cs.CV [Total: 58]
cs.CL [Total: 17]
cs.AI [Total: 3]
cs.GR [Total: 1]
cs.CY [Total: 1]
eess.IV [Total: 3]
cs.RO [Total: 4]
cs.LG [Total: 3]
eess.SP [Total: 1]

cs.CV [Back]

[1] Recurrence Meets Transformers for Universal Multimodal Retrieval cs.CV | cs.AI | cs.CL | cs.MMPDF

Davide Caffagni, Sara Sarto, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara

TL;DR: 这篇论文提出了ReT-2，一种支持多模态查询的统一检索模型，通过结合循环Transformer和LSTM门控机制，动态整合跨层和多模态信息，实现了在多模态检索任务上的先进性能。

Details

Motivation: 针对现有检索方法局限于单模态查询或文档的问题，以及复杂多模态检索任务的需求，提出了一个支持多模态查询的统一模型。

Result: 在M2KR和M-BEIR基准测试上表现优异，推理速度更快，内存占用更少，同时提升了检索增强生成任务的下游性能。

Insight: 动态整合跨层和多模态信息是提升多模态检索性能的关键，循环Transformer架构在此类任务中表现出色。

Abstract: With the rapid advancement of multimodal retrieval and its application in LLMs and multimodal LLMs, increasingly complex retrieval tasks have emerged. Existing methods predominantly rely on task-specific fine-tuning of vision-language models and are limited to single-modality queries or documents. In this paper, we propose ReT-2, a unified retrieval model that supports multimodal queries, composed of both images and text, and searches across multimodal document collections where text and images coexist. ReT-2 leverages multi-layer representations and a recurrent Transformer architecture with LSTM-inspired gating mechanisms to dynamically integrate information across layers and modalities, capturing fine-grained visual and textual details. We evaluate ReT-2 on the challenging M2KR and M-BEIR benchmarks across different retrieval configurations. Results demonstrate that ReT-2 consistently achieves state-of-the-art performance across diverse settings, while offering faster inference and reduced memory usage compared to prior approaches. When integrated into retrieval-augmented generation pipelines, ReT-2 also improves downstream performance on Encyclopedic-VQA and InfoSeek datasets. Our source code and trained models are publicly available at: https://github.com/aimagelab/ReT-2

[2] Diffusion-Based Action Recognition Generalizes to Untrained Domains cs.CVPDF

Rogerio Guimaraes, Frank Xiao, Pietro Perona, Markus Marks

TL;DR: 该论文提出了一种基于视觉扩散模型（VDM）的特征提取方法，通过变换器聚合特征，实现了人类级别的动作识别泛化能力，特别是在动物种类、视角和记录上下文等未训练领域的任务中。

Details

Motivation: 人类能够无视背景和视角的差异识别相同动作，而现有深度学习模型在此类泛化任务中表现不佳。论文旨在通过扩散模型的特征提取能力提升模型的泛化性能。

Result: 在跨物种、跨视角和跨上下文的动作识别任务中，该方法均取得了最优性能，显著提升了模型在未训练领域的泛化能力。

Insight: 论文表明，扩散模型生成的特征能够更好地捕捉高阶语义信息，而变换器的聚合能力进一步提升了跨领域任务的性能。

Abstract: Humans can recognize the same actions despite large context and viewpoint variations, such as differences between species (walking in spiders vs. horses), viewpoints (egocentric vs. third-person), and contexts (real life vs movies). Current deep learning models struggle with such generalization. We propose using features generated by a Vision Diffusion Model (VDM), aggregated via a transformer, to achieve human-like action recognition across these challenging conditions. We find that generalization is enhanced by the use of a model conditioned on earlier timesteps of the diffusion process to highlight semantic information over pixel level details in the extracted features. We experimentally explore the generalization properties of our approach in classifying actions across animal species, across different viewing angles, and different recording contexts. Our model sets a new state-of-the-art across all three generalization benchmarks, bringing machine action recognition closer to human-like robustness. Project page: $\href{https://www.vision.caltech.edu/actiondiff/}{\texttt{vision.caltech.edu/actiondiff}}$ Code: $\href{https://github.com/frankyaoxiao/ActionDiff}{\texttt{github.com/frankyaoxiao/ActionDiff}}$

[3] PromptGuard: An Orchestrated Prompting Framework for Principled Synthetic Text Generation for Vulnerable Populations using LLMs with Enhanced Safety, Fairness, and Controllability cs.CV | cs.AIPDF

Tung Vu, Lam Nguyen, Quynh Dao

TL;DR: 本文提出PromptGuard，一个模块化提示框架，通过VulnGuard Prompt技术主动预防LLM对弱势群体生成有害文本，结合多目标优化和伦理推理，实现25-30%的危害降低。

Details

Motivation: 大语言模型（LLMs）在现实应用中对弱势群体（如同性恋者、单亲家庭等）可能生成有害、偏见或误导性信息。现有方法多为事后过滤，无法从源头预防。

Result: 通过理论分析和实验验证，PromptGuard显著减少有害输出，并在GitHub数据集上展示有效性。

Insight: 1. 从源头预防有害生成比事后过滤更有效； 2. 结合伦理推理和数据驱动方法可提升模型安全性； 3. 模块化设计适合实际部署。

Abstract: The proliferation of Large Language Models (LLMs) in real-world applications poses unprecedented risks of generating harmful, biased, or misleading information to vulnerable populations including LGBTQ+ individuals, single parents, and marginalized communities. While existing safety approaches rely on post-hoc filtering or generic alignment techniques, they fail to proactively prevent harmful outputs at the generation source. This paper introduces PromptGuard, a novel modular prompting framework with our breakthrough contribution: VulnGuard Prompt, a hybrid technique that prevents harmful information generation using real-world data-driven contrastive learning. VulnGuard integrates few-shot examples from curated GitHub repositories, ethical chain-of-thought reasoning, and adaptive role-prompting to create population-specific protective barriers. Our framework employs theoretical multi-objective optimization with formal proofs demonstrating 25-30% analytical harm reduction through entropy bounds and Pareto optimality. PromptGuard orchestrates six core modules: Input Classification, VulnGuard Prompting, Ethical Principles Integration, External Tool Interaction, Output Validation, and User-System Interaction, creating an intelligent expert system for real-time harm prevention. We provide comprehensive mathematical formalization including convergence proofs, vulnerability analysis using information theory, and theoretical validation framework using GitHub-sourced datasets, establishing mathematical foundations for systematic empirical research.

[4] Similarity-based Outlier Detection for Noisy Object Re-Identification Using Beta Mixtures cs.CV | cs.AI | cs.LG | math.ST | stat.ML | stat.THPDF

Waqar Ahmad, Evan Murphy, Vladimir A. Krylov

TL;DR: 论文提出了一种基于Beta混合模型的相似性离群值检测方法（Beta-SOD），通过统计建模提升噪声环境下对象重识别（Re-ID）的鲁棒性。

Details

Motivation: 对象重识别对标签噪声高度敏感，导致性能显著下降，论文旨在解决这一问题。

Result: 在CUHK03、Market-1501和VeRi-776数据集上表现出优于现有方法的性能，尤其在10-30%噪声水平下。

Insight: 统计建模能够有效处理噪声问题，为Re-ID任务提供了一种鲁棒性更强的解决方案。

Abstract: Object re-identification (Re-ID) methods are highly sensitive to label noise, which typically leads to significant performance degradation. We address this challenge by reframing Re-ID as a supervised image similarity task and adopting a Siamese network architecture trained to capture discriminative pairwise relationships. Central to our approach is a novel statistical outlier detection (OD) framework, termed Beta-SOD (Beta mixture Similarity-based Outlier Detection), which models the distribution of cosine similarities between embedding pairs using a two-component Beta distribution mixture model. We establish a novel identifiability result for mixtures of two Beta distributions, ensuring that our learning task is well-posed.The proposed OD step complements the Re-ID architecture combining binary cross-entropy, contrastive, and cosine embedding losses that jointly optimize feature-level similarity learning.We demonstrate the effectiveness of Beta-SOD in de-noising and Re-ID tasks for person Re-ID, on CUHK03 and Market-1501 datasets, and vehicle Re-ID, on VeRi-776 dataset. Our method shows superior performance compared to the state-of-the-art methods across various noise levels (10-30%), demonstrating both robustness and broad applicability in noisy Re-ID scenarios. The implementation of Beta-SOD is available at: https://github.com/waqar3411/Beta-SOD

[5] Discovering Divergent Representations between Text-to-Image Models cs.CVPDF

Lisa Dunlap, Joseph E. Gonzalez, Trevor Darrell, Fabian Caba Heilbron, Josef Sivic

TL;DR: 论文研究了两种文本到图像模型学习到的视觉表征在何时及如何出现差异，提出了一种进化搜索算法CompCon，用于发现模型输出中更常见的视觉属性差异，并揭示了触发这些差异的提示概念。

Details

Motivation: 研究动机在于探究不同生成模型在视觉表征上的差异，以便更好地理解模型的行为和潜在偏见。

Result: 实验结果表明，CompCon能够有效发现模型间的差异，例如某些模型在处理特定情感提示时表现出独特的视觉特征。

Insight: 研究发现不同模型在视觉表征上的差异可能与文化或情感提示相关，揭示了模型的潜在偏见和多样性。

Abstract: In this paper, we investigate when and how visual representations learned by two different generative models diverge. Given two text-to-image models, our goal is to discover visual attributes that appear in images generated by one model but not the other, along with the types of prompts that trigger these attribute differences. For example, “flames” might appear in one model’s outputs when given prompts expressing strong emotions, while the other model does not produce this attribute given the same prompts. We introduce CompCon (Comparing Concepts), an evolutionary search algorithm that discovers visual attributes more prevalent in one model’s output than the other, and uncovers the prompt concepts linked to these visual differences. To evaluate CompCon’s ability to find diverging representations, we create an automated data generation pipeline to produce ID2, a dataset of 60 input-dependent differences, and compare our approach to several LLM- and VLM-powered baselines. Finally, we use CompCon to compare popular text-to-image models, finding divergent representations such as how PixArt depicts prompts mentioning loneliness with wet streets and Stable Diffusion 3.5 depicts African American people in media professions. Code at: https://github.com/adobe-research/CompCon

[6] An U-Net-Based Deep Neural Network for Cloud Shadow and Sun-Glint Correction of Unmanned Aerial System (UAS) Imagery cs.CVPDF

Yibin Wang, Wondimagegn Beshah, Padmanava Dash, Haifeng Wang

TL;DR: 该论文提出了一种基于U-Net的深度学习方法，用于校正无人机系统（UAS）影像中的云影和太阳耀斑，以提高水质参数估计的准确性。

Details

Motivation: 无人机影像在云层下拍摄时易受云影和太阳耀斑的影响，导致图像质量下降，影响水质参数的估计。因此需要一种有效的方法对这些干扰区域进行识别和校正。

Result: 实验结果表明，该方法能够成功恢复受干扰区域，生成高质量的校正影像。

Insight: 这项研究表明，深度学习可以有效处理无人机影像中的复杂干扰问题，为遥感应用提供了新的技术手段。

Abstract: The use of unmanned aerial systems (UASs) has increased tremendously in the current decade. They have significantly advanced remote sensing with the capability to deploy and image the terrain as per required spatial, spectral, temporal, and radiometric resolutions for various remote sensing applications. One of the major advantages of UAS imagery is that images can be acquired in cloudy conditions by flying the UAS under the clouds. The limitation to the technology is that the imagery is often sullied by cloud shadows. Images taken over water are additionally affected by sun glint. These are two pose serious issues for estimating water quality parameters from the UAS images. This study proposes a novel machine learning approach first to identify and extract regions with cloud shadows and sun glint and separate such regions from non-obstructed clear sky regions and sun-glint unaffected regions. The data was extracted from the images at pixel level to train an U-Net based deep learning model and best settings for model training was identified based on the various evaluation metrics from test cases. Using this evaluation, a high-quality image correction model was determined, which was used to recover the cloud shadow and sun glint areas in the images.

[7] CoSwin: Convolution Enhanced Hierarchical Shifted Window Attention For Small-Scale Vision cs.CVPDF

Puskal Khadka, Rodrigue Rizk, Longwei Wang, KC Santosh

TL;DR: CoSwin提出了一种融合卷积增强的分层移位窗口注意力架构，用于小规模视觉任务，显著提升了性能。

Details

Motivation: 当前Vision Transformers在全局上下文建模方面表现出色，但在小规模数据上缺乏局部特征提取能力。为解决这一问题，CoSwin结合卷积的局部特征学习与Transformers的全局注意力机制。

Result: 实验表明，CoSwin在多个图像分类基准（如CIFAR-10、CIFAR-100等）上均超过现有卷积和Transformer模型的性能。

Insight: 局部与全局特征融合能显著提升小规模视觉任务中Transformer的泛化性和鲁棒性。

Abstract: Vision Transformers (ViTs) have achieved impressive results in computer vision by leveraging self-attention to model long-range dependencies. However, their emphasis on global context often comes at the expense of local feature extraction in small datasets, particularly due to the lack of key inductive biases such as locality and translation equivariance. To mitigate this, we propose CoSwin, a novel feature-fusion architecture that augments the hierarchical shifted window attention with localized convolutional feature learning. Specifically, CoSwin integrates a learnable local feature enhancement module into each attention block, enabling the model to simultaneously capture fine-grained spatial details and global semantic structure. We evaluate CoSwin on multiple image classification benchmarks including CIFAR-10, CIFAR-100, MNIST, SVHN, and Tiny ImageNet. Our experimental results show consistent performance gains over state-of-the-art convolutional and transformer-based models. Notably, CoSwin achieves improvements of 2.17% on CIFAR-10, 4.92% on CIFAR-100, 0.10% on MNIST, 0.26% on SVHN, and 4.47% on Tiny ImageNet over the baseline Swin Transformer. These improvements underscore the effectiveness of local-global feature fusion in enhancing the generalization and robustness of transformers for small-scale vision. Code and pretrained weights available at https://github.com/puskal-khadka/coswin

[8] iMatcher: Improve matching in point cloud registration via local-to-global geometric consistency learning cs.CVPDF

Karim Slimani, Catherine Achard, Brahim Tamadazte

TL;DR: iMatcher是一个完全可微的点云配准特征匹配框架，通过局部到全局的几何一致性学习提升匹配性能，在多个数据集上达到最先进效果。

Details

Motivation: 点云配准中的特征匹配通常面临局部和全局几何一致性不足的问题，传统方法难以兼顾两者。iMatcher旨在通过学习局部和全局几何一致性提升匹配的准确性和鲁棒性。

Result: 在KITTI和KITTI-360上取得了95%-97%的内点率，3DMatch上达到81.1%，显著优于现有方法。

Insight: 结合局部和全局几何一致性是提升点云配准性能的关键，学习的特征更具鲁棒性，适用于多种场景。

Abstract: This paper presents iMatcher, a fully differentiable framework for feature matching in point cloud registration. The proposed method leverages learned features to predict a geometrically consistent confidence matrix, incorporating both local and global consistency. First, a local graph embedding module leads to an initialization of the score matrix. A subsequent repositioning step refines this matrix by considering bilateral source-to-target and target-to-source matching via nearest neighbor search in 3D space. The paired point features are then stacked together to be refined through global geometric consistency learning to predict a point-wise matching probability. Extensive experiments on real-world outdoor (KITTI, KITTI-360) and indoor (3DMatch) datasets, as well as on 6-DoF pose estimation (TUD-L) and partial-to-partial matching (MVP-RG), demonstrate that iMatcher significantly improves rigid registration performance. The method achieves state-of-the-art inlier ratios, scoring 95% - 97% on KITTI, 94% - 97% on KITTI-360, and up to 81.1% on 3DMatch, highlighting its robustness across diverse settings.

[9] COCO-Urdu: A Large-Scale Urdu Image-Caption Dataset with Multimodal Quality Estimation cs.CV | cs.CL | 68T45 (Primary) 68T50 (Secondary)PDF

Umair Hassan

TL;DR: 论文提出了COCO-Urdu，一个基于MS COCO的大规模乌尔都语图像描述数据集，旨在填补乌尔都语在多模态研究中的空白，并通过多模态质量评估框架确保翻译质量和语义一致性。

Details

Motivation: 乌尔都语作为全球超过2.5亿人使用的语言，在多模态和视觉语言研究中长期被忽视。缺乏高质量的乌尔都语数据集限制了相关系统的开发，并加剧了多语言模型对高资源语言的依赖。作者希望通过COCO-Urdu数据集减少多模态研究中的语言偏见。

Result: COCO-Urdu在BLEU、SacreBLEU和chrF等指标上表现优异，验证了数据集的高质量。

Insight: 多模态质量评估框架可用于提升低资源语言的翻译质量，减少对人工标注的依赖。开源数据与工具为包容性视觉语言系统的发展提供了基础。

Abstract: Urdu, spoken by over 250 million people, remains critically under-served in multimodal and vision-language research. The absence of large-scale, high-quality datasets has limited the development of Urdu-capable systems and reinforced biases in multilingual vision-language models trained primarily on high-resource languages. To address this gap, we present COCO-Urdu, a large-scale image-caption dataset derived from MS COCO, containing 59,000 images and 319,000 Urdu captions selected through stratified sampling to preserve the original distribution. Captions were translated using SeamlessM4T v2 and validated with a hybrid multimodal quality estimation framework that integrates COMET-Kiwi for translation quality, CLIP-based similarity for visual grounding, and BERTScore with back-translation for semantic consistency; low-scoring captions were iteratively refined using open-source large language models. We further benchmark COCO-Urdu on BLEU, SacreBLEU, and chrF, reporting consistently strong results. To the best of our knowledge, COCO-Urdu is the largest publicly available Urdu captioning dataset. By releasing both the dataset and the quality estimation pipeline, we aim to reduce language bias in multimodal research and establish a foundation for inclusive vision-language systems.

[10] VoxelFormer: Parameter-Efficient Multi-Subject Visual Decoding from fMRI cs.CVPDF

Chenqian Le, Yilin Zhao, Nikasadat Emami, Kushagra Yadav, Xujin “Chris” Liu

TL;DR: VoxelFormer是一种轻量级Transformer架构，支持多被试的fMRI视觉解码，通过Token Merging Transformer和Q-Former实现参数高效化，在7T自然场景数据集上表现出色。

Details

Motivation: 现有fMRI视觉解码方法多为单被试训练，缺乏可扩展性和实际部署能力。VoxelFormer旨在解决这一问题，实现多被试训练。

Result: 在训练包含的被试上实现了竞争力强的检索性能，且参数显著少于现有方法。

Insight: Token合并和基于查询的Transformer是参数高效神经解码的有效策略。

Abstract: Recent advances in fMRI-based visual decoding have enabled compelling reconstructions of perceived images. However, most approaches rely on subject-specific training, limiting scalability and practical deployment. We introduce \textbf{VoxelFormer}, a lightweight transformer architecture that enables multi-subject training for visual decoding from fMRI. VoxelFormer integrates a Token Merging Transformer (ToMer) for efficient voxel compression and a query-driven Q-Former that produces fixed-size neural representations aligned with the CLIP image embedding space. Evaluated on the 7T Natural Scenes Dataset, VoxelFormer achieves competitive retrieval performance on subjects included during training with significantly fewer parameters than existing methods. These results highlight token merging and query-based transformers as promising strategies for parameter-efficient neural decoding.

[11] Integrating Anatomical Priors into a Causal Diffusion Model cs.CVPDF

Binxu Li, Wei Peng, Mingjie Li, Ehsan Adeli, Kilian M. Pohl

TL;DR: 该论文提出了一种名为PCGM的方法，将解剖学先验整合到因果扩散模型中，以生成高质量且解剖学合理的脑部MRI图像，解决了现有方法难以保留医学相关局部细节的问题。

Details

Motivation: 现有反事实模型在生成脑部MRI时缺乏对细微解剖学细节的保留，导致生成的图像在医学研究中实用性不足。论文旨在通过引入解剖学先验，提升生成图像的解剖学合理性和医学价值。

Result: 实验表明，PCGM在多个数据集上生成的脑部MRI质量优于基线方法，且能够复现真实疾病对大脑皮层的细微形态影响。

Insight: 通过显式引入解剖学先验，生成模型能够更好地保留医学相关的局部细节，为脑部MRI研究提供更实用的工具。

Abstract: 3D brain MRI studies often examine subtle morphometric differences between cohorts that are hard to detect visually. Given the high cost of MRI acquisition, these studies could greatly benefit from image syntheses, particularly counterfactual image generation, as seen in other domains, such as computer vision. However, counterfactual models struggle to produce anatomically plausible MRIs due to the lack of explicit inductive biases to preserve fine-grained anatomical details. This shortcoming arises from the training of the models aiming to optimize for the overall appearance of the images (e.g., via cross-entropy) rather than preserving subtle, yet medically relevant, local variations across subjects. To preserve subtle variations, we propose to explicitly integrate anatomical constraints on a voxel-level as prior into a generative diffusion framework. Called Probabilistic Causal Graph Model (PCGM), the approach captures anatomical constraints via a probabilistic graph module and translates those constraints into spatial binary masks of regions where subtle variations occur. The masks (encoded by a 3D extension of ControlNet) constrain a novel counterfactual denoising UNet, whose encodings are then transferred into high-quality brain MRIs via our 3D diffusion decoder. Extensive experiments on multiple datasets demonstrate that PCGM generates structural brain MRIs of higher quality than several baseline approaches. Furthermore, we show for the first time that brain measurements extracted from counterfactuals (generated by PCGM) replicate the subtle effects of a disease on cortical brain regions previously reported in the neuroscience literature. This achievement is an important milestone in the use of synthetic MRIs in studies investigating subtle morphological differences.

[12] Enhancing 3D Medical Image Understanding with Pretraining Aided by 2D Multimodal Large Language Models cs.CVPDF

Qiuhui Chen, Xuancheng Yao, Huping Ye, Yi Hong

TL;DR: 论文提出了一种名为Med3DInsight的预训练框架，通过集成3D图像编码器和2D多模态大语言模型（MLLMs），结合平面切片感知变换器模块，提升3D医学图像理解的语义深度，无需人工标注，并在分割和分类任务上表现优异。

Details

Motivation: 现有的3D医学图像自监督学习方法缺乏深层语义理解，而2D多模态大语言模型（MLLMs）为通过文本描述增强图像理解提供了可能。论文旨在利用这些技术进步改进3D医学图像理解。

Result: 在多个公共数据集（CT和MRI模态）的分割和分类任务中，Med3DInsight表现优于现有的自监督学习方法，达到state-of-the-art性能。

Insight: 1. 2D MLLMs可以高效辅助3D医学图像理解；2. 无需人工标注即可实现多模态表示学习；3. 该框架可无缝集成到现有3D医学图像理解网络中。

Abstract: Understanding 3D medical image volumes is critical in the medical field, yet existing 3D medical convolution and transformer-based self-supervised learning (SSL) methods often lack deep semantic comprehension. Recent advancements in multimodal large language models (MLLMs) provide a promising approach to enhance image understanding through text descriptions. To leverage these 2D MLLMs for improved 3D medical image understanding, we propose Med3DInsight, a novel pretraining framework that integrates 3D image encoders with 2D MLLMs via a specially designed plane-slice-aware transformer module. Additionally, our model employs a partial optimal transport based alignment, demonstrating greater tolerance to noise introduced by potential noises in LLM-generated content. Med3DInsight introduces a new paradigm for scalable multimodal 3D medical representation learning without requiring human annotations. Extensive experiments demonstrate our state-of-the-art performance on two downstream tasks, i.e., segmentation and classification, across various public datasets with CT and MRI modalities, outperforming current SSL methods. Med3DInsight can be seamlessly integrated into existing 3D medical image understanding networks, potentially enhancing their performance. Our source code, generated datasets, and pre-trained models will be available at https://github.com/Qybc/Med3DInsight.

[13] Improvement of Human-Object Interaction Action Recognition Using Scene Information and Multi-Task Learning Approach cs.CVPDF

Hesham M. Shehata, Mohammad Abdolrahmani

TL;DR: 论文提出了一种结合场景信息和多任务学习的方法，提升了人机交互动作识别的性能。

Details

Motivation: 现有的图卷积神经网络（GCNs）在人体动作识别中表现良好，但缺乏对场景信息的有效表征，难以准确识别人与固定物体的交互动作。

Result: 所提方法在交互和非交互动作识别中实现了99.25%的准确率，比仅使用人体骨架姿态的基线模型提升了2.75%。

Insight: 场景信息和多任务学习能够弥补传统动作识别方法的不足，尤其适用于人与固定物体交互的场景。

Abstract: Recent graph convolutional neural networks (GCNs) have shown high performance in the field of human action recognition by using human skeleton poses. However, it fails to detect human-object interaction cases successfully due to the lack of effective representation of the scene information and appropriate learning architectures. In this context, we propose a methodology to utilize human action recognition performance by considering fixed object information in the environment and following a multi-task learning approach. In order to evaluate the proposed method, we collected real data from public environments and prepared our data set, which includes interaction classes of hands-on fixed objects (e.g., ATM ticketing machines, check-in/out machines, etc.) and non-interaction classes of walking and standing. The multi-task learning approach, along with interaction area information, succeeds in recognizing the studied interaction and non-interaction actions with an accuracy of 99.25%, outperforming the accuracy of the base model using only human skeleton poses by 2.75%.

[14] SQAP-VLA: A Synergistic Quantization-Aware Pruning Framework for High-Performance Vision-Language-Action Models cs.CV | cs.AIPDF

Hengyu Fang, Yijiang Liu, Yuan Du, Li Du, Huanrui Yang

TL;DR: SQAP-VLA是一个结合量化和token剪枝的训练加速框架，首次实现了两者协同设计，解决了传统VLA模型中二者的不兼容问题。

Details

Motivation: VLA模型因计算和内存开销大而难以实用化，现有方法无法同时实现高效量化和token剪枝。

Result: 在标准VLA模型上，SQAP-VLA实现了1.93倍加速和4.5%的平均成功率提升。

Insight: 量化与剪枝的协同设计是提升VLA模型效率的关键，且无需额外训练即可部署。

Abstract: Vision-Language-Action (VLA) models exhibit unprecedented capabilities for embodied intelligence. However, their extensive computational and memory costs hinder their practical deployment. Existing VLA compression and acceleration approaches conduct quantization or token pruning in an ad-hoc manner but fail to enable both for a holistic efficiency improvement due to an observed incompatibility. This work introduces SQAP-VLA, the first structured, training-free VLA inference acceleration framework that simultaneously enables state-of-the-art quantization and token pruning. We overcome the incompatibility by co-designing the quantization and token pruning pipeline, where we propose new quantization-aware token pruning criteria that work on an aggressively quantized model while improving the quantizer design to enhance pruning effectiveness. When applied to standard VLA models, SQAP-VLA yields significant gains in computational efficiency and inference speed while successfully preserving core model performance, achieving a $\times$1.93 speedup and up to a 4.5% average success rate enhancement compared to the original model.

[15] FPI-Det: a face–phone Interaction Dataset for phone-use detection and understanding cs.CVPDF

Jianqin Gao, Tianqi Wang, Yu Zhang, Yishu Zhang, Chenyuan Wang

TL;DR: 本文提出了FPI-Det数据集，用于检测和理解手机使用行为，填补了现有通用基准在细粒度的‘人脸-手机’交互数据上的空白，并提供了基线检测结果和分析。

Details

Motivation: 移动设备的普及带来安全监控等工作场景中检测手机使用行为的需求，但现有数据集难以充分捕捉‘人脸-手机’交互的细粒度行为。

Result: 提供了检测器的性能分析，为后续研究提供了基准。

Insight: FPI-Det填补了细粒度交互数据集的空白，为手机使用行为的研究提供了重要工具。

Abstract: The widespread use of mobile devices has created new challenges for vision systems in safety monitoring, workplace productivity assessment, and attention management. Detecting whether a person is using a phone requires not only object recognition but also an understanding of behavioral context, which involves reasoning about the relationship between faces, hands, and devices under diverse conditions. Existing generic benchmarks do not fully capture such fine-grained human–device interactions. To address this gap, we introduce the FPI-Det, containing 22{,}879 images with synchronized annotations for faces and phones across workplace, education, transportation, and public scenarios. The dataset features extreme scale variation, frequent occlusions, and varied capture conditions. We evaluate representative YOLO and DETR detectors, providing baseline results and an analysis of performance across object sizes, occlusion levels, and environments. Source code and dataset is available at https://github.com/KvCgRv/FPI-Det.

[16] Zero-shot Hierarchical Plant Segmentation via Foundation Segmentation Models and Text-to-image Attention cs.CVPDF

Junhao Xing, Ryohei Miyakawa, Yang Yang, Xinpeng Liu, Risa Shinoda

TL;DR: 提出了一种零样本分层植物分割方法ZeroPlantSeg，结合基础分割模型和视觉语言模型，无需额外训练即可从俯视图中分割出植物个体。

Details

Motivation: 基础分割模型可实现零样本的叶片实例分割，但复杂的分层分割任务仍需要标注数据集。为解决这一问题，提出了一种无需训练的零样本方法。

Result: 在多物种、多生长阶段和多拍摄环境的数据集上验证了方法的优越性。

Insight: 结合多种模型的能力可实现复杂任务的零样本分割，且无需额外的标注数据。

Abstract: Foundation segmentation models achieve reasonable leaf instance extraction from top-view crop images without training (i.e., zero-shot). However, segmenting entire plant individuals with each consisting of multiple overlapping leaves remains challenging. This problem is referred to as a hierarchical segmentation task, typically requiring annotated training datasets, which are often species-specific and require notable human labor. To address this, we introduce ZeroPlantSeg, a zero-shot segmentation for rosette-shaped plant individuals from top-view images. We integrate a foundation segmentation model, extracting leaf instances, and a vision-language model, reasoning about plants’ structures to extract plant individuals without additional training. Evaluations on datasets with multiple plant species, growth stages, and shooting environments demonstrate that our method surpasses existing zero-shot methods and achieves better cross-domain performance than supervised methods. Implementations are available at https://github.com/JunhaoXing/ZeroPlantSeg.

[17] Gradient-Attention Guided Dual-Masking Synergetic Framework for Robust Text-based Person Retrieval cs.CVPDF

Tianlu Zheng, Yifan Zhang, Xiang An, Ziyong Feng, Kaicheng Yang

TL;DR: 本文提出了一种基于CLIP的文本-人物检索框架GA-DMS，通过高效数据构建和模型架构改进，解决了数据稀缺和噪声文本问题，实现了最先进的性能。

Details

Motivation: 现有CLIP模型在人物检索任务中面临数据稀缺和噪声文本的挑战，需要改进以提升细粒度匹配能力。

Result: GA-DMS在多个基准测试中达到最优性能。

Insight: 梯度注意力和掩码预测目标能有效提升模型对噪声的鲁棒性和细粒度语义表示能力。

Abstract: Although Contrastive Language-Image Pre-training (CLIP) exhibits strong performance across diverse vision tasks, its application to person representation learning faces two critical challenges: (i) the scarcity of large-scale annotated vision-language data focused on person-centric images, and (ii) the inherent limitations of global contrastive learning, which struggles to maintain discriminative local features crucial for fine-grained matching while remaining vulnerable to noisy text tokens. This work advances CLIP for person representation learning through synergistic improvements in data curation and model architecture. First, we develop a noise-resistant data construction pipeline that leverages the in-context learning capabilities of MLLMs to automatically filter and caption web-sourced images. This yields WebPerson, a large-scale dataset of 5M high-quality person-centric image-text pairs. Second, we introduce the GA-DMS (Gradient-Attention Guided Dual-Masking Synergetic) framework, which improves cross-modal alignment by adaptively masking noisy textual tokens based on the gradient-attention similarity score. Additionally, we incorporate masked token prediction objectives that compel the model to predict informative text tokens, enhancing fine-grained semantic representation learning. Extensive experiments show that GA-DMS achieves state-of-the-art performance across multiple benchmarks.

[18] ALL-PET: A Low-resource and Low-shot PET Foundation Model in the Projection Domain cs.CVPDF

Bin Huang, Kang Chen, Bingxuan Li, Huafeng Liu, Qiegen Liu

TL;DR: ALL-PET 是一种低资源、低样本的 PET 基础模型，直接工作在投影域，通过潜在扩散模型（LDM）和创新的数据增强与注意力机制，实现了高质量的正电子发射断层扫描（PET）成像任务。

Details

Motivation: PET 成像领域中，构建大规模基础模型面临标记数据稀缺和计算资源有限的挑战。作者旨在提出一种能够在低资源和小样本条件下高效完成 PET 成像任务的解决方案。

Result: ALL-PET 仅需 500 个样本即可生成高质量投影数据，性能媲美大数据集训练的模型，且内存占用低于 24GB。

Insight: 通过几何驱动和物理一致的注意力机制，可以在低资源条件下实现 PET 成像任务的泛化能力，为医学影像分析提供了高效的解决方案。

Abstract: Building large-scale foundation model for PET imaging is hindered by limited access to labeled data and insufficient computational resources. To overcome data scarcity and efficiency limitations, we propose ALL-PET, a low-resource, low-shot PET foundation model operating directly in the projection domain. ALL-PET leverages a latent diffusion model (LDM) with three key innovations. First, we design a Radon mask augmentation strategy (RMAS) that generates over 200,000 structurally diverse training samples by projecting randomized image-domain masks into sinogram space, significantly improving generalization with minimal data. This is extended by a dynamic multi-mask (DMM) mechanism that varies mask quantity and distribution, enhancing data diversity without added model complexity. Second, we implement positive/negative mask constraints to embed strict geometric consistency, reducing parameter burden while preserving generation quality. Third, we introduce transparent medical attention (TMA), a parameter-free, geometry-driven mechanism that enhances lesion-related regions in raw projection data. Lesion-focused attention maps are derived from coarse segmentation, covering both hypermetabolic and hypometabolic areas, and projected into sinogram space for physically consistent guidance. The system supports clinician-defined ROI adjustments, ensuring flexible, interpretable, and task-adaptive emphasis aligned with PET acquisition physics. Experimental results show ALL-PET achieves high-quality sinogram generation using only 500 samples, with performance comparable to models trained on larger datasets. ALL-PET generalizes across tasks including low-dose reconstruction, attenuation correction, delayed-frame prediction, and tracer separation, operating efficiently with memory use under 24GB.

[19] Noise-Robust Topology Estimation of 2D Image Data via Neural Networks and Persistent Homology cs.CVPDF

Dylan Peek, Matthew P. Skerritt, Stephan Chalup

TL;DR: 该论文研究了在二维二值图像中，使用神经网络预测贝蒂数的抗噪性能，并对比了持久同调方法（PH），发现神经网络由于能从数据中学习上下文和几何先验，在噪声环境下表现更好。

Details

Motivation: 持久同调和神经网络是两种不同的拓扑结构推断方法。作者旨在探索神经网络在噪声环境下预测拓扑结构的鲁棒性，并与传统的基于持久同调的方法进行比较。

Result: 实验表明，神经网络在噪声环境下表现更优，可能因其能学习上下文和几何先验。

Insight: 神经网络为噪声环境下的拓扑估计提供了一种有潜力的替代方案，尽管该领域仍在发展中。

Abstract: Persistent Homology (PH) and Artificial Neural Networks (ANNs) offer contrasting approaches to inferring topological structure from data. In this study, we examine the noise robustness of a supervised neural network trained to predict Betti numbers in 2D binary images. We compare an ANN approach against a PH pipeline based on cubical complexes and the Signed Euclidean Distance Transform (SEDT), which is a widely adopted strategy for noise-robust topological analysis. Using one synthetic and two real-world datasets, we show that ANNs can outperform this PH approach under noise, likely due to their capacity to learn contextual and geometric priors from training data. Though still emerging, the use of ANNs for topology estimation offers a compelling alternative to PH under structural noise.

[20] Objectness Similarity: Capturing Object-Level Fidelity in 3D Scene Evaluation cs.CV | cs.AI | cs.GRPDF

Yuiko Uchida, Ren Togo, Keisuke Maeda, Takahiro Ogawa, Miki Haseyama

TL;DR: 本文提出了Objectness SIMilarity（OSIM），一种专注于3D场景中“物体”的评估指标，通过对象检测模型量化场景中每个物体的“物体性”，更符合人类感知。

Details

Motivation: 现有3D场景评估指标关注整体图像质量，与人类感知不一致。本文基于心理学研究，假设人类识别3D场景时更关注单个物体，因此提出OSIM以提升评估的准确性。

Result: OSIM比现有指标更符合人类感知；通过标准化实验重新评估了3D重建与生成模型的性能。

Insight: 物体级保真度对人类理解3D场景至关重要；现有指标需从整体质量评估转向更细粒度的物体级评价。

Abstract: This paper presents Objectness SIMilarity (OSIM), a novel evaluation metric for 3D scenes that explicitly focuses on “objects,” which are fundamental units of human visual perception. Existing metrics assess overall image quality, leading to discrepancies with human perception. Inspired by neuropsychological insights, we hypothesize that human recognition of 3D scenes fundamentally involves attention to individual objects. OSIM enables object-centric evaluations by leveraging an object detection model and its feature representations to quantify the “objectness” of each object in the scene. Our user study demonstrates that OSIM aligns more closely with human perception compared to existing metrics. We also analyze the characteristics of OSIM using various approaches. Moreover, we re-evaluate recent 3D reconstruction and generation models under a standardized experimental setup to clarify advancements in this field. The code is available at https://github.com/Objectness-Similarity/OSIM.

[21] Video Understanding by Design: How Datasets Shape Architectures and Insights cs.CV | cs.AI | cs.LGPDF

Lei Wang, Piotr Koniusz, Yongsheng Gao

TL;DR: 这篇论文从数据集驱动的视角重新审视了视频理解领域的发展，揭示了数据集如何通过运动复杂性、时间跨度、层次结构和多模态丰富性等因素引导模型架构的演变。

Details

Motivation: 现有的视频理解研究大多按任务或模型家族分类，忽视了数据集对模型架构演化的结构性影响。论文旨在填补这一空白。

Result: 论文提供了一个统一的框架，将数据集、归纳偏置和模型架构结合起来，为视频理解领域的未来发展提供了路线图。

Insight: 数据集不仅是模型性能的基准，其特性（如时间动态或层次结构）直接指导了模型架构的设计方向。未来研究应更注重数据集与模型的协同设计。

Abstract: Video understanding has advanced rapidly, fueled by increasingly complex datasets and powerful architectures. Yet existing surveys largely classify models by task or family, overlooking the structural pressures through which datasets guide architectural evolution. This survey is the first to adopt a dataset-driven perspective, showing how motion complexity, temporal span, hierarchical composition, and multimodal richness impose inductive biases that models should encode. We reinterpret milestones, from two-stream and 3D CNNs to sequential, transformer, and multimodal foundation models, as concrete responses to these dataset-driven pressures. Building on this synthesis, we offer practical guidance for aligning model design with dataset invariances while balancing scalability and task demands. By unifying datasets, inductive biases, and architectures into a coherent framework, this survey provides both a comprehensive retrospective and a prescriptive roadmap for advancing general-purpose video understanding.

[22] OCELOT 2023: Cell Detection from Cell-Tissue Interaction Challenge cs.CV | cs.AIPDF

JaeWoong Shin, Jeongun Ryu, Aaron Valero Puche, Jinhee Lee, Biagio Brattoli

TL;DR: OCELOT 2023挑战赛旨在通过多尺度细胞与组织交互标注数据集验证细胞与组织关系的理解对细胞检测的重要性，并推动相关研究。参赛模型通过整合多尺度语义显著提升了性能，显示了这一方法的潜力。

Details

Motivation: 现有基于深度学习的细胞检测模型难以模拟病理学家的多尺度观察行为，缺乏多尺度交互标注数据集是主要瓶颈。OCELOT 2023挑战赛通过收集和组织多尺度标注数据，验证细胞与组织关系对性能提升的关键作用。

Result: 最佳参赛模型相比基线（仅细胞检测）在测试集上F1分数提升了7.99，证明了多尺度语义整合的有效性。

Insight: 细胞检测任务需结合组织上下文信息，多尺度语义学习是实现人类水平性能的关键。

Abstract: Pathologists routinely alternate between different magnifications when examining Whole-Slide Images, allowing them to evaluate both broad tissue morphology and intricate cellular details to form comprehensive diagnoses. However, existing deep learning-based cell detection models struggle to replicate these behaviors and learn the interdependent semantics between structures at different magnifications. A key barrier in the field is the lack of datasets with multi-scale overlapping cell and tissue annotations. The OCELOT 2023 challenge was initiated to gather insights from the community to validate the hypothesis that understanding cell and tissue (cell-tissue) interactions is crucial for achieving human-level performance, and to accelerate the research in this field. The challenge dataset includes overlapping cell detection and tissue segmentation annotations from six organs, comprising 673 pairs sourced from 306 The Cancer Genome Atlas (TCGA) Whole-Slide Images with hematoxylin and eosin staining, divided into training, validation, and test subsets. Participants presented models that significantly enhanced the understanding of cell-tissue relationships. Top entries achieved up to a 7.99 increase in F1-score on the test set compared to the baseline cell-only model that did not incorporate cell-tissue relationships. This is a substantial improvement in performance over traditional cell-only detection methods, demonstrating the need for incorporating multi-scale semantics into the models. This paper provides a comparative analysis of the methods used by participants, highlighting innovative strategies implemented in the OCELOT 2023 challenge.

[23] RT-DETR++ for UAV Object Detection cs.CVPDF

Yuan Shufang

TL;DR: RT-DETR++通过改进RT-DETR的编码器部分，引入通道门控注意力机制和CSP-PAC特征融合技术，显著提升了无人机图像中小目标和密集目标的检测性能，同时保持了实时性。

Details

Motivation: 无人机图像中的目标检测面临小目标密集、尺度变化大和遮挡等挑战，现有方法在性能和效率上存在不足。

Result: RT-DETR++在小目标和密集目标的检测上表现优异，同时维持了实时检测速度。

Insight: 通过优化的编码器和特征融合设计，可以在不增加计算复杂度的前提下提升目标检测性能。

Abstract: Object detection in unmanned aerial vehicle (UAV) imagery presents significant challenges. Issues such as densely packed small objects, scale variations, and occlusion are commonplace. This paper introduces RT-DETR++, which enhances the encoder component of the RT-DETR model. Our improvements focus on two key aspects. First, we introduce a channel-gated attention-based upsampling/downsampling (AU/AD) mechanism. This dual-path system minimizes errors and preserves details during feature layer propagation. Second, we incorporate CSP-PAC during feature fusion. This technique employs parallel hollow convolutions to process local and contextual information within the same layer, facilitating the integration of multi-scale features. Evaluation demonstrates that our novel neck design achieves superior performance in detecting small and densely packed objects. The model maintains sufficient speed for real-time detection without increasing computational complexity. This study provides an effective approach for feature encoding design in real-time detection systems.

[24] A Knowledge Noise Mitigation Framework for Knowledge-based Visual Question Answering cs.CV | cs.AIPDF

Zhiyue Liu, Sihang Liu, Jinyuan Liu, Xinru Zhang

TL;DR: 提出了一种无需训练的框架，用于减轻知识冗余对KB-VQA任务的干扰，通过提高知识相关性和减少冗余来优化答案生成过程。

Details

Motivation: 现有的KB-VQA方法直接将检索到的知识注入模型，忽略了知识冗余带来的噪声问题，影响了回答的准确性。

Result: 实验表明，该方法显著优于现有技术，能更准确地利用关键知识回答问题。

Insight: 通过减少知识噪声并动态选择知识集成，KB-VQA任务能更高效地利用外部知识，提高回答质量。

Abstract: Knowledge-based visual question answering (KB-VQA) requires a model to understand images and utilize external knowledge to provide accurate answers. Existing approaches often directly augment models with retrieved information from knowledge sources while ignoring substantial knowledge redundancy, which introduces noise into the answering process. To address this, we propose a training-free framework with knowledge focusing for KB-VQA, that mitigates the impact of noise by enhancing knowledge relevance and reducing redundancy. First, for knowledge retrieval, our framework concludes essential parts from the image-question pairs, creating low-noise queries that enhance the retrieval of highly relevant knowledge. Considering that redundancy still persists in the retrieved knowledge, we then prompt large models to identify and extract answer-beneficial segments from knowledge. In addition, we introduce a selective knowledge integration strategy, allowing the model to incorporate knowledge only when it lacks confidence in answering the question, thereby mitigating the influence of redundant information. Our framework enables the acquisition of accurate and critical knowledge, and extensive experiments demonstrate that it outperforms state-of-the-art methods.

[25] CWSSNet: Hyperspectral Image Classification Enhanced by Wavelet Domain Convolution cs.CVPDF

Yulin Tong, Fengzong Zhang, Haiqin Cheng

TL;DR: CWSSNet是一种结合3D谱空特征和小波卷积的高光谱图像分类框架，通过多尺度卷积注意力模块和多波段小波分解提升分类性能，在小样本训练下表现稳健。

Details

Motivation: 高光谱图像因波段多、维度高和光谱混合特性导致特征冗余，传统方法分类性能有限，需突破瓶颈。

Result: 在mIoU、mAcc和mF1上分别达到74.50%、82.73%和84.94%，尤其在水体、植被和裸地分类中IoU最高。

Insight: 小波域卷积和多尺度注意力模块能显著提升高光谱图像分类性能，且模型在小样本下表现稳健。

Abstract: Hyperspectral remote sensing technology has significant application value in fields such as forestry ecology and precision agriculture, while also putting forward higher requirements for fine ground object classification. However, although hyperspectral images are rich in spectral information and can improve recognition accuracy, they tend to cause prominent feature redundancy due to their numerous bands, high dimensionality, and spectral mixing characteristics. To address this, this study used hyperspectral images from the ZY1F satellite as a data source and selected Yugan County, Shangrao City, Jiangxi Province as the research area to perform ground object classification research. A classification framework named CWSSNet was proposed, which integrates 3D spectral-spatial features and wavelet convolution. This framework integrates multimodal information us-ing a multiscale convolutional attention module and breaks through the classification performance bottleneck of traditional methods by introducing multi-band decomposition and convolution operations in the wavelet domain. The experiments showed that CWSSNet achieved 74.50%, 82.73%, and 84.94% in mean Intersection over Union (mIoU), mean Accuracy (mAcc), and mean F1-score (mF1) respectively in Yugan County. It also obtained the highest Intersection over Union (IoU) in the classifica-tion of water bodies, vegetation, and bare land, demonstrating good robustness. Additionally, when the training set proportion was 70%, the increase in training time was limited, and the classification effect was close to the optimal level, indicating that the model maintains reliable performance under small-sample training conditions.

[26] Bridging the Gap Between Ideal and Real-world Evaluation: Benchmarking AI-Generated Image Detection in Challenging Scenarios cs.CVPDF

Chunxiao Li, Xiaoxiao Wang, Meiling Li, Boming Miao, Peng Sun

TL;DR: 本研究提出了Real-World Robustness Dataset (RRDataset)，用于评估AI生成图像检测模型在真实复杂场景中的性能。该数据集覆盖了七个主要场景，并测试了模型在互联网传输和重新数字化后的鲁棒性。实验结果表明当前方法在真实条件下的局限性，并强调结合人类适应能力开发更鲁棒检测算法的必要性。

Details

Motivation: 随着生成模型的快速发展，高真实感的图像合成对数字安全和媒体可信度提出了新挑战。现有AI生成图像检测方法在复杂真实场景中的评估存在研究空白，作者试图填补这一空白。

Result: 基准测试显示现有AI检测方法在真实条件下表现不佳，尤其在互联网传输和重新数字化后性能下降明显。人类实验表明人在少量学习后能更好地适应复杂场景。

Insight: 当前的AI生成图像检测模型在真实复杂场景中存在明显局限性，未来研究需结合人类适应能力开发更鲁棒的算法，尤其是在数据传输和图像处理变形后的检测能力。

Abstract: With the rapid advancement of generative models, highly realistic image synthesis has posed new challenges to digital security and media credibility. Although AI-generated image detection methods have partially addressed these concerns, a substantial research gap remains in evaluating their performance under complex real-world conditions. This paper introduces the Real-World Robustness Dataset (RRDataset) for comprehensive evaluation of detection models across three dimensions: 1) Scenario Generalization: RRDataset encompasses high-quality images from seven major scenarios (War and Conflict, Disasters and Accidents, Political and Social Events, Medical and Public Health, Culture and Religion, Labor and Production, and everyday life), addressing existing dataset gaps from a content perspective. 2) Internet Transmission Robustness: examining detector performance on images that have undergone multiple rounds of sharing across various social media platforms. 3) Re-digitization Robustness: assessing model effectiveness on images altered through four distinct re-digitization methods. We benchmarked 17 detectors and 10 vision-language models (VLMs) on RRDataset and conducted a large-scale human study involving 192 participants to investigate human few-shot learning capabilities in detecting AI-generated images. The benchmarking results reveal the limitations of current AI detection methods under real-world conditions and underscore the importance of drawing on human adaptability to develop more robust detection algorithms.

[27] VQualA 2025 Challenge on Visual Quality Comparison for Large Multimodal Models: Methods and Results cs.CVPDF

Hanwei Zhu, Haoning Wu, Zicheng Zhang, Lingyu Zhu, Yixuan Li

TL;DR: 本文总结了2025年ICCV视觉质量评估研讨会上举办的VQualA挑战赛，旨在评估和改进大型多模态模型（LMMs）在视觉质量差异比较中的能力。比赛提出了一个包含数千个从粗到细粒度视觉质量比较任务的新基准，并吸引了100名参赛者，展示了五款指令调优LMMs在质量评估中的潜力。

Details

Motivation: 现有大型多模态模型在视觉质量差异的开放式和详细推理能力上仍需改进，因此设计了VQualA挑战赛，以推动相关技术的发展。

Result: 五款模型展示了指令调优LMMs在视觉质量评估中的新兴能力，标志着开放域视觉质量推理的重要进展。

Insight: 比赛表明指令调优LMMs在质量评估任务中具有潜力，并为未来可解释且与人一致的质量评估系统研究提供了方向。

Abstract: This paper presents a summary of the VQualA 2025 Challenge on Visual Quality Comparison for Large Multimodal Models (LMMs), hosted as part of the ICCV 2025 Workshop on Visual Quality Assessment. The challenge aims to evaluate and enhance the ability of state-of-the-art LMMs to perform open-ended and detailed reasoning about visual quality differences across multiple images. To this end, the competition introduces a novel benchmark comprising thousands of coarse-to-fine grained visual quality comparison tasks, spanning single images, pairs, and multi-image groups. Each task requires models to provide accurate quality judgments. The competition emphasizes holistic evaluation protocols, including 2AFC-based binary preference and multi-choice questions (MCQs). Around 100 participants submitted entries, with five models demonstrating the emerging capabilities of instruction-tuned LMMs on quality assessment. This challenge marks a significant step toward open-domain visual quality reasoning and comparison and serves as a catalyst for future research on interpretable and human-aligned quality evaluation systems.

[28] Medverse: A Universal Model for Full-Resolution 3D Medical Image Segmentation, Transformation and Enhancement cs.CVPDF

Jiesi Hu, Jianfeng Cao, Yanwu Yang, Chenfei Ye, Yixuan Zhang

TL;DR: Medverse提出了一种通用的上下文学习模型，用于3D医学图像的分割、变换和增强，解决了现有方法在高保真预测和全局解剖理解方面的局限性。

Details

Motivation: 当前医学图像分析的上下文学习模型无法同时实现高保真预测和全局解剖理解，且缺乏跨任务和区域的统一模型，限制了其在医学影像中的潜力。

Result: Medverse在未见过的数据集上显著优于现有方法，展现了上下文学习的新范式。

Insight: Medverse的成功表明，统一的上下文学习模型可以跨任务、跨模态工作，为医学图像分析提供了新的可能性。

Abstract: In-context learning (ICL) offers a promising paradigm for universal medical image analysis, enabling models to perform diverse image processing tasks without retraining. However, current ICL models for medical imaging remain limited in two critical aspects: they cannot simultaneously achieve high-fidelity predictions and global anatomical understanding, and there is no unified model trained across diverse medical imaging tasks (e.g., segmentation and enhancement) and anatomical regions. As a result, the full potential of ICL in medical imaging remains underexplored. Thus, we present \textbf{Medverse}, a universal ICL model for 3D medical imaging, trained on 22 datasets covering diverse tasks in universal image segmentation, transformation, and enhancement across multiple organs, imaging modalities, and clinical centers. Medverse employs a next-scale autoregressive in-context learning framework that progressively refines predictions from coarse to fine, generating consistent, full-resolution volumetric outputs and enabling multi-scale anatomical awareness. We further propose a blockwise cross-attention module that facilitates long-range interactions between context and target inputs while preserving computational efficiency through spatial sparsity. Medverse is extensively evaluated on a broad collection of held-out datasets covering previously unseen clinical centers, organs, species, and imaging modalities. Results demonstrate that Medverse substantially outperforms existing ICL baselines and establishes a novel paradigm for in-context learning. Code and model weights will be made publicly available. Our model are publicly available at https://github.com/jiesihu/Medverse.

[29] CoAtNeXt:An Attention-Enhanced ConvNeXtV2-Transformer Hybrid Model for Gastric Tissue Classification cs.CV | cs.AIPDF

Mustafa Yurdakul, Sakir Tasdemir

TL;DR: CoAtNeXt是一種結合ConvNeXtV2和Transformer的混合模型，用於胃組織圖像分類，表現優於傳統CNN和ViT模型。

Details

Motivation: 早期診斷胃病對預防致命後果至關重要，但目前依賴手動的組織病理學檢查存在勞動密集和主觀差異等問題，需要自動化方法。

Result: 在兩個數據集上表現優異，HMU-GC-HE-30K的準確率達96.47%，GasHisSDB達98.29%，均超越所有對比模型。

Insight: CoAtNeXt展示了在組織病理學分類中的潛力，有助於提高診斷準確性並減輕病理學家的工作負擔。

Abstract: Background and objective Early diagnosis of gastric diseases is crucial to prevent fatal outcomes. Although histopathologic examination remains the diagnostic gold standard, it is performed entirely manually, making evaluations labor-intensive and prone to variability among pathologists. Critical findings may be missed, and lack of standard procedures reduces consistency. These limitations highlight the need for automated, reliable, and efficient methods for gastric tissue analysis. Methods In this study, a novel hybrid model named CoAtNeXt was proposed for the classification of gastric tissue images. The model is built upon the CoAtNet architecture by replacing its MBConv layers with enhanced ConvNeXtV2 blocks. Additionally, the Convolutional Block Attention Module (CBAM) is integrated to improve local feature extraction through channel and spatial attention mechanisms. The architecture was scaled to achieve a balance between computational efficiency and classification performance. CoAtNeXt was evaluated on two publicly available datasets, HMU-GC-HE-30K for eight-class classification and GasHisSDB for binary classification, and was compared against 10 Convolutional Neural Networks (CNNs) and ten Vision Transformer (ViT) models. Results CoAtNeXt achieved 96.47% accuracy, 96.60% precision, 96.47% recall, 96.45% F1 score, and 99.89% AUC on HMU-GC-HE-30K. On GasHisSDB, it reached 98.29% accuracy, 98.07% precision, 98.41% recall, 98.23% F1 score, and 99.90% AUC. It outperformed all CNN and ViT models tested and surpassed previous studies in the literature. Conclusion Experimental results show that CoAtNeXt is a robust architecture for histopathological classification of gastric tissue images, providing performance on binary and multiclass. Its highlights its potential to assist pathologists by enhancing diagnostic accuracy and reducing workload.

[30] Towards Better Dental AI: A Multimodal Benchmark and Instruction Dataset for Panoramic X-ray Analysis cs.CV | cs.MMPDF

Jing Hao, Yuxuan Fan, Yanpeng Sun, Kaixin Guo, Lizhuo Lin

TL;DR: 这篇论文提出了首个针对全景X射线分析的多模态指令数据集和基准MMOral，并在此基础上开发了OralGPT模型，显著提升了模型在牙科领域的表现。

Details

Motivation: 尽管大型视觉语言模型（LVLMs）在通用医疗任务中表现出色，但在牙科等专业领域的效果仍未充分探索。全景X射线由于其复杂的解剖结构和细微的病理特征，现有模型难以准确解析。

Result: 现有LVLMs在MMOral-Bench上表现不佳（最高准确率41.45%），而OralGPT通过单轮微调显著提升24.73%。

Insight: 1. 专业化领域需要针对性的数据集和基准；2. 监督微调在提升模型性能中效果显著；3. 开源资源对推动牙科AI发展至关重要。

Abstract: Recent advances in large vision-language models (LVLMs) have demonstrated strong performance on general-purpose medical tasks. However, their effectiveness in specialized domains such as dentistry remains underexplored. In particular, panoramic X-rays, a widely used imaging modality in oral radiology, pose interpretative challenges due to dense anatomical structures and subtle pathological cues, which are not captured by existing medical benchmarks or instruction datasets. To this end, we introduce MMOral, the first large-scale multimodal instruction dataset and benchmark tailored for panoramic X-ray interpretation. MMOral consists of 20,563 annotated images paired with 1.3 million instruction-following instances across diverse task types, including attribute extraction, report generation, visual question answering, and image-grounded dialogue. In addition, we present MMOral-Bench, a comprehensive evaluation suite covering five key diagnostic dimensions in dentistry. We evaluate 64 LVLMs on MMOral-Bench and find that even the best-performing model, i.e., GPT-4o, only achieves 41.45% accuracy, revealing significant limitations of current models in this domain. To promote the progress of this specific domain, we also propose OralGPT, which conducts supervised fine-tuning (SFT) upon Qwen2.5-VL-7B with our meticulously curated MMOral instruction dataset. Remarkably, a single epoch of SFT yields substantial performance enhancements for LVLMs, e.g., OralGPT demonstrates a 24.73% improvement. Both MMOral and OralGPT hold significant potential as a critical foundation for intelligent dentistry and enable more clinically impactful multimodal AI systems in the dental field. The dataset, model, benchmark, and evaluation suite are available at https://github.com/isbrycee/OralGPT.

[31] DATE: Dynamic Absolute Time Enhancement for Long Video Understanding cs.CVPDF

Chao Yuan, Yang Yang, Yehui Yang, Zach Cheng

TL;DR: DATE通过动态绝对时间增强方法，结合时间戳注入机制和时间感知相似性采样策略，显著提升了多模态大语言模型在长视频理解中的时序感知能力。

Details

Motivation: 长视频理解是多模态大语言模型的根本挑战，传统方法因均匀帧采样和隐式位置编码导致关键信息丢失和时序理解下降。

Result: 在长达小时的视频基准测试中，7B和72B模型均取得优异表现，7B模型甚至在某些任务中超越72B模型。

Insight: 显式的时间建模和语义引导的采样策略对长视频理解的时序推理至关重要，小模型通过高效方法也能超越大模型性能。

Abstract: Long video understanding remains a fundamental challenge for multimodal large language models (MLLMs), particularly in tasks requiring precise temporal reasoning and event localization. Existing approaches typically adopt uniform frame sampling and rely on implicit position encodings to model temporal order. However, these methods struggle with long-range dependencies, leading to critical information loss and degraded temporal comprehension. In this paper, we propose Dynamic Absolute Time Enhancement (DATE) that enhances temporal awareness in MLLMs through the Timestamp Injection Mechanism (TIM) and a semantically guided Temporal-Aware Similarity Sampling (TASS) strategy. Specifically, we interleave video frame embeddings with textual timestamp tokens to construct a continuous temporal reference system. We further reformulate the video sampling problem as a vision-language retrieval task and introduce a two-stage algorithm to ensure both semantic relevance and temporal coverage: enriching each query into a descriptive caption to better align with the vision feature, and sampling key event with a similarity-driven temporally regularized greedy strategy. Our method achieves remarkable improvements w.r.t. absolute time understanding and key event localization, resulting in state-of-the-art performance among 7B and 72B models on hour-long video benchmarks. Particularly, our 7B model even exceeds many 72B models on some benchmarks.

[32] Unified Start, Personalized End: Progressive Pruning for Efficient 3D Medical Image Segmentation cs.CVPDF

Linhao Li, Yiwen Ye, Ziyang Chen, Yong Xia

TL;DR: PSP-Seg是一个渐进式剪枝框架，用于高效动态的3D医学图像分割，通过逐步剪枝冗余模块，显著减少资源消耗并保持性能。

Details

Motivation: 3D医学图像分割通常资源消耗大，现有高效模型多为静态设计，难以适应多样任务和平衡性能与效率。

Result: PSP-Seg-S在性能接近nnU-Net的同时，GPU内存减少42-45%，训练时间减少29-48%，参数减少83-87%。

Insight: 动态剪枝策略在医学图像分割中能显著提高资源效率，同时保持高性能。

Abstract: 3D medical image segmentation often faces heavy resource and time consumption, limiting its scalability and rapid deployment in clinical environments. Existing efficient segmentation models are typically static and manually designed prior to training, which restricts their adaptability across diverse tasks and makes it difficult to balance performance with resource efficiency. In this paper, we propose PSP-Seg, a progressive pruning framework that enables dynamic and efficient 3D segmentation. PSP-Seg begins with a redundant model and iteratively prunes redundant modules through a combination of block-wise pruning and a functional decoupling loss. We evaluate PSP-Seg on five public datasets, benchmarking it against seven state-of-the-art models and six efficient segmentation models. Results demonstrate that the lightweight variant, PSP-Seg-S, achieves performance on par with nnU-Net while reducing GPU memory usage by 42-45%, training time by 29-48%, and parameter number by 83-87% across all datasets. These findings underscore PSP-Seg’s potential as a cost-effective yet high-performing alternative for widespread clinical application.

[33] Visual Programmability: A Guide for Code-as-Thought in Chart Understanding cs.CVPDF

Bohao Tang, Yan Ma, Fei Zhang, Jiadi Su, Ethan Chern

TL;DR: 该论文提出了一种名为Code-as-Thought (CaT)的方法，用于提升视觉语言模型在图表理解任务中的推理能力，并通过强化学习动态选择最佳推理路径。

Details

Motivation: 现有的图表理解方法要么依赖外部工具，要么采用单一推理策略（如文本思维链），限制了模型的灵活性和准确性。为解决这一问题，论文探索了代码形式作为视觉信息的符号化表示，并提出了动态选择推理路径的方案。

Result: 在多个图表理解基准测试中表现优异，验证了动态选择推理路径的有效性。

Insight: 视觉语言模型不仅能学习如何推理，还能动态选择最优推理路径，从而提升任务适应性。这为复杂视觉推理任务提供了新的解决方案。

Abstract: Chart understanding presents a critical test to the reasoning capabilities of Vision-Language Models (VLMs). Prior approaches face critical limitations: some rely on external tools, making them brittle and constrained by a predefined toolkit, while others fine-tune specialist models that often adopt a single reasoning strategy, such as text-based chain-of-thought (CoT). The intermediate steps of text-based reasoning are difficult to verify, which complicates the use of reinforcement-learning signals that reward factual accuracy. To address this, we propose a Code-as-Thought (CaT) approach to represent the visual information of a chart in a verifiable, symbolic format. Our key insight is that this strategy must be adaptive: a fixed, code-only implementation consistently fails on complex charts where symbolic representation is unsuitable. This finding leads us to introduce Visual Programmability: a learnable property that determines if a chart-question pair is better solved with code or direct visual analysis. We implement this concept in an adaptive framework where a VLM learns to choose between the CaT pathway and a direct visual reasoning pathway. The selection policy of the model is trained with reinforcement learning using a novel dual-reward system. This system combines a data-accuracy reward to ground the model in facts and prevent numerical hallucination, with a decision reward that teaches the model when to use each strategy, preventing it from defaulting to a single reasoning mode. Experiments demonstrate strong and robust performance across diverse chart-understanding benchmarks. Our work shows that VLMs can be taught not only to reason but also how to reason, dynamically selecting the optimal reasoning pathway for each task.

[34] Modality-Agnostic Input Channels Enable Segmentation of Brain lesions in Multimodal MRI with Sequences Unavailable During Training cs.CV | cs.AIPDF

Anthony P. Addison, Felix Wagner, Wentian Xu, Natalie Voets, Konstantinos Kamnitsas

TL;DR: 该论文提出了一种模态无关的输入通道方法，使U-net架构能够处理训练中未见过的MRI模态混合数据，提升了脑病变分割的灵活性。

Details

Motivation: 现有脑MRI分割模型通常局限于训练时的固定模态，无法处理推理时的新模态或模态混合数据。论文旨在开发一种更灵活的模型，能够适应任何可用的MRI模态，包括训练中未见过的情况。

Result: 实验使用了8个MRI数据库和5种病理类型，结果表明方法能够有效处理训练过的模态，同时提升对未见模态的分割能力。

Insight: 模态无关的设计和合成增强技术的结合，为处理多模态医学图像提供了一种实用且灵活的解决方案。

Abstract: Segmentation models are important tools for the detection and analysis of lesions in brain MRI. Depending on the type of brain pathology that is imaged, MRI scanners can acquire multiple, different image modalities (contrasts). Most segmentation models for multimodal brain MRI are restricted to fixed modalities and cannot effectively process new ones at inference. Some models generalize to unseen modalities but may lose discriminative modality-specific information. This work aims to develop a model that can perform inference on data that contain image modalities unseen during training, previously seen modalities, and heterogeneous combinations of both, thus allowing a user to utilize any available imaging modalities. We demonstrate this is possible with a simple, thus practical alteration to the U-net architecture, by integrating a modality-agnostic input channel or pathway, alongside modality-specific input channels. To train this modality-agnostic component, we develop an image augmentation scheme that synthesizes artificial MRI modalities. Augmentations differentially alter the appearance of pathological and healthy brain tissue to create artificial contrasts between them while maintaining realistic anatomical integrity. We evaluate the method using 8 MRI databases that include 5 types of pathologies (stroke, tumours, traumatic brain injury, multiple sclerosis and white matter hyperintensities) and 8 modalities (T1, T1+contrast, T2, PD, SWI, DWI, ADC and FLAIR). The results demonstrate that the approach preserves the ability to effectively process MRI modalities encountered during training, while being able to process new, unseen modalities to improve its segmentation. Project code: https://github.com/Anthony-P-Addison/AGN-MOD-SEG

[35] Can Multimodal LLMs See Materials Clearly? A Multimodal Benchmark on Materials Characterization cs.CV | cs.AI | cs.CL | cs.MMPDF

Zhengzhao Lai, Youbin Zheng, Zhenyang Cai, Haonan Lyu, Jinpu Yang

TL;DR: 论文提出了MatCha基准测试，用于评估多模态大语言模型（MLLMs）在材料表征图像理解中的能力，发现现有模型在专家级任务上表现不佳，提示学习方法也难以弥补其局限性。

Details

Motivation: 材料表征是材料科学的核心，但目前的多模态大语言模型在实际材料表征图像理解方面的能力尚未充分探索。因此，作者提出了MatCha基准测试，填补了这一空白。

Result: 实验表明，现有MLLMs在材料表征图像理解任务上与人类专家存在显著差距，尤其是在需要高阶专业知识和复杂视觉感知的任务上表现不佳。

Insight: 现有的MLLMs在真实材料表征场景中的适应性有限，提示学习方法（如Few-shot和Chain-of-Thought）效果不显著。MatCha有望推动新材料发现和自主科学代理的研究。

Abstract: Materials characterization is fundamental to acquiring materials information, revealing the processing-microstructure-property relationships that guide material design and optimization. While multimodal large language models (MLLMs) have recently shown promise in generative and predictive tasks within materials science, their capacity to understand real-world characterization imaging data remains underexplored. To bridge this gap, we present MatCha, the first benchmark for materials characterization image understanding, comprising 1,500 questions that demand expert-level domain expertise. MatCha encompasses four key stages of materials research comprising 21 distinct tasks, each designed to reflect authentic challenges faced by materials scientists. Our evaluation of state-of-the-art MLLMs on MatCha reveals a significant performance gap compared to human experts. These models exhibit degradation when addressing questions requiring higher-level expertise and sophisticated visual perception. Simple few-shot and chain-of-thought prompting struggle to alleviate these limitations. These findings highlight that existing MLLMs still exhibit limited adaptability to real-world materials characterization scenarios. We hope MatCha will facilitate future research in areas such as new material discovery and autonomous scientific agents. MatCha is available at https://github.com/FreedomIntelligence/MatCha.

Hao Si, Ehsan Javanmardi, Manabu Tsukada

TL;DR: PHCP是一种新型协作感知框架，通过无需标注数据和联合训练的推理时动态特征对齐，解决了异构车辆协作感知的实际挑战。

Details

Motivation: 现实场景中，不同车辆的感知模型通常异构，现有方法需要联合训练或预存模型，无法直接应用于推理阶段。PHCP旨在无需这些步骤，直接在推理阶段实现高效协作。

Result: 在OPV2V数据集上，PHCP在异构场景中表现出色，性能与全数据集训练的SOTA方法相当，仅需少量未标注数据。

Insight: PHCP的创新在于将协作感知问题转化为推理时的动态适应问题，为实现实时异构协作提供了可行方案。

Abstract: Collaborative perception enables vehicles to overcome individual perception limitations by sharing information, allowing them to see further and through occlusions. In real-world scenarios, models on different vehicles are often heterogeneous due to manufacturer variations. Existing methods for heterogeneous collaborative perception address this challenge by fine-tuning adapters or the entire network to bridge the domain gap. However, these methods are impractical in real-world applications, as each new collaborator must undergo joint training with the ego vehicle on a dataset before inference, or the ego vehicle stores models for all potential collaborators in advance. Therefore, we pose a new question: Can we tackle this challenge directly during inference, eliminating the need for joint training? To answer this, we introduce Progressive Heterogeneous Collaborative Perception (PHCP), a novel framework that formulates the problem as few-shot unsupervised domain adaptation. Unlike previous work, PHCP dynamically aligns features by self-training an adapter during inference, eliminating the need for labeled data and joint training. Extensive experiments on the OPV2V dataset demonstrate that PHCP achieves strong performance across diverse heterogeneous scenarios. Notably, PHCP achieves performance comparable to SOTA methods trained on the entire dataset while using only a small amount of unlabeled data.

[37] Image Recognition with Vision and Language Embeddings of VLMs cs.CVPDF

Illia Volkov, Nikita Kisel, Klara Janouskova, Jiri Matas

TL;DR: 该论文对视觉语言模型（VLMs）的纯视觉推理能力进行了全面评估，并通过引入一种基于类精度的无学习融合方法，结合语言和视觉的互补性提升了分类性能。

Details

Motivation: 视觉语言模型在零样本分类中表现出色，但其纯视觉推理能力尚未得到充分研究。论文旨在填补这一空白，并探索语言和视觉的结合策略。

Result: 实验表明，语言和视觉在分类任务中具有互补性，某些类别更适合文本提示，而其他类别则更适合视觉相似性。提出的融合方法有效提升了分类性能。

Insight: 视觉和语言嵌入在图像分类中各有所长，结合两者的互补性可以显著提升模型性能。这种融合方法无需额外训练，具有实用性。

Abstract: Vision-language models (VLMs) have enabled strong zero-shot classification through image-text alignment. Yet, their purely visual inference capabilities remain under-explored. In this work, we conduct a comprehensive evaluation of both language-guided and vision-only image classification with a diverse set of dual-encoder VLMs, including both well-established and recent models such as SigLIP 2 and RADIOv2.5. The performance is compared in a standard setup on the ImageNet-1k validation set and its label-corrected variant. The key factors affecting accuracy are analysed, including prompt design, class diversity, the number of neighbours in k-NN, and reference set size. We show that language and vision offer complementary strengths, with some classes favouring textual prompts and others better handled by visual similarity. To exploit this complementarity, we introduce a simple, learning-free fusion method based on per-class precision that improves classification performance. The code is available at: https://github.com/gonikisgo/bmvc2025-vlm-image-recognition.

[38] Fine-Grained Customized Fashion Design with Image-into-Prompt benchmark and dataset from LMM cs.CVPDF

Hui Li, Yi You, Qiqi Chen, Bingfeng Zhang, George Q. Huang

TL;DR: 论文提出了一种基于大型多模态模型（LMM）的细粒度时尚设计定制工作流（BUG），通过图像到提示的转换，自动生成和定制服装设计，解决了文本输入不确定性导致的设计难题。

Details

Motivation: 当前的生成式AI模型能够轻松将创意转化为设计，但在缺少专业背景知识的终端用户中，细粒度定制仍受限于文本输入的不确定性。论文旨在通过LMM降低时尚设计的门槛，提升用户体验。

Result: 实验证明BUG工作流能够有效提升设计的可控性和用户满意度，降低设计门槛。

Insight: LMM在时尚设计领域的应用潜力巨大，图像到提示的转换机制为其他领域的细粒度生成任务提供了参考。

Abstract: Generative AI evolves the execution of complex workflows in industry, where the large multimodal model empowers fashion design in the garment industry. Current generation AI models magically transform brainstorming into fancy designs easily, but the fine-grained customization still suffers from text uncertainty without professional background knowledge from end-users. Thus, we propose the Better Understanding Generation (BUG) workflow with LMM to automatically create and fine-grain customize the cloth designs from chat with image-into-prompt. Our framework unleashes users’ creative potential beyond words and also lowers the barriers of clothing design/editing without further human involvement. To prove the effectiveness of our model, we propose a new FashionEdit dataset that simulates the real-world clothing design workflow, evaluated from generation similarity, user satisfaction, and quality. The code and dataset: https://github.com/detectiveli/FashionEdit.

[39] Exploring Pre-training Across Domains for Few-Shot Surgical Skill Assessment cs.CV | cs.LGPDF

Dimitrios Anastasiou, Razvan Caramalau, Nazir Sirajudeen, Matthew Boal, Philip Edwards

TL;DR: 本篇论文研究自监督预训练策略对少样本外科手术技能评估（SSA）任务的影响，发现小规模但领域相关的数据集优于大规模但领域不匹配的数据集，并证明将特定手术数据融入预训练能显著提升性能。

Details

Motivation: 由于外科手术技能标注稀缺且耗时，研究少样本学习方法（FSL）作为一种替代方案，但其有效性依赖于预训练策略。目前预训练在外科手术技能评估（SSA）中尚未得到充分探索。

Result: 在1-shot、2-shot和5-shot设置下，分别达到60.16%、66.03%和73.65%的准确率。融入特定手术数据的预训练策略平均提升1.22%准确率和2.28% F1分数。

Insight: 领域相关性比数据规模更重要；特定手术数据的加入能显著提升性能，但需确保领域匹配。

Abstract: Automated surgical skill assessment (SSA) is a central task in surgical computer vision. Developing robust SSA models is challenging due to the scarcity of skill annotations, which are time-consuming to produce and require expert consensus. Few-shot learning (FSL) offers a scalable alternative enabling model development with minimal supervision, though its success critically depends on effective pre-training. While widely studied for several surgical downstream tasks, pre-training has remained largely unexplored in SSA. In this work, we formulate SSA as a few-shot task and investigate how self-supervised pre-training strategies affect downstream few-shot SSA performance. We annotate a publicly available robotic surgery dataset with Objective Structured Assessment of Technical Skill (OSATS) scores, and evaluate various pre-training sources across three few-shot settings. We quantify domain similarity and analyze how domain gap and the inclusion of procedure-specific data into pre-training influence transferability. Our results show that small but domain-relevant datasets can outperform large scale, less aligned ones, achieving accuracies of 60.16%, 66.03%, and 73.65% in the 1-, 2-, and 5-shot settings, respectively. Moreover, incorporating procedure-specific data into pre-training with a domain-relevant external dataset significantly boosts downstream performance, with an average gain of +1.22% in accuracy and +2.28% in F1-score; however, applying the same strategy with less similar but large-scale sources can instead lead to performance degradation. Code and models are available at https://github.com/anastadimi/ssa-fsl.

[40] Classification of Driver Behaviour Using External Observation Techniques for Autonomous Vehicles cs.CV | cs.AI | cs.ET | cs.RO | eess.IVPDF

Ian Nell, Shane Gilroy

TL;DR: 本文提出了一种基于外部观察技术的驾驶员行为分类系统，通过计算机视觉方法检测分心和受损驾驶行为，适用于非联网车辆，具有较高的可靠性和适应性。

Details

Motivation: 交通事故主要由人为错误（如分心和受损驾驶）引起，现有系统多依赖车辆间通信，无法覆盖非联网车辆。本文旨在开发一种无需依赖车辆通信的视觉解决方案。

Result: 在多样化视频数据集上的实验表明，系统在不同道路和环境条件下具有可靠性和适应性。

Insight: 视觉方法为驾驶员行为分析提供了非侵入式解决方案，特别适用于非联网车辆，有助于提升交通安全。

Abstract: Road traffic accidents remain a significant global concern, with human error, particularly distracted and impaired driving, among the leading causes. This study introduces a novel driver behavior classification system that uses external observation techniques to detect indicators of distraction and impairment. The proposed framework employs advanced computer vision methodologies, including real-time object tracking, lateral displacement analysis, and lane position monitoring. The system identifies unsafe driving behaviors such as excessive lateral movement and erratic trajectory patterns by implementing the YOLO object detection model and custom lane estimation algorithms. Unlike systems reliant on inter-vehicular communication, this vision-based approach enables behavioral analysis of non-connected vehicles. Experimental evaluations on diverse video datasets demonstrate the framework’s reliability and adaptability across varying road and environmental conditions.

[41] FLUX-Reason-6M & PRISM-Bench: A Million-Scale Text-to-Image Reasoning Dataset and Comprehensive Benchmark cs.CV | cs.CLPDF

Rongyao Fang, Aldrich Yu, Chengqi Duan, Linjiang Huang, Shuai Bai

TL;DR: 该论文提出了FLUX-Reason-6M和PRISM-Bench，分别是一个大规模推理导向的文本到图像数据集和一个综合评估基准，旨在填补开源模型在复杂推理能力上的不足。

Details

Motivation: 开源文本到图像（T2I）模型由于缺乏大规模推理导向的数据集和综合评估标准，性能落后于闭源系统。作者希望通过引入大规模数据集和基准测试推动开源社区的发展。

Result: 对19个领先模型的评估揭示了性能差距和改进方向，数据集和基准测试为社区提供了新资源。

Insight: 该研究填补了开源T2I模型在推理能力上的空白，特别是通过GCoT和长文本挑战任务提升了模型对复杂提示的理解能力。

Abstract: The advancement of open-source text-to-image (T2I) models has been hindered by the absence of large-scale, reasoning-focused datasets and comprehensive evaluation benchmarks, resulting in a performance gap compared to leading closed-source systems. To address this challenge, We introduce FLUX-Reason-6M and PRISM-Bench (Precise and Robust Image Synthesis Measurement Benchmark). FLUX-Reason-6M is a massive dataset consisting of 6 million high-quality FLUX-generated images and 20 million bilingual (English and Chinese) descriptions specifically designed to teach complex reasoning. The image are organized according to six key characteristics: Imagination, Entity, Text rendering, Style, Affection, and Composition, and design explicit Generation Chain-of-Thought (GCoT) to provide detailed breakdowns of image generation steps. The whole data curation takes 15,000 A100 GPU days, providing the community with a resource previously unattainable outside of large industrial labs. PRISM-Bench offers a novel evaluation standard with seven distinct tracks, including a formidable Long Text challenge using GCoT. Through carefully designed prompts, it utilizes advanced vision-language models for nuanced human-aligned assessment of prompt-image alignment and image aesthetics. Our extensive evaluation of 19 leading models on PRISM-Bench reveals critical performance gaps and highlights specific areas requiring improvement. Our dataset, benchmark, and evaluation code are released to catalyze the next wave of reasoning-oriented T2I generation. Project page: https://flux-reason-6m.github.io/ .

[42] Plug-and-play Diffusion Models for Image Compressive Sensing with Data Consistency Projection cs.CVPDF

Xiaodong Wang, Ping Wang, Zhangyuan Li, Xin Yuan

TL;DR: 该论文通过解耦扩散模型为去噪、数据一致性和采样三个阶段，提出了一种结合PnP方法和DDIM模型的统一框架，用于解决单像素成像中的逆问题。

Details

Motivation: 研究PnP方法与扩散模型（特别是DDIM）在解决病态逆问题（如单像素成像）中的联系，旨在整合学习先验与物理前向模型。

Result: 在单像素成像任务中，该方法实现了更好的重建质量。

Insight: 通过解耦和统一框架，扩散模型能更高效地结合领域知识与数据驱动方法，解决逆问题。

Abstract: We explore the connection between Plug-and-Play (PnP) methods and Denoising Diffusion Implicit Models (DDIM) for solving ill-posed inverse problems, with a focus on single-pixel imaging. We begin by identifying key distinctions between PnP and diffusion models-particularly in their denoising mechanisms and sampling procedures. By decoupling the diffusion process into three interpretable stages: denoising, data consistency enforcement, and sampling, we provide a unified framework that integrates learned priors with physical forward models in a principled manner. Building upon this insight, we propose a hybrid data-consistency module that linearly combines multiple PnP-style fidelity terms. This hybrid correction is applied directly to the denoised estimate, improving measurement consistency without disrupting the diffusion sampling trajectory. Experimental results on single-pixel imaging tasks demonstrate that our method achieves better reconstruction quality.

[43] A Fully Automatic Framework for Intracranial Pressure Grading: Integrating Keyframe Identification, ONSD Measurement and Clinical Data cs.CVPDF

Pengxu Wen, Tingting Yu, Ziwei Nie, Cheng Jiang, Zhenyu Yin

TL;DR: 该论文提出了一种全自动框架，结合关键帧识别、ONSD测量和临床数据进行颅内压分级，显著提升了准确性和可靠性。

Details

Motivation: 当前颅内压（ICP）监测的侵入性方法存在风险，而基于Optic nerve sheath diameter（ONSD）的非侵入方法因操作不一致和主观性导致可靠性不足。

Result: 验证准确率为0.845±0.071，独立测试准确率为0.786，显著优于传统阈值方法（0.637±0.111和0.429）。

Insight: 该研究通过减少操作变异性，结合多源数据，为非侵入性ICP评估提供了可靠工具，有望优化急性神经系统疾病管理。

Abstract: Intracranial pressure (ICP) elevation poses severe threats to cerebral function, thus necessitating monitoring for timely intervention. While lumbar puncture is the gold standard for ICP measurement, its invasiveness and associated risks drive the need for non-invasive alternatives. Optic nerve sheath diameter (ONSD) has emerged as a promising biomarker, as elevated ICP directly correlates with increased ONSD. However, current clinical practices for ONSD measurement suffer from inconsistency in manual operation, subjectivity in optimal view selection, and variability in thresholding, limiting their reliability. To address these challenges, we introduce a fully automatic two-stage framework for ICP grading, integrating keyframe identification, ONSD measurement and clinical data. Specifically, the fundus ultrasound video processing stage performs frame-level anatomical segmentation, rule-based keyframe identification guided by an international consensus statement, and precise ONSD measurement. The intracranial pressure grading stage then fuses ONSD metrics with clinical features to enable the prediction of ICP grades, thereby demonstrating an innovative blend of interpretable ultrasound analysis and multi-source data integration for objective clinical evaluation. Experimental results demonstrate that our method achieves a validation accuracy of $0.845 \pm 0.071$ (with standard deviation from five-fold cross-validation) and an independent test accuracy of 0.786, significantly outperforming conventional threshold-based method ($0.637 \pm 0.111$ validation accuracy, $0.429$ test accuracy). Through effectively reducing operator variability and integrating multi-source information, our framework establishes a reliable non-invasive approach for clinical ICP evaluation, holding promise for improving patient management in acute neurological conditions.

[44] Decoupling Clinical and Class-Agnostic Features for Reliable Few-Shot Adaptation under Shift cs.CVPDF

Umaima Rahman, Raza Imam, Mohammad Yaqub, Dwarikanath Mahapatra

TL;DR: DRiFt通过解耦临床特征和任务无关特征，提高医学视觉语言模型（VLM）在分布偏移下的可靠性和泛化能力。

Details

Motivation: 医学视觉语言模型在分布偏移下可靠性差，且容易学习任务无关的伪相关，限制了其在实际临床场景中的应用。

Result: 在分布内性能上Top-1准确率提升11.4%，Macro-F1提升3.3%，且在未知数据上表现稳定。

Insight: 特征解耦和对齐显著提升模型泛化能力，减少领域偏移下的不确定性行为，为构建更安全的医学VLM提供了方向。

Abstract: Medical vision-language models (VLMs) offer promise for clinical decision support, yet their reliability under distribution shifts remains a major concern for safe deployment. These models often learn task-agnostic correlations due to variability in imaging protocols and free-text reports, limiting their generalizability and increasing the risk of failure in real-world settings. We propose DRiFt, a structured feature decoupling framework that explicitly separates clinically relevant signals from task-agnostic noise using parameter-efficient tuning (LoRA) and learnable prompt tokens. To enhance cross-modal alignment and reduce uncertainty, we curate high-quality, clinically grounded image-text pairs by generating captions for a diverse medical dataset. Our approach improves in-distribution performance by +11.4% Top-1 accuracy and +3.3% Macro-F1 over prior prompt-based methods, while maintaining strong robustness across unseen datasets. Ablation studies reveal that disentangling task-relevant features and careful alignment significantly enhance model generalization and reduce unpredictable behavior under domain shift. These insights contribute toward building safer, more trustworthy VLMs for clinical use. The code is available at https://github.com/rumaima/DRiFt.

[45] FS-Diff: Semantic guidance and clarity-aware simultaneous multimodal image fusion and super-resolution cs.CVPDF

Yuchan Jie, Yushen Xu, Xiaosong Li, Fuqiang Zhou, Jianming Lv

TL;DR: FS-Diff是一种结合语义引导和清晰度感知的多模态图像融合与超分辨率方法，通过条件生成问题统一两项任务，利用改进的U-Net网络实现高质量结果。

Details

Motivation: 现实应用中（如军事侦察），多模态图像的目标和背景结构易受损且语义信息弱，现有方法效果不佳，需一种能同时提升分辨率和语义信息的方法。

Result: 在多个公开数据集和AVMS上，FS-Diff在融合与超分辨率任务中均优于现有方法，恢复更多细节和语义信息。

Insight: 通过语义引导和清晰度感知，FS-Diff为多模态图像处理提供了端到端解决方案，尤其适用于低分辨率和高噪声场景。

Abstract: As an influential information fusion and low-level vision technique, image fusion integrates complementary information from source images to yield an informative fused image. A few attempts have been made in recent years to jointly realize image fusion and super-resolution. However, in real-world applications such as military reconnaissance and long-range detection missions, the target and background structures in multimodal images are easily corrupted, with low resolution and weak semantic information, which leads to suboptimal results in current fusion techniques. In response, we propose FS-Diff, a semantic guidance and clarity-aware joint image fusion and super-resolution method. FS-Diff unifies image fusion and super-resolution as a conditional generation problem. It leverages semantic guidance from the proposed clarity sensing mechanism for adaptive low-resolution perception and cross-modal feature extraction. Specifically, we initialize the desired fused result as pure Gaussian noise and introduce the bidirectional feature Mamba to extract the global features of the multimodal images. Moreover, utilizing the source images and semantics as conditions, we implement a random iterative denoising process via a modified U-Net network. This network istrained for denoising at multiple noise levels to produce high-resolution fusion results with cross-modal features and abundant semantic information. We also construct a powerful aerial view multiscene (AVMS) benchmark covering 600 pairs of images. Extensive joint image fusion and super-resolution experiments on six public and our AVMS datasets demonstrated that FS-Diff outperforms the state-of-the-art methods at multiple magnifications and can recover richer details and semantics in the fused images. The code is available at https://github.com/XylonXu01/FS-Diff.

Yushen Xu, Xiaosong Li, Yuchun Wang, Xiaoqi Cheng, Huafeng Li

TL;DR: FlexiD-Fuse提出了一种基于扩散模型的多模态医学图像融合方法，支持灵活数量的输入模态，解决了现有方法仅能处理固定数量输入的问题，且性能优于现有方法。

Details

Motivation: 现有医学图像融合方法只能处理固定数量的输入模态（如双模态或三模态），无法直接适应变化的输入数量，限制了临床应用。本次研究旨在解决这一问题。

Result: 在哈佛数据集和多种任务上的实验表明，FlexiD-Fuse在灵活输入数量的医学图像融合中性能最佳，且在扩展任务中优于其他SOTA方法。

Insight: 通过扩散模型和贝叶斯建模的结合，FlexiD-Fuse展示了在多模态图像融合中处理动态输入数量的潜力，为临床提供了更灵活的解决方案。

Abstract: Different modalities of medical images provide unique physiological and anatomical information for diseases. Multi-modal medical image fusion integrates useful information from different complementary medical images with different modalities, producing a fused image that comprehensively and objectively reflects lesion characteristics to assist doctors in clinical diagnosis. However, existing fusion methods can only handle a fixed number of modality inputs, such as accepting only two-modal or tri-modal inputs, and cannot directly process varying input quantities, which hinders their application in clinical settings. To tackle this issue, we introduce FlexiD-Fuse, a diffusion-based image fusion network designed to accommodate flexible quantities of input modalities. It can end-to-end process two-modal and tri-modal medical image fusion under the same weight. FlexiD-Fuse transforms the diffusion fusion problem, which supports only fixed-condition inputs, into a maximum likelihood estimation problem based on the diffusion process and hierarchical Bayesian modeling. By incorporating the Expectation-Maximization algorithm into the diffusion sampling iteration process, FlexiD-Fuse can generate high-quality fused images with cross-modal information from source images, independently of the number of input images. We compared the latest two and tri-modal medical image fusion methods, tested them on Harvard datasets, and evaluated them using nine popular metrics. The experimental results show that our method achieves the best performance in medical image fusion with varying inputs. Meanwhile, we conducted extensive extension experiments on infrared-visible, multi-exposure, and multi-focus image fusion tasks with arbitrary numbers, and compared them with the perspective SOTA methods. The results of the extension experiments consistently demonstrate the effectiveness and superiority of our method.

[47] OpenFake: An Open Dataset and Platform Toward Large-Scale Deepfake Detection cs.CV | cs.AI | cs.LG | I.4.9; I.5.4; I.2.10PDF

Victor Livernoche, Akshatha Arodi, Andreea Musulan, Zachary Yang, Adam Salvail

TL;DR: 这篇论文提出了一个名为OpenFake的数据集和平台，旨在通过大规模高质量的合成图像数据集和众包对抗平台，推动深度伪造检测技术的发展。

Details

Motivation: 深度伪造技术快速发展，导致虚假信息传播加剧，尤其在政治敏感领域。现有检测数据集通常规模小、生成方法过时或缺乏多样性，难以为现代检测技术提供有效支持。

Result: 结果表明，现代专有模型生成的合成图像与真实图像的区分难度显著增加，证明了数据集和平台的必要性。

Insight: 论文强调了深度伪造技术在真实性上的持续改进，以及社区驱动对抗方法在长期应对虚假信息威胁中的重要性。

Abstract: Deepfakes, synthetic media created using advanced AI techniques, have intensified the spread of misinformation, particularly in politically sensitive contexts. Existing deepfake detection datasets are often limited, relying on outdated generation methods, low realism, or single-face imagery, restricting the effectiveness for general synthetic image detection. By analyzing social media posts, we identify multiple modalities through which deepfakes propagate misinformation. Furthermore, our human perception study demonstrates that recently developed proprietary models produce synthetic images increasingly indistinguishable from real ones, complicating accurate identification by the general public. Consequently, we present a comprehensive, politically-focused dataset specifically crafted for benchmarking detection against modern generative models. This dataset contains three million real images paired with descriptive captions, which are used for generating 963k corresponding high-quality synthetic images from a mix of proprietary and open-source models. Recognizing the continual evolution of generative techniques, we introduce an innovative crowdsourced adversarial platform, where participants are incentivized to generate and submit challenging synthetic images. This ongoing community-driven initiative ensures that deepfake detection methods remain robust and adaptive, proactively safeguarding public discourse from sophisticated misinformation threats.

[48] Region-Wise Correspondence Prediction between Manga Line Art Images cs.CVPDF

Yingxuan Li, Jiafeng Mao, Qianru Qiu, Yusuke Matsui

TL;DR: 该论文提出了一种基于Transformer的框架，用于预测未标注的漫画线稿图像之间的区域级对应关系，支持下游应用如上色和中间帧生成。

Details

Motivation: 漫画处理中区域级对应关系的研究较少，尤其是在无标注或分割的情况下，该任务对漫画自动上色和中间帧生成等应用非常关键。

Result: 在多个数据集上测试，块级准确率达到96.34%，并能生成一致的区域级对应关系。

Insight: 无标注的漫画线稿图像可以通过块级学习和区域匹配实现高效的区域对应，为漫画处理任务提供了新思路。

Abstract: Understanding region-wise correspondence between manga line art images is a fundamental task in manga processing, enabling downstream applications such as automatic line art colorization and in-between frame generation. However, this task remains largely unexplored, especially in realistic scenarios without pre-existing segmentation or annotations. In this paper, we introduce a novel and practical task: predicting region-wise correspondence between raw manga line art images without any pre-existing labels or masks. To tackle this problem, we divide each line art image into a set of patches and propose a Transformer-based framework that learns patch-level similarities within and across images. We then apply edge-aware clustering and a region matching algorithm to convert patch-level predictions into coherent region-level correspondences. To support training and evaluation, we develop an automatic annotation pipeline and manually refine a subset of the data to construct benchmark datasets. Experiments on multiple datasets demonstrate that our method achieves high patch-level accuracy (e.g., 96.34%) and generates consistent region-level correspondences, highlighting its potential for real-world manga applications.

[49] DualTrack: Sensorless 3D Ultrasound needs Local and Global Context cs.CVPDF

Paul F. R. Wilson, Matteo Ronchetti, Rüdiger Göbl, Viktoria Markova, Sebastian Rosenzweig

TL;DR: 论文提出DualTrack，一种双编码器架构，分别处理超声图像的局部和全局特征，提升3D重建精度。

Details

Motivation: 传统3D超声系统成本高且复杂，而基于深度学习的无传感器3D超声技术需同时捕捉局部（如斑点模式）和全局（如解剖结构）特征，但现有方法未能充分解耦两者。

Result: 在公开基准测试中，DualTrack实现平均重建误差低于5毫米，优于现有方法。

Insight: 解耦局部和全局特征提取可显著提升3D超声重建精度，表明两者在医学图像分析中的互补性。

Abstract: Three-dimensional ultrasound (US) offers many clinical advantages over conventional 2D imaging, yet its widespread adoption is limited by the cost and complexity of traditional 3D systems. Sensorless 3D US, which uses deep learning to estimate a 3D probe trajectory from a sequence of 2D US images, is a promising alternative. Local features, such as speckle patterns, can help predict frame-to-frame motion, while global features, such as coarse shapes and anatomical structures, can situate the scan relative to anatomy and help predict its general shape. In prior approaches, global features are either ignored or tightly coupled with local feature extraction, restricting the ability to robustly model these two complementary aspects. We propose DualTrack, a novel dual-encoder architecture that leverages decoupled local and global encoders specialized for their respective scales of feature extraction. The local encoder uses dense spatiotemporal convolutions to capture fine-grained features, while the global encoder utilizes an image backbone (e.g., a 2D CNN or foundation model) and temporal attention layers to embed high-level anatomical features and long-range dependencies. A lightweight fusion module then combines these features to estimate the trajectory. Experimental results on a large public benchmark show that DualTrack achieves state-of-the-art accuracy and globally consistent 3D reconstructions, outperforming previous methods and yielding an average reconstruction error below 5 mm.

[50] Improving Video Diffusion Transformer Training by Multi-Feature Fusion and Alignment from Self-Supervised Vision Encoders cs.CV | cs.AIPDF

Dohun Lee, Hyeonho Jeong, Jiwook Kim, Duygu Ceylan, Jong Chul Ye

TL;DR: 该论文提出了一种通过自监督视觉编码器的多特征融合与对齐来改进视频扩散Transformer训练的方法，称为Align4Gen。该方法显著提升了视频生成质量。

Details

Motivation: 现有视频扩散模型在架构创新和训练目标上取得进展，但对特征表示能力的改进关注不足。论文希望通过对齐预训练视觉编码器的特征提升视频生成质量。

Result: 在无条件视频生成和类别条件视频生成任务中，Align4Gen显著提升了生成视频的质量，并通过多种指标验证了其有效性。

Insight: 预训练视觉编码器的特征对齐是改进视频扩散模型的一种有效路径，多特征融合进一步提升了模型的生成能力。

Abstract: Video diffusion models have advanced rapidly in the recent years as a result of series of architectural innovations (e.g., diffusion transformers) and use of novel training objectives (e.g., flow matching). In contrast, less attention has been paid to improving the feature representation power of such models. In this work, we show that training video diffusion models can benefit from aligning the intermediate features of the video generator with feature representations of pre-trained vision encoders. We propose a new metric and conduct an in-depth analysis of various vision encoders to evaluate their discriminability and temporal consistency, thereby assessing their suitability for video feature alignment. Based on the analysis, we present Align4Gen which provides a novel multi-feature fusion and alignment method integrated into video diffusion model training. We evaluate Align4Gen both for unconditional and class-conditional video generation tasks and show that it results in improved video generation as quantified by various metrics. Full video results are available on our project page: https://align4gen.github.io/align4gen/

[51] Invisible Attributes, Visible Biases: Exploring Demographic Shortcuts in MRI-based Alzheimer’s Disease Classification cs.CV | cs.AIPDF

Akshit Achara, Esther Puyol Anton, Alexander Hammers, Andrew P. King

TL;DR: 本文探讨了基于MRI的阿尔茨海默病（AD）分类中深度学习的捷径学习和人口统计学偏见问题，揭示了种族和性别相关的分布偏移及模型性能下降，并通过特征归因分析提出了更公平的诊断工具基础。

Details

Motivation: 深度学习方法在MRI辅助AD诊断中存在潜在捷径学习问题，可能导致基于种族和性别的性能偏见，影响公平性。

Result: 实验证明AD分类中存在种族和性别相关的捷径学习和性能偏见，具体表现为模型对某些脑区特征的依赖。

Insight: MRI数据中的隐含人口统计学差异可能导致模型偏见，未来需设计更公平的诊断工具以避免对少数群体的歧视。

Abstract: Magnetic resonance imaging (MRI) is the gold standard for brain imaging. Deep learning (DL) algorithms have been proposed to aid in the diagnosis of diseases such as Alzheimer’s disease (AD) from MRI scans. However, DL algorithms can suffer from shortcut learning, in which spurious features, not directly related to the output label, are used for prediction. When these features are related to protected attributes, they can lead to performance bias against underrepresented protected groups, such as those defined by race and sex. In this work, we explore the potential for shortcut learning and demographic bias in DL based AD diagnosis from MRI. We first investigate if DL algorithms can identify race or sex from 3D brain MRI scans to establish the presence or otherwise of race and sex based distributional shifts. Next, we investigate whether training set imbalance by race or sex can cause a drop in model performance, indicating shortcut learning and bias. Finally, we conduct a quantitative and qualitative analysis of feature attributions in different brain regions for both the protected attribute and AD classification tasks. Through these experiments, and using multiple datasets and DL models (ResNet and SwinTransformer), we demonstrate the existence of both race and sex based shortcut learning and bias in DL based AD classification. Our work lays the foundation for fairer DL diagnostic tools in brain MRI. The code is provided at https://github.com/acharaakshit/ShortMR

[52] PeftCD: Leveraging Vision Foundation Models with Parameter-Efficient Fine-Tuning for Remote Sensing Change Detection cs.CVPDF

Sijun Dong, Yuxuan Hu, LiBo Wang, Geng Chen, Xiaoliang Meng

TL;DR: PeftCD是基于视觉基础模型（VFMs）和参数高效微调（PEFT）的遥感变化检测框架，通过集成LoRA和Adapter模块高效适应任务，使用SAM2和DINOv3两种骨干网络，取得了多个数据集上的SOTA性能。

Details

Motivation: 解决多时相多源遥感影像中伪变化、标注样本稀缺和跨域泛化困难的问题。

Result: 在SYSU-CD、WHUCD等数据集上取得SOTA性能，精确边界刻画和强伪变化抑制。

Insight: PeftCD展示了在准确性、效率和泛化性之间的最佳平衡，为大规模VFM在遥感变化检测中的应用提供了范例。

Abstract: To tackle the prevalence of pseudo changes, the scarcity of labeled samples, and the difficulty of cross-domain generalization in multi-temporal and multi-source remote sensing imagery, we propose PeftCD, a change detection framework built upon Vision Foundation Models (VFMs) with Parameter-Efficient Fine-Tuning (PEFT). At its core, PeftCD employs a weight-sharing Siamese encoder derived from a VFM, into which LoRA and Adapter modules are seamlessly integrated. This design enables highly efficient task adaptation by training only a minimal set of additional parameters. To fully unlock the potential of VFMs, we investigate two leading backbones: the Segment Anything Model v2 (SAM2), renowned for its strong segmentation priors, and DINOv3, a state-of-the-art self-supervised representation learner. The framework is complemented by a deliberately lightweight decoder, ensuring the focus remains on the powerful feature representations from the backbones. Extensive experiments demonstrate that PeftCD achieves state-of-the-art performance across multiple public datasets, including SYSU-CD (IoU 73.81%), WHUCD (92.05%), MSRSCD (64.07%), MLCD (76.89%), CDD (97.01%), S2Looking (52.25%) and LEVIR-CD (85.62%), with notably precise boundary delineation and strong suppression of pseudo-changes. In summary, PeftCD presents an optimal balance of accuracy, efficiency, and generalization. It offers a powerful and scalable paradigm for adapting large-scale VFMs to real-world remote sensing change detection applications. The code and pretrained models will be released at https://github.com/dyzy41/PeftCD.

[53] Visual Grounding from Event Cameras cs.CV | cs.ROPDF

Lingdong Kong, Dongyue Lu, Ao Liang, Rong Li, Yuhao Dong

TL;DR: 论文介绍了Talk2Event，第一个基于事件相机数据的大规模语言驱动对象接地基准，旨在填补事件相机与自然语言理解结合的多模态感知空白。

Details

Motivation: 事件相机在高动态场景下具有微秒级精度和抗运动模糊的特性，但其与自然语言理解的结合研究较少，限制了多模态感知的发展。

Result: Talk2Event为动态环境中的上下文推理提供了基础，并支持多模态和时间感知研究。

Insight: 事件相机与语言理解的结合有望推动机器人、人机交互等领域的发展。

Abstract: Event cameras capture changes in brightness with microsecond precision and remain reliable under motion blur and challenging illumination, offering clear advantages for modeling highly dynamic scenes. Yet, their integration with natural language understanding has received little attention, leaving a gap in multimodal perception. To address this, we introduce Talk2Event, the first large-scale benchmark for language-driven object grounding using event data. Built on real-world driving scenarios, Talk2Event comprises 5,567 scenes, 13,458 annotated objects, and more than 30,000 carefully validated referring expressions. Each expression is enriched with four structured attributes – appearance, status, relation to the viewer, and relation to surrounding objects – that explicitly capture spatial, temporal, and relational cues. This attribute-centric design supports interpretable and compositional grounding, enabling analysis that moves beyond simple object recognition to contextual reasoning in dynamic environments. We envision Talk2Event as a foundation for advancing multimodal and temporally-aware perception, with applications spanning robotics, human-AI interaction, and so on.

[54] Kling-Avatar: Grounding Multimodal Instructions for Cascaded Long-Duration Avatar Animation Synthesis cs.CVPDF

Yikang Ding, Jiwen Liu, Wenyuan Zhang, Zekun Wang, Wentao Hu

TL;DR: Kling-Avatar提出了一种新颖的多模态指令理解与逼真肖像生成结合的级联框架，解决了现有方法在叙事连贯性和角色表现力上的局限性。

Details

Motivation: 现有音频驱动的虚拟形象视频生成方法仅将指令条件视为低层次的跟踪驱动，缺乏对指令传达的交流目的建模，导致叙事连贯性和表现力不足。

Result: 实验表明，Kling-Avatar能够生成生动流畅的长视频（1080p，48fps），在唇同步、情感表现、指令可控性等方面表现优越。

Insight: 通过全局-局部框架和并行生成策略，Kling-Avatar在保留细节的同时高效捕捉指令的高层意图，适用于实时应用。

Abstract: Recent advances in audio-driven avatar video generation have significantly enhanced audio-visual realism. However, existing methods treat instruction conditioning merely as low-level tracking driven by acoustic or visual cues, without modeling the communicative purpose conveyed by the instructions. This limitation compromises their narrative coherence and character expressiveness. To bridge this gap, we introduce Kling-Avatar, a novel cascaded framework that unifies multimodal instruction understanding with photorealistic portrait generation. Our approach adopts a two-stage pipeline. In the first stage, we design a multimodal large language model (MLLM) director that produces a blueprint video conditioned on diverse instruction signals, thereby governing high-level semantics such as character motion and emotions. In the second stage, guided by blueprint keyframes, we generate multiple sub-clips in parallel using a first-last frame strategy. This global-to-local framework preserves fine-grained details while faithfully encoding the high-level intent behind multimodal instructions. Our parallel architecture also enables fast and stable generation of long-duration videos, making it suitable for real-world applications such as digital human livestreaming and vlogging. To comprehensively evaluate our method, we construct a benchmark of 375 curated samples covering diverse instructions and challenging scenarios. Extensive experiments demonstrate that Kling-Avatar is capable of generating vivid, fluent, long-duration videos at up to 1080p and 48 fps, achieving superior performance in lip synchronization accuracy, emotion and dynamic expressiveness, instruction controllability, identity preservation, and cross-domain generalization. These results establish Kling-Avatar as a new benchmark for semantically grounded, high-fidelity audio-driven avatar synthesis.

[55] Measuring Epistemic Humility in Multimodal Large Language Models cs.CVPDF

Bingkui Tong, Jiaer Xia, Sifeng Shang, Kaiyang Zhou

TL;DR: 该论文提出了HumbleBench，一个专门用于评估多模态大语言模型（MLLMs）识别和拒绝错误答案能力的基准测试，填补了现有测试中忽略的’认知谦逊’评估空白。

Details

Motivation: 现有的多模态大语言模型基准测试主要关注模型能否从候选答案中选出正确答案，但忽视了模型识别和拒绝错误答案的能力，这在安全关键应用中尤为重要。

Result: 实验评估了多种先进的MLLMs，结果表明现有模型在识别错误答案方面表现不佳，验证了HumbleBench的重要性。

Insight: 该研究揭示了现有MLLMs在’认知谦逊’方面的不足，强调了在实际应用中测试模型拒绝错误答案能力的必要性，为未来研究提供了方向。

Abstract: Hallucinations in multimodal large language models (MLLMs) – where the model generates content inconsistent with the input image – pose significant risks in real-world applications, from misinformation in visual question answering to unsafe errors in decision-making. Existing benchmarks primarily test recognition accuracy, i.e., evaluating whether models can select the correct answer among distractors. This overlooks an equally critical capability for trustworthy AI: recognizing when none of the provided options are correct, a behavior reflecting epistemic humility. We present HumbleBench, a new hallucination benchmark designed to evaluate MLLMs’ ability to reject plausible but incorrect answers across three hallucination types: object, relation, and attribute. Built from a panoptic scene graph dataset, we leverage fine-grained scene graph annotations to extract ground-truth entities and relations, and prompt GPT-4-Turbo to generate multiple-choice questions, followed by a rigorous manual filtering process. Each question includes a “None of the above” option, requiring models not only to recognize correct visual information but also to identify when no provided answer is valid. We evaluate a variety of state-of-the-art MLLMs – including both general-purpose and specialized reasoning models – on HumbleBench and share valuable findings and insights with the community. By incorporating explicit false-option rejection, HumbleBench fills a key gap in current evaluation suites, providing a more realistic measure of MLLM reliability in safety-critical settings. Our code and dataset are released publicly and can be accessed at https://github.com/maifoundations/HumbleBench.

[56] Can Understanding and Generation Truly Benefit Together – or Just Coexist? cs.CVPDF

Zhiyuan Yan, Kaiqing Lin, Zongjian Li, Junyan Ye, Hui Han

TL;DR: 本文通过自编码器视角提出了一种统一的多模态学习框架UAE，通过图像到文本（I2T）和文本到图像（T2I）的双向信息流实现理解与生成的协同增益。

Details

Motivation: 探索理解（I2T）与生成（T2I）是否能够真正协同优化，而非仅共存，并通过统一目标（重建保真度）实现双向信息流。

Result: RL训练中，编码器生成的描述更丰富，解码器理解能力增强，显著提升重建保真度，实现双向增益。

Insight: 理解与生成可通过统一框架协同优化，RL驱动的迭代训练是关键；描述质量与生成能力相互促进是意外的发现。

Abstract: In this paper, we introduce an insightful paradigm through the Auto-Encoder lens-understanding as the encoder (I2T) that compresses images into text, and generation as the decoder (T2I) that reconstructs images from that text. Using reconstruction fidelity as the unified training objective, we enforce the coherent bidirectional information flow between the understanding and generation processes, bringing mutual gains. To implement this, we propose UAE, a novel framework for unified multimodal learning. We begin by pre-training the decoder with large-scale long-context image captions to capture fine-grained semantic and complex spatial relationships. We then propose Unified-GRPO via reinforcement learning (RL), which covers three stages: (1) A cold-start phase to gently initialize both encoder and decoder with a semantic reconstruction loss; (2) Generation for Understanding, where the encoder is trained to generate informative captions that maximize the decoder’s reconstruction quality, enhancing its visual understanding; (3) Understanding for Generation, where the decoder is refined to reconstruct from these captions, forcing it to leverage every detail and improving its long-context instruction following and generation fidelity. For evaluation, we introduce Unified-Bench, the first benchmark tailored to assess the degree of unification of the UMMs. A surprising “aha moment” arises within the multimodal learning domain: as RL progresses, the encoder autonomously produces more descriptive captions, while the decoder simultaneously demonstrates a profound ability to understand these intricate descriptions, resulting in reconstructions of striking fidelity.

[57] Locality in Image Diffusion Models Emerges from Data Statistics cs.CVPDF

Artem Lukoianov, Chenyang Yuan, Justin Solomon, Vincent Sitzmann

TL;DR: 该论文探讨了图像扩散模型中局部性的来源，提出局部性源自图像数据本身的统计特性，而非卷积神经网络的归纳偏置。作者通过理论和实验证明，最优参数化线性去噪器与深度神经去噪器具有相似的局部性，并利用这一洞察设计了更匹配深度扩散模型的分析去噪器。

Details

Motivation: 扩散模型的训练目标存在闭式最优解（最优去噪器），但其直接使用仅能复现训练集图像，无法生成新图像。此前研究认为卷积神经网络的平移不变性和局部性归纳偏置导致了这一差距，但本文质疑这一假设。

Result: 实验表明，本文设计的分析去噪器比之前的专家设计方法更准确地预测深度扩散模型的分数，支持了局部性源自数据统计特性的观点。

Insight: 论文揭示了图像扩散模型中局部性的根本来源是数据而非模型结构，为理解扩散模型的行为提供了新视角，并启发了更接近实际模型特性的分析方法。

Abstract: Among generative models, diffusion models are uniquely intriguing due to the existence of a closed-form optimal minimizer of their training objective, often referred to as the optimal denoiser. However, diffusion using this optimal denoiser merely reproduces images in the training set and hence fails to capture the behavior of deep diffusion models. Recent work has attempted to characterize this gap between the optimal denoiser and deep diffusion models, proposing analytical, training-free models that can generate images that resemble those generated by a trained UNet. The best-performing method hypothesizes that shift equivariance and locality inductive biases of convolutional neural networks are the cause of the performance gap, hence incorporating these assumptions into its analytical model. In this work, we present evidence that the locality in deep diffusion models emerges as a statistical property of the image dataset, not due to the inductive bias of convolutional neural networks. Specifically, we demonstrate that an optimal parametric linear denoiser exhibits similar locality properties to the deep neural denoisers. We further show, both theoretically and experimentally, that this locality arises directly from the pixel correlations present in natural image datasets. Finally, we use these insights to craft an analytical denoiser that better matches scores predicted by a deep diffusion model than the prior expert-crafted alternative.

[58] SpatialVID: A Large-Scale Video Dataset with Spatial Annotations cs.CVPDF

Jiahao Wang, Yufeng Yuan, Rujie Zheng, Youtian Lin, Jian Gao

TL;DR: SpatialVID是一个大规模视频数据集，提供丰富的空间标注，包括相机位姿、深度图等，旨在解决现有数据在规模和多样性上的不足。

Details

Motivation: 当前空间智能模型的扩展性和真实世界保真度受到高质量训练数据稀缺的限制。现有数据集在规模、多样性和标注丰富性上不足。

Result: SpatialVID的数据统计显示其丰富性和多样性，能直接促进模型泛化能力和性能提升。

Insight: SpatialVID填补了高质量、大规模视频数据的空白，为视频和3D视觉研究提供了重要资源。

Abstract: Significant progress has been made in spatial intelligence, spanning both spatial reconstruction and world exploration. However, the scalability and real-world fidelity of current models remain severely constrained by the scarcity of large-scale, high-quality training data. While several datasets provide camera pose information, they are typically limited in scale, diversity, and annotation richness, particularly for real-world dynamic scenes with ground-truth camera motion. To this end, we collect \textbf{SpatialVID}, a dataset consists of a large corpus of in-the-wild videos with diverse scenes, camera movements and dense 3D annotations such as per-frame camera poses, depth, and motion instructions. Specifically, we collect more than 21,000 hours of raw video, and process them into 2.7 million clips through a hierarchical filtering pipeline, totaling 7,089 hours of dynamic content. A subsequent annotation pipeline enriches these clips with detailed spatial and semantic information, including camera poses, depth maps, dynamic masks, structured captions, and serialized motion instructions. Analysis of SpatialVID’s data statistics reveals a richness and diversity that directly foster improved model generalization and performance, establishing it as a key asset for the video and 3D vision research community.

cs.CL [Back]

[59] Automated Evidence Extraction and Scoring for Corporate Climate Policy Engagement: A Multilingual RAG Approach cs.CLPDF

Imene Kolli, Ario Saeid Vaghefi, Chiara Colesanti Senni, Shantam Raj, Markus Leippold

TL;DR: 论文提出了一种基于检索增强生成（RAG）的AI辅助框架，用于自动化从多语言公司文件中提取和分类气候政策证据，以加速企业气候政策参与监测。

Details

Motivation: InfluenceMap的监测工作大部分仍依赖人工，导致费时费力且易出错。因此，作者希望通过AI自动化解决这一问题。

Result: 评估表明，该框架在多语言文档中高效提取证据，但仍需人工参与以保证分析的准确性。

Insight: 自动化工具虽能加速证据提取，但在复杂分析中仍需专家介入，技术应作为辅助而非替代。

Abstract: InfluenceMap’s LobbyMap Platform monitors the climate policy engagement of over 500 companies and 250 industry associations, assessing each entity’s support or opposition to science-based policy pathways for achieving the Paris Agreement’s goal of limiting global warming to 1.5{\deg}C. Although InfluenceMap has made progress with automating key elements of the analytical workflow, a significant portion of the assessment remains manual, making it time- and labor-intensive and susceptible to human error. We propose an AI-assisted framework to accelerate the monitoring of corporate climate policy engagement by leveraging Retrieval-Augmented Generation to automate the most time-intensive extraction of relevant evidence from large-scale textual data. Our evaluation shows that a combination of layout-aware parsing, the Nomic embedding model, and few-shot prompting strategies yields the best performance in extracting and classifying evidence from multilingual corporate documents. We conclude that while the automated RAG system effectively accelerates evidence extraction, the nuanced nature of the analysis necessitates a human-in-the-loop approach where the technology augments, rather than replaces, expert judgment to ensure accuracy.

[60] BRoverbs – Measuring how much LLMs understand Portuguese proverbs cs.CLPDF

Thales Sales Almeida, Giovana Kerche Bonás, João Guilherme Alves Santos

TL;DR: 该论文提出了BRoverbs数据集，用于评估大型语言模型（LLMs）在理解巴西葡萄牙语谚语方面的能力。

Details

Motivation: 由于现有评估框架在葡萄牙语中的局限性，尤其是依赖翻译数据集无法完全捕捉语言和文化细节，作者希望通过巴西谚语填补这一空白。

Result: BRoverbs为葡萄牙语LLMs提供了一个新的基准测试工具，推动区域化评估的发展。数据集已公开在Hugging Face平台上。

Insight: 谚语作为语言和文化的核心载体，能够有效评估LLMs的区域语言能力，为其他语言的类似评估提供了参考。

Abstract: Large Language Models (LLMs) exhibit significant performance variations depending on the linguistic and cultural context in which they are applied. This disparity signals the necessity of mature evaluation frameworks that can assess their capabilities in specific regional settings. In the case of Portuguese, existing evaluations remain limited, often relying on translated datasets that may not fully capture linguistic nuances or cultural references. Meanwhile, native Portuguese-language datasets predominantly focus on structured national exams or sentiment analysis of social media interactions, leaving gaps in evaluating broader linguistic understanding. To address this limitation, we introduce BRoverbs, a dataset specifically designed to assess LLM performance through Brazilian proverbs. Proverbs serve as a rich linguistic resource, encapsulating cultural wisdom, figurative expressions, and complex syntactic structures that challenge the model comprehension of regional expressions. BRoverbs aims to provide a new evaluation tool for Portuguese-language LLMs, contributing to advancing regionally informed benchmarking. The benchmark is available at https://huggingface.co/datasets/Tropic-AI/BRoverbs.

[61] Can Vision-Language Models Solve Visual Math Equations? cs.CL | cs.AI | cs.CVPDF

Monjoy Narayan Choudhury, Junling Wang, Yifan Hou, Mrinmaya Sachan

TL;DR: 本文研究了视觉语言模型（VLMs）在解决视觉数学方程时的问题，发现其在系数计数和变量识别中的表现较差，尤其是多步骤视觉推理的挑战显著。

Details

Motivation: 尽管VLMs在视觉理解和语言推理方面表现优异，但在需要感知与符号计算结合的任务中能力有限。本文通过视觉方程的解决任务探索这一局限，揭示其弱点。

Result: VLMs在文本方程上表现良好，但在视觉方程中表现不佳，尤其是系数计数和多步骤推理的误差显著。随着方程复杂度增加，符号推理也成为限制因素。

Insight: 当前VLMs在视觉数学推理中的局限性主要来源于计数能力和多步骤推理的不足，未来研究需改进这些方面以实现更优的视觉-符号组合能力。

Abstract: Despite strong performance in visual understanding and language-based reasoning, Vision-Language Models (VLMs) struggle with tasks requiring integrated perception and symbolic computation. We study this limitation through visual equation solving, where mathematical equations are embedded in images, variables are represented by object icons, and coefficients must be inferred by counting. While VLMs perform well on textual equations, they fail on visually grounded counterparts. To understand this gap, we decompose the task into coefficient counting and variable recognition, and find that counting is the primary bottleneck, even when recognition is accurate. We also observe that composing recognition and reasoning introduces additional errors, highlighting challenges in multi-step visual reasoning. Finally, as equation complexity increases, symbolic reasoning itself becomes a limiting factor. These findings reveal key weaknesses in current VLMs and point toward future improvements in visually grounded mathematical reasoning.

[62] MR-UIE: Multi-Perspective Reasoning with Reinforcement Learning for Universal Information Extraction cs.CLPDF

Zhongqiu Li, Shiquan Wang, Ruiyu Fang, Mengjiao Bao, Zhenhe Wu

TL;DR: 这篇论文提出了一种名为MR-UIE的方法，通过结合多视角推理和强化学习，提升了大型语言模型在通用信息抽取任务中的泛化能力和准确性。

Details

Motivation: 大型语言模型在通用信息抽取任务中表现不佳，尤其是在需要多步推理和复杂模式描述的结构化输出场景中。现有的方法（如上下文学习和指令微调）仍有明显局限，因此需要一种新方法提升模型的推理能力和泛化性。

Result: 实验表明，MR-UIE在多个领域的信息抽取任务中显著提升了准确性，并在某些数据集上超越了现有最先进方法。多视角推理的引入还增强了模型在复杂任务中的泛化能力。

Insight: 论文强调了推理在复杂信息抽取任务中的关键作用，展示了多视角推理与强化学习的结合能够有效提升模型的泛化能力和任务表现。

Abstract: Large language models (LLMs) demonstrate robust capabilities across diverse research domains. However, their performance in universal information extraction (UIE) remains insufficient, especially when tackling structured output scenarios that involve complex schema descriptions and require multi-step reasoning. While existing approaches enhance the performance of LLMs through in-context learning and instruction tuning, significant limitations nonetheless persist. To enhance the model’s generalization ability, we propose integrating reinforcement learning (RL) with multi-perspective reasoning for information extraction (IE) tasks. Our work transitions LLMs from passive extractors to active reasoners, enabling them to understand not only what to extract but also how to reason. Experiments conducted on multiple IE benchmarks demonstrate that MR-UIE consistently elevates extraction accuracy across domains and surpasses state-of-the-art methods on several datasets. Furthermore, incorporating multi-perspective reasoning into RL notably enhances generalization in complex IE tasks, underscoring the critical role of reasoning in challenging scenarios.

[63] TigerCoder: A Novel Suite of LLMs for Code Generation in Bangla cs.CLPDF

Nishat Raihan, Antonios Anastasopoulos, Marcos Zampieri

TL;DR: 论文提出了一个专门用于孟加拉语（Bangla）代码生成的首个大型语言模型家族TigerCoder，通过高质量数据集和评测基准显著提升了性能。

Details

Motivation: 孟加拉语虽然是全球第五大语言，但在代码生成的LLM中代表性不足，主要原因是缺乏高质量的数据集。

Result: TigerCoder在Pass@1指标上比现有多语言和通用孟加拉语LLMs性能提升11-18%。

Insight: 研究发现，高质量的数据集可以有效弥补低资源语言模型的局限性。

Abstract: Despite being the 5th most spoken language, Bangla remains underrepresented in Large Language Models (LLMs), particularly for code generation. This primarily stems from the scarcity of high-quality data to pre-train and/or finetune such models. Hence, we introduce the first dedicated family of Code LLMs for Bangla (1B & 9B). We offer three major contributions: (1) a comprehensive Bangla code instruction datasets for programming domain adaptation; (2) MBPP-Bangla, an evaluation benchmark for Bangla code generation; and (3) the TigerCoder-family of Code LLMs, achieving significant ~11-18% performance gains at Pass@1 over existing multilingual and general-purpose Bangla LLMs. Our findings show that curated, high-quality datasets can overcome limitations of smaller models for low-resource languages. We open-source all resources to advance further Bangla LLM research.

[64] Compass-v3: Scaling Domain-Specific LLMs for Multilingual E-Commerce in Southeast Asia cs.CLPDF

Sophia Maria

TL;DR: Compass-v3是一个面向东南亚多语言电商的垂直领域专家混合模型，通过硬件优化和定制训练策略提升性能，并在多语言和电商任务上表现优于现有模型。

Details

Motivation: 通用大语言模型在电商等专业领域表现不足，尤其是面对多语言和动态数据的挑战。为解决这一问题，作者开发了针对东南亚电商的专用模型Compass-v3。

Result: Compass-v3在电商任务和多语言场景中表现优于DeepSeek-V3.1、GPT-4和Qwen3-235B，并在Shopee的实际应用中占比超过70%。

Insight: 1. 垂直领域的专用模型可以通过硬件优化和混合训练显著提升性能；2. OTPO对指令对齐有显著效果；3. 多语言能力在低资源语言中仍可保持竞争力。

Abstract: Large language models (LLMs) excel in general-domain applications, yet their performance often degrades in specialized tasks requiring domain-specific knowledge. E-commerce is particularly challenging, as its data are noisy, heterogeneous, multilingual, and highly dynamic. We present Compass-v3, a vertical-domain Mixture-of-Experts (MoE) model with 245B total parameters and 71B active per token, designed for Southeast Asian e-commerce. Compass-v3 adopts fewer but larger experts, combined with hardware-efficient optimizations-such as intra-node expert parallelism and a customized memcpy operator-to maximize GPU utilization. The model is trained on 12T tokens of curated multilingual corpora and large-scale synthetic e-commerce instructions using a mixed-training strategy. To enhance alignment, we propose Optimal-Transport Direct Preference Optimization (OTPO), which captures token-level distinctions and improves instruction adherence in commerce-specific scenarios. Extensive evaluations demonstrate that Compass-v3 delivers state-of-the-art e-commerce performance, surpassing DeepSeek-V3.1, GPT-4 series, and Qwen3-235B. Moreover, Compass-v3 demonstrates strong multilingual capability across low-resource Southeast Asian languages (Indonesian, Thai, Filipino, Vietnamese, Malay, Taglog) and Portuguese while sustaining competitive performance on general benchmarks. It has already been widely applied in Shopee’s industrial-scale e-commerce platform and is gradually replacing OpenAI’s traffic, now accounting for over 70% of total LLM usage, highlighting its dual strengths in specialized commerce expertise and broad linguistic competence.

[65] Target-oriented Multimodal Sentiment Classification with Counterfactual-enhanced Debiasing cs.CL | cs.AIPDF

Zhiyue Liu, Fanrong Ma, Xin Ling

TL;DR: 本文提出了一种基于反事实增强的去偏框架，用于目标导向的多模态情感分类，通过改进文本特征与标签之间的虚假相关性，提升分类准确性。

Details

Motivation: 现有目标导向多模态情感分类方法过度依赖文本内容且忽略数据集偏差（特别是词级上下文偏差），导致文本特征与输出标签之间存在虚假相关性，影响分类效果。

Result: 在多个基准数据集上的实验表明，该方法优于当前最优基线方法。

Insight: 1）反事实数据增强能有效减少虚假相关性；2）自适应去偏对比学习机制在多模态任务中表现优越。

Abstract: Target-oriented multimodal sentiment classification seeks to predict sentiment polarity for specific targets from image-text pairs. While existing works achieve competitive performance, they often over-rely on textual content and fail to consider dataset biases, in particular word-level contextual biases. This leads to spurious correlations between text features and output labels, impairing classification accuracy. In this paper, we introduce a novel counterfactual-enhanced debiasing framework to reduce such spurious correlations. Our framework incorporates a counterfactual data augmentation strategy that minimally alters sentiment-related causal features, generating detail-matched image-text samples to guide the model’s attention toward content tied to sentiment. Furthermore, for learning robust features from counterfactual data and prompting model decisions, we introduce an adaptive debiasing contrastive learning mechanism, which effectively mitigates the influence of biased words. Experimental results on several benchmark datasets show that our proposed method outperforms state-of-the-art baselines.

[66] EchoX: Towards Mitigating Acoustic-Semantic Gap via Echo Training for Speech-to-Speech LLMs cs.CL | cs.AI | cs.SDPDF

Yuhao Zhang, Yuhao Du, Zhanchen Dai, Xiangnan Ma, Kaiqi Kou

TL;DR: EchoX通过回波训练弥合语音到语音大语言模型（SLLMs）中的声学-语义鸿沟，结合声学和语义学习，提升推理能力。

Details

Motivation: 当前语音到语音大语言模型（SLLMs）在知识和推理能力上存在退化，原因是声学与语义特征空间的训练未能有效结合。

Result: 在约6000小时训练数据下，EchoX在多个知识问答基准上表现优异。

Insight: 声学与语义的结合对语音大语言模型的推理能力至关重要，动态生成训练目标是有效方法。

Abstract: Speech-to-speech large language models (SLLMs) are attracting increasing attention. Derived from text-based large language models (LLMs), SLLMs often exhibit degradation in knowledge and reasoning capabilities. We hypothesize that this limitation arises because current training paradigms for SLLMs fail to bridge the acoustic-semantic gap in the feature representation space. To address this issue, we propose EchoX, which leverages semantic representations and dynamically generates speech training targets. This approach integrates both acoustic and semantic learning, enabling EchoX to preserve strong reasoning abilities as a speech LLM. Experimental results demonstrate that EchoX, with about six thousand hours of training data, achieves advanced performance on multiple knowledge-based question-answering benchmarks. The project is available at https://github.com/FreedomIntelligence/EchoX.

[67] Efficient Trie-based Biasing using K-step Prediction for Rare Word Recognition cs.CL | cs.AIPDF

Chin Yuen Kwok, Jia Qi yip

TL;DR: 这篇论文提出了一种基于Trie的高效偏置方法，通过K步预测来优化罕见词识别，避免了传统方法的计算开销，显著降低了词错误率。

Details

Motivation: 当前的Trie偏置方法在解码时需要对部分假设给出‘奖励分’，但如果罕见词未被识别，系统需要撤销这些奖励，这在大型解码器中计算开销很大。

Result: 仅用10小时合成数据微调后，NSC Part 2测试集的词错误率从30.86%降至12.19%。

Insight: 多步预测技术可以显著减少传统方法的计算负担，同时提升罕见词识别的性能。

Abstract: Contextual biasing improves rare word recognition of ASR models by prioritizing the output of rare words during decoding. A common approach is Trie-based biasing, which gives “bonus scores” to partial hypothesis (e.g. “Bon”) that may lead to the generation of the rare word (e.g. “Bonham”). If the full word (“Bonham”) isn’t ultimately recognized, the system revokes those earlier bonuses. This revocation is limited to beam search and is computationally expensive, particularly for models with large decoders. To overcome these limitations, we propose adapting ASR models to look ahead and predict multiple steps at once. This avoids the revocation step entirely by better estimating whether a partial hypothesis will lead to the generation of the full rare word. By fine-tuning Whisper with only 10 hours of synthetic data, our method reduces the word error rate on the NSC Part 2 test set from 30.86% to 12.19%.

[68] Improving Synthetic Data Training for Contextual Biasing Models with a Keyword-Aware Cost Function cs.CL | cs.AIPDF

Chin Yuen Kwok, Jia Qi Yip, Eng Siong Chng

TL;DR: 论文提出了一种关键词感知损失函数，通过结合掩码交叉熵和二元分类任务，改进了基于TCPGen的上下文偏置模型在合成数据上的训练效果，显著降低了词错误率。

Details

Motivation: 在合成数据上训练的上下文偏置模型容易因数据中的伪影而出现过拟合，影响罕见词识别的性能。为了缓解这一问题，作者提出了一种新的损失函数设计。

Result: 在10小时合成数据上微调Whisper模型，词错误率从29.71%降至11.81%（NSC Part 2测试集）。

Insight: 联合优化偏置词预测和位置检测任务可以有效缓解合成数据训练中的过拟合问题，提升罕见词识别性能。

Abstract: Rare word recognition can be improved by adapting ASR models to synthetic data that includes these words. Further improvements can be achieved through contextual biasing, which trains and adds a biasing module into the model architecture to prioritize rare words. While training the module on synthetic rare word data is more effective than using non-rare-word data, it can lead to overfitting due to artifacts in the synthetic audio. To address this, we enhance the TCPGen-based contextual biasing approach and propose a keyword-aware loss function that additionally focuses on biased words when training biasing modules. This loss includes a masked cross-entropy term for biased word prediction and a binary classification term for detecting biased word positions. These two terms complementarily support the decoding of biased words during inference. By adapting Whisper to 10 hours of synthetic data, our method reduced the word error rate on the NSC Part 2 test set from 29.71% to 11.81%.

[69] From scratch to silver: Creating trustworthy training data for patent-SDG classification using Large Language Models cs.CLPDF

Grazia Sveva Ascione, Nicolò Tamagnone

TL;DR: 该论文提出了一种基于弱监督和大型语言模型（LLM）的方法，用于将专利分类到联合国可持续发展目标（SDG）中，通过构建一个银标准的多标签数据集，解决了现有方法缺乏可扩展性和泛化性的问题。

Details

Motivation: 专利与SDG的分类对于追踪创新如何应对全球挑战至关重要，但由于缺乏大规模标注数据集，现有方法（如关键词搜索、迁移学习和基于引用的启发式方法）难以满足需求。因此，需要一种可扩展且泛化的解决方案。

Result: 1. 内部验证中，该方法优于包括基于Transformer的模型和零样本LLM在内的多种基线；2. 外部验证中，提出的标签在专利引用、共同发明人和共同申请人图中显示出更高的主题、认知和组织一致性。

Insight: 弱监督和语义对齐能够在缺乏大规模标注数据的情况下，显著提升SDG分类的扩展性和效果。

Abstract: Classifying patents by their relevance to the UN Sustainable Development Goals (SDGs) is crucial for tracking how innovation addresses global challenges. However, the absence of a large, labeled dataset limits the use of supervised learning. Existing methods, such as keyword searches, transfer learning, and citation-based heuristics, lack scalability and generalizability. This paper frames patent-to-SDG classification as a weak supervision problem, using citations from patents to SDG-tagged scientific publications (NPL citations) as a noisy initial signal. To address its sparsity and noise, we develop a composite labeling function (LF) that uses large language models (LLMs) to extract structured concepts, namely functions, solutions, and applications, from patents and SDG papers based on a patent ontology. Cross-domain similarity scores are computed and combined using a rank-based retrieval approach. The LF is calibrated via a custom positive-only loss that aligns with known NPL-SDG links without penalizing discovery of new SDG associations. The result is a silver-standard, soft multi-label dataset mapping patents to SDGs, enabling the training of effective multi-label regression models. We validate our approach through two complementary strategies: (1) internal validation against held-out NPL-based labels, where our method outperforms several baselines including transformer-based models, and zero-shot LLM; and (2) external validation using network modularity in patent citation, co-inventor, and co-applicant graphs, where our labels reveal greater thematic, cognitive, and organizational coherence than traditional technological classifications. These results show that weak supervision and semantic alignment can enhance SDG classification at scale.

[70] MetaRAG: Metamorphic Testing for Hallucination Detection in RAG Systems cs.CLPDF

Channdeth Sok, David Luz, Yacine Haddam

TL;DR: MetaRAG提出了一种用于检测RAG系统中幻觉的变形测试框架，能实时、无监督地运行，适用于高风险和专有领域。

Details

Motivation: LLM在企业应用中广泛部署，但幻觉问题（即自信但事实错误的信息）限制了其可靠性。现有方法如SelfCheckGPT和MetaQA针对独立LLM，未解决RAG系统的独特挑战，即响应需与检索证据一致。

Result: 在专有企业数据集上验证了MetaRAG的有效性，可检测幻觉并支持可信赖的RAG对话代理部署。

Insight: MetaRAG通过定位未支持的声明和提供身份敏感查询的保护机制，展示了在身份感知AI中的实用性。

Abstract: Large Language Models (LLMs) are increasingly deployed in enterprise applications, yet their reliability remains limited by hallucinations, i.e., confident but factually incorrect information. Existing detection approaches, such as SelfCheckGPT and MetaQA, primarily target standalone LLMs and do not address the unique challenges of Retrieval-Augmented Generation (RAG) systems, where responses must be consistent with retrieved evidence. We therefore present MetaRAG, a metamorphic testing framework for hallucination detection in Retrieval-Augmented Generation (RAG) systems. MetaRAG operates in a real-time, unsupervised, black-box setting, requiring neither ground-truth references nor access to model internals, making it suitable for proprietary and high-stakes domains. The framework proceeds in four stages: (1) decompose answers into atomic factoids, (2) generate controlled mutations of each factoid using synonym and antonym substitutions, (3) verify each variant against the retrieved context (synonyms are expected to be entailed and antonyms contradicted), and (4) aggregate penalties for inconsistencies into a response-level hallucination score. Crucially for identity-aware AI, MetaRAG localizes unsupported claims at the factoid span where they occur (e.g., pregnancy-specific precautions, LGBTQ+ refugee rights, or labor eligibility), allowing users to see flagged spans and enabling system designers to configure thresholds and guardrails for identity-sensitive queries. Experiments on a proprietary enterprise dataset illustrate the effectiveness of MetaRAG for detecting hallucinations and enabling trustworthy deployment of RAG-based conversational agents. We also outline a topic-based deployment design that translates MetaRAG’s span-level scores into identity-aware safeguards; this design is discussed but not evaluated in our experiments.

[71] Modelling Analogies and Analogical Reasoning: Connecting Cognitive Science Theory and NLP Research cs.CLPDF

Molly R Petersen, Claire E Stevenson, Lonneke van der Plas

TL;DR: 论文探讨了类比推理在认知科学和NLP研究中的联系，提出了如何通过认知视角优化NLP中的关系理解。

Details

Motivation: 类比推理是人类认知的核心，但NLP研究通常未从认知科学角度探讨其底层过程，本文旨在填补这一空白。

Result: 表明认知科学的类比推理理论可以指导NLP研究超越实体相似性，更注重关系理解。

Insight: NLP可以通过认知科学的视角提升对文本中关系的建模能力，而不仅仅是依赖实体层面的相似性。

Abstract: Analogical reasoning is an essential aspect of human cognition. In this paper, we summarize key theory about the processes underlying analogical reasoning from the cognitive science literature and relate it to current research in natural language processing. While these processes can be easily linked to concepts in NLP, they are generally not viewed through a cognitive lens. Furthermore, we show how these notions are relevant for several major challenges in NLP research, not directly related to analogy solving. This may guide researchers to better optimize relational understanding in text, as opposed to relying heavily on entity-level similarity.

[72] Hierarchical Bracketing Encodings Work for Dependency Graphs cs.CLPDF

Ana Ezquerro, Carlos Gómez-Rodríguez, David Vilares

TL;DR: 该论文提出了一种基于分层括号编码的依赖图解析方法，能够在保持结构信息的同时减少标签空间，并在多语言基准测试中取得了竞争性结果。

Details

Motivation: 依赖图解析在自然语言处理中具有重要作用，传统方法难以高效处理图的线性化表示，尤其是存在重入、循环和空节点的情况。本文旨在通过分层括号编码解决这些问题。

Result: 在多语言和多形式基准测试中，该方法在精确匹配准确率上优于其他现有方法，验证了其有效性。

Insight: 分层括号编码为依赖图解析提供了一种高效且紧凑的表示方法，能够在不损失结构信息的情况下简化解析过程，适用于复杂图结构的处理。

Abstract: We revisit hierarchical bracketing encodings from a practical perspective in the context of dependency graph parsing. The approach encodes graphs as sequences, enabling linear-time parsing with $n$ tagging actions, and still representing reentrancies, cycles, and empty nodes. Compared to existing graph linearizations, this representation substantially reduces the label space while preserving structural information. We evaluate it on a multilingual and multi-formalism benchmark, showing competitive results and consistent improvements over other methods in exact match accuracy.

[73] Mitigating Language Barriers in Education: Developing Multilingual Digital Learning Materials with Machine Translation cs.CLPDF

Lucie Poláková, Martin Popel, Věra Kloudová, Michal Novák, Mariia Anisimova

TL;DR: EdUKate项目利用机器翻译等技术开发多语言学习材料，解决捷克中小学教育中的语言障碍问题，重点关注捷克语-乌克兰语的机器翻译系统。

Details

Motivation: 解决非捷克语学生在捷克教育系统中的语言障碍问题，通过多语言学习材料提升教育包容性。

Result: 项目成果已免费提供给学生、教育者和研究人员，包括翻译的9,000个练习和机器翻译系统的实现。

Insight: 机器翻译在教育领域中需针对特定语言对和内容格式进行优化，以提升翻译质量和实用性。

Abstract: The EdUKate project combines digital education, linguistics, translation studies, and machine translation to develop multilingual learning materials for Czech primary and secondary schools. Launched through collaboration between a major Czech academic institution and the country’s largest educational publisher, the project is aimed at translating up to 9,000 multimodal interactive exercises from Czech into Ukrainian, English, and German for an educational web portal. It emphasizes the development and evaluation of a direct Czech-Ukrainian machine translation system tailored to the educational domain, with special attention to processing formatted content such as XML and PDF and handling technical and scientific terminology. We present findings from an initial survey of Czech teachers regarding the needs of non-Czech-speaking students and describe the system’s evaluation and implementation on the web portal. All resulting applications are freely available to students, educators, and researchers.

[74] All for One: LLMs Solve Mental Math at the Last Token With Information Transferred From Other Tokens cs.CL | I.2.7PDF

Siddarth Mamidanna, Daking Rai, Ziyu Yao, Yilun Zhou

TL;DR: 论文通过抑制初始层特定token的计算、限制信息传递路径以及强制最后token完成所有计算，揭示了LLMs在心理数学任务中主要通过特定中间层传递信息到最后一个token进行计算。

Details

Motivation: 探究LLMs在心理数学任务中的内部工作机制，明确其如何通过token间的信息传递完成计算任务。

Result: 实验表明AF1子图在多种模型和算术表达式中高效且必要，具有跨模型适应性和输入风格鲁棒性。

Insight: LLMs在心理数学任务中的计算集中在最后token，且特定中间层的信息传递是关键，这为理解LLMs的内部机制提供了新视角。

Abstract: Large language models (LLMs) demonstrate proficiency across numerous computational tasks, yet their inner workings remain unclear. In theory, the combination of causal self-attention and multilayer perceptron layers allows every token to access and compute information based on all preceding tokens. In practice, to what extent are such operations present? In this paper, on mental math tasks (i.e., direct math calculation via next-token prediction without explicit reasoning), we investigate this question in three steps: inhibiting input-specific token computations in the initial layers, restricting the routes of information transfer across token positions in the next few layers, and forcing all computation to happen at the last token in the remaining layers. With two proposed techniques, Context-Aware Mean Ablation (CAMA) and Attention-Based Peeking (ABP), we identify an All-for-One subgraph (AF1) with high accuracy on a wide variety of mental math tasks, where meaningful computation occurs very late (in terms of layer depth) and only at the last token, which receives information of other tokens in few specific middle layers. Experiments on a variety of models and arithmetic expressions show that this subgraph is sufficient and necessary for high model performance, transfers across different models, and works on a variety of input styles. Ablations on different CAMA and ABP alternatives reveal their unique advantages over other methods, which may be of independent interest.

[75] CDE: Curiosity-Driven Exploration for Efficient Reinforcement Learning in Large Language Models cs.CL | cs.AI | cs.LGPDF

Runpeng Dai, Linfeng Song, Haolin Liu, Zhenwen Liang, Dian Yu

TL;DR: 论文提出了一种基于好奇心的探索框架（CDE），用于解决大型语言模型（LLM）在强化学习中探索效率低下的问题。CDE通过结合生成困惑度和价值估计方差作为内在奖励，提升了模型的多样性和鲁棒性，并在实验中取得了显著改进。

Details

Motivation: 当前的强化学习方法（如RLVR）在大型语言模型中存在探索不足的问题，容易导致过早收敛和熵崩溃，限制了模型的推理能力提升。为解决这一问题，研究者提出了好奇心驱动的探索机制。

Result: 实验显示，CDE在AIME基准上比GRPO/PPO方法提升了约3个点。此外，研究还揭示了RLVR中存在的一种校准崩溃机制，解释了LLM的常见失效模式。

Insight: 1. 好奇心信号（如困惑度和价值方差）可以显著提升探索效率和模型多样性。
2. RLVR中的校准崩溃可能是模型性能提升的关键瓶颈之一。

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) is a powerful paradigm for enhancing the reasoning ability of Large Language Models (LLMs). Yet current RLVR methods often explore poorly, leading to premature convergence and entropy collapse. To address this challenge, we introduce Curiosity-Driven Exploration (CDE), a framework that leverages the model’s own intrinsic sense of curiosity to guide exploration. We formalize curiosity with signals from both the actor and the critic: for the actor, we use perplexity over its generated response, and for the critic, we use the variance of value estimates from a multi-head architecture. Both signals serve as an exploration bonus within the RLVR framework to guide the model. Our theoretical analysis shows that the actor-wise bonus inherently penalizes overconfident errors and promotes diversity among correct responses; moreover, we connect the critic-wise bonus to the well-established count-based exploration bonus in RL. Empirically, our method achieves an approximate +3 point improvement over standard RLVR using GRPO/PPO on AIME benchmarks. Further analysis identifies a calibration collapse mechanism within RLVR, shedding light on common LLM failure modes.

cs.AI [Back]

Amna Hassan

TL;DR: 该论文提出了一种通过自然语言处理和多模态大语言模型，将游戏设计文档（GDDs）自动转换为功能性的Unity游戏模板的框架。

Details

Motivation: 游戏设计文档到实际开发的过程通常复杂且耗时，缺乏自动化工具导致开发效率低下。本框架旨在填补这一空白，利用AI技术简化从设计到实现的过渡。

Result: 评估表明，该方法在编译成功、设计一致性、最佳实践采用和代码模块化方面优于基线模型，适用于多种游戏类型。

Insight: 多模态LLMs在游戏开发中有潜力显著提高从设计到实现的效率，成为游戏开发流程中的重要工具。

Abstract: This paper presents a novel framework for automated game template generation by transforming Game Design Documents (GDDs) into functional Unity game prototypes using Natural Language Processing (NLP) and multi-modal Large Language Models (LLMs). We introduce an end-to-end system that parses GDDs, extracts structured game specifications, and synthesizes Unity-compatible C# code that implements the core mechanics, systems, and architecture defined in the design documentation. Our approach combines a fine-tuned LLaMA-3 model specialized for Unity code generation with a custom Unity integration package that streamlines the implementation process. Evaluation results demonstrate significant improvements over baseline models, with our fine-tuned model achieving superior performance (4.8/5.0 average score) compared to state-of-the-art LLMs across compilation success, GDD adherence, best practices adoption, and code modularity metrics. The generated templates demonstrate high adherence to GDD specifications across multiple game genres. Our system effectively addresses critical gaps in AI-assisted game development, positioning LLMs as valuable tools in streamlining the transition from game design to implementation.

[77] Tree-OPO: Off-policy Monte Carlo Tree-Guided Advantage Optimization for Multistep Reasoning cs.AI | cs.CL | cs.LGPDF

Bingning Huang, Tu Nguyen, Matthieu Zimmer

TL;DR: 论文提出了一种结合MCTS和RL的新方法Tree-OPO，通过利用MCTS生成的轨迹优化策略学习，改进了偏好一致性的强化学习，并提出启发式和统计方法解决潜在问题。

Details

Motivation: 现有的大语言模型（LLMs）在多步推理任务中展现了MCTS生成高质量中间轨迹的能力。但如何将这些轨迹用于策略优化，尤其是在偏好一致性学习中，仍是一个开放问题。

Result: 结果表明，树状优势估计可以稳定更新并更好地反映组合推理质量，但仍需进一步解决优势饱和和奖励信号崩溃问题。

Insight: MCTS生成的轨迹可以丰富RL的策略优化，但树状奖励结构下的学习仍面临挑战，需更多理论研究。

Abstract: Recent advances in reasoning with large language models (LLMs) have shown the effectiveness of Monte Carlo Tree Search (MCTS) for generating high-quality intermediate trajectories, particularly in math and symbolic domains. Inspired by this, we explore how MCTS-derived trajectories, traditionally used for training value or reward models, can be repurposed to improve policy optimization in preference-based reinforcement learning (RL). Specifically, we focus on Group Relative Policy Optimization (GRPO), a recent algorithm that enables preference-consistent policy learning without value networks. We propose a staged GRPO training paradigm where completions are derived from partially revealed MCTS rollouts, introducing a novel tree-structured setting for advantage estimation. This leads to a rich class of prefix-conditioned reward signals, which we analyze theoretically and empirically. Our initial results indicate that while structured advantage estimation can stabilize updates and better reflect compositional reasoning quality, challenges such as advantage saturation and reward signal collapse remain. We propose heuristic and statistical solutions to mitigate these issues and discuss open challenges for learning under staged or tree-like reward structures.

[78] Mind Meets Space: Rethinking Agentic Spatial Intelligence from a Neuroscience-inspired Perspective cs.AI | cs.CVPDF

Bui Duc Manh, Soumyaratna Debnath, Zetong Zhang, Shriram Damodaran, Arvind Kumar

TL;DR: 本文提出了一种基于神经科学启发的计算框架，旨在提升自主AI系统的空间推理能力，弥补当前AI与人类空间智能之间的差距。

Details

Motivation: 当前自主AI系统的空间推理能力有限，而人类的空间智能基于多感官感知和认知地图，能在非结构化环境中灵活决策。因此，有必要从神经科学角度重新思考AI的空间智能。

Result: 1. 分析现有方法的局限。2. 提出了未来研究方向，特别是在动态和非结构化环境中推广空间推理能力的潜力。

Insight: 神经科学的视角为AI空间推理提供了结构化路径，尤其在机器人和虚拟系统中具有广泛应用前景。

Abstract: Recent advances in agentic AI have led to systems capable of autonomous task execution and language-based reasoning, yet their spatial reasoning abilities remain limited and underexplored, largely constrained to symbolic and sequential processing. In contrast, human spatial intelligence, rooted in integrated multisensory perception, spatial memory, and cognitive maps, enables flexible, context-aware decision-making in unstructured environments. Therefore, bridging this gap is critical for advancing Agentic Spatial Intelligence toward better interaction with the physical 3D world. To this end, we first start from scrutinizing the spatial neural models as studied in computational neuroscience, and accordingly introduce a novel computational framework grounded in neuroscience principles. This framework maps core biological functions to six essential computation modules: bio-inspired multimodal sensing, multi-sensory integration, egocentric-allocentric conversion, an artificial cognitive map, spatial memory, and spatial reasoning. Together, these modules form a perspective landscape for agentic spatial reasoning capability across both virtual and physical environments. On top, we conduct a framework-guided analysis of recent methods, evaluating their relevance to each module and identifying critical gaps that hinder the development of more neuroscience-grounded spatial reasoning modules. We further examine emerging benchmarks and datasets and explore potential application domains ranging from virtual to embodied systems, such as robotics. Finally, we outline potential research directions, emphasizing the promising roadmap that can generalize spatial reasoning across dynamic or unstructured environments. We hope this work will benefit the research community with a neuroscience-grounded perspective and a structured pathway. Our project page can be found at Github.

cs.GR [Back]

[79] CameraVDP: Perceptual Display Assessment with Uncertainty Estimation via Camera and Visual Difference Prediction cs.GR | cs.CVPDF

Yancheng Cai, Robert Wanat, Rafal Mantiuk

TL;DR: CameraVDP 是一种结合相机重建流程与视觉差异预测的方法，用于感知显示评估和不确定性估计，解决了传统显示测量方法在捕捉空间变化和高频失真方面的不足。

Details

Motivation: 传统显示测量方法无法捕捉高频和像素级失真，而相机虽然具有足够空间分辨率，但引入了光学、采样和光度失真。此外，需要结合视觉系统模型评估失真是否可见。

Result: 在缺陷像素检测、色边感知和显示非均匀性评估等应用中验证了 CameraVDP 的有效性，并通过不确定性分析框架提供了缺陷检测的理论上限和 VDP 质量评分的置信区间。

Insight: CameraVDP 不仅解决了传统方法的局限性，还引入了视觉感知模型，使得显示评估更接近人类视觉系统的实际感知效果。

Abstract: Accurate measurement of images produced by electronic displays is critical for the evaluation of both traditional and computational displays. Traditional display measurement methods based on sparse radiometric sampling and fitting a model are inadequate for capturing spatially varying display artifacts, as they fail to capture high-frequency and pixel-level distortions. While cameras offer sufficient spatial resolution, they introduce optical, sampling, and photometric distortions. Furthermore, the physical measurement must be combined with a model of a visual system to assess whether the distortions are going to be visible. To enable perceptual assessment of displays, we propose a combination of a camera-based reconstruction pipeline with a visual difference predictor, which account for both the inaccuracy of camera measurements and visual difference prediction. The reconstruction pipeline combines HDR image stacking, MTF inversion, vignetting correction, geometric undistortion, homography transformation, and color correction, enabling cameras to function as precise display measurement instruments. By incorporating a Visual Difference Predictor (VDP), our system models the visibility of various stimuli under different viewing conditions for the human visual system. We validate the proposed CameraVDP framework through three applications: defective pixel detection, color fringing awareness, and display non-uniformity evaluation. Our uncertainty analysis framework enables the estimation of the theoretical upper bound for defect pixel detection performance and provides confidence intervals for VDP quality scores.

cs.CY [Back]

[80] A vibe coding learning design to enhance EFL students’ talking to, through, and about AI cs.CY | cs.AI | cs.CLPDF

David James Woo, Kai Guo, Yangyang Yu

TL;DR: 这篇创新实践论文探讨了在英语作为外语（EFL）教学中使用vibe coding（通过自然语言与AI协作开发软件应用）的试点研究。研究开发了一个人类-AI元语言框架，包含三个维度：与AI对话（提示工程）、通过AI对话（协商作者身份）和关于AI的对话（AI的心理模型）。通过案例研究，发现学生在vibe coding中的表现差异与其提示工程方法和AI心理模型相关。

Details

Motivation: 研究的动机是探索如何通过AI技术（如vibe coding）提升EFL学生的语言学习体验，同时揭示学生在与AI协作过程中遇到的挑战及其背后的原因。

Result: 研究发现一名学生成功设计出功能符合预期的应用，另一名则遇到技术困难，且设计与实际功能之间存在较大差距。差异主要源于学生的提示工程方法和AI心理模型不同。

Insight: 研究表明，有效的vibe coding教学需要明确的元语言支持，包括结构化提示工程训练、作者身份的批判性讨论以及AI心理模型的词汇培养。

Abstract: This innovative practice article reports on the piloting of vibe coding (using natural language to create software applications with AI) for English as a Foreign Language (EFL) education. We developed a human-AI meta-languaging framework with three dimensions: talking to AI (prompt engineering), talking through AI (negotiating authorship), and talking about AI (mental models of AI). Using backward design principles, we created a four-hour workshop where two students designed applications addressing authentic EFL writing challenges. We adopted a case study methodology, collecting data from worksheets and video recordings, think-aloud protocols, screen recordings, and AI-generated images. Contrasting cases showed one student successfully vibe coding a functional application cohering to her intended design, while another encountered technical difficulties with major gaps between intended design and actual functionality. Analysis reveals differences in students’ prompt engineering approaches, suggesting different AI mental models and tensions in attributing authorship. We argue that AI functions as a beneficial languaging machine, and that differences in how students talk to, through, and about AI explain vibe coding outcome variations. Findings indicate that effective vibe coding instruction requires explicit meta-languaging scaffolding, teaching structured prompt engineering, facilitating critical authorship discussions, and developing vocabulary for articulating AI mental models.

eess.IV [Back]

[81] Dynamic Structural Recovery Parameters Enhance Prediction of Visual Outcomes After Macular Hole Surgery eess.IV | cs.CV | I.4.6PDF

Yinzheng Zhao, Zhihao Zhao, Rundong Jiang, Louisa Sackewitz, Quanmin Liang

TL;DR: 该论文提出了一种结合动态结构参数的多模态深度学习模型，用于预测黄斑裂孔手术后患者的视力恢复情况，显著提高了预测准确性。

Details

Motivation: 现有方法在预测黄斑裂孔手术后视力恢复时未充分利用动态结构参数，导致预测准确性不足。论文旨在填补这一空白。

Result: 动态参数显著提高了逻辑回归AUC，多模态深度学习模型在所有阶段均优于传统方法（AUC提升高达0.12）。

Insight: 动态结构和多模态数据的结合能够显著提升预测性能，为临床决策提供了更精准的工具。

Abstract: Purpose: To introduce novel dynamic structural parameters and evaluate their integration within a multimodal deep learning (DL) framework for predicting postoperative visual recovery in idiopathic full-thickness macular hole (iFTMH) patients. Methods: We utilized a publicly available longitudinal OCT dataset at five stages (preoperative, 2 weeks, 3 months, 6 months, and 12 months). A stage specific segmentation model delineated related structures, and an automated pipeline extracted quantitative, composite, qualitative, and dynamic features. Binary logistic regression models, constructed with and without dynamic parameters, assessed their incremental predictive value for best-corrected visual acuity (BCVA). A multimodal DL model combining clinical variables, OCT-derived features, and raw OCT images was developed and benchmarked against regression models. Results: The segmentation model achieved high accuracy across all timepoints (mean Dice > 0.89). Univariate and multivariate analyses identified base diameter, ellipsoid zone integrity, and macular hole area as significant BCVA predictors (P < 0.05). Incorporating dynamic recovery rates consistently improved logistic regression AUC, especially at the 3-month follow-up. The multimodal DL model outperformed logistic regression, yielding higher AUCs and overall accuracy at each stage. The difference is as high as 0.12, demonstrating the complementary value of raw image volume and dynamic parameters. Conclusions: Integrating dynamic parameters into the multimodal DL model significantly enhances the accuracy of predictions. This fully automated process therefore represents a promising clinical decision support tool for personalized postoperative management in macular hole surgery.

[82] Virtual staining for 3D X-ray histology of bone implants eess.IV | cs.AI | cs.CV | physics.comp-ph | q-bio.QMPDF

Sarah C. Irvine, Christian Lucas, Diana Krüger, Bianca Guedert, Julian Moosmann

TL;DR: 该论文提出一种基于深度学习的方法，将虚拟染色技术拓展至3D X射线成像领域，通过跨模态图像翻译实现骨植入样本的虚拟染色，从而在无需物理切片或化学染色的情况下提升生物组织的生化特异性和可解释性。

Details

Motivation: 传统2D组织学方法需要物理切片和化学染色，而3D X射线组织学技术虽然提供了无创的体成像能力，但其灰度图像对比度在生化特异性上存在局限。虚拟染色技术有望解决这一问题，但此前主要应用于光学图像领域。

Result: 实验结果表明，该方法在SSIM、PSNR和LPIPS等指标上优于Pix2Pix和标准CycleGAN基线模型，能够生成具有高分辨率结构细节的虚拟染色3D数据集。

Insight: 虽然该方法能够重现新骨形成等特征，但在植入物降解层的表现上仍存在变异性，这表明需要更多的训练数据和进一步优化。

Abstract: Three-dimensional X-ray histology techniques offer a non-invasive alternative to conventional 2D histology, enabling volumetric imaging of biological tissues without the need for physical sectioning or chemical staining. However, the inherent greyscale image contrast of X-ray tomography limits its biochemical specificity compared to traditional histological stains. Within digital pathology, deep learning-based virtual staining has demonstrated utility in simulating stained appearances from label-free optical images. In this study, we extend virtual staining to the X-ray domain by applying cross-modality image translation to generate artificially stained slices from synchrotron-radiation-based micro-CT scans. Using over 50 co-registered image pairs of micro-CT and toluidine blue-stained histology from bone-implant samples, we trained a modified CycleGAN network tailored for limited paired data. Whole slide histology images were downsampled to match the voxel size of the CT data, with on-the-fly data augmentation for patch-based training. The model incorporates pixelwise supervision and greyscale consistency terms, producing histologically realistic colour outputs while preserving high-resolution structural detail. Our method outperformed Pix2Pix and standard CycleGAN baselines across SSIM, PSNR, and LPIPS metrics. Once trained, the model can be applied to full CT volumes to generate virtually stained 3D datasets, enhancing interpretability without additional sample preparation. While features such as new bone formation were able to be reproduced, some variability in the depiction of implant degradation layers highlights the need for further training data and refinement. This work introduces virtual staining to 3D X-ray imaging and offers a scalable route for chemically informative, label-free tissue characterisation in biomedical research.

[83] In-Loop Filtering Using Learned Look-Up Tables for Video Coding eess.IV | cs.CV | cs.MMPDF

Zhuoyuan Li, Jiacheng Li, Yao Li, Jialin Li, Li Li

TL;DR: 该论文提出了一种基于查找表（LUT）的环内滤波（ILF）框架（LUT-ILF++），通过将深度神经网络的输出缓存为LUT，显著降低了计算复杂性和存储需求，同时保持了较好的编码增益。

Details

Motivation: 基于神经网络的环内滤波方法虽然能显著提升视频编码质量，但其高昂的计算复杂性和硬件需求限制了实际应用。因此，研究一种更实用的替代方案成为必要。

Result: 在VVC参考软件中实现该框架，实验表明在AI和RA配置下分别实现了平均0.82%/2.97%/1.63%和0.85%/4.11%/2.06%的码率降低，同时显著降低了时间复杂性和存储成本。

Insight: 通过结合LUT的轻量化和DNN的高效性，展示了在视频编码中平衡性能与复杂度的可行路径，为实际部署提供了新思路。

Abstract: In-loop filtering (ILF) is a key technology in video coding standards to reduce artifacts and enhance visual quality. Recently, neural network-based ILF schemes have achieved remarkable coding gains, emerging as a powerful candidate for next-generation video coding standards. However, the use of deep neural networks (DNN) brings significant computational and time complexity or high demands for dedicated hardware, making it challenging for general use. To address this limitation, we study a practical ILF solution by adopting look-up tables (LUTs). After training a DNN with a restricted reference range for ILF, all possible inputs are traversed, and the output values of the DNN are cached into LUTs. During the coding process, the filtering process is performed by simply retrieving the filtered pixel through locating the input pixels and interpolating between the cached values, instead of relying on heavy inference computations. In this paper, we propose a universal LUT-based ILF framework, termed LUT-ILF++. First, we introduce the cooperation of multiple kinds of filtering LUTs and propose a series of customized indexing mechanisms to enable better filtering reference perception with limited storage consumption. Second, we propose the cross-component indexing mechanism to enable the filtering of different color components jointly. Third, in order to make our solution practical for coding uses, we propose the LUT compaction scheme to enable the LUT pruning, achieving a lower storage cost of the entire solution. The proposed framework is implemented in the VVC reference software. Experimental results show that the proposed framework achieves on average 0.82%/2.97%/1.63% and 0.85%/4.11%/2.06% bitrate reduction for common test sequences, under the AI and RA configurations, respectively. Compared to DNN-based solutions, our proposed solution has much lower time complexity and storage cost.

cs.RO [Back]

[84] OmniEVA: Embodied Versatile Planner via Task-Adaptive 3D-Grounded and Embodiment-aware Reasoning cs.RO | cs.AI | cs.CL | cs.CVPDF

Yuecheng Liu, Dafeng Chi, Shiguang Wu, Zhanguang Zhang, Yuzheng Zhuang

TL;DR: OmniEVA是一个基于多模态大语言模型的通用规划器，通过任务自适应的3D基础和体感知推理，解决了现有系统中的几何适应性差距和体限制差距问题。

Details

Motivation: 现有MLLM（多模态大语言模型）在机器人智能体中的应用存在两大问题：空间信息的适应性不足（仅依赖2D输入或固定3D几何注入）和体限制的忽视（忽略机器人物理约束）。OmniEVA旨在解决这些问题，提升机器人任务规划的适应性和可行性。

Result: OmniEVA在通用推理任务中达到SOTA性能，并在多种下游场景中展现出强大的适应能力。

Insight: 动态调节3D信息和物理约束可以显著提升任务规划的表现，表明自适应性是机器人智能体泛化能力的关键。

Abstract: Recent advances in multimodal large language models (MLLMs) have opened new opportunities for embodied intelligence, enabling multimodal understanding, reasoning, and interaction, as well as continuous spatial decision-making. Nevertheless, current MLLM-based embodied systems face two critical limitations. First, Geometric Adaptability Gap: models trained solely on 2D inputs or with hard-coded 3D geometry injection suffer from either insufficient spatial information or restricted 2D generalization, leading to poor adaptability across tasks with diverse spatial demands. Second, Embodiment Constraint Gap: prior work often neglects the physical constraints and capacities of real robots, resulting in task plans that are theoretically valid but practically infeasible.To address these gaps, we introduce OmniEVA – an embodied versatile planner that enables advanced embodied reasoning and task planning through two pivotal innovations: (1) a Task-Adaptive 3D Grounding mechanism, which introduces a gated router to perform explicit selective regulation of 3D fusion based on contextual requirements, enabling context-aware 3D grounding for diverse embodied tasks. (2) an Embodiment-Aware Reasoning framework that jointly incorporates task goals and embodiment constraints into the reasoning loop, resulting in planning decisions that are both goal-directed and executable. Extensive experimental results demonstrate that OmniEVA not only achieves state-of-the-art general embodied reasoning performance, but also exhibits a strong ability across a wide range of downstream scenarios. Evaluations of a suite of proposed embodied benchmarks, including both primitive and composite tasks, confirm its robust and versatile planning capabilities. Project page: https://omnieva.github.io

[85] SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning cs.RO | cs.AI | cs.CL | cs.LGPDF

Haozhan Li, Yuxin Zuo, Jiale Yu, Yuhao Zhang, Zhaohui Yang

TL;DR: SimpleVLA-RL提出了一种基于强化学习的框架，用于提升视觉-语言-动作（VLA）模型的长时程动作规划能力，减少了大规模数据依赖性并提升了泛化性能。

Details

Motivation: 当前VLA模型依赖昂贵的大规模人类操作轨迹数据进行监督微调（SFT），且面临分布偏移下的泛化能力不足问题。受大型推理模型（LRMs）通过强化学习提升推理能力的启发，本文探索RL对VLA模型动作规划的改进潜力。

Result: 在LIBERO和RoboTwin 1.0&2.0上达到SoTA性能，甚至超越基线模型π0。同时发现RL训练中的新现象“pushcut”。

Insight: 强化学习不仅能减少VLA模型对大规模数据的依赖，还能发现训练过程中未见的新动作模式，推动了真实任务中的性能突破。

Abstract: Vision-Language-Action (VLA) models have recently emerged as a powerful paradigm for robotic manipulation. Despite substantial progress enabled by large-scale pretraining and supervised fine-tuning (SFT), these models face two fundamental challenges: (i) the scarcity and high cost of large-scale human-operated robotic trajectories required for SFT scaling, and (ii) limited generalization to tasks involving distribution shift. Recent breakthroughs in Large Reasoning Models (LRMs) demonstrate that reinforcement learning (RL) can dramatically enhance step-by-step reasoning capabilities, raising a natural question: Can RL similarly improve the long-horizon step-by-step action planning of VLA? In this work, we introduce SimpleVLA-RL, an efficient RL framework tailored for VLA models. Building upon veRL, we introduce VLA-specific trajectory sampling, scalable parallelization, multi-environment rendering, and optimized loss computation. When applied to OpenVLA-OFT, SimpleVLA-RL achieves SoTA performance on LIBERO and even outperforms $\pi_0$ on RoboTwin 1.0&2.0 with the exploration-enhancing strategies we introduce. SimpleVLA-RL not only reduces dependence on large-scale data and enables robust generalization, but also remarkably surpasses SFT in real-world tasks. Moreover, we identify a novel phenomenon ``pushcut’’ during RL training, wherein the policy discovers previously unseen patterns beyond those seen in the previous training process. Github: https://github.com/PRIME-RL/SimpleVLA-RL

Sourav Garg, Dustin Craggs, Vineeth Bhat, Lachlan Mares, Stefan Podgorski

TL;DR: 论文提出了一种基于对象相对控制（object-relative control）的视觉导航方法，通过对象级别的表示取代传统的图像相对方法，提高了导航任务的泛化能力和跨场景适应性。

Details

Motivation: 传统的视觉导航依赖于图像相对控制（image-relative control），其表现受限于视角和机器人姿态的变化。对象级别的表示更加稳定，能够提供更好的跨场景泛化能力。

Result: 实验表明，对象相对控制在视角变化和反向导航等任务中优于传统方法，且模拟训练的模型能很好地泛化到真实室内环境。

Insight: 对象级别的表示提供了一种更稳定的导航控制方法，能够有效减少对视角和机器人姿态的依赖，适合跨场景部署。

Abstract: Visual navigation using only a single camera and a topological map has recently become an appealing alternative to methods that require additional sensors and 3D maps. This is typically achieved through an “image-relative” approach to estimating control from a given pair of current observation and subgoal image. However, image-level representations of the world have limitations because images are strictly tied to the agent’s pose and embodiment. In contrast, objects, being a property of the map, offer an embodiment- and trajectory-invariant world representation. In this work, we present a new paradigm of learning “object-relative” control that exhibits several desirable characteristics: a) new routes can be traversed without strictly requiring to imitate prior experience, b) the control prediction problem can be decoupled from solving the image matching problem, and c) high invariance can be achieved in cross-embodiment deployment for variations across both training-testing and mapping-execution settings. We propose a topometric map representation in the form of a “relative” 3D scene graph, which is used to obtain more informative object-level global path planning costs. We train a local controller, dubbed “ObjectReact”, conditioned directly on a high-level “WayObject Costmap” representation that eliminates the need for an explicit RGB input. We demonstrate the advantages of learning object-relative control over its image-relative counterpart across sensor height variations and multiple navigation tasks that challenge the underlying spatial understanding capability, e.g., navigating a map trajectory in the reverse direction. We further show that our sim-only policy is able to generalize well to real-world indoor environments. Code and supplementary material are accessible via project page: https://object-react.github.io/

[87] Dexplore: Scalable Neural Control for Dexterous Manipulation from Reference-Scoped Exploration cs.RO | cs.CVPDF

Sirui Xu, Yu-Wei Chao, Liuyu Bian, Arsalan Mousavian, Yu-Xiong Wang

TL;DR: Dexplore提出了一种统一的单循环优化方法，直接从动作捕捉数据中学习机器人控制策略，避免了传统三阶段工作流的误差累积问题。

Details

Motivation: 传统的基于动作捕捉数据的机器人控制方法采用三阶段流程（重定向、跟踪和残差校正），容易导致误差累积和数据利用不足。Dexplore旨在通过统一的单循环优化解决这些问题。

Result: Dexplore能够保留演示意图，同时允许机器人特定的策略涌现，提高了对噪声的鲁棒性，并能扩展到大规模演示数据。生成的视觉控制器支持跨对象泛化和实际部署。

Insight: 将演示数据作为软指导而非绝对真值，并结合自适应范围训练策略，是一种有效利用不完美演示数据的方法，同时避免了多阶段流程的误差累积。

Abstract: Hand-object motion-capture (MoCap) repositories offer large-scale, contact-rich demonstrations and hold promise for scaling dexterous robotic manipulation. Yet demonstration inaccuracies and embodiment gaps between human and robot hands limit the straightforward use of these data. Existing methods adopt a three-stage workflow, including retargeting, tracking, and residual correction, which often leaves demonstrations underused and compound errors across stages. We introduce Dexplore, a unified single-loop optimization that jointly performs retargeting and tracking to learn robot control policies directly from MoCap at scale. Rather than treating demonstrations as ground truth, we use them as soft guidance. From raw trajectories, we derive adaptive spatial scopes, and train with reinforcement learning to keep the policy in-scope while minimizing control effort and accomplishing the task. This unified formulation preserves demonstration intent, enables robot-specific strategies to emerge, improves robustness to noise, and scales to large demonstration corpora. We distill the scaled tracking policy into a vision-based, skill-conditioned generative controller that encodes diverse manipulation skills in a rich latent representation, supporting generalization across objects and real-world deployment. Taken together, these contributions position Dexplore as a principled bridge that transforms imperfect demonstrations into effective training signals for dexterous manipulation.

cs.LG [Back]

[88] Harnessing Uncertainty: Entropy-Modulated Policy Gradients for Long-Horizon LLM Agents cs.LG | cs.CLPDF

Jiawei Wang, Jiacai Liu, Yuqian Fu, Yingru Li, Xintao Wang

TL;DR: 为了解决长周期任务中LLM代理的稀疏奖励和信用分配问题，论文提出了一种基于熵调制的策略梯度方法（EMPG），通过重新校准学习信号以提高效率和稳定性。

Details

Motivation: 长周期任务中，基于LLM的代理面临稀疏奖励和信用分配的挑战，传统方法通过密集奖励信号引导学习，但忽略了策略梯度与熵的耦合问题。

Result: 在WebShop、ALFWorld和Deep Search三个任务上，EMPG显著优于基线方法，表现出性能提升。

Insight: 策略梯度与熵的耦合是LLM代理学习低效和不稳定的主要原因，通过熵调制可以显著改善学习动态。

Abstract: In long-horizon tasks, recent agents based on Large Language Models (LLMs) face a significant challenge that sparse, outcome-based rewards make it difficult to assign credit to intermediate steps. Previous methods mainly focus on creating dense reward signals to guide learning, either through traditional reinforcement learning techniques like inverse reinforcement learning or by using Process Reward Models for step-by-step feedback. In this paper, we identify a fundamental problem in the learning dynamics of LLMs: the magnitude of policy gradients is inherently coupled with the entropy, which leads to inefficient small updates for confident correct actions and potentially destabilizes large updates for uncertain ones. To resolve this, we propose Entropy-Modulated Policy Gradients (EMPG), a framework that re-calibrates the learning signal based on step-wise uncertainty and the final task outcome. EMPG amplifies updates for confident correct actions, penalizes confident errors, and attenuates updates from uncertain steps to stabilize exploration. We further introduce a bonus term for future clarity that encourages agents to find more predictable solution paths. Through comprehensive experiments on three challenging agent tasks, WebShop, ALFWorld, and Deep Search, we demonstrate that EMPG achieves substantial performance gains and significantly outperforms strong policy gradient baselines. Project page is at https://empgseed-seed.github.io/

[89] Adaptive Pareto-Optimal Token Merging for Edge Transformer Models in Semantic Communication cs.LG | cs.AI | cs.CV | eess.IVPDF

Omar Erak, Omar Alhussein, Hatem Abou-Zeid, Mehdi Bennis

TL;DR: 该论文提出了一种无需训练的框架，通过自适应令牌合并技术，在预训练视觉变换器中减少推理时间和传输资源使用。该方法通过多目标优化和贝叶斯优化，平衡精度与计算成本，并在动态应用中灵活调整。实验证明其在降低计算复杂度的同时保持了竞争力。

Details

Motivation: 大规模变换器模型在语义通信系统中表现出色，但其高计算需求限制了在资源受限的6G网络中的实际应用，因此需要一种高效的方法来解决这一挑战。

Result: 实验表明，该方法在降低浮点运算次数的同时保持了高精度，并能适应不同信噪比条件。

Insight: 动态调整合并策略能够有效应对信道质量的波动，为边缘智能系统中的变换器部署提供了一种高效且灵活的方法。

Abstract: Large-scale transformer models have emerged as a powerful tool for semantic communication systems, enabling edge devices to extract rich representations for robust inference across noisy wireless channels. However, their substantial computational demands remain a major barrier to practical deployment in resource-constrained 6G networks. In this paper, we present a training-free framework for adaptive token merging in pretrained vision transformers to jointly reduce inference time and transmission resource usage. We formulate the selection of per-layer merging proportions as a multi-objective optimization problem to balance accuracy and computational cost. We employ Gaussian process-based Bayesian optimization to construct a Pareto frontier of optimal configurations, enabling flexible runtime adaptation to dynamic application requirements and channel conditions. Extensive experiments demonstrate that our method consistently outperforms other baselines and achieves significant reductions in floating-point operations while maintaining competitive accuracy across a wide range of signal-to-noise ratio (SNR) conditions. Additional results highlight the effectiveness of adaptive policies that adjust merging aggressiveness in response to channel quality, providing a practical mechanism to trade off latency and semantic fidelity on demand. These findings establish a scalable and efficient approach for deploying transformer-based semantic communication in future edge intelligence systems.

[90] Graph Alignment via Dual-Pass Spectral Encoding and Latent Space Communication cs.LG | cs.AI | cs.CVPDF

Maysam Behmanesh, Erkan Turan, Maks Ovsjanikov

TL;DR: 论文提出一种新图对齐框架，通过双通道谱编码和潜在空间通信增强节点区分度并保持几何一致性，优于现有无监督基线。

Details

Motivation: 现有无监督图对齐方法因GNN嵌入的过平滑和潜在空间不对齐导致节点对应关系不可靠，亟需改进。

Result: 在图基准和跨模态（视觉-语言）任务上均超越现有方法，对结构噪声和异质性表现出强鲁棒性。

Insight: 谱滤波与几何一致性约束的结合可有效解决嵌入平滑和空间不对齐问题，且框架具备跨领域泛化能力。

Abstract: Graph alignment-the problem of identifying corresponding nodes across multiple graphs-is fundamental to numerous applications. Most existing unsupervised methods embed node features into latent representations to enable cross-graph comparison without ground-truth correspondences. However, these methods suffer from two critical limitations: the degradation of node distinctiveness due to oversmoothing in GNN-based embeddings, and the misalignment of latent spaces across graphs caused by structural noise, feature heterogeneity, and training instability, ultimately leading to unreliable node correspondences. We propose a novel graph alignment framework that simultaneously enhances node distinctiveness and enforces geometric consistency across latent spaces. Our approach introduces a dual-pass encoder that combines low-pass and high-pass spectral filters to generate embeddings that are both structure-aware and highly discriminative. To address latent space misalignment, we incorporate a geometry-aware functional map module that learns bijective and isometric transformations between graph embeddings, ensuring consistent geometric relationships across different representations. Extensive experiments on graph benchmarks demonstrate that our method consistently outperforms existing unsupervised alignment baselines, exhibiting superior robustness to structural inconsistencies and challenging alignment scenarios. Additionally, comprehensive evaluation on vision-language benchmarks using diverse pretrained models shows that our framework effectively generalizes beyond graph domains, enabling unsupervised alignment of vision and language representations.

eess.SP [Back]

[91] Ultrafast Deep Learning-Based Scatter Estimation in Cone-Beam Computed Tomography eess.SP | cs.CVPDF

Harshit Agrawal, Ari Hietanen, Simo Särkkä

TL;DR: 该论文提出了一种基于深度学习的快速散射估计方法，通过在多个分辨率下优化网络结构和输入尺寸，显著降低了计算开销和内存需求，同时保持了较高的精度。

Details

Motivation: CBCT（锥形束计算机断层扫描）中的散射伪影严重影响图像质量。现有的深度学习方法虽然有效，但因其网络规模大，难以部署在移动CBCT系统或边缘设备上。

Result: 实验表明，优化后的方法在MAPE和MSE上与基线方法相当，FLOPs减少78倍，推理时间和GPU内存分别降低16倍和12倍。散射校正结果在仿真和真实数据上均表现出鲁棒性。

Insight: 论文揭示了降采样在深度学习散射估计中的重要作用，并证明了轻量化网络在资源受限环境中的可行性。

Abstract: Purpose: Scatter artifacts drastically degrade the image quality of cone-beam computed tomography (CBCT) scans. Although deep learning-based methods show promise in estimating scatter from CBCT measurements, their deployment in mobile CBCT systems or edge devices is still limited due to the large memory footprint of the networks. This study addresses the issue by applying networks at varying resolutions and suggesting an optimal one, based on speed and accuracy. Methods: First, the reconstruction error in down-up sampling of CBCT scatter signal was examined at six resolutions by comparing four interpolation methods. Next, a recent state-of-the-art method was trained across five image resolutions and evaluated for the reductions in floating-point operations (FLOPs), inference times, and GPU memory requirements. Results: Reducing the input size and network parameters achieved a 78-fold reduction in FLOPs compared to the baseline method, while maintaining comarable performance in terms of mean-absolute-percentage-error (MAPE) and mean-square-error (MSE). Specifically, the MAPE decreased to 3.85% compared to 4.42%, and the MSE decreased to 1.34 \times 10^{-2} compared to 2.01 \times 10^{-2}. Inference time and GPU memory usage were reduced by factors of 16 and 12, respectively. Further experiments comparing scatter-corrected reconstructions on a large, simulated dataset and real CBCT scans from water and Sedentex CT phantoms clearly demonstrated the robustness of our method. Conclusion: This study highlights the underappreciated role of downsampling in deep learning-based scatter estimation. The substantial reduction in FLOPs and GPU memory requirements achieved by our method enables scatter correction in resource-constrained environments, such as mobile CBCT and edge devices.

Table of Contents

cs.CV [Back]

[1] Recurrence Meets Transformers for Universal Multimodal Retrieval cs.CV | cs.AI | cs.CL | cs.MMPDF

[2] Diffusion-Based Action Recognition Generalizes to Untrained Domains cs.CVPDF

[3] PromptGuard: An Orchestrated Prompting Framework for Principled Synthetic Text Generation for Vulnerable Populations using LLMs with Enhanced Safety, Fairness, and Controllability cs.CV | cs.AIPDF

[4] Similarity-based Outlier Detection for Noisy Object Re-Identification Using Beta Mixtures cs.CV | cs.AI | cs.LG | math.ST | stat.ML | stat.THPDF

[5] Discovering Divergent Representations between Text-to-Image Models cs.CVPDF

[6] An U-Net-Based Deep Neural Network for Cloud Shadow and Sun-Glint Correction of Unmanned Aerial System (UAS) Imagery cs.CVPDF

[7] CoSwin: Convolution Enhanced Hierarchical Shifted Window Attention For Small-Scale Vision cs.CVPDF

[8] iMatcher: Improve matching in point cloud registration via local-to-global geometric consistency learning cs.CVPDF

[9] COCO-Urdu: A Large-Scale Urdu Image-Caption Dataset with Multimodal Quality Estimation cs.CV | cs.CL | 68T45 (Primary) 68T50 (Secondary)PDF

[10] VoxelFormer: Parameter-Efficient Multi-Subject Visual Decoding from fMRI cs.CVPDF

[11] Integrating Anatomical Priors into a Causal Diffusion Model cs.CVPDF

[12] Enhancing 3D Medical Image Understanding with Pretraining Aided by 2D Multimodal Large Language Models cs.CVPDF

[13] Improvement of Human-Object Interaction Action Recognition Using Scene Information and Multi-Task Learning Approach cs.CVPDF

[14] SQAP-VLA: A Synergistic Quantization-Aware Pruning Framework for High-Performance Vision-Language-Action Models cs.CV | cs.AIPDF

[15] FPI-Det: a face–phone Interaction Dataset for phone-use detection and understanding cs.CVPDF

[16] Zero-shot Hierarchical Plant Segmentation via Foundation Segmentation Models and Text-to-image Attention cs.CVPDF

[17] Gradient-Attention Guided Dual-Masking Synergetic Framework for Robust Text-based Person Retrieval cs.CVPDF

[18] ALL-PET: A Low-resource and Low-shot PET Foundation Model in the Projection Domain cs.CVPDF

[19] Noise-Robust Topology Estimation of 2D Image Data via Neural Networks and Persistent Homology cs.CVPDF

[20] Objectness Similarity: Capturing Object-Level Fidelity in 3D Scene Evaluation cs.CV | cs.AI | cs.GRPDF

[21] Video Understanding by Design: How Datasets Shape Architectures and Insights cs.CV | cs.AI | cs.LGPDF

[22] OCELOT 2023: Cell Detection from Cell-Tissue Interaction Challenge cs.CV | cs.AIPDF

[23] RT-DETR++ for UAV Object Detection cs.CVPDF

[24] A Knowledge Noise Mitigation Framework for Knowledge-based Visual Question Answering cs.CV | cs.AIPDF

[25] CWSSNet: Hyperspectral Image Classification Enhanced by Wavelet Domain Convolution cs.CVPDF

[26] Bridging the Gap Between Ideal and Real-world Evaluation: Benchmarking AI-Generated Image Detection in Challenging Scenarios cs.CVPDF

[27] VQualA 2025 Challenge on Visual Quality Comparison for Large Multimodal Models: Methods and Results cs.CVPDF

[28] Medverse: A Universal Model for Full-Resolution 3D Medical Image Segmentation, Transformation and Enhancement cs.CVPDF

[29] CoAtNeXt:An Attention-Enhanced ConvNeXtV2-Transformer Hybrid Model for Gastric Tissue Classification cs.CV | cs.AIPDF

[30] Towards Better Dental AI: A Multimodal Benchmark and Instruction Dataset for Panoramic X-ray Analysis cs.CV | cs.MMPDF

[31] DATE: Dynamic Absolute Time Enhancement for Long Video Understanding cs.CVPDF

[32] Unified Start, Personalized End: Progressive Pruning for Efficient 3D Medical Image Segmentation cs.CVPDF

[33] Visual Programmability: A Guide for Code-as-Thought in Chart Understanding cs.CVPDF

[34] Modality-Agnostic Input Channels Enable Segmentation of Brain lesions in Multimodal MRI with Sequences Unavailable During Training cs.CV | cs.AIPDF

[35] Can Multimodal LLMs See Materials Clearly? A Multimodal Benchmark on Materials Characterization cs.CV | cs.AI | cs.CL | cs.MMPDF

[36] You Share Beliefs, I Adapt: Progressive Heterogeneous Collaborative Perception cs.CVPDF

[37] Image Recognition with Vision and Language Embeddings of VLMs cs.CVPDF

[38] Fine-Grained Customized Fashion Design with Image-into-Prompt benchmark and dataset from LMM cs.CVPDF

[39] Exploring Pre-training Across Domains for Few-Shot Surgical Skill Assessment cs.CV | cs.LGPDF

[40] Classification of Driver Behaviour Using External Observation Techniques for Autonomous Vehicles cs.CV | cs.AI | cs.ET | cs.RO | eess.IVPDF

[41] FLUX-Reason-6M & PRISM-Bench: A Million-Scale Text-to-Image Reasoning Dataset and Comprehensive Benchmark cs.CV | cs.CLPDF

[42] Plug-and-play Diffusion Models for Image Compressive Sensing with Data Consistency Projection cs.CVPDF

[43] A Fully Automatic Framework for Intracranial Pressure Grading: Integrating Keyframe Identification, ONSD Measurement and Clinical Data cs.CVPDF

[44] Decoupling Clinical and Class-Agnostic Features for Reliable Few-Shot Adaptation under Shift cs.CVPDF

[45] FS-Diff: Semantic guidance and clarity-aware simultaneous multimodal image fusion and super-resolution cs.CVPDF

[46] FlexiD-Fuse: Flexible number of inputs multi-modal medical image fusion based on diffusion model cs.CVPDF

[47] OpenFake: An Open Dataset and Platform Toward Large-Scale Deepfake Detection cs.CV | cs.AI | cs.LG | I.4.9; I.5.4; I.2.10PDF

[48] Region-Wise Correspondence Prediction between Manga Line Art Images cs.CVPDF

[49] DualTrack: Sensorless 3D Ultrasound needs Local and Global Context cs.CVPDF

[50] Improving Video Diffusion Transformer Training by Multi-Feature Fusion and Alignment from Self-Supervised Vision Encoders cs.CV | cs.AIPDF

[51] Invisible Attributes, Visible Biases: Exploring Demographic Shortcuts in MRI-based Alzheimer’s Disease Classification cs.CV | cs.AIPDF

[52] PeftCD: Leveraging Vision Foundation Models with Parameter-Efficient Fine-Tuning for Remote Sensing Change Detection cs.CVPDF

[53] Visual Grounding from Event Cameras cs.CV | cs.ROPDF

[54] Kling-Avatar: Grounding Multimodal Instructions for Cascaded Long-Duration Avatar Animation Synthesis cs.CVPDF

[55] Measuring Epistemic Humility in Multimodal Large Language Models cs.CVPDF

[56] Can Understanding and Generation Truly Benefit Together – or Just Coexist? cs.CVPDF

[57] Locality in Image Diffusion Models Emerges from Data Statistics cs.CVPDF

[58] SpatialVID: A Large-Scale Video Dataset with Spatial Annotations cs.CVPDF

cs.CL [Back]

[59] Automated Evidence Extraction and Scoring for Corporate Climate Policy Engagement: A Multilingual RAG Approach cs.CLPDF

[60] BRoverbs – Measuring how much LLMs understand Portuguese proverbs cs.CLPDF

[61] Can Vision-Language Models Solve Visual Math Equations? cs.CL | cs.AI | cs.CVPDF

[62] MR-UIE: Multi-Perspective Reasoning with Reinforcement Learning for Universal Information Extraction cs.CLPDF

[63] TigerCoder: A Novel Suite of LLMs for Code Generation in Bangla cs.CLPDF

[64] Compass-v3: Scaling Domain-Specific LLMs for Multilingual E-Commerce in Southeast Asia cs.CLPDF

[65] Target-oriented Multimodal Sentiment Classification with Counterfactual-enhanced Debiasing cs.CL | cs.AIPDF

[66] EchoX: Towards Mitigating Acoustic-Semantic Gap via Echo Training for Speech-to-Speech LLMs cs.CL | cs.AI | cs.SDPDF

[67] Efficient Trie-based Biasing using K-step Prediction for Rare Word Recognition cs.CL | cs.AIPDF

[68] Improving Synthetic Data Training for Contextual Biasing Models with a Keyword-Aware Cost Function cs.CL | cs.AIPDF

[69] From scratch to silver: Creating trustworthy training data for patent-SDG classification using Large Language Models cs.CLPDF

[70] MetaRAG: Metamorphic Testing for Hallucination Detection in RAG Systems cs.CLPDF

[71] Modelling Analogies and Analogical Reasoning: Connecting Cognitive Science Theory and NLP Research cs.CLPDF

[72] Hierarchical Bracketing Encodings Work for Dependency Graphs cs.CLPDF

[73] Mitigating Language Barriers in Education: Developing Multilingual Digital Learning Materials with Machine Translation cs.CLPDF

[74] All for One: LLMs Solve Mental Math at the Last Token With Information Transferred From Other Tokens cs.CL | I.2.7PDF

[75] CDE: Curiosity-Driven Exploration for Efficient Reinforcement Learning in Large Language Models cs.CL | cs.AI | cs.LGPDF

cs.AI [Back]

[76] Automated Unity Game Template Generation from GDDs via NLP and Multi-Modal LLMs cs.AI | cs.CL | cs.LG | cs.SEPDF