cs.CV [Total: 63]
cs.CL [Total: 20]
eess.IV [Total: 7]
cs.CR [Total: 1]
cs.GR [Total: 1]
cs.SC [Total: 1]
cs.IR [Total: 1]
q-bio.NC [Total: 1]
eess.SP [Total: 1]
cs.NE [Total: 1]
astro-ph.IM [Total: 1]
cs.AI [Total: 1]
cs.SE [Total: 2]
cs.RO [Total: 3]
cs.SD [Total: 1]
cs.LG [Total: 2]

cs.CV [Back]

[1] An Memory-Efficient Framework for Deformable Transformer with Neural Architecture Search cs.CV | cs.AIPDF

Wendong Mao, Mingfan Zhao, Jianfeng Guan, Qiwei Dong, Zhongfeng Wang

TL;DR: 该论文提出了一种针对可变形注意力Transformer（DAT）的硬件友好优化框架，通过神经架构搜索（NAS）和新的切片策略，在推理过程中自动分割输入特征为均匀块，避免内存冲突，同时保持模型精度。FPGA验证表明其显著减少存储访问次数。

Details

Motivation: 可变形注意力Transformer（DAT）在计算机视觉任务中表现优异，但其数据依赖的采样机制导致不规则内存访问模式，难以高效部署到硬件上。现有方法要么硬件开销高，要么牺牲模型精度。

Result: ImageNet-1K实验显示精度仅下降0.2%；FPGA测试中存储访问次数降至现有方法的18%。

Insight: 通过智能分割输入和硬件协同设计，可显著提升Transformer在边缘设备上的部署效率，且无需牺牲性能。

Abstract: Deformable Attention Transformers (DAT) have shown remarkable performance in computer vision tasks by adaptively focusing on informative image regions. However, their data-dependent sampling mechanism introduces irregular memory access patterns, posing significant challenges for efficient hardware deployment. Existing acceleration methods either incur high hardware overhead or compromise model accuracy. To address these issues, this paper proposes a hardware-friendly optimization framework for DAT. First, a neural architecture search (NAS)-based method with a new slicing strategy is proposed to automatically divide the input feature into uniform patches during the inference process, avoiding memory conflicts without modifying model architecture. The method explores the optimal slice configuration by jointly optimizing hardware cost and inference accuracy. Secondly, an FPGA-based verification system is designed to test the performance of this framework on edge-side hardware. Algorithm experiments on the ImageNet-1K dataset demonstrate that our hardware-friendly framework can maintain have only 0.2% accuracy drop compared to the baseline DAT. Hardware experiments on Xilinx FPGA show the proposed method reduces DRAM access times to 18% compared with existing DAT acceleration methods.

[2] Reprogramming Vision Foundation Models for Spatio-Temporal Forecasting cs.CV | cs.AIPDF

Changlu Chen, Yanbin Liu, Chaoxi Niu, Ling Chen, Tianqing Zhu

TL;DR: 提出了ST-VFM框架，通过重新编程视觉基础模型（VFMs）来处理时空预测任务，解决了VFMs在时空建模中的局限性，并在多个数据集上取得了优越性能。

Details

Motivation: 现有的基础模型（如大型语言模型）在时间序列预测中表现有限，尤其是缺乏对时空相关性的建模能力。视觉基础模型（VFMs）虽具有强大的空间先验知识，但缺乏时间建模能力，且与时空数据之间存在模态差距。

Result: 在十个时空数据集上，ST-VFM超越了现有最优方法，并展示了其在不同VFM骨干模型（如DINO、CLIP、DEIT）中的鲁棒性。

Insight: 通过重新编程VFMs，可以有效利用其强大的空间先验知识，同时通过辅助流输入和动态交互机制弥补时间建模的不足，为时空预测提供了新的解决方案。

Abstract: Foundation models have achieved remarkable success in natural language processing and computer vision, demonstrating strong capabilities in modeling complex patterns. While recent efforts have explored adapting large language models (LLMs) for time-series forecasting, LLMs primarily capture one-dimensional sequential dependencies and struggle to model the richer spatio-temporal (ST) correlations essential for accurate ST forecasting. In this paper, we present \textbf{ST-VFM}, a novel framework that systematically reprograms Vision Foundation Models (VFMs) for general-purpose spatio-temporal forecasting. While VFMs offer powerful spatial priors, two key challenges arise when applying them to ST tasks: (1) the lack of inherent temporal modeling capacity and (2) the modality gap between visual and ST data. To address these, ST-VFM adopts a \emph{dual-branch architecture} that integrates raw ST inputs with auxiliary ST flow inputs, where the flow encodes lightweight temporal difference signals interpretable as dynamic spatial cues. To effectively process these dual-branch inputs, ST-VFM introduces two dedicated reprogramming stages. The \emph{pre-VFM reprogramming} stage applies a Temporal-Aware Token Adapter to embed temporal context and align both branches into VFM-compatible feature spaces. The \emph{post-VFM reprogramming} stage introduces a Bilateral Cross-Prompt Coordination module, enabling dynamic interaction between branches through prompt-based conditioning, thus enriching joint representation learning without modifying the frozen VFM backbone. Extensive experiments on ten spatio-temporal datasets show that ST-VFM outperforms state-of-the-art baselines, demonstrating effectiveness and robustness across VFM backbones (e.g., DINO, CLIP, DEIT) and ablation studies, establishing it as a strong general framework for spatio-temporal forecasting.

[3] Expert Operational GANS: Towards Real-Color Underwater Image Restoration cs.CV | cs.AI | eess.IVPDF

Ozer Can Devecioglu, Serkan Kiranyaz, Mehmet Yamac, Moncef Gabbouj

TL;DR: 论文提出了一种新型GAN模型xOp-GAN，通过多个专家生成器网络分别处理不同质量范围的图像，结合判别器的感知置信分数选择最佳恢复图像，显著提升了水下图像恢复性能。

Details

Motivation: 水下图像恢复因复杂的光传播、散射和深度相关衰减导致的多样化变形伪影而具有挑战性，单一生成器网络难以覆盖所有质量范围，因此需要多专家生成器来解决这一问题。

Result: 在LSUI数据集上，xOp-GAN的PSNR高达25.16 dB，远超单一回归模型，且复杂度更低。

Insight: 多专家生成器结合判别器选择机制，能够更精细地处理水下图像的多样化退化问题，为复杂领域的图像恢复提供了新思路。

Abstract: The wide range of deformation artifacts that arise from complex light propagation, scattering, and depth-dependent attenuation makes the underwater image restoration to remain a challenging problem. Like other single deep regressor networks, conventional GAN-based restoration methods struggle to perform well across this heterogeneous domain, since a single generator network is typically insufficient to capture the full range of visual degradations. In order to overcome this limitation, we propose xOp-GAN, a novel GAN model with several expert generator networks, each trained solely on a particular subset with a certain image quality. Thus, each generator can learn to maximize its restoration performance for a particular quality range. Once a xOp-GAN is trained, each generator can restore the input image and the best restored image can then be selected by the discriminator based on its perceptual confidence score. As a result, xOP-GAN is the first GAN model with multiple generators where the discriminator is being used during the inference of the regression task. Experimental results on benchmark Large Scale Underwater Image (LSUI) dataset demonstrates that xOp-GAN achieves PSNR levels up to 25.16 dB, surpassing all single-regressor models by a large margin even, with reduced complexity.

[4] Data-Driven Meta-Analysis and Public-Dataset Evaluation for Sensor-Based Gait Age Estimation cs.CV | eess.IVPDF

Varun Velankar

TL;DR: 该论文通过数据驱动的元分析和公开数据集评估，系统地研究了基于传感器的步态年龄估计方法，并提出了降低误差的实用指南。

Details

Motivation: 步态年龄估计在医疗、安全和人机交互中有重要应用，但现有研究缺乏大规模的系统评估和性能基准。

Result: 结果表明，多传感器融合模型的误差最低（3.4年），深度学习模型在VersatileGait数据集上达到96%的准确率，处理速度低于0.1秒/样本。

Insight: 研究揭示了膝关节和骨盆区域是步态年龄估计的关键，并提出了在实际场景中将误差降低到3年以下的实用建议。

Abstract: Estimating a person’s age from their gait has important applications in healthcare, security and human-computer interaction. In this work, we review fifty-nine studies involving over seventy-five thousand subjects recorded with video, wearable and radar sensors. We observe that convolutional neural networks produce an average error of about 4.2 years, inertial-sensor models about 4.5 years and multi-sensor fusion as low as 3.4 years, with notable differences between lab and real-world data. We then analyse sixty-three thousand eight hundred forty-six gait cycles from the OU-ISIR Large-Population dataset to quantify correlations between age and five key metrics: stride length, walking speed, step cadence, step-time variability and joint-angle entropy, with correlation coefficients of at least 0.27. Next, we fine-tune a ResNet34 model and apply Grad-CAM to reveal that the network attends to the knee and pelvic regions, consistent with known age-related gait changes. Finally, on a one hundred thousand sample subset of the VersatileGait database, we compare support vector machines, decision trees, random forests, multilayer perceptrons and convolutional neural networks, finding that deep networks achieve up to 96 percent accuracy while processing each sample in under 0.1 seconds. By combining a broad meta-analysis with new large-scale experiments and interpretable visualizations, we establish solid performance baselines and practical guidelines for reducing gait-age error below three years in real-world scenarios.

[5] What cat is that? A re-id model for feral cats cs.CV | cs.AIPDF

Victor Caquilpan

TL;DR: 论文探讨了如何通过改进的PPGNet模型（PPGNet-Cat）对野猫进行重新识别（re-ID），以帮助监控其对生态的影响，并取得了优异的性能表现。

Details

Motivation: 野猫对澳大利亚野生动物造成严重威胁，因此需要一种高效的监控方法，而re-ID技术可以通过相机陷阱图像帮助实现这一目标。

Result: PPGNet-Cat表现优异，mAP达到0.86，rank-1准确率为0.95，证明了其在野猫re-ID中的竞争力。

Insight: 研究表明，通过适当的改进和对比学习方法，现有re-ID模型可以成功迁移到新物种（如野猫）的识别任务中。

Abstract: Feral cats exert a substantial and detrimental impact on Australian wildlife, placing them among the most dangerous invasive species worldwide. Therefore, closely monitoring these cats is essential labour in minimising their effects. In this context, the potential application of Re-Identification (re-ID) emerges to enhance monitoring activities for these animals, utilising images captured by camera traps. This project explores different CV approaches to create a re-ID model able to identify individual feral cats in the wild. The main approach consists of modifying a part-pose guided network (PPGNet) model, initially used in the re-ID of Amur tigers, to be applicable for feral cats. This adaptation, resulting in PPGNet-Cat, which incorporates specific modifications to suit the characteristics of feral cats images. Additionally, various experiments were conducted, particularly exploring contrastive learning approaches such as ArcFace loss. The main results indicate that PPGNet-Cat excels in identifying feral cats, achieving high performance with a mean Average Precision (mAP) of 0.86 and a rank-1 accuracy of 0.95. These outcomes establish PPGNet-Cat as a competitive model within the realm of re-ID.

[6] SketchDNN: Joint Continuous-Discrete Diffusion for CAD Sketch Generation cs.CV | cs.LGPDF

Sathvik Chereddy, John Femiani

TL;DR: SketchDNN 是一种生成模型，通过联合连续-离散扩散过程合成 CAD 草图，其核心创新是高斯-Softmax 扩散方法，显著提升了生成质量。

Details

Motivation: CAD 草图生成中，连续参数和离散类别的异构性及图元的置换不变性带来了挑战，需要一种统一的建模方式。

Result: 在 SketchGraphs 数据集上，FID 从 16.04 降至 7.80，NLL 从 84.8 降至 81.33，达到新的 SOTA。

Insight: 联合连续-离散扩散过程可以有效解决 CAD 草图中的参数异构性和置换不变性问题。

Abstract: We present SketchDNN, a generative model for synthesizing CAD sketches that jointly models both continuous parameters and discrete class labels through a unified continuous-discrete diffusion process. Our core innovation is Gaussian-Softmax diffusion, where logits perturbed with Gaussian noise are projected onto the probability simplex via a softmax transformation, facilitating blended class labels for discrete variables. This formulation addresses 2 key challenges, namely, the heterogeneity of primitive parameterizations and the permutation invariance of primitives in CAD sketches. Our approach significantly improves generation quality, reducing Fr'echet Inception Distance (FID) from 16.04 to 7.80 and negative log-likelihood (NLL) from 84.8 to 81.33, establishing a new state-of-the-art in CAD sketch generation on the SketchGraphs dataset.

[7] Interpretable Prediction of Lymph Node Metastasis in Rectal Cancer MRI Using Variational Autoencoders cs.CV | cs.AI | cs.LGPDF

Benjamin Keel, Aaron Quyn, David Jayne, Maryam Mohsin, Samuel D. Relton

TL;DR: 该论文利用变分自编码器（VAE）作为特征编码器，替代传统预训练的CNN，以提高直肠癌MRI中淋巴结转移预测的准确性和可解释性。模型在内部数据集上表现优异，AUC达0.86。

Details

Motivation: 现有的基于淋巴结形态的放射学标准诊断准确性有限，而预训练的CNN缺乏可解释性。VAE通过图像重构直接编码视觉特征，生成的结构化潜在空间更易解释。

Result: 模型在交叉验证中AUC为0.86，灵敏度0.79，特异性0.85，表现优于现有方法。

Insight: VAE的潜在空间比CNN更易解释，有助于揭示医学图像中的关键特征，为临床决策提供透明支持。

Abstract: Effective treatment for rectal cancer relies on accurate lymph node metastasis (LNM) staging. However, radiological criteria based on lymph node (LN) size, shape and texture morphology have limited diagnostic accuracy. In this work, we investigate applying a Variational Autoencoder (VAE) as a feature encoder model to replace the large pre-trained Convolutional Neural Network (CNN) used in existing approaches. The motivation for using a VAE is that the generative model aims to reconstruct the images, so it directly encodes visual features and meaningful patterns across the data. This leads to a disentangled and structured latent space which can be more interpretable than a CNN. Models are deployed on an in-house MRI dataset with 168 patients who did not undergo neo-adjuvant treatment. The post-operative pathological N stage was used as the ground truth to evaluate model predictions. Our proposed model ‘VAE-MLP’ achieved state-of-the-art performance on the MRI dataset, with cross-validated metrics of AUC 0.86 +/- 0.05, Sensitivity 0.79 +/- 0.06, and Specificity 0.85 +/- 0.05. Code is available at: https://github.com/benkeel/Lymph_Node_Classification_MIUA.

[8] Posture-Driven Action Intent Inference for Playing style and Fatigue Assessment cs.CV | cs.LGPDF

Abhishek Jaiswal, Nisheeth Srivastava

TL;DR: 该论文提出了一种基于姿势的动作意图推断方法，用于评估运动员的风格和疲劳状态，并通过板球运动的实验验证了其有效性。

Details

Motivation: 姿势作为心理状态的推断工具在诊断疲劳、预防伤害和提升表现方面具有潜力，但由于人类数据的敏感性，传统方法面临挑战。体育场景为数据收集提供了可行替代方案。

Result: 方法在意图分类任务中表现优异（F1分数>75%, AUC-ROC>80%），证明了姿势信号的强推断能力。

Insight: 姿势能够有效泄露意图信息，即使数据存在噪声。弱监督为克服标注限制提供了潜在解决方案，可推广至体育分析和其他行为分析领域。

Abstract: Posture-based mental state inference has significant potential in diagnosing fatigue, preventing injury, and enhancing performance across various domains. Such tools must be research-validated with large datasets before being translated into practice. Unfortunately, such vision diagnosis faces serious challenges due to the sensitivity of human subject data. To address this, we identify sports settings as a viable alternative for accumulating data from human subjects experiencing diverse emotional states. We test our hypothesis in the game of cricket and present a posture-based solution to identify human intent from activity videos. Our method achieves over 75% F1 score and over 80% AUC-ROC in discriminating aggressive and defensive shot intent through motion analysis. These findings indicate that posture leaks out strong signals for intent inference, even with inherent noise in the data pipeline. Furthermore, we utilize existing data statistics as weak supervision to validate our findings, offering a potential solution for overcoming data labelling limitations. This research contributes to generalizable techniques for sports analytics and also opens possibilities for applying human behavior analysis across various fields.

[9] VISTA: Monocular Segmentation-Based Mapping for Appearance and View-Invariant Global Localization cs.CV | cs.ROPDF

Hannah Shafferman, Annika Thomas, Jouko Kinnari, Michael Ricard, Jose Nino

TL;DR: VISTA 是一种基于单目分割和跟踪的全局定位框架，能够跨视角和季节变化实现一致定位，无需特定领域训练，性能优于基线方法，同时保持极低的内存占用。

Details

Motivation: 全局定位在自动驾驶导航中至关重要，但传统方法在无结构环境中因视角变化、季节变化等问题表现不佳。VISTA 旨在解决这些挑战。

Result: 在季节变化和斜视角数据集中，VISTA 的召回率比基线方法提升了 69%，地图大小仅为基线方法的 0.6%。

Insight: 基于分割和几何一致性的方法可以有效应对视角和外观变化，轻量化的对象地图为实现实时平台应用提供了可能。

Abstract: Global localization is critical for autonomous navigation, particularly in scenarios where an agent must localize within a map generated in a different session or by another agent, as agents often have no prior knowledge about the correlation between reference frames. However, this task remains challenging in unstructured environments due to appearance changes induced by viewpoint variation, seasonal changes, spatial aliasing, and occlusions – known failure modes for traditional place recognition methods. To address these challenges, we propose VISTA (View-Invariant Segmentation-Based Tracking for Frame Alignment), a novel open-set, monocular global localization framework that combines: 1) a front-end, object-based, segmentation and tracking pipeline, followed by 2) a submap correspondence search, which exploits geometric consistencies between environment maps to align vehicle reference frames. VISTA enables consistent localization across diverse camera viewpoints and seasonal changes, without requiring any domain-specific training or finetuning. We evaluate VISTA on seasonal and oblique-angle aerial datasets, achieving up to a 69% improvement in recall over baseline methods. Furthermore, we maintain a compact object-based map that is only 0.6% the size of the most memory-conservative baseline, making our approach capable of real-time implementation on resource-constrained platforms.

[10] Seeing the Signs: A Survey of Edge-Deployable OCR Models for Billboard Visibility Analysis cs.CV | cs.AIPDF

Maciej Szankin, Vidhyananth Venkatasamy, Lihang Ying

TL;DR: 该论文系统地评估了多模态视觉语言模型（VLMs）与轻量级CNN OCR模型在广告牌文本识别任务中的表现，发现VLMs在整体场景理解上更优，但CNN模型在裁剪文本任务中表现更高效。

Details

Motivation: 现代市场营销中户外广告的文本可见性验证仍然具有挑战性，传统OCR在复杂户外场景中的表现不足。多模态视觉语言模型（VLMs）可能提供更优的端到端解决方案。

Result: 结果表明，虽然VLMs在整体场景理解上表现更好，但轻量级CNN模型在裁剪文本任务中依然具有竞争力且计算成本更低。

Insight: 论文的实用意义在于为边缘设备部署提供了轻量级CNN模型的可行性建议，同时强调了多模态VLMs在场景理解中的潜力。

Abstract: Outdoor advertisements remain a critical medium for modern marketing, yet accurately verifying billboard text visibility under real-world conditions is still challenging. Traditional Optical Character Recognition (OCR) pipelines excel at cropped text recognition but often struggle with complex outdoor scenes, varying fonts, and weather-induced visual noise. Recently, multimodal Vision-Language Models (VLMs) have emerged as promising alternatives, offering end-to-end scene understanding with no explicit detection step. This work systematically benchmarks representative VLMs - including Qwen 2.5 VL 3B, InternVL3, and SmolVLM2 - against a compact CNN-based OCR baseline (PaddleOCRv4) across two public datasets (ICDAR 2015 and SVT), augmented with synthetic weather distortions to simulate realistic degradation. Our results reveal that while selected VLMs excel at holistic scene reasoning, lightweight CNN pipelines still achieve competitive accuracy for cropped text at a fraction of the computational cost-an important consideration for edge deployment. To foster future research, we release our weather-augmented benchmark and evaluation code publicly.

[11] Beyond Task-Specific Reasoning: A Unified Conditional Generative Framework for Abstract Visual Reasoning cs.CV | cs.AIPDF

Fan Shi, Bin Li, Xiangyang Xue

TL;DR: 这篇论文提出了一种统一的生成式框架UCGS，通过多任务训练解决多种抽象视觉推理任务，并展示了零样本推理能力。

Details

Motivation: 现有抽象视觉推理（AVR）方法通常针对特定任务设计，难以泛化到新任务，增加了计算和设计成本。本文旨在开发一个统一框架，避免任务特定的重复训练和架构调整。

Result: UCGS通过单轮多任务训练，在多种AVR任务上展示了抽象推理能力，并在测试阶段实现了对未见任务的零样本推理。

Insight: 生成式框架可以有效统一多种AVR任务，避免任务特定设计，同时零样本推理能力为模型泛化提供了新思路。

Abstract: Abstract visual reasoning (AVR) enables humans to quickly discover and generalize abstract rules to new scenarios. Designing intelligent systems with human-like AVR abilities has been a long-standing topic in the artificial intelligence community. Deep AVR solvers have recently achieved remarkable success in various AVR tasks. However, they usually use task-specific designs or parameters in different tasks. In such a paradigm, solving new tasks often means retraining the model, and sometimes retuning the model architectures, which increases the cost of solving AVR problems. In contrast to task-specific approaches, this paper proposes a novel Unified Conditional Generative Solver (UCGS), aiming to address multiple AVR tasks in a unified framework. First, we prove that some well-known AVR tasks can be reformulated as the problem of estimating the predictability of target images in problem panels. Then, we illustrate that, under the proposed framework, training one conditional generative model can solve various AVR tasks. The experiments show that with a single round of multi-task training, UCGS demonstrates abstract reasoning ability across various AVR tasks. Especially, UCGS exhibits the ability of zero-shot reasoning, enabling it to perform abstract reasoning on problems from unseen AVR tasks in the testing phase.

[12] CorrMoE: Mixture of Experts with De-stylization Learning for Cross-Scene and Cross-Domain Correspondence Pruning cs.CVPDF

Peiwen Xia, Tangfei Liao, Wei Zhu, Danhuai Zhao, Jianjun Ke

TL;DR: CorrMoE提出了一种新的对应关系修剪框架，通过风格解耦和自适应专家混合方法，提升了跨场景和跨域任务的鲁棒性。

Details

Motivation: 现有方法在处理跨域和跨场景的对应关系修剪时表现不佳，主要原因是忽略了域偏移和场景多样性的挑战。

Result: 在多个基准数据集上，CorrMoE表现优于现有方法，展现出更高的准确性和泛化能力。

Insight: 通过风格解耦和动态专家融合，可以有效提升跨域和跨场景任务的表现，为对应关系修剪提供了新思路。

Abstract: Establishing reliable correspondences between image pairs is a fundamental task in computer vision, underpinning applications such as 3D reconstruction and visual localization. Although recent methods have made progress in pruning outliers from dense correspondence sets, they often hypothesize consistent visual domains and overlook the challenges posed by diverse scene structures. In this paper, we propose CorrMoE, a novel correspondence pruning framework that enhances robustness under cross-domain and cross-scene variations. To address domain shift, we introduce a De-stylization Dual Branch, performing style mixing on both implicit and explicit graph features to mitigate the adverse influence of domain-specific representations. For scene diversity, we design a Bi-Fusion Mixture of Experts module that adaptively integrates multi-perspective features through linear-complexity attention and dynamic expert routing. Extensive experiments on benchmark datasets demonstrate that CorrMoE achieves superior accuracy and generalization compared to state-of-the-art methods. The code and pre-trained models are available at https://github.com/peiwenxia/CorrMoE.

[13] ProtoConNet: Prototypical Augmentation and Alignment for Open-Set Few-Shot Image Classification cs.CVPDF

Kexuan Shi, Zhuang Qi, Jingjing Zhu, Lei Meng, Yaochen Zhang

TL;DR: ProtoConNet提出了一种原型增强与对齐方法，通过整合上下文信息提升开放集小样本图像分类的性能，包含三个核心模块：CDS、CSR和PA。

Details

Motivation: 现有小样本图像分类方法多基于单图像的视觉信息，忽视了上下文信息的潜力，导致泛化能力不足。ProtoConNet旨在通过整合背景信息解决这一问题。

Result: 在两个数据集上的实验表明，ProtoConNet在表示学习和开放集样本识别上优于现有方法。

Insight: 上下文信息对小样本分类至关重要，原型对齐可有效区分已知与未知类别。

Abstract: Open-set few-shot image classification aims to train models using a small amount of labeled data, enabling them to achieve good generalization when confronted with unknown environments. Existing methods mainly use visual information from a single image to learn class representations to distinguish known from unknown categories. However, these methods often overlook the benefits of integrating rich contextual information. To address this issue, this paper proposes a prototypical augmentation and alignment method, termed ProtoConNet, which incorporates background information from different samples to enhance the diversity of the feature space, breaking the spurious associations between context and image subjects in few-shot scenarios. Specifically, it consists of three main modules: the clustering-based data selection (CDS) module mines diverse data patterns while preserving core features; the contextual-enhanced semantic refinement (CSR) module builds a context dictionary to integrate into image representations, which boosts the model’s robustness in various scenarios; and the prototypical alignment (PA) module reduces the gap between image representations and class prototypes, amplifying feature distances for known and unknown classes. Experimental results from two datasets verified that ProtoConNet enhances the effectiveness of representation learning in few-shot scenarios and identifies open-set samples, making it superior to existing methods.

Yu Liu, Leyuan Qu, Hanlei Shi, Di Gao, Yuhua Zheng

TL;DR: 论文提出GRACE方法，通过动态运动建模、语义文本细化和跨模态对齐，结合粗到细的文本增强模块和运动差异加权机制，显著提升了动态情感识别的性能，并在多个基准数据集上达到SOTA。

Details

Motivation: 现有方法未能充分利用文本中的细微情感线索，且缺乏有效机制过滤与情感无关的面部动态，导致识别性能受限。

Result: 在三个基准数据集上表现优异，尤其是在模糊或不平衡情感类别场景下，UAR和WAR均达SOTA。

Insight: 结合文本细化和动态运动建模可以有效捕捉细微情感线索，提升跨模态情感识别的鲁棒性。

Abstract: Dynamic Facial Expression Recognition (DFER) aims to identify human emotions from temporally evolving facial movements and plays a critical role in affective computing. While recent vision-language approaches have introduced semantic textual descriptions to guide expression recognition, existing methods still face two key limitations: they often underutilize the subtle emotional cues embedded in generated text, and they have yet to incorporate sufficiently effective mechanisms for filtering out facial dynamics that are irrelevant to emotional expression. To address these gaps, We propose GRACE, Granular Representation Alignment for Cross-modal Emotion recognition that integrates dynamic motion modeling, semantic text refinement, and token-level cross-modal alignment to facilitate the precise localization of emotionally salient spatiotemporal features. Our method constructs emotion-aware textual descriptions via a Coarse-to-fine Affective Text Enhancement (CATE) module and highlights expression-relevant facial motion through a motion-difference weighting mechanism. These refined semantic and visual signals are aligned at the token level using entropy-regularized optimal transport. Experiments on three benchmark datasets demonstrate that our method significantly improves recognition performance, particularly in challenging settings with ambiguous or imbalanced emotion classes, establishing new state-of-the-art (SOTA) results in terms of both UAR and WAR.

[15] Spatial Frequency Modulation for Semantic Segmentation cs.CV | cs.AIPDF

Linwei Chen, Ying Fu, Lin Gu, Dezhi Zheng, Jifeng Dai

TL;DR: 该论文提出了一种名为空间频率调制（SFM）的新方法，旨在解决语义分割中高频信息因下采样导致的混叠问题。通过自适应重采样（ARS）和多种尺度自适应上采样（MSAU），SFM有效保留高频细节，并在多任务中展示了广泛的适用性。

Details

Motivation: 在语义分割中，高频信息（如纹理细节）对准确性至关重要，但下采样层（如步幅卷积）可能导致高频信息混叠或失真。论文旨在解决这一问题。

Result: SFM有效缓解了混叠问题，保留了高频细节，并在语义分割、图像分类、对抗鲁棒性等任务中展现出优异性能。

Insight: 通过频率调制机制，论文揭示了高频信息在下采样中的重要性，并提出了可扩展的解决方案，适用于多种视觉任务。

Abstract: High spatial frequency information, including fine details like textures, significantly contributes to the accuracy of semantic segmentation. However, according to the Nyquist-Shannon Sampling Theorem, high-frequency components are vulnerable to aliasing or distortion when propagating through downsampling layers such as strided-convolution. Here, we propose a novel Spatial Frequency Modulation (SFM) that modulates high-frequency features to a lower frequency before downsampling and then demodulates them back during upsampling. Specifically, we implement modulation through adaptive resampling (ARS) and design a lightweight add-on that can densely sample the high-frequency areas to scale up the signal, thereby lowering its frequency in accordance with the Frequency Scaling Property. We also propose Multi-Scale Adaptive Upsampling (MSAU) to demodulate the modulated feature and recover high-frequency information through non-uniform upsampling This module further improves segmentation by explicitly exploiting information interaction between densely and sparsely resampled areas at multiple scales. Both modules can seamlessly integrate with various architectures, extending from convolutional neural networks to transformers. Feature visualization and analysis confirm that our method effectively alleviates aliasing while successfully retaining details after demodulation. Finally, we validate the broad applicability and effectiveness of SFM by extending it to image classification, adversarial robustness, instance segmentation, and panoptic segmentation tasks. The code is available at \href{https://github.com/Linwei-Chen/SFM}{https://github.com/Linwei-Chen/SFM}.

[16] SEPose: A Synthetic Event-based Human Pose Estimation Dataset for Pedestrian Monitoring cs.CVPDF

Kaustav Chanda, Aayush Atul Verma, Arpitsinh Vaghela, Yezhou Yang, Bharatesh Chakravarthi

TL;DR: 论文提出了SEPose——一个合成的事件基于人类姿态估计数据集，用于固定视角的行人监控，填补了现有数据的不足。

Details

Motivation: 事件基于传感器在行人监控中表现优异，但真实场景数据不足。研究者希望通过合成数据解决这一问题。

Result: 实验表明，在SEPose上训练的模型能泛化到真实事件数据，证明了数据集的实用性。

Insight: 合成数据可以有效弥补真实数据的不足，尤其是在复杂场景下的行人姿态估计任务中。

Abstract: Event-based sensors have emerged as a promising solution for addressing challenging conditions in pedestrian and traffic monitoring systems. Their low-latency and high dynamic range allow for improved response time in safety-critical situations caused by distracted walking or other unusual movements. However, the availability of data covering such scenarios remains limited. To address this gap, we present SEPose – a comprehensive synthetic event-based human pose estimation dataset for fixed pedestrian perception generated using dynamic vision sensors in the CARLA simulator. With nearly 350K annotated pedestrians with body pose keypoints from the perspective of fixed traffic cameras, SEPose is a comprehensive synthetic multi-person pose estimation dataset that spans busy and light crowds and traffic across diverse lighting and weather conditions in 4-way intersections in urban, suburban, and rural environments. We train existing state-of-the-art models such as RVT and YOLOv8 on our dataset and evaluate them on real event-based data to demonstrate the sim-to-real generalization capabilities of the proposed dataset.

[17] Dark-EvGS: Event Camera as an Eye for Radiance Field in the Dark cs.CVPDF

Jingqian Wu, Peiqi Duan, Zongqiang Wang, Changwei Wang, Boxin Shi

TL;DR: Dark-EvGS 是一种基于事件相机的 3D 高斯泼溅框架，用于在低光环境下重建辐射场并生成多视角的明亮帧。通过引入三重监督和色彩一致性模块，解决了事件噪声和帧质量低的问题，并在实验中表现优异。

Details

Motivation: 传统相机在低光环境下因动态范围限制和运动模糊难以捕捉清晰的多视角图像。事件相机的高动态范围和高速特性为解决这一问题提供了可能。

Result: 实验表明，Dark-EvGS 在低光环境下优于现有方法，实现了高质量的辐射场重建和帧渲染。

Insight: 事件相机与 3D 高斯泼溅的结合为低光环境下的多视角成像提供了新思路，未来可进一步优化噪声抑制和实时性能。

Abstract: In low-light environments, conventional cameras often struggle to capture clear multi-view images of objects due to dynamic range limitations and motion blur caused by long exposure. Event cameras, with their high-dynamic range and high-speed properties, have the potential to mitigate these issues. Additionally, 3D Gaussian Splatting (GS) enables radiance field reconstruction, facilitating bright frame synthesis from multiple viewpoints in low-light conditions. However, naively using an event-assisted 3D GS approach still faced challenges because, in low light, events are noisy, frames lack quality, and the color tone may be inconsistent. To address these issues, we propose Dark-EvGS, the first event-assisted 3D GS framework that enables the reconstruction of bright frames from arbitrary viewpoints along the camera trajectory. Triplet-level supervision is proposed to gain holistic knowledge, granular details, and sharp scene rendering. The color tone matching block is proposed to guarantee the color consistency of the rendered frames. Furthermore, we introduce the first real-captured dataset for the event-guided bright frame synthesis task via 3D GS-based radiance field reconstruction. Experiments demonstrate that our method achieves better results than existing methods, conquering radiance field reconstruction under challenging low-light conditions. The code and sample data are included in the supplementary material.

[18] Hyperphantasia: A Benchmark for Evaluating the Mental Visualization Capabilities of Multimodal LLMs cs.CVPDF

Mohammad Shahab Sepehri, Berk Tinaz, Zalan Fabian, Mahdi Soltanolkotabi

TL;DR: 论文提出了Hyperphantasia基准，用于评估多模态大语言模型的’心智可视化’能力，发现目前模型在这方面的表现显著落后于人类。

Details

Motivation: 心智可视化是认知的核心能力，但目前多模态大语言模型的评估基准主要关注被动视觉感知，缺乏对主动视觉建构能力的测试。

Result: 评估显示，当前多模态大语言模型在心智可视化任务上表现显著低于人类，部分模型仅能识别视觉模式。

Insight: 心智可视化是多模态模型尚未解决的挑战，可能需进一步研究强化学习或其他方法以提升这一能力。

Abstract: Mental visualization, the ability to construct and manipulate visual representations internally, is a core component of human cognition and plays a vital role in tasks involving reasoning, prediction, and abstraction. Despite the rapid progress of Multimodal Large Language Models (MLLMs), current benchmarks primarily assess passive visual perception, offering limited insight into the more active capability of internally constructing visual patterns to support problem solving. Yet mental visualization is a critical cognitive skill in humans, supporting abilities such as spatial navigation, predicting physical trajectories, and solving complex visual problems through imaginative simulation. To bridge this gap, we introduce Hyperphantasia, a synthetic benchmark designed to evaluate the mental visualization abilities of MLLMs through four carefully constructed puzzles. Each task is procedurally generated and presented at three difficulty levels, enabling controlled analysis of model performance across increasing complexity. Our comprehensive evaluation of state-of-the-art models reveals a substantial gap between the performance of humans and MLLMs. Additionally, we explore the potential of reinforcement learning to improve visual simulation capabilities. Our findings suggest that while some models exhibit partial competence in recognizing visual patterns, robust mental visualization remains an open challenge for current MLLMs.

[19] RaDL: Relation-aware Disentangled Learning for Multi-Instance Text-to-Image Generation cs.CV | cs.AIPDF

Geon Park, Seon Bin Kim, Gunho Jung, Seong-Whan Lee

TL;DR: 论文提出了RaDL框架，通过关系感知解耦学习解决多实例文本到图像生成中的关系差异和属性泄漏问题，显著提升了生成图像的位置准确性和实例间关系。

Details

Motivation: 现有方法在多实例图像生成中难以处理实例间关系差异和属性泄漏，导致生成结果不理想。RaDL旨在解决这些问题。

Result: 在COCO-Position、COCO-MIG和DrawBench等基准测试中，RaDL显著优于现有方法，尤其在位置准确性和实例关系处理上表现突出。

Insight: RaDL通过结合关系感知和解耦学习，为多实例文本到图像生成提供了更全面的解决方案，强调了实例间关系的重要性。

Abstract: With recent advancements in text-to-image (T2I) models, effectively generating multiple instances within a single image prompt has become a crucial challenge. Existing methods, while successful in generating positions of individual instances, often struggle to account for relationship discrepancy and multiple attributes leakage. To address these limitations, this paper proposes the relation-aware disentangled learning (RaDL) framework. RaDL enhances instance-specific attributes through learnable parameters and generates relation-aware image features via Relation Attention, utilizing action verbs extracted from the global prompt. Through extensive evaluations on benchmarks such as COCO-Position, COCO-MIG, and DrawBench, we demonstrate that RaDL outperforms existing methods, showing significant improvements in positional accuracy, multiple attributes consideration, and the relationships between instances. Our results present RaDL as the solution for generating images that consider both the relationships and multiple attributes of each instance within the multi-instance image.

[20] Prototypical Progressive Alignment and Reweighting for Generalizable Semantic Segmentation cs.CVPDF

Yuhang Zhang, Zhengyu Zhang, Muxin Liao, Shishun Tian, Wenbin Zou

TL;DR: 本文提出了PPAR框架，通过渐进式原型对齐和重加权机制，提升语义分割在未见目标域上的泛化能力，利用CLIP模型的强泛化性，取得了SOTA效果。

Details

Motivation: 解决通用语义分割中现有方法因粗粒度原型对齐、源数据过拟合及忽视特征适应难度差异而导致的泛化性能不足问题。

Result: 在多基准测试中取得SOTA性能，验证了方法的有效性。

Insight: 渐进式对齐和重加权机制能显著提升模型对未见域的泛化能力，CLIP的引入增强了原型稳定性。

Abstract: Generalizable semantic segmentation aims to perform well on unseen target domains, a critical challenge due to real-world applications requiring high generalizability. Class-wise prototypes, representing class centroids, serve as domain-invariant cues that benefit generalization due to their stability and semantic consistency. However, this approach faces three challenges. First, existing methods often adopt coarse prototypical alignment strategies, which may hinder performance. Second, naive prototypes computed by averaging source batch features are prone to overfitting and may be negatively affected by unrelated source data. Third, most methods treat all source samples equally, ignoring the fact that different features have varying adaptation difficulties. To address these limitations, we propose a novel framework for generalizable semantic segmentation: Prototypical Progressive Alignment and Reweighting (PPAR), leveraging the strong generalization ability of the CLIP model. Specifically, we define two prototypes: the Original Text Prototype (OTP) and Visual Text Prototype (VTP), generated via CLIP to serve as a solid base for alignment. We then introduce a progressive alignment strategy that aligns features in an easy-to-difficult manner, reducing domain gaps gradually. Furthermore, we propose a prototypical reweighting mechanism that estimates the reliability of source data and adjusts its contribution, mitigating the effect of irrelevant or harmful features (i.e., reducing negative transfer). We also provide a theoretical analysis showing the alignment between our method and domain generalization theory. Extensive experiments across multiple benchmarks demonstrate that PPAR achieves state-of-the-art performance, validating its effectiveness.

[21] Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos cs.CV | eess.AS | eess.IVPDF

Yuchi Ishikawa, Shota Nakada, Hokuto Munakata, Kazuhiro Saito, Tatsuya Komatsu

TL;DR: LG-CAV-MAE提出了一种结合文本编码器的对比音频-视觉掩码自编码器，通过自动生成的音频-视觉-文本三元组进行多模态学习，显著提升了任务性能。

Details

Motivation: 提升音频-视觉表示学习，通过引入文本模态和自动生成的音频-视觉-文本三元组，减少对人工标注的依赖。

Result: 在音频-视觉检索任务中提升5.6%的recall@10，分类任务提升3.2%。

Insight: 自动生成的多模态三元组和文本引导的对比学习结合，显著提升了模型性能。

Abstract: In this paper, we propose Language-Guided Contrastive Audio-Visual Masked Autoencoders (LG-CAV-MAE) to improve audio-visual representation learning. LG-CAV-MAE integrates a pretrained text encoder into contrastive audio-visual masked autoencoders, enabling the model to learn across audio, visual and text modalities. To train LG-CAV-MAE, we introduce an automatic method to generate audio-visual-text triplets from unlabeled videos. We first generate frame-level captions using an image captioning model and then apply CLAP-based filtering to ensure strong alignment between audio and captions. This approach yields high-quality audio-visual-text triplets without requiring manual annotations. We evaluate LG-CAV-MAE on audio-visual retrieval tasks, as well as an audio-visual classification task. Our method significantly outperforms existing approaches, achieving up to a 5.6% improvement in recall@10 for retrieval tasks and a 3.2% improvement for the classification task.

Sahid Hossain Mustakim, S M Jishanul Islam, Ummay Maria Muna, Montasir Chowdhury, Mohammed Jawwadul Islam

TL;DR: 该论文提出了一个针对多模态大语言模型（MLLMs）的三模态对抗攻击框架，通过短视频内容评估模型的安全性，揭示了模型在视觉、听觉和语义推理中的漏洞。

Details

Motivation: 目前的内容审核多依赖于单模态攻击评估，忽略了多模态联合攻击的潜在风险，因此需要全面评估MLLMs在三模态场景下的鲁棒性。

Result: 实验表明，MLLMs在联合攻击下存在高攻击成功率（ASR），并表现出对良性或违规内容的分类偏差。

Insight: 揭示了MLLMs在多模态安全性评估中的弱点，为开发更安全的模型提供了关键见解。

Abstract: Multimodal Large Language Models (MLLMs) are increasingly used for content moderation, yet their robustness in short-form video contexts remains underexplored. Current safety evaluations often rely on unimodal attacks, failing to address combined attack vulnerabilities. In this paper, we introduce a comprehensive framework for evaluating the tri-modal safety of MLLMs. First, we present the Short-Video Multimodal Adversarial (SVMA) dataset, comprising diverse short-form videos with human-guided synthetic adversarial attacks. Second, we propose ChimeraBreak, a novel tri-modal attack strategy that simultaneously challenges visual, auditory, and semantic reasoning pathways. Extensive experiments on state-of-the-art MLLMs reveal significant vulnerabilities with high Attack Success Rates (ASR). Our findings uncover distinct failure modes, showing model biases toward misclassifying benign or policy-violating content. We assess results using LLM-as-a-judge, demonstrating attack reasoning efficacy. Our dataset and findings provide crucial insights for developing more robust and safe MLLMs.

[23] GS-Bias: Global-Spatial Bias Learner for Single-Image Test-Time Adaptation of Vision-Language Models cs.CVPDF

Zhaohong Huang, Yuxin Zhang, Jingjing Xie, Fei Chao, Rongrong Ji

TL;DR: GS-Bias提出了一种高效的测试时适应方法，通过全局和空间偏置学习提升视觉语言模型的零样本泛化能力，显著降低了计算开销。

Details

Motivation: 现有测试时适应方法在性能和效率上难以平衡，要么需要调整文本提示导致开销过大，要么依赖手工设计的视觉特征增强效果不稳定。

Result: 在15个基准数据集上实现SOTA性能，例如在跨数据集泛化和域泛化上分别提升2.23%和2.72%，同时仅需6.5%的内存开销。

Insight: 通过轻量化的偏置学习直接作用于逻辑输出，既避免了传统方法的计算瓶颈，又保持了语义特征的捕获能力。

Abstract: Recent advances in test-time adaptation (TTA) for Vision-Language Models (VLMs) have garnered increasing attention, particularly through the use of multiple augmented views of a single image to boost zero-shot generalization. Unfortunately, existing methods fail to strike a satisfactory balance between performance and efficiency, either due to excessive overhead of tuning text prompts or unstable benefits from handcrafted, training-free visual feature enhancement. In this paper, we present Global-Spatial Bias Learner (GS-Bias), an efficient and effective TTA paradigm that incorporates two learnable biases during TTA, unfolded as the global bias and spatial bias. Particularly, the global bias captures the global semantic features of a test image by learning consistency across augmented views, while spatial bias learns the semantic coherence between regions in the image’s spatial visual representation. It is worth highlighting that these two sets of biases are directly added to the logits outputed by the pretrained VLMs, which circumvent the full backpropagation through VLM that hinders the efficiency of existing TTA methods. This endows GS-Bias with extremely high efficiency while achieving state-of-the-art performance on 15 benchmark datasets. For example, it achieves a 2.23% improvement over TPT in cross-dataset generalization and a 2.72% improvement in domain generalization, while requiring only 6.5% of TPT’s memory usage on ImageNet.

[24] EC-Diff: Fast and High-Quality Edge-Cloud Collaborative Inference for Diffusion Models cs.CVPDF

Jiajian Xie, Shengyu Zhang, Zhou Zhao, Fan Wu, Fei Wu

TL;DR: EC-Diff提出了一种边缘-云协同推理框架，通过梯度噪声估计和K步噪声近似策略优化扩散模型的输出质量和推理速度。

Details

Motivation: 扩散模型在图像和视频合成中表现出色，但模型规模和延迟问题影响了用户体验。当前的边缘-云协同框架存在推理时间长或语义模糊的问题。

Result: 在边缘推理基础上显著提升生成质量，同时在云端推理基础上平均速度提升2倍。

Insight: 通过动态调整云端和边缘模型的分工可同时优化推理速度和生成质量。

Abstract: Diffusion Models have shown remarkable proficiency in image and video synthesis. As model size and latency increase limit user experience, hybrid edge-cloud collaborative framework was recently proposed to realize fast inference and high-quality generation, where the cloud model initiates high-quality semantic planning and the edge model expedites later-stage refinement. However, excessive cloud denoising prolongs inference time, while insufficient steps cause semantic ambiguity, leading to inconsistency in edge model output. To address these challenges, we propose EC-Diff that accelerates cloud inference through gradient-based noise estimation while identifying the optimal point for cloud-edge handoff to maintain generation quality. Specifically, we design a K-step noise approximation strategy to reduce cloud inference frequency by using noise gradients between steps and applying cloud inference periodically to adjust errors. Then we design a two-stage greedy search algorithm to efficiently find the optimal parameters for noise approximation and edge model switching. Extensive experiments demonstrate that our method significantly enhances generation quality compared to edge inference, while achieving up to an average $2\times$ speedup in inference compared to cloud inference. Video samples and source code are available at https://ec-diff.github.io/.

[25] Unsupervised Part Discovery via Descriptor-Based Masked Image Restoration with Optimized Constraints cs.CVPDF

Jiahao Xia, Yike Wu, Wenjian Huang, Jianguo Zhang, Jian Zhang

TL;DR: 论文提出了一种名为MPAE的无监督部件发现方法，通过基于描述符的掩码图像恢复和优化约束，能够在复杂场景中稳健地发现与物体形状高度匹配的部件。

Details

Motivation: 部件级特征对图像理解至关重要，但由于缺乏细粒度标注，相关研究较少。现有的无监督方法在跨类别和跨场景时鲁棒性不足，限制了其应用范围。

Result: 实验证明，MPAE能在多种类别和场景下稳健发现有意义部件，支持遮挡处理和跨类别部件相似性探索。

Insight: 通过结合掩码恢复和描述符学习，可以在无监督条件下实现更精确的部件发现，为复杂场景下的图像理解提供新思路。

Abstract: Part-level features are crucial for image understanding, but few studies focus on them because of the lack of fine-grained labels. Although unsupervised part discovery can eliminate the reliance on labels, most of them cannot maintain robustness across various categories and scenarios, which restricts their application range. To overcome this limitation, we present a more effective paradigm for unsupervised part discovery, named Masked Part Autoencoder (MPAE). It first learns part descriptors as well as a feature map from the inputs and produces patch features from a masked version of the original images. Then, the masked regions are filled with the learned part descriptors based on the similarity between the local features and descriptors. By restoring these masked patches using the part descriptors, they become better aligned with their part shapes, guided by appearance features from unmasked patches. Finally, MPAE robustly discovers meaningful parts that closely match the actual object shapes, even in complex scenarios. Moreover, several looser yet more effective constraints are proposed to enable MPAE to identify the presence of parts across various scenarios and categories in an unsupervised manner. This provides the foundation for addressing challenges posed by occlusion and for exploring part similarity across multiple categories. Extensive experiments demonstrate that our method robustly discovers meaningful parts across various categories and scenarios. The code is available at the project https://github.com/Jiahao-UTS/MPAE.

[26] Frequency-Dynamic Attention Modulation for Dense Prediction cs.CV | cs.AIPDF

Linwei Chen, Lin Gu, Ying Fu

TL;DR: 本文提出了一种名为FDAM的新方法，通过调制ViTs的频率响应克服了其低频滤波导致的细节丢失问题，提升了多种视觉任务的性能。

Details

Motivation: Vision Transformers（ViTs）的注意力机制导致每层表现为低通滤波器，而多层堆叠架构会导致频率信号衰减，丢失关键细节和纹理，因此需要一种能动态调制频率响应的解决方案。

Result: 在多个模型和任务中展现了性能提升，避免了表示坍塌，并在遥感检测中达到了SOTA。

Insight: 通过电路理论启发的动态频率调制方法，可以有效解决ViTs的低频主导问题，提升其在密集预测任务中的表现。

Abstract: Vision Transformers (ViTs) have significantly advanced computer vision, demonstrating strong performance across various tasks. However, the attention mechanism in ViTs makes each layer function as a low-pass filter, and the stacked-layer architecture in existing transformers suffers from frequency vanishing. This leads to the loss of critical details and textures. We propose a novel, circuit-theory-inspired strategy called Frequency-Dynamic Attention Modulation (FDAM), which can be easily plugged into ViTs. FDAM directly modulates the overall frequency response of ViTs and consists of two techniques: Attention Inversion (AttInv) and Frequency Dynamic Scaling (FreqScale). Since circuit theory uses low-pass filters as fundamental elements, we introduce AttInv, a method that generates complementary high-pass filtering by inverting the low-pass filter in the attention matrix, and dynamically combining the two. We further design FreqScale to weight different frequency components for fine-grained adjustments to the target response function. Through feature similarity analysis and effective rank evaluation, we demonstrate that our approach avoids representation collapse, leading to consistent performance improvements across various models, including SegFormer, DeiT, and MaskDINO. These improvements are evident in tasks such as semantic segmentation, object detection, and instance segmentation. Additionally, we apply our method to remote sensing detection, achieving state-of-the-art results in single-scale settings. The code is available at \href{https://github.com/Linwei-Chen/FDAM}{https://github.com/Linwei-Chen/FDAM}.

[27] Dual form Complementary Masking for Domain-Adaptive Image Segmentation cs.CV | cs.AIPDF

Jiawen Wang, Yinda Chen, Xiaoyu Liu, Che Liu, Dong Liu

TL;DR: 该论文提出了一种名为MaskTwins的新框架，通过双形式互补掩码（dual form complementary masking）重构稀疏信号，提升跨域图像分割的域不变特征提取能力，无需单独预训练即可实现端到端的域泛化。

Details

Motivation: 现有工作仅将掩码图像建模（MIM）视为输入图像的变形形式，忽略了其理论与潜力；本文从稀疏信号重构角度重新分析掩码重建，探索其在增强特征提取与表示学习中的作用。

Result: 在自然和生物图像分割任务上超越基线方法，验证了MaskTwins在提取域不变特征方面的优越性。

Insight: 掩码重建不仅是数据增强，还可通过理论驱动的互补掩码策略显著提升模型对域变化的鲁棒性。

Abstract: Recent works have correlated Masked Image Modeling (MIM) with consistency regularization in Unsupervised Domain Adaptation (UDA). However, they merely treat masking as a special form of deformation on the input images and neglect the theoretical analysis, which leads to a superficial understanding of masked reconstruction and insufficient exploitation of its potential in enhancing feature extraction and representation learning. In this paper, we reframe masked reconstruction as a sparse signal reconstruction problem and theoretically prove that the dual form of complementary masks possesses superior capabilities in extracting domain-agnostic image features. Based on this compelling insight, we propose MaskTwins, a simple yet effective UDA framework that integrates masked reconstruction directly into the main training pipeline. MaskTwins uncovers intrinsic structural patterns that persist across disparate domains by enforcing consistency between predictions of images masked in complementary ways, enabling domain generalization in an end-to-end manner. Extensive experiments verify the superiority of MaskTwins over baseline methods in natural and biological image segmentation. These results demonstrate the significant advantages of MaskTwins in extracting domain-invariant features without the need for separate pre-training, offering a new paradigm for domain-adaptive segmentation.

[28] Deep Neural Encoder-Decoder Model to Relate fMRI Brain Activity with Naturalistic Stimuli cs.CV | cs.HCPDF

Florian David, Michael Chan, Elenor Morgenroth, Patrik Vuilleumier, Dimitri Van De Ville

TL;DR: 该论文提出一种端到端的深度神经编码器-解码器模型，利用fMRI数据编码和解码自然刺激下的大脑活动。通过结合时间卷积层，模型解决了自然电影刺激与fMRI采集之间的时间分辨率差异，并成功预测视觉皮质区的体素活动，还能从神经活动中重建对应的视觉输入。通过显著性图分析，发现中枕叶、梭状回和距状沟是视觉解码的关键区域。

Details

Motivation: 研究动机是通过深度学习方法探索自然刺激（如电影）下的大脑活动模式，尤其是视觉皮质的响应机制。通过模型重建视觉输入，研究者希望进一步理解视觉处理的神经基础。

Result: 实验结果显示，模型能有效预测视觉皮质区的体素活动，并重建边缘、人脸和对比度等视觉特征。显著性图表明中枕叶（形状感知）、梭状回（复杂识别）和距状沟（基础视觉特征）是解码的关键区域。

Insight: 研究发现，模型的解码能力与视觉皮质的已知功能（如边缘检测和面部识别）高度一致，这表明深度学习模型可作为研究视觉神经机制的代理工具。

Abstract: We propose an end-to-end deep neural encoder-decoder model to encode and decode brain activity in response to naturalistic stimuli using functional magnetic resonance imaging (fMRI) data. Leveraging temporally correlated input from consecutive film frames, we employ temporal convolutional layers in our architecture, which effectively allows to bridge the temporal resolution gap between natural movie stimuli and fMRI acquisitions. Our model predicts activity of voxels in and around the visual cortex and performs reconstruction of corresponding visual inputs from neural activity. Finally, we investigate brain regions contributing to visual decoding through saliency maps. We find that the most contributing regions are the middle occipital area, the fusiform area, and the calcarine, respectively employed in shape perception, complex recognition (in particular face perception), and basic visual features such as edges and contrasts. These functions being strongly solicited are in line with the decoder’s capability to reconstruct edges, faces, and contrasts. All in all, this suggests the possibility to probe our understanding of visual processing in films using as a proxy the behaviour of deep learning models such as the one proposed in this paper.

[29] SS-DC: Spatial-Spectral Decoupling and Coupling Across Visible-Infrared Gap for Domain Adaptive Object Detection cs.CV | cs.AIPDF

Xiwei Zhang, Chunjin Yang, Yiming Xiao, Runtong Zhang, Fanman Meng

TL;DR: 论文提出了一种基于解耦-耦合策略的SS-DC框架，用于可见光到红外（RGB-IR）领域的无监督域自适应目标检测（UDAOD），通过光谱和空间特征的有效解耦与耦合提升性能。

Details

Motivation: 现有的UDAOD方法将可见光域视为一个统一域，忽略了其内部多个子域（如白天、夜晚、雾天）的差异。论文认为解耦这些子域中的域不变（DI）和域特定（DS）特征有助于跨域适应。

Result: 在多个RGB-IR数据集上，显著优于基线和其他UDAOD方法，特别是在FLIR-ADAS数据集的新实验协议中表现优异。

Insight: 通过解耦域不变和域特定特征，并结合空间-光谱信息，可以有效提升跨域目标检测的性能，尤其是在复杂多子域场景中。

Abstract: Unsupervised domain adaptive object detection (UDAOD) from the visible domain to the infrared (RGB-IR) domain is challenging. Existing methods regard the RGB domain as a unified domain and neglect the multiple subdomains within it, such as daytime, nighttime, and foggy scenes. We argue that decoupling the domain-invariant (DI) and domain-specific (DS) features across these multiple subdomains is beneficial for RGB-IR domain adaptation. To this end, this paper proposes a new SS-DC framework based on a decoupling-coupling strategy. In terms of decoupling, we design a Spectral Adaptive Idempotent Decoupling (SAID) module in the aspect of spectral decomposition. Due to the style and content information being highly embedded in different frequency bands, this module can decouple DI and DS components more accurately and interpretably. A novel filter bank-based spectral processing paradigm and a self-distillation-driven decoupling loss are proposed to improve the spectral domain decoupling. In terms of coupling, a new spatial-spectral coupling method is proposed, which realizes joint coupling through spatial and spectral DI feature pyramids. Meanwhile, this paper introduces DS from decoupling to reduce the domain bias. Extensive experiments demonstrate that our method can significantly improve the baseline performance and outperform existing UDAOD methods on multiple RGB-IR datasets, including a new experimental protocol proposed in this paper based on the FLIR-ADAS dataset.

[30] Dataset Ownership Verification for Pre-trained Masked Models cs.CVPDF

Yuechen Xie, Jie Song, Yicheng Shan, Xiaoyan Zhang, Yuanyu Wan

TL;DR: 论文提出了DOV4MM方法，解决掩码模型（masked models）的数据集所有权验证问题，填补了现有技术在这一领域的空白。

Details

Motivation: 高质量开源数据集对深度学习发展至关重要，但其所有权可能被滥用。现有验证技术主要针对监督学习和对比预训练模型，无法直接适用于掩码模型。

Result: 在ImageNet-1K和WikiText-103数据集上的实验显示，DOV4MM能有效拒绝零假设（p值远低于0.05），优于现有方法。

Insight: 掩码模型的预训练行为在嵌入空间留下了独特的可验证痕迹，为数据集所有权保护提供了新思路。

Abstract: High-quality open-source datasets have emerged as a pivotal catalyst driving the swift advancement of deep learning, while facing the looming threat of potential exploitation. Protecting these datasets is of paramount importance for the interests of their owners. The verification of dataset ownership has evolved into a crucial approach in this domain; however, existing verification techniques are predominantly tailored to supervised models and contrastive pre-trained models, rendering them ill-suited for direct application to the increasingly prevalent masked models. In this work, we introduce the inaugural methodology addressing this critical, yet unresolved challenge, termed Dataset Ownership Verification for Masked Modeling (DOV4MM). The central objective is to ascertain whether a suspicious black-box model has been pre-trained on a particular unlabeled dataset, thereby assisting dataset owners in safeguarding their rights. DOV4MM is grounded in our empirical observation that when a model is pre-trained on the target dataset, the difficulty of reconstructing masked information within the embedding space exhibits a marked contrast to models not pre-trained on that dataset. We validated the efficacy of DOV4MM through ten masked image models on ImageNet-1K and four masked language models on WikiText-103. The results demonstrate that DOV4MM rejects the null hypothesis, with a $p$-value considerably below 0.05, surpassing all prior approaches. Code is available at https://github.com/xieyc99/DOV4MM.

[31] MVAR: MultiVariate AutoRegressive Air Pollutants Forecasting Model cs.CV | cs.LGPDF

Xu Fan, Zhihao Wang, Yuetan Lin, Yan Zhang, Yang Xiang

TL;DR: 该论文提出了一个多变量自回归空气污染物预测模型（MVAR），通过减少对长时间窗口输入的依赖并提升数据利用效率，实现了120小时长期序列预测，同时结合气象数据优化空间响应学习。

Details

Motivation: 现有研究多集中于单一污染物预测，忽略了不同污染物间的相互作用及其空间响应的多样性，无法满足实际多变量预测需求。

Result: 实验表明，MVAR在性能上优于现有方法，验证了其架构的有效性。

Insight: 多变量交互与气象数据的结合是提升空气污染物预测精度的关键，标准化数据集的构建为后续研究提供了重要支持。

Abstract: Air pollutants pose a significant threat to the environment and human health, thus forecasting accurate pollutant concentrations is essential for pollution warnings and policy-making. Existing studies predominantly focus on single-pollutant forecasting, neglecting the interactions among different pollutants and their diverse spatial responses. To address the practical needs of forecasting multivariate air pollutants, we propose MultiVariate AutoRegressive air pollutants forecasting model (MVAR), which reduces the dependency on long-time-window inputs and boosts the data utilization efficiency. We also design the Multivariate Autoregressive Training Paradigm, enabling MVAR to achieve 120-hour long-term sequential forecasting. Additionally, MVAR develops Meteorological Coupled Spatial Transformer block, enabling the flexible coupling of AI-based meteorological forecasts while learning the interactions among pollutants and their diverse spatial responses. As for the lack of standardized datasets in air pollutants forecasting, we construct a comprehensive dataset covering 6 major pollutants across 75 cities in North China from 2018 to 2023, including ERA5 reanalysis data and FuXi-2.0 forecast data. Experimental results demonstrate that the proposed model outperforms state-of-the-art methods and validate the effectiveness of the proposed architecture.

Rongtao Xu, Han Gao, Mingming Yu, Dong An, Shunpeng Chen

TL;DR: 论文提出3D-MoRe框架，利用基础模型生成大规模3D-语言数据集，显著提升了3D场景中问答和密集描述任务的性能。

Details

Motivation: 现有3D场景任务（如问答和密集描述）需要更多多样化和可扩展的数据。论文旨在通过结合多模态数据和高层次推理能力，提升任务性能。

Result: ScanQA任务的CIDEr提升2.15%；ScanRefer任务的CIDEr@0.5提升1.84%。

Insight: 通过融合多模态上下文和高层次推理，能够有效提升3D场景任务的性能，且生成的大规模数据集有望推动领域发展。

Abstract: With the growing need for diverse and scalable data in indoor scene tasks, such as question answering and dense captioning, we propose 3D-MoRe, a novel paradigm designed to generate large-scale 3D-language datasets by leveraging the strengths of foundational models. The framework integrates key components, including multi-modal embedding, cross-modal interaction, and a language model decoder, to process natural language instructions and 3D scene data. This approach facilitates enhanced reasoning and response generation in complex 3D environments. Using the ScanNet 3D scene dataset, along with text annotations from ScanQA and ScanRefer, 3D-MoRe generates 62,000 question-answer (QA) pairs and 73,000 object descriptions across 1,513 scenes. We also employ various data augmentation techniques and implement semantic filtering to ensure high-quality data. Experiments on ScanQA demonstrate that 3D-MoRe significantly outperforms state-of-the-art baselines, with the CIDEr score improving by 2.15%. Similarly, on ScanRefer, our approach achieves a notable increase in CIDEr@0.5 by 1.84%, highlighting its effectiveness in both tasks. Our code and generated datasets will be publicly released to benefit the community, and both can be accessed on the https://3D-MoRe.github.io.

[33] Intra-view and Inter-view Correlation Guided Multi-view Novel Class Discovery cs.CV | cs.AIPDF

Xinhang Wan, Jiyuan Liu, Qian Qu, Suyuan Liu, Chuyu Zhang

TL;DR: 论文提出了首个针对多视角数据的NCD方法IICMVNCD，通过视角内和视角间的相关性引导，改进了现有方法在伪标签依赖和多视角数据忽视上的局限性。

Details

Motivation: 现有NCD方法仅关注单视角数据，且依赖伪标签导致性能不稳定，而多视角数据（如多组学数据）在实际中日益常见，亟需更鲁棒的方法。

Result: 实验验证了IICMVNCD的有效性，展示了在多视角数据上的优越性能。

Insight: 视角间关系的信息传递和动态权重调整是提升多视角NCD性能的关键。

Abstract: In this paper, we address the problem of novel class discovery (NCD), which aims to cluster novel classes by leveraging knowledge from disjoint known classes. While recent advances have made significant progress in this area, existing NCD methods face two major limitations. First, they primarily focus on single-view data (e.g., images), overlooking the increasingly common multi-view data, such as multi-omics datasets used in disease diagnosis. Second, their reliance on pseudo-labels to supervise novel class clustering often results in unstable performance, as pseudo-label quality is highly sensitive to factors such as data noise and feature dimensionality. To address these challenges, we propose a novel framework named Intra-view and Inter-view Correlation Guided Multi-view Novel Class Discovery (IICMVNCD), which is the first attempt to explore NCD in multi-view setting so far. Specifically, at the intra-view level, leveraging the distributional similarity between known and novel classes, we employ matrix factorization to decompose features into view-specific shared base matrices and factor matrices. The base matrices capture distributional consistency among the two datasets, while the factor matrices model pairwise relationships between samples. At the inter-view level, we utilize view relationships among known classes to guide the clustering of novel classes. This includes generating predicted labels through the weighted fusion of factor matrices and dynamically adjusting view weights of known classes based on the supervision loss, which are then transferred to novel class learning. Experimental results validate the effectiveness of our proposed approach.

[34] InstructFLIP: Exploring Unified Vision-Language Model for Face Anti-spoofing cs.CV | cs.AI | cs.MMPDF

Kun-Hsiang Lin, Yu-Wen Tseng, Kang-Yang Huang, Jhih-Ciang Wu, Wen-Huang Cheng

TL;DR: InstructFLIP是一个基于视觉-语言模型（VLM）的指令调优框架，用于提升人脸防伪任务的泛化能力，通过解耦指令为内容和风格两部分，显著减少了跨域训练冗余。

Details

Motivation: 当前人脸防伪（FAS）的研究主要集中在跨域泛化，但面临两大挑战：攻击类型的语义理解不足和跨域训练冗余。本文结合VLM和元域策略来解决这些问题。

Result: 实验表明，InstructFLIP在准确性上优于现有SOTA模型，并大幅减少跨域训练冗余。

Insight: 指令解耦策略为FAS任务提供了一种新的解决思路，利用文本引导可以有效提升模型的语义理解和泛化能力。

Abstract: Face anti-spoofing (FAS) aims to construct a robust system that can withstand diverse attacks. While recent efforts have concentrated mainly on cross-domain generalization, two significant challenges persist: limited semantic understanding of attack types and training redundancy across domains. We address the first by integrating vision-language models (VLMs) to enhance the perception of visual input. For the second challenge, we employ a meta-domain strategy to learn a unified model that generalizes well across multiple domains. Our proposed InstructFLIP is a novel instruction-tuned framework that leverages VLMs to enhance generalization via textual guidance trained solely on a single domain. At its core, InstructFLIP explicitly decouples instructions into content and style components, where content-based instructions focus on the essential semantics of spoofing, and style-based instructions consider variations related to the environment and camera characteristics. Extensive experiments demonstrate the effectiveness of InstructFLIP by outperforming SOTA models in accuracy and substantially reducing training redundancy across diverse domains in FAS. Project website is available at https://kunkunlin1221.github.io/InstructFLIP.

[35] MS-DETR: Towards Effective Video Moment Retrieval and Highlight Detection by Joint Motion-Semantic Learning cs.CVPDF

Hongxu Ma, Guanshuo Wang, Fufu Yu, Qiong Jia, Shouhong Ding

TL;DR: MS-DETR 是一个联合运动-语义学习的框架，用于视频片段检索（MR）和亮点检测（HD），通过解耦运动与语义的模态内相关性并利用跨模态任务相关性，显著提升了性能。

Details

Motivation: 现有 DETR 框架在 MR/HD 任务中未充分利用视频中的运动与语义关系，且有数据稀疏性问题。

Result: 在四个基准测试中超越现有 SOTA 模型。

Insight: 视频任务中运动与语义的联合学习及数据稀疏性问题至关重要。

Abstract: Video Moment Retrieval (MR) and Highlight Detection (HD) aim to pinpoint specific moments and assess clip-wise relevance based on the text query. While DETR-based joint frameworks have made significant strides, there remains untapped potential in harnessing the intricate relationships between temporal motion and spatial semantics within video content. In this paper, we propose the Motion-Semantics DETR (MS-DETR), a framework that captures rich motion-semantics features through unified learning for MR/HD tasks. The encoder first explicitly models disentangled intra-modal correlations within motion and semantics dimensions, guided by the given text queries. Subsequently, the decoder utilizes the task-wise correlation across temporal motion and spatial semantics dimensions to enable precise query-guided localization for MR and refined highlight boundary delineation for HD. Furthermore, we observe the inherent sparsity dilemma within the motion and semantics dimensions of MR/HD datasets. To address this issue, we enrich the corpus from both dimensions by generation strategies and propose contrastive denoising learning to ensure the above components learn robustly and effectively. Extensive experiments on four MR/HD benchmarks demonstrate that our method outperforms existing state-of-the-art models by a margin. Our code is available at https://github.com/snailma0229/MS-DETR.git.

[36] Foresight in Motion: Reinforcing Trajectory Prediction with Reward Heuristics cs.CV | cs.ROPDF

Muleilan Pei, Shaoshuai Shi, Xuesong Chen, Xu Liu, Shaojie Shen

TL;DR: 本文提出了一种基于强化学习的轨迹预测方法，通过结合行为意图和奖励启发式，显著提升了轨迹预测的准确性和置信度。

Details

Motivation: 自动驾驶系统中的运动预测是一个关键但具有挑战性的任务。传统方法直接预测轨迹，忽视了行为意图的重要性。本文从规划角度重新思考这一任务，提出结合意图推理和奖励启发式的新策略。

Result: 在Argoverse和nuScenes数据集上，该方法显著提升了预测置信度，性能达到最先进水平。

Insight: 从规划角度重新思考运动预测任务，结合意图推理和奖励启发式，能够显著提升预测性能。

Abstract: Motion forecasting for on-road traffic agents presents both a significant challenge and a critical necessity for ensuring safety in autonomous driving systems. In contrast to most existing data-driven approaches that directly predict future trajectories, we rethink this task from a planning perspective, advocating a “First Reasoning, Then Forecasting” strategy that explicitly incorporates behavior intentions as spatial guidance for trajectory prediction. To achieve this, we introduce an interpretable, reward-driven intention reasoner grounded in a novel query-centric Inverse Reinforcement Learning (IRL) scheme. Our method first encodes traffic agents and scene elements into a unified vectorized representation, then aggregates contextual features through a query-centric paradigm. This enables the derivation of a reward distribution, a compact yet informative representation of the target agent’s behavior within the given scene context via IRL. Guided by this reward heuristic, we perform policy rollouts to reason about multiple plausible intentions, providing valuable priors for subsequent trajectory generation. Finally, we develop a hierarchical DETR-like decoder integrated with bidirectional selective state space models to produce accurate future trajectories along with their associated probabilities. Extensive experiments on the large-scale Argoverse and nuScenes motion forecasting datasets demonstrate that our approach significantly enhances trajectory prediction confidence, achieving highly competitive performance relative to state-of-the-art methods.

[37] YOLOv8-SMOT: An Efficient and Robust Framework for Real-Time Small Object Tracking via Slice-Assisted Training and Adaptive Association cs.CVPDF

Xiang Yu, Xinyao Liu, Guang Liang

TL;DR: YOLOv8-SMOT提出了一种高效且鲁棒的小物体实时追踪框架，通过切片辅助训练和自适应关联解决了小物体追踪中的特征稀少、运动复杂和遮挡难题，并在MVA 2025比赛中取得冠军。

Details

Motivation: 从无人机视角追踪小鸟等多动小物体是一项极具挑战的任务，主要困难包括目标特征稀少、运动复杂及频繁遮挡所导致的身份模糊。

Result: 在SMOT4SB测试集上取得SO-HOTA 55.205的SOTA性能，验证了框架的有效性。

Insight: SliceTrain和运动方向维护机制是解决小物体追踪中数据不足和身份模糊问题的关键创新。

Abstract: Tracking small, agile multi-objects (SMOT), such as birds, from an Unmanned Aerial Vehicle (UAV) perspective is a highly challenging computer vision task. The difficulty stems from three main sources: the extreme scarcity of target appearance features, the complex motion entanglement caused by the combined dynamics of the camera and the targets themselves, and the frequent occlusions and identity ambiguity arising from dense flocking behavior. This paper details our championship-winning solution in the MVA 2025 “Finding Birds” Small Multi-Object Tracking Challenge (SMOT4SB), which adopts the tracking-by-detection paradigm with targeted innovations at both the detection and association levels. On the detection side, we propose a systematic training enhancement framework named \textbf{SliceTrain}. This framework, through the synergy of ‘deterministic full-coverage slicing’ and ‘slice-level stochastic augmentation, effectively addresses the problem of insufficient learning for small objects in high-resolution image training. On the tracking side, we designed a robust tracker that is completely independent of appearance information. By integrating a \textbf{motion direction maintenance (EMA)} mechanism and an \textbf{adaptive similarity metric} combining \textbf{bounding box expansion and distance penalty} into the OC-SORT framework, our tracker can stably handle irregular motion and maintain target identities. Our method achieves state-of-the-art performance on the SMOT4SB public test set, reaching an SO-HOTA score of \textbf{55.205}, which fully validates the effectiveness and advancement of our framework in solving complex real-world SMOT problems. The source code will be made available at https://github.com/Salvatore-Love/YOLOv8-SMOT.

[38] Out-of-distribution data supervision towards biomedical semantic segmentation cs.CVPDF

Yiquan Gao, Duohui Xu

TL;DR: 该论文提出了一种名为Med-OoD的数据中心框架，通过引入Out-of-Distribution（OoD）数据监督来解决生物医学图像分割中的错误分类问题，无需额外数据、特征正则化或标注。该方法可直接应用于现有分割网络，显著提升了性能，并展示了仅用OoD数据训练分割网络的潜力。

Details

Motivation: 生物医学图像分割网络在有限和不完美的数据集上容易发生前景与背景的错误分类，而OoD数据在其他视觉任务中表现出的强大能力启发了作者探索其在分割任务中的应用。

Result: 在Lizard数据集上取得了显著性能提升，并展示了仅用OoD数据训练时76.1%的mIoU结果。

Insight: OoD数据在生物医学图像分割中具有潜在的重要作用，挑战了传统依赖标注数据的学习范式。

Abstract: Biomedical segmentation networks easily suffer from the unexpected misclassification between foreground and background objects when learning on limited and imperfect medical datasets. Inspired by the strong power of Out-of-Distribution (OoD) data on other visual tasks, we propose a data-centric framework, Med-OoD to address this issue by introducing OoD data supervision into fully-supervised biomedical segmentation with none of the following needs: (i) external data sources, (ii) feature regularization objectives, (iii) additional annotations. Our method can be seamlessly integrated into segmentation networks without any modification on the architectures. Extensive experiments show that Med-OoD largely prevents various segmentation networks from the pixel misclassification on medical images and achieves considerable performance improvements on Lizard dataset. We also present an emerging learning paradigm of training a medical segmentation network completely using OoD data devoid of foreground class labels, surprisingly turning out 76.1% mIoU as test result. We hope this learning paradigm will attract people to rethink the roles of OoD data. Code is made available at https://github.com/StudioYG/Med-OoD.

[39] Non-Adaptive Adversarial Face Generation cs.CV | cs.AI | cs.CR | I.2.6; I.5.4; D.4.6; K.6.5; I.4.8PDF

Sunpill Kim, Seunghun Paik, Chanwoo Hwang, Minsu Kim, Jae Hong Seo

TL;DR: 本文提出了一种新的非适应性对抗人脸生成方法，通过利用FRS特征空间的结构特性，仅需少量查询即可生成视觉差异显著但被识别为目标身份的对抗人脸，无需依赖于迁移性或开源代理模型。

Details

Motivation: 当前的人脸识别系统（FRSs）在面对对抗攻击时存在严重的安全和隐私风险，尤其是在身份验证场景中。现有方法通常依赖迭代优化或迁移性攻击，而本文旨在提出一种更高效且无需适应性查询的对抗生成方法。

Result: 在AWS的CompareFaces API上，仅需一次非适应性查询（包含100张人脸图像），成功率达到93%以上，显著优于现有方法。

Insight: FRS特征空间的结构特性（如属性子球面）为对抗攻击提供了新的研究方向，同时也揭示了现有系统的潜在脆弱性。

Abstract: Adversarial attacks on face recognition systems (FRSs) pose serious security and privacy threats, especially when these systems are used for identity verification. In this paper, we propose a novel method for generating adversarial faces-synthetic facial images that are visually distinct yet recognized as a target identity by the FRS. Unlike iterative optimization-based approaches (e.g., gradient descent or other iterative solvers), our method leverages the structural characteristics of the FRS feature space. We figure out that individuals sharing the same attribute (e.g., gender or race) form an attributed subsphere. By utilizing such subspheres, our method achieves both non-adaptiveness and a remarkably small number of queries. This eliminates the need for relying on transferability and open-source surrogate models, which have been a typical strategy when repeated adaptive queries to commercial FRSs are impossible. Despite requiring only a single non-adaptive query consisting of 100 face images, our method achieves a high success rate of over 93% against AWS’s CompareFaces API at its default threshold. Furthermore, unlike many existing attacks that perturb a given image, our method can deliberately produce adversarial faces that impersonate the target identity while exhibiting high-level attributes chosen by the adversary.

[40] LidarPainter: One-Step Away From Any Lidar View To Novel Guidance cs.CVPDF

Yuzhou Ji, Ke Ma, Hong Cai, Anchun Zhang, Lizhuang Ma

TL;DR: LidarPainter 是一种一步扩散模型，能够从稀疏的 LiDAR 条件和带有伪影的渲染中实时恢复一致的驾驶视图，支持高保真的车道变换和风格化生成。

Details

Motivation: 动态驾驶场景重建在数字孪生系统和自动驾驶仿真中具有重要意义，但现有方法在视图偏离输入轨迹时会导致背景和车辆模型质量下降，且存在速度、一致性和资源效率等问题。

Result: 实验表明，LidarPainter 在速度、质量和资源效率上优于现有方法（比 StreetCrafter 快 7 倍，GPU 内存需求仅为 1/5），并能实现风格化生成。

Insight: LidarPainter 通过一步扩散模型实现了高效高质量的驾驶场景重建，为数字孪生和自动驾驶仿真提供了新的解决方案。

Abstract: Dynamic driving scene reconstruction is of great importance in fields like digital twin system and autonomous driving simulation. However, unacceptable degradation occurs when the view deviates from the input trajectory, leading to corrupted background and vehicle models. To improve reconstruction quality on novel trajectory, existing methods are subject to various limitations including inconsistency, deformation, and time consumption. This paper proposes LidarPainter, a one-step diffusion model that recovers consistent driving views from sparse LiDAR condition and artifact-corrupted renderings in real-time, enabling high-fidelity lane shifts in driving scene reconstruction. Extensive experiments show that LidarPainter outperforms state-of-the-art methods in speed, quality and resource efficiency, specifically 7 x faster than StreetCrafter with only one fifth of GPU memory required. LidarPainter also supports stylized generation using text prompts such as “foggy” and “night”, allowing for a diverse expansion of the existing asset library.

[41] Open-Vocabulary Indoor Object Grounding with 3D Hierarchical Scene Graph cs.CVPDF

Sergey Linok, Gleb Naumov

TL;DR: 该论文提出了OVIGo-3DHSG方法，通过3D层次场景图在开放词汇场景中实现室内物体定位，结合大型语言模型提升空间推理能力。

Details

Motivation: 现有室内场景理解方法难以处理复杂空间关系和开放词汇查询，需要一种能结合几何与语义信息的多层次表示方法。

Result: 在Habitat Matterport 3D多楼层场景中表现出高效的场景理解和鲁棒的物体定位能力。

Insight: 层次场景图结合语言模型可以显著提升复杂空间任务的性能，适用于需要高精度空间推理的应用。

Abstract: We propose OVIGo-3DHSG method - Open-Vocabulary Indoor Grounding of objects using 3D Hierarchical Scene Graph. OVIGo-3DHSG represents an extensive indoor environment over a Hierarchical Scene Graph derived from sequences of RGB-D frames utilizing a set of open-vocabulary foundation models and sensor data processing. The hierarchical representation explicitly models spatial relations across floors, rooms, locations, and objects. To effectively address complex queries involving spatial reference to other objects, we integrate the hierarchical scene graph with a Large Language Model for multistep reasoning. This integration leverages inter-layer (e.g., room-to-object) and intra-layer (e.g., object-to-object) connections, enhancing spatial contextual understanding. We investigate the semantic and geometry accuracy of hierarchical representation on Habitat Matterport 3D Semantic multi-floor scenes. Our approach demonstrates efficient scene comprehension and robust object grounding compared to existing methods. Overall OVIGo-3DHSG demonstrates strong potential for applications requiring spatial reasoning and understanding of indoor environments. Related materials can be found at https://github.com/linukc/OVIGo-3DHSG.

[42] Block-based Symmetric Pruning and Fusion for Efficient Vision Transformers cs.CVPDF

Yi-Kuan Hsieh, Jun-Wei Hsieh, Xin Li, Yu-Ming Chang, Yu-Chee Tseng

TL;DR: 本文提出了一种名为BSPF-ViT的新方法，通过对称性剪枝与融合技术优化ViT的计算效率，显著提升了精度并降低了计算成本。

Details

Motivation: Vision Transformer的高计算复杂度限制了其实际应用，现有方法在剪枝时忽略了token间的交互，导致精度损失。

Result: 在多个ViT模型上表现优异，DeiT-T和DeiT-S的ImageNet分类精度分别提升1.3%和2.0%，计算开销降低50%，速度提升40%。

Insight: 对称性剪枝与融合能同时提升模型精度和效率，为ViT的轻量化设计提供了新思路。

Abstract: Vision Transformer (ViT) has achieved impressive results across various vision tasks, yet its high computational cost limits practical applications. Recent methods have aimed to reduce ViT’s $O(n^2)$ complexity by pruning unimportant tokens. However, these techniques often sacrifice accuracy by independently pruning query (Q) and key (K) tokens, leading to performance degradation due to overlooked token interactions. To address this limitation, we introduce a novel {\bf Block-based Symmetric Pruning and Fusion} for efficient ViT (BSPF-ViT) that optimizes the pruning of Q/K tokens jointly. Unlike previous methods that consider only a single direction, our approach evaluates each token and its neighbors to decide which tokens to retain by taking token interaction into account. The retained tokens are compressed through a similarity fusion step, preserving key information while reducing computational costs. The shared weights of Q/K tokens create a symmetric attention matrix, allowing pruning only the upper triangular part for speed up. BSPF-ViT consistently outperforms state-of-the-art ViT methods at all pruning levels, increasing ImageNet classification accuracy by 1.3% on DeiT-T and 2.0% on DeiT-S, while reducing computational overhead by 50%. It achieves 40% speedup with improved accuracy across various ViTs.

[43] AD-GS: Object-Aware B-Spline Gaussian Splatting for Self-Supervised Autonomous Driving cs.CVPDF

Jiawei Xu, Kai Deng, Zexin Fan, Shenlong Wang, Jin Xie

TL;DR: AD-GS提出了一种自监督的高质量驾驶场景渲染框架，通过B样条曲线和全局三角函数的结合，实现动态对象建模，无需标注即可分割场景并增强渲染效果。

Details

Motivation: 当前高质量的动态场景渲染方法依赖昂贵的标注，而自监督方法难以准确捕捉动态运动和分解场景，导致渲染伪影。AD-GS旨在解决这一问题。

Result: AD-GS在无标注方法中表现出色，与依赖标注的方法竞争力相当。

Insight: 创新的运动模型和自监督分割方法为动态场景渲染提供了高效且低成本的解决方案。

Abstract: Modeling and rendering dynamic urban driving scenes is crucial for self-driving simulation. Current high-quality methods typically rely on costly manual object tracklet annotations, while self-supervised approaches fail to capture dynamic object motions accurately and decompose scenes properly, resulting in rendering artifacts. We introduce AD-GS, a novel self-supervised framework for high-quality free-viewpoint rendering of driving scenes from a single log. At its core is a novel learnable motion model that integrates locality-aware B-spline curves with global-aware trigonometric functions, enabling flexible yet precise dynamic object modeling. Rather than requiring comprehensive semantic labeling, AD-GS automatically segments scenes into objects and background with the simplified pseudo 2D segmentation, representing objects using dynamic Gaussians and bidirectional temporal visibility masks. Further, our model incorporates visibility reasoning and physically rigid regularization to enhance robustness. Extensive evaluations demonstrate that our annotation-free model significantly outperforms current state-of-the-art annotation-free methods and is competitive with annotation-dependent approaches.

[44] Fine-Grained Image Recognition from Scratch with Teacher-Guided Data Augmentation cs.CV | I.2; I.4PDF

Edwin Arkel Rios, Fernando Mikael, Oswin Gosal, Femiloye Oyerinde, Hao-Chun Liang

TL;DR: 本文提出了一种名为TGDA的新框架，通过教师引导的数据增强和知识蒸馏，实现了从零开始训练高性能细粒度图像识别模型，摆脱了对预训练模型的依赖。

Details

Motivation: 现有细粒度图像识别方法依赖大规模预训练模型，限制了在资源受限环境中的应用和任务特定架构的发展。本文旨在探索从零开始训练的可行性。

Result: 在低分辨率和高分辨率输入下，TGDA均优于预训练模型。LRNets提升准确率23%，参数减少20.6倍；ViTFS-T性能匹配ViT B-16，但参数减少15.3倍。

Insight: 从零开始训练细粒度图像识别系统是可行的，TGDA为任务特定和硬件感知架构设计提供了新思路，减少了对预训练模型的依赖。

Abstract: Fine-grained image recognition (FGIR) aims to distinguish visually similar sub-categories within a broader class, such as identifying bird species. While most existing FGIR methods rely on backbones pretrained on large-scale datasets like ImageNet, this dependence limits adaptability to resource-constrained environments and hinders the development of task-specific architectures tailored to the unique challenges of FGIR. In this work, we challenge the conventional reliance on pretrained models by demonstrating that high-performance FGIR systems can be trained entirely from scratch. We introduce a novel training framework, TGDA, that integrates data-aware augmentation with weak supervision via a fine-grained-aware teacher model, implemented through knowledge distillation. This framework unlocks the design of task-specific and hardware-aware architectures, including LRNets for low-resolution FGIR and ViTFS, a family of Vision Transformers optimized for efficient inference. Extensive experiments across three FGIR benchmarks over diverse settings involving low-resolution and high-resolution inputs show that our method consistently matches or surpasses state-of-the-art pretrained counterparts. In particular, in the low-resolution setting, LRNets trained with TGDA improve accuracy by up to 23% over prior methods while requiring up to 20.6x less parameters, lower FLOPs, and significantly less training data. Similarly, ViTFS-T can match the performance of a ViT B-16 pretrained on ImageNet-21k while using 15.3x fewer trainable parameters and requiring orders of magnitudes less data. These results highlight TGDA’s potential as an adaptable alternative to pretraining, paving the way for more efficient fine-grained vision systems.

[45] Hybrid Ensemble Approaches: Optimal Deep Feature Fusion and Hyperparameter-Tuned Classifier Ensembling for Enhanced Brain Tumor Classification cs.CVPDF

Zahid Ullah, Dragan Pamucar, Jihie Kim

TL;DR: 该论文提出了一种新型的双重集成框架，通过集成预训练的深度学习模型和机器学习分类器，结合特征融合和超参数调优，显著提升了脑瘤分类的准确性。

Details

Motivation: 传统的MRI图像诊断依赖专家评估，易受疲劳、经验不足或图像细节不足的影响，导致误诊或漏诊。本文旨在通过自动化的深度学习与机器学习结合的方法提高诊断精度。

Result: 结果表明，特征融合和分类器融合显著优于现有方法，超参数调优进一步提升了集成方法的效果。此外，消融研究证明了各组件对分类准确性的贡献。

Insight: 深度学习与机器学习的结合（特征提取与分类器优化）在医学图像分类任务中具有显著优势，超参数调优是提升性能的关键；特征融合和集成学习可以有效缓解小样本或复杂背景下的分类挑战。

Abstract: Magnetic Resonance Imaging (MRI) is widely recognized as the most reliable tool for detecting tumors due to its capability to produce detailed images that reveal their presence. However, the accuracy of diagnosis can be compromised when human specialists evaluate these images. Factors such as fatigue, limited expertise, and insufficient image detail can lead to errors. For example, small tumors might go unnoticed, or overlap with healthy brain regions could result in misidentification. To address these challenges and enhance diagnostic precision, this study proposes a novel double ensembling framework, consisting of ensembled pre-trained deep learning (DL) models for feature extraction and ensembled fine-tuned hyperparameter machine learning (ML) models to efficiently classify brain tumors. Specifically, our method includes extensive preprocessing and augmentation, transfer learning concepts by utilizing various pre-trained deep convolutional neural networks and vision transformer networks to extract deep features from brain MRI, and fine-tune hyperparameters of ML classifiers. Our experiments utilized three different publicly available Kaggle MRI brain tumor datasets to evaluate the pre-trained DL feature extractor models, ML classifiers, and the effectiveness of an ensemble of deep features along with an ensemble of ML classifiers for brain tumor classification. Our results indicate that the proposed feature fusion and classifier fusion improve upon the state of the art, with hyperparameter fine-tuning providing a significant enhancement over the ensemble method. Additionally, we present an ablation study to illustrate how each component contributes to accurate brain tumor classification.

[46] Revealing the Ancient Beauty: Digital Reconstruction of Temple Tiles using Computer Vision cs.CV | cs.AIPDF

Arkaprabha Basu

TL;DR: 该论文提出三种计算机视觉技术——分形卷积、自适应瓷砖填充（SSTF）和数据增强方法MosaicSlice，用于印度古迹的数字重建，同时结合超分辨率技术提升图像质量，实现了文化遗产保护中的高效与美学平衡。

Details

Motivation: 现代数字化方法在文化遗产保护中的应用需求日益增长，而印度古迹因其独特的建筑风格和美学价值需要特殊的技术手段，因此研究提出了结合计算机视觉的创新方法。

Result: 研究实现了高细节的古迹瓷砖重建，保持了文化遗产的真实性，同时通过自动化降低了成本，提供了高效且美学优异的解决方案。

Insight: 通过计算机视觉技术，可以在保持传统与创新平衡的前提下，高效地保护和修复文化遗产，为多学科合作提供了新思路。

Abstract: Modern digitised approaches have dramatically changed the preservation and restoration of cultural treasures, integrating computer scientists into multidisciplinary projects with ease. Machine learning, deep learning, and computer vision techniques have revolutionised developing sectors like 3D reconstruction, picture inpainting,IoT-based methods, genetic algorithms, and image processing with the integration of computer scientists into multidisciplinary initiatives. We suggest three cutting-edge techniques in recognition of the special qualities of Indian monuments, which are famous for their architectural skill and aesthetic appeal. First is the Fractal Convolution methodology, a segmentation method based on image processing that successfully reveals subtle architectural patterns within these irreplaceable cultural buildings. The second is a revolutionary Self-Sensitive Tile Filling (SSTF) method created especially for West Bengal’s mesmerising Bankura Terracotta Temples with a brand-new data augmentation method called MosaicSlice on the third. Furthermore, we delve deeper into the Super Resolution strategy to upscale the images without losing significant amount of quality. Our methods allow for the development of seamless region-filling and highly detailed tiles while maintaining authenticity using a novel data augmentation strategy within affordable costs introducing automation. By providing effective solutions that preserve the delicate balance between tradition and innovation, this study improves the subject and eventually ensures unrivalled efficiency and aesthetic excellence in cultural heritage protection. The suggested approaches advance the field into an era of unmatched efficiency and aesthetic quality while carefully upholding the delicate equilibrium between tradition and innovation.

[47] MGFFD-VLM: Multi-Granularity Prompt Learning for Face Forgery Detection with VLM cs.CVPDF

Tao Chen, Jingyi Zhang, Decheng Liu, Chunlei Peng

TL;DR: 论文提出了MGFFD-VLM框架，通过多粒度提示学习和属性驱动的混合LoRA策略，提升视觉大语言模型（VLM）在深度伪造检测中的性能，同时增强解释性。

Details

Motivation: 现有基于VLM的深度伪造检测方法未能充分利用人脸质量相关属性，且缺乏有效的训练策略。

Result: 实验表明，MGFFD-VLM在文本驱动的伪造判断和分析中优于现有方法，准确率更高。

Insight: 结合多粒度提示和属性驱动策略，可有效提升VLM在深度伪造检测中的性能和解释性。

Abstract: Recent studies have utilized visual large language models (VLMs) to answer not only “Is this face a forgery?” but also “Why is the face a forgery?” These studies introduced forgery-related attributes, such as forgery location and type, to construct deepfake VQA datasets and train VLMs, achieving high accuracy while providing human-understandable explanatory text descriptions. However, these methods still have limitations. For example, they do not fully leverage face quality-related attributes, which are often abnormal in forged faces, and they lack effective training strategies for forgery-aware VLMs. In this paper, we extend the VQA dataset to create DD-VQA+, which features a richer set of attributes and a more diverse range of samples. Furthermore, we introduce a novel forgery detection framework, MGFFD-VLM, which integrates an Attribute-Driven Hybrid LoRA Strategy to enhance the capabilities of Visual Large Language Models (VLMs). Additionally, our framework incorporates Multi-Granularity Prompt Learning and a Forgery-Aware Training Strategy. By transforming classification and forgery segmentation results into prompts, our method not only improves forgery classification but also enhances interpretability. To further boost detection performance, we design multiple forgery-related auxiliary losses. Experimental results demonstrate that our approach surpasses existing methods in both text-based forgery judgment and analysis, achieving superior accuracy.

[48] Generate to Ground: Multimodal Text Conditioning Boosts Phrase Grounding in Medical Vision-Language Models cs.CVPDF

Felix Nützel, Mischa Dombrowski, Bernhard Kainz

TL;DR: 论文提出了一种基于生成式扩散模型的方法（结合多模态文本条件）用于医学图像中的短语定位任务，通过引入新的后处理技术（BBM）显著提升了性能。

Details

Motivation: 当前基于判别式自监督对比学习的方法在医学图像短语定位任务中表现有限，生成式扩散模型的潜力尚未被充分挖掘。

Result: 实验显示，该方法在mIoU指标上比当前判别式方法翻倍，显著提升了定位性能。

Insight: 生成式模型在医学图像短语定位任务中具有巨大潜力，结合领域专用语言模型和后处理技术可以显著提升性能，为临床应用提供了更鲁棒和可解释的方案。

Abstract: Phrase grounding, i.e., mapping natural language phrases to specific image regions, holds significant potential for disease localization in medical imaging through clinical reports. While current state-of-the-art methods rely on discriminative, self-supervised contrastive models, we demonstrate that generative text-to-image diffusion models, leveraging cross-attention maps, can achieve superior zero-shot phrase grounding performance. Contrary to prior assumptions, we show that fine-tuning diffusion models with a frozen, domain-specific language model, such as CXR-BERT, substantially outperforms domain-agnostic counterparts. This setup achieves remarkable improvements, with mIoU scores doubling those of current discriminative methods. These findings highlight the underexplored potential of generative models for phrase grounding tasks. To further enhance performance, we introduce Bimodal Bias Merging (BBM), a novel post-processing technique that aligns text and image biases to identify regions of high certainty. BBM refines cross-attention maps, achieving even greater localization accuracy. Our results establish generative approaches as a more effective paradigm for phrase grounding in the medical imaging domain, paving the way for more robust and interpretable applications in clinical practice. The source code and model weights are available at https://github.com/Felix-012/generate_to_ground.

[49] Calisthenics Skills Temporal Video Segmentation cs.CVPDF

Antonio Finocchiaro, Giovanni Maria Farinella, Antonino Furnari

TL;DR: 这篇论文提出了一个静态卡路里技能（Calisthenics Skills）的时间视频分割问题，并构建了一个标注数据集，为自动化工具的开发提供了一个初步的基础。

Details

Motivation: 卡路里技能的评价基于难度和持续时间，但目前缺少自动化的工具来从视频中分割和评估这些技能。论文旨在填补这一空白，为运动员训练和比赛评审提供支持。

Result: 结果显示该问题的可行性，但仍有改进空间。

Insight: 这是首个针对卡路里技能时间分割的研究，未来可以结合更先进的视频理解和姿态分析方法来提升性能。

Abstract: Calisthenics is a fast-growing bodyweight discipline that consists of different categories, one of which is focused on skills. Skills in calisthenics encompass both static and dynamic elements performed by athletes. The evaluation of static skills is based on their difficulty level and the duration of the hold. Automated tools able to recognize isometric skills from a video by segmenting them to estimate their duration would be desirable to assist athletes in their training and judges during competitions. Although the video understanding literature on action recognition through body pose analysis is rich, no previous work has specifically addressed the problem of calisthenics skill temporal video segmentation. This study aims to provide an initial step towards the implementation of automated tools within the field of Calisthenics. To advance knowledge in this context, we propose a dataset of video footage of static calisthenics skills performed by athletes. Each video is annotated with a temporal segmentation which determines the extent of each skill. We hence report the results of a baseline approach to address the problem of skill temporal segmentation on the proposed dataset. The results highlight the feasibility of the proposed problem, while there is still room for improvement.

[50] Site-Level Fine-Tuning with Progressive Layer Freezing: Towards Robust Prediction of Bronchopulmonary Dysplasia from Day-1 Chest Radiographs in Extremely Preterm Infants cs.CV | cs.AI | cs.LGPDF

Sybelle Goedicke-Fritz, Michelle Bous, Annika Engel, Matthias Flotho, Pascal Hirsch

TL;DR: 本文提出了一种基于深度学习的渐进层冻结与线性探测方法，从早产儿出生24小时内的胸部X光片中预测支气管肺发育不良（BPD）。该方法在特定领域预训练的基础上表现优异，具有临床实用性。

Details

Motivation: 支气管肺发育不良（BPD）是一种严重的早产儿慢性肺病，早期预测对避免不必要的治疗风险至关重要。由于常规影像学指标（如IRDS）预测能力有限，研究者探索了基于深度学习的非侵入性预测方法。

Result: 模型在预测中/重度BPD时，AUROC为0.78 ± 0.10，平衡准确率为0.69 ± 0.10，F1分数为0.67 ± 0.11，优于ImageNet初始化的模型和常规IRDS指标。

Insight: 特定领域的预训练对医学影像任务至关重要；渐进层冻结与线性探测的结合既能提升性能，又能降低计算成本，适合临床落地和联邦学习部署。

Abstract: Bronchopulmonary dysplasia (BPD) is a chronic lung disease affecting 35% of extremely low birth weight infants. Defined by oxygen dependence at 36 weeks postmenstrual age, it causes lifelong respiratory complications. However, preventive interventions carry severe risks, including neurodevelopmental impairment, ventilator-induced lung injury, and systemic complications. Therefore, early BPD prognosis and prediction of BPD outcome is crucial to avoid unnecessary toxicity in low risk infants. Admission radiographs of extremely preterm infants are routinely acquired within 24h of life and could serve as a non-invasive prognostic tool. In this work, we developed and investigated a deep learning approach using chest X-rays from 163 extremely low-birth-weight infants ($\leq$32 weeks gestation, 401-999g) obtained within 24 hours of birth. We fine-tuned a ResNet-50 pretrained specifically on adult chest radiographs, employing progressive layer freezing with discriminative learning rates to prevent overfitting and evaluated a CutMix augmentation and linear probing. For moderate/severe BPD outcome prediction, our best performing model with progressive freezing, linear probing and CutMix achieved an AUROC of 0.78 $\pm$ 0.10, balanced accuracy of 0.69 $\pm$ 0.10, and an F1-score of 0.67 $\pm$ 0.11. In-domain pre-training significantly outperformed ImageNet initialization (p = 0.031) which confirms domain-specific pretraining to be important for BPD outcome prediction. Routine IRDS grades showed limited prognostic value (AUROC 0.57 $\pm$ 0.11), confirming the need of learned markers. Our approach demonstrates that domain-specific pretraining enables accurate BPD prediction from routine day-1 radiographs. Through progressive freezing and linear probing, the method remains computationally feasible for site-level implementation and future federated learning deployments.

[51] Efficient Calisthenics Skills Classification through Foreground Instance Selection and Depth Estimation cs.CVPDF

Antonio Finocchiaro, Giovanni Maria Farinella, Antonino Furnari

TL;DR: 这篇论文提出了一种高效的自体重训技能分类方法，通过前景实例选择和深度估计替代传统的高计算成本姿态估计，显著提升了速度和准确性。

Details

Motivation: 传统基于姿态估计的分类方法计算成本高且复杂，限制了实时性和移动设备的应用，因此需要一种更高效的替代方案。

Result: 方法比基于骨架的方法快38.3倍，分类精度更高（深度块：0.837 vs. 0.815）。

Insight: 通过避免姿态估计直接处理前景和深度信息，可以显著提升效率和精度，适合实时和移动端应用。

Abstract: Calisthenics skill classification is the computer vision task of inferring the skill performed by an athlete from images, enabling automatic performance assessment and personalized analytics. Traditional methods for calisthenics skill recognition are based on pose estimation methods to determine the position of skeletal data from images, which is later fed to a classification algorithm to infer the performed skill. Despite the progress in human pose estimation algorithms, they still involve high computational costs, long inference times, and complex setups, which limit the applicability of such approaches in real-time applications or mobile devices. This work proposes a direct approach to calisthenics skill recognition, which leverages depth estimation and athlete patch retrieval to avoid the computationally expensive human pose estimation module. Using Depth Anything V2 for depth estimation and YOLOv10 for athlete localization, we segment the subject from the background rather than relying on traditional pose estimation techniques. This strategy increases efficiency, reduces inference time, and improves classification accuracy. Our approach significantly outperforms skeleton-based methods, achieving 38.3x faster inference with RGB image patches and improved classification accuracy with depth patches (0.837 vs. 0.815). Beyond these performance gains, the modular design of our pipeline allows for flexible replacement of components, enabling future enhancements and adaptation to real-world applications.

[52] Unsupervised Monocular 3D Keypoint Discovery from Multi-View Diffusion Priors cs.CVPDF

Subin Jeon, In Cho, Junyoung Hong, Seon Joo Kim

TL;DR: KeyDiff3D是一种无监督的单目3D关键点估计框架，通过利用预训练的多视角扩散模型的几何先验，从单张图像预测精确的3D关键点，且无需人工标注或多视角校准数据。

Details

Motivation: 现有方法依赖昂贵的人工标注或多视角校准数据，限制了3D关键点估计的应用。KeyDiff3D旨在通过无监督方式仅使用单视角图像实现3D关键点估计。

Result: 实验表明KeyDiff3D在Human3.6M、Stanford Dogs等数据集上具有高精度和泛化能力，并能操纵扩散模型生成的3D物体。

Insight: 扩散模型的隐含3D先验可转换为显式3D特征，为无监督3D视觉任务提供新思路。

Abstract: This paper introduces KeyDiff3D, a framework for unsupervised monocular 3D keypoints estimation that accurately predicts 3D keypoints from a single image. While previous methods rely on manual annotations or calibrated multi-view images, both of which are expensive to collect, our method enables monocular 3D keypoints estimation using only a collection of single-view images. To achieve this, we leverage powerful geometric priors embedded in a pretrained multi-view diffusion model. In our framework, this model generates multi-view images from a single image, serving as a supervision signal to provide 3D geometric cues to our model. We also use the diffusion model as a powerful 2D multi-view feature extractor and construct 3D feature volumes from its intermediate representations. This transforms implicit 3D priors learned by the diffusion model into explicit 3D features. Beyond accurate keypoints estimation, we further introduce a pipeline that enables manipulation of 3D objects generated by the diffusion model. Experimental results on diverse aspects and datasets, including Human3.6M, Stanford Dogs, and several in-the-wild and out-of-domain datasets, highlight the effectiveness of our method in terms of accuracy, generalization, and its ability to enable manipulation of 3D objects generated by the diffusion model from a single image.

[53] Cluster Contrast for Unsupervised Visual Representation Learning cs.CV | cs.AIPDF

Nikolaos Giakoumoglou, Tania Stathaki

TL;DR: 论文提出了一种名为Cluster Contrast（CueCo）的无监督视觉表示学习方法，结合了对比学习和聚类技术的优点，通过分散和对齐特征表示提升模型性能。

Details

Motivation: 现有的无监督表示学习方法在特征空间中对特征的分散和对齐能力不足，限制了模型的性能提升。CueCo旨在通过结合对比学习和聚类方法，解决这一问题。

Result: 在CIFAR-10、CIFAR-100和ImageNet-100数据集上，CueCo分别取得了91.40%、68.56%和78.65%的Top-1分类准确率，显著优于现有方法。

Insight: 通过结合对比学习和聚类目标，CueCo展示了无监督表示学习中特征分散与对齐的重要性，为未来研究提供了新的方向。

Abstract: We introduce Cluster Contrast (CueCo), a novel approach to unsupervised visual representation learning that effectively combines the strengths of contrastive learning and clustering methods. Inspired by recent advancements, CueCo is designed to simultaneously scatter and align feature representations within the feature space. This method utilizes two neural networks, a query and a key, where the key network is updated through a slow-moving average of the query outputs. CueCo employs a contrastive loss to push dissimilar features apart, enhancing inter-class separation, and a clustering objective to pull together features of the same cluster, promoting intra-class compactness. Our method achieves 91.40% top-1 classification accuracy on CIFAR-10, 68.56% on CIFAR-100, and 78.65% on ImageNet-100 using linear evaluation with a ResNet-18 backbone. By integrating contrastive learning with clustering, CueCo sets a new direction for advancing unsupervised visual representation learning.

[54] OD-VIRAT: A Large-Scale Benchmark for Object Detection in Realistic Surveillance Environments cs.CVPDF

Hayat Ullah, Abbas Khan, Arslan Munir, Hari Kalva

TL;DR: 该论文提出了两个大规模的监控场景目标检测基准OD-VIRAT Large和OD-VIRAT Tiny，用于在复杂环境中评估目标检测模型的性能，并测试了包括RETMDET、YOLOX等多种先进架构。

Details

Motivation: 开发能够应对复杂监控场景（如遮挡、小目标、复杂背景）的鲁棒目标检测算法，需要多样且具有挑战性的数据集来评估模型性能。

Result: 提供了8.7百万（Large）和28.9万（Tiny）标注实例的数据集，并展示了不同模型在这些数据上的表现。

Insight: 复杂监控场景下的目标检测仍面临挑战，尤其是小目标和遮挡情况下的性能需进一步优化。

Abstract: Realistic human surveillance datasets are crucial for training and evaluating computer vision models under real-world conditions, facilitating the development of robust algorithms for human and human-interacting object detection in complex environments. These datasets need to offer diverse and challenging data to enable a comprehensive assessment of model performance and the creation of more reliable surveillance systems for public safety. To this end, we present two visual object detection benchmarks named OD-VIRAT Large and OD-VIRAT Tiny, aiming at advancing visual understanding tasks in surveillance imagery. The video sequences in both benchmarks cover 10 different scenes of human surveillance recorded from significant height and distance. The proposed benchmarks offer rich annotations of bounding boxes and categories, where OD-VIRAT Large has 8.7 million annotated instances in 599,996 images and OD-VIRAT Tiny has 288,901 annotated instances in 19,860 images. This work also focuses on benchmarking state-of-the-art object detection architectures, including RETMDET, YOLOX, RetinaNet, DETR, and Deformable-DETR on this object detection-specific variant of VIRAT dataset. To the best of our knowledge, it is the first work to examine the performance of these recently published state-of-the-art object detection architectures on realistic surveillance imagery under challenging conditions such as complex backgrounds, occluded objects, and small-scale objects. The proposed benchmarking and experimental settings will help in providing insights concerning the performance of selected object detection models and set the base for developing more efficient and robust object detection architectures.

[55] AutoVDC: Automated Vision Data Cleaning Using Vision-Language Models cs.CV | cs.AI | cs.LG | cs.ROPDF

Santosh Vasa, Aditi Ramadwar, Jnana Rama Krishna Darabattula, Md Zafar Anwar, Stanislaw Antol

TL;DR: AutoVDC是一种利用视觉-语言模型（VLM）自动检测视觉数据集中错误标注的框架，目标是提升自动驾驶领域数据质量，减少人工标注成本。

Details

Motivation: 自动驾驶系统训练依赖高质量标注数据，但人工标注存在缺陷且成本高昂，因此需要自动化工具提升数据质量。

Result: AutoVDC在错误检测和数据清洗实验中表现优异，验证了其提升大规模数据集可靠性的潜力。

Insight: VLM在数据清洗任务中具有高效性和扩展性，微调能进一步提升性能，为自动驾驶数据管理提供了新思路。

Abstract: Training of autonomous driving systems requires extensive datasets with precise annotations to attain robust performance. Human annotations suffer from imperfections, and multiple iterations are often needed to produce high-quality datasets. However, manually reviewing large datasets is laborious and expensive. In this paper, we introduce AutoVDC (Automated Vision Data Cleaning) framework and investigate the utilization of Vision-Language Models (VLMs) to automatically identify erroneous annotations in vision datasets, thereby enabling users to eliminate these errors and enhance data quality. We validate our approach using the KITTI and nuImages datasets, which contain object detection benchmarks for autonomous driving. To test the effectiveness of AutoVDC, we create dataset variants with intentionally injected erroneous annotations and observe the error detection rate of our approach. Additionally, we compare the detection rates using different VLMs and explore the impact of VLM fine-tuning on our pipeline. The results demonstrate our method’s high performance in error detection and data cleaning experiments, indicating its potential to significantly improve the reliability and accuracy of large-scale production datasets in autonomous driving.

[56] InterpIoU: Rethinking Bounding Box Regression with Interpolation-Based IoU Optimization cs.CVPDF

Haoyuan Liu, Hiroshi Watanabe

TL;DR: 论文提出了InterpIoU，一种新的边界框回归损失函数，通过插值优化IoU损失，解决了现有方法中因几何惩罚导致的小物体检测效果差和边界框膨胀问题。

Details

Motivation: 现有基于IoU的边界框回归损失常通过手工设计的几何惩罚来解决IoU在非重叠情况下的不可微问题，但这些惩罚对框的形状、大小和分布敏感，容易导致小物体检测效果不佳和边界框膨胀。

Result: 在COCO、VisDrone和PASCAL VOC数据集上，InterpIoU和Dynamic InterpIoU均超越了现有IoU损失函数，尤其在小物体检测中表现突出。

Insight: IoU本身是一个理想的回归目标，手工设计的几何惩罚是不必要且次优的；通过插值优化可以更自然地解决IoU不可微问题，并避免误对齐导致的副作用。

Abstract: Bounding box regression (BBR) is fundamental to object detection, where the regression loss is crucial for accurate localization. Existing IoU-based losses often incorporate handcrafted geometric penalties to address IoU’s non-differentiability in non-overlapping cases and enhance BBR performance. However, these penalties are sensitive to box shape, size, and distribution, often leading to suboptimal optimization for small objects and undesired behaviors such as bounding box enlargement due to misalignment with the IoU objective. To address these limitations, we propose InterpIoU, a novel loss function that replaces handcrafted geometric penalties with a term based on the IoU between interpolated boxes and the target. By using interpolated boxes to bridge the gap between predictions and ground truth, InterpIoU provides meaningful gradients in non-overlapping cases and inherently avoids the box enlargement issue caused by misaligned penalties. Simulation results further show that IoU itself serves as an ideal regression target, while existing geometric penalties are both unnecessary and suboptimal. Building on InterpIoU, we introduce Dynamic InterpIoU, which dynamically adjusts interpolation coefficients based on IoU values, enhancing adaptability to scenarios with diverse object distributions. Experiments on COCO, VisDrone, and PASCAL VOC show that our methods consistently outperform state-of-the-art IoU-based losses across various detection frameworks, with particularly notable improvements in small object detection, confirming their effectiveness.

[57] DVFL-Net: A Lightweight Distilled Video Focal Modulation Network for Spatio-Temporal Action Recognition cs.CVPDF

Hayat Ullah, Muhammad Ali Shafique, Abbas Khan, Arslan Munir

TL;DR: 本文提出了一种轻量级的视频焦点调制网络DVFL-Net，通过知识蒸馏和时空特征调制，实现了高效的时空动作识别，同时保持了高性能。

Details

Motivation: 现有的Transformer模型尽管在时空动作识别任务中表现优异，但计算成本较高，尤其是在密集视频数据上。本文旨在设计一种轻量化的网络，既能保留高性能，又能高效部署在设备端。

Result: 在UCF50、UCF101、HMDB51、SSV2和Kinetics-400等数据集上的实验表明，DVFL-Net在内存占用、计算量（GFLOPs）和精度之间取得了最优平衡，适用于实时应用。

Insight: 时空焦点调制和知识蒸馏的结合是提升轻量化模型性能的有效方法，前向KL散度在知识传递中发挥了关键作用。

Abstract: The landscape of video recognition has evolved significantly, shifting from traditional Convolutional Neural Networks (CNNs) to Transformer-based architectures for improved accuracy. While 3D CNNs have been effective at capturing spatiotemporal dynamics, recent Transformer models leverage self-attention to model long-range spatial and temporal dependencies. Despite achieving state-of-the-art performance on major benchmarks, Transformers remain computationally expensive, particularly with dense video data. To address this, we propose a lightweight Video Focal Modulation Network, DVFL-Net, which distills spatiotemporal knowledge from a large pre-trained teacher into a compact nano student model, enabling efficient on-device deployment. DVFL-Net utilizes knowledge distillation and spatial-temporal feature modulation to significantly reduce computation while preserving high recognition performance. We employ forward Kullback-Leibler (KL) divergence alongside spatio-temporal focal modulation to effectively transfer both local and global context from the Video-FocalNet Base (teacher) to the proposed VFL-Net (student). We evaluate DVFL-Net on UCF50, UCF101, HMDB51, SSV2, and Kinetics-400, benchmarking it against recent state-of-the-art methods in Human Action Recognition (HAR). Additionally, we conduct a detailed ablation study analyzing the impact of forward KL divergence. The results confirm the superiority of DVFL-Net in achieving an optimal balance between performance and efficiency, demonstrating lower memory usage, reduced GFLOPs, and strong accuracy, making it a practical solution for real-time HAR applications.

[58] Describe Anything Model for Visual Question Answering on Text-rich Images cs.CV | cs.LGPDF

Yen-Linh Vu, Dinh-Thang Duong, Truong-Binh Duong, Anh-Khoi Nguyen, Thanh-Huy Nguyen

TL;DR: DAM-QA框架利用Describe Anything Model的区域感知能力，通过聚合多个区域的答案提升文本丰富图像的VQA任务性能，显著优于基线模型。

Details

Motivation: 现有的视觉-语言模型在文本丰富的图像VQA任务中表现不足，区域感知的DAM模型可以生成详细描述，这为解决文本相关VQA问题提供了可能。

Result: 在六个VQA基准测试中，DAM-QA显著优于基线DAM模型，DocVQA任务提升7+点，且参数更少。

Insight: 区域感知模型在文本丰富的VQA任务中潜力巨大，高效的区域信息整合策略是关键。

Abstract: Recent progress has been made in region-aware vision-language modeling, particularly with the emergence of the Describe Anything Model (DAM). DAM is capable of generating detailed descriptions of any specific image areas or objects without the need for additional localized image-text alignment supervision. We hypothesize that such region-level descriptive capability is beneficial for the task of Visual Question Answering (VQA), especially in challenging scenarios involving images with dense text. In such settings, the fine-grained extraction of textual information is crucial to producing correct answers. Motivated by this, we introduce DAM-QA, a framework with a tailored evaluation protocol, developed to investigate and harness the region-aware capabilities from DAM for the text-rich VQA problem that requires reasoning over text-based information within images. DAM-QA incorporates a mechanism that aggregates answers from multiple regional views of image content, enabling more effective identification of evidence that may be tied to text-related elements. Experiments on six VQA benchmarks show that our approach consistently outperforms the baseline DAM, with a notable 7+ point gain on DocVQA. DAM-QA also achieves the best overall performance among region-aware models with fewer parameters, significantly narrowing the gap with strong generalist VLMs. These results highlight the potential of DAM-like models for text-rich and broader VQA tasks when paired with efficient usage and integration strategies. Our code is publicly available at https://github.com/Linvyl/DAM-QA.git.

[59] Vision-based Perception for Autonomous Vehicles in Obstacle Avoidance Scenarios cs.CVPDF

Van-Hoang-Anh Phan, Chi-Tam Nguyen, Doan-Trung Au, Thanh-Danh Phan, Minh-Thien Duong

TL;DR: 本文提出了一种基于视觉的高效障碍物避障系统，结合YOLOv11目标检测和单目深度估计模型（如Depth Anything V2），并通过Frenet-Pure Pursuit规划策略实现自动驾驶车辆的安全导航。

Details

Motivation: 自动驾驶车辆在复杂环境中需要高精度的感知和运动规划能力以确保安全，现有的视觉感知和避障方法仍存在效率和鲁棒性问题。

Result: 系统在校园多样场景中验证，表现出良好的避障效果和实时性能。

Insight: 单目深度估计结合目标检测可以有效提升自动驾驶车辆的环境感知能力，但需要在效率和鲁棒性之间权衡。

Abstract: Obstacle avoidance is essential for ensuring the safety of autonomous vehicles. Accurate perception and motion planning are crucial to enabling vehicles to navigate complex environments while avoiding collisions. In this paper, we propose an efficient obstacle avoidance pipeline that leverages a camera-only perception module and a Frenet-Pure Pursuit-based planning strategy. By integrating advancements in computer vision, the system utilizes YOLOv11 for object detection and state-of-the-art monocular depth estimation models, such as Depth Anything V2, to estimate object distances. A comparative analysis of these models provides valuable insights into their accuracy, efficiency, and robustness in real-world conditions. The system is evaluated in diverse scenarios on a university campus, demonstrating its effectiveness in handling various obstacles and enhancing autonomous navigation. The video presenting the results of the obstacle avoidance experiments is available at: https://www.youtube.com/watch?v=FoXiO5S_tA8

[60] Mitigating Object Hallucinations via Sentence-Level Early Intervention cs.CVPDF

Shangpin Peng, Senqiao Yang, Li Jiang, Zhuotao Tian

TL;DR: 论文提出了SENTINEL框架，通过句子级早期干预减少多模态大语言模型中的幻觉问题，利用无监督方法生成偏好对，并通过上下文感知偏好损失（C-DPO）训练模型，实验显示幻觉减少90%以上。

Details

Motivation: 多模态大语言模型（MLLMs）在跨模态理解中表现突出，但普遍存在幻觉问题（生成与视觉输入矛盾的内容）。现有方法成本高或引入数据分布不匹配，作者发现幻觉问题主要在生成早期阶段出现并传播。

Result: 实验表明，SENTINEL比原始模型减少90%以上的幻觉，并在幻觉评测和通用能力评测中优于现有方法。

Insight: 幻觉问题主要源于生成早期阶段，通过句子级干预可有效阻断其传播，无监督偏好学习是一种高效的缓解途径。

Abstract: Multimodal large language models (MLLMs) have revolutionized cross-modal understanding but continue to struggle with hallucinations - fabricated content contradicting visual inputs. Existing hallucination mitigation methods either incur prohibitive computational costs or introduce distribution mismatches between training data and model outputs. We identify a critical insight: hallucinations predominantly emerge at the early stages of text generation and propagate through subsequent outputs. To address this, we propose SENTINEL (Sentence-level Early iNtervention Through IN-domain prEference Learning), a framework that eliminates dependency on human annotations. Specifically, we first bootstrap high-quality in-domain preference pairs by iteratively sampling model outputs, validating object existence through cross-checking with two open-vocabulary detectors, and classifying sentences into hallucinated/non-hallucinated categories. Subsequently, we use context-coherent positive samples and hallucinated negative samples to build context-aware preference data iteratively. Finally, we train models using a context-aware preference loss (C-DPO) that emphasizes discriminative learning at the sentence level where hallucinations initially manifest. Experimental results show that SENTINEL can reduce hallucinations by over 90% compared to the original model and outperforms the previous state-of-the-art method on both hallucination benchmarks and general capabilities benchmarks, demonstrating its superiority and generalization ability. The models, datasets, and code are available at https://github.com/pspdada/SENTINEL.

[61] SpatialTrackerV2: 3D Point Tracking Made Easy cs.CVPDF

Yuxi Xiao, Jianyuan Wang, Nan Xue, Nikita Karaev, Yuri Makarov

TL;DR: SpatialTrackerV2 是一种前馈式单目视频 3D 点追踪方法，通过联合学习几何与运动，显著超越现有方法。

Details

Motivation: 现有 3D 追踪方法多依赖模块化流程和现成组件，限制了性能和数据适应性。

Result: 在多种数据集上验证，性能提升 30%，运行速度比动态 3D 重建方法快 50 倍。

Insight: 联合学习几何与运动能够提高泛化能力和效率，适用于异构数据。

Abstract: We present SpatialTrackerV2, a feed-forward 3D point tracking method for monocular videos. Going beyond modular pipelines built on off-the-shelf components for 3D tracking, our approach unifies the intrinsic connections between point tracking, monocular depth, and camera pose estimation into a high-performing and feedforward 3D point tracker. It decomposes world-space 3D motion into scene geometry, camera ego-motion, and pixel-wise object motion, with a fully differentiable and end-to-end architecture, allowing scalable training across a wide range of datasets, including synthetic sequences, posed RGB-D videos, and unlabeled in-the-wild footage. By learning geometry and motion jointly from such heterogeneous data, SpatialTrackerV2 outperforms existing 3D tracking methods by 30%, and matches the accuracy of leading dynamic 3D reconstruction approaches while running 50$\times$ faster.

[62] MMHU: A Massive-Scale Multimodal Benchmark for Human Behavior Understanding cs.CVPDF

Renjie Li, Ruijie Ye, Mingyang Wu, Hao Frank Yang, Zhiwen Fan

TL;DR: 论文提出MMHU，一个大规模多模态基准数据集，用于自动驾驶中人类行为理解，包含丰富的注释和多任务评估。

Details

Motivation: 现有数据集对人类行为的分析不够全面，尤其是在自动驾驶场景中，缺乏一个统一的基准来评估人类行为的多个方面。

Result: MMHU提供了全面的数据分析和多任务评估（如运动预测、行为问答等），为研究社区提供了强大的工具。

Insight: 多模态数据与丰富注释的结合对理解复杂的人类行为至关重要，尤其是在自动驾驶领域。

Abstract: Humans are integral components of the transportation ecosystem, and understanding their behaviors is crucial to facilitating the development of safe driving systems. Although recent progress has explored various aspects of human behavior$\unicode{x2014}$such as motion, trajectories, and intention$\unicode{x2014}$a comprehensive benchmark for evaluating human behavior understanding in autonomous driving remains unavailable. In this work, we propose $\textbf{MMHU}$, a large-scale benchmark for human behavior analysis featuring rich annotations, such as human motion and trajectories, text description for human motions, human intention, and critical behavior labels relevant to driving safety. Our dataset encompasses 57k human motion clips and 1.73M frames gathered from diverse sources, including established driving datasets such as Waymo, in-the-wild videos from YouTube, and self-collected data. A human-in-the-loop annotation pipeline is developed to generate rich behavior captions. We provide a thorough dataset analysis and benchmark multiple tasks$\unicode{x2014}$ranging from motion prediction to motion generation and human behavior question answering$\unicode{x2014}$thereby offering a broad evaluation suite. Project page : https://MMHU-Benchmark.github.io.

[63] PhysX: Physical-Grounded 3D Asset Generation cs.CVPDF

Ziang Cao, Zhaoxi Chen, Linag Pan, Ziwei Liu

TL;DR: 该论文提出PhysX，一种物理基础的3D资产生成方法，解决了现有方法忽视物理属性的问题，通过构建PhysXNet数据集和PhysXGen模型，实现了物理驱动的3D生成。

Details

Motivation: 现有3D生成方法主要关注几何和纹理，忽视了物理属性，限制了其在仿真和具身AI等领域的应用。

Result: 实验验证了PhysXGen在物理预测和几何质量上的优越性能，展现了泛化能力。

Insight: 物理属性对3D生成的真实性和实用性至关重要，结合人机协同标注和双分支架构是有效实现物理驱动生成的关键。

Abstract: 3D modeling is moving from virtual to physical. Existing 3D generation primarily emphasizes geometries and textures while neglecting physical-grounded modeling. Consequently, despite the rapid development of 3D generative models, the synthesized 3D assets often overlook rich and important physical properties, hampering their real-world application in physical domains like simulation and embodied AI. As an initial attempt to address this challenge, we propose \textbf{PhysX}, an end-to-end paradigm for physical-grounded 3D asset generation. 1) To bridge the critical gap in physics-annotated 3D datasets, we present PhysXNet - the first physics-grounded 3D dataset systematically annotated across five foundational dimensions: absolute scale, material, affordance, kinematics, and function description. In particular, we devise a scalable human-in-the-loop annotation pipeline based on vision-language models, which enables efficient creation of physics-first assets from raw 3D assets.2) Furthermore, we propose \textbf{PhysXGen}, a feed-forward framework for physics-grounded image-to-3D asset generation, injecting physical knowledge into the pre-trained 3D structural space. Specifically, PhysXGen employs a dual-branch architecture to explicitly model the latent correlations between 3D structures and physical properties, thereby producing 3D assets with plausible physical predictions while preserving the native geometry quality. Extensive experiments validate the superior performance and promising generalization capability of our framework. All the code, data, and models will be released to facilitate future research in generative physical AI.

cs.CL [Back]

[64] MapIQ: Benchmarking Multimodal Large Language Models for Map Question Answering cs.CL | cs.AI | cs.CV | cs.LGPDF

Varun Srivastava, Fan Lei, Srija Mukhopadhyay, Vivek Gupta, Ross Maciejewski

TL;DR: MapIQ是一个新的基准数据集，用于评估多模态大语言模型（MLLMs）在地图问答（Map-VQA）中的性能，覆盖了三种地图类型和六种主题，并通过实验分析了模型对地图设计变化的鲁棒性和敏感性。

Details

Motivation: 现有的Map-VQA研究主要局限于等值线图（choropleth maps），覆盖的主题和视觉分析任务有限，需要更全面的基准来评估MLLMs在地图问答中的能力。

Result: 实验揭示了MLLMs在地图问答中的性能差异、对地图设计变化的敏感性，以及依赖内部地理知识的程度。

Insight: 研究为改进Map-VQA性能提供了方向，例如优化地图设计以减少模型对特定视觉特征的依赖。

Abstract: Recent advancements in multimodal large language models (MLLMs) have driven researchers to explore how well these models read data visualizations, e.g., bar charts, scatter plots. More recently, attention has shifted to visual question answering with maps (Map-VQA). However, Map-VQA research has primarily focused on choropleth maps, which cover only a limited range of thematic categories and visual analytical tasks. To address these gaps, we introduce MapIQ, a benchmark dataset comprising 14,706 question-answer pairs across three map types: choropleth maps, cartograms, and proportional symbol maps spanning topics from six distinct themes (e.g., housing, crime). We evaluate multiple MLLMs using six visual analytical tasks, comparing their performance against one another and a human baseline. An additional experiment examining the impact of map design changes (e.g., altered color schemes, modified legend designs, and removal of map elements) provides insights into the robustness and sensitivity of MLLMs, their reliance on internal geographic knowledge, and potential avenues for improving Map-VQA performance.

Guimin Hu, Yi Xin, Lijie Hu, Zhihong Zhu, Hasti Seifi

TL;DR: 该论文提出了一种分区指导的多模态学习框架（PgM），通过模态分区器、单模态学习器、配对模态学习器和单-配对模态解码器，系统地学习单模态和配对模态特征，并展示了其在多种任务中的有效性和可迁移性。

Details

Motivation: 多模态学习虽然受益于多模态信息，但现有方法未能充分区分单模态和配对模态特征的学习。论文旨在通过分区指导的框架，更彻底地学习这两类特征，并灵活适应不同下游任务。

Result: PgM在四个多模态任务中表现出色，并验证了其对现有模型的可迁移性。可视化分析揭示了单模态和配对模态特征的贡献差异。

Insight: 分区学习能够更系统地捕捉多模态特征，单模态和配对模态的分布和贡献因任务和模态而异，灵活性是提升多模态学习性能的关键。

Abstract: Multimodal learning benefits from multiple modal information, and each learned modal representations can be divided into uni-modal that can be learned from uni-modal training and paired-modal features that can be learned from cross-modal interaction. Building on this perspective, we propose a partitioner-guided modal learning framework, PgM, which consists of the modal partitioner, uni-modal learner, paired-modal learner, and uni-paired modal decoder. Modal partitioner segments the learned modal representation into uni-modal and paired-modal features. Modal learner incorporates two dedicated components for uni-modal and paired-modal learning. Uni-paired modal decoder reconstructs modal representation based on uni-modal and paired-modal features. PgM offers three key benefits: 1) thorough learning of uni-modal and paired-modal features, 2) flexible distribution adjustment for uni-modal and paired-modal representations to suit diverse downstream tasks, and 3) different learning rates across modalities and partitions. Extensive experiments demonstrate the effectiveness of PgM across four multimodal tasks and further highlight its transferability to existing models. Additionally, we visualize the distribution of uni-modal and paired-modal features across modalities and tasks, offering insights into their respective contributions.

[66] ExpliCIT-QA: Explainable Code-Based Image Table Question Answering cs.CL | cs.AIPDF

Maximiliano Hormazábal Lagos, Álvaro Bueno Sáez, Pedro Alonso Doval, Jorge Alcalde Vesteiro, Héctor Cerezo-Costas

TL;DR: ExpliCIT-QA是构建于MRT方法上的可解释表格图像问答系统，通过模块化设计实现透明性，并利用语言推理和代码生成提升可解释性。

Details

Motivation: 解决现有端到端TableVQA系统缺乏可解释性的问题，尤其是在金融、医疗等需要审计结果的敏感领域。

Result: 在TableVQA-Bench基准测试中显示出更高的可解释性和透明性。

Insight: 模块化设计和中间结果的可视化填补了TableVQA系统的可解释性空白，适用于需要审计的领域。

Abstract: We present ExpliCIT-QA, a system that extends our previous MRT approach for tabular question answering into a multimodal pipeline capable of handling complex table images and providing explainable answers. ExpliCIT-QA follows a modular design, consisting of: (1) Multimodal Table Understanding, which uses a Chain-of-Thought approach to extract and transform content from table images; (2) Language-based Reasoning, where a step-by-step explanation in natural language is generated to solve the problem; (3) Automatic Code Generation, where Python/Pandas scripts are created based on the reasoning steps, with feedback for handling errors; (4) Code Execution to compute the final answer; and (5) Natural Language Explanation that describes how the answer was computed. The system is built for transparency and auditability: all intermediate outputs, parsed tables, reasoning steps, generated code, and final answers are available for inspection. This strategy works towards closing the explainability gap in end-to-end TableVQA systems. We evaluated ExpliCIT-QA on the TableVQA-Bench benchmark, comparing it with existing baselines. We demonstrated improvements in interpretability and transparency, which open the door for applications in sensitive domains like finance and healthcare where auditing results are critical.

[67] CRABS: A syntactic-semantic pincer strategy for bounding LLM interpretation of Python notebooks cs.CL | cs.AIPDF

Meng Li, Timothy M. McPhillips, Dingmin Wang, Shin-Rong Tsai, Bertram Ludäscher

TL;DR: 论文提出了一种名为CRABS的策略，通过结合浅层语法分析和LLM，解决了LLM在解释Python笔记本时因幻觉和长上下文挑战导致的错误。该方法通过捕捉和解析笔记本的语法结构，辅助LLM进行逐细胞零样本学习，显著提高了笔记本信息流的准确性和细胞执行依赖的识别率。

Details

Motivation: Python笔记本在数据科学和机器学习中广泛应用，但由于数据和软件依赖问题，重新执行笔记本通常不可行。尽管预训练的LLM在代码理解上表现良好，但在实际笔记本中仍存在幻觉和长上下文理解不足的问题，因此需要一种更可靠的方法来理解笔记本的信息流和执行依赖。

Result: 在50个Kaggle笔记本的评估中，CRABS在细胞间信息流和细胞执行依赖识别上的平均F1分数分别达到98%和99%。LLM成功解析了98%的剩余歧义（1425中的1397）。

Insight: 结合语法分析和LLM的双重策略能够显著提升笔记本理解的准确性。浅层语法分析提供了边界约束，而LLM填补了细粒度的语义理解漏洞，为笔记本的复用和扩展提供了可靠工具。

Abstract: Recognizing the information flows and operations comprising data science and machine learning Python notebooks is critical for evaluating, reusing, and adapting notebooks for new tasks. Investigating a notebook via re-execution often is impractical due to the challenges of resolving data and software dependencies. While Large Language Models (LLMs) pre-trained on large codebases have demonstrated effectiveness in understanding code without running it, we observe that they fail to understand some realistic notebooks due to hallucinations and long-context challenges. To address these issues, we propose a notebook understanding task yielding an information flow graph and corresponding cell execution dependency graph for a notebook, and demonstrate the effectiveness of a pincer strategy that uses limited syntactic analysis to assist full comprehension of the notebook using an LLM. Our Capture and Resolve Assisted Bounding Strategy (CRABS) employs shallow syntactic parsing and analysis of the abstract syntax tree (AST) to capture the correct interpretation of a notebook between lower and upper estimates of the inter-cell I/O sets, then uses an LLM to resolve remaining ambiguities via cell-by-cell zero-shot learning, thereby identifying the true data inputs and outputs of each cell. We evaluate and demonstrate the effectiveness of our approach using an annotated dataset of 50 representative, highly up-voted Kaggle notebooks that together represent 3454 actual cell inputs and outputs. The LLM correctly resolves 1397 of 1425 (98%) ambiguities left by analyzing the syntactic structure of these notebooks. Across 50 notebooks, CRABS achieves average F1 scores of 98% identifying cell-to-cell information flows and 99% identifying transitive cell execution dependencies.

[68] AI Wizards at CheckThat! 2025: Enhancing Transformer-Based Embeddings with Sentiment for Subjectivity Detection in News Articles cs.CL | cs.IRPDF

Matteo Fasulo, Luca Babboni, Luca Tedeschini

TL;DR: AI Wizards提出了一种通过集成情感分数增强基于Transformer的分类器的方法，用于新闻文章的主观性检测任务。该方法在多语言和零样本场景下表现优异，在希腊语任务中排名第一。

Details

Motivation: 新闻文章中的主观性检测是一个重要任务，但现有方法在多语言和零样本场景下的泛化能力有限。情感信息可能有助于区分主观和客观句子。

Result: 情感特征显著提升了性能，尤其是在主观F1分数上。在希腊语任务中取得了Macro F1为0.51的最优成绩。

Insight: 情感信息可以有效地增强主观性检测任务的性能，尤其是在多语言和零样本场景下。

Abstract: This paper presents AI Wizards’ participation in the CLEF 2025 CheckThat! Lab Task 1: Subjectivity Detection in News Articles, classifying sentences as subjective/objective in monolingual, multilingual, and zero-shot settings. Training/development datasets were provided for Arabic, German, English, Italian, and Bulgarian; final evaluation included additional unseen languages (e.g., Greek, Romanian, Polish, Ukrainian) to assess generalization. Our primary strategy enhanced transformer-based classifiers by integrating sentiment scores, derived from an auxiliary model, with sentence representations, aiming to improve upon standard fine-tuning. We explored this sentiment-augmented architecture with mDeBERTaV3-base, ModernBERT-base (English), and Llama3.2-1B. To address class imbalance, prevalent across languages, we employed decision threshold calibration optimized on the development set. Our experiments show sentiment feature integration significantly boosts performance, especially subjective F1 score. This framework led to high rankings, notably 1st for Greek (Macro F1 = 0.51).

[69] DualReward: A Dynamic Reinforcement Learning Framework for Cloze Tests Distractor Generation cs.CLPDF

Tianyou Huang, Xinglu Chen, Jingshen Zhang, Xinying Qiu, Ruiying Niu

TL;DR: DualReward提出了一种新的强化学习框架，用于填空题干扰项的生成，通过动态调整奖励信号强度，优化干扰项质量。

Details

Motivation: 传统填空题干扰项生成方法多依赖监督学习或静态生成模型，缺乏对干扰项多样性和质量的动态优化。DualReward旨在通过强化学习的动态奖励机制改进这一问题。

Result: 在CLOTH-F和MCQ数据集上均优于基线方法，跨域数据（MCQ）上提升显著（P@1提升3.48-3.86%）。

Insight: 动态奖励机制在多样化数据上表现更优，表明其在处理复杂任务时的潜力；框架灵活性高，适合实际应用需求。

Abstract: This paper introduces DualReward, a novel reinforcement learning framework for automatic distractor generation in cloze tests. Unlike conventional approaches that rely primarily on supervised learning or static generative models, our method employs a dual reward structure with adaptive scaling that differentiates between human-created gold standard distractors and model-generated candidates. The framework dynamically adjusts reward signal intensity based on model performance and confidence. We evaluate our approach on both passage-level (CLOTH-F) and sentence-level (MCQ) cloze test datasets, demonstrating consistent improvements over state-of-the-art baselines. Experimental results show that our adaptive reward scaling mechanism provides modest but consistent benefits on homogeneous datasets (CLOTH-F) and more substantial improvements (3.48-3.86% in P@1) on diverse, cross-domain data (MCQ), suggesting its particular effectiveness for handling varied question types and domains. Our work offers a flexible framework that effectively balances learning from reliable human examples while exploring novel, high-quality distractors for automated test generation.

[70] A Survey of Deep Learning for Geometry Problem Solving cs.CL | cs.AI | cs.CV | cs.LGPDF

Jianzhe Ma, Wenxuan Wang, Qin Jin

TL;DR: 这篇论文综述了深度学习在几何问题求解中的应用，涵盖了任务总结、深度学习方法回顾、评估指标分析以及当前挑战和未来方向的讨论，旨在为该领域的研究提供全面参考。

Details

Motivation: 几何问题求解是数学推理的关键领域，广泛应用于教育和人工智能能力评估等领域。随着深度学习技术的发展，尤其是多模态大语言模型的兴起，研究如何利用深度学习解决几何问题变得尤为重要。

Result: 论文提供了一个持续更新的GitHub资源列表（https://github.com/majianz/dl4gps），为研究者提供了实用的参考。

Insight: 多模态大语言模型的兴起为几何问题求解带来了新的可能性，但如何结合几何推理的严谨性与深度学习的灵活性仍是未来的研究重点。

Abstract: Geometry problem solving is a key area of mathematical reasoning, which is widely involved in many important fields such as education, mathematical ability assessment of artificial intelligence, and multimodal ability assessment. In recent years, the rapid development of deep learning technology, especially the rise of multimodal large language models, has triggered a widespread research boom. This paper provides a survey of the applications of deep learning in geometry problem solving, including (i) a comprehensive summary of the relevant tasks in geometry problem solving; (ii) a thorough review of related deep learning methods; (iii) a detailed analysis of evaluation metrics and methods; and (iv) a critical discussion of the current challenges and future directions that can be explored. Our goal is to provide a comprehensive and practical reference of deep learning for geometry problem solving to promote further developments in this field. We create a continuously updated list of papers on GitHub: https://github.com/majianz/dl4gps.

[71] POLYCHARTQA: Benchmarking Large Vision-Language Models with Multilingual Chart Question Answering cs.CL | cs.AI | cs.CV | cs.MMPDF

Yichen Xu, Liangyu Chen, Liang Zhang, Wenxuan Wang, Qin Jin

TL;DR: PolyChartQA 是一个大规模多语言图表问答基准，覆盖 10 种语言的 22,606 张图表和 26,151 个问答对，旨在推动全球包容性的视觉语言模型发展。

Details

Motivation: 现有的图表理解基准主要集中于英语，限制了其对全球受众的可访问性和适用性。

Result: 实验表明，现有视觉语言模型在英语和非拉丁语系的低资源语言之间存在显著性能差距。

Insight: PolyChartQA 为多语言图表理解提供了系统评估工具，揭示了模型在非英语语言上的局限性。

Abstract: Charts are a universally adopted medium for interpreting and communicating data. However, existing chart understanding benchmarks are predominantly English-centric, limiting their accessibility and applicability to global audiences. In this paper, we present PolyChartQA, the first large-scale multilingual chart question answering benchmark covering 22,606 charts and 26,151 question-answering pairs across 10 diverse languages. PolyChartQA is built using a decoupled pipeline that separates chart data from rendering code, allowing multilingual charts to be flexibly generated by simply translating the data and reusing the code. We leverage state-of-the-art LLM-based translation and enforce rigorous quality control in the pipeline to ensure the linguistic and semantic consistency of the generated multilingual charts. PolyChartQA facilitates systematic evaluation of multilingual chart understanding. Experiments on both open- and closed-source large vision-language models reveal a significant performance gap between English and other languages, especially low-resource ones with non-Latin scripts. This benchmark lays a foundation for advancing globally inclusive vision-language models.

[72] The benefits of query-based KGQA systems for complex and temporal questions in LLM era cs.CL | cs.LGPDF

Artem Alekseev, Mikhail Chaichuk, Miron Butko, Alexander Panchenko, Elena Tutubalina

TL;DR: 论文探讨了在大型语言模型（LLM）时代，基于查询的知识图谱问答（KGQA）系统在处理复杂和多跳问题时仍具有优势。提出了一种多阶段查询生成框架，显著提升了多跳和时间性问题的性能。

Details

Motivation: 尽管大型语言模型在问答任务中表现出色，但在多跳推理和时间性问题中仍存在不足。基于查询的KGQA系统提供了一种模块化替代方案，通过生成可执行查询而非直接答案来提升性能。

Result: 实验结果表明，该框架显著提升了多跳和时间性问题的解答性能，证明了基于查询的KGQA系统在小语言模型中的潜力。

Insight: 论文表明，基于查询的多阶段KGQA框架是解决复杂和时间性问题的有效方法，尤其适用于资源受限的小语言模型。

Abstract: Large language models excel in question-answering (QA) yet still struggle with multi-hop reasoning and temporal questions. Query-based knowledge graph QA (KGQA) offers a modular alternative by generating executable queries instead of direct answers. We explore multi-stage query-based framework for WikiData QA, proposing multi-stage approach that enhances performance on challenging multi-hop and temporal benchmarks. Through generalization and rejection studies, we evaluate robustness across multi-hop and temporal QA datasets. Additionally, we introduce a novel entity linking and predicate matching method using CoT reasoning. Our results demonstrate the potential of query-based multi-stage KGQA framework for improving multi-hop and temporal QA with small language models. Code and data: https://github.com/ar2max/NLDB-KGQA-System

[73] Improving Data and Parameter Efficiency of Neural Language Models Using Representation Analysis cs.CLPDF

Josip Jukić

TL;DR: 该论文探讨了如何通过表示分析和优化技术提升神经语言模型的数据和参数效率。提出了基于表示平滑性的创新方法，包括利用Jacobian和Hessian矩阵的稳定训练策略，以及结合主动学习和参数高效微调的方法。实验表明，这些方法在性能和效率上显著优于传统方法。

Details

Motivation: 解决神经语言模型在数据和参数效率上的挑战，提升模型的鲁棒性和泛化能力。

Result: 实验证明，所提方法在性能、稳定性和效率上显著优于传统方法，尤其在低资源场景中表现突出。

Insight: 表示平滑性和上下文学习是提升模型效率的关键技术，特别是在资源有限或动态数据环境中。

Abstract: This thesis addresses challenges related to data and parameter efficiency in neural language models, with a focus on representation analysis and the introduction of new optimization techniques. The first part examines the properties and dynamics of language representations within neural models, emphasizing their significance in enhancing robustness and generalization. It proposes innovative approaches based on representation smoothness, including regularization strategies that utilize Jacobian and Hessian matrices to stabilize training and mitigate sensitivity to input perturbations. The second part focuses on methods to significantly enhance data and parameter efficiency by integrating active learning strategies with parameter-efficient fine-tuning, guided by insights from representation smoothness analysis. It presents smoothness-informed early-stopping techniques designed to eliminate the need for labeled validation sets and proposes innovative combinations of active learning and parameter-efficient fine-tuning to reduce labeling efforts and computational resources. Extensive experimental evaluations across various NLP tasks demonstrate that these combined approaches substantially outperform traditional methods in terms of performance, stability, and efficiency. The third part explores weak supervision techniques enhanced by in-context learning to effectively utilize unlabeled data, further reducing dependence on extensive labeling. It shows that using in-context learning as a mechanism for weak supervision enables models to better generalize from limited labeled data by leveraging unlabeled examples more effectively during training. Comprehensive empirical evaluations confirm significant gains in model accuracy, adaptability, and robustness, especially in low-resource settings and dynamic data environments.

[74] Evaluating the Ability of Large Language Models to Reason about Cardinal Directions, Revisited cs.CLPDF

Anthony G Cohn, Robert E Blackwell

TL;DR: 本文研究了28种大型语言模型（LLM）在基数方向（CD）推理上的能力，通过模板生成的基准测试其正确性，发现即使是新型的大型推理模型也无法在所有问题上可靠地确定正确的CD。

Details

Motivation: 研究大型语言模型在基数方向推理上的能力，填补相关领域的空白并验证模型的可靠性。

Result: 即使新型的大型推理模型也无法在所有问题上可靠地确定正确的CD。

Insight: 大型语言模型在基数方向推理上的能力存在局限性，需要进一步优化或改进。

Abstract: We investigate the abilities of 28 Large language Models (LLMs) to reason about cardinal directions (CDs) using a benchmark generated from a set of templates, extensively testing an LLM’s ability to determine the correct CD given a particular scenario. The templates allow for a number of degrees of variation such as means of locomotion of the agent involved, and whether set in the first, second or third person. Even the newer Large Reasoning Models are unable to reliably determine the correct CD for all questions. This paper summarises and extends earlier work presented at COSIT-24.

[75] Findings of MEGA: Maths Explanation with LLMs using the Socratic Method for Active Learning cs.CLPDF

Tosin Adewumi, Foteini Simistira Liwicki, Marcus Liwicki, Viktor Gardelli, Lama Alkhaled

TL;DR: MEGA结合苏格拉底法、思维链推理、简化游戏化和形成性反馈，通过大型语言模型提升大学生数学学习效果。结果显示MEGA优于传统逐步方法。

Details

Motivation: 部分学生因数学困难回避相关学科，传统教学方法效果不佳。研究希望通过MEGA方法改进数学学习体验。

Result: MEGA在GSM8K和MATH数据集上均优于传统方法，尤其在难度更高的MATH数据集上表现更优（47.5% vs 26.67%）。

Insight: MEGA方法尤其适合解决复杂数学问题，其多策略结合显著提升学习效果，为数学教育提供新思路。

Abstract: This paper presents an intervention study on the effects of the combined methods of (1) the Socratic method, (2) Chain of Thought (CoT) reasoning, (3) simplified gamification and (4) formative feedback on university students’ Maths learning driven by large language models (LLMs). We call our approach Mathematics Explanations through Games by AI LLMs (MEGA). Some students struggle with Maths and as a result avoid Math-related discipline or subjects despite the importance of Maths across many fields, including signal processing. Oftentimes, students’ Maths difficulties stem from suboptimal pedagogy. We compared the MEGA method to the traditional step-by-step (CoT) method to ascertain which is better by using a within-group design after randomly assigning questions for the participants, who are university students. Samples (n=60) were randomly drawn from each of the two test sets of the Grade School Math 8K (GSM8K) and Mathematics Aptitude Test of Heuristics (MATH) datasets, based on the error margin of 11%, the confidence level of 90%, and a manageable number of samples for the student evaluators. These samples were used to evaluate two capable LLMs at length (Generative Pretrained Transformer 4o (GPT4o) and Claude 3.5 Sonnet) out of the initial six that were tested for capability. The results showed that students agree in more instances that the MEGA method is experienced as better for learning for both datasets. It is even much better than the CoT (47.5% compared to 26.67%) in the more difficult MATH dataset, indicating that MEGA is better at explaining difficult Maths problems.

Payal Bhattad, Sai Manoj Pudukotai Dinakarrao, Anju Gupta

TL;DR: 本文提出了一种基于大型语言模型（LLM）的文本增强评估框架，包括可扩展性分析和迭代增强与摘要细化（IASR），旨在解决增强过程中语义一致性和多样性平衡的问题。

Details

Motivation: 现有文本增强技术在语义保存方面机制不足，导致冗余和不稳定性，需结构化评估框架改进。

Result: 在使用GPT增强的BERTopic任务中，主题粒度提升400%，并完全消除主题重叠。

Insight: 结构化评估框架能显著提升LLM增强技术在实践NLP流水线中的效果，尤其在语义一致性和多样性平衡方面。

Abstract: Text data augmentation is a widely used strategy for mitigating data sparsity in natural language processing (NLP), particularly in low-resource settings where limited samples hinder effective semantic modeling. While augmentation can improve input diversity and downstream interpretability, existing techniques often lack mechanisms to ensure semantic preservation during large-scale or iterative generation, leading to redundancy and instability. This work introduces a principled evaluation framework for large language model (LLM) based text augmentation, comprising two components: (1) Scalability Analysis, which measures semantic consistency as augmentation volume increases, and (2) Iterative Augmentation with Summarization Refinement (IASR), which evaluates semantic drift across recursive paraphrasing cycles. Empirical evaluations across state-of-the-art LLMs show that GPT-3.5 Turbo achieved the best balance of semantic fidelity, diversity, and generation efficiency. Applied to a real-world topic modeling task using BERTopic with GPT-enhanced few-shot labeling, the proposed approach results in a 400% increase in topic granularity and complete elimination of topic overlaps. These findings validated the utility of the proposed frameworks for structured evaluation of LLM-based augmentation in practical NLP pipelines.

[77] Overview of the Sensemaking Task at the ELOQUENT 2025 Lab: LLMs as Teachers, Students and Evaluators cs.CL | I.2.7PDF

Pavel Šindelář, Ondřej Bojar

TL;DR: 论文介绍了ELOQUENT 2025实验室的Sensemaking任务，旨在通过三个步骤（问题生成、回答和评估）评估生成语言模型的表现。实验涉及多语言材料，参与者包括教师、学生和评估者系统，揭示了当前方法的局限性。

Details

Motivation: 为生成语言模型开发易于测试的高层次评估标准，特别是在理解文本和生成相关内容的能力方面。

Result: 实验中，问题生成任务面临评估困难；回答问题任务中LLM表现尚可，但受限于输入文本；评估任务中，LLM易误判乱码或混合问题答案。

Insight: 当前LLM在文本理解和评估方面存在明显不足，尤其是在问题生成的评价和答案与输入的严格匹配上需要进一步改进。

Abstract: ELOQUENT is a set of shared tasks that aims to create easily testable high-level criteria for evaluating generative language models. Sensemaking is one such shared task. In Sensemaking, we try to assess how well generative models ``make sense out of a given text’’ in three steps inspired by exams in a classroom setting: (1) Teacher systems should prepare a set of questions, (2) Student systems should answer these questions, and (3) Evaluator systems should score these answers, all adhering rather strictly to a given set of input materials. We report on the 2025 edition of Sensemaking, where we had 7 sources of test materials (fact-checking analyses of statements, textbooks, transcribed recordings of a lecture, and educational videos) spanning English, German, Ukrainian, and Czech languages. This year, 4 teams participated, providing us with 2 Teacher submissions, 2 Student submissions, and 2 Evaluator submissions. We added baselines for Teacher and Student using commercial large language model systems. We devised a fully automatic evaluation procedure, which we compare to a minimalistic manual evaluation. We were able to make some interesting observations. For the first task, the creation of questions, better evaluation strategies will still have to be devised because it is difficult to discern the quality of the various candidate question sets. In the second task, question answering, the LLMs examined overall perform acceptably, but restricting their answers to the given input texts remains problematic. In the third task, evaluation of question answers, our adversarial tests reveal that systems using the LLM-as-a-Judge paradigm erroneously rate both garbled question-answer pairs and answers to mixed-up questions as acceptable.

[78] Toward a Behavioural Translation Style Space: Simulating the Temporal Dynamics of Affect, Behaviour, and Cognition in Human Translation Production cs.CLPDF

Michael Carl, Takanori Mizowaki, Aishvarya Ray, Masaru Yamada, Devi Sri Bandaru

TL;DR: 本文提出了一个行为翻译风格空间（BTSS），用于描述可能的翻译行为模式，并通过计算翻译代理模拟翻译过程中的情感、自动行为和认知的动态变化。

Details

Motivation: 研究翻译行为背后的高层认知和情感过程，通过眼动和键盘数据揭示隐藏的心理处理结构，从而更好地理解翻译行为的动态性。

Result: BTSS能够捕捉翻译行为的复杂动态性，为模拟人类翻译行为提供了一种新的框架。

Insight: 翻译行为不仅受物理操作影响，还由认知和情感驱动，BTSS为这一复杂过程提供了系统化的描述方法。

Abstract: The paper introduces a Behavioural Translation Style Space (BTSS) that describes possible behavioural translation patterns. The suggested BTSS is organized as a hierarchical structure that entails various embedded processing layers. We posit that observable translation behaviour - i.e., eye and finger movements - is fundamental when executing the physical act of translation but it is caused and shaped by higher-order cognitive processes and affective translation states. We analyse records of keystrokes and gaze data as indicators of the hidden mental processing structure and organize the behavioural patterns as a multi-layered embedded BTSS. The BTSS serves as the basis for a computational translation agent to simulate the temporal dynamics of affect, automatized behaviour and cognition during human translation production.

[79] Towards few-shot isolated word reading assessment cs.CL | eess.ASPDF

Reuben Smit, Retief Louw, Herman Kamper

TL;DR: 论文提出了一种基于自监督学习（SSL）模型的少量样本孤立词阅读评估方法，研究发现在成人数据上表现良好，但在儿童语音上效果显著下降。

Details

Motivation: 研究动机是针对低资源环境下的孤立词阅读评估，探索不依赖自动语音识别（ASR）的方法，尤其是在儿童语音任务中的适用性。

Result: 实验结果表明，系统在成人数据上表现良好，但在儿童语音输入时性能显著下降，即使使用儿童模板。

Insight: 论文揭示了SSL表征在少量样本分类系统中处理儿童语音时的局限性，强调了进一步优化的必要性。

Abstract: We explore an ASR-free method for isolated word reading assessment in low-resource settings. Our few-shot approach compares input child speech to a small set of adult-provided reference templates. Inputs and templates are encoded using intermediate layers from large self-supervised learned (SSL) models. Using an Afrikaans child speech benchmark, we investigate design options such as discretising SSL features and barycentre averaging of the templates. Idealised experiments show reasonable performance for adults, but a substantial drop for child speech input, even with child templates. Despite the success of employing SSL representations in low-resource speech tasks, our work highlights the limitations of SSL representations for processing child data when used in a few-shot classification system.

Meysam Alizadeh, Fabrizio Gilardi, Zeynab Samei, Mohsen Mosleh

TL;DR: 该研究表明，具备网页浏览能力的LLMs可以从社交媒体用户名推断用户的人口统计学属性，潜在存在性别和政治偏见风险，建议限制公开应用并保留研究用途。

Details

Motivation: 传统LLMs依赖静态数据，而具备网页浏览能力的LLMs能实时获取信息。研究探索LLMs是否可通过社交媒体用户名推断用户属性。

Result: LLMs能以合理准确率预测用户人口统计学属性，但可能对低活跃度账户引入性别和政治偏见。

Insight: 此能力有益于计算社会科学，但也可能被滥用，需在公开应用中限制访问并保留研究用途。

Abstract: Large language models (LLMs) have traditionally relied on static training data, limiting their knowledge to fixed snapshots. Recent advancements, however, have equipped LLMs with web browsing capabilities, enabling real time information retrieval and multi step reasoning over live web content. While prior studies have demonstrated LLMs ability to access and analyze websites, their capacity to directly retrieve and analyze social media data remains unexplored. Here, we evaluate whether web browsing LLMs can infer demographic attributes of social media users given only their usernames. Using a synthetic dataset of 48 X (Twitter) accounts and a survey dataset of 1,384 international participants, we show that these models can access social media content and predict user demographics with reasonable accuracy. Analysis of the synthetic dataset further reveals how LLMs parse and interpret social media profiles, which may introduce gender and political biases against accounts with minimal activity. While this capability holds promise for computational social science in the post API era, it also raises risks of misuse particularly in information operations and targeted advertising underscoring the need for safeguards. We recommend that LLM providers restrict this capability in public facing applications, while preserving controlled access for verified research purposes.

[81] Probing for Arithmetic Errors in Language Models cs.CL | cs.AIPDF

Yucheng Sun, Alessandro Stolfo, Mrinmaya Sachan

TL;DR: 论文研究如何通过语言模型的内部激活检测算术错误，发现简单探针能解码正确答案，并训练轻量级错误检测器以指导选择性重提，提升任务准确性。

Details

Motivation: 探索语言模型内部激活是否能用于检测算术错误，为模型自我纠错提供轻量级方法。

Result: 错误检测器准确率超90%，探针在复杂任务中表现一致，选择性重提可提升任务准确性。

Insight: 算术错误可通过内部激活预测，探针为轻量级自我纠错提供可行路径。

Abstract: We investigate whether internal activations in language models can be used to detect arithmetic errors. Starting with a controlled setting of 3-digit addition, we show that simple probes can accurately decode both the model’s predicted output and the correct answer from hidden states, regardless of whether the model’s output is correct. Building on this, we train lightweight error detectors that predict model correctness with over 90% accuracy. We then extend our analysis to structured chain-of-thought traces on addition-only GSM8K problems and find that probes trained on simple arithmetic generalize well to this more complex setting, revealing consistent internal representations. Finally, we demonstrate that these probes can guide selective re-prompting of erroneous reasoning steps, improving task accuracy with minimal disruption to correct outputs. Our findings suggest that arithmetic errors can be anticipated from internal activations alone, and that simple probes offer a viable path toward lightweight model self-correction.

[82] Advancing Retrieval-Augmented Generation for Structured Enterprise and Internal Data cs.CL | cs.AI | cs.CE | cs.IRPDF

Chandana Cheerla

TL;DR: 论文提出了一种改进的检索增强生成（RAG）框架，专为企业结构化数据设计，采用混合检索策略和元数据过滤，显著提升了检索精度和生成质量。

Details

Motivation: 企业依赖专有数据（如HR记录、表格文档）进行决策，但现有LLM和传统RAG在处理异构结构化数据时存在局限。

Result: 实验中Precision@5提升15%，Recall@5提升13%，且生成结果的Faithfulness、Completeness、Relevance评分显著提高。

Insight: 结构化数据的语义分块和元数据利用是提升RAG效果的关键；未来可扩展至多模态和基于代理的检索。

Abstract: Organizations increasingly rely on proprietary enterprise data, including HR records, structured reports, and tabular documents, for critical decision-making. While Large Language Models (LLMs) have strong generative capabilities, they are limited by static pretraining, short context windows, and challenges in processing heterogeneous data formats. Conventional Retrieval-Augmented Generation (RAG) frameworks address some of these gaps but often struggle with structured and semi-structured data. This work proposes an advanced RAG framework that combines hybrid retrieval strategies using dense embeddings (all-mpnet-base-v2) and BM25, enhanced by metadata-aware filtering with SpaCy NER and cross-encoder reranking. The framework applies semantic chunking to maintain textual coherence and retains tabular data structures to preserve row-column integrity. Quantized indexing optimizes retrieval efficiency, while human-in-the-loop feedback and conversation memory improve adaptability. Experiments on enterprise datasets show notable improvements: Precision@5 increased by 15 percent (90 versus 75), Recall@5 by 13 percent (87 versus 74), and Mean Reciprocal Rank by 16 percent (0.85 versus 0.69). Qualitative evaluations show higher scores in Faithfulness (4.6 versus 3.0), Completeness (4.2 versus 2.5), and Relevance (4.5 versus 3.2) on a 5-point Likert scale. These results demonstrate the framework’s effectiveness in delivering accurate, comprehensive, and contextually relevant responses for enterprise tasks. Future work includes extending to multimodal data and integrating agent-based retrieval. The source code will be released at https://github.com/CheerlaChandana/Enterprise-Chatbot

[83] Can We Predict Alignment Before Models Finish Thinking? Towards Monitoring Misaligned Reasoning Models cs.CL | cs.AI | cs.LGPDF

Yik Siu Chan, Zheng-Xin Yong, Stephen H. Bach

TL;DR: 本文研究了利用思维链（CoT）预测语言模型最终输出的对齐性，发现基于CoT激活的线性探针优于基于文本的方法，并能提前预测不安全输出。

Details

Motivation: 开放权重的推理语言模型在生成最终响应前会产生长思维链，这提升了性能但也引入了对齐风险，因为有害内容可能出现在思维链或最终输出中。因此，研究者希望探索能否利用CoT预测最终响应的对齐性。

Result: 结果显示：1）线性探针在预测最终响应安全性上表现最佳，且优于文本方法；2）探针能在推理早期实现高预测准确率，支持实时监控和早期干预。

Insight: 核心洞察是：CoT文本可能不忠实且误导监控工具，而模型潜在表示（activations）提供更可靠的预测信号，且轻量级探针可高效实现实时安全监控。

Abstract: Open-weights reasoning language models generate long chains-of-thought (CoTs) before producing a final response, which improves performance but introduces additional alignment risks, with harmful content often appearing in both the CoTs and the final outputs. In this work, we investigate if we can use CoTs to predict final response misalignment. We evaluate a range of monitoring approaches, including humans, highly-capable large language models, and text classifiers, using either CoT text or activations. First, we find that a simple linear probe trained on CoT activations can significantly outperform all text-based methods in predicting whether a final response will be safe or unsafe. CoT texts are often unfaithful and can mislead humans and classifiers, while model latents (i.e., CoT activations) offer a more reliable predictive signal. Second, the probe makes accurate predictions before reasoning completes, achieving strong performance even when applied to early CoT segments. These findings generalize across model sizes, families, and safety benchmarks, suggesting that lightweight probes could enable real-time safety monitoring and early intervention during generation.

eess.IV [Back]

[84] CompressedVQA-HDR: Generalized Full-reference and No-reference Quality Assessment Models for Compressed High Dynamic Range Videos eess.IV | cs.CVPDF

Wei Sun, Linhan Cao, Kang Fu, Dandan Zhu, Jun Jia

TL;DR: CompressedVQA-HDR提出了一种用于高动态范围（HDR）压缩视频质量评估的框架，结合Swin Transformer和SigLip 2作为骨干网络，分别用于全参考（FR）和无参考（NR）模型，并通过预训练和数据增强策略提升了性能。

Details

Motivation: 现有的压缩视频质量评估方法无法很好地处理HDR内容的多样性，因此需要一种更通用的框架来解决这一问题。

Result: 实验结果表明，模型在性能上优于现有方法，并在IEEE ICME 2025的挑战赛中取得第一名。

Insight: 通过结合预训练和数据增强，可以显著提升HDR视频质量评估的泛化能力。

Abstract: Video compression is a standard procedure applied to all videos to minimize storage and transmission demands while preserving visual quality as much as possible. Therefore, evaluating the visual quality of compressed videos is crucial for guiding the practical usage and further development of video compression algorithms. Although numerous compressed video quality assessment (VQA) methods have been proposed, they often lack the generalization capability needed to handle the increasing diversity of video types, particularly high dynamic range (HDR) content. In this paper, we introduce CompressedVQA-HDR, an effective VQA framework designed to address the challenges of HDR video quality assessment. Specifically, we adopt the Swin Transformer and SigLip 2 as the backbone networks for the proposed full-reference (FR) and no-reference (NR) VQA models, respectively. For the FR model, we compute deep structural and textural similarities between reference and distorted frames using intermediate-layer features extracted from the Swin Transformer as its quality-aware feature representation. For the NR model, we extract the global mean of the final-layer feature maps from SigLip 2 as its quality-aware representation. To mitigate the issue of limited HDR training data, we pre-train the FR model on a large-scale standard dynamic range (SDR) VQA dataset and fine-tune it on the HDRSDR-VQA dataset. For the NR model, we employ an iterative mixed-dataset training strategy across multiple compressed VQA datasets, followed by fine-tuning on the HDRSDR-VQA dataset. Experimental results show that our models achieve state-of-the-art performance compared to existing FR and NR VQA models. Moreover, CompressedVQA-HDR-FR won first place in the FR track of the Generalizable HDR & SDR Video Quality Measurement Grand Challenge at IEEE ICME 2025. The code is available at https://github.com/sunwei925/CompressedVQA-HDR.

[85] Identifying Signatures of Image Phenotypes to Track Treatment Response in Liver Disease eess.IV | cs.AI | cs.CVPDF

Matthias Perkonigg, Nina Bastati, Ahmed Ba-Ssalamah, Peter Mesenbrink, Alexander Goehler

TL;DR: 该论文提出了一种无监督的深度聚类网络方法，用于从肝脏磁共振图像中识别与疾病进展和治疗反应相关的图像模式。通过建立组织词汇表，该方法能够量化治疗反应，并在非酒精性脂肪性肝炎患者中验证了其有效性。

Details

Motivation: 在弥漫性肝脏疾病中，量化图像模式对于指导个体化治疗和开发新疗法至关重要。现有的方法通常依赖于侵入性活检，而该研究旨在通过无监督学习从非侵入性图像数据中提取有用的信息。

Result: 研究结果表明，该方法能够识别与治疗相关的特定肝脏组织变化路径，并在治疗组间提供比现有非成像指标更好的分离效果。此外，词汇表还能从非侵入性图像数据中预测活检特征。

Insight: 无监督学习可以从医学图像中自动提取有意义的模式，为疾病管理和治疗监测提供了新的工具，减少了对侵入性活检的依赖。

Abstract: Quantifiable image patterns associated with disease progression and treatment response are critical tools for guiding individual treatment, and for developing novel therapies. Here, we show that unsupervised machine learning can identify a pattern vocabulary of liver tissue in magnetic resonance images that quantifies treatment response in diffuse liver disease. Deep clustering networks simultaneously encode and cluster patches of medical images into a low-dimensional latent space to establish a tissue vocabulary. The resulting tissue types capture differential tissue change and its location in the liver associated with treatment response. We demonstrate the utility of the vocabulary on a randomized controlled trial cohort of non-alcoholic steatohepatitis patients. First, we use the vocabulary to compare longitudinal liver change in a placebo and a treatment cohort. Results show that the method identifies specific liver tissue change pathways associated with treatment, and enables a better separation between treatment groups than established non-imaging measures. Moreover, we show that the vocabulary can predict biopsy derived features from non-invasive imaging data. We validate the method on a separate replication cohort to demonstrate the applicability of the proposed method.

[86] Benchmarking and Explaining Deep Learning Cortical Lesion MRI Segmentation in Multiple Sclerosis eess.IV | cs.CVPDF

Nataliia Molchanova, Alessandro Cagol, Mario Ocampo-Pineda, Po-Jui Lu, Matthias Weigel

TL;DR: 该论文提出了一个多中心基准测试，用于评估深度学习在MRI中对多发性硬化症（MS）皮质病变（CLs）的检测和分割性能，并提出了改进的方法和公开可用的模型。

Details

Motivation: 皮质病变（CLs）在多发性硬化症中具有重要的诊断和预后价值，但由于MRI图像中的CLs表现微妙、专家标注困难以及缺乏标准化的自动化方法，其临床应用受限。

Result: 模型在域内和域外的F1分数分别为0.64和0.5，展示了较强的病变检测能力。

Insight: 论文分析了数据变异性、病变模糊性和协议差异对模型性能的影响，为未来临床应用的障碍提供了解决建议。

Abstract: Cortical lesions (CLs) have emerged as valuable biomarkers in multiple sclerosis (MS), offering high diagnostic specificity and prognostic relevance. However, their routine clinical integration remains limited due to subtle magnetic resonance imaging (MRI) appearance, challenges in expert annotation, and a lack of standardized automated methods. We propose a comprehensive multi-centric benchmark of CL detection and segmentation in MRI. A total of 656 MRI scans, including clinical trial and research data from four institutions, were acquired at 3T and 7T using MP2RAGE and MPRAGE sequences with expert-consensus annotations. We rely on the self-configuring nnU-Net framework, designed for medical imaging segmentation, and propose adaptations tailored to the improved CL detection. We evaluated model generalization through out-of-distribution testing, demonstrating strong lesion detection capabilities with an F1-score of 0.64 and 0.5 in and out of the domain, respectively. We also analyze internal model features and model errors for a better understanding of AI decision-making. Our study examines how data variability, lesion ambiguity, and protocol differences impact model performance, offering future recommendations to address these barriers to clinical adoption. To reinforce the reproducibility, the implementation and models will be publicly accessible and ready to use at https://github.com/Medical-Image-Analysis-Laboratory/ and https://doi.org/10.5281/zenodo.15911797.

[87] 3D Wavelet Latent Diffusion Model for Whole-Body MR-to-CT Modality Translation eess.IV | cs.AI | cs.CVPDF

Jiaxu Zheng, Meiman He, Xuhui Tang, Xiong Wang, Tuoyu Cao

TL;DR: 该论文提出了一种新颖的3D小波潜在扩散模型（3D-WLDM），用于从磁共振（MR）图像合成计算机断层扫描（CT）图像，解决了现有方法中空间对齐和图像质量不足的问题。

Details

Motivation: MR成像在临床诊断中至关重要，但在混合PET/MR成像和仅MR放射治疗等应用中，需要从MR合成CT以估计辐射衰减。现有方法存在空间对齐和图像质量问题，影响了临床任务的可靠性。

Result: 3D-WLDM能够生成具有更好骨骼结构和软组织对比的高分辨率CT图像，显著提升了空间对齐和图像质量。

Insight: 潜在空间中的模态转换结合小波分析和扩散模型，可有效解决MR-to-CT合成中的关键挑战，为临床任务提供了更可靠的解决方案。

Abstract: Magnetic Resonance (MR) imaging plays an essential role in contemporary clinical diagnostics. It is increasingly integrated into advanced therapeutic workflows, such as hybrid Positron Emission Tomography/Magnetic Resonance (PET/MR) imaging and MR-only radiation therapy. These integrated approaches are critically dependent on accurate estimation of radiation attenuation, which is typically facilitated by synthesizing Computed Tomography (CT) images from MR scans to generate attenuation maps. However, existing MR-to-CT synthesis methods for whole-body imaging often suffer from poor spatial alignment between the generated CT and input MR images, and insufficient image quality for reliable use in downstream clinical tasks. In this paper, we present a novel 3D Wavelet Latent Diffusion Model (3D-WLDM) that addresses these limitations by performing modality translation in a learned latent space. By incorporating a Wavelet Residual Module into the encoder-decoder architecture, we enhance the capture and reconstruction of fine-scale features across image and latent spaces. To preserve anatomical integrity during the diffusion process, we disentangle structural and modality-specific characteristics and anchor the structural component to prevent warping. We also introduce a Dual Skip Connection Attention mechanism within the diffusion model, enabling the generation of high-resolution CT images with improved representation of bony structures and soft-tissue contrast.

[88] Predicting Pulmonary Hypertension in Newborns: A Multi-view VAE Approach eess.IV | cs.AI | cs.CVPDF

Lucas Erlacher, Samuel Ruipérez-Campillo, Holger Michel, Sven Wellmann, Thomas M. Sutter

TL;DR: 该论文提出了一种基于多视角变分自编码器（VAE）的方法，用于新生儿肺动脉高压（PH）的预测。通过多视角超声心动图视频，该方法提高了特征提取的鲁棒性，并展现了优于单视角和监督学习方法的泛化能力和分类准确性。

Details

Motivation: 新生儿肺动脉高压（PH）的诊断通常依赖于操作者依赖的超声心动图，导致评估主观性强。现有自动化方法多针对成人且基于单视角数据，泛化能力不足。多视角超声心动图虽有望提升性能，但现有模型难以应对这一挑战。

Result: 实验结果表明，多视角VAE方法的分类准确性和泛化能力显著优于单视角和监督学习方法，验证了多视角学习在PH评估中的有效性。

Insight: 多视角数据能够捕捉更全面的病理特征，而VAE的潜在表示进一步增强了模型的鲁棒性。这为新生儿PH的自动化诊断提供了新的思路。

Abstract: Pulmonary hypertension (PH) in newborns is a critical condition characterized by elevated pressure in the pulmonary arteries, leading to right ventricular strain and heart failure. While right heart catheterization (RHC) is the diagnostic gold standard, echocardiography is preferred due to its non-invasive nature, safety, and accessibility. However, its accuracy highly depends on the operator, making PH assessment subjective. While automated detection methods have been explored, most models focus on adults and rely on single-view echocardiographic frames, limiting their performance in diagnosing PH in newborns. While multi-view echocardiography has shown promise in improving PH assessment, existing models struggle with generalizability. In this work, we employ a multi-view variational autoencoder (VAE) for PH prediction using echocardiographic videos. By leveraging the VAE framework, our model captures complex latent representations, improving feature extraction and robustness. We compare its performance against single-view and supervised learning approaches. Our results show improved generalization and classification accuracy, highlighting the effectiveness of multi-view learning for robust PH assessment in newborns.

[89] Are Vision Foundation Models Ready for Out-of-the-Box Medical Image Registration? eess.IV | cs.AI | cs.CVPDF

Hanxue Gu, Yaqian Chen, Nicholas Konz, Qihang Li, Maciej A. Mazurowski

TL;DR: 该论文评估了基于基础模型的医学图像配准算法在乳腺MRI中的表现，发现某些模型（如SAM）在全局对齐上优于传统方法，但在细粒度组织对齐上表现不佳。

Details

Motivation: 探讨基础模型（如DINO-v2、SAM等）是否能在医学图像配准中（尤其是乳腺MRI这种复杂、可变形的解剖结构）实现零样本性能。

Result: SAM在全局对齐上表现优于传统方法，但在细粒度纤维腺体组织对齐上表现不佳；医学特定预训练（如MedSAM）并未提升性能，甚至可能降低。

Insight: 基础模型在医学图像配准中潜力巨大，但需进一步研究如何优化其对细粒度结构的捕获，且域特定训练需谨慎设计。

Abstract: Foundation models, pre-trained on large image datasets and capable of capturing rich feature representations, have recently shown potential for zero-shot image registration. However, their performance has mostly been tested in the context of rigid or less complex structures, such as the brain or abdominal organs, and it remains unclear whether these models can handle more challenging, deformable anatomy. Breast MRI registration is particularly difficult due to significant anatomical variation between patients, deformation caused by patient positioning, and the presence of thin and complex internal structure of fibroglandular tissue, where accurate alignment is crucial. Whether foundation model-based registration algorithms can address this level of complexity remains an open question. In this study, we provide a comprehensive evaluation of foundation model-based registration algorithms for breast MRI. We assess five pre-trained encoders, including DINO-v2, SAM, MedSAM, SSLSAM, and MedCLIP, across four key breast registration tasks that capture variations in different years and dates, sequences, modalities, and patient disease status (lesion versus no lesion). Our results show that foundation model-based algorithms such as SAM outperform traditional registration baselines for overall breast alignment, especially under large domain shifts, but struggle with capturing fine details of fibroglandular tissue. Interestingly, additional pre-training or fine-tuning on medical or breast-specific images in MedSAM and SSLSAM, does not improve registration performance and may even decrease it in some cases. Further work is needed to understand how domain-specific training influences registration and to explore targeted strategies that improve both global alignment and fine structure accuracy. We also publicly release our code at \href{https://github.com/mazurowski-lab/Foundation-based-reg}{Github}.

[90] Unit-Based Histopathology Tissue Segmentation via Multi-Level Feature Representation eess.IV | cs.AI | cs.CV | cs.LGPDF

Ashkan Shakarami, Azade Farshad, Yousef Yeganeh, Lorenzo Nicole, Peter Schuffler

TL;DR: 论文提出了一种基于单元的组织分割框架UTS，利用多级视觉变换器（L-ViT）对32×32的图块进行分类，显著减少了标注成本并提升了计算效率。

Details

Motivation: 传统组织分割方法对像素级标注需求高且计算效率低，作者希望通过图块级分类解决这些问题。

Result: 在459个H&E染色区域和386,371个图块上测试，UTS优于U-Net变体和基于变换器的基线方法。

Insight: 图块级分类在减少标注需求的同时保持准确性，多级特征融合有助于提升分割性能。

Abstract: We propose UTS, a unit-based tissue segmentation framework for histopathology that classifies each fixed-size 32 * 32 tile, rather than each pixel, as the segmentation unit. This approach reduces annotation effort and improves computational efficiency without compromising accuracy. To implement this approach, we introduce a Multi-Level Vision Transformer (L-ViT), which benefits the multi-level feature representation to capture both fine-grained morphology and global tissue context. Trained to segment breast tissue into three categories (infiltrating tumor, non-neoplastic stroma, and fat), UTS supports clinically relevant tasks such as tumor-stroma quantification and surgical margin assessment. Evaluated on 386,371 tiles from 459 H&E-stained regions, it outperforms U-Net variants and transformer-based baselines. Code and Dataset will be available at GitHub.

cs.CR [Back]

[91] Effective Fine-Tuning of Vision Transformers with Low-Rank Adaptation for Privacy-Preserving Image Classification cs.CR | cs.AI | cs.CVPDF

Haiwei Lin, Shoko Imaizumi, Hitoshi Kiya

TL;DR: 论文提出了一种低秩适应方法，用于训练隐私保护的ViT模型，通过冻结预训练权重并注入可训练的低秩分解矩阵，同时解冻patch嵌入层，以在减少可训练参数的同时保持高精度。

Details

Motivation: 传统低秩适应方法在ViT中冻结patch嵌入层可能导致性能损失，本文旨在解决这一问题，同时实现参数效率和隐私保护的平衡。

Result: 方法在减少可训练参数的同时保持了与全参数调优相近的精度。

Insight: 解冻patch嵌入层可能是低秩适应方法在ViT中提升性能的关键，为隐私保护的高效训练提供了新思路。

Abstract: We propose a low-rank adaptation method for training privacy-preserving vision transformer (ViT) models that efficiently freezes pre-trained ViT model weights. In the proposed method, trainable rank decomposition matrices are injected into each layer of the ViT architecture, and moreover, the patch embedding layer is not frozen, unlike in the case of the conventional low-rank adaptation methods. The proposed method allows us not only to reduce the number of trainable parameters but to also maintain almost the same accuracy as that of full-time tuning.

cs.GR [Back]

[92] MOSPA: Human Motion Generation Driven by Spatial Audio cs.GR | cs.CV | cs.ROPDF

Shuyang Xu, Zhiyang Dou, Mingyi Shi, Liang Pan, Leo Ho

TL;DR: 论文《MOSPA: Human Motion Generation Driven by Spatial Audio》提出了第一个空间音频驱动的人类动作生成任务（SAM数据集），并开发了一个基于扩散模型的生成框架（MOSPA），用于高质量地模拟人类对空间音频的反应动作。该方法在实验中取得了最先进的性能。

Details

Motivation: 目前的人类动作生成研究主要关注语音、音乐等模态的映射，而忽略了空间音频信号中的空间特征对人类动作的影响。填补这一空白，并实现对空间音频的高质量动作生成，是本文的核心动机。

Result: MOSPA在生成的多样性和真实性上表现出色，并在基准实验中取得了最先进的性能。

Insight: 空间音频信号的空间特征对人类动作生成具有重要影响，通过扩散模型和融合机制可以高质量地模拟这种关系。此外，公开的数据集和模型将推动这一领域的研究。

Abstract: Enabling virtual humans to dynamically and realistically respond to diverse auditory stimuli remains a key challenge in character animation, demanding the integration of perceptual modeling and motion synthesis. Despite its significance, this task remains largely unexplored. Most previous works have primarily focused on mapping modalities like speech, audio, and music to generate human motion. As of yet, these models typically overlook the impact of spatial features encoded in spatial audio signals on human motion. To bridge this gap and enable high-quality modeling of human movements in response to spatial audio, we introduce the first comprehensive Spatial Audio-Driven Human Motion (SAM) dataset, which contains diverse and high-quality spatial audio and motion data. For benchmarking, we develop a simple yet effective diffusion-based generative framework for human MOtion generation driven by SPatial Audio, termed MOSPA, which faithfully captures the relationship between body motion and spatial audio through an effective fusion mechanism. Once trained, MOSPA could generate diverse realistic human motions conditioned on varying spatial audio inputs. We perform a thorough investigation of the proposed dataset and conduct extensive experiments for benchmarking, where our method achieves state-of-the-art performance on this task. Our model and dataset will be open-sourced upon acceptance. Please refer to our supplementary video for more details.

cs.SC [Back]

[93] FactorHD: A Hyperdimensional Computing Model for Multi-Object Multi-Class Representation and Factorization cs.SC | cs.AI | cs.CVPDF

Yifei Zhou, Xuchu Huang, Chenyu Ni, Min Zhou, Zheyu Yan

TL;DR: FactorHD是一种新颖的HDC模型，专注于高效表示和分解复杂的类-子类关系，显著提升了计算效率和精度。

Details

Motivation: 现有的HDC模型在表示复杂的类-子类关系时面临挑战，尤其在多对象多类的场景下难以高效分解，这是神经符号AI系统的关键任务。

Result: 在10^9规模下，FactorHD比现有HDC模型快5667倍；与ResNet-18集成时，在Cifar-10数据集上实现了92.48%的分解准确率。

Insight: FactorHD通过引入记忆条款和高效分解算法，克服了HDC模型中的‘叠加灾难’和‘问题2’，为神经符号AI提供了更高效的工具。

Abstract: Neuro-symbolic artificial intelligence (neuro-symbolic AI) excels in logical analysis and reasoning. Hyperdimensional Computing (HDC), a promising brain-inspired computational model, is integral to neuro-symbolic AI. Various HDC models have been proposed to represent class-instance and class-class relations, but when representing the more complex class-subclass relation, where multiple objects associate different levels of classes and subclasses, they face challenges for factorization, a crucial task for neuro-symbolic AI systems. In this article, we propose FactorHD, a novel HDC model capable of representing and factorizing the complex class-subclass relation efficiently. FactorHD features a symbolic encoding method that embeds an extra memorization clause, preserving more information for multiple objects. In addition, it employs an efficient factorization algorithm that selectively eliminates redundant classes by identifying the memorization clause of the target class. Such model significantly enhances computing efficiency and accuracy in representing and factorizing multiple objects with class-subclass relation, overcoming limitations of existing HDC models such as “superposition catastrophe” and “the problem of 2”. Evaluations show that FactorHD achieves approximately 5667x speedup at a representation size of 10^9 compared to existing HDC models. When integrated with the ResNet-18 neural network, FactorHD achieves 92.48% factorization accuracy on the Cifar-10 dataset.

cs.IR [Back]

[94] Developing Visual Augmented Q&A System using Scalable Vision Embedding Retrieval & Late Interaction Re-ranker cs.IR | cs.CLPDF

Rachna Saxena, Abhijeet Kumar, Suresh Shanmugam

TL;DR: 该论文提出了一种结合视觉嵌入检索和晚期交互重排序的视觉增强问答系统，解决了多模态检索中的效率和质量问题。

Details

Motivation: 传统文本语言模型无法处理信息图表等视觉元素，而多模态大语言模型（MLLM）在检索海量文档时存在效率问题。

Result: 实验表明，系统在保持性能的同时显著提升了检索速度，适用于实际生产环境。

Insight: 结合混合搜索和晚期交互重排序是多模态检索领域的高效解决方案。

Abstract: Traditional information extraction systems face challenges with text only language models as it does not consider infographics (visual elements of information) such as tables, charts, images etc. often used to convey complex information to readers. Multimodal LLM (MLLM) face challenges of finding needle in the haystack problem i.e., either longer context length or substantial number of documents as search space. Late interaction mechanism over visual language models has shown state of the art performance in retrieval-based vision augmented Q&A tasks. There are yet few challenges using it for RAG based multi-modal Q&A. Firstly, many popular and widely adopted vector databases do not support native multi-vector retrieval. Secondly, late interaction requires computation which inflates space footprint and can hinder enterprise adoption. Lastly, the current state of late interaction mechanism does not leverage the approximate neighbor search indexing methods for large speed ups in retrieval process. This paper explores a pragmatic approach to make vision retrieval process scalable and efficient without compromising on performance quality. We propose multi-step custom implementation utilizing widely adopted hybrid search (metadata & embedding) and state of the art late interaction re-ranker to retrieve best matching pages. Finally, MLLM are prompted as reader to generate answers from contextualized best matching pages. Through experiments, we observe that the proposed design is scalable (significant speed up) and stable (without degrading performance quality), hence can be used as production systems at enterprises.

q-bio.NC [Back]

[95] Spontaneous Spatial Cognition Emerges during Egocentric Video Viewing through Non-invasive BCI q-bio.NC | cs.CV | eess.SPPDF

Weichen Dai, Yuxuan Huang, Li Zhu, Dongjun Liu, Yu Zhang

TL;DR: 通过非侵入式脑机接口（BCI）解码，首次证明了在被动观看自我中心视频时，自发的高精度6D位姿（3D位置和方向）可以被解码。这一发现挑战了主动与被动空间认知的传统区分。

Details

Motivation: 尽管海马神经元对位置和方向的编码已被广泛研究，但在自然、被动体验中支持空间表征的大规模神经动力学仍不清楚。本文旨在通过EEG技术探索这一问题。

Result: EEG可以解码连续的6D位姿，且解码性能在100ms/帧时最佳。研究发现了一种分布式的互补神经编码模式。

Insight: 空间认知系统即使在被动条件下也能自发、连续运作，这表明主动与被动认知的界线可能比传统认知更模糊。

Abstract: Humans possess a remarkable capacity for spatial cognition, allowing for self-localization even in novel or unfamiliar environments. While hippocampal neurons encoding position and orientation are well documented, the large-scale neural dynamics supporting spatial representation, particularly during naturalistic, passive experience, remain poorly understood. Here, we demonstrate for the first time that non-invasive brain-computer interfaces (BCIs) based on electroencephalography (EEG) can decode spontaneous, fine-grained egocentric 6D pose, comprising three-dimensional position and orientation, during passive viewing of egocentric video. Despite EEG’s limited spatial resolution and high signal noise, we find that spatially coherent visual input (i.e., continuous and structured motion) reliably evokes decodable spatial representations, aligning with participants’ subjective sense of spatial engagement. Decoding performance further improves when visual input is presented at a frame rate of 100 ms per image, suggesting alignment with intrinsic neural temporal dynamics. Using gradient-based backpropagation through a neural decoding model, we identify distinct EEG channels contributing to position – and orientation specific – components, revealing a distributed yet complementary neural encoding scheme. These findings indicate that the brain’s spatial systems operate spontaneously and continuously, even under passive conditions, challenging traditional distinctions between active and passive spatial cognition. Our results offer a non-invasive window into the automatic construction of egocentric spatial maps and advance our understanding of how the human mind transforms everyday sensory experience into structured internal representations.

eess.SP [Back]

[96] DoRF: Doppler Radiance Fields for Robust Human Activity Recognition Using Wi-Fi eess.SP | cs.CVPDF

Navid Hasanzadeh, Shahrokh Valaee

TL;DR: 该论文提出了一种基于Wi-Fi CSI的多普勒速度投影的新方法DoRF（多普勒辐射场），通过学习3D潜在运动表示，提高人类活动识别（HAR）在环境变化下的鲁棒性和泛化能力，受NeRF启发。

Details

Motivation: 尽管Wi-Fi CSI的多普勒速度投影在HAR中表现出一定鲁棒性，但在实际部署中其泛化能力仍不足。论文受NeRF启发，试图通过3D潜在运动表示解决这一问题。

Result: 实验结果表明，DoRF显著提升了Wi-Fi HAR的泛化精度，展现了在实际应用中的潜力。

Insight: 通过3D潜在表示和多普勒辐射场，可以更好地捕捉运动的全局特征，从而克服环境变化的干扰。

Abstract: Wi-Fi Channel State Information (CSI) has gained increasing interest for remote sensing applications. Recent studies show that Doppler velocity projections extracted from CSI can enable human activity recognition (HAR) that is robust to environmental changes and generalizes to new users. However, despite these advances, generalizability still remains insufficient for practical deployment. Inspired by neural radiance fields (NeRF), which learn a volumetric representation of a 3D scene from 2D images, this work proposes a novel approach to reconstruct an informative 3D latent motion representation from one-dimensional Doppler velocity projections extracted from Wi-Fi CSI. The resulting latent representation is then used to construct a uniform Doppler radiance field (DoRF) of the motion, providing a comprehensive view of the performed activity and improving the robustness to environmental variability. The results show that the proposed approach noticeably enhances the generalization accuracy of Wi-Fi-based HAR, highlighting the strong potential of DoRFs for practical sensing applications.

cs.NE [Back]

[97] Simulated Language Acquisition in a Biologically Realistic Model of the Brain cs.NE | cs.CLPDF

Daniel Mitropolsky, Christos Papadimitriou

TL;DR: 这篇论文提出了一种基于生物学启发的脑模型，通过六种神经科学原理的数学形式化，实现了语言学习的基本能力。

Details

Motivation: 尽管神经科学取得巨大进展，但神经元活动如何导致高级认知现象（如语言）仍缺乏清晰解释。本文旨在填补这一空白。

Result: 系统能够从少量接地句子中学习单词语义、语法角色及语言语序，甚至能生成新句子。

Insight: 这种生物学启发的模型为揭示高级认知现象（如语言）的神经机制提供了新思路。

Abstract: Despite tremendous progress in neuroscience, we do not have a compelling narrative for the precise way whereby the spiking of neurons in our brain results in high-level cognitive phenomena such as planning and language. We introduce a simple mathematical formulation of six basic and broadly accepted principles of neuroscience: excitatory neurons, brain areas, random synapses, Hebbian plasticity, local inhibition, and inter-area inhibition. We implement a simulated neuromorphic system based on this formalism, which is capable of basic language acquisition: Starting from a tabula rasa, the system learns, in any language, the semantics of words, their syntactic role (verb versus noun), and the word order of the language, including the ability to generate novel sentences, through the exposure to a modest number of grounded sentences in the same language. We discuss several possible extensions and implications of this result.

astro-ph.IM [Back]

[98] Image-Based Multi-Survey Classification of Light Curves with a Pre-Trained Vision Transformer astro-ph.IM | cs.CVPDF

Daniel Moreno-Cartagena, Guillermo Cabrera-Vives, Alejandra M. Muñoz Arancibia, Pavlos Protopapas, Francisco Förster

TL;DR: 论文探讨了使用预训练的视觉Transformer（Swin Transformer V2）在多巡天数据（ZTF和ATLAS）中进行光度分类，发现联合处理多巡天数据的架构性能最佳。

Details

Motivation: 研究动机是开发一种可扩展的分类器，用于处理来自不同巡天项目的光变曲线数据，并解决多巡天数据整合的问题。

Result: 实验结果表明，多巡天联合处理的架构在分类性能上优于单巡天处理，验证了巡天间交互建模的重要性。

Insight: 研究指出，建模巡天特异性特征和巡天间相互作用是提升分类性能的关键，为未来时域天文学的可扩展分类器提供了指导。

Abstract: We explore the use of Swin Transformer V2, a pre-trained vision Transformer, for photometric classification in a multi-survey setting by leveraging light curves from the Zwicky Transient Facility (ZTF) and the Asteroid Terrestrial-impact Last Alert System (ATLAS). We evaluate different strategies for integrating data from these surveys and find that a multi-survey architecture which processes them jointly achieves the best performance. These results highlight the importance of modeling survey-specific characteristics and cross-survey interactions, and provide guidance for building scalable classifiers for future time-domain astronomy.

cs.AI [Back]

[99] Let’s Think in Two Steps: Mitigating Agreement Bias in MLLMs with Self-Grounded Verification cs.AI | cs.CL | cs.LG | cs.MA | cs.ROPDF

Moises Andrade, Joonhyuk Cha, Brandon Ho, Vriksha Srihari, Karmesh Yadav

TL;DR: 论文提出了一种名为‘自我验证’（SGV）的轻量级方法，通过两步推理过程解决多模态大语言模型（MLLMs）在验证任务中的‘一致性偏差’问题。该方法显著提升了验证任务的准确性和失败检测率。

Details

Motivation: 目前，在数学和棋类游戏等领域，验证器（verifiers）通过奖励机制推动了AI的进步。然而，在没有明确成功标准的领域（如计算机使用），验证器的设计仍具挑战性。多模态大语言模型（MLLMs）因其世界知识、人类偏好对齐和推理能力成为潜在解决方案，但其在验证任务中存在‘一致性偏差’问题。

Result: SGV使MLLM验证器的准确性和失败检测率提升了高达20个百分点，并在多个任务（如OSWorld中的GUI专家、robomimic中的扩散策略和VisualWebArena中的ReAct代理）中实现了实时监督，性能超越了之前的SOTA方法48%。

Insight: 论文揭示了MLLMs在验证任务中的‘一致性偏差’问题，并通过简单的两步推理方法显著缓解了该问题。这表明，通过合理设计，MLLMs的自生成机制可以更好地服务于复杂任务的验证。

Abstract: Verifiers – functions assigning rewards to agent behavior – have been key for AI progress in domains like math and board games. However, extending these gains to domains without clear-cut success criteria (e.g.,computer use) remains a challenge: while humans can recognize suitable outcomes, translating this intuition into scalable rules is non-trivial. Multimodal Large Language Models(MLLMs) emerge as a promising solution, given their world knowledge, human-preference alignment, and reasoning skills. We evaluate MLLMs as verifiers of agent trajectories across web navigation, computer use, and robotic manipulation, and identify a critical limitation: agreement bias, a strong tendency for MLLMs to favor information in their context window, often generating chains of thought to rationalize flawed behavior. This bias is pervasive across models, resilient to test-time scaling, and can impact several methods using MLLMs as evaluators (e.g.,data filtering). Notably, it occurs despite MLLMs showing strong, human-aligned priors on desired behavior. To address this, we propose Self-Grounded Verification (SGV), a lightweight method that enables more effective use of MLLMs’ knowledge and reasoning by harnessing their own sampling mechanisms via unconditional and conditional generation. SGV operates in two steps: first, the MLLM is elicited to retrieve broad priors about task completion, independent of the data under evaluation. Then, conditioned on self-generated priors, it reasons over and evaluates a candidate trajectory. Enhanced with SGV, MLLM verifiers show gains of up to 20 points in accuracy and failure detection rates, and can perform real-time supervision of heterogeneous agents, boosting task completion of a GUI specialist in OSWorld, a diffusion policy in robomimic, and a ReAct agent in VisualWebArena – setting a new state of the art on the benchmark, surpassing the previous best by 48%.

cs.SE [Back]

[100] MetaLint: Generalizable Idiomatic Code Quality Analysis through Instruction-Following and Easy-to-Hard Generalization cs.SE | cs.CL | cs.LGPDF

Atharva Naik, Lawanya Baghel, Dhakshin Govindarajan, Darsh Agrawal, Daniel Fried

TL;DR: MetaLint是一个基于指令遵循的框架，通过指令微调合成数据支持代码质量分析，能够在不重新训练的情况下适应新或复杂的代码模式，优于现有方法。

Details

Motivation: 现有大语言模型在代码质量分析中受限于静态训练数据，无法灵活适应不断演进的最佳实践。

Result: 在未见的PEP成语检测中表现优异，F-score达70.37%，且在4B参数量下与更大规模模型性能相当。

Insight: 通过指令微调和数据合成，模型可以在不更新训练数据的情况下适应新的代码实践，为代码质量分析提供了一种更灵活的方法。

Abstract: Large Language Models, though successful in code generation, struggle with code quality analysis because they are limited by static training data and can’t easily adapt to evolving best practices. We introduce MetaLint, a new instruction-following framework that formulates code quality analysis as the task of detecting and fixing problematic semantic code fragments or code idioms based on high-level specifications. Unlike conventional approaches that train models on static, rule-based data, MetaLint employs instruction tuning on synthetic linter-generated data to support easy-to-hard generalization, enabling models to adapt to novel or complex code patterns without retraining. To evaluate this, we construct a benchmark of challenging idioms inspired by real-world coding standards such as Python Enhancement Proposals (PEPs) and assess whether MetaLint-trained models reason adaptively or simply memorize. Our results show that MetaLint improves generalization to unseen PEP idioms, achieving a 70.37% F-score on idiom detection with the highest recall (70.43%) among all evaluated models. It also achieves 26.73% on localization, competitive for its 4B parameter size and comparable to larger state-of-the-art models like o3-mini, highlighting its potential for future-proof code quality analysis.

[101] MERA Code: A Unified Framework for Evaluating Code Generation Across Tasks cs.SE | cs.AI | cs.CLPDF

Artem Chervyakov, Alexander Kharitonov, Pavel Zadorozhny, Adamenko Pavel, Rodion Levichev

TL;DR: 该论文提出了MERA Code，一个专注于评估代码生成大语言模型（LLMs）的基准框架，涵盖8种编程语言和11项任务，填补了现有评估在代码质量方面的不足。

Details

Motivation: 现有的大语言模型评估主要关注自然语言任务，忽视了代码质量和实际生产环境中的表现，这导致对模型真实能力和风险的评估不够全面。

Result: 评估了开源和前沿API模型，揭示了其在非英语环境下实际编码任务中的局限性。

Insight: MERA Code为未来研究提供了标准化评估工具，帮助模型开发者突破非英语环境中的编码任务挑战，并推动代码生成领域的进步。

Abstract: Advancements in LLMs have enhanced task automation in software engineering; however, current evaluations primarily focus on natural language tasks, overlooking code quality. Most benchmarks prioritize high-level reasoning over executable code and real-world performance, leaving gaps in understanding true capabilities and risks associated with these models in production. To address this issue, we propose MERA Code, a new addition to the MERA benchmark family, specifically focused on evaluating code for the latest code generation LLMs in Russian. This benchmark includes 11 evaluation tasks that span 8 programming languages. Our proposed evaluation methodology features a taxonomy that outlines the practical coding skills necessary for models to complete these tasks. The benchmark comprises an open-source codebase for users to conduct MERA assessments, a scoring system compatible with various programming environments, and a platform featuring a leaderboard and submission system. We evaluate open LLMs and frontier API models, analyzing their limitations in terms of practical coding tasks in non-English languages. We are publicly releasing MERA to guide future research, anticipate groundbreaking features in model development, and standardize evaluation procedures.

cs.RO [Back]

[102] Towards Autonomous Riding: A Review of Perception, Planning, and Control in Intelligent Two-Wheelers cs.RO | cs.CV | 93C85 | F.2.2; I.2.7PDF

Mohammed Hassanin, Mohammad Abu Alsheikh, Carlos C. N. Kuhn, Damith Herath, Dinh Thai Hoang

TL;DR: 这篇综述全面分析了双轮自动驾驶（AR）系统的感知、规划和控制三大核心组件，对比了自动驾驶（AD）技术，指出了当前研究的不足，并提出了未来研究方向。

Details

Motivation: 微出行工具（如电动滑板车和电动自行车）的普及推动了对双轮自动驾驶技术的需求。然而，双轮平台的不稳定性、有限的体积和动力，以及不可预测的环境带来了独特的挑战，亟需研究和解决。

Result: 综述中明确了AR技术与AD技术的差异，提出了AR研究中的关键问题和潜力方向。

Insight: 双轮自动驾驶的研究需要更多行业和政府支持，同时需注重轻量化平台的多模态传感器技术和边缘计算能力的提升。

Abstract: The rapid adoption of micromobility solutions, particularly two-wheeled vehicles like e-scooters and e-bikes, has created an urgent need for reliable autonomous riding (AR) technologies. While autonomous driving (AD) systems have matured significantly, AR presents unique challenges due to the inherent instability of two-wheeled platforms, limited size, limited power, and unpredictable environments, which pose very serious concerns about road users’ safety. This review provides a comprehensive analysis of AR systems by systematically examining their core components, perception, planning, and control, through the lens of AD technologies. We identify critical gaps in current AR research, including a lack of comprehensive perception systems for various AR tasks, limited industry and government support for such developments, and insufficient attention from the research community. The review analyses the gaps of AR from the perspective of AD to highlight promising research directions, such as multimodal sensor techniques for lightweight platforms and edge deep learning architectures. By synthesising insights from AD research with the specific requirements of AR, this review aims to accelerate the development of safe, efficient, and scalable autonomous riding systems for future urban mobility.

[103] A Multi-Level Similarity Approach for Single-View Object Grasping: Matching, Planning, and Fine-Tuning cs.RO | cs.CVPDF

Hao Chen, Takuya Kiyokawa, Zhengtao Hu, Weiwei Wan, Kensuke Harada

TL;DR: 论文提出了一种基于多级相似性的单视角物体抓取方法，通过相似性匹配、规划和微调解决了未知物体抓取的鲁棒性问题。

Details

Motivation: 传统学习框架对感知噪声和环境变化敏感，无法实现高度泛化的抓取效果。因此，作者放弃了传统方法，转而探索相似性匹配的新视角。

Result: 该方法在单视角条件下实现了对未知物体的鲁棒抓取，优于传统学习框架的泛化能力。

Insight: 通过相似性匹配和已知知识的迁移，可以显著提升未知物体抓取的鲁棒性，尤其是在部分观测条件下。

Abstract: Grasping unknown objects from a single view has remained a challenging topic in robotics due to the uncertainty of partial observation. Recent advances in large-scale models have led to benchmark solutions such as GraspNet-1Billion. However, such learning-based approaches still face a critical limitation in performance robustness for their sensitivity to sensing noise and environmental changes. To address this bottleneck in achieving highly generalized grasping, we abandon the traditional learning framework and introduce a new perspective: similarity matching, where similar known objects are utilized to guide the grasping of unknown target objects. We newly propose a method that robustly achieves unknown-object grasping from a single viewpoint through three key steps: 1) Leverage the visual features of the observed object to perform similarity matching with an existing database containing various object models, identifying potential candidates with high similarity; 2) Use the candidate models with pre-existing grasping knowledge to plan imitative grasps for the unknown target object; 3) Optimize the grasp quality through a local fine-tuning process. To address the uncertainty caused by partial and noisy observation, we propose a multi-level similarity matching framework that integrates semantic, geometric, and dimensional features for comprehensive evaluation. Especially, we introduce a novel point cloud geometric descriptor, the C-FPFH descriptor, which facilitates accurate similarity assessment between partial point clouds of observed objects and complete point clouds of database models. In addition, we incorporate the use of large language models, introduce the semi-oriented bounding box, and develop a novel point cloud registration approach based on plane detection to enhance matching accuracy under single-view conditions. Videos are available at https://youtu.be/qQDIELMhQmk.

[104] EgoVLA: Learning Vision-Language-Action Models from Egocentric Human Videos cs.RO | cs.AI | cs.CV | cs.LGPDF

Ruihan Yang, Qinxi Yu, Yecheng Wu, Rui Yan, Borui Li

TL;DR: 论文提出了一种基于人类自我中心视频训练视觉-语言-动作（VLA）模型的方法EgoVLA，通过逆运动学和动作重定向将人类动作转化为机器人动作，并通过少量机器人演示进行微调，显著提升了机器人操作任务的性能。

Details

Motivation: 机器人模仿学习需要大量真实数据，但硬件限制了数据规模。人类视频不仅规模大，且场景和任务丰富，因此探索利用人类自我中心视频训练VLA模型。

Result: 在Isaac Humanoid Manipulation Benchmark上评估，EgoVLA显著优于基线方法，验证了人类数据的重要性。

Insight: 人类视频不仅能提供大规模数据，还能覆盖更丰富的场景和任务，为机器人操作学习提供了新思路。

Abstract: Real robot data collection for imitation learning has led to significant advancements in robotic manipulation. However, the requirement for robot hardware in the process fundamentally constrains the scale of the data. In this paper, we explore training Vision-Language-Action (VLA) models using egocentric human videos. The benefit of using human videos is not only for their scale but more importantly for the richness of scenes and tasks. With a VLA trained on human video that predicts human wrist and hand actions, we can perform Inverse Kinematics and retargeting to convert the human actions to robot actions. We fine-tune the model using a few robot manipulation demonstrations to obtain the robot policy, namely EgoVLA. We propose a simulation benchmark called Isaac Humanoid Manipulation Benchmark, where we design diverse bimanual manipulation tasks with demonstrations. We fine-tune and evaluate EgoVLA with Isaac Humanoid Manipulation Benchmark and show significant improvements over baselines and ablate the importance of human data. Videos can be found on our website: https://rchalyang.github.io/EgoVLA

cs.SD [Back]

[105] Stereo Sound Event Localization and Detection with Onscreen/offscreen Classification cs.SD | cs.CV | cs.MM | eess.AS | eess.IVPDF

Kazuki Shimada, Archontis Politis, Iran R. Roman, Parthasaarathy Sudarsanam, David Diaz-Guerra

TL;DR: 本文介绍了DCASE2025挑战赛任务3的目标、数据集、基线和指标，重点关注立体声音频下的声事件定位与检测（SELD），增加了屏幕上/下事件的分类子任务。

Details

Motivation: 以往的任务使用四声道音频（如FOA和麦克风阵列），今年转向更常见的立体声音频场景，更贴近实际应用中的有限视场（FOV）问题。

Result: 基线系统在立体声音频数据上表现良好。

Insight: 立体声音频的局限性（如方向模糊性）促使任务聚焦于方位角和距离估计，同时屏幕上/下分类为有限FOV场景提供了新思路。

Abstract: This paper presents the objective, dataset, baseline, and metrics of Task 3 of the DCASE2025 Challenge on sound event localization and detection (SELD). In previous editions, the challenge used four-channel audio formats of first-order Ambisonics (FOA) and microphone array. In contrast, this year’s challenge investigates SELD with stereo audio data (termed stereo SELD). This change shifts the focus from more specialized 360{\deg} audio and audiovisual scene analysis to more commonplace audio and media scenarios with limited field-of-view (FOV). Due to inherent angular ambiguities in stereo audio data, the task focuses on direction-of-arrival (DOA) estimation in the azimuth plane (left-right axis) along with distance estimation. The challenge remains divided into two tracks: audio-only and audiovisual, with the audiovisual track introducing a new sub-task of onscreen/offscreen event classification necessitated by the limited FOV. This challenge introduces the DCASE2025 Task3 Stereo SELD Dataset, whose stereo audio and perspective video clips are sampled and converted from the STARSS23 recordings. The baseline system is designed to process stereo audio and corresponding video frames as inputs. In addition to the typical SELD event classification and localization, it integrates onscreen/offscreen classification for the audiovisual track. The evaluation metrics have been modified to introduce an onscreen/offscreen accuracy metric, which assesses the models’ ability to identify which sound sources are onscreen. In the experimental evaluation, the baseline system performs reasonably well with the stereo audio data.

cs.LG [Back]

[106] MNIST-Gen: A Modular MNIST-Style Dataset Generation Using Hierarchical Semantics, Reinforcement Learning, and Category Theory cs.LG | cs.AI | cs.CV | cs.HCPDF

Pouya Shaeri, Arash Karimi, Ariane Middel

TL;DR: 这篇论文提出了MNIST-Gen，一个自动化、模块化的框架，用于生成定制化的MNIST风格数据集，结合了层次语义分类、强化学习和范畴论，显著提高了数据集生成的效率和灵活性。

Details

Motivation: 标准数据集（如MNIST）局限于通用类别，无法满足特定领域任务的需求。手动创建定制数据集耗时且复杂，需要一种自动化且灵活的解决方案。

Result: 生成的两个新数据集（Tree-MNIST和Food-MNIST）展示了框架的实用性，自动分类准确率达到85%，相比手动方法节省80%时间。

Insight: 将语义理解与强化学习结合，结合人类反馈，可以高效生成定制化数据集；范畴论的设计思想提升了框架的可扩展性。

Abstract: Neural networks are often benchmarked using standard datasets such as MNIST, FashionMNIST, or other variants of MNIST, which, while accessible, are limited to generic classes such as digits or clothing items. For researchers working on domain-specific tasks, such as classifying trees, food items, or other real-world objects, these data sets are insufficient and irrelevant. Additionally, creating and publishing a custom dataset can be time consuming, legally constrained, or beyond the scope of individual projects. We present MNIST-Gen, an automated, modular, and adaptive framework for generating MNIST-style image datasets tailored to user-specified categories using hierarchical semantic categorization. The system combines CLIP-based semantic understanding with reinforcement learning and human feedback to achieve intelligent categorization with minimal manual intervention. Our hierarchical approach supports complex category structures with semantic characteristics, enabling fine-grained subcategorization and multiple processing modes: individual review for maximum control, smart batch processing for large datasets, and fast batch processing for rapid creation. Inspired by category theory, MNIST-Gen models each data transformation stage as a composable morphism, enhancing clarity, modularity, and extensibility. As proof of concept, we generate and benchmark two novel datasets-\textit{Tree-MNIST} and \textit{Food-MNIST}-demonstrating MNIST-Gen’s utility for producing task-specific evaluation data while achieving 85% automatic categorization accuracy and 80% time savings compared to manual approaches.

[107] RegCL: Continual Adaptation of Segment Anything Model via Model Merging cs.LG | cs.CVPDF

Yuan-Chen Shu, Zhiwei Lin, Yongtao Wang

TL;DR: RegCL通过模型合并实现Segment Anything Model (SAM)的持续适应，解决了传统适配器方法在跨域应用中的性能下降问题。

Details

Motivation: 解决SAM在特定领域性能受限的问题，传统适配器方法在多域应用时会出现性能下降和灾难性遗忘。

Result: 实验表明，RegCL在多个下游数据集上表现出色，验证了其在动态场景中的有效性。

Insight: RegCL避免了历史数据存储需求，同时保持模型大小恒定，适用于多任务场景。

Abstract: To address the performance limitations of the Segment Anything Model (SAM) in specific domains, existing works primarily adopt adapter-based one-step adaptation paradigms. However, some of these methods are specific developed for specific domains. If used on other domains may lead to performance degradation. This issue of catastrophic forgetting severely limits the model’s scalability. To address this issue, this paper proposes RegCL, a novel non-replay continual learning (CL) framework designed for efficient multi-domain knowledge integration through model merging. Specifically, RegCL incorporates the model merging algorithm into the continual learning paradigm by merging the parameters of SAM’s adaptation modules (e.g., LoRA modules) trained on different domains. The merging process is guided by weight optimization, which minimizes prediction discrepancies between the merged model and each of the domain-specific models. RegCL effectively consolidates multi-domain knowledge while maintaining parameter efficiency, i.e., the model size remains constant regardless of the number of tasks, and no historical data storage is required. Experimental results demonstrate that RegCL achieves favorable continual learning performance across multiple downstream datasets, validating its effectiveness in dynamic scenarios.

Table of Contents

cs.CV [Back]

[1] An Memory-Efficient Framework for Deformable Transformer with Neural Architecture Search cs.CV | cs.AIPDF

[2] Reprogramming Vision Foundation Models for Spatio-Temporal Forecasting cs.CV | cs.AIPDF

[3] Expert Operational GANS: Towards Real-Color Underwater Image Restoration cs.CV | cs.AI | eess.IVPDF

[4] Data-Driven Meta-Analysis and Public-Dataset Evaluation for Sensor-Based Gait Age Estimation cs.CV | eess.IVPDF

[5] What cat is that? A re-id model for feral cats cs.CV | cs.AIPDF

[6] SketchDNN: Joint Continuous-Discrete Diffusion for CAD Sketch Generation cs.CV | cs.LGPDF

[7] Interpretable Prediction of Lymph Node Metastasis in Rectal Cancer MRI Using Variational Autoencoders cs.CV | cs.AI | cs.LGPDF

[8] Posture-Driven Action Intent Inference for Playing style and Fatigue Assessment cs.CV | cs.LGPDF

[9] VISTA: Monocular Segmentation-Based Mapping for Appearance and View-Invariant Global Localization cs.CV | cs.ROPDF

[10] Seeing the Signs: A Survey of Edge-Deployable OCR Models for Billboard Visibility Analysis cs.CV | cs.AIPDF

[11] Beyond Task-Specific Reasoning: A Unified Conditional Generative Framework for Abstract Visual Reasoning cs.CV | cs.AIPDF

[12] CorrMoE: Mixture of Experts with De-stylization Learning for Cross-Scene and Cross-Domain Correspondence Pruning cs.CVPDF

[13] ProtoConNet: Prototypical Augmentation and Alignment for Open-Set Few-Shot Image Classification cs.CVPDF

[14] From Coarse to Nuanced: Cross-Modal Alignment of Fine-Grained Linguistic Cues and Visual Salient Regions for Dynamic Emotion Recognition cs.CV | cs.AI | cs.HCPDF

[15] Spatial Frequency Modulation for Semantic Segmentation cs.CV | cs.AIPDF

[16] SEPose: A Synthetic Event-based Human Pose Estimation Dataset for Pedestrian Monitoring cs.CVPDF

[17] Dark-EvGS: Event Camera as an Eye for Radiance Field in the Dark cs.CVPDF

[18] Hyperphantasia: A Benchmark for Evaluating the Mental Visualization Capabilities of Multimodal LLMs cs.CVPDF

[19] RaDL: Relation-aware Disentangled Learning for Multi-Instance Text-to-Image Generation cs.CV | cs.AIPDF

[20] Prototypical Progressive Alignment and Reweighting for Generalizable Semantic Segmentation cs.CVPDF

[21] Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos cs.CV | eess.AS | eess.IVPDF

[22] Watch, Listen, Understand, Mislead: Tri-modal Adversarial Attacks on Short Videos for Content Appropriateness Evaluation cs.CVPDF

[23] GS-Bias: Global-Spatial Bias Learner for Single-Image Test-Time Adaptation of Vision-Language Models cs.CVPDF

[24] EC-Diff: Fast and High-Quality Edge-Cloud Collaborative Inference for Diffusion Models cs.CVPDF

[25] Unsupervised Part Discovery via Descriptor-Based Masked Image Restoration with Optimized Constraints cs.CVPDF

[26] Frequency-Dynamic Attention Modulation for Dense Prediction cs.CV | cs.AIPDF

[27] Dual form Complementary Masking for Domain-Adaptive Image Segmentation cs.CV | cs.AIPDF

[28] Deep Neural Encoder-Decoder Model to Relate fMRI Brain Activity with Naturalistic Stimuli cs.CV | cs.HCPDF

[29] SS-DC: Spatial-Spectral Decoupling and Coupling Across Visible-Infrared Gap for Domain Adaptive Object Detection cs.CV | cs.AIPDF

[30] Dataset Ownership Verification for Pre-trained Masked Models cs.CVPDF

[31] MVAR: MultiVariate AutoRegressive Air Pollutants Forecasting Model cs.CV | cs.LGPDF

[32] 3D-MoRe: Unified Modal-Contextual Reasoning for Embodied Question Answering cs.CVPDF

[33] Intra-view and Inter-view Correlation Guided Multi-view Novel Class Discovery cs.CV | cs.AIPDF

[34] InstructFLIP: Exploring Unified Vision-Language Model for Face Anti-spoofing cs.CV | cs.AI | cs.MMPDF

[35] MS-DETR: Towards Effective Video Moment Retrieval and Highlight Detection by Joint Motion-Semantic Learning cs.CVPDF

[36] Foresight in Motion: Reinforcing Trajectory Prediction with Reward Heuristics cs.CV | cs.ROPDF

[37] YOLOv8-SMOT: An Efficient and Robust Framework for Real-Time Small Object Tracking via Slice-Assisted Training and Adaptive Association cs.CVPDF

[38] Out-of-distribution data supervision towards biomedical semantic segmentation cs.CVPDF

[39] Non-Adaptive Adversarial Face Generation cs.CV | cs.AI | cs.CR | I.2.6; I.5.4; D.4.6; K.6.5; I.4.8PDF

[40] LidarPainter: One-Step Away From Any Lidar View To Novel Guidance cs.CVPDF

[41] Open-Vocabulary Indoor Object Grounding with 3D Hierarchical Scene Graph cs.CVPDF

[42] Block-based Symmetric Pruning and Fusion for Efficient Vision Transformers cs.CVPDF

[43] AD-GS: Object-Aware B-Spline Gaussian Splatting for Self-Supervised Autonomous Driving cs.CVPDF

[44] Fine-Grained Image Recognition from Scratch with Teacher-Guided Data Augmentation cs.CV | I.2; I.4PDF

[45] Hybrid Ensemble Approaches: Optimal Deep Feature Fusion and Hyperparameter-Tuned Classifier Ensembling for Enhanced Brain Tumor Classification cs.CVPDF

[46] Revealing the Ancient Beauty: Digital Reconstruction of Temple Tiles using Computer Vision cs.CV | cs.AIPDF

[47] MGFFD-VLM: Multi-Granularity Prompt Learning for Face Forgery Detection with VLM cs.CVPDF

[48] Generate to Ground: Multimodal Text Conditioning Boosts Phrase Grounding in Medical Vision-Language Models cs.CVPDF

[49] Calisthenics Skills Temporal Video Segmentation cs.CVPDF

[50] Site-Level Fine-Tuning with Progressive Layer Freezing: Towards Robust Prediction of Bronchopulmonary Dysplasia from Day-1 Chest Radiographs in Extremely Preterm Infants cs.CV | cs.AI | cs.LGPDF

[51] Efficient Calisthenics Skills Classification through Foreground Instance Selection and Depth Estimation cs.CVPDF

[52] Unsupervised Monocular 3D Keypoint Discovery from Multi-View Diffusion Priors cs.CVPDF

[53] Cluster Contrast for Unsupervised Visual Representation Learning cs.CV | cs.AIPDF

[54] OD-VIRAT: A Large-Scale Benchmark for Object Detection in Realistic Surveillance Environments cs.CVPDF

[55] AutoVDC: Automated Vision Data Cleaning Using Vision-Language Models cs.CV | cs.AI | cs.LG | cs.ROPDF

[56] InterpIoU: Rethinking Bounding Box Regression with Interpolation-Based IoU Optimization cs.CVPDF

[57] DVFL-Net: A Lightweight Distilled Video Focal Modulation Network for Spatio-Temporal Action Recognition cs.CVPDF

[58] Describe Anything Model for Visual Question Answering on Text-rich Images cs.CV | cs.LGPDF

[59] Vision-based Perception for Autonomous Vehicles in Obstacle Avoidance Scenarios cs.CVPDF

[60] Mitigating Object Hallucinations via Sentence-Level Early Intervention cs.CVPDF

[61] SpatialTrackerV2: 3D Point Tracking Made Easy cs.CVPDF

[62] MMHU: A Massive-Scale Multimodal Benchmark for Human Behavior Understanding cs.CVPDF

[63] PhysX: Physical-Grounded 3D Asset Generation cs.CVPDF

cs.CL [Back]

[64] MapIQ: Benchmarking Multimodal Large Language Models for Map Question Answering cs.CL | cs.AI | cs.CV | cs.LGPDF

[65] Partitioner Guided Modal Learning Framework cs.CL | cs.AIPDF

[66] ExpliCIT-QA: Explainable Code-Based Image Table Question Answering cs.CL | cs.AIPDF

[67] CRABS: A syntactic-semantic pincer strategy for bounding LLM interpretation of Python notebooks cs.CL | cs.AIPDF

[68] AI Wizards at CheckThat! 2025: Enhancing Transformer-Based Embeddings with Sentiment for Subjectivity Detection in News Articles cs.CL | cs.IRPDF

[69] DualReward: A Dynamic Reinforcement Learning Framework for Cloze Tests Distractor Generation cs.CLPDF

[70] A Survey of Deep Learning for Geometry Problem Solving cs.CL | cs.AI | cs.CV | cs.LGPDF

[71] POLYCHARTQA: Benchmarking Large Vision-Language Models with Multilingual Chart Question Answering cs.CL | cs.AI | cs.CV | cs.MMPDF

[72] The benefits of query-based KGQA systems for complex and temporal questions in LLM era cs.CL | cs.LGPDF

[73] Improving Data and Parameter Efficiency of Neural Language Models Using Representation Analysis cs.CLPDF

[74] Evaluating the Ability of Large Language Models to Reason about Cardinal Directions, Revisited cs.CLPDF

[75] Findings of MEGA: Maths Explanation with LLMs using the Socratic Method for Active Learning cs.CLPDF

[76] Iterative Augmentation with Summarization Refinement (IASR) Evaluation for Unstructured Survey data Modeling and Analysis cs.CL | cs.LGPDF

[77] Overview of the Sensemaking Task at the ELOQUENT 2025 Lab: LLMs as Teachers, Students and Evaluators cs.CL | I.2.7PDF