cs.CV [Total: 125]
cs.CL [Total: 25]
cs.AI [Total: 6]
cs.RO [Total: 3]
cs.IR [Total: 1]
cs.MA [Total: 1]
cs.MM [Total: 2]
cs.DB [Total: 1]
cs.CR [Total: 2]
cs.SE [Total: 1]
q-bio.NC [Total: 2]
cs.SD [Total: 2]
eess.IV [Total: 8]
cs.LG [Total: 7]
cs.GR [Total: 1]
eess.AS [Total: 2]

cs.CV [Back]

Josh Qixuan Sun, Xiaoying Xing, Huaiyuan Weng, Chul Min Yeum, Mark Crowley

TL;DR: 论文提出了VIL方法，通过对比学习和教师-学生框架提升视觉语言导航中对视角变化的鲁棒性。

Details

Motivation: 现有的视觉语言导航策略对视角变化（如相机高度和视角）敏感，导致性能下降。

Result: 在R2R-CE和RxR-CE数据集上，VIL比现有方法提升了8-15%的成功率。

Insight: VIL不仅能提升视角变化下的性能，还能在不降低标准视角性能的情况下作为即插即用的后训练方法使用。

Abstract: Vision-Language Navigation in Continuous Environments (VLNCE), where an agent follows instructions and moves freely to reach a destination, is a key research problem in embodied AI. However, most navigation policies are sensitive to viewpoint changes, i.e., variations in camera height and viewing angle that alter the agent’s observation. In this paper, we introduce a generalized scenario, V2-VLNCE (VLNCE with Varied Viewpoints), and propose VIL (View Invariant Learning), a view-invariant post-training strategy that enhances the robustness of existing navigation policies to changes in camera viewpoint. VIL employs a contrastive learning framework to learn sparse and view-invariant features. Additionally, we introduce a teacher-student framework for the Waypoint Predictor Module, a core component of most VLNCE baselines, where a view-dependent teacher model distills knowledge into a view-invariant student model. We employ an end-to-end training paradigm to jointly optimize these components, thus eliminating the cost for individual module training. Empirical results show that our method outperforms state-of-the-art approaches on V2-VLNCE by 8-15% measured on Success Rate for two standard benchmark datasets R2R-CE and RxR-CE. Furthermore, we evaluate VIL under the standard VLNCE setting and find that, despite being trained for varied viewpoints, it often still improves performance. On the more challenging RxR-CE dataset, our method also achieved state-of-the-art performance across all metrics when compared to other map-free methods. This suggests that adding VIL does not diminish the standard viewpoint performance and can serve as a plug-and-play post-training method.

[2] Detecting Deepfake Talking Heads from Facial Biometric Anomalies cs.CVPDF

Justin D. Norman, Hany Farid

TL;DR: 提出了一种基于面部生物特征异常的深度学习检测方法，用于识别深度伪造（deepfake）视频，特别是针对高度逼真的语音克隆和视觉欺骗技术。

Details

Motivation: 深度伪造技术（如语音克隆和面部替换）容易用于欺诈和政治虚假信息，因此需要一种有效的检测方法来应对这一威胁。

Result: 该技术能够有效检测深度伪造视频，并对视频篡改（laundering）和未见过的新型伪造生成器具有较好的鲁棒性。

Insight: 面部生物特征的异常可以作为检测深度伪造视频的可靠指标，且该方法能适应新出现的伪造技术。

Abstract: The combination of highly realistic voice cloning, along with visually compelling avatar, face-swap, or lip-sync deepfake video generation, makes it relatively easy to create a video of anyone saying anything. Today, such deepfake impersonations are often used to power frauds, scams, and political disinformation. We propose a novel forensic machine learning technique for the detection of deepfake video impersonations that leverages unnatural patterns in facial biometrics. We evaluate this technique across a large dataset of deepfake techniques and impersonations, as well as assess its reliability to video laundering and its generalization to previously unseen video deepfake generators.

[3] PRISM: Reducing Spurious Implicit Biases in Vision-Language Models with LLM-Guided Embedding Projection cs.CV | cs.LGPDF

Mahdiyar Molahasani, Azadeh Motamedi, Michael Greenspan, Il-Min Kim, Ali Etemad

TL;DR: PRISM 是一种无需数据和无监督的方法，通过 LLM 生成场景描述并结合对比式去偏损失，减少视觉-语言模型中的隐含偏见。

Details

Motivation: 视觉-语言模型（VLMs）如 CLIP 容易从训练数据中继承和放大隐含偏见，导致预测偏差。传统去偏方法依赖预定义的偏见类别或额外数据，PRISM 旨在无需这些条件下去偏。

Result: 在 Waterbirds 和 CelebA 数据集上，PRISM 优于现有去偏方法，验证了其有效性。

Insight: PRISM 展示了任务无关和数据无关的去偏潜力，为 VLMs 的公平性提供了一种新思路。

Abstract: We introduce Projection-based Reduction of Implicit Spurious bias in vision-language Models (PRISM), a new data-free and task-agnostic solution for bias mitigation in VLMs like CLIP. VLMs often inherit and amplify biases in their training data, leading to skewed predictions. PRISM is designed to debias VLMs without relying on predefined bias categories or additional external data. It operates in two stages: first, an LLM is prompted with simple class prompts to generate scene descriptions that contain spurious correlations. Next, PRISM uses our novel contrastive-style debiasing loss to learn a projection that maps the embeddings onto a latent space that minimizes spurious correlations while preserving the alignment between image and text embeddings.Extensive experiments demonstrate that PRISM outperforms current debiasing methods on the commonly used Waterbirds and CelebA datasets We make our code public at: https://github.com/MahdiyarMM/PRISM.

[4] Video Inference for Human Mesh Recovery with Vision Transformer cs.CVPDF

Hanbyel Cho, Jaesung Ahn, Yooshin Cho, Junmo Kim

TL;DR: 论文提出HMR-ViT方法，结合时序和运动学信息，通过构建时序-运动学特征图像，利用Vision Transformer实现人体网格重建，性能优越。

Details

Motivation: 现有的人体网格重建方法仅利用时序或运动学信息，而未结合两者，导致性能受限。本文旨在通过同时利用这两类信息提升重建精度。

Result: 在3DPW和Human3.6M数据集上表现优异，验证了方法的有效性。

Insight: 结合时序和运动学信息是提升HMR性能的关键，同时CRM的设计能显著优化特征表示。

Abstract: Human Mesh Recovery (HMR) from an image is a challenging problem because of the inherent ambiguity of the task. Existing HMR methods utilized either temporal information or kinematic relationships to achieve higher accuracy, but there is no method using both. Hence, we propose “Video Inference for Human Mesh Recovery with Vision Transformer (HMR-ViT)” that can take into account both temporal and kinematic information. In HMR-ViT, a Temporal-kinematic Feature Image is constructed using feature vectors obtained from video frames by an image encoder. When generating the feature image, we use a Channel Rearranging Matrix (CRM) so that similar kinematic features could be located spatially close together. The feature image is then further encoded using Vision Transformer, and the SMPL pose and shape parameters are finally inferred using a regression network. Extensive evaluation on the 3DPW and Human3.6M datasets indicates that our method achieves a competitive performance in HMR.

[5] From images to properties: a NeRF-driven framework for granular material parameter inversion cs.CV | physics.geo-phPDF

Cheng-Hsi Hsiao, Krishna Kumar

TL;DR: 该论文提出了一种结合NeRF和MPM的新框架，通过视觉观测反推颗粒材料的摩擦角，误差在2度以内。

Details

Motivation: 在现实场景中，直接测量颗粒材料（如沙子）的物理参数（如摩擦角）可能不切实际。论文旨在通过纯视觉观测实现参数反演。

Result: 实验结果表明，该方法能够将摩擦角的估计误差控制在2度以内。

Insight: 证明了纯视觉观测可以用于复杂物理参数的反演，为无法直接测量的场景提供了新思路。

Abstract: We introduce a novel framework that integrates Neural Radiance Fields (NeRF) with Material Point Method (MPM) simulation to infer granular material properties from visual observations. Our approach begins by generating synthetic experimental data, simulating an plow interacting with sand. The experiment is rendered into realistic images as the photographic observations. These observations include multi-view images of the experiment’s initial state and time-sequenced images from two fixed cameras. Using NeRF, we reconstruct the 3D geometry from the initial multi-view images, leveraging its capability to synthesize novel viewpoints and capture intricate surface details. The reconstructed geometry is then used to initialize material point positions for the MPM simulation, where the friction angle remains unknown. We render images of the simulation under the same camera setup and compare them to the observed images. By employing Bayesian optimization, we minimize the image loss to estimate the best-fitting friction angle. Our results demonstrate that friction angle can be estimated with an error within 2 degrees, highlighting the effectiveness of inverse analysis through purely visual observations. This approach offers a promising solution for characterizing granular materials in real-world scenarios where direct measurement is impractical or impossible.

[6] VISTA: A Visual Analytics Framework to Enhance Foundation Model-Generated Data Labels cs.CVPDF

Xiwei Xuan, Xiaoqi Wang, Wenbin He, Jorge Piazentin Ono, Liang Gou

TL;DR: VISTA是一个视觉分析框架，旨在提升基础模型生成的标签数据质量，通过多阶段验证结合人工干预，解决了现有方法在数据质量评估中的不足。

Details

Motivation: 多模态基础模型（如CLIP和LLaVA）为大规模数据集自动标注提供了便利，但其生成标签的质量未被充分研究，现有方法更关注数据量而非质量。

Result: 在多个基准数据集上验证了VISTA的有效性，定量和定性结果均显示其显著提升了数据质量。

Insight: VISTA展示了结合自动化和人工验证的必要性，为改进大规模标注数据的可靠性提供了新思路。

Abstract: The advances in multi-modal foundation models (FMs) (e.g., CLIP and LLaVA) have facilitated the auto-labeling of large-scale datasets, enhancing model performance in challenging downstream tasks such as open-vocabulary object detection and segmentation. However, the quality of FM-generated labels is less studied as existing approaches focus more on data quantity over quality. This is because validating large volumes of data without ground truth presents a considerable challenge in practice. Existing methods typically rely on limited metrics to identify problematic data, lacking a comprehensive perspective, or apply human validation to only a small data fraction, failing to address the full spectrum of potential issues. To overcome these challenges, we introduce VISTA, a visual analytics framework that improves data quality to enhance the performance of multi-modal models. Targeting the complex and demanding domain of open-vocabulary image segmentation, VISTA integrates multi-phased data validation strategies with human expertise, enabling humans to identify, understand, and correct hidden issues within FM-generated labels. Through detailed use cases on two benchmark datasets and expert reviews, we demonstrate VISTA’s effectiveness from both quantitative and qualitative perspectives.

[7] BrainLesion Suite: A Flexible and User-Friendly Framework for Modular Brain Lesion Image Analysis cs.CV | cs.AI | cs.LG | cs.SEPDF

Florian Kofler, Marcel Rosier, Mehdi Astaraki, Hendrik Möller, Ilhem Isra Mekki

TL;DR: BrainLesion Suite是一个灵活的Python工具包，专为模块化脑部病变图像分析设计，提供了预处理、缺失模态合成、病变分割等功能，适用于临床和科研实践。

Details

Motivation: 当前医疗图像分析工具通常缺乏灵活性，难以适应复杂的工作流需求，尤其是针对脑部病变的多模态分析。因此，研究者开发了一个模块化、易扩展的工具包，以简化开发流程并减少认知负担。

Result: BrainLesion Suite能够高效完成脑部病变的复杂分析任务，并支持扩展到其他生物医学图像分析领域。开发者可通过GitHub获取工具包和教程。

Insight: 该工具包的灵活性使其不仅适用于脑部病变分析，还为其他生物医学图像任务提供了可能性，展现了模块化设计在医疗图像分析中的潜力。

Abstract: BrainLesion Suite is a versatile toolkit for building modular brain lesion image analysis pipelines in Python. Following Pythonic principles, BrainLesion Suite is designed to provide a ‘brainless’ development experience, minimizing cognitive effort and streamlining the creation of complex workflows for clinical and scientific practice. At its core is an adaptable preprocessing module that performs co-registration, atlas registration, and optional skull-stripping and defacing on arbitrary multi-modal input images. BrainLesion Suite leverages algorithms from the BraTS challenge to synthesize missing modalities, inpaint lesions, and generate pathology-specific tumor segmentations. BrainLesion Suite also enables quantifying segmentation model performance, with tools such as panoptica to compute lesion-wise metrics. Although BrainLesion Suite was originally developed for image analysis pipelines of brain lesions such as glioma, metastasis, and multiple sclerosis, it can be adapted for other biomedical image analysis applications. The individual BrainLesion Suite packages and tutorials are accessible on GitHub.

[8] Can Contrastive Learning Improve Class-Imbalanced Diffusion Model? cs.CV | cs.LGPDF

Fang Chen, Alex Villa, Gongbo Liang, Xiaoyi Lu, Meng Tang

TL;DR: 该论文提出一种利用对比学习改进类别不平衡扩散模型的方法，通过两种简单的对比损失函数提升尾类图像多样性，同时保持头类图像的质量和多样性。

Details

Motivation: 类别不平衡数据在扩散模型训练中会导致尾类图像的多样性不足和模式崩塌。现有的方法未能有效解决这一问题，尤其在扩散模型中。

Result: 实验结果表明，该方法在CIFAR10/100-LT、PlacesLT、TinyImageNetLT和ImageNetLT等多个数据集上优于标准DDPM和其他替代方法，显著提升了尾类图像的多样性。

Insight: 通过对比学习平衡类别不平衡扩散模型，表明简单但精心设计的损失函数可以在扩散模型中有效解决长尾分布问题，同时保持头类性能。

Abstract: Training data for class-conditional image synthesis often exhibit a long-tailed distribution with limited images for tail classes. Such an imbalance causes mode collapse and reduces the diversity of synthesized images for tail classes. For class-conditional diffusion models trained on imbalanced data, we aim to improve the diversity of tail class images without compromising the fidelity and diversity of head class images. We achieve this by introducing two deceptively simple but highly effective contrastive loss functions. Firstly, we employ an unsupervised InfoNCE loss utilizing negative samples to increase the distance/dissimilarity among synthetic images, particularly for tail classes. To further enhance the diversity of tail classes, our second loss is an MSE loss that contrasts class-conditional generation with unconditional generation at large timesteps. This second loss makes the denoising process insensitive to class conditions for the initial steps, which enriches tail classes through knowledge sharing from head classes. Conditional-unconditional alignment has been shown to enhance the performance of long-tailed GAN. We are the first to adapt such alignment to diffusion models. We successfully leveraged contrastive learning for class-imbalanced diffusion models. Our contrastive learning framework is easy to implement and outperforms standard DDPM and alternative methods for class-imbalanced diffusion models across various datasets, including CIFAR10/100-LT, PlacesLT, TinyImageNetLT, and ImageNetLT.

[9] Infinite Video Understanding cs.CV | cs.AI | cs.IR | cs.LG | cs.MMPDF

Dell Zhang, Xiangyu Chen, Jixiang Luo, Mengxi Jia, Changzhi Sun

TL;DR: 这篇论文提出了‘无限视频理解’的概念，旨在解决模型处理和理解超长或持续视频内容的挑战，并作为多媒体和AI研究的新方向。

Details

Motivation: 当前大型语言模型和多模态扩展在视频理解方面取得了显著进展，但处理超长或持续视频内容时仍面临计算、内存和时序一致性的严重限制。因此，作者提出了‘无限视频理解’这一前沿目标，以推动相关技术的创新。

Result: 作为一篇立场论文，并未展示具体实验结果，但提供了针对无限视频理解的研究框架和未来工作的方向。

Insight: 论文认为，将无限视频理解作为‘蓝天’研究目标，可以推动流式架构、持久内存机制、层次化表征等领域的创新，同时需要开发新的评估范式来衡量进展。

Abstract: The rapid advancements in Large Language Models (LLMs) and their multimodal extensions (MLLMs) have ushered in remarkable progress in video understanding. However, a fundamental challenge persists: effectively processing and comprehending video content that extends beyond minutes or hours. While recent efforts like Video-XL-2 have demonstrated novel architectural solutions for extreme efficiency, and advancements in positional encoding such as HoPE and VideoRoPE++ aim to improve spatio-temporal understanding over extensive contexts, current state-of-the-art models still encounter significant computational and memory constraints when faced with the sheer volume of visual tokens from lengthy sequences. Furthermore, maintaining temporal coherence, tracking complex events, and preserving fine-grained details over extended periods remain formidable hurdles, despite progress in agentic reasoning systems like Deep Video Discovery. This position paper posits that a logical, albeit ambitious, next frontier for multimedia research is Infinite Video Understanding – the capability for models to continuously process, understand, and reason about video data of arbitrary, potentially never-ending duration. We argue that framing Infinite Video Understanding as a blue-sky research objective provides a vital north star for the multimedia, and the wider AI, research communities, driving innovation in areas such as streaming architectures, persistent memory mechanisms, hierarchical and adaptive representations, event-centric reasoning, and novel evaluation paradigms. Drawing inspiration from recent work on long/ultra-long video understanding and several closely related fields, we outline the core challenges and key research directions towards achieving this transformative capability.

[10] BlindSight: Harnessing Sparsity for Efficient VLMs cs.CV | I.2.10PDF

Tharun Adithya Srikrishnan, Deval Shah, Steven K. Reinhardt

TL;DR: 论文提出了BlindSight方法，通过利用视觉语言模型中的注意力稀疏性优化推理效率，显著减少了计算开销，同时保持了准确性。

Details

Motivation: 视觉语言模型（VLMs）在处理文本和图像时，由于注意力计算的二次复杂度，预填充时间较长。本文旨在通过利用注意力稀疏性优化推理效率。

Result: 在保持精度波动为-2%到+2%的同时，平均减少了32%-41%的FLOPs。

Insight: 注意力稀疏性在VLMs中普遍存在，通过合理设计的稀疏掩码可以显著提升推理效率，而无需重新训练模型。

Abstract: Large vision-language models (VLMs) enable the joint processing of text and images. However, the inclusion of vision data significantly expands the prompt length. Along with the quadratic complexity of the attention computation, this results in a longer prefill duration. An approach to mitigate this bottleneck is to leverage the inherent sparsity in the attention computation. In our analysis of attention patterns in VLMs, we observe that a substantial portion of layers exhibit minimal cross-image attention, except through attention-sink tokens per image. These sparse attention patterns fall into distinct categories: sink-only, document mask and a hybrid document-sink mask. Based on this, we propose BlindSight: a training-free approach to optimize VLM inference using a input template-aware attention sparsity mask. We utilize samples from a dataset to derive a prompt-agnostic sparsity categorization for every attention head. We evaluate the proposed technique using VLMs such as Qwen2-VL, Qwen2.5-VL and Gemma-3. BlindSight results in a 32%-41% reduction in FLOPs on average with -2%-+2% accuracy compared to the original model in most evaluated multi-image understanding benchmarks.

[11] From Physics to Foundation Models: A Review of AI-Driven Quantitative Remote Sensing Inversion cs.CVPDF

Zhenyu Yu, Mohd Yamani Idna Idris, Hua Wang, Pei Wang, Junyi Chen

TL;DR: 这篇论文综述了定量遥感反演技术的演变，从物理模型到数据驱动和基础模型（FM）方法，重点比较了不同范式的假设、应用场景和局限性，并探讨了未来的发展方向。

Details

Motivation: 随着遥感系统和人工智能的发展，传统基于物理的反演方法逐渐被数据驱动和基础模型方法取代。论文旨在系统地回顾这一演变过程，并探讨新方法的优势和挑战。

Result: 结果表明，基础模型在自监督预训练、多模态集成和跨任务适应方面表现突出，但仍面临物理可解释性、领域泛化和不确定性量化等挑战。

Insight: 未来发展方向包括构建统一的建模能力、跨领域泛化和增强物理可解释性的下一代基础模型。

Abstract: Quantitative remote sensing inversion aims to estimate continuous surface variables-such as biomass, vegetation indices, and evapotranspiration-from satellite observations, supporting applications in ecosystem monitoring, carbon accounting, and land management. With the evolution of remote sensing systems and artificial intelligence, traditional physics-based paradigms are giving way to data-driven and foundation model (FM)-based approaches. This paper systematically reviews the methodological evolution of inversion techniques, from physical models (e.g., PROSPECT, SCOPE, DART) to machine learning methods (e.g., deep learning, multimodal fusion), and further to foundation models (e.g., SatMAE, GFM, mmEarth). We compare the modeling assumptions, application scenarios, and limitations of each paradigm, with emphasis on recent FM advances in self-supervised pretraining, multi-modal integration, and cross-task adaptation. We also highlight persistent challenges in physical interpretability, domain generalization, limited supervision, and uncertainty quantification. Finally, we envision the development of next-generation foundation models for remote sensing inversion, emphasizing unified modeling capacity, cross-domain generalization, and physical interpretability.

[12] Taming generative video models for zero-shot optical flow extraction cs.CVPDF

Seungwoo Kim, Khai Loong Aw, Klemen Kotar, Cristobal Eyzaguirre, Wanhee Lee

TL;DR: 该论文提出了一种无需微调的零样本光流提取方法，通过对抗性扰动和KL散度计算，利用生成视频模型的特性，显著提升了光流提取的性能。

Details

Motivation: 光流提取是计算机视觉的核心问题，但现有方法依赖微调或合成数据集，存在标注稀缺和仿真与现实差距的问题。论文探索了能否从冻结的自监督视频模型中提取光流，无需额外训练。

Result: 在TAP-Vid DAVIS和Kubric数据集上，该方法分别提升了16.6%和4.7%的性能，优于现有光流提取模型。

Insight: 论文表明，通过对抗性提示可控生成视频模型，是一种无需监督学习或光度损失的高质量光流提取方法，具有可扩展性和高效性。

Abstract: Extracting optical flow from videos remains a core computer vision problem. Motivated by the success of large general-purpose models, we ask whether frozen self-supervised video models trained only for future frame prediction can be prompted, without fine-tuning, to output flow. Prior work reading out depth or illumination from video generators required fine-tuning, which is impractical for flow where labels are scarce and synthetic datasets suffer from a sim-to-real gap. Inspired by the Counterfactual World Model (CWM) paradigm, which can obtain point-wise correspondences by injecting a small tracer perturbation into a next-frame predictor and tracking its propagation, we extend this idea to generative video models. We explore several popular architectures and find that successful zero-shot flow extraction in this manner is aided by three model properties: (1) distributional prediction of future frames (avoiding blurry or noisy outputs); (2) factorized latents that treat each spatio-temporal patch independently; and (3) random-access decoding that can condition on any subset of future pixels. These properties are uniquely present in the recent Local Random Access Sequence (LRAS) architecture. Building on LRAS, we propose KL-tracing: a novel test-time procedure that injects a localized perturbation into the first frame, rolls out the model one step, and computes the Kullback-Leibler divergence between perturbed and unperturbed predictive distributions. Without any flow-specific fine-tuning, our method outperforms state-of-the-art models on real-world TAP-Vid DAVIS dataset (16.6% relative improvement for endpoint error) and synthetic TAP-Vid Kubric (4.7% relative improvement). Our results indicate that counterfactual prompting of controllable generative video models is a scalable and effective alternative to supervised or photometric-loss approaches for high-quality flow.

[13] MI CAM: Mutual Information Weighted Activation Mapping for Causal Visual Explanations of Convolutional Neural Networks cs.CV | cs.LGPDF

Ram S Iyer, Narayan S Iyer, Rugmini Ammal P

TL;DR: MI CAM是一种基于激活映射的后处理视觉解释方法，通过互信息对特征图加权，生成因果关系验证的解释。

Details

Motivation: 随着视觉模型在医疗和自动化等关键领域的应用增加，理解卷积神经网络的推理机制变得至关重要。

Result: 在质量和定量指标上达到或超越现有方法，提供无偏解释。

Insight: 互信息能有效捕捉输入与模型推理的因果关系，为解释方法提供了新思路。

Abstract: With the intervention of machine vision in our crucial day to day necessities including healthcare and automated power plants, attention has been drawn to the internal mechanisms of convolutional neural networks, and the reason why the network provides specific inferences. This paper proposes a novel post-hoc visual explanation method called MI CAM based on activation mapping. Differing from previous class activation mapping based approaches, MI CAM produces saliency visualizations by weighing each feature map through its mutual information with the input image and the final result is generated by a linear combination of weights and activation maps. It also adheres to producing causal interpretations as validated with the help of counterfactual analysis. We aim to exhibit the visual performance and unbiased justifications for the model inferencing procedure achieved by MI CAM. Our approach works at par with all state-of-the-art methods but particularly outperforms some in terms of qualitative and quantitative measures. The implementation of proposed method can be found on https://anonymous.4open.science/r/MI-CAM-4D27

[14] RadEyeVideo: Enhancing general-domain Large Vision Language Model for chest X-ray analysis with video representations of eye gaze cs.CVPDF

Yunsoo Kim, Jinge Wu, Honghan Wu

TL;DR: RadEyeVideo通过将放射科医生的眼球注视数据整合为视频序列，显著提升通用领域大视觉语言模型在胸部X光分析和报告生成中的性能。

Details

Motivation: 现有的方法通常忽略眼球注视的顺序信息，而RadEyeVideo旨在通过视频序列捕捉眼球注视的时空动态，以增强模型的性能和交互体验。

Result: 模型的报告生成任务性能提升高达24.6%，疾病诊断任务平均提升15.2%，甚至超过专门训练的任务特定医学LVLMs，如MAIRA-2和CheXagent。

Insight: 研究表明，通过有效整合领域专家知识（如眼球注视信息），可以显著提升通用领域模型在临床任务中的能力，为医疗影像分析提供了可扩展的人类中心化方法。

Abstract: Large Vision-Language Models (LVLMs) have demonstrated promising performance in chest X-ray (CXR) analysis. To enhance human-computer interaction, several studies have incorporated radiologists’ eye gaze, typically through heatmaps or textual prompts. However, these methods often overlook the sequential order of eye movements, which could provide valuable insights by highlighting both the areas of interest and the order in which they are examined. In this work, we propose a novel approach called RadEyeVideo that integrates radiologists’ eye-fixation data as a video sequence, capturing both the temporal and spatial dynamics of their gaze. We evaluate this method in CXR report generation and disease diagnosis using three general-domain, open-source LVLMs with video input capabilities. When prompted with eye-gaze videos, model performance improves by up to 24.6% in the report generation task and on average 15.2% for both tasks using scaled evaluation metrics. Notably, RadEyeVideo enhanced an open-domain LVLM model, LLaVA-OneVision, to surpass task-specific medical LVLMs such as MAIRA-2 and CheXagent, trained on large Chest X-ray data. This work highlights that domain expert’s knowledge (eye-gaze information in this case), when effectively integrated with LVLMs, can significantly enhance general-domain models’ capabilities in clinical tasks. RadEyeVideo is a step toward a scalable human-centered approach of utilizing LVLMs in medical image analytics.

[15] Harnessing Text-to-Image Diffusion Models for Point Cloud Self-Supervised Learning cs.CVPDF

Yiyang Chen, Shanshan Zhao, Lunhao Duan, Changxing Ding, Dacheng Tao

TL;DR: PointSD利用文本到图像的扩散模型（如Stable Diffusion）进行3D点云的自我监督学习，通过将文本编码器替换为3D编码器，训练点云到图像的扩散模型，提升点云表示能力。

Details

Motivation: 现有3D扩散模型受限于小规模3D数据集，而文本到图像的扩散模型（如Stable Diffusion）在大规模数据集上训练，具备更强的泛化能力，可能提升点云自我监督学习的性能。

Result: 实验表明，PointSD在下游点云任务中表现优异，证明了Stable Diffusion模型对点云自我监督学习的有效性。

Insight: 大规模2D扩散模型的知识可以有效迁移到3D领域，解决3D数据稀缺问题，为未来跨模态学习提供新思路。

Abstract: Diffusion-based models, widely used in text-to-image generation, have proven effective in 2D representation learning. Recently, this framework has been extended to 3D self-supervised learning by constructing a conditional point generator for enhancing 3D representations. However, its performance remains constrained by the 3D diffusion model, which is trained on the available 3D datasets with limited size. We hypothesize that the robust capabilities of text-to-image diffusion models, particularly Stable Diffusion (SD), which is trained on large-scale datasets, can help overcome these limitations. To investigate this hypothesis, we propose PointSD, a framework that leverages the SD model for 3D self-supervised learning. By replacing the SD model’s text encoder with a 3D encoder, we train a point-to-image diffusion model that allows point clouds to guide the denoising of rendered noisy images. With the trained point-to-image diffusion model, we use noise-free images as the input and point clouds as the condition to extract SD features. Next, we train a 3D backbone by aligning its features with these SD features, thereby facilitating direct semantic learning. Comprehensive experiments on downstream point cloud tasks and ablation studies demonstrate that the SD model can enhance point cloud self-supervised learning. Code is publicly available at https://github.com/wdttt/PointSD.

[16] Hybrid Autoregressive-Diffusion Model for Real-Time Streaming Sign Language Production cs.CVPDF

Maoxiao Ye, Xinfeng Ye, Mano Manoharan

TL;DR: 该论文提出了一种结合自回归和扩散模型的混合方法，用于实时手语生成，解决了传统自回归模型的错误累积问题以及扩散模型的实时性限制。

Details

Motivation: 传统自回归模型在推理阶段无法避免错误累积，而扩散模型由于其迭代性质无法满足实时任务需求。论文旨在通过混合方法结合两者优势，实现高质量和实时的流式手语生成。

Result: 在PHOENIX14T和How2Sign数据集上，方法在生成质量和实时流式处理效率方面均表现优异。

Insight: 混合模型结合了自回归和扩散模型的优势，为实时序列生成任务提供了新思路；多尺度和动态注意力机制的设计可推广到其他序列生成任务。

Abstract: Earlier Sign Language Production (SLP) models typically relied on autoregressive methods that generate output tokens one by one, which inherently provide temporal alignment. Although techniques like Teacher Forcing can prevent model collapse during training, they still cannot solve the problem of error accumulation during inference, since ground truth is unavailable at that stage. In contrast, more recent approaches based on diffusion models leverage step-by-step denoising to enable high-quality generation. However, the iterative nature of these models and the requirement to denoise entire sequences limit their applicability in real-time tasks like SLP. To address it, we apply a hybrid approach combining autoregressive and diffusion models to SLP for the first time, leveraging the strengths of both models in sequential dependency modeling and output refinement. To capture fine-grained body movements, we design a Multi-Scale Pose Representation module that separately extracts detailed features from distinct articulators and integrates them via a Multi-Scale Fusion module. Furthermore, we introduce a Confidence-Aware Causal Attention mechanism that utilizes joint-level confidence scores to dynamically guide the pose generation process, improving accuracy and robustness. Extensive experiments on the PHOENIX14T and How2Sign datasets demonstrate the effectiveness of our method in both generation quality and real-time streaming efficiency.

[17] RoHOI: Robustness Benchmark for Human-Object Interaction Detection cs.CV | cs.HC | cs.RO | eess.IVPDF

Di Wen, Kunyu Peng, Kailun Yang, Yufan Chen, Ruiping Liu

TL;DR: 本文提出了第一个用于人-物交互（HOI）检测的鲁棒性基准测试RoHOI，评估模型在多种干扰下的表现，并提出了一种语义感知掩码渐进学习（SAMPL）策略以提升模型鲁棒性。

Details

Motivation: 尽管HOI检测在机器人-人类辅助中至关重要，但现有模型在真实世界中的表现因环境干扰（如遮挡、噪声）而显著下降。因此，需要一种方法评估并提升模型的鲁棒性。

Result: 实验表明，SAMPL策略在鲁棒性方面优于现有方法，为HOI检测设立了新标准。

Insight: 模型对干扰敏感，尤其是环境变异和噪声；动态调整优化策略可有效提升鲁棒性。

Abstract: Human-Object Interaction (HOI) detection is crucial for robot-human assistance, enabling context-aware support. However, models trained on clean datasets degrade in real-world conditions due to unforeseen corruptions, leading to inaccurate prediction. To address this, we introduce the first robustness benchmark for HOI detection, evaluating model resilience under diverse challenges. Despite advances, current models struggle with environmental variability, occlusion, and noise. Our benchmark, RoHOI, includes 20 corruption types based on HICO-DET and V-COCO datasets and a new robustness-focused metric. We systematically analyze existing models in the related field, revealing significant performance drops under corruptions. To improve robustness, we propose a Semantic-Aware Masking-based Progressive Learning (SAMPL) strategy to guide the model to be optimized based on holistic and partial cues, dynamically adjusting the model’s optimization to enhance robust feature learning. Extensive experiments show our approach outperforms state-of-the-art methods, setting a new standard for robust HOI detection. Benchmarks, datasets, and code will be made publicly available at https://github.com/Kratos-Wen/RoHOI.

[18] Mind the Gap: Preserving and Compensating for the Modality Gap in CLIP-Based Continual Learning cs.CV | cs.LGPDF

Linlan Huang, Xusheng Cao, Haori Lu, Yifan Meng, Fei Yang

TL;DR: 该论文提出一种基于模态间隙的CLIP模型持续学习方法（MG-CLIP），通过保留和补偿模态间隙来减轻遗忘并提升新数据学习能力，实验显示其优于现有方法。

Details

Motivation: 现有CLIP模型在持续学习中忽视了模态间隙的重要性，而该论文发现模态间隙反映了预训练知识的保留程度，因此提出利用这一特性改进持续学习性能。

Result: 实验表明MG-CLIP在类增量学习任务中优于现有方法，尤其在保留知识和适应新任务方面表现突出。

Insight: 模态间隙是CLIP模型中一个关键指标，可用于持续学习中知识保留与新任务适应的平衡。

Abstract: Continual learning aims to enable models to learn sequentially from continuously incoming data while retaining performance on previously learned tasks. With the Contrastive Language-Image Pre-trained model (CLIP) exhibiting strong capabilities across various downstream tasks, there has been growing interest in leveraging CLIP for continual learning in such scenarios. Most existing works overlook the inherent modality gap in CLIP, a key factor in its generalization and adaptability. In this paper, we analyze the variations in the modality gap during the fine-tuning of vision-language pre-trained models. Our observations reveal that the modality gap effectively reflects the extent to which pre-trained knowledge is preserved. Based on these insights, we propose a simple yet effective method, MG-CLIP, that improves CLIP’s performance in class-incremental learning. Our approach leverages modality gap preservation to mitigate forgetting and modality gap compensation to enhance the capacity for new data, introducing a novel modality-gap-based perspective for continual learning. Extensive experiments on multiple benchmarks demonstrate that our method outperforms existing approaches without requiring additional replay data. Our code is available at https://github.com/linlany/MindtheGap.

[19] SnapMoGen: Human Motion Generation from Expressive Texts cs.CVPDF

Chuan Guo, Inwoo Hwang, Jian Wang, Bing Zhou

TL;DR: SnapMoGen提出了一個新的文本-運動數據集，並改進了生成模型MoMask++，實現了長序列運動生成和精細控制，提升了表現力和泛化能力。

Details

Motivation: 現有的文本到運動生成方法受限於短文本或通用提示，數據集的限制導致細粒度控制和未見提示的泛化不足。

Result: MoMask++在HumanML3D和SnapMoGen基準上達到了最佳性能，並能處理用戶的非正式提示。

Insight: 長序列運動生成需要豐富的文本註釋和多尺度建模，對齊用戶輸入風格是提升實用性的關鍵。

Abstract: Text-to-motion generation has experienced remarkable progress in recent years. However, current approaches remain limited to synthesizing motion from short or general text prompts, primarily due to dataset constraints. This limitation undermines fine-grained controllability and generalization to unseen prompts. In this paper, we introduce SnapMoGen, a new text-motion dataset featuring high-quality motion capture data paired with accurate, expressive textual annotations. The dataset comprises 20K motion clips totaling 44 hours, accompanied by 122K detailed textual descriptions averaging 48 words per description (vs. 12 words of HumanML3D). Importantly, these motion clips preserve original temporal continuity as they were in long sequences, facilitating research in long-term motion generation and blending. We also improve upon previous generative masked modeling approaches. Our model, MoMask++, transforms motion into multi-scale token sequences that better exploit the token capacity, and learns to generate all tokens using a single generative masked transformer. MoMask++ achieves state-of-the-art performance on both HumanML3D and SnapMoGen benchmarks. Additionally, we demonstrate the ability to process casual user prompts by employing an LLM to reformat inputs to align with the expressivity and narration style of SnapMoGen. Project webpage: https://snap-research.github.io/SnapMoGen/

[20] PoseLLM: Enhancing Language-Guided Human Pose Estimation with MLP Alignment cs.CVPDF

Dewen Zhang, Tahir Hussain, Wangpeng An, Hayaru Shouno

TL;DR: PoseLLM提出了一个基于大语言模型（LLM）的姿态估计框架，通过非线性MLP视觉-语言连接器替代线性投影器，显著提升了定位精度和跨模态特征融合能力。

Details

Motivation: 传统姿态估计方法依赖于关键点先验，泛化能力有限；而现有语言引导方法（如LocLLM）的线性投影器不足以捕捉复杂的空间-文本交互。

Result: 在COCO、Human-Art和MPII上实现了高精度和零样本泛化能力。

Insight: 非线性连接器能有效提升语言引导姿态估计的性能，且不牺牲泛化能力。

Abstract: Human pose estimation traditionally relies on architectures that encode keypoint priors, limiting their generalization to novel poses or unseen keypoints. Recent language-guided approaches like LocLLM reformulate keypoint localization as a vision-language task, enabling zero-shot generalization through textual descriptions. However, LocLLM’s linear projector fails to capture complex spatial-textual interactions critical for high-precision localization. To address this, we propose PoseLLM, the first Large Language Model (LLM)-based pose estimation framework that replaces the linear projector with a nonlinear MLP vision-language connector. This lightweight two-layer MLP with GELU activation enables hierarchical cross-modal feature transformation, enhancing the fusion of visual patches and textual keypoint descriptions. Trained exclusively on COCO data, PoseLLM achieves 77.8 AP on the COCO validation set, outperforming LocLLM by +0.4 AP, while maintaining strong zero-shot generalization on Human-Art and MPII. Our work demonstrates that a simple yet powerful nonlinear connector significantly boosts localization accuracy without sacrificing generalization, advancing the state-of-the-art in language-guided pose estimation. Code is available at https://github.com/Ody-trek/PoseLLM.

[21] $I^{2}$-World: Intra-Inter Tokenization for Efficient Dynamic 4D Scene Forecasting cs.CVPDF

Zhimin Liao, Ping Wei, Ruijie Zhang, Shuaijia Chen, Haoxuan Wang

TL;DR: I²-World提出了一种高效的动态4D场景预测框架，通过解码场景标记化的双策略（内部场景与跨场景标记化），实现了在保留空间细节的同时高效建模时间依赖性。

Details

Motivation: 当前基于占用的3D世界模型在自动驾驶系统中处理复杂场景的能力有限，尤其是如何高效地标记化复杂3D场景以支持动态预测是一个关键挑战。

Result: 在4D占用预测任务中，mIoU和IoU分别提升25.1%和36.9%；训练内存仅需2.9GB，推理速度达37.0FPS。

Insight: 分而治之的标记化策略（空间与时间分离）不仅提升了效率，还保持了动态建模能力，证明了时空解耦在复杂场景预测中的有效性。

Abstract: Forecasting the evolution of 3D scenes and generating unseen scenarios via occupancy-based world models offers substantial potential for addressing corner cases in autonomous driving systems. While tokenization has revolutionized image and video generation, efficiently tokenizing complex 3D scenes remains a critical challenge for 3D world models. To address this, we propose $I^{2}$-World, an efficient framework for 4D occupancy forecasting. Our method decouples scene tokenization into intra-scene and inter-scene tokenizers. The intra-scene tokenizer employs a multi-scale residual quantization strategy to hierarchically compress 3D scenes while preserving spatial details. The inter-scene tokenizer residually aggregates temporal dependencies across timesteps. This dual design preserves the compactness of 3D tokenizers while retaining the dynamic expressiveness of 4D tokenizers. Unlike decoder-only GPT-style autoregressive models, $I^{2}$-World adopts an encoder-decoder architecture. The encoder aggregates spatial context from the current scene and predicts a transformation matrix to enable high-level control over scene generation. The decoder, conditioned on this matrix and historical tokens, ensures temporal consistency during generation. Experiments demonstrate that $I^{2}$-World achieves state-of-the-art performance, outperforming existing methods by 25.1% in mIoU and 36.9% in IoU for 4D occupancy forecasting while exhibiting exceptional computational efficiency: it requires merely 2.9 GB of training memory and achieves real-time inference at 37.0 FPS. Our code is available on https://github.com/lzzzzzm/II-World.

[22] Learning and Transferring Better with Depth Information in Visual Reinforcement Learning cs.CV | cs.ROPDF

Zichun Xu, Yuntao Li, Zhaomin Wang, Lei Zhuang, Guocai Yang

TL;DR: 该论文提出了一种基于视觉Transformer的视觉骨干网络，通过融合RGB和深度模态提升强化学习的泛化能力，并结合对比无监督学习和课程学习实现高效的sim2real迁移。

Details

Motivation: 深度信息对场景外观变化具有鲁棒性且包含3D空间细节，因此在视觉强化学习中结合RGB和深度模态可提升模型的泛化性和样本效率。

Result: 该方法在多模态融合和sim2real迁移中表现出色，提升了模型在视觉强化学习中的泛化能力和样本效率。

Insight: 深度信息与RGB的融合是提升视觉强化学习泛化性的有效途径，而结合对比学习和课程学习可以进一步优化训练效率和迁移能力。

Abstract: Depth information is robust to scene appearance variations and inherently carries 3D spatial details. In this paper, a visual backbone based on the vision transformer is proposed to fuse RGB and depth modalities for enhancing generalization. Different modalities are first processed by separate CNN stems, and the combined convolutional features are delivered to the scalable vision transformer to obtain visual representations. Moreover, a contrastive unsupervised learning scheme is designed with masked and unmasked tokens to accelerate the sample efficiency during the reinforcement learning progress. For sim2real transfer, a flexible curriculum learning schedule is developed to deploy domain randomization over training processes.

[23] MCA-LLaVA: Manhattan Causal Attention for Reducing Hallucination in Large Vision-Language Models cs.CVPDF

Qiyan Zhao, Xiaofeng Zhang, Yiheng Li, Yun Xing, Xiaosong Yuan

TL;DR: 该论文揭示了大型视觉语言模型（LVLMs）中因长程衰减导致的多模态对齐偏差问题，并提出基于曼哈顿距离的MCA-LLaVA模型，显著降低幻觉现象。

Details

Motivation: 大型视觉语言模型中的幻觉问题严重，研究发现旋转位置编码（RoPE）的长程衰减会导致图像与指令对齐偏差，亟需改进多模态对齐方法。

Result: 实验表明，MCA-LLaVA在多个幻觉和通用基准测试中表现优异，有效降低幻觉现象。

Insight: 多模态对齐中位置编码的设计至关重要，曼哈顿距离的引入为空间感知提供了更均衡的建模方式。

Abstract: Hallucinations pose a significant challenge in Large Vision Language Models (LVLMs), with misalignment between multimodal features identified as a key contributing factor. This paper reveals the negative impact of the long-term decay in Rotary Position Encoding (RoPE), used for positional modeling in LVLMs, on multimodal alignment. Concretely, under long-term decay, instruction tokens exhibit uneven perception of image tokens located at different positions within the two-dimensional space: prioritizing image tokens from the bottom-right region since in the one-dimensional sequence, these tokens are positionally closer to the instruction tokens. This biased perception leads to insufficient image-instruction interaction and suboptimal multimodal alignment. We refer to this phenomenon as image alignment bias. To enhance instruction’s perception of image tokens at different spatial locations, we propose MCA-LLaVA, based on Manhattan distance, which extends the long-term decay to a two-dimensional, multi-directional spatial decay. MCA-LLaVA integrates the one-dimensional sequence order and two-dimensional spatial position of image tokens for positional modeling, mitigating hallucinations by alleviating image alignment bias. Experimental results of MCA-LLaVA across various hallucination and general benchmarks demonstrate its effectiveness and generality. The code can be accessed in https://github.com/ErikZ719/MCA-LLaVA.

[24] THYME: Temporal Hierarchical-Cyclic Interactivity Modeling for Video Scene Graphs in Aerial Footage cs.CVPDF

Trong-Thuan Nguyen, Pha Nguyen, Jackson Cothren, Alper Yilmaz, Minh-Triet Tran

TL;DR: THYME提出了一种结合层次特征聚合和循环时间精化的方法，用于视频场景图生成，显著提升了动态场景理解的准确性和一致性。

Details

Motivation: 现有的视频场景图生成方法无法同时捕捉细粒度空间细节和长程时间依赖关系，导致表征碎片化，无法有效理解动态场景。

Result: 在ASPIRe和AeroEye-v1.0数据集上，THYME显著优于现有方法。

Insight: 层次和循环机制的协同设计是提升动态场景图生成性能的关键，且高质量的数据集对模型评估至关重要。

Abstract: The rapid proliferation of video in applications such as autonomous driving, surveillance, and sports analytics necessitates robust methods for dynamic scene understanding. Despite advances in static scene graph generation and early attempts at video scene graph generation, previous methods often suffer from fragmented representations, failing to capture fine-grained spatial details and long-range temporal dependencies simultaneously. To address these limitations, we introduce the Temporal Hierarchical Cyclic Scene Graph (THYME) approach, which synergistically integrates hierarchical feature aggregation with cyclic temporal refinement to address these limitations. In particular, THYME effectively models multi-scale spatial context and enforces temporal consistency across frames, yielding more accurate and coherent scene graphs. In addition, we present AeroEye-v1.0, a novel aerial video dataset enriched with five types of interactivity that overcome the constraints of existing datasets and provide a comprehensive benchmark for dynamic scene graph generation. Empirically, extensive experiments on ASPIRe and AeroEye-v1.0 demonstrate that the proposed THYME approach outperforms state-of-the-art methods, offering improved scene understanding in ground-view and aerial scenarios.

[25] Visual Surface Wave Elastography: Revealing Subsurface Physical Properties via Visible Surface Waves cs.CVPDF

Alexander C. Ogren, Berthy T. Feng, Jihoon Ahn, Katherine L. Bouman, Chiara Daraio

TL;DR: 该论文提出了一种通过表面波的传播特性推断材料厚度和刚度的方法，利用视频数据提取波的频散关系，并通过物理模型优化参数。实验验证了该方法在模拟和真实数据上的准确性。

Details

Motivation: 表面波的传播包含了材料表面以下物理特性的信息，但现有方法通常需要复杂设备或侵入性测量。作者希望通过视频分析实现非接触式、低成本的物理特性推断。

Result: 在模拟和真实数据上的实验表明，该方法的结果与地面真实测量值高度一致。

Insight: 该方法为家庭健康监测提供了可能性，同时也适用于人机交互等领域，展示了计算机视觉与物理模型结合的应用潜力。

Abstract: Wave propagation on the surface of a material contains information about physical properties beneath its surface. We propose a method for inferring the thickness and stiffness of a structure from just a video of waves on its surface. Our method works by extracting a dispersion relation from the video and then solving a physics-based optimization problem to find the best-fitting thickness and stiffness parameters. We validate our method on both simulated and real data, in both cases showing strong agreement with ground-truth measurements. Our technique provides a proof-of-concept for at-home health monitoring of medically-informative tissue properties, and it is further applicable to fields such as human-computer interaction.

[26] Uncertainty-Driven Expert Control: Enhancing the Reliability of Medical Vision-Language Models cs.CVPDF

Xiao Liang, Di Wang, Zhicheng Jiao, Ronghan Li, Pengfei Yang

TL;DR: 论文提出了一种专家介入的框架Expert-CFG，通过不确定性估计和专家指导，无需额外训练即可提升医学视觉语言模型的可靠性。

Details

Motivation: 当前医学视觉语言模型（MedVLM）存在概率性不确定性，可能生成错误或未经验证的响应，这对医学应用具有严重风险。现有方法依赖调整模型结构或数据微调，成本高且临床对齐不足。

Result: 在三个医学视觉问答基准测试中，提出的Expert-CFG（参数4.2B）仅需少量专家标注，性能即超越13B参数的SOTA模型。

Insight: Expert-CFG框架为资源有限环境中部署可靠的医学视觉语言模型提供了可行性，突出了专家介入在模型优化中的高效性。

Abstract: The rapid advancements in Vision Language Models (VLMs) have prompted the development of multi-modal medical assistant systems. Despite this progress, current models still have inherent probabilistic uncertainties, often producing erroneous or unverified responses-an issue with serious implications in medical applications. Existing methods aim to enhance the performance of Medical Vision Language Model (MedVLM) by adjusting model structure, fine-tuning with high-quality data, or through preference fine-tuning. However, these training-dependent strategies are costly and still lack sufficient alignment with clinical expertise. To address these issues, we propose an expert-in-the-loop framework named Expert-Controlled Classifier-Free Guidance (Expert-CFG) to align MedVLM with clinical expertise without additional training. This framework introduces an uncertainty estimation strategy to identify unreliable outputs. It then retrieves relevant references to assist experts in highlighting key terms and applies classifier-free guidance to refine the token embeddings of MedVLM, ensuring that the adjusted outputs are correct and align with expert highlights. Evaluations across three medical visual question answering benchmarks demonstrate that the proposed Expert-CFG, with 4.2B parameters and limited expert annotations, outperforms state-of-the-art models with 13B parameters. The results demonstrate the feasibility of deploying such a system in resource-limited settings for clinical use.

[27] Stereo-based 3D Anomaly Object Detection for Autonomous Driving: A New Dataset and Baseline cs.CVPDF

Shiyi Mu, Zichong Gu, Hanqi Lyu, Yilin Gao, Shugong Xu

TL;DR: 该论文提出了一种基于立体视觉的3D异常物体检测算法S3AD，并发布了新的数据集KITTI-AR，用于提升自动驾驶中3D检测模型的泛化能力和异常检测性能。

Details

Motivation: 在自动驾驶中，3D检测模型对封闭数据集训练后难以识别道路上的罕见异常物体。需提升模型对任意形状目标的泛化能力，并具备异常过滤能力。

Result: 实验验证了S3AD算法和KITTI-AR数据集的性能，提升了3D异常检测能力。

Insight: 解耦2D和3D训练可释放模型对任意3D前景检测的泛化能力；合成数据能有效解决样本稀疏问题。

Abstract: 3D detection technology is widely used in the field of autonomous driving, with its application scenarios gradually expanding from enclosed highways to open conventional roads. For rare anomaly categories that appear on the road, 3D detection models trained on closed sets often misdetect or fail to detect anomaly objects. To address this risk, it is necessary to enhance the generalization ability of 3D detection models for targets of arbitrary shapes and to possess the capability to filter out anomalies. The generalization of 3D detection is limited by two factors: the coupled training of 2D and 3D, and the insufficient diversity in the scale distribution of training samples. This paper proposes a Stereo-based 3D Anomaly object Detection (S3AD) algorithm, which decouples the training strategy of 3D and 2D to release the generalization ability for arbitrary 3D foreground detection, and proposes an anomaly scoring algorithm based on foreground confidence prediction, achieving target-level anomaly scoring. In order to further verify and enhance the generalization of anomaly detection, we use a 3D rendering method to synthesize two augmented reality binocular stereo 3D detection datasets which named KITTI-AR. KITTI-AR extends upon KITTI by adding 97 new categories, totaling 6k pairs of stereo images. The KITTI-AR-ExD subset includes 39 common categories as extra training data to address the sparse sample distribution issue. Additionally, 58 rare categories form the KITTI-AR-OoD subset, which are not used in training to simulate zero-shot scenarios in real-world settings, solely for evaluating 3D anomaly detection. Finally, the performance of the algorithm and the dataset is verified in the experiments. (Code and dataset can be obtained at https://github.com/xxxx/xxx).

[28] 360-Degree Full-view Image Segmentation by Spherical Convolution compatible with Large-scale Planar Pre-trained Models cs.CVPDF

Jingguo Liu, Han Yu, Shigang Li, Jianfeng Li

TL;DR: 本文提出了一种新颖的球形采样方法，使现有的二维预训练模型可直接用于全景图像任务，解决了全景图像中的畸变和不连续性问题，并在室内数据集Stanford2D3D上取得了良好效果。

Details

Motivation: 由于缺乏大规模的全景图像数据集，现有任务依赖二维预训练模型，但这些模型无法识别全景图像的畸变和不连续性，导致性能下降。本文旨在通过球形采样方法解决这一问题。

Result: 在常用室内数据集Stanford2D3D上取得了良好的分割结果。

Insight: 通过球形采样和通道注意力机制的结合，可以有效利用现有二维预训练模型处理全景图像任务，为相关研究提供了新思路。

Abstract: Due to the current lack of large-scale datasets at the million-scale level, tasks involving panoramic images predominantly rely on existing two-dimensional pre-trained image benchmark models as backbone networks. However, these networks are not equipped to recognize the distortions and discontinuities inherent in panoramic images, which adversely affects their performance in such tasks. In this paper, we introduce a novel spherical sampling method for panoramic images that enables the direct utilization of existing pre-trained models developed for two-dimensional images. Our method employs spherical discrete sampling based on the weights of the pre-trained models, effectively mitigating distortions while achieving favorable initial training values. Additionally, we apply the proposed sampling method to panoramic image segmentation, utilizing features obtained from the spherical model as masks for specific channel attentions, which yields commendable results on commonly used indoor datasets, Stanford2D3D.

[29] Online Long-term Point Tracking in the Foundation Model Era cs.CVPDF

Görkay Aydemir

TL;DR: 该论文提出了一种名为Track-On的在线长时点跟踪方法，基于视觉基础模型和Transformer架构，解决了因果性约束下的长时跟踪问题。

Details

Motivation: 现有长时点跟踪方法多为离线设置，依赖未来帧信息，而实际应用（如流媒体和具身AI）需要在线处理。论文旨在设计一种无需未来信息的在线跟踪方法。

Result: Track-On在7个基准测试中表现优异，验证了在线长时点跟踪的可行性，且无需未来帧信息。

Insight: 视觉基础模型虽缺乏时序推理能力，但通过结合专用记忆机制（如Transformer的查询机制），可显著提升在线跟踪性能。因果性设计是关键挑战。

Abstract: Point tracking aims to identify the same physical point across video frames and serves as a geometry-aware representation of motion. This representation supports a wide range of applications, from robotics to augmented reality, by enabling accurate modeling of dynamic environments. Most existing long-term tracking approaches operate in an offline setting, where future frames are available to refine predictions and recover from occlusions. However, real-world scenarios often demand online predictions: the model must operate causally, using only current and past frames. This constraint is critical in streaming video and embodied AI, where decisions must be made immediately based on past observations. Under such constraints, viewpoint invariance becomes essential. Visual foundation models, trained on diverse large-scale datasets, offer the potential for robust geometric representations. While they lack temporal reasoning on their own, they can be integrated into tracking pipelines to enrich spatial features. In this thesis, we address the problem of long-term point tracking in an online setting, where frames are processed sequentially without access to future information or sliding windows. We begin by evaluating the suitability of visual foundation models for this task and find that they can serve as useful initializations and be integrated into tracking pipelines. However, to enable long-term tracking in an online setting, a dedicated design is still required. In particular, maintaining coherence over time in this causal regime requires memory to propagate appearance and context across frames. To address this, we introduce Track-On, a transformer-based model that treats each tracked point as a query and processes video frames one at a time. Track-On sets a new state of the art across seven public benchmarks, demonstrating the feasibility of long-term tracking without future access.

[30] Calibrated and Robust Foundation Models for Vision-Language and Medical Image Tasks Under Distribution Shift cs.CV | cs.LGPDF

Behraj Khan, Tahir Syed

TL;DR: 这篇论文提出了StaRFM框架，通过Fisher信息惩罚和置信度对齐惩罚，解决了基础模型在分布偏移和置信度不匹配问题上的挑战，提升了视觉-语言和医疗图像任务的性能。

Details

Motivation: 基础模型（如CLIP和SAM）在低样本迁移学习中表现优异，但在部署时面临分布偏移和置信度不匹配的问题。这些问题在不同任务（如视觉语言分类和医疗图像分割）中表现不同，现有解决方案多为领域特定。因此，需要一种统一框架来解决这些问题。

Result: 在19个视觉数据集上（如ImageNet、Office-Home），StaRFM提升3.5%准确率，降低28%ECE。在医疗图像分割（如BraTS、ATLAS）中达到84.7% DSC和4.8mm HD95，跨域性能差距降低40%。

Insight: StaRFM提供了一种通用的方法来提升基础模型在分布偏移和置信度校准问题的表现，理论和实验均验证其有效性。其即插即用的设计使其易于集成到现有模型中，具有广泛的应用潜力。

Abstract: Foundation models like CLIP and SAM have transformed computer vision and medical imaging via low-shot transfer learning. However, deployment of these models hindered by two key challenges: \textit{distribution shift} between training and test data, and \textit{confidence misalignment} that leads to overconfident incorrect predictions. These issues manifest differently in vision-language classification and medical segmentation tasks, yet existing solutions remain domain-specific. We propose \textit{StaRFM}, a unified framework addressing both challenges. It introduces a Fisher information penalty (FIP), extended to 3D medical data via patch-wise regularization, to reduce covariate shift in CLIP and SAM embeddings. Additionally, a confidence misalignment penalty (CMP), reformulated for voxel-level predictions, calibrates uncertainty in segmentation tasks. We theoretically derive PAC-Bayes bounds showing FIP controls generalization via the Fisher-Rao norm, while CMP minimizes calibration error through Brier score optimization. StaRFM shows consistent performance like \texttt{+}3.5% accuracy and 28% lower ECE on 19 vision datasets (e.g., ImageNet, Office-Home), 84.7% DSC and 4.8mm HD95 in medical segmentation (e.g., BraTS, ATLAS), and 40% lower cross-domain performance gap compared to prior benchmarking methods. The framework is plug-and-play, requiring minimal architectural changes for seamless integration with foundation models. Code and models will be released at https://anonymous.4open.science/r/StaRFM-C0CD/README.md

[31] EgoAnimate: Generating Human Animations from Egocentric top-down Views cs.CVPDF

G. Kutay Türkoglu, Julian Tanke, Iheb Belgacem, Lev Markhasin

TL;DR: 该论文提出了一种新颖的方法EgoAnimate，通过生成式先验技术从单一自顶向下的第一视角图像生成逼真的动画化虚拟形象，解决了传统方法中对多视角数据依赖的问题。

Details

Motivation: 现有虚拟形象生成方法多依赖多视角数据或复杂设备，而自顶向下的第一视角图像虽便携易得，但存在遮挡和身体比例失真的挑战。该研究旨在利用生成式先验技术解决这些问题，降低训练负担并提升通用性。

Result: 方法实现了从单一自顶向下图像生成逼真动画虚拟形象，减少了训练数据需求，提升了生成效果和通用性。

Insight: 生成式先验技术可用于解决第一视角图像中的遮挡和失真问题，为低成本便携式虚拟现实系统提供了可行方案。

Abstract: An ideal digital telepresence experience requires accurate replication of a person’s body, clothing, and movements. To capture and transfer these movements into virtual reality, the egocentric (first-person) perspective can be adopted, which enables the use of a portable and cost-effective device without front-view cameras. However, this viewpoint introduces challenges such as occlusions and distorted body proportions. There are few works reconstructing human appearance from egocentric views, and none use a generative prior-based approach. Some methods create avatars from a single egocentric image during inference, but still rely on multi-view datasets during training. To our knowledge, this is the first study using a generative backbone to reconstruct animatable avatars from egocentric inputs. Based on Stable Diffusion, our method reduces training burden and improves generalizability. Inspired by methods such as SiTH and MagicMan, which perform 360-degree reconstruction from a frontal image, we introduce a pipeline that generates realistic frontal views from occluded top-down images using ControlNet and a Stable Diffusion backbone. Our goal is to convert a single top-down egocentric image into a realistic frontal representation and feed it into an image-to-motion model. This enables generation of avatar motions from minimal input, paving the way for more accessible and generalizable telepresence systems.

[32] PPJudge: Towards Human-Aligned Assessment of Artistic Painting Process cs.CVPDF

Shiqi Jiang, Xinpeng Li, Xi Mao, Changbo Wang, Chenhui Li

TL;DR: 该论文提出了一个针对艺术绘画过程评估的新框架，包括首创的大型数据集PPAD和基于Transformer的模型PPJudge，能够动态评估多阶段的绘画过程，与人类评判更一致。

Details

Motivation: 现有艺术图像评估方法主要关注静态最终图像，忽视了绘画过程的动态性和多阶段性。本文旨在填补这一空白，提出更符合人类评判的动态评估框架。

Result: 实验表明PPJudge在准确性、鲁棒性和与人类评判的一致性上优于现有基线方法。

Insight: 该研究为计算创造力和艺术教育提供了新视角，强调动态过程的评估优于静态图像分析。

Abstract: Artistic image assessment has become a prominent research area in computer vision. In recent years, the field has witnessed a proliferation of datasets and methods designed to evaluate the aesthetic quality of paintings. However, most existing approaches focus solely on static final images, overlooking the dynamic and multi-stage nature of the artistic painting process. To address this gap, we propose a novel framework for human-aligned assessment of painting processes. Specifically, we introduce the Painting Process Assessment Dataset (PPAD), the first large-scale dataset comprising real and synthetic painting process images, annotated by domain experts across eight detailed attributes. Furthermore, we present PPJudge (Painting Process Judge), a Transformer-based model enhanced with temporally-aware positional encoding and a heterogeneous mixture-of-experts architecture, enabling effective assessment of the painting process. Experimental results demonstrate that our method outperforms existing baselines in accuracy, robustness, and alignment with human judgment, offering new insights into computational creativity and art education.

[33] AGCD-Net: Attention Guided Context Debiasing Network for Emotion Recognition cs.CV | cs.AIPDF

Varsha Devi, Amine Bohi, Pardeep Kumar

TL;DR: AGCD-Net提出了一种注意力引导的上下文去偏方法，通过因果干预模块消除情绪识别中的上下文偏差，结合Hybrid ConvNeXt增强特征校准，在CAER-S数据集上表现优异。

Details

Motivation: 传统上下文感知情绪识别方法容易受到上下文偏差的影响（如背景与情绪标签的虚假关联），导致识别性能下降。

Result: 在CAER-S数据集上表现优于现有方法，验证了因果去偏的有效性。

Insight: 上下文去偏对提升复杂场景下的情绪识别鲁棒性至关重要，因果干预是解决这一问题的有效途径。

Abstract: Context-aware emotion recognition (CAER) enhances affective computing in real-world scenarios, but traditional methods often suffer from context bias-spurious correlation between background context and emotion labels (e.g. associating garden'' with happy’’). In this paper, we propose \textbf{AGCD-Net}, an Attention Guided Context Debiasing model that introduces \textit{Hybrid ConvNeXt}, a novel convolutional encoder that extends the ConvNeXt backbone by integrating Spatial Transformer Network and Squeeze-and-Excitation layers for enhanced feature recalibration. At the core of AGCD-Net is the Attention Guided - Causal Intervention Module (AG-CIM), which applies causal theory, perturbs context features, isolates spurious correlations, and performs an attention-driven correction guided by face features to mitigate context bias. Experimental results on the CAER-S dataset demonstrate the effectiveness of AGCD-Net, achieving state-of-the-art performance and highlighting the importance of causal debiasing for robust emotion recognition in complex settings.

[34] Ambiguity-Aware and High-Order Relation Learning for Multi-Grained Image-Text Matching cs.CV | cs.IR | cs.MMPDF

Junyu Chen, Yihua Gao, Mingyuan Ge, Mingyong Li

TL;DR: 论文提出了一种名为AAHR的框架，通过动态聚类原型对比学习和GNN增强语义交互，解决了图像-文本匹配中的语义模糊性和高阶关系学习问题，显著提升了匹配性能。

Details

Motivation: 现有方法在相似实例间的语义模糊性（如软正样本和软负样本）和高阶关系学习上表现不足，亟需一种新框架来提升匹配精度。

Result: 在Flickr30K、MSCOCO和ECCV Caption数据集上超越现有方法，显著提升了匹配准确性和效率。

Insight: 通过动态聚类和GNN的结合，AAHR能更好地捕捉语义模糊性和高阶关系，为图像-文本匹配提供了一种新思路。

Abstract: Image-text matching is crucial for bridging the semantic gap between computer vision and natural language processing. However, existing methods still face challenges in handling high-order associations and semantic ambiguities among similar instances. These ambiguities arise from subtle differences between soft positive samples (semantically similar but incorrectly labeled) and soft negative samples (locally matched but globally inconsistent), creating matching uncertainties. Furthermore, current methods fail to fully utilize the neighborhood relationships among semantically similar instances within training batches, limiting the model’s ability to learn high-order shared knowledge. This paper proposes the Ambiguity-Aware and High-order Relation learning framework (AAHR) to address these issues. AAHR constructs a unified representation space through dynamic clustering prototype contrastive learning, effectively mitigating the soft positive sample problem. The framework introduces global and local feature extraction mechanisms and an adaptive aggregation network, significantly enhancing full-grained semantic understanding capabilities. Additionally, AAHR employs intra-modal and inter-modal correlation matrices to investigate neighborhood relationships among sample instances thoroughly. It incorporates GNN to enhance semantic interactions between instances. Furthermore, AAHR integrates momentum contrastive learning to expand the negative sample set. These combined strategies significantly improve the model’s ability to discriminate between features. Experimental results demonstrate that AAHR outperforms existing state-of-the-art methods on Flickr30K, MSCOCO, and ECCV Caption datasets, considerably improving the accuracy and efficiency of image-text matching. The code and model checkpoints for this research are available at https://github.com/Image-Text-Matching/AAHR .

[35] SAGE: Segment-Aware Gloss-Free Encoding for Token-Efficient Sign Language Translation cs.CVPDF

JianHe Low, Ozge Mercanoglu Sincan, Richard Bowden

TL;DR: 本文提出了一种名为SAGE的视觉分词框架，通过利用手语分段技术将连续视频转换为离散的视觉标记，显著降低了输入序列长度和内存使用，同时提升了跨模态对齐效果。

Details

Motivation: 尽管无需标注的Gloss-free手语翻译（SLT）取得了快速进展，但现有方法通常伴随着模型复杂性和计算资源需求的大幅增加，影响了其可扩展性。本文旨在提出一种高效且可扩展的SLT方法。

Result: 在PHOENIX14T基准测试中性能超过现有方法，同时序列长度减少50%，内存使用降低2.67倍。

Insight: 通过视觉分词和高效跨模态对齐，可以在降低计算资源需求的同时提升性能，为大规模手语翻译任务提供了可行方案。

Abstract: Gloss-free Sign Language Translation (SLT) has advanced rapidly, achieving strong performances without relying on gloss annotations. However, these gains have often come with increased model complexity and high computational demands, raising concerns about scalability, especially as large-scale sign language datasets become more common. We propose a segment-aware visual tokenization framework that leverages sign segmentation to convert continuous video into discrete, sign-informed visual tokens. This reduces input sequence length by up to 50% compared to prior methods, resulting in up to 2.67x lower memory usage and better scalability on larger datasets. To bridge the visual and linguistic modalities, we introduce a token-to-token contrastive alignment objective, along with a dual-level supervision that aligns both language embeddings and intermediate hidden states. This improves fine-grained cross-modal alignment without relying on gloss-level supervision. Our approach notably exceeds the performance of state-of-the-art methods on the PHOENIX14T benchmark, while significantly reducing sequence length. Further experiments also demonstrate our improved performance over prior work under comparable sequence-lengths, validating the potential of our tokenization and alignment strategies.

[36] Cross Knowledge Distillation between Artificial and Spiking Neural Networks cs.CV | cs.AIPDF

Shuhan Ye, Yuanbin Qian, Chong Wang, Sunqi Lin, Jiazhen Xu

TL;DR: 该论文提出了一种跨模态和跨架构的知识蒸馏方法（CKD），以提升脉冲神经网络（SNN）在事件数据上的性能，通过利用RGB数据和人工神经网络（ANN）的知识。

Details

Motivation: 由于脉冲神经网络（SNN）在事件数据上的性能不及人工神经网络（ANN），而事件数据的标注有限且SNN架构不成熟，因此需要利用RGB数据和ANN的知识来提升SNN性能。

Result: 在主流神经形态数据集（如N-Caltech101和CEP-DVS）上，CKD方法优于当前最先进的方法。

Insight: 跨模态和跨架构的知识蒸馏可以有效提升SNN在事件数据上的性能，为SNN的进一步发展提供了新思路。

Abstract: Recently, Spiking Neural Networks (SNNs) have demonstrated rich potential in computer vision domain due to their high biological plausibility, event-driven characteristic and energy-saving efficiency. Still, limited annotated event-based datasets and immature SNN architectures result in their performance inferior to that of Artificial Neural Networks (ANNs). To enhance the performance of SNNs on their optimal data format, DVS data, we explore using RGB data and well-performing ANNs to implement knowledge distillation. In this case, solving cross-modality and cross-architecture challenges is necessary. In this paper, we propose cross knowledge distillation (CKD), which not only leverages semantic similarity and sliding replacement to mitigate the cross-modality challenge, but also uses an indirect phased knowledge distillation to mitigate the cross-architecture challenge. We validated our method on main-stream neuromorphic datasets, including N-Caltech101 and CEP-DVS. The experimental results show that our method outperforms current State-of-the-Art methods. The code will be available at https://github.com/ShawnYE618/CKD

[37] Prompt4Trust: A Reinforcement Learning Prompt Augmentation Framework for Clinically-Aligned Confidence Calibration in Multimodal Large Language Models cs.CV | cs.AI | cs.CLPDF

Anita Kriz, Elizabeth Laura Janes, Xing Shen, Tal Arbel

TL;DR: Prompt4Trust提出了一种基于强化学习的提示增强框架，用于多模态大语言模型（MLLMs）的信心校准，特别适用于医疗场景。通过训练轻量级LLM生成上下文感知的辅助提示，显著提升了模型的校准效果和任务准确性。

Details

Motivation: 当前MLLMs在医疗等安全关键场景中的部署面临两大问题：对提示设计的敏感性和高置信度下生成错误回答的倾向。这降低了模型的可信度，亟需改进。

Result: 在PMC-VQA基准测试中取得了最先进的医学视觉问答性能，同时展示了在小模型训练基础上的零样本泛化能力。

Insight: 自动化且与人类对齐的提示工程可以有效提升MLLMs在安全关键场景中的可信度，且轻量级框架具有潜在的扩展性和低计算成本优势。

Abstract: Multimodal large language models (MLLMs) hold considerable promise for applications in healthcare. However, their deployment in safety-critical settings is hindered by two key limitations: (i) sensitivity to prompt design, and (ii) a tendency to generate incorrect responses with high confidence. As clinicians may rely on a model’s stated confidence to gauge the reliability of its predictions, it is especially important that when a model expresses high confidence, it is also highly accurate. We introduce Prompt4Trust, the first reinforcement learning (RL) framework for prompt augmentation targeting confidence calibration in MLLMs. A lightweight LLM is trained to produce context-aware auxiliary prompts that guide a downstream task MLLM to generate responses in which the expressed confidence more accurately reflects predictive accuracy. Unlike conventional calibration techniques, Prompt4Trust specifically prioritizes aspects of calibration most critical for safe and trustworthy clinical decision-making. Beyond improvements driven by this clinically motivated calibration objective, our proposed method also improves task accuracy, achieving state-of-the-art medical visual question answering (VQA) performance on the PMC-VQA benchmark, which is composed of multiple-choice questions spanning diverse medical imaging modalities. Moreover, our framework trained with a small downstream task MLLM showed promising zero-shot generalization to larger MLLMs in our experiments, suggesting the potential for scalable calibration without the associated computational costs. This work demonstrates the potential of automated yet human-aligned prompt engineering for improving the the trustworthiness of MLLMs in safety critical settings. Our codebase can be found at https://github.com/xingbpshen/vccrl-llm.

Chenhao Ding, Jiangtao Zhang, Zongsheng Yue, Hui Wang, Qian Zhao

TL;DR: 这篇论文提出了一种基于深度生成模型的盲运动去模糊新框架，通过预训练生成对抗网络（GAN）的核生成器和初始化器，有效缓解了传统方法对初始模糊核高度敏感的问题，并在基准数据集上取得了领先性能。

Details

Motivation: 盲运动去模糊（BMD）现有方法因优化过程的高度非凸性而对初始模糊核极其敏感，导致性能受限。为解决这一问题，论文提出利用生成模型编码核先验并优化初始化过程。

Result: 在多个挑战性数据集上实现了SOTA性能，尤其在非均匀运动去模糊任务中表现优异。

Insight: 通过生成模型编码先验是解决BMD敏感性的有效途径，同时证明了即插即用的模块化设计对性能提升的重要性。

Abstract: Deep prior-based approaches have demonstrated remarkable success in blind motion deblurring (BMD) recently. These methods, however, are often limited by the high non-convexity of the underlying optimization process in BMD, which leads to extreme sensitivity to the initial blur kernel. To address this issue, we propose a novel framework for BMD that leverages a deep generative model to encode the kernel prior and induce a better initialization for the blur kernel. Specifically, we pre-train a kernel generator based on a generative adversarial network (GAN) to aptly characterize the kernel’s prior distribution, as well as a kernel initializer to provide a well-informed and high-quality starting point for kernel estimation. By combining these two components, we constrain the BMD solution within a compact latent kernel manifold, thus alleviating the aforementioned sensitivity for kernel initialization. Notably, the kernel generator and initializer are designed to be easily integrated with existing BMD methods in a plug-and-play manner, enhancing their overall performance. Furthermore, we extend our approach to tackle blind non-uniform motion deblurring without the need for additional priors, achieving state-of-the-art performance on challenging benchmark datasets. The source code is available at https://github.com/dch0319/GLKM-Deblur.

[39] Supercharging Floorplan Localization with Semantic Rays cs.CV | cs.LGPDF

Yuval Grader, Hadar Averbuch-Elor

TL;DR: 该论文提出了一种语义感知的平面图定位框架，通过联合估计深度和语义光线，构建结构-语义概率体积，并在两个标准基准测试中显著优于现有方法。

Details

Motivation: 当前平面图定位技术主要依赖深度结构信息，忽略了平面图中丰富的语义信息（如门窗位置）。论文旨在利用这些语义信息提升定位精度。

Result: 在两个标准基准测试中，该方法在召回率等指标上显著优于现有技术，并能通过整合元数据进一步提升精度和效率。

Insight: 语义信息（如门窗位置）对平面图定位至关重要，联合结构与语义建模可以显著提升性能；动态采样策略有效平衡了计算效率与精度。

Abstract: Floorplans provide a compact representation of the building’s structure, revealing not only layout information but also detailed semantics such as the locations of windows and doors. However, contemporary floorplan localization techniques mostly focus on matching depth-based structural cues, ignoring the rich semantics communicated within floorplans. In this work, we introduce a semantic-aware localization framework that jointly estimates depth and semantic rays, consolidating over both for predicting a structural-semantic probability volume. Our probability volume is constructed in a coarse-to-fine manner: We first sample a small set of rays to obtain an initial low-resolution probability volume. We then refine these probabilities by performing a denser sampling only in high-probability regions and process the refined values for predicting a 2D location and orientation angle. We conduct an evaluation on two standard floorplan localization benchmarks. Our experiments demonstrate that our approach substantially outperforms state-of-the-art methods, achieving significant improvements in recall metrics compared to prior works. Moreover, we show that our framework can easily incorporate additional metadata such as room labels, enabling additional gains in both accuracy and efficiency.

[40] Geo-RepNet: Geometry-Aware Representation Learning for Surgical Phase Recognition in Endoscopic Submucosal Dissection cs.CV | cs.ROPDF

Rui Tang, Haochen Yin, Guankun Wang, Long Bai, An Wang

TL;DR: Geo-RepNet是一个几何感知的卷积框架，通过整合RGB和深度信息来提升内镜黏膜下剥离手术（ESD）中的手术阶段识别性能。

Details

Motivation: 手术阶段识别在智能辅助系统中至关重要，但ESD中不同阶段视觉相似度高且RGB图像缺乏结构信息，而深度信息能提供几何线索。

Result: 在构建的ESD数据集上，Geo-RepNet实现了最先进的性能，同时在复杂和低纹理手术环境中保持鲁棒性和高效计算。

Insight: 深度信息能显著提升手术阶段识别的性能，尤其在视觉相似度高和结构信息缺失的场景中。

Abstract: Surgical phase recognition plays a critical role in developing intelligent assistance systems for minimally invasive procedures such as Endoscopic Submucosal Dissection (ESD). However, the high visual similarity across different phases and the lack of structural cues in RGB images pose significant challenges. Depth information offers valuable geometric cues that can complement appearance features by providing insights into spatial relationships and anatomical structures. In this paper, we pioneer the use of depth information for surgical phase recognition and propose Geo-RepNet, a geometry-aware convolutional framework that integrates RGB image and depth information to enhance recognition performance in complex surgical scenes. Built upon a re-parameterizable RepVGG backbone, Geo-RepNet incorporates the Depth-Guided Geometric Prior Generation (DGPG) module that extracts geometry priors from raw depth maps, and the Geometry-Enhanced Multi-scale Attention (GEMA) to inject spatial guidance through geometry-aware cross-attention and efficient multi-scale aggregation. To evaluate the effectiveness of our approach, we construct a nine-phase ESD dataset with dense frame-level annotations from real-world ESD videos. Extensive experiments on the proposed dataset demonstrate that Geo-RepNet achieves state-of-the-art performance while maintaining robustness and high computational efficiency under complex and low-texture surgical environments.

[41] ViT-ProtoNet for Few-Shot Image Classification: A Multi-Benchmark Evaluation cs.CV | cs.AI | cs.LGPDF

Abdulvahap Mutlu, Şengül Doğan, Türker Tuncer

TL;DR: ViT-ProtoNet结合了Vision Transformer（ViT）和原型网络（Prototypical Network），在少样本图像分类任务中表现优异，显著优于基于CNN的原型网络，并在多个标准数据集上验证了其有效性。

Details

Motivation: 尽管Vision Transformer在计算机视觉领域表现出色，但其在少样本图像分类任务中的潜力尚未充分挖掘。本文旨在探索如何利用ViT的强大表示能力来提升少样本分类性能。

Result: 在Mini-ImageNet、FC100、CUB-200和CIFAR-FS上，ViT-ProtoNet在5-shot分类中比基于CNN的原型网络提升了3.2%的准确率，且特征可分性更优。

Insight: ViT的特征表示能力在少样本分类任务中具有明显优势，轻量级ViT-Small即可超越或媲美其他基于Transformer的方法，为少样本学习提供了新的基准。

Abstract: The remarkable representational power of Vision Transformers (ViTs) remains underutilized in few-shot image classification. In this work, we introduce ViT-ProtoNet, which integrates a ViT-Small backbone into the Prototypical Network framework. By averaging class conditional token embeddings from a handful of support examples, ViT-ProtoNet constructs robust prototypes that generalize to novel categories under 5-shot settings. We conduct an extensive empirical evaluation on four standard benchmarks: Mini-ImageNet, FC100, CUB-200, and CIFAR-FS, including overlapped support variants to assess robustness. Across all splits, ViT-ProtoNet consistently outperforms CNN-based prototypical counterparts, achieving up to a 3.2% improvement in 5-shot accuracy and demonstrating superior feature separability in latent space. Furthermore, it outperforms or is competitive with transformer-based competitors using a more lightweight backbone. Comprehensive ablations examine the impact of transformer depth, patch size, and fine-tuning strategy. To foster reproducibility, we release code and pretrained weights. Our results establish ViT-ProtoNet as a powerful, flexible approach for few-shot classification and set a new baseline for transformer-based meta-learners.

[42] DAA*: Deep Angular A Star for Image-based Path Planning cs.CV | cs.LG | eess.IVPDF

Zhiwei Xu

TL;DR: DAA是一种新颖的路径规划方法，通过引入路径角度自由度（PAF）改进路径平滑性，从而在路径模仿学习中更接近参考路径。DAA在7个数据集上显著提升了路径相似性和优化性。

Details

Motivation: 现有路径模仿学习往往忽视了路径平滑性，导致生成的路径与参考路径相似度不足。

Result: 在7个数据集中，DAA比神经A和TransPath显著提升了路径相似性指标（如SPR、ASIM和PSIM），同时路径长度更短。

Insight:

Abstract: Path smoothness is often overlooked in path imitation learning from expert demonstrations. In this paper, we introduce a novel learning method, termed deep angular A* (DAA*), by incorporating the proposed path angular freedom (PAF) into A* to improve path similarity through adaptive path smoothness. The PAF aims to explore the effect of move angles on path node expansion by finding the trade-off between their minimum and maximum values, allowing for high adaptiveness for imitation learning. DAA* improves path optimality by closely aligning with the reference path through joint optimization of path shortening and smoothing, which correspond to heuristic distance and PAF, respectively. Throughout comprehensive evaluations on 7 datasets, including 4 maze datasets, 2 video-game datasets, and a real-world drone-view dataset containing 2 scenarios, we demonstrate remarkable improvements of our DAA* over neural A* in path similarity between the predicted and reference paths with a shorter path length when the shortest path is plausible, improving by 9.0% SPR, 6.9% ASIM, and 3.9% PSIM. Furthermore, when jointly learning pathfinding with both path loss and path probability map loss, DAA* significantly outperforms the state-of-the-art TransPath by 6.7% SPR, 6.5% PSIM, and 3.7% ASIM. We also discuss the minor trade-off between path optimality and search efficiency where applicable.

[43] ProactiveBench: A Comprehensive Benchmark Evaluating Proactive Interactions in Video Large Language Models cs.CVPDF

Yueqian Wang, Xiaojun Meng, Yifan Wang, Huishuai Zhang, Dongyan Zhao

TL;DR: 论文提出了ProactiveBench，首个评估视频大语言模型主动交互能力的基准，并设计了PAUC指标以考虑时间动态，实验证明PAUC与传统指标相比更符合人类偏好。

Details

Motivation: 随着多模态对话系统研究的深入，用户对系统主动性的需求增加，传统按轮交互已不足，需要评估系统在视频播放等实时场景中的主动交互能力。

Result: PAUC比仅关注文本内容的传统指标更符合人类偏好，更准确反映主动交互场景的用户体验。

Insight: 时间动态性是评估主动交互的重要维度，PAUC为未来多模态对话系统的评估提供了新思路。

Abstract: With the growing research focus on multimodal dialogue systems, the capability for proactive interaction is gradually gaining recognition. As an alternative to conventional turn-by-turn dialogue, users increasingly expect multimodal systems to be more initiative, for example, by autonomously determining the timing of multi-turn responses in real time during video playback. To facilitate progress in this emerging area, we introduce ProactiveBench, the first comprehensive benchmark to evaluate a system’s ability to engage in proactive interaction. Since model responses are generated at varying timestamps, we further propose PAUC, the first metric that accounts for the temporal dynamics of model responses. This enables a more accurate evaluation of systems operating in proactive settings. Through extensive benchmarking of various baseline systems on ProactiveBench and a user study of human preferences, we show that PAUC is in better agreement with human preferences than traditional evaluation metrics, which typically only consider the textual content of responses. These findings demonstrate that PAUC provides a more faithful assessment of user experience in proactive interaction scenarios. Project homepage: https://github.com/yellow-binary-tree/ProactiveBench

[44] Dynamic Inter-Class Confusion-Aware Encoder for Audio-Visual Fusion in Human Activity Recognition cs.CVPDF

Kaixuan Cong, Yifan Wang, Rongkun Xue, Yuyang Jiang, Yiming Feng

TL;DR: 该论文提出了一种动态类间混淆感知编码器（DICCAE），用于音频-视频模态的细粒度对齐，通过动态调整混淆损失提升模型对相似活动的区分能力，并结合自监督预训练策略解决数据稀缺问题。

Details

Motivation: 现有音频-视频预训练范式仅关注模态间的整体对齐，忽略了通过认知诱导和对比强化易混淆类别的区分能力。

Result: 在VGGSound数据集上达到65.5%的top-1准确率，验证了各模块的必要性。

Insight: 动态调整类间混淆损失能有效提升模型对相似活动的区分能力，自监督预训练是解决数据稀缺的有效途径。

Abstract: Humans do not understand individual events in isolation; rather, they generalize concepts within classes and compare them to others. Existing audio-video pre-training paradigms only focus on the alignment of the overall audio-video modalities, without considering the reinforcement of distinguishing easily confused classes through cognitive induction and contrast during training. This paper proposes the Dynamic Inter-Class Confusion-Aware Encoder (DICCAE), an encoder that aligns audio-video representations at a fine-grained, category-level. DICCAE addresses category confusion by dynamically adjusting the confusion loss based on inter-class confusion degrees, thereby enhancing the model’s ability to distinguish between similar activities. To further extend the application of DICCAE, we also introduce a novel training framework that incorporates both audio and video modalities, as well as their fusion. To mitigate the scarcity of audio-video data in the human activity recognition task, we propose a cluster-guided audio-video self-supervised pre-training strategy for DICCAE. DICCAE achieves near state-of-the-art performance on the VGGSound dataset, with a top-1 accuracy of 65.5%. We further evaluate its feature representation quality through extensive ablation studies, validating the necessity of each module.

Wencan Huang, Daizong Liu, Wei Hu

TL;DR: Fast3D提出了一种加速3D多模态大语言模型（MLLM）的方法，通过全局注意力预测和样本自适应视觉token剪枝，显著提升了3D场景理解的效率。

Details

Motivation: 3D多模态大语言模型在场景理解方面表现出色，但由于计算效率低下，实际部署面临挑战，尤其是处理冗余的3D视觉token。

Result: 在五个基准测试中验证了Fast3D的有效性，尤其在高token剪枝比例下表现突出。

Insight: 全局注意力模式对识别3D中非关键token具有预测能力，这是提升3D MLLM效率的关键。

Abstract: While 3D Multi-modal Large Language Models (MLLMs) demonstrate remarkable scene understanding capabilities, their practical deployment faces critical challenges due to computational inefficiency. The key bottleneck stems from processing excessive object-centric visual tokens required for comprehensive 3D scene representation. Although visual token pruning has shown promise in accelerating 2D MLLMs, its applicability to 3D domains remains largely unexplored due to fundamental disparities in token structures. In this paper, we reveal two critical insights: (1) Significant redundancy exists in object-level 3D token representations, analogous to patch-level redundancy in 2D systems; (2) Global attention patterns exhibit strong predictive power for identifying non-essential tokens in 3D contexts. Building on these observations, we propose Fast3D, a plug-and-play visual token pruning framework for 3D MLLMs featuring two technical innovations: (1) Global Attention Prediction (GAP), where a lightweight neural network learns to predict the global attention distributions of the target model, enabling efficient token importance estimation for precise pruning guidance; (2) Sample-Adaptive visual token Pruning (SAP), which introduces dynamic token budgets through attention-based complexity assessment, automatically adjusting layer-wise pruning ratios based on input characteristics. Both of these two techniques operate without modifying the parameters of the target model. Extensive evaluations across five benchmarks validate the effectiveness of Fast3D, particularly under high visual token pruning ratios. Code is available at https://github.com/wencan25/Fast3D

[46] Simplifying Traffic Anomaly Detection with Video Foundation Models cs.CVPDF

Svetlana Orlova, Tommie Kerssies, Brunó B. Englert, Gijs Dubbelman

TL;DR: 论文探讨了如何使用预训练的Video ViT简化交通异常检测（TAD），发现简单的编码器架构结合自监督预训练（MVM）和领域自适应预训练（DAPT）可超越复杂模型。

Details

Motivation: 现有TAD方法多为复杂多阶段或多表示融合架构，但其必要性存疑。视觉基础模型的出现推动了对简单但高效架构的研究。

Result: 简单架构在效率与性能上均优于复杂方法，尤其在自监督MVM和DAPT加持下表现更佳。

Insight: 预训练比架构复杂性更重要，自监督学习和领域适应是关键。简化模型设计可实现高效、可扩展的TAD解决方案。

Abstract: Recent methods for ego-centric Traffic Anomaly Detection (TAD) often rely on complex multi-stage or multi-representation fusion architectures, yet it remains unclear whether such complexity is necessary. Recent findings in visual perception suggest that foundation models, enabled by advanced pre-training, allow simple yet flexible architectures to outperform specialized designs. Therefore, in this work, we investigate an architecturally simple encoder-only approach using plain Video Vision Transformers (Video ViTs) and study how pre-training enables strong TAD performance. We find that: (i) strong pre-training enables simple encoder-only models to match or even surpass the performance of specialized state-of-the-art TAD methods, while also being significantly more efficient; (ii) although weakly- and fully-supervised pre-training are advantageous on standard benchmarks, we find them less effective for TAD. Instead, self-supervised Masked Video Modeling (MVM) provides the strongest signal; and (iii) Domain-Adaptive Pre-Training (DAPT) on unlabeled driving videos further improves downstream performance, without requiring anomalous examples. Our findings highlight the importance of pre-training and show that effective, efficient, and scalable TAD models can be built with minimal architectural complexity. We release our code, domain-adapted encoders, and fine-tuned models to support future work: https://github.com/tue-mps/simple-tad.

[47] Automated Multi-Class Crop Pathology Classification via Convolutional Neural Networks: A Deep Learning Approach for Real-Time Precision Agriculture cs.CV | I.2.6; I.5.4PDF

Sourish Suri, Yifei Shao

TL;DR: 本文提出了一种基于卷积神经网络（CNN）的图像分类系统，用于自动检测和分类八种常见作物病害，并通过移动平台提供实时诊断和治理建议。

Details

Motivation: 作物病害对农业生产和全球粮食安全构成重大威胁，当前的人工检测方法效率低且不准确，亟需自动化解决方案。

Result: 训练准确率约90%，验证准确率约60%，表明存在轻微过拟合。模型在真实数据上表现可靠，并成功部署到移动平台。

Insight: CNN在作物病害分类中表现优异，结合治理建议模块可显著提升农业管理效率，尤其是在资源匮乏地区。该研究为精准农业提供了可扩展且实用的工具。

Abstract: Crop diseases present a significant barrier to agricultural productivity and global food security, especially in large-scale farming where early identification is often delayed or inaccurate. This research introduces a Convolutional Neural Network (CNN)-based image classification system designed to automate the detection and classification of eight common crop diseases using leaf imagery. The methodology involves a complete deep learning pipeline: image acquisition from a large, labeled dataset, preprocessing via resizing, normalization, and augmentation, and model training using TensorFlow with Keras’ Sequential API. The CNN architecture comprises three convolutional layers with increasing filter sizes and ReLU activations, followed by max pooling, flattening, and fully connected layers, concluding with a softmax output for multi-class classification. The system achieves high training accuracy (~90%) and demonstrates reliable performance on unseen data, although a validation accuracy of ~60% suggests minor overfitting. Notably, the model integrates a treatment recommendation module, providing actionable guidance by mapping each detected disease to suitable pesticide or fungicide interventions. Furthermore, the solution is deployed on an open-source, mobile-compatible platform, enabling real-time image-based diagnostics for farmers in remote areas. This research contributes a scalable and accessible tool to the field of precision agriculture, reducing reliance on manual inspection and promoting sustainable disease management practices. By merging deep learning with practical agronomic support, this work underscores the potential of CNNs to transform crop health monitoring and enhance food production resilience on a global scale.

[48] GreenCrossingAI: A Camera Trap/Computer Vision Pipeline for Environmental Science Research Groups cs.CV | cs.LGPDF

Bernie Boscoe, Shawn Johnson, Andrea Osborn, Chandler Campbell, Karen Mager

TL;DR: 本文介绍了一个低资源需求的相机陷阱数据处理流程GreenCrossingAI，旨在帮助资源有限的小型研究团队利用ML/AI工具分析野生动物数据。

Details

Motivation: 相机陷阱数据量大、标注困难、环境条件多变，且传统方法难以整合ML/AI工具，限制了小型研究团队的数据分析能力。

Result: 该流程使小型研究团队能够更高效地处理和分析相机陷阱数据，从中获取有意义的生态学研究洞察。

Insight: 通过定制化ML/AI工具和本地化处理，小型团队可以在资源有限的情况下实现高效的数据分析，填补了传统方法的空白。

Abstract: Camera traps have long been used by wildlife researchers to monitor and study animal behavior, population dynamics, habitat use, and species diversity in a non-invasive and efficient manner. While data collection from the field has increased with new tools and capabilities, methods to develop, process, and manage the data, especially the adoption of ML/AI tools, remain challenging. These challenges include the sheer volume of data generated, the need for accurate labeling and annotation, variability in environmental conditions affecting data quality, and the integration of ML/AI tools into existing workflows that often require domain-specific customization and computational resources. This paper provides a guide to a low-resource pipeline to process camera trap data on-premise, incorporating ML/AI capabilities tailored for small research groups with limited resources and computational expertise. By focusing on practical solutions, the pipeline offers accessible approaches for data transmission, inference, and evaluation, enabling researchers to discover meaningful insights from their ever-increasing camera trap datasets.

[49] Domain Adaptation and Multi-view Attention for Learnable Landmark Tracking with Sparse Data cs.CV | cs.AI | cs.LG | cs.ROPDF

Timothy Chase Jr, Karthik Dantu

TL;DR: 论文提出了一种结合域适应和多视角注意力的学习方法，用于稀疏数据下的可学习地标跟踪，旨在解决航天器自主任务中传统方法的局限性。

Details

Motivation: 传统的太空地形特征检测方法依赖大量离线处理和先验数据，成本高且实时性差。学习视觉方法虽有效，但计算需求高且缺乏标记数据。

Result: 提出的统一系统在性能上优于现有技术，适用于稀疏数据和多样化外太空环境。

Insight: 域适应和注意力机制的结合可以显著提升轻量级模型在数据稀缺和视角变化下的鲁棒性。

Abstract: The detection and tracking of celestial surface terrain features are crucial for autonomous spaceflight applications, including Terrain Relative Navigation (TRN), Entry, Descent, and Landing (EDL), hazard analysis, and scientific data collection. Traditional photoclinometry-based pipelines often rely on extensive a priori imaging and offline processing, constrained by the computational limitations of radiation-hardened systems. While historically effective, these approaches typically increase mission costs and duration, operate at low processing rates, and have limited generalization. Recently, learning-based computer vision has gained popularity to enhance spacecraft autonomy and overcome these limitations. While promising, emerging techniques frequently impose computational demands exceeding the capabilities of typical spacecraft hardware for real-time operation and are further challenged by the scarcity of labeled training data for diverse extraterrestrial environments. In this work, we present novel formulations for in-situ landmark tracking via detection and description. We utilize lightweight, computationally efficient neural network architectures designed for real-time execution on current-generation spacecraft flight processors. For landmark detection, we propose improved domain adaptation methods that enable the identification of celestial terrain features with distinct, cheaply acquired training data. Concurrently, for landmark description, we introduce a novel attention alignment formulation that learns robust feature representations that maintain correspondence despite significant landmark viewpoint variations. Together, these contributions form a unified system for landmark tracking that demonstrates superior performance compared to existing state-of-the-art techniques.

[50] SegVec3D: A Method for Vector Embedding of 3D Objects Oriented Towards Robot manipulation cs.CV | cs.ROPDF

Zhihan Kang, Boyu Wang

TL;DR: SegVec3D提出了一种新的3D点云实例分割框架，结合了注意力机制、嵌入学习和跨模态对齐，支持最小监督和实际部署需求。

Details

Motivation: 当前的3D点云实例分割和跨模态理解方法通常需要大量监督且难以实际部署，因此SegVec3D旨在解决这些问题。

Result: 与现有方法（如Mask3D和ULIP）相比，SegVec3D在零样本检索和实例分割任务上表现更优。

Insight: 通过整合注意力机制和跨模态对齐，SegVec3D展示了3D数据处理的通用性和高效性，适用于机器人操作等实际场景。

Abstract: We propose SegVec3D, a novel framework for 3D point cloud instance segmentation that integrates attention mechanisms, embedding learning, and cross-modal alignment. The approach builds a hierarchical feature extractor to enhance geometric structure modeling and enables unsupervised instance segmentation via contrastive clustering. It further aligns 3D data with natural language queries in a shared semantic space, supporting zero-shot retrieval. Compared to recent methods like Mask3D and ULIP, our method uniquely unifies instance segmentation and multimodal understanding with minimal supervision and practical deployability.

[51] CKAA: Cross-subspace Knowledge Alignment and Aggregation for Robust Continual Learning cs.CVPDF

Lingfeng He, De Cheng, Zhiheng Ma, Huaijie Wang, Dingwen Zhang

TL;DR: CKAA提出了一种新的持续学习框架，通过跨子空间知识对齐和聚合提升模型对误导任务ID的鲁棒性。

Details

Motivation: 现有基于参数高效微调（PEFT）的持续学习方法因独立训练的子模块特征子空间不对齐，容易在错误任务ID下产生模糊决策。

Result: 实验表明CKAA优于现有PEFT-based持续学习方法。

Insight: 跨子空间特征对齐和任务置信度引导的知识聚合是提升持续学习鲁棒性的关键。

Abstract: Continual Learning (CL) empowers AI models to continuously learn from sequential task streams. Recently, parameter-efficient fine-tuning (PEFT)-based CL methods have garnered increasing attention due to their superior performance. They typically allocate a unique sub-module for learning each task, with a task recognizer to select the appropriate sub-modules for testing images. However, due to the feature subspace misalignment from independently trained sub-modules, these methods tend to produce ambiguous decisions under misleading task-ids. To address this, we propose Cross-subspace Knowledge Alignment and Aggregation (CKAA), a novel framework that enhances model robustness against misleading task-ids through two key innovations: (1) Dual-level Knowledge Alignment (DKA): By aligning intra-class feature distributions across different subspaces and learning a robust global classifier through a feature simulation process, DKA enables the model to distinguish features from both correct and incorrect subspaces during training. (2) Task-Confidence-guided Mixture of Adapters (TC-MoA): A robust inference scheme that adaptively aggregates task-specific knowledge from relevant sub-modules based on task-confidence scores, avoiding overconfidence in misleading task-id predictions. Extensive experiments demonstrate that CKAA outperforms existing PEFT-based CL methods.

[52] HMID-Net: An Exploration of Masked Image Modeling and Knowledge Distillation in Hyperbolic Space cs.CV | cs.AIPDF

Changli Wang, Fang Yin, Jiafeng Liu, Rui Wu

TL;DR: HMID-Net 提出了一种在双曲空间中结合掩码图像建模（MIM）和知识蒸馏的新方法，首次将这两种技术引入双曲空间，并在多个下游任务中显著优于现有模型。

Details

Motivation: 视觉和语义概念通常具有层次化结构，而双曲空间能够更好地捕捉这种层次关系。此前的工作 MERU 将多模态学习从欧几里得空间扩展到双曲空间，但其训练效率有待提升。

Result: 实验表明，双曲空间中的 MIM 和知识蒸馏与欧几里得空间同样有效，模型在图像分类和检索任务中表现优异。

Insight: 双曲空间不仅是多模态学习的有效框架，还能通过结合 MIM 和知识蒸馏进一步提升模型效率和性能。

Abstract: Visual and semantic concepts are often structured in a hierarchical manner. For instance, textual concept `cat’ entails all images of cats. A recent study, MERU, successfully adapts multimodal learning techniques from Euclidean space to hyperbolic space, effectively capturing the visual-semantic hierarchy. However, a critical question remains: how can we more efficiently train a model to capture and leverage this hierarchy? In this paper, we propose the \textit{Hyperbolic Masked Image and Distillation Network} (HMID-Net), a novel and efficient method that integrates Masked Image Modeling (MIM) and knowledge distillation techniques within hyperbolic space. To the best of our knowledge, this is the first approach to leverage MIM and knowledge distillation in hyperbolic space to train highly efficient models. In addition, we introduce a distillation loss function specifically designed to facilitate effective knowledge transfer in hyperbolic space. Our experiments demonstrate that MIM and knowledge distillation techniques in hyperbolic space can achieve the same remarkable success as in Euclidean space. Extensive evaluations show that our method excels across a wide range of downstream tasks, significantly outperforming existing models like MERU and CLIP in both image classification and retrieval.

[53] GLIMPSE: Do Large Vision-Language Models Truly Think With Videos or Just Glimpse at Them? cs.CVPDF

Yiyang Zhou, Linjie Li, Shi Qiu, Zhengyuan Yang, Yuyang Zhao

TL;DR: GLIMPSE is a new benchmark designed to test if large vision-language models (LVLMs) can truly understand videos beyond superficial frame-level analysis, highlighting their current limitations.

Details

Motivation: Existing video benchmarks often resemble image-based tasks, allowing models to answer by scanning key frames without deep temporal reasoning. This limits the assessment of whether LVLMs can genuinely ‘think’ with videos.

Result: Human accuracy on GLIMPSE is 94.82%, but the best LVLM (GPT-o3) achieves only 66.43%, showing significant gaps in deep video understanding.

Insight: LVLMs still struggle with temporal reasoning and comprehensive video understanding, indicating a need for improved architectures or training methods.

Abstract: Existing video benchmarks often resemble image-based benchmarks, with question types like “What actions does the person perform throughout the video?” or “What color is the woman’s dress in the video?” For these, models can often answer by scanning just a few key frames, without deep temporal reasoning. This limits our ability to assess whether large vision-language models (LVLMs) can truly think with videos rather than perform superficial frame-level analysis. To address this, we introduce GLIMPSE, a benchmark specifically designed to evaluate whether LVLMs can genuinely think with videos. Unlike prior benchmarks, GLIMPSE emphasizes comprehensive video understanding beyond static image cues. It consists of 3,269 videos and over 4,342 highly visual-centric questions across 11 categories, including Trajectory Analysis, Temporal Reasoning, and Forensics Detection. All questions are carefully crafted by human annotators and require watching the entire video and reasoning over full video context-this is what we mean by thinking with video. These questions cannot be answered by scanning selected frames or relying on text alone. In human evaluations, GLIMPSE achieves 94.82% accuracy, but current LVLMs face significant challenges. Even the best-performing model, GPT-o3, reaches only 66.43%, highlighting that LVLMs still struggle to move beyond surface-level reasoning to truly think with videos.

[54] SDTN and TRN: Adaptive Spectral-Spatial Feature Extraction for Hyperspectral Image Classification cs.CV | cs.AIPDF

Fuyin Ye, Erwen Yao, Jianyong Chen, Fengmei He, Junxiang Zhang

TL;DR: 论文提出了SDTN和TRN两种方法，用于高光谱图像分类中的自适应谱-空间特征提取，解决了高维数据、谱-空间冗余和标记样本稀缺的问题。

Details

Motivation: 高光谱图像分类在精准农业中至关重要，但传统方法因高维数据、谱-空间冗余和标记样本稀缺而导致性能不佳，亟需更高效的方法。

Result: 在PaviaU数据集上的实验表明，该方法在分类精度和模型参数上显著优于现有技术。

Insight: SDTN和TRN的结合不仅提高了分类精度，还降低了计算复杂度，适合资源受限的实时部署场景。

Abstract: Hyperspectral image classification plays a pivotal role in precision agriculture, providing accurate insights into crop health monitoring, disease detection, and soil analysis. However, traditional methods struggle with high-dimensional data, spectral-spatial redundancy, and the scarcity of labeled samples, often leading to suboptimal performance. To address these challenges, we propose the Self-Adaptive Tensor- Regularized Network (SDTN), which combines tensor decomposition with regularization mechanisms to dynamically adjust tensor ranks, ensuring optimal feature representation tailored to the complexity of the data. Building upon SDTN, we propose the Tensor-Regularized Network (TRN), which integrates the features extracted by SDTN into a lightweight network capable of capturing spectral-spatial features at multiple scales. This approach not only maintains high classification accuracy but also significantly reduces computational complexity, making the framework highly suitable for real-time deployment in resource-constrained environments. Experiments on PaviaU datasets demonstrate significant improvements in accuracy and reduced model parameters compared to state-of-the-art methods.

[55] Advancing Reliable Test-Time Adaptation of Vision-Language Models under Visual Variations cs.CVPDF

Yiwen Liang, Hui Chen, Yizhe Xiong, Zihan Zhou, Mengyao Lyu

TL;DR: 该论文提出了一种名为ReTA的可靠测试时适应方法，通过熵一致性重加权和多样性驱动的分布校准，解决了视觉语言模型在分布偏移下的性能下降问题。

Details

Motivation: 视觉语言模型（VLMs）在零样本任务中表现优异，但在无标注数据的下游任务中难以应对分布偏移。测试时适应（TTA）虽能提升性能，但现有缓存方法因熵不可靠和决策边界僵化导致性能下降。

Result: 实验表明，ReTA在多种挑战性分布偏移场景中优于现有方法。

Insight: 熵值的可靠性和决策边界的适应性是提升测试时适应效果的关键。结合一致性约束和分布校正是解决这些问题的有效途径。

Abstract: Vision-language models (VLMs) exhibit remarkable zero-shot capabilities but struggle with distribution shifts in downstream tasks when labeled data is unavailable, which has motivated the development of Test-Time Adaptation (TTA) to improve VLMs’ performance during inference without annotations. Among various TTA approaches, cache-based methods show promise by preserving historical knowledge from low-entropy samples in a dynamic cache and fostering efficient adaptation. However, these methods face two critical reliability challenges: (1) entropy often becomes unreliable under distribution shifts, causing error accumulation in the cache and degradation in adaptation performance; (2) the final predictions may be unreliable due to inflexible decision boundaries that fail to accommodate large downstream shifts. To address these challenges, we propose a Reliable Test-time Adaptation (ReTA) method that integrates two complementary strategies to enhance reliability from two perspectives. First, to mitigate the unreliability of entropy as a sample selection criterion for cache construction, we introduce Consistency-aware Entropy Reweighting (CER), which incorporates consistency constraints to weight entropy during cache updating. While conventional approaches rely solely on low entropy for cache prioritization and risk introducing noise, our method leverages predictive consistency to maintain a high-quality cache and facilitate more robust adaptation. Second, we present Diversity-driven Distribution Calibration (DDC), which models class-wise text embeddings as multivariate Gaussian distributions, enabling adaptive decision boundaries for more accurate predictions across visually diverse content. Extensive experiments demonstrate that ReTA consistently outperforms state-of-the-art methods, particularly under challenging real-world distribution shifts.

[56] Online Micro-gesture Recognition Using Data Augmentation and Spatial-Temporal Attention cs.CVPDF

Pengyu Liu, Kun Li, Fei Wang, Yanyan Wei, Junhui She

TL;DR: HFUT-VUT团队提出的方法通过数据增强和时空注意力机制改进微表情在线识别任务，在IJCAI 2025 MiGA挑战赛中排名第一，性能提升37.9%。

Details

Motivation: 微表情在线识别是一项高挑战性的任务，需要在未修剪视频中定位时间位置并识别多类微表情实例。与传统时序动作检测相比，其更注重区分微表情类别和精确时间定位。由于微表情是自发行为，差异更大，现有方法难以应对。

Result: 实验结果显示，该方法在微表情在线识别任务中F1分数达38.03，超越之前最佳方法37.9%，在IJCAI 2025 MiGA挑战赛中排名第一。

Insight: 论文表明，手工设计的数据增强和时空注意力机制能有效应对微表情识别中的高差异性和精确时间定位问题，为类似任务提供了新思路。

Abstract: In this paper, we introduce the latest solution developed by our team, HFUT-VUT, for the Micro-gesture Online Recognition track of the IJCAI 2025 MiGA Challenge. The Micro-gesture Online Recognition task is a highly challenging problem that aims to locate the temporal positions and recognize the categories of multiple micro-gesture instances in untrimmed videos. Compared to traditional temporal action detection, this task places greater emphasis on distinguishing between micro-gesture categories and precisely identifying the start and end times of each instance. Moreover, micro-gestures are typically spontaneous human actions, with greater differences than those found in other human actions. To address these challenges, we propose hand-crafted data augmentation and spatial-temporal attention to enhance the model’s ability to classify and localize micro-gestures more accurately. Our solution achieved an F1 score of 38.03, outperforming the previous state-of-the-art by 37.9%. As a result, our method ranked first in the Micro-gesture Online Recognition track.

[57] QuarterMap: Efficient Post-Training Token Pruning for Visual State Space Models cs.CV | cs.AIPDF

Tien-Yu Chi, Hung-Yueh Chiang, Diana Marculescu, Kai-Chiang Wu

TL;DR: QuarterMap是一种高效的后训练令牌修剪方法，专门为视觉状态空间模型设计，通过移除冗余空间激活和最近邻上采样提升吞吐量，而无需重新训练。

Details

Motivation: 传统状态空间模型（如VMamba）在视觉任务中存在空间冗余问题，影响了其效率。作者希望通过后处理方式优化模型，提升部署时的吞吐能力。

Result: 在ImageNet-1K上，VMamba的推理速度提升11%，精度下降不到0.9%；在ADE20K分割任务和MedMamba的医疗图像任务中，也实现了类似的加速效果。

Insight: QuarterMap展示了后训练修剪方法在优化特定结构模型时的潜力，尤其是针对SSM模型的冗余问题，提供了一种低开销的部署时优化工具。

Abstract: State space models (SSMs) reduce the quadratic complexity of transformers by leveraging linear recurrence. Recently, VMamba has emerged as a strong SSM-based vision backbone, yet remains bottlenecked by spatial redundancy in its four-directional scan. We propose QuarterMap, a post-training activation pruning method that removes redundant spatial activations before scanning and restores dimensions via nearest-neighbor upsampling. Our method improves throughput without retraining. On ImageNet-1K, QuarterMap achieves up to 11% speedup on VMamba with less than 0.9% accuracy drop, and yields similar gains on ADE20K segmentation. Beyond VMamba, we validate QuarterMap on MedMamba, a domain-specific model that shares the same four-directional scanning structure, where it consistently improves throughput while preserving accuracy across multiple medical imaging tasks. Compared to token merging methods like ToMe, QuarterMap is tailored for SSMs and avoids costly merge-unmerge operations. Our method offers a plug-and-play tool for deployment-time efficiency without compromising transferability.

[58] When Schrödinger Bridge Meets Real-World Image Dehazing with Unpaired Training cs.CVPDF

Yunwei Lan, Zhigao Cui, Xin Luo, Chang Liu, Nian Wang

TL;DR: 论文提出了一种基于Schrödinger Bridge的非配对去雾框架DehazeSB，利用最优传输理论直接在雾图和清晰图之间建立分布桥梁，并通过细节保留正则化和提示学习提升性能。

Details

Motivation: 现有GAN方法在非配对去雾任务中受限于生成器的传输映射能力，难以充分挖掘非配对训练的有效性，因此需要一种更优的解决方案。

Result: 在多个真实数据集上验证了方法的优越性。

Insight: Schrödinger Bridge和最优传输理论在非配对图像生成任务中具有潜力，结合细节保留和提示学习可以进一步提升性能。

Abstract: Recent advancements in unpaired dehazing, particularly those using GANs, show promising performance in processing real-world hazy images. However, these methods tend to face limitations due to the generator’s limited transport mapping capability, which hinders the full exploitation of their effectiveness in unpaired training paradigms. To address these challenges, we propose DehazeSB, a novel unpaired dehazing framework based on the Schr"odinger Bridge. By leveraging optimal transport (OT) theory, DehazeSB directly bridges the distributions between hazy and clear images. This enables optimal transport mappings from hazy to clear images in fewer steps, thereby generating high-quality results. To ensure the consistency of structural information and details in the restored images, we introduce detail-preserving regularization, which enforces pixel-level alignment between hazy inputs and dehazed outputs. Furthermore, we propose a novel prompt learning to leverage pre-trained CLIP models in distinguishing hazy images and clear ones, by learning a haze-aware vision-language alignment. Extensive experiments on multiple real-world datasets demonstrate our method’s superiority. Code: https://github.com/ywxjm/DehazeSB.

[59] VDInstruct: Zero-Shot Key Information Extraction via Content-Aware Vision Tokenization cs.CV | cs.AI | cs.LGPDF

Son Nguyen, Giang Nguyen, Hung Dao, Thao Do, Daeyoung Kim

TL;DR: 论文提出了一种名为VDInstruct的多模态大语言模型（MLLM），通过内容感知的视觉分词策略，显著提升了密集文档中关键信息提取（KIE）的性能，并在零样本场景下超越现有基线方法。

Details

Motivation: 现有MLLM在密集文档上表现不佳，且传统的视觉分词方法因图像尺寸而冗余计算和内存浪费。为此，论文提出了一种内容感知的分词策略以提高效率与性能。

Result: 1. 在KIE基准测试中达到SOTA，同时减少约3.6倍的图像token；2. 零样本评估中比DocOwl 1.5高5.5 F1点，展现了模型的鲁棒性。

Insight: 内容感知分词与显式布局建模的结合，为文档理解提供了高效且有效的方向。

Abstract: Key Information Extraction (KIE) underpins the understanding of visual documents (e.g., receipts and contracts) by extracting precise semantic content and accurately capturing spatial structure. Yet existing multimodal large language models (MLLMs) often perform poorly on dense documents and rely on vision tokenization approaches that scale with image size, leading to redundant computation and memory inefficiency. To address these challenges, we introduce VDInstruct, an MLLM that separates spatial region detection from semantic feature extraction. Central to our model is a content-aware tokenization strategy: rather than fragmenting the entire image uniformly, it generates tokens in proportion to document complexity, preserving critical structure while eliminating wasted tokens. Leveraging a three-stage training paradigm, our model achieves state-of-the-art (SOTA) results on KIE benchmarks, matching or exceeding the accuracy of leading approaches while reducing the number of image tokens by roughly 3.6x. In zero-shot evaluations, VDInstruct surpasses strong baselines-such as DocOwl 1.5-by +5.5 F1 points, highlighting its robustness to unseen documents. These findings show that content-aware tokenization combined with explicit layout modeling offers a promising direction forward for document understanding. Data, source code, and model weights will be made publicly available.

[60] Prompt Engineering in Segment Anything Model: Methodologies, Applications, and Emerging Challenges cs.CV | cs.AIPDF

Yidong Jiang

TL;DR: 本文系统地综述了SAM（Segment Anything Model）及其变体中的提示工程方法、应用和挑战，填补了该领域的研究空白。

Details

Motivation: SAM通过基于提示的方法革新了图像分割，但提示工程在其成功中的关键作用尚未被充分探索。本文旨在填补这一研究空白。

Result: 揭示了提示工程的演化路径及其在医疗影像和遥感等领域的适应性，同时指出了优化挑战和未来研究方向。

Insight: 提示工程在SAM中的核心作用及其跨领域应用的潜力是本文的重要见解。

Abstract: The Segment Anything Model (SAM) has revolutionized image segmentation through its innovative prompt-based approach, yet the critical role of prompt engineering in its success remains underexplored. This paper presents the first comprehensive survey focusing specifically on prompt engineering techniques for SAM and its variants. We systematically organize and analyze the rapidly growing body of work in this emerging field, covering fundamental methodologies, practical applications, and key challenges. Our review reveals how prompt engineering has evolved from simple geometric inputs to sophisticated multimodal approaches, enabling SAM’s adaptation across diverse domains including medical imaging and remote sensing. We identify unique challenges in prompt optimization and discuss promising research directions. This survey fills an important gap in the literature by providing a structured framework for understanding and advancing prompt engineering in foundation models for segmentation.

[61] WordCraft: Interactive Artistic Typography with Attention Awareness and Noise Blending cs.CVPDF

Zhe Wang, Jingbo Zhang, Tianyi Wei, Wanchao Su, Can Wang

TL;DR: WordCraft是一个交互式艺术字体合成系统，结合扩散模型、区域注意力机制和噪声混合技术，支持局部编辑、迭代优化和多字符合成，显著提升了艺术字体生成的交互性和灵活性。

Details

Motivation: 传统艺术字体设计依赖人工，现有生成方法缺乏交互性和灵活性，无法满足用户对局部编辑、多字符合成和开放式提示的需求。

Result: 系统能够生成高质量、多样化的艺术字体，支持单字符和多字符输入，适用于多种语言和用户需求。

Insight: 交互性和用户意图解析是提升艺术字体生成实用性的关键。

Abstract: Artistic typography aims to stylize input characters with visual effects that are both creative and legible. Traditional approaches rely heavily on manual design, while recent generative models, particularly diffusion-based methods, have enabled automated character stylization. However, existing solutions remain limited in interactivity, lacking support for localized edits, iterative refinement, multi-character composition, and open-ended prompt interpretation. We introduce WordCraft, an interactive artistic typography system that integrates diffusion models to address these limitations. WordCraft features a training-free regional attention mechanism for precise, multi-region generation and a noise blending that supports continuous refinement without compromising visual quality. To support flexible, intent-driven generation, we incorporate a large language model to parse and structure both concrete and abstract user prompts. These components allow our framework to synthesize high-quality, stylized typography across single- and multi-character inputs across multiple languages, supporting diverse user-centered workflows. Our system significantly enhances interactivity in artistic typography synthesis, opening up creative possibilities for artists and designers.

[62] MENTOR: Efficient Multimodal-Conditioned Tuning for Autoregressive Vision Generation Models cs.CV | cs.AI | cs.CLPDF

Haozhe Zhao, Zefan Cai, Shuzheng Si, Liang Chen, Jiuxiang Gu

TL;DR: MENTOR提出了一种高效的多模态条件调优框架，通过两阶段训练实现细粒度的多模态输入与图像输出的对齐，提升了生成控制的精确性和训练效率。

Details

Motivation: 现有文本到图像模型在多模态输入的精确控制、平衡及复杂生成任务上的训练效率方面存在不足。

Result: 在DreamBench++上优于对比基线，展示了更高的概念保留、提示跟随能力和图像重建保真度。

Insight: 两阶段训练范式在资源有限的情况下仍能实现高效的多模态生成控制，为自回归模型在多模态任务中的应用提供了新思路。

Abstract: Recent text-to-image models produce high-quality results but still struggle with precise visual control, balancing multimodal inputs, and requiring extensive training for complex multimodal image generation. To address these limitations, we propose MENTOR, a novel autoregressive (AR) framework for efficient Multimodal-conditioned Tuning for Autoregressive multimodal image generation. MENTOR combines an AR image generator with a two-stage training paradigm, enabling fine-grained, token-level alignment between multimodal inputs and image outputs without relying on auxiliary adapters or cross-attention modules. The two-stage training consists of: (1) a multimodal alignment stage that establishes robust pixel- and semantic-level alignment, followed by (2) a multimodal instruction tuning stage that balances the integration of multimodal inputs and enhances generation controllability. Despite modest model size, suboptimal base components, and limited training resources, MENTOR achieves strong performance on the DreamBench++ benchmark, outperforming competitive baselines in concept preservation and prompt following. Additionally, our method delivers superior image reconstruction fidelity, broad task adaptability, and improved training efficiency compared to diffusion-based methods. Dataset, code, and models are available at: https://github.com/HaozheZhao/MENTOR

[63] Memory-Augmented SAM2 for Training-Free Surgical Video Segmentation cs.CVPDF

Ming Yin, Fu Wang, Xujiong Ye, Yanda Meng, Zeyu Fu

TL;DR: 该论文提出了MA-SAM2，一种无需训练的增强记忆手术视频分割方法，通过改进SAM2的贪婪内存设计来解决手术视频中的遮挡和多器械互动问题，显著提升了性能。

Details

Motivation: 手术视频中的快速器械移动、频繁遮挡和复杂的器械-组织互动限制了现有分割模型（如SAM2）的性能，需要一种更鲁棒的方法。

Result: 在EndoVis2017和EndoVis2018数据集上，MA-SAM2分别比SAM2性能提升了4.36%和6.1%。

Insight: 无需额外训练或参数的内存增强设计可以在复杂场景（如手术视频）中显著提升分割模型的性能。

Abstract: Surgical video segmentation is a critical task in computer-assisted surgery, essential for enhancing surgical quality and patient outcomes. Recently, the Segment Anything Model 2 (SAM2) framework has demonstrated remarkable advancements in both image and video segmentation. However, the inherent limitations of SAM2’s greedy selection memory design are amplified by the unique properties of surgical videos-rapid instrument movement, frequent occlusion, and complex instrument-tissue interaction-resulting in diminished performance in the segmentation of complex, long videos. To address these challenges, we introduce Memory Augmented (MA)-SAM2, a training-free video object segmentation strategy, featuring novel context-aware and occlusion-resilient memory models. MA-SAM2 exhibits strong robustness against occlusions and interactions arising from complex instrument movements while maintaining accuracy in segmenting objects throughout videos. Employing a multi-target, single-loop, one-prompt inference further enhances the efficiency of the tracking process in multi-instrument videos. Without introducing any additional parameters or requiring further training, MA-SAM2 achieved performance improvements of 4.36% and 6.1% over SAM2 on the EndoVis2017 and EndoVis2018 datasets, respectively, demonstrating its potential for practical surgical applications.

[64] Towards Fine-Grained Adaptation of CLIP via a Self-Trained Alignment Score cs.CVPDF

Eman Ali, Sathira Silva, Chetan Arora, Muhammad Haris Khan

TL;DR: 论文提出了一种称为FAIR的创新方法，通过动态对齐局部图像特征和语言嵌入，改进了CLIP模型的细粒度无监督适应性能，显著提升了分类精度。

Details

Motivation: 现有的视觉语言模型（如CLIP）在细粒度分类任务中表现不佳，通常依赖于固定的对齐分数或计算成本高昂的伪标签策略。FAIR旨在通过动态对齐和交互优化来解决这些问题。

Result: 在13个细粒度数据集上，FAIR相比现有SOTA方法平均提升了2.78%的分类性能。

Insight: 动态对齐和跨模态交互优化是提升视觉语言模型在细粒度任务中性能的关键。自训练机制可以进一步减少类别歧义带来的噪声。

Abstract: Vision-language models (VLMs) like CLIP excel in zero-shot learning by aligning image and text representations through contrastive pretraining. Existing approaches to unsupervised adaptation (UA) for fine-grained classification with VLMs either rely on fixed alignment scores that cannot capture evolving, subtle class distinctions or use computationally expensive pseudo-labeling strategies that limit scalability. In contrast, we show that modeling fine-grained cross-modal interactions during adaptation produces more accurate, class-discriminative pseudo-labels and substantially improves performance over state-of-the-art (SOTA) methods. We introduce Fine-grained Alignment and Interaction Refinement (FAIR), an innovative approach that dynamically aligns localized image features with descriptive language embeddings through a set of Class Description Anchors (CDA). This enables the definition of a Learned Alignment Score (LAS), which incorporates CDA as an adaptive classifier, facilitating cross-modal interactions to improve self-training in unsupervised adaptation. Furthermore, we propose a self-training weighting mechanism designed to refine pseudo-labels in the presence of inter-class ambiguities. Our approach, FAIR, delivers a substantial performance boost in fine-grained unsupervised adaptation, achieving a notable overall gain of 2.78% across 13 fine-grained datasets compared to SOTA methods.

[65] Generate Aligned Anomaly: Region-Guided Few-Shot Anomaly Image-Mask Pair Synthesis for Industrial Inspection cs.CVPDF

Yilin Lu, Jianghang Lin, Linhuang Xie, Kai Zhao, Yansong Qu

TL;DR: 提出了Generate Aligned Anomaly (GAA)，一个基于区域引导的少样本异常图像-掩码对生成框架，用于工业检测，解决了现有方法在真实感、掩码对齐和泛化能力上的不足。

Details

Motivation: 工业检测中异常样本稀缺限制了现有方法在定位和分类任务中的效果，现有异常合成方法存在真实感低、掩码对齐不准确和泛化能力差的问题，亟需一种更高效的解决方案。

Result: 在MVTec AD和LOCO数据集上，GAA在异常合成质量和下游任务（如定位与分类）中表现优异。

Insight: 通过结合预训练模型的强先验与细粒度语义控制，GAA能够高效生成高质量的异常样本，为工业检测中的数据增强提供了新思路。

Abstract: Anomaly inspection plays a vital role in industrial manufacturing, but the scarcity of anomaly samples significantly limits the effectiveness of existing methods in tasks such as localization and classification. While several anomaly synthesis approaches have been introduced for data augmentation, they often struggle with low realism, inaccurate mask alignment, and poor generalization. To overcome these limitations, we propose Generate Aligned Anomaly (GAA), a region-guided, few-shot anomaly image-mask pair generation framework. GAA leverages the strong priors of a pretrained latent diffusion model to generate realistic, diverse, and semantically aligned anomalies using only a small number of samples. The framework first employs Localized Concept Decomposition to jointly model the semantic features and spatial information of anomalies, enabling flexible control over the type and location of anomalies. It then utilizes Adaptive Multi-Round Anomaly Clustering to perform fine-grained semantic clustering of anomaly concepts, thereby enhancing the consistency of anomaly representations. Subsequently, a region-guided mask generation strategy ensures precise alignment between anomalies and their corresponding masks, while a low-quality sample filtering module is introduced to further improve the overall quality of the generated samples. Extensive experiments on the MVTec AD and LOCO datasets demonstrate that GAA achieves superior performance in both anomaly synthesis quality and downstream tasks such as localization and classification.

[66] Brain Stroke Detection and Classification Using CT Imaging with Transformer Models and Explainable AI cs.CV | cs.AIPDF

Shomukh Qari, Maha A. Thafar

TL;DR: 本文提出了一种基于CT影像的多类别脑卒中分类方法，结合了MaxViT和XAI技术，实现了高准确率和可解释性。

Details

Motivation: 脑卒中是全球主要死因之一，快速准确的诊断对患者预后至关重要。CT成像因其快速、可及和经济性成为关键工具。本文旨在通过AI技术提升卒中分类的准确性和透明度。

Result: MaxViT模型在增强数据上表现最佳，准确率和F1分数均达到98%，优于其他模型和基线方法。

Insight: 结合Transformer模型和XAI技术不仅提升了卒中分类的准确性，还增强了模型的透明度和临床适用性，为AI辅助诊断工具的开发提供了重要参考。

Abstract: Stroke is one of the leading causes of death globally, making early and accurate diagnosis essential for improving patient outcomes, particularly in emergency settings where timely intervention is critical. CT scans are the key imaging modality because of their speed, accessibility, and cost-effectiveness. This study proposed an artificial intelligence framework for multiclass stroke classification (ischemic, hemorrhagic, and no stroke) using CT scan images from a dataset provided by the Republic of Turkey’s Ministry of Health. The proposed method adopted MaxViT, a state-of-the-art Vision Transformer, as the primary deep learning model for image-based stroke classification, with additional transformer variants (vision transformer, transformer-in-transformer, and ConvNext). To enhance model generalization and address class imbalance, we applied data augmentation techniques, including synthetic image generation. The MaxViT model trained with augmentation achieved the best performance, reaching an accuracy and F1-score of 98.00%, outperforming all other evaluated models and the baseline methods. The primary goal of this study was to distinguish between stroke types with high accuracy while addressing crucial issues of transparency and trust in artificial intelligence models. To achieve this, Explainable Artificial Intelligence (XAI) was integrated into the framework, particularly Grad-CAM++. It provides visual explanations of the model’s decisions by highlighting relevant stroke regions in the CT scans and establishing an accurate, interpretable, and clinically applicable solution for early stroke detection. This research contributed to the development of a trustworthy AI-assisted diagnostic tool for stroke, facilitating its integration into clinical practice and enhancing access to timely and optimal stroke diagnosis in emergency departments, thereby saving more lives.

[67] Disentanglement and Assessment of Shortcuts in Ophthalmological Retinal Imaging Exams cs.CV | cs.LGPDF

Leonor Fernandes, Tiago Gonçalves, João Matos, Luis Filipe Nakayama, Jaime S. Cardoso

TL;DR: 论文通过评估三种模型在糖尿病视网膜病变（DR）预测中的公平性和性能，探讨了解缠技术对减少偏见的有效性，强调了医学影像AI中公平性的重要性。

Details

Motivation: 糖尿病视网膜病变（DR）是导致劳动年龄成人失明的主要原因，传统筛查成本高且难以普及，AI算法虽具潜力但存在公平性和泛化性问题。

Result: 所有模型在DR预测中表现优异（最高94% AUROC），但对年龄和性别的预测存在公平性差异；解缠技术对不同模型效果不一，DINOv2性能提升2%，ConvNeXt V2和Swin V2性能分别下降7%和3%。

Insight: 解缠技术在医学影像中的效果因模型而异，公平性问题是复杂的，需针对性优化；AI在医疗领域的应用需优先考虑公平性和可靠性。

Abstract: Diabetic retinopathy (DR) is a leading cause of vision loss in working-age adults. While screening reduces the risk of blindness, traditional imaging is often costly and inaccessible. Artificial intelligence (AI) algorithms present a scalable diagnostic solution, but concerns regarding fairness and generalization persist. This work evaluates the fairness and performance of image-trained models in DR prediction, as well as the impact of disentanglement as a bias mitigation technique, using the diverse mBRSET fundus dataset. Three models, ConvNeXt V2, DINOv2, and Swin V2, were trained on macula images to predict DR and sensitive attributes (SAs) (e.g., age and gender/sex). Fairness was assessed between subgroups of SAs, and disentanglement was applied to reduce bias. All models achieved high DR prediction performance in diagnosing (up to 94% AUROC) and could reasonably predict age and gender/sex (91% and 77% AUROC, respectively). Fairness assessment suggests disparities, such as a 10% AUROC gap between age groups in DINOv2. Disentangling SAs from DR prediction had varying results, depending on the model selected. Disentanglement improved DINOv2 performance (2% AUROC gain), but led to performance drops in ConvNeXt V2 and Swin V2 (7% and 3%, respectively). These findings highlight the complexity of disentangling fine-grained features in fundus imaging and emphasize the importance of fairness in medical imaging AI to ensure equitable and reliable healthcare solutions.

[68] Prompt2DEM: High-Resolution DEMs for Urban and Open Environments from Global Prompts Using a Monocular Foundation Model cs.CV | eess.IVPDF

Osher Rafaeli, Tal Svoray, Ariel Nahlieli

TL;DR: 该论文提出了一种基于提示的单目深度估计方法Prompt2DEM，用于从全球范围的RGB图像生成高分辨率数字高程模型（DEM），显著提升了分辨率和精度。

Details

Motivation: 高分辨率DEM对水文、城市形态和生态系统研究至关重要。现有方法（如超分辨率技术和单目深度估计）存在分辨率提升限制或缺乏全局高程上下文的问题，亟需新解决方案。

Result: 在三个美国地貌多样区域测试中，系统分辨率提升至30厘米，MAE低于5米，较SRTM提升高达18%，适用于水文和环境研究。

Insight: 1）提示策略能有效结合全局高程信息；2）视觉Transformer在DEM生成中表现优异；3）方法具有强扩展性，适用于大区域应用。

Abstract: High-resolution elevation estimations are essential to understand catchment and hillslope hydrology, study urban morphology and dynamics, and monitor the growth, decline, and mortality of terrestrial ecosystems. Various deep learning approaches (e.g., super-resolution techniques, monocular depth estimation) have been developed to create high-resolution Digital Elevation Models (DEMs). However, super-resolution techniques are limited by the upscaling factor, and monocular depth estimation lacks global elevation context, making its conversion to a seamless DEM restricted. The recently introduced technique of prompt-based monocular depth estimation has opened new opportunities to extract estimates of absolute elevation in a global context. We present here a framework for the estimation of high-resolution DEMs as a new paradigm for absolute global elevation mapping. It is exemplified using low-resolution Shuttle Radar Topography Mission (SRTM) elevation data as prompts and high-resolution RGB imagery from the National Agriculture Imagery Program (NAIP). The approach fine-tunes a vision transformer encoder with LiDAR-derived DEMs and employs a versatile prompting strategy, enabling tasks such as DEM estimation, void filling, and updating. Our framework achieves a 100x resolution gain (from 30-m to 30-cm), surpassing prior methods by an order of magnitude. Evaluations across three diverse U.S. landscapes show robust generalization, capturing urban structures and fine-scale terrain features with < 5 m MAE relative to LiDAR, improving over SRTM by up to 18%. Hydrological analysis confirms suitability for hazard and environmental studies. We demonstrate scalability by applying the framework to large regions in the U.S. and Israel. All code and pretrained models are publicly available at: https://osherr1996.github.io/prompt2dem_propage/.

[69] ViTCoT: Video-Text Interleaved Chain-of-Thought for Boosting Video Understanding in Large Language Models cs.CV | cs.AI | cs.CLPDF

Yongheng Zhang, Xu Liu, Ruihan Tao, Qiguang Chen, Hao Fei

TL;DR: ViTCoT提出了一种新的视频推理范式，结合视频与文本信息进行链式思考，显著提升了大型语言模型在视频理解任务中的性能。

Details

Motivation: 现有的大型语言模型在视频推理中主要依赖文本信息，忽视了视觉模态，而人类在推理时会自然回顾视觉内容。因此，提出ViTCoT以实现更直观的推理。

Result: 实验表明，ViTCoT相比传统文本CoT显著提升性能，且能激活更多MLLMs神经元。

Insight: 结合多模态信息（视频与文本）的推理更加接近人类认知过程，有助于提升模型的视频理解能力。

Abstract: Video understanding plays a vital role in bridging low-level visual signals with high-level cognitive reasoning, and is fundamental to applications such as autonomous driving, embodied AI, and the broader pursuit of AGI. The rapid development of large language models (LLMs), particularly those utilizing Chain-of-Thought (CoT) technology, has significantly advanced video reasoning capabilities. However, current approaches primarily depend on textual information for reasoning, overlooking the visual modality in the actual video reasoning process. In contrast, humans naturally re-examine visual content while reasoning. Motivated by this, we introduce a novel video reasoning paradigm: Video-Text Interleaved CoT (ViTCoT), which facilitates more intuitive and cognitively aligned reasoning. To the end, first, we construct the Video-Text Interleaved Benchmark (ViTIB), which is created using MLLMs for key-video selection and manually verified. Furthermore, we extensively explore the potential of the ViTCoT paradigm in the video understanding field. Extensive experiments demonstrate that ViTCoT significantly enhances performance compared to the traditional text-only CoT paradigm and effectively activates more neuron values in MLLMs.

[70] ExpStar: Towards Automatic Commentary Generation for Multi-discipline Scientific Experiments cs.CVPDF

Jiali Chen, Yujie Jia, Zihan Wu, Jinyu Yang, Jianpeng Chen

TL;DR: 论文提出了ExpStar，一种自动生成科学实验评论的模型，通过构建首个实验评论数据集ExpInstruct，并采用检索增强机制来提升多学科实验评论的生成质量，显著优于现有大型多模态模型。

Details

Motivation: 实验评论在描述实验过程、解释科学原理及安全指导中至关重要，但人工生成需要大量时间和专业知识，因此提出自动生成多学科实验评论的任务。

Result: ExpStar在实验中表现优异，显著超越现有大型多模态模型，验证了数据集和模型的有效性。

Insight: 研究展示了自动生成实验评论的潜力，结合外部知识的检索增强机制在多学科任务中具有优势，为AI辅助科学实验教学提供了新方向。

Abstract: Experiment commentary is crucial in describing the experimental procedures, delving into underlying scientific principles, and incorporating content-related safety guidelines. In practice, human teachers rely heavily on subject-specific expertise and invest significant time preparing such commentary. To address this challenge, we introduce the task of automatic commentary generation across multi-discipline scientific experiments. While recent progress in large multimodal models (LMMs) has demonstrated promising capabilities in video understanding and reasoning, their ability to generate fine-grained and insightful experiment commentary remains largely underexplored. In this paper, we make the following contributions: (i) We construct \textit{ExpInstruct}, the first dataset tailored for experiment commentary generation, featuring over 7\textit{K} step-level commentaries across 21 scientific subjects from 3 core disciplines (\ie, science, healthcare and engineering). Each sample includes procedural descriptions along with potential scientific principles (\eg, chemical equations and physical laws) and safety guidelines. (ii) We propose ExpStar, an automatic experiment commentary generation model that leverages a retrieval-augmented mechanism to adaptively access, evaluate, and utilize external knowledge. (iii) Extensive experiments show that our ExpStar substantially outperforms 14 leading LMMs, which highlights the superiority of our dataset and model. We believe that ExpStar holds great potential for advancing AI-assisted scientific experiment instruction.

[71] Token Compression Meets Compact Vision Transformers: A Survey and Comparative Evaluation for Edge AI cs.CVPDF

Phat Nguyen, Ngai-Man Cheung

TL;DR: 该论文对令牌压缩技术进行了系统调查和比较评估，填补了现有研究中缺乏统一分类和在紧凑视觉变换器（ViT）中效果的空白。

Details

Motivation: 令牌压缩技术用于加速ViT推理，但目前缺乏统一分类且未评估其在紧凑ViT上的效果。

Result: 令牌压缩在标准ViT中有效，但在紧凑设计中表现不佳。

Insight: 未来研究需专注于优化令牌压缩技术，以适应资源受限的边缘AI设备。

Abstract: Token compression techniques have recently emerged as powerful tools for accelerating Vision Transformer (ViT) inference in computer vision. Due to the quadratic computational complexity with respect to the token sequence length, these methods aim to remove less informative tokens before the attention layers to improve inference throughput. While numerous studies have explored various accuracy-efficiency trade-offs on large-scale ViTs, two critical gaps remain. First, there is a lack of unified survey that systematically categorizes and compares token compression approaches based on their core strategies (e.g., pruning, merging, or hybrid) and deployment settings (e.g., fine-tuning vs. plug-in). Second, most benchmarks are limited to standard ViT models (e.g., ViT-B, ViT-L), leaving open the question of whether such methods remain effective when applied to structurally compressed transformers, which are increasingly deployed on resource-constrained edge devices. To address these gaps, we present the first systematic taxonomy and comparative study of token compression methods, and we evaluate representative techniques on both standard and compact ViT architectures. Our experiments reveal that while token compression methods are effective for general-purpose ViTs, they often underperform when directly applied to compact designs. These findings not only provide practical insights but also pave the way for future research on adapting token optimization techniques to compact transformer-based networks for edge AI and AI agent applications.

Tom Kouwenhoven, Kiana Shahrasbi, Tessa Verhoef

TL;DR: 本文重新评估了视觉与语言模型（VLMs）中的跨模态关联能力，重点关注了bouba-kiki效应（人类将“bouba”与圆形、“kiki”与锯齿形关联的现象）。通过对CLIP的两个变体（ResNet和ViT）进行测试，发现模型未能稳定表现该效应，且与人类认知行为的对齐性不足。

Details

Motivation: 多模态模型的快速发展引发了对模型是否具备跨模态信息整合能力的探讨，尤其是是否反映人类认知。bouba-kiki效应是研究这一问题的经典案例。

Result: 模型未稳定表现出bouba-kiki效应，虽然ResNet对圆形有偏好，但整体缺乏预期关联，且模型行为与人类认知存在显著差异。

Insight: 研究揭示了VLMs在跨模态表征中的不足，表明其内部表示与人类直觉存在差距，为模型理解能力的进一步研究提供了方向。

Abstract: Recent advances in multimodal models have raised questions about whether vision-and-language models (VLMs) integrate cross-modal information in ways that reflect human cognition. One well-studied test case in this domain is the bouba-kiki effect, where humans reliably associate pseudowords like “bouba” with round shapes and “kiki” with jagged ones. Given the mixed evidence found in prior studies for this effect in VLMs, we present a comprehensive re-evaluation focused on two variants of CLIP, ResNet and Vision Transformer (ViT), given their centrality in many state-of-the-art VLMs. We apply two complementary methods closely modelled after human experiments: a prompt-based evaluation that uses probabilities as model preference, and we use Grad-CAM as a novel way to interpret visual attention in shape-word matching tasks. Our findings show that these models do not consistently exhibit the bouba-kiki effect. While ResNet shows a preference for round shapes, overall performance across both models lacks the expected associations. Moreover, direct comparison with prior human data on the same task shows that the models’ responses fall markedly short of the robust, modality-integrated behaviour characteristic of human cognition. These results contribute to the ongoing debate about the extent to which VLMs truly understand cross-modal concepts, highlighting limitations in their internal representations and alignment with human intuitions.

[73] NegRefine: Refining Negative Label-Based Zero-Shot OOD Detection cs.CV | cs.LGPDF

Amirhossein Ansari, Ke Wang, Pulei Xiong

TL;DR: NegRefine是一种用于零样本OOD检测的负标签细化框架，通过过滤子类别标签和专有名词，并结合动态调整的多匹配感知评分函数，提升了CLIP等视觉语言模型在区分分布内和分布外样本时的鲁棒性。

Details

Motivation: 现有的基于负标签的零样本OOD检测方法（如NegLabel和CSP）存在将分布内样本误判为OOD的问题，主要原因是负标签可能为分布内标签的子类别或专有名词，且无法有效处理图像匹配多个标签的情况。

Result: 在ImageNet-1K等大规模基准测试中表现优异，验证了方法的有效性。

Insight: 负标签的选择和动态评分策略对零样本OOD检测至关重要，尤其是避免语义重叠的多标签干扰。

Abstract: Recent advancements in Vision-Language Models like CLIP have enabled zero-shot OOD detection by leveraging both image and textual label information. Among these, negative label-based methods such as NegLabel and CSP have shown promising results by utilizing a lexicon of words to define negative labels for distinguishing OOD samples. However, these methods suffer from detecting in-distribution samples as OOD due to negative labels that are subcategories of in-distribution labels or proper nouns. They also face limitations in handling images that match multiple in-distribution and negative labels. We propose NegRefine, a novel negative label refinement framework for zero-shot OOD detection. By introducing a filtering mechanism to exclude subcategory labels and proper nouns from the negative label set and incorporating a multi-matching-aware scoring function that dynamically adjusts the contributions of multiple labels matching an image, NegRefine ensures a more robust separation between in-distribution and OOD samples. We evaluate NegRefine on large-scale benchmarks, including ImageNet-1K. Source code is available at https://github.com/ah-ansari/NegRefine.

[74] VRU-Accident: A Vision-Language Benchmark for Video Question Answering and Dense Captioning for Accident Scene Understanding cs.CVPDF

Younggun Kim, Ahmed S. Abdelrahman, Mohamed Abdel-Aty

TL;DR: VRU-Accident是一个专注于脆弱道路使用者（VRU）事故场景理解的大规模视觉-语言基准，包含1K真实事故视频、6K多选问答对和1K密集场景描述，用于评估多模态大语言模型（MLLM）的推理和描述能力。

Details

Motivation: 当前缺乏标准化的基准来评估MLLM在复杂、安全关键的VRU事故场景中的推理能力，这限制了自动驾驶系统中VRU安全的提升。

Result: MLLM在视觉属性上表现较好，但在推理事故原因、类型和可预防性方面存在显著挑战。

Insight: VRU-Accident揭示了MLLM在安全关键场景下的局限性，为未来研究提供了改进方向。

Abstract: Ensuring the safety of vulnerable road users (VRUs), such as pedestrians and cyclists, is a critical challenge for autonomous driving systems, as crashes involving VRUs often result in severe or fatal consequences. While multimodal large language models (MLLMs) have shown promise in enhancing scene understanding and decision making in autonomous vehicles, there is currently no standardized benchmark to quantitatively evaluate their reasoning abilities in complex, safety-critical scenarios involving VRUs. To address this gap, we present VRU-Accident, a large-scale vision-language benchmark designed to evaluate MLLMs in high-risk traffic scenarios involving VRUs. VRU-Accident comprises 1K real-world dashcam accident videos, annotated with 6K multiple-choice question-answer pairs across six safety-critical categories (with 24K candidate options and 3.4K unique answer choices), as well as 1K dense scene descriptions. Unlike prior works, our benchmark focuses explicitly on VRU-vehicle accidents, providing rich, fine-grained annotations that capture both spatial-temporal dynamics and causal semantics of accidents. To assess the current landscape of MLLMs, we conduct a comprehensive evaluation of 17 state-of-the-art models on the multiple-choice VQA task and on the dense captioning task. Our findings reveal that while MLLMs perform reasonably well on visually grounded attributes, they face significant challenges in reasoning and describing accident causes, types, and preventability.

[75] Hierarchical Abstraction Enables Human-Like 3D Object Recognition in Deep Learning Models cs.CV | cs.LGPDF

Shuhao Fu, Philip J. Kellman, Hongjing Lu

TL;DR: 这篇论文探讨了深度学习模型与人脑在3D物体识别中的表现差异，提出点云变换器模型（Point Transformer）通过层次化抽象机制实现了更接近人类的表现。

Details

Motivation: 研究旨在验证深度学习模型是否与人脑一样，能够从稀疏的3D形状信息（如点云）中构建类似的3D形状表征，并比较不同模型的性能。

Result: 点云变换器模型在人类实验中的表现优于卷积模型，尤其在对全局3D形状的表征上更接近人类。

Insight: 层次化抽象机制是模型能够更接近人类3D物体识别的关键，这为未来设计更高效的3D视觉模型提供了方向。

Abstract: Both humans and deep learning models can recognize objects from 3D shapes depicted with sparse visual information, such as a set of points randomly sampled from the surfaces of 3D objects (termed a point cloud). Although deep learning models achieve human-like performance in recognizing objects from 3D shapes, it remains unclear whether these models develop 3D shape representations similar to those used by human vision for object recognition. We hypothesize that training with 3D shapes enables models to form representations of local geometric structures in 3D shapes. However, their representations of global 3D object shapes may be limited. We conducted two human experiments systematically manipulating point density and object orientation (Experiment 1), and local geometric structure (Experiment 2). Humans consistently performed well across all experimental conditions. We compared two types of deep learning models, one based on a convolutional neural network (DGCNN) and the other on visual transformers (point transformer), with human performance. We found that the point transformer model provided a better account of human performance than the convolution-based model. The advantage mainly results from the mechanism in the point transformer model that supports hierarchical abstraction of 3D shapes.

[76] FaceLLM: A Multimodal Large Language Model for Face Understanding cs.CV | cs.AI | cs.CLPDF

Hatef Otroshi Shahreza, Sébastien Marcel

TL;DR: 该论文提出FaceLLM，一种专为面部图像理解设计的多模态大语言模型，通过ChatGPT弱监督生成高质量问答对，构建FairFaceGPT数据集，并在面部任务中取得SOTA性能。

Details

Motivation: 现有MLLMs主要在通用数据集上训练，缺乏对领域特异性视觉线索（如面部图像）的推理能力，尤其是在面部结构、表情、情绪和人口统计特征等任务的细粒度理解上。

Result: FaceLLM在多种面部任务中提升MLLMs性能，并达到SOTA，展示了语言模型合成监督在领域专用MLLMs中的潜力。

Insight: 1) 语言模型可作为合成监督工具生成领域专用数据；2) 细粒度领域任务的性能提升依赖于针对性数据集；3) 为可信赖、以人为本的多模态AI系统提供了范例。

Abstract: Multimodal large language models (MLLMs) have shown remarkable performance in vision-language tasks. However, existing MLLMs are primarily trained on generic datasets, limiting their ability to reason on domain-specific visual cues such as those in facial images. In particular, tasks that require detailed understanding of facial structure, expression, emotion, and demographic features remain underexplored by MLLMs due to the lack of large-scale annotated face image-text datasets. In this work, we introduce FaceLLM, a multimodal large language model trained specifically for facial image understanding. To construct the training data, we propose a novel weakly supervised pipeline that uses ChatGPT with attribute-aware prompts to generate high-quality question-answer pairs based on images from the FairFace dataset. The resulting corpus, called FairFaceGPT, covers a diverse set of attributes including expression, pose, skin texture, and forensic information. Our experiments demonstrate that FaceLLM improves the performance of MLLMs on various face-centric tasks and achieves state-of-the-art performance. This work highlights the potential of synthetic supervision via language models for building domain-specialized MLLMs, and sets a precedent for trustworthy, human-centric multimodal AI systems. FairFaceGPT dataset and pretrained FaceLLM models are publicly available in the project page.

[77] A Survey on MLLM-based Visually Rich Document Understanding: Methods, Challenges, and Emerging Trends cs.CV | cs.AIPDF

Yihao Ding, Siwen Luo, Yue Dai, Yanbei Jiang, Zechuan Li

TL;DR: 本文综述了基于多模态大语言模型（MLLM）的视觉丰富文档理解（VRDU）的方法、挑战和新兴趋势，重点关注文本、视觉和布局特征的编码与融合方法、训练范式以及相关数据集，并提出未来发展的方向。

Details

Motivation: 随着复杂文档的自动处理需求增长，VRDU成为重要研究领域。多模态大语言模型（MLLM）展现了在这一领域的潜力，本文旨在系统梳理相关进展。

Result: 研究发现MLLM在VRDU领域的表现显著，但仍面临效率和泛化性等挑战。

Insight: 未来的研究方向包括提升模型的效率、通用性和鲁棒性，以推动VRDU系统的进一步发展。

Abstract: Visually-Rich Document Understanding (VRDU) has emerged as a critical field, driven by the need to automatically process documents containing complex visual, textual, and layout information. Recently, Multimodal Large Language Models (MLLMs) have shown remarkable potential in this domain, leveraging both Optical Character Recognition (OCR)-dependent and OCR-free frameworks to extract and interpret information in document images. This survey reviews recent advancements in MLLM-based VRDU, highlighting three core components: (1) methods for encoding and fusing textual, visual, and layout features; (2) training paradigms, including pretraining strategies, instruction-response tuning, and the trainability of different model modules; and (3) datasets utilized for pretraining, instruction-tuning, and supervised fine-tuning. Finally, we discuss the challenges and opportunities in this evolving field and propose future directions to advance the efficiency, generalizability, and robustness of VRDU systems.

[78] SpeakerVid-5M: A Large-Scale High-Quality Dataset for Audio-Visual Dyadic Interactive Human Generation cs.CV | eess.ASPDF

Youliang Zhang, Zhaoyang Li, Duomin Wang, Jiahe Zhang, Deyu Zhou

TL;DR: SpeakerVid-5M是首个为音频-视觉双向交互虚拟人生成任务设计的大规模高质量数据集，包含520万视频片段，总时长超过8743小时，覆盖多种交互类型和质量层级。

Details

Motivation: 随着大规模模型的发展，学术界转向研究音频-视觉双向交互虚拟人的挑战，但缺乏高质量数据集支持，因此提出了SpeakerVid-5M。

Result: 数据集包含520万视频片段，总时长8743小时，覆盖多种交互类型。同时提出了AR-based视频聊天基线模型和评估基准。

Insight: 数据集的双重结构设计（交互类型+质量层级）为2D虚拟人任务提供了灵活支持，未来可推动音频-视觉交互研究的发展。

Abstract: The rapid development of large-scale models has catalyzed significant breakthroughs in the digital human domain. These advanced methodologies offer high-fidelity solutions for avatar driving and rendering, leading academia to focus on the next major challenge: audio-visual dyadic interactive virtual human. To facilitate research in this emerging area, we present SpeakerVid-5M dataset, the first large-scale, high-quality dataset designed for audio-visual dyadic interactive virtual human generation. Totaling over 8,743 hours, SpeakerVid-5M contains more than 5.2 million video clips of human portraits. It covers diverse scales and interaction types, including monadic talking, listening, and dyadic conversations. Crucially, the dataset is structured along two key dimensions: interaction type and data quality. First, it is categorized into four types (dialogue branch, single branch, listening branch and multi-turn branch) based on the interaction scenario. Second, it is stratified into a large-scale pre-training subset and a curated, high-quality subset for Supervised Fine-Tuning (SFT). This dual structure accommodates a wide array of 2D virtual human tasks. In addition, we provide an autoregressive (AR)-based video chat baseline trained on this data, accompanied by a dedicated set of metrics and test data to serve as a benchmark VidChatBench for future work. Both the dataset and the corresponding data processing code will be publicly released. Project page: https://dorniwang.github.io/SpeakerVid-5M/

[79] OpenHuman4D: Open-Vocabulary 4D Human Parsing cs.CVPDF

Keito Suzuki, Bang Du, Runfa Blark Li, Kunyao Chen, Lei Wang

TL;DR: OpenHuman4D是一个开放词汇的4D人体解析框架，通过减少推理时间和引入开放词汇能力，解决了当前方法在封闭数据集和长推理时间上的限制。

Details

Motivation: 现有的人体部分分割方法受限于封闭数据集和长推理时间，限制了其应用范围。OpenHuman4D旨在解决这些问题，提升动态3D人体表示的理解能力。

Result: 实验表明方法在4D人体解析任务中高效灵活，推理速度显著提升。

Insight: 通过开放词汇和高效设计，OpenHuman4D为动态3D人体解析开辟了新方向，适用于虚拟和扩展现实应用。

Abstract: Understanding dynamic 3D human representation has become increasingly critical in virtual and extended reality applications. However, existing human part segmentation methods are constrained by reliance on closed-set datasets and prolonged inference times, which significantly restrict their applicability. In this paper, we introduce the first 4D human parsing framework that simultaneously addresses these challenges by reducing the inference time and introducing open-vocabulary capabilities. Building upon state-of-the-art open-vocabulary 3D human parsing techniques, our approach extends the support to 4D human-centric video with three key innovations: 1) We adopt mask-based video object tracking to efficiently establish spatial and temporal correspondences, avoiding the necessity of segmenting all frames. 2) A novel Mask Validation module is designed to manage new target identification and mitigate tracking failures. 3) We propose a 4D Mask Fusion module, integrating memory-conditioned attention and logits equalization for robust embedding fusion. Extensive experiments demonstrate the effectiveness and flexibility of the proposed method on 4D human-centric parsing tasks, achieving up to 93.3% acceleration compared to the previous state-of-the-art method, which was limited to parsing fixed classes.

[80] MCGA: Mixture of Codebooks Hyperspectral Reconstruction via Grayscale-Aware Attention cs.CVPDF

Zhanjiang Yang, Lijun Sun, Jiawei Dong, Xiaoxin An, Yang Liu

TL;DR: MCGA提出了一种两阶段方法，通过混合码书（MoC）和灰度感知注意力机制，从RGB图像高效重构高光谱图像，显著提升了重建质量。

Details

Motivation: 现有基于学习的高光谱重建方法直接从RGB到高光谱的映射学习忽视了低维到高维信息转换的固有挑战，MCGA旨在解决这一问题。

Result: 实验表明MCGA在高光谱重建任务中达到了最先进的性能。

Insight: 通过先学习光谱模式再估计映射，结合物理启发的注意力机制，MCGA实现了轻量高效且高质量的高光谱重建。

Abstract: Reconstructing hyperspectral images (HSI) from RGB images is a cost-effective solution for various vision-based applications. However, most existing learning-based hyperspectral reconstruction methods directly learn the RGB-to-HSI mapping using complex attention mechanisms, neglecting the inherent challenge of transitioning from low-dimensional to high-dimensional information. To address this limitation, we propose a two-stage approach, MCGA, which first learns spectral patterns before estimating the mapping. In the first stage, a multi-scale VQ-VAE learns representations from heterogeneous HSI datasets, extracting a Mixture of Codebooks (MoC). In the second stage, the RGB-to-HSI mapping is refined by querying features from the MoC to replace latent HSI representations, incorporating prior knowledge rather than forcing a direct high-dimensional transformation. To further enhance reconstruction quality, we introduce Grayscale-Aware Attention and Quantized Self-Attention, which adaptively adjust feature map intensities to meet hyperspectral reconstruction requirements. This physically motivated attention mechanism ensures lightweight and efficient HSI recovery. Moreover, we propose an entropy-based Test-Time Adaptation strategy to improve robustness in real-world scenarios. Extensive experiments demonstrate that our method, MCGA, achieves state-of-the-art performance. The code and models will be released at https://github.com/Fibonaccirabbit/MCGA

[81] EmbRACE-3K: Embodied Reasoning and Action in Complex Environments cs.CV | cs.AI | cs.CLPDF

Mingxian Lin, Wei Huang, Yitang Li, Chengjie Jiang, Kui Wu

TL;DR: 本文提出了EmRACE-3K数据集，旨在评估和提升视觉语言模型（VLMs）在复杂环境中执行具身任务的能力。实验表明，现有VLMs在零样本设置下表现不佳（成功率<20%），但通过微调Qwen2.5-VL-7B模型，性能显著提升。

Details

Motivation: 当前VLMs在离线图片和视频理解任务中表现优异，但在需要实时交互和动态环境理解的具身任务中效果有限。因此，需要开发新的数据集和方法来推动VLMs在这一领域的发展。

Result: 在零样本设置下，现有VLMs的成功率低于20%。微调后的Qwen2.5-VL-7B模型在所有三个任务类别中性能显著提升。

Insight: 具身任务对VLMs提出了更高的要求，尤其是动态环境理解和多阶段规划能力。EmRACE-3K数据集为研究和改进VLMs的交互能力提供了重要工具。

Abstract: Recent advanced vision-language models(VLMs) have demonstrated strong performance on passive, offline image and video understanding tasks. However, their effectiveness in embodied settings, which require online interaction and active scene understanding remains limited. In such scenarios, an agent perceives the environment from a first-person perspective, with each action dynamically shaping subsequent observations. Even state-of-the-art models such as GPT-4o, Claude 3.5 Sonnet, and Gemini 2.5 Pro struggle in open-environment interactions, exhibiting clear limitations in spatial reasoning and long-horizon planning. To address this gap, we introduce EmRACE-3K, a dataset of over 3,000 language-guided tasks situated in diverse, photorealistic environments constructed using Unreal Engine and the UnrealCV-Zoo framework. The tasks encompass a wide range of embodied challenges, including navigation, object manipulation, and multi-stage goal execution. Each task unfolds as a multi-step trajectory, pairing first-person visual observations with high-level instructions, grounded actions, and natural language rationales that express the agent’s intent at every step. Using EmRACE-3K, we establish a benchmark to evaluate the embodied reasoning capabilities of VLMs across three key dimensions: Exploration, Dynamic Spatial-Semantic Reasoning, and Multi-stage Goal Execution. In zero-shot settings, all models achieve success rates below 20%, underscoring the challenge posed by our benchmark and the current limitations of VLMs in interactive environments. To demonstrate the utility of EmRACE-3K, we further fine-tune Qwen2.5-VL-7B using supervised learning followed by reinforcement learning. This approach yields substantial improvements across all three challenge categories, highlighting the dataset’s effectiveness in enabling the development of embodied reasoning capabilities.

[82] IGD: Instructional Graphic Design with Multimodal Layer Generation cs.CVPDF

Yadong Qu, Shancheng Fang, Yuxin Wang, Xiaorui Wang, Zhineng Chen

TL;DR: 论文提出了一种基于多模态层生成的教学式图形设计方法IGD，通过自然语言指令快速生成可编辑的图形设计，解决了现有方法缺乏创造力和难以编辑的问题。

Details

Motivation: 现有图形设计方法分为两阶段布局生成或基于扩散模型的图像级生成，前者缺乏创造力，后者生成的文件不可编辑且文本可读性差。因此需要一种既能快速生成又具备编辑灵活性的自动化图形设计解决方案。

Result: 实验结果表明，IGD为图形设计提供了新的解决方案，生成结果优于现有方法。

Insight: IGD通过结合参数化渲染与图像资产生成，实现了快速且可编辑的图形设计，为复杂设计任务的可扩展性提供了支持。

Abstract: Graphic design visually conveys information and data by creating and combining text, images and graphics. Two-stage methods that rely primarily on layout generation lack creativity and intelligence, making graphic design still labor-intensive. Existing diffusion-based methods generate non-editable graphic design files at image level with poor legibility in visual text rendering, which prevents them from achieving satisfactory and practical automated graphic design. In this paper, we propose Instructional Graphic Designer (IGD) to swiftly generate multimodal layers with editable flexibility with only natural language instructions. IGD adopts a new paradigm that leverages parametric rendering and image asset generation. First, we develop a design platform and establish a standardized format for multi-scenario design files, thus laying the foundation for scaling up data. Second, IGD utilizes the multimodal understanding and reasoning capabilities of MLLM to accomplish attribute prediction, sequencing and layout of layers. It also employs a diffusion model to generate image content for assets. By enabling end-to-end training, IGD architecturally supports scalability and extensibility in complex graphic design tasks. The superior experimental results demonstrate that IGD offers a new solution for graphic design.

[83] Can GPT-4o mini and Gemini 2.0 Flash Predict Fine-Grained Fashion Product Attributes? A Zero-Shot Analysis cs.CV | cs.AIPDF

Shubham Shukla, Kunal Sonalkar

TL;DR: 论文通过零样本评估了GPT-4o-mini和Gemini 2.0 Flash在细粒度时尚属性识别任务中的表现，发现Gemini表现更优，并探讨了其在电商产品属性标注中的潜力。

Details

Motivation: 时尚行业的成功依赖对产品属性的准确理解，而现有大语言模型（LLMs）在细粒度时尚属性识别上的表现尚未充分研究。本文旨在填补这一空白。

Result: Gemini 2.0 Flash表现最佳（宏F1 56.79%），优于GPT-4o-mini（43.28%）。模型在部分属性上表现优异，但需领域微调。

Insight: 当前LLMs在细粒度属性识别上仍有局限，需领域适配优化；Gemini的实用性强，适合电商场景部署。研究为时尚AI多模态属性提取奠定基础。

Abstract: The fashion retail business is centered around the capacity to comprehend products. Product attribution helps in comprehending products depending on the business process. Quality attribution improves the customer experience as they navigate through millions of products offered by a retail website. It leads to well-organized product catalogs. In the end, product attribution directly impacts the ‘discovery experience’ of the customer. Although large language models (LLMs) have shown remarkable capabilities in understanding multimodal data, their performance on fine-grained fashion attribute recognition remains under-explored. This paper presents a zero-shot evaluation of state-of-the-art LLMs that balance performance with speed and cost efficiency, mainly GPT-4o-mini and Gemini 2.0 Flash. We have used the dataset DeepFashion-MultiModal (https://github.com/yumingj/DeepFashion-MultiModal) to evaluate these models in the attribution tasks of fashion products. Our study evaluates these models across 18 categories of fashion attributes, offering insight into where these models excel. We only use images as the sole input for product information to create a constrained environment. Our analysis shows that Gemini 2.0 Flash demonstrates the strongest overall performance with a macro F1 score of 56.79% across all attributes, while GPT-4o-mini scored a macro F1 score of 43.28%. Through detailed error analysis, our findings provide practical insights for deploying these LLMs in production e-commerce product attribution-related tasks and highlight the need for domain-specific fine-tuning approaches. This work also lays the groundwork for future research in fashion AI and multimodal attribute extraction.

[84] Uncertainty Quantification for Incomplete Multi-View Data Using Divergence Measures cs.CVPDF

Zhipeng Xue, Yan Zhang, Ming Li, Chun Li, Yue Liu

TL;DR: 论文提出了一种基于Hölder散度的KPHD-Net方法，用于解决多视图分类和聚类任务中不确定性量化的问题，通过结合Dempster-Shafer证据理论和Kalman滤波，提升了多视图融合的可靠性和准确性。

Details

Motivation: 当前多视图分类和聚类方法通常使用Kullback-Leibler（KL）散度来估计不确定性，但忽略了模态间的领域差距，导致可靠性不足。因此，需要一种更有效的方法来量化不确定性并提升融合结果的可靠性。

Result: 实验表明，KPHD-Net在分类和聚类任务中的准确性、鲁棒性和可靠性均优于现有方法，且有理论保证。

Insight: Hölder散度比KL散度更适合衡量多视图数据的分布差异，结合Dempster-Shafer证据理论可以显著提升不确定性估计的可靠性。

Abstract: Existing multi-view classification and clustering methods typically improve task accuracy by leveraging and fusing information from different views. However, ensuring the reliability of multi-view integration and final decisions is crucial, particularly when dealing with noisy or corrupted data. Current methods often rely on Kullback-Leibler (KL) divergence to estimate uncertainty of network predictions, ignoring domain gaps between different modalities. To address this issue, KPHD-Net, based on H"older divergence, is proposed for multi-view classification and clustering tasks. Generally, our KPHD-Net employs a variational Dirichlet distribution to represent class probability distributions, models evidences from different views, and then integrates it with Dempster-Shafer evidence theory (DST) to improve uncertainty estimation effects. Our theoretical analysis demonstrates that Proper H"older divergence offers a more effective measure of distribution discrepancies, ensuring enhanced performance in multi-view learning. Moreover, Dempster-Shafer evidence theory, recognized for its superior performance in multi-view fusion tasks, is introduced and combined with the Kalman filter to provide future state estimations. This integration further enhances the reliability of the final fusion results. Extensive experiments show that the proposed KPHD-Net outperforms the current state-of-the-art methods in both classification and clustering tasks regarding accuracy, robustness, and reliability, with theoretical guarantees.

[85] 3DGAA: Realistic and Robust 3D Gaussian-based Adversarial Attack for Autonomous Driving cs.CVPDF

Yixun Zhang, Lizhi Wang, Junjun Zhao, Wending Zhao, Feng Zhou

TL;DR: 3DGAA是一种新颖的对抗攻击框架，通过3D高斯泼溅技术联合优化几何和外观属性，生成物理上可行的对抗物体，显著提升了攻击鲁棒性和现实性。

Details

Motivation: 自动驾驶中的摄像头目标检测系统易受真实环境中的对抗威胁，现有的2D和3D物理攻击通常仅优化纹理，难以平衡物理现实性和攻击鲁棒性。

Result: 在虚拟和物理实验中，3DGAA将检测mAP从87.21%降至7.38%，显著优于现有3D物理攻击方法，并展现出高迁移性。

Insight: 联合优化几何和外观是提升对抗攻击现实性和有效性的关键，物理模拟模块的设计增强了算法在复杂环境中的鲁棒性。

Abstract: Camera-based object detection systems play a vital role in autonomous driving, yet they remain vulnerable to adversarial threats in real-world environments. While existing 2D and 3D physical attacks typically optimize texture, they often struggle to balance physical realism and attack robustness. In this work, we propose 3D Gaussian-based Adversarial Attack (3DGAA), a novel adversarial object generation framework that leverages the full 14-dimensional parameterization of 3D Gaussian Splatting (3DGS) to jointly optimize geometry and appearance in physically realizable ways. Unlike prior works that rely on patches or texture, 3DGAA jointly perturbs both geometric attributes (shape, scale, rotation) and appearance attributes (color, opacity) to produce physically realistic and transferable adversarial objects. We further introduce a physical filtering module to preserve geometric fidelity, and a physical augmentation module to simulate complex physical scenarios, thus enhancing attack generalization under real-world conditions. We evaluate 3DGAA on both virtual benchmarks and physical-world setups using miniature vehicle models. Experimental results show that 3DGAA achieves to reduce the detection mAP from 87.21% to 7.38%, significantly outperforming existing 3D physical attacks. Moreover, our method maintains high transferability across different physical conditions, demonstrating a new state-of-the-art in physically realizable adversarial attacks. These results validate 3DGAA as a practical attack framework for evaluating the safety of perception systems in autonomous driving.

[86] Leveraging Swin Transformer for enhanced diagnosis of Alzheimer’s disease using multi-shell diffusion MRI cs.CV | q-bio.NC | q-bio.QMPDF

Quentin Dessain, Nicolas Delinte, Bernard Hanseeuw, Laurence Dricot, Benoît Macq

TL;DR: 论文利用Swin Transformer和多壳扩散MRI数据，开发了一个深度学习框架，用于阿尔茨海默病的早期诊断和淀粉样蛋白积累检测。

Details

Motivation: 阿尔茨海默病的早期诊断和淀粉样蛋白积累检测具有重要的临床意义，但现有方法在数据有限的情况下效果不佳。本文旨在利用多壳扩散MRI的微结构信息和Transformer模型解决这一问题。

Result: 1. 在区分认知正常与阿尔茨海默病痴呆的任务中，平衡准确率达到95.2%。2. 在区分淀粉样蛋白阳性和阴性的任务中，平衡准确率分别为77.2%和67.9%。3. 解释性分析显示，海马旁回和海马体等脑区对模型预测贡献显著。

Insight: 1. 多壳扩散MRI与Transformer模型的结合在神经影像分析中表现优异。2. 低秩适应技术能有效解决标签数据有限的问题。3. 模型的突出表现支持了生物标志物驱动的诊断方法在临床中的应用潜力。

Abstract: Objective: This study aims to support early diagnosis of Alzheimer’s disease and detection of amyloid accumulation by leveraging the microstructural information available in multi-shell diffusion MRI (dMRI) data, using a vision transformer-based deep learning framework. Methods: We present a classification pipeline that employs the Swin Transformer, a hierarchical vision transformer model, on multi-shell dMRI data for the classification of Alzheimer’s disease and amyloid presence. Key metrics from DTI and NODDI were extracted and projected onto 2D planes to enable transfer learning with ImageNet-pretrained models. To efficiently adapt the transformer to limited labeled neuroimaging data, we integrated Low-Rank Adaptation. We assessed the framework on diagnostic group prediction (cognitively normal, mild cognitive impairment, Alzheimer’s disease dementia) and amyloid status classification. Results: The framework achieved competitive classification results within the scope of multi-shell dMRI-based features, with the best balanced accuracy of 95.2% for distinguishing cognitively normal individuals from those with Alzheimer’s disease dementia using NODDI metrics. For amyloid detection, it reached 77.2% balanced accuracy in distinguishing amyloid-positive mild cognitive impairment/Alzheimer’s disease dementia subjects from amyloid-negative cognitively normal subjects, and 67.9% for identifying amyloid-positive individuals among cognitively normal subjects. Grad-CAM-based explainability analysis identified clinically relevant brain regions, including the parahippocampal gyrus and hippocampus, as key contributors to model predictions. Conclusion: This study demonstrates the promise of diffusion MRI and transformer-based architectures for early detection of Alzheimer’s disease and amyloid pathology, supporting biomarker-driven diagnostics in data-limited biomedical settings.

[87] Vision-Based Anti Unmanned Aerial Technology: Opportunities and Challenges cs.CVPDF

Guanghai Ding, Yihua Ren, Yuting Liu, Qijun Zhao, Shuiwang Li

TL;DR: 该论文综述了基于视觉的无人机反制技术，探讨了其机遇与挑战，并总结了当前的主流方法、数据集及未来研究方向。

Details

Motivation: 随着无人机技术的快速发展和广泛应用，实现高效准确的无人机反制跟踪变得至关重要，尤其是在公共安全、边境巡逻等复杂环境中。

Result: 通过总结当前技术与挑战，论文为研究者提供了数据支持和算法分析，指明了未来的研究方向。

Insight: 无人机反制技术的关键在于多传感器数据融合和复杂环境下的高效跟踪算法，未来研究可进一步优化这些方向。

Abstract: With the rapid advancement of UAV technology and its extensive application in various fields such as military reconnaissance, environmental monitoring, and logistics, achieving efficient and accurate Anti-UAV tracking has become essential. The importance of Anti-UAV tracking is increasingly prominent, especially in scenarios such as public safety, border patrol, search and rescue, and agricultural monitoring, where operations in complex environments can provide enhanced security. Current mainstream Anti-UAV tracking technologies are primarily centered around computer vision techniques, particularly those that integrate multi-sensor data fusion with advanced detection and tracking algorithms. This paper first reviews the characteristics and current challenges of Anti-UAV detection and tracking technologies. Next, it investigates and compiles several publicly available datasets, providing accessible links to support researchers in efficiently addressing related challenges. Furthermore, the paper analyzes the major vision-based and vision-fusion-based Anti-UAV detection and tracking algorithms proposed in recent years. Finally, based on the above research, this paper outlines future research directions, aiming to provide valuable insights for advancing the field.

[88] (Almost) Free Modality Stitching of Foundation Models cs.CV | cs.AI | cs.LGPDF

Jaisidh Singh, Diganta Misra, Boris Knyazev, Antonio Orvieto

TL;DR: 这篇论文提出了一种名为Hypernetwork Model Alignment（Hyma）的方法，用于高效地选择最优的单模态模型并训练连接模块，解决了多模态基础模型中模型选择和连接器训练的高计算成本问题。

Details

Motivation: 多模态基础模型通常由多个预训练的单模态模型连接而成，但选择和训练连接模块的计算成本高昂。为解决这一问题，论文提出了Hyma方法。

Result: 实验表明，Hyma将最优单模态模型对的搜索成本降低了10倍，同时在多模态基准测试中表现与网格搜索相当。

Insight: 论文展示了超网络在多模态模型构建中的潜力，为解决模型选择和连接器训练的复杂性问题提供了新思路。

Abstract: Foundation multi-modal models are often designed by stitching of multiple existing pretrained uni-modal models: for example, an image classifier with an autoregressive text model. This stitching process is performed by training a connector module that aims to align the representation-representation or representation-input spaces of these uni-modal models. However, given the complexity of training such connectors on large scale web-based datasets coupled with the ever-increasing number of available pretrained uni-modal models, the task of uni-modal models selection and subsequent connector module training becomes computationally demanding. To address this under-studied critical problem, we propose Hypernetwork Model Alignment (Hyma), a novel all-in-one solution for optimal uni-modal model selection and connector training by leveraging hypernetworks. Specifically, our framework utilizes the parameter prediction capability of a hypernetwork to obtain jointly trained connector modules for $N \times M$ combinations of uni-modal models. In our experiments, Hyma reduces the optimal uni-modal model pair search cost by $10\times$ (averaged across all experiments), while matching the ranking and trained connector performance obtained via grid search across a suite of diverse multi-modal benchmarks.

[89] LifelongPR: Lifelong knowledge fusion for point cloud place recognition based on replay and prompt learning cs.CV | cs.ROPDF

Xianghong Zou, Jianping Li, Zhe Chen, Zhen Cao, Zhen Dong

TL;DR: 该论文提出了一种名为LifelongPR的持续学习框架，用于解决点云位置识别（PCPR）中的灾难性遗忘问题，通过动态重放样本选择和基于提示学习的轻量级模块，显著提升了模型的性能。

Details

Motivation: 在点云位置识别的实际应用中，模型需要持续适应动态多变的环境，但现有方法易受灾难性遗忘影响，导致性能下降和部署难度增加。为此，作者提出了LifelongPR框架以解决这一问题。

Result: 在大规模公开和自建数据集上，LifelongPR在mIR@1和mR@1指标上分别提升了6.50%和7.96%，遗忘率降低了8.95%。

Insight: 动态样本选择和轻量级提示学习的结合是解决持续学习问题的有效途径，尤其适用于点云位置识别这类动态多变的任务。

Abstract: Point cloud place recognition (PCPR) plays a crucial role in photogrammetry and robotics applications such as autonomous driving, intelligent transportation, and augmented reality. In real-world large-scale deployments of a positioning system, PCPR models must continuously acquire, update, and accumulate knowledge to adapt to diverse and dynamic environments, i.e., the ability known as continual learning (CL). However, existing PCPR models often suffer from catastrophic forgetting, leading to significant performance degradation in previously learned scenes when adapting to new environments or sensor types. This results in poor model scalability, increased maintenance costs, and system deployment difficulties, undermining the practicality of PCPR. To address these issues, we propose LifelongPR, a novel continual learning framework for PCPR, which effectively extracts and fuses knowledge from sequential point cloud data. First, to alleviate the knowledge loss, we propose a replay sample selection method that dynamically allocates sample sizes according to each dataset’s information quantity and selects spatially diverse samples for maximal representativeness. Second, to handle domain shifts, we design a prompt learning-based CL framework with a lightweight prompt module and a two-stage training strategy, enabling domain-specific feature adaptation while minimizing forgetting. Comprehensive experiments on large-scale public and self-collected datasets are conducted to validate the effectiveness of the proposed method. Compared with state-of-the-art (SOTA) methods, our method achieves 6.50% improvement in mIR@1, 7.96% improvement in mR@1, and an 8.95% reduction in F. The code and pre-trained models are publicly available at https://github.com/zouxianghong/LifelongPR.

[90] CoSMo: A Multimodal Transformer for Page Stream Segmentation in Comic Books cs.CVPDF

Marc Serra Ortega, Emanuele Vivoli, Artemis Llabrés, Dimosthenis Karatzas

TL;DR: 论文提出了CoSMo，一种多模态Transformer模型，用于漫画书中的页面流分割（PSS），并通过实验证明了其在视觉和多模态任务中的优越性。

Details

Motivation: 漫画书的自动化内容理解需要首先完成页面流分割，以支持角色分析、故事索引等下游任务。然而，现有方法在这一独特媒介上表现不佳。

Result: 在F1-Macro、Panoptic Quality和流级别指标上，CoSMo均优于基线模型，尤其在解决模糊性问题上表现出多模态的优势。

Insight: 视觉特征在漫画PSS的宏观结构中占主导地位，但多模态特征在解决复杂模糊性问题上具有重要作用。

Abstract: This paper introduces CoSMo, a novel multimodal Transformer for Page Stream Segmentation (PSS) in comic books, a critical task for automated content understanding, as it is a necessary first stage for many downstream tasks like character analysis, story indexing, or metadata enrichment. We formalize PSS for this unique medium and curate a new 20,800-page annotated dataset. CoSMo, developed in vision-only and multimodal variants, consistently outperforms traditional baselines and significantly larger general-purpose vision-language models across F1-Macro, Panoptic Quality, and stream-level metrics. Our findings highlight the dominance of visual features for comic PSS macro-structure, yet demonstrate multimodal benefits in resolving challenging ambiguities. CoSMo establishes a new state-of-the-art, paving the way for scalable comic book analysis.

[91] MoVieS: Motion-Aware 4D Dynamic View Synthesis in One Second cs.CVPDF

Chenguo Lin, Yuchen Lin, Panwang Pan, Yifan Yu, Honglei Yan

TL;DR: MoVieS 是一种前馈模型，能在1秒内从单目视频合成4D动态新视角。它通过高斯原语的像素对齐网格表示动态3D场景，并显式监督其时变运动，实现了外观、几何和运动的统一建模。

Details

Motivation: 现有方法在动态场景的新视角合成和几何重建中往往需要复杂的优化过程，MoVieS旨在通过统一的框架高效完成这些任务，同时支持零样本应用。

Result: 实验表明，MoVieS在多种任务中表现优异，同时提供显著的速度提升。

Insight: 通过统一建模动态场景的属性，MoVieS展示了高效且通用的框架潜力，为未来动态场景理解提供了新思路。

Abstract: We present MoVieS, a novel feed-forward model that synthesizes 4D dynamic novel views from monocular videos in one second. MoVieS represents dynamic 3D scenes using pixel-aligned grids of Gaussian primitives, explicitly supervising their time-varying motion. This allows, for the first time, the unified modeling of appearance, geometry and motion, and enables view synthesis, reconstruction and 3D point tracking within a single learning-based framework. By bridging novel view synthesis with dynamic geometry reconstruction, MoVieS enables large-scale training on diverse datasets with minimal dependence on task-specific supervision. As a result, it also naturally supports a wide range of zero-shot applications, such as scene flow estimation and moving object segmentation. Extensive experiments validate the effectiveness and efficiency of MoVieS across multiple tasks, achieving competitive performance while offering several orders of magnitude speedups.

[92] A Transfer Learning-Based Method for Water Body Segmentation in Remote Sensing Imagery: A Case Study of the Zhada Tulin Area cs.CV | cs.LGPDF

Haonan Chen, Xin Tong

TL;DR: 该论文提出了一种基于SegFormer的两阶段迁移学习方法，用于解决遥感图像中水体分割的跨域和小样本问题，并在西藏扎达土林地区取得了显著效果。

Details

Motivation: 解决遥感图像水体分割任务中因域偏移和小样本数据导致的性能下降问题，以及在高复杂地形和光谱特征地区的应用需求。

Result: 在目标域数据集上，IoU从直接迁移的25.50%提升到了64.84%。

Insight: 两阶段迁移学习策略能有效缓解域偏移问题，并为数据稀缺和独特环境下的遥感信息提取提供了可行方案。

Abstract: To address the prevalent challenges of domain shift and small sample sizes in remote sensing image water body segmentation, this study proposes and validates a two-stage transfer learning strategy based on the SegFormer model. The approach begins by training a foundational segmentation model on a diverse source domain, where it achieves an Intersection over Union (IoU) of 68.80% on its validation set, followed by fine-tuning on data from the distinct target domain. Focusing on the Zhada Tulin area in Tibet – a region characterized by highly complex topography and spectral features – the experimental results demonstrate that this strategy significantly boosts the IoU for the water body segmentation task from 25.50% (for direct transfer) to 64.84%. This not only effectively resolves the model performance degradation caused by domain discrepancy but also provides an effective technical paradigm for high-precision thematic information extraction in data-scarce and environmentally unique remote sensing scenarios.

[93] FIX-CLIP: Dual-Branch Hierarchical Contrastive Learning via Synthetic Captions for Better Understanding of Long Text cs.CVPDF

Bingchao Wang, Zhiwei Ning, Jianyu Ding, Xuanang Gao, Yin Li

TL;DR: 论文提出了一种名为FIX-CLIP的方法，通过双分支训练、区域提示学习器和层次化特征对齐模块，解决了CLIP模型在处理长文本输入时的局限性，并通过合成大量图文对数据提升性能。

Details

Motivation: CLIP在短文本任务中表现优异，但受限于文本编码器的输入长度，无法处理长文本输入（>77 tokens）。FIX-CLIP旨在填补这一空白，提升模型对长文本的理解能力。

Result: FIX-CLIP在长文本和短文本检索任务中达到SOTA性能，并在长文本输入的扩散模型中展现出良好的即插即用能力。

Insight: 通过双分支设计和层次化特征对齐，FIX-CLIP成功解决了CLIP在处理长文本输入时的瓶颈，同时保留了短文本任务的能力；合成数据的引入为模型训练提供了丰富资源。

Abstract: CLIP has shown promising performance across many short-text tasks in a zero-shot manner. However, limited by the input length of the text encoder, CLIP struggles on under-stream tasks with long-text inputs (>77 tokens). To remedy this issue, we propose FIX-CLIP which includes three novel modules: (1) A dual-branch training pipeline that aligns short and long texts with masked and raw images respectively, which boosts the long-text representation while preserving the short-text ability. (2) Multiple learnable regional prompts with unidirectional masks in Transformer layers for regional information extraction. (3) A hierarchical feature alignment module in the intermediate encoder layers to promote the consistency of multi-scale features. Furthermore, we collect 30M images and utilize existing MLLMs to synthesize long-text captions for training. Extensive experiments show that FIX-CLIP achieves state-of-the-art performance on both long-text and short-text retrieval benchmarks. For downstream applications, we reveal that FIX-CLIP’s text encoder delivers promising performance in a plug-and-play manner for diffusion models with long-text input.

[94] DEARLi: Decoupled Enhancement of Recognition and Localization for Semi-supervised Panoptic Segmentation cs.CVPDF

Ivan Martinović, Josip Šarić, Marin Oršić, Matej Kristan, Siniša Šegvić

TL;DR: DEARLi提出了一种新的半监督全景分割方法，通过结合两个专用基础模型（CLIP和SAM）分别增强识别和定位能力，显著提升了在小标注数据和大类别场景下的性能。

Details

Motivation: 像素级标注成本高昂且耗时，而现有半监督分割方法在利用基础模型方面的机制尚未充分探索。DEARLi旨在通过结合CLIP和SAM的能力，解决这一问题。

Result: 在ADE20K数据集上，仅用158张标注图像，实现了29.9 PQ和38.9 mIoU的性能。

Insight: 解耦识别和定位的增强机制能有效提升半监督全景分割的性能，尤其是在标注数据稀缺和大类别场景下。

Abstract: Pixel-level annotation is expensive and time-consuming. Semi-supervised segmentation methods address this challenge by learning models on few labeled images alongside a large corpus of unlabeled images. Although foundation models could further account for label scarcity, effective mechanisms for their exploitation remain underexplored. We address this by devising a novel semi-supervised panoptic approach fueled by two dedicated foundation models. We enhance recognition by complementing unsupervised mask-transformer consistency with zero-shot classification of CLIP features. We enhance localization by class-agnostic decoder warm-up with respect to SAM pseudo-labels. The resulting decoupled enhancement of recognition and localization (DEARLi) particularly excels in the most challenging semi-supervised scenarios with large taxonomies and limited labeled data. Moreover, DEARLi outperforms the state of the art in semi-supervised semantic segmentation by a large margin while requiring 8x less GPU memory, in spite of being trained only for the panoptic objective. We observe 29.9 PQ and 38.9 mIoU on ADE20K with only 158 labeled images. The source code is available at https://github.com/helen1c/DEARLi.

[95] Taming Modern Point Tracking for Speckle Tracking Echocardiography via Impartial Motion cs.CV | cs.AIPDF

Md Abulkalam Azad, John Nyberg, Håvard Dalen, Bjørnar Grenne, Lasse Lovstakken

TL;DR: 本文研究了现代点跟踪方法在超声心动图中的应用，提出了一种改进训练策略和轻量级网络的方法，显著提升了运动估计的准确性和泛化能力。

Details

Motivation: 传统方法（如块匹配或光流）在处理复杂心脏运动时表现不佳，而现代点跟踪方法在超声心动图中的应用尚未充分探索。本文旨在填补这一空白。

Result: 实验表明，改进后的方法（EchoTracker）在位置准确性和轨迹误差方面显著优于基线模型（分别提升60.7%和降低61.5%）。临床评估显示，这些方法在半自动化工具中表现更好。

Insight: 现代点跟踪方法在超声心动图中存在局限性，但通过针对性改进，可以显著提升性能。轻量级设计在复杂任务中可能优于复杂的时空模型。

Abstract: Accurate motion estimation for tracking deformable tissues in echocardiography is essential for precise cardiac function measurements. While traditional methods like block matching or optical flow struggle with intricate cardiac motion, modern point tracking approaches remain largely underexplored in this domain. This work investigates the potential of state-of-the-art (SOTA) point tracking methods for ultrasound, with a focus on echocardiography. Although these novel approaches demonstrate strong performance in general videos, their effectiveness and generalizability in echocardiography remain limited. By analyzing cardiac motion throughout the heart cycle in real B-mode ultrasound videos, we identify that a directional motion bias across different views is affecting the existing training strategies. To mitigate this, we refine the training procedure and incorporate a set of tailored augmentations to reduce the bias and enhance tracking robustness and generalization through impartial cardiac motion. We also propose a lightweight network leveraging multi-scale cost volumes from spatial context alone to challenge the advanced spatiotemporal point tracking models. Experiments demonstrate that fine-tuning with our strategies significantly improves models’ performances over their baselines, even for out-of-distribution (OOD) cases. For instance, EchoTracker boosts overall position accuracy by 60.7% and reduces median trajectory error by 61.5% across heart cycle phases. Interestingly, several point tracking models fail to outperform our proposed simple model in terms of tracking accuracy and generalization, reflecting their limitations when applied to echocardiography. Nevertheless, clinical evaluation reveals that these methods improve GLS measurements, aligning more closely with expert-validated, semi-automated tools and thus demonstrating better reproducibility in real-world applications.

[96] Deep Recurrence for Dynamical Segmentation Models cs.CV | cs.LGPDF

David Calhas, Arlindo L. Oliveira

TL;DR: 该论文提出了一种基于预测编码的反馈机制，通过递归循环改进分割模型的动态性能，在噪声环境中显著优于前馈模型。

Details

Motivation: 生物视觉系统依赖反馈连接迭代优化感知，而人工神经网络多为静态前馈结构，缺乏动态调整能力。为此，作者提出反馈机制以提升模型的鲁棒性和数据效率。

Result: 反馈模型在噪声条件下显著优于前馈模型，仅需两个训练样本即可超越随机性能（前馈需至少四个），展示了更强的鲁棒性和泛化能力。

Insight: 递归反馈机制不仅模拟生物视觉的适应性，还能显著提升模型在低数据或噪声环境中的表现，为未来更仿生的神经架构提供了方向。

Abstract: While biological vision systems rely heavily on feedback connections to iteratively refine perception, most artificial neural networks remain purely feedforward, processing input in a single static pass. In this work, we propose a predictive coding inspired feedback mechanism that introduces a recurrent loop from output to input, allowing the model to refine its internal state over time. We implement this mechanism within a standard U-Net architecture and introduce two biologically motivated operations, softmax projection and exponential decay, to ensure stability of the feedback loop. Through controlled experiments on a synthetic segmentation task, we show that the feedback model significantly outperforms its feedforward counterpart in noisy conditions and generalizes more effectively with limited supervision. Notably, feedback achieves above random performance with just two training examples, while the feedforward model requires at least four. Our findings demonstrate that feedback enhances robustness and data efficiency, and offer a path toward more adaptive and biologically inspired neural architectures. Code is available at: github.com/DCalhas/feedback_segmentation.

[97] SlumpGuard: An AI-Powered Real-Time System for Automated Concrete Slump Prediction via Video Analysis cs.CVPDF

Youngmin Kim, Giyeong Oh, Kwangsoo Youm, Youngjae Yu

TL;DR: 论文提出一种基于AI的实时视频分析系统SlumpGuard，用于自动预测混凝土的坍落度，解决了传统坍落度测试效率低、不一致的问题。

Details

Motivation: 传统坍untuk度测试方法效率低下且依赖人工操作，无法实现实时监测，影响了施工质量控制。

Result: 实际部署结果显示，SlumpGuard显著提高了混凝土质量控制的准确性和效率。

Insight: 视频分析和AI技术的结合为实时监测土木工程材料的性能提供了新思路。

Abstract: Concrete workability is essential for construction quality, with the slump test being the most common on-site method for its assessment. However, traditional slump testing is manual, time-consuming, and prone to inconsistency, limiting its applicability for real-time monitoring. To address these challenges, we propose SlumpGuard, an AI-powered, video-based system that automatically analyzes concrete flow from the truck chute to assess workability in real time. Our system enables full-batch inspection without manual intervention, improving both the accuracy and efficiency of quality control. We present the system design, a the construction of a dedicated dataset, and empirical results from real-world deployment, demonstrating the effectiveness of SlumpGuard as a practical solution for modern concrete quality assurance.

[98] A Training-Free, Task-Agnostic Framework for Enhancing MLLM Performance on High-Resolution Images cs.CV | cs.AIPDF

Jaeseong Lee, Yeeun Choi, Heechan Choi, Hanjung Kim, Seonjoo Kim

TL;DR: 该论文提出了一种名为Extract Candidate then Predict (ECP)的训练无关、任务无关的两阶段框架，旨在提升多模态大语言模型(MLLM)在高分辨率图像上的性能，解决了现有方法因分辨率差异导致的性能下降问题。

Details

Motivation: MLLM在高分辨率图像上表现不佳，主要是因为训练和测试时的分辨率不一致，直接输入高分辨率图像会导致泛化性能差，而降低分辨率则会丢失细粒度视觉细节。

Result: 在4K GUI grounding和4K、8K MLLM感知任务上，分别取得了21.3%、5.8%、5.2%的绝对性能提升。

Insight: 通过利用低分辨率图像的隐含线索指导高分辨率图像的预测，能够在不牺牲细节的情况下提升模型性能。

Abstract: Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in vision-language understanding, reasoning, and generation. However, they struggle with tasks requiring fine-grained localization and reasoning in high-resolution images. This constraint stems from the fact that MLLMs are fine-tuned with fixed image resolution to align with the pre-trained image encoder used in MLLM. Consequently, feeding high-resolution images directly into MLLMs leads to poor generalization due to a train-test resolution discrepancy, while downsampling these images-although ensuring consistency-compromises fine-grained visual details and ultimately degrades performance. To address this challenge, we propose Extract Candidate then Predict (ECP), a novel training-free, task-agnostic two-stage framework designed to enhance MLLM performance on high-resolution images. The key intuition behind ECP is that while MLLMs struggle with high-resolution images, their predictions on downsampled images still contain implicit localization cues. By first identifying candidate region using the coarse prediction and then predicting the final output based on candidate region, ECP effectively preserves fine-grained details while mitigating the challenges posed by high-resolution data. We validate our framework on 4K GUI grounding and 4K, 8K MLLM perception, achieving +21.3%, +5.8%, +5.2% absolute improvement compared to baseline respectively, demonstrating its effectiveness. Code is available at https://github.com/yenncye/ECP.

[99] Improving Multimodal Learning via Imbalanced Learning cs.CVPDF

Shicai Wei, Chunbo Luo, Yang Luo

TL;DR: 该论文提出了一种非对称表征学习（ARL）策略，通过不平衡优化提升多模态学习性能，证明了模态依赖性与方差反比关系的最优性，并通过实验验证了其有效性。

Details

Motivation: 多模态学习中常因模态间学习不平衡导致性能不佳，传统方法试图通过梯度平衡解决，但作者认为不平衡学习可能更优。

Result: 在多个数据集上的实验验证了ARL的有效性和通用性。

Insight: 不平衡学习（而非平衡学习）可能是多模态学习的最优策略，模态依赖性应与模态方差成反比。

Abstract: Multimodal learning often encounters the under-optimized problem and may perform worse than unimodal learning. Existing approaches attribute this issue to imbalanced learning across modalities and tend to address it through gradient balancing. However, this paper argues that balanced learning is not the optimal setting for multimodal learning. With bias-variance analysis, we prove that imbalanced dependency on each modality obeying the inverse ratio of their variances contributes to optimal performance. To this end, we propose the Asymmetric Representation Learning(ARL) strategy to assist multimodal learning via imbalanced optimization. ARL introduces auxiliary regularizers for each modality encoder to calculate their prediction variance. ARL then calculates coefficients via the unimodal variance to re-weight the optimization of each modality, forcing the modality dependence ratio to be inversely proportional to the modality variance ratio. Moreover, to minimize the generalization error, ARL further introduces the prediction bias of each modality and jointly optimizes them with multimodal loss. Notably, all auxiliary regularizers share parameters with the multimodal model and rely only on the modality representation. Thus the proposed ARL strategy introduces no extra parameters and is independent of the structures and fusion methods of the multimodal model. Finally, extensive experiments on various datasets validate the effectiveness and versatility of ARL. Code is available at \href{https://github.com/shicaiwei123/ICCV2025-ARL}{https://github.com/shicaiwei123/ICCV2025-ARL}

[100] Boosting Multimodal Learning via Disentangled Gradient Learning cs.CVPDF

Shicai Wei, Chunbo Luo, Yang Luo

TL;DR: 本文提出了解耦梯度学习（DGL）框架，通过分离模态编码器和模态融合模块的优化过程，解决多模态学习中存在的优化冲突问题，提升了多模态模型的性能。

Details

Motivation: 多模态学习常常出现优化不足的问题，甚至性能不如单一模态模型。现有方法认为这是由于模态间的学习不平衡，但未能解释为何主导模态在多模态模型中也表现不佳。

Result: 在多种模态、任务和框架上的实验验证了DGL的有效性和通用性。

Insight: 解耦模态编码器和融合模块的优化过程是提升多模态学习性能的关键。

Abstract: Multimodal learning often encounters the under-optimized problem and may have worse performance than unimodal learning. Existing methods attribute this problem to the imbalanced learning between modalities and rebalance them through gradient modulation. However, they fail to explain why the dominant modality in multimodal models also underperforms that in unimodal learning. In this work, we reveal the optimization conflict between the modality encoder and modality fusion module in multimodal models. Specifically, we prove that the cross-modal fusion in multimodal models decreases the gradient passed back to each modality encoder compared with unimodal models. Consequently, the performance of each modality in the multimodal model is inferior to that in the unimodal model. To this end, we propose a disentangled gradient learning (DGL) framework to decouple the optimization of the modality encoder and modality fusion module in the multimodal model. DGL truncates the gradient back-propagated from the multimodal loss to the modality encoder and replaces it with the gradient from unimodal loss. Besides, DGL removes the gradient back-propagated from the unimodal loss to the modality fusion module. This helps eliminate the gradient interference between the modality encoder and modality fusion module while ensuring their respective optimization processes. Finally, extensive experiments on multiple types of modalities, tasks, and frameworks with dense cross-modal interaction demonstrate the effectiveness and versatility of the proposed DGL. Code is available at \href{https://github.com/shicaiwei123/ICCV2025-GDL}{https://github.com/shicaiwei123/ICCV2025-GDL}

[101] Straighten Viscous Rectified Flow via Noise Optimization cs.CVPDF

Jimin Dai, Jiexi Yan, Jian Yang, Lei Luo

TL;DR: 提出了VRFNO框架，通过噪声优化和速度历史项改进Rectified Flow（整流流），解决了Reflow在生成高质量图像时的分布差距问题，提升了单步和少步生成的性能。

Details

Motivation: Reflow在训练过程中构建噪声和图像的确定性耦合以改善生成轨迹，但在快速生成高质量图像时存在分布差距问题。

Result: 在合成数据和真实数据上的实验表明，VRFNO显著优于Reflow，在单步和少步生成中达到SOTA性能。

Insight: 噪声优化和历史速度项的结合可有效解决生成模型中轨迹偏差问题，提升生成质量。

Abstract: The Reflow operation aims to straighten the inference trajectories of the rectified flow during training by constructing deterministic couplings between noises and images, thereby improving the quality of generated images in single-step or few-step generation. However, we identify critical limitations in Reflow, particularly its inability to rapidly generate high-quality images due to a distribution gap between images in its constructed deterministic couplings and real images. To address these shortcomings, we propose a novel alternative called Straighten Viscous Rectified Flow via Noise Optimization (VRFNO), which is a joint training framework integrating an encoder and a neural velocity field. VRFNO introduces two key innovations: (1) a historical velocity term that enhances trajectory distinction, enabling the model to more accurately predict the velocity of the current trajectory, and (2) the noise optimization through reparameterization to form optimized couplings with real images which are then utilized for training, effectively mitigating errors caused by Reflow’s limitations. Comprehensive experiments on synthetic data and real datasets with varying resolutions show that VRFNO significantly mitigates the limitations of Reflow, achieving state-of-the-art performance in both one-step and few-step generation tasks.

[102] Spatial Lifting for Dense Prediction cs.CV | cs.LG | eess.IVPDF

Mingzhi Xu, Yizhe Zhang

TL;DR: 论文提出了一种名为Spatial Lifting（SL）的新方法，通过将标准输入（如2D图像）提升至高维空间（如3D），并使用高维网络处理，实现了在密集预测任务中的高效性能。

Details

Motivation: 现有密集预测方法通常依赖于复杂的2D网络，参数量和计算成本较高。SL旨在通过高维空间建模，减少模型参数和推理成本，同时保持或提升性能。

Result: 在19个基准数据集（13个语义分割，6个深度估计）上验证了SL的高效性，参数量减少98%的同时保持了竞争性性能。

Insight: SL展示了一种新的视觉建模范式：通过升维处理，可以在减少资源消耗的同时提升密集预测任务的性能。这种思路为设计更高效、准确的视觉模型提供了可能。

Abstract: We present Spatial Lifting (SL), a novel methodology for dense prediction tasks. SL operates by lifting standard inputs, such as 2D images, into a higher-dimensional space and subsequently processing them using networks designed for that higher dimension, such as a 3D U-Net. Counterintuitively, this dimensionality lifting allows us to achieve good performance on benchmark tasks compared to conventional approaches, while reducing inference costs and significantly lowering the number of model parameters. The SL framework produces intrinsically structured outputs along the lifted dimension. This emergent structure facilitates dense supervision during training and enables robust, near-zero-additional-cost prediction quality assessment at test time. We validate our approach across 19 benchmark datasets (13 for semantic segmentation and 6 for depth estimation), demonstrating competitive dense prediction performance while reducing the model parameter count by over 98% (in the U-Net case) and lowering inference costs. Spatial Lifting introduces a new vision modeling paradigm that offers a promising path toward more efficient, accurate, and reliable deep networks for dense prediction tasks in vision.

[103] ProGait: A Multi-Purpose Video Dataset and Benchmark for Transfemoral Prosthesis Users cs.CV | cs.AIPDF

Xiangyu Yin, Boyuan Yang, Weichen Liu, Qiyao Xue, Abrar Alamri

TL;DR: ProGait是一个用于视频对象分割、2D人体姿态估计和步态分析的多用途数据集，专为跨股假肢用户设计。提供了一个包含412个视频片段的数据集和基准模型，为假肢相关任务提供了更好的泛化能力。

Details

Motivation: 假肢在康复中至关重要，但现有的计算机视觉方法在检测和分析假肢方面表现不佳，因为假肢的外观和运动方式独特。

Result: 基准模型在假肢相关任务中表现优于预训练视觉模型，展示了更好的泛化能力。

Insight: ProGait填补了假肢用户视觉分析的数据空白，为假肢设计和优化提供了实用的工具。

Abstract: Prosthetic legs play a pivotal role in clinical rehabilitation, allowing individuals with lower-limb amputations the ability to regain mobility and improve their quality of life. Gait analysis is fundamental for optimizing prosthesis design and alignment, directly impacting the mobility and life quality of individuals with lower-limb amputations. Vision-based machine learning (ML) methods offer a scalable and non-invasive solution to gait analysis, but face challenges in correctly detecting and analyzing prosthesis, due to their unique appearances and new movement patterns. In this paper, we aim to bridge this gap by introducing a multi-purpose dataset, namely ProGait, to support multiple vision tasks including Video Object Segmentation, 2D Human Pose Estimation, and Gait Analysis (GA). ProGait provides 412 video clips from four above-knee amputees when testing multiple newly-fitted prosthetic legs through walking trials, and depicts the presence, contours, poses, and gait patterns of human subjects with transfemoral prosthetic legs. Alongside the dataset itself, we also present benchmark tasks and fine-tuned baseline models to illustrate the practical application and performance of the ProGait dataset. We compared our baseline models against pre-trained vision models, demonstrating improved generalizability when applying the ProGait dataset for prosthesis-specific tasks. Our code is available at https://github.com/pittisl/ProGait and dataset at https://huggingface.co/datasets/ericyxy98/ProGait.

[104] Synthesizing Near-Boundary OOD Samples for Out-of-Distribution Detection cs.CVPDF

Jinglun Li, Kaixun Jiang, Zhaoyu Chen, Bo Lin, Yao Tang

TL;DR: SynOOD利用扩散模型和多模态大语言模型生成边界附近的合成OOD样本，通过微调CLIP模型提升InD与OOD样本的边界区分能力，显著提升了OOD检测性能。

Details

Motivation: 现有的视觉语言模型在OOD检测中表现优异，但对于特征空间中接近InD的OOD样本仍存在误分类问题。利用新兴的基础模型（如扩散模型和MLLMs）生成边界附近的OOD样本，可以进一步优化OOD检测。

Result: 在ImageNet基准上，SynOOD显著超越现有方法，AUROC提升2.80%，FPR95降低11.13%。

Insight: 利用基础模型生成边界附近的OOD样本是一种有效的OOD检测优化策略，既能提升性能，又不会显著增加计算开销。

Abstract: Pre-trained vision-language models have exhibited remarkable abilities in detecting out-of-distribution (OOD) samples. However, some challenging OOD samples, which lie close to in-distribution (InD) data in image feature space, can still lead to misclassification. The emergence of foundation models like diffusion models and multimodal large language models (MLLMs) offers a potential solution to this issue. In this work, we propose SynOOD, a novel approach that harnesses foundation models to generate synthetic, challenging OOD data for fine-tuning CLIP models, thereby enhancing boundary-level discrimination between InD and OOD samples. Our method uses an iterative in-painting process guided by contextual prompts from MLLMs to produce nuanced, boundary-aligned OOD samples. These samples are refined through noise adjustments based on gradients from OOD scores like the energy score, effectively sampling from the InD/OOD boundary. With these carefully synthesized images, we fine-tune the CLIP image encoder and negative label features derived from the text encoder to strengthen connections between near-boundary OOD samples and a set of negative labels. Finally, SynOOD achieves state-of-the-art performance on the large-scale ImageNet benchmark, with minimal increases in parameters and runtime. Our approach significantly surpasses existing methods, improving AUROC by 2.80% and reducing FPR95 by 11.13%. Codes are available in https://github.com/Jarvisgivemeasuit/SynOOD.

[105] Transferring Styles for Reduced Texture Bias and Improved Robustness in Semantic Segmentation Networks cs.CVPDF

Ben Hamscher, Edgar Heinert, Annika Mütze, Kira Maag, Matthias Rottmann

TL;DR: 风格迁移减少语义分割网络的纹理偏差并提升鲁棒性，通过Voronoi细胞生成随机风格区域进行数据增强，实验证明其有效性。

Details

Motivation: 研究显示风格化图像能减少DNN在图像分类中的纹理偏差并提升鲁棒性，本文探索其在语义分割中的类似效果。

Result: 在Cityscapes和PASCAL Context数据集上，方法显著减少纹理偏差，并提升对图像干扰和对抗攻击的鲁棒性。

Insight: 风格迁移不仅能用于图像分类，还能推广到语义分割任务，通过形状特征增强模型的鲁棒性。

Abstract: Recent research has investigated the shape and texture biases of deep neural networks (DNNs) in image classification which influence their generalization capabilities and robustness. It has been shown that, in comparison to regular DNN training, training with stylized images reduces texture biases in image classification and improves robustness with respect to image corruptions. In an effort to advance this line of research, we examine whether style transfer can likewise deliver these two effects in semantic segmentation. To this end, we perform style transfer with style varying across artificial image areas. Those random areas are formed by a chosen number of Voronoi cells. The resulting style-transferred data is then used to train semantic segmentation DNNs with the objective of reducing their dependence on texture cues while enhancing their reliance on shape-based features. In our experiments, it turns out that in semantic segmentation, style transfer augmentation reduces texture bias and strongly increases robustness with respect to common image corruptions as well as adversarial attacks. These observations hold for convolutional neural networks and transformer architectures on the Cityscapes dataset as well as on PASCAL Context, showing the generality of the proposed method.

[106] Kaleidoscopic Background Attack: Disrupting Pose Estimation with Multi-Fold Radial Symmetry Textures cs.CVPDF

Xinlong Ding, Hongwei Yu, Jiawei Li, Feifan Li, Yu Shang

TL;DR: 本文提出了一种名为Kaleidoscopic Background Attack (KBA)的方法，通过多折径向对称纹理破坏相机位姿估计任务的准确性，并提出了投影方向一致性损失来优化攻击效果。

Details

Motivation: 相机位姿估计是计算机视觉中的基础任务，但在稀疏输入的对象中心场景中，背景纹理可能显著影响估计精度。本文旨在通过设计特殊的背景纹理攻击位姿估计模型。

Result: 实验表明，优化的对抗性KBA背景能有效攻击多种相机位姿估计模型。

Insight: 背景纹理对位姿估计的影响不可忽视，通过对称性和优化损失可显著提升攻击效果。

Abstract: Camera pose estimation is a fundamental computer vision task that is essential for applications like visual localization and multi-view stereo reconstruction. In the object-centric scenarios with sparse inputs, the accuracy of pose estimation can be significantly influenced by background textures that occupy major portions of the images across different viewpoints. In light of this, we introduce the Kaleidoscopic Background Attack (KBA), which uses identical segments to form discs with multi-fold radial symmetry. These discs maintain high similarity across different viewpoints, enabling effective attacks on pose estimation models even with natural texture segments. Additionally, a projected orientation consistency loss is proposed to optimize the kaleidoscopic segments, leading to significant enhancement in the attack effectiveness. Experimental results show that optimized adversarial kaleidoscopic backgrounds can effectively attack various camera pose estimation models.

[107] FTCFormer: Fuzzy Token Clustering Transformer for Image Classification cs.CVPDF

Muyi Bao, Changyu Zeng, Yifan Wang, Zhengni Yang, Zimu Wang

TL;DR: 提出了一种新的Transformer模型FTCFormer，通过聚类动态生成语义相关的视觉Token，改进图像分类任务的效果。

Details

Motivation: 现有Transformer模型大多基于网格划分生成Token，忽视了图像区域的语义重要性，导致特征表示不够理想。

Result: 在32个数据集上表现优于基线TCFormer，细粒度、自然图像、医学和遥感数据集分别提升1.43%、1.09%、0.97%和0.55%。

Insight: 动态Token分配能更好地捕捉语义重要性，而基于聚类的策略能灵活处理不规则区域。

Abstract: Transformer-based deep neural networks have achieved remarkable success across various computer vision tasks, largely attributed to their long-range self-attention mechanism and scalability. However, most transformer architectures embed images into uniform, grid-based vision tokens, neglecting the underlying semantic meanings of image regions, resulting in suboptimal feature representations. To address this issue, we propose Fuzzy Token Clustering Transformer (FTCFormer), which incorporates a novel clustering-based downsampling module to dynamically generate vision tokens based on the semantic meanings instead of spatial positions. It allocates fewer tokens to less informative regions and more to represent semantically important regions, regardless of their spatial adjacency or shape irregularity. To further enhance feature extraction and representation, we propose a Density Peak Clustering-Fuzzy K-Nearest Neighbor (DPC-FKNN) mechanism for clustering center determination, a Spatial Connectivity Score (SCS) for token assignment, and a channel-wise merging (Cmerge) strategy for token merging. Extensive experiments on 32 datasets across diverse domains validate the effectiveness of FTCFormer on image classification, showing consistent improvements over the TCFormer baseline, achieving gains of improving 1.43% on five fine-grained datasets, 1.09% on six natural image datasets, 0.97% on three medical datasets and 0.55% on four remote sensing datasets. The code is available at: https://github.com/BaoBao0926/FTCFormer/tree/main.

[108] Show and Polish: Reference-Guided Identity Preservation in Face Video Restoration cs.CVPDF

Wenkang Han, Wang Lin, Yiyun Zhou, Qi Liu, Shulei Wang

TL;DR: IP-FVR是一种基于参考图像的退化视频修复方法，通过解耦的交叉注意力机制保留身份特征，并通过反馈学习和混合策略解决帧间和帧内的身份漂移问题。

Details

Motivation: 传统面部视频修复方法在严重退化场景下难以保留细粒度的个性化特征，导致生成的面部缺乏个性化。

Result: 在合成和真实数据集上，IP-FVR在修复质量和身份一致性上均优于现有方法。

Insight: 参考图像的引入和注意力机制的动态调整能显著提升身份特征的保留能力，同时负提示可有效约束生成质量。

Abstract: Face Video Restoration (FVR) aims to recover high-quality face videos from degraded versions. Traditional methods struggle to preserve fine-grained, identity-specific features when degradation is severe, often producing average-looking faces that lack individual characteristics. To address these challenges, we introduce IP-FVR, a novel method that leverages a high-quality reference face image as a visual prompt to provide identity conditioning during the denoising process. IP-FVR incorporates semantically rich identity information from the reference image using decoupled cross-attention mechanisms, ensuring detailed and identity consistent results. For intra-clip identity drift (within 24 frames), we introduce an identity-preserving feedback learning method that combines cosine similarity-based reward signals with suffix-weighted temporal aggregation. This approach effectively minimizes drift within sequences of frames. For inter-clip identity drift, we develop an exponential blending strategy that aligns identities across clips by iteratively blending frames from previous clips during the denoising process. This method ensures consistent identity representation across different clips. Additionally, we enhance the restoration process with a multi-stream negative prompt, guiding the model’s attention to relevant facial attributes and minimizing the generation of low-quality or incorrect features. Extensive experiments on both synthetic and real-world datasets demonstrate that IP-FVR outperforms existing methods in both quality and identity preservation, showcasing its substantial potential for practical applications in face video restoration.

[109] DisCo: Towards Distinct and Coherent Visual Encapsulation in Video MLLMs cs.CVPDF

Jiahe Zhao, Rongkun Zheng, Yi Wang, Helin Wang, Hengshuang Zhao

TL;DR: DisCo是一种新型视觉封装方法，通过独特的语义区分和时序一致性设计，显著提升视频多模态大语言模型的性能。

Details

Motivation: 视频多模态大语言模型中，传统的线性投影方法在视觉封装过程中存在语义不清晰和时序不一致的问题，而现有方法未能有效解决这些挑战。

Result: 在多种视频理解基准测试中显著超越现有方法，同时因减少语义模糊而提高了标记效率。

Insight: 视觉标记的语义区分和时序一致性对视频多模态大语言模型的性能至关重要，DisCo为这一问题提供了有效解决方案。

Abstract: In video Multimodal Large Language Models (video MLLMs), the visual encapsulation process plays a pivotal role in converting video contents into representative tokens for LLM input. While linear projectors are widely employed for encapsulation, they introduce semantic indistinctness and temporal incoherence when applied to videos. Conversely, the structure of resamplers shows promise in tackling these challenges, but an effective solution remains unexplored. Drawing inspiration from resampler structures, we introduce DisCo, a novel visual encapsulation method designed to yield semantically distinct and temporally coherent visual tokens for video MLLMs. DisCo integrates two key components: (1) A Visual Concept Discriminator (VCD) module, assigning unique semantics for visual tokens by associating them in pair with discriminative concepts in the video. (2) A Temporal Focus Calibrator (TFC) module, ensuring consistent temporal focus of visual tokens to video elements across every video frame. Through extensive experiments on multiple video MLLM frameworks, we demonstrate that DisCo remarkably outperforms previous state-of-the-art methods across a variety of video understanding benchmarks, while also achieving higher token efficiency thanks to the reduction of semantic indistinctness. The code: https://github.com/ZJHTerry18/DisCo.

[110] Contrastive Pretraining with Dual Visual Encoders for Gloss-Free Sign Language Translation cs.CVPDF

Ozge Mercanoglu Sincan, Richard Bowden

TL;DR: 该论文提出了一种基于双视觉编码器和对比预训练的无注释符号语言翻译框架，显著提升了翻译性能。

Details

Motivation: 传统的符号语言翻译依赖昂贵的中间注释（gloss），且无法完全捕捉连续手语的复杂性，因此需要一种无注释的高效方法。

Result: 在Phoenix-2014T基准测试中，双编码器架构显著优于单流变体，并取得最高BLEU-4分数。

Insight: 对比预训练和双视觉编码器的结合可以有效提升无注释符号语言翻译的性能，无需依赖中间注释。

Abstract: Sign Language Translation (SLT) aims to convert sign language videos into spoken or written text. While early systems relied on gloss annotations as an intermediate supervision, such annotations are costly to obtain and often fail to capture the full complexity of continuous signing. In this work, we propose a two-phase, dual visual encoder framework for gloss-free SLT, leveraging contrastive visual-language pretraining. During pretraining, our approach employs two complementary visual backbones whose outputs are jointly aligned with each other and with sentence-level text embeddings via a contrastive objective. During the downstream SLT task, we fuse the visual features and input them into an encoder-decoder model. On the Phoenix-2014T benchmark, our dual encoder architecture consistently outperforms its single stream variants and achieves the highest BLEU-4 score among existing gloss-free SLT approaches.

[111] Mind the Gap: Aligning Vision Foundation Models to Image Feature Matching cs.CVPDF

Yuhan Liu, Jingwen Fu, Yang Wu, Kangyi Wu, Pengna Li

TL;DR: 该论文提出了一种名为IMD的新框架，利用预训练的扩散模型解决视觉基础模型在图像特征匹配中的不对齐问题，通过生成模型获取实例级细节，并设计了跨图像交互提示模块以提升性能。

Details

Motivation: 视觉基础模型在单图像理解方面表现优异，但在跨图像特征匹配中存在不对齐问题，导致在多实例场景下性能不足。

Result: 在常用基准中达到SOTA，在IMIM基准上提升12%。

Insight: 生成模型能更好地捕捉实例细节，提示机制可有效解决跨图像理解的需求。

Abstract: Leveraging the vision foundation models has emerged as a mainstream paradigm that improves the performance of image feature matching. However, previous works have ignored the misalignment when introducing the foundation models into feature matching. The misalignment arises from the discrepancy between the foundation models focusing on single-image understanding and the cross-image understanding requirement of feature matching. Specifically, 1) the embeddings derived from commonly used foundation models exhibit discrepancies with the optimal embeddings required for feature matching; 2) lacking an effective mechanism to leverage the single-image understanding ability into cross-image understanding. A significant consequence of the misalignment is they struggle when addressing multi-instance feature matching problems. To address this, we introduce a simple but effective framework, called IMD (Image feature Matching with a pre-trained Diffusion model) with two parts: 1) Unlike the dominant solutions employing contrastive-learning based foundation models that emphasize global semantics, we integrate the generative-based diffusion models to effectively capture instance-level details. 2) We leverage the prompt mechanism in generative model as a natural tunnel, propose a novel cross-image interaction prompting module to facilitate bidirectional information interaction between image pairs. To more accurately measure the misalignment, we propose a new benchmark called IMIM, which focuses on multi-instance scenarios. Our proposed IMD establishes a new state-of-the-art in commonly evaluated benchmarks, and the superior improvement 12% in IMIM indicates our method efficiently mitigates the misalignment.

[112] FGSSNet: Feature-Guided Semantic Segmentation of Real World Floorplans cs.CVPDF

Hugo Norrby, Gabriel Färm, Kevin Hernandez-Diaz, Fernando Alonso-Fernandez

TL;DR: FGSSNet提出了一种多头的特征引导语义分割架构，通过域特定特征图注入U-Net潜在空间，提升了墙分割任务的泛化能力。

Details

Motivation: 针对真实世界平面图中的墙分割任务，现有方法泛化能力不足，FGSSNet通过引入多头的特征提取器，提取域特定特征以指导分割过程。

Result: 实验表明，FGSSNet的分割性能优于传统U-Net，验证了特征注入方法的有效性。

Insight: 通过提取并利用域特定特征（如墙纹理和宽度），可以显著提升语义分割任务的性能。

Abstract: We introduce FGSSNet, a novel multi-headed feature-guided semantic segmentation (FGSS) architecture designed to improve the generalization ability of wall segmentation on floorplans. FGSSNet features a U-Net segmentation backbone with a multi-headed dedicated feature extractor used to extract domain-specific feature maps which are injected into the latent space of U-Net to guide the segmentation process. This dedicated feature extractor is trained as an encoder-decoder with selected wall patches, representative of the walls present in the input floorplan, to produce a compressed latent representation of wall patches while jointly trained to predict the wall width. In doing so, we expect that the feature extractor encodes texture and width features of wall patches that are useful to guide the wall segmentation process. Our experiments show increased performance by the use of such injected features in comparison to the vanilla U-Net, highlighting the validity of the proposed approach.

[113] Beyond Graph Model: Reliable VLM Fine-Tuning via Random Graph Adapter cs.CVPDF

Bo Jiang, Xueyang Ze, Beibei Wang, Xixi Wang, Xixi Wan

TL;DR: 该论文提出了一种随机图适配器（VRGAdapter），通过顶点随机知识图（VRKG）建模类别多样性和类间关系，并引入不确定性引导的多分支融合（UMF）方案，提升了视觉语言模型（VLM）在下游任务中的表现。

Details

Motivation: 传统确定性适配器无法充分捕捉类别描述的多样性和类间关系的语义信息，限制了视觉语言模型在下游任务中的潜力。

Result: 在多个基准数据集上的实验验证了VRGAdapter和UMF的有效性。

Insight: 通过随机图建模多样性和类间关系，可以提供更丰富的语义知识，动态模型集成能进一步提升下游任务的鲁棒性。

Abstract: Textual adapter-based tuning methods have shown significant potential in transferring knowledge from pre-trained Vision-Language Models (VLMs) to downstream tasks. Existing works generally employ the deterministic textual feature adapter to refine each category textual representation. However, due to inherent factors such as different attributes and contexts, there exists significant diversity in textual descriptions for each category. Such description diversity offers rich discriminative semantic knowledge that can benefit downstream visual learning tasks. Obviously, traditional deterministic adapter model cannot adequately capture this varied semantic information. Also, it is desirable to exploit the inter-class relationships in VLM adapter. To address these issues, we propose to exploit random graph model into VLM adapter and develop a novel Vertex Random Graph Adapter (VRGAdapter). VRGAdapter first models the inherent diverse descriptions of each category and inter-class relationships of different categories simultaneously by leveraging a Vertex Random Knowledge Graph (VRKG) model. Then, it employs probabilistic message propagation on VRKG to learn context-aware distribution representation for each class node. Finally, it adopts a reparameterized sampling function to achieve textual adapter learning. Note that, VRGAdapter provides a more general adapter solution that encompasses traditional graph-based adapter as a special case. In addition, to enable more robust performance for downstream tasks, we also introduce a new Uncertainty-guided Multi-branch Fusion (UMF) scheme that dynamically integrates multiple pre-trained models for ensemble prediction. Extensive experiments on multiple benchmark datasets demonstrate the effectiveness of our approach.

[114] Test-Time Canonicalization by Foundation Models for Robust Perception cs.CV | cs.LGPDF

Utkarsh Singhal, Ryan Feng, Stella X. Yu, Atul Prakash

TL;DR: FOCAL 是一个利用基础模型的测试时数据驱动框架，通过生成和优化候选变换以实现稳健的视觉感知，无需重新训练或修改架构。

Details

Motivation: 当前方法依赖专门架构或预定义的数据增强，限制了泛化能力；FOCAL 旨在通过基础模型的先验知识实现更灵活的稳健性。

Result: 实验表明，FOCAL 显著提升了 CLIP 和 SAM 在多种变换（如旋转、光照变化、昼夜差异）中的稳健性。

Insight: FOCAL 挑战了特定变换训练的必要性，为稳健视觉感知提供了一种可扩展的解决方案。

Abstract: Real-world visual perception requires invariance to diverse transformations, yet current methods rely heavily on specialized architectures or training on predefined augmentations, limiting generalization. We propose FOCAL, a test-time, data-driven framework that achieves robust perception by leveraging internet-scale visual priors from foundation models. By generating and optimizing candidate transformations toward visually typical, “canonical” views, FOCAL enhances robustness without re-training or architectural changes. Our experiments demonstrate improved robustness of CLIP and SAM across challenging transformations, including 2D/3D rotations, illumination shifts (contrast and color), and day-night variations. We also highlight potential applications in active vision. Our approach challenges the assumption that transform-specific training is necessary, instead offering a scalable path to invariance. Our code is available at: https://github.com/sutkarsh/focal.

[115] Improving Remote Sensing Classification using Topological Data Analysis and Convolutional Neural Networks cs.CV | cs.LGPDF

Aaryam Sharma

TL;DR: 该论文提出了一种结合拓扑数据分析（TDA）和卷积神经网络（CNN）的方法，通过提取几何特征增强遥感图像分类性能，显著提升了ResNet18在EuroSAT和RESISC45数据集上的精度。

Details

Motivation: 卷积神经网络在图像分类中过于依赖局部纹理特征，而拓扑数据分析能有效捕捉几何信息，结合两者可以弥补CNN的局限性，提升模型性能。

Result: 在EuroSAT数据集上达到99.33%的准确率，超过ResNet18基线1.44%；在RESISC45数据集上超过基线1.82%。性能优于更大规模的模型。

Insight: TDA特征可以补充CNN的纹理偏向性，即使数据中没有显式拓扑结构也能提升分类性能，扩展了TDA的应用范围。

Abstract: Topological data analysis (TDA) is a relatively new field that is gaining rapid adoption due to its robustness and ability to effectively describe complex datasets by quantifying geometric information. In imaging contexts, TDA typically models data as filtered cubical complexes from which we can extract discriminative features using persistence homology. Meanwhile, convolutional neural networks (CNNs) have been shown to be biased towards texture based local features. To address this limitation, we propose a TDA feature engineering pipeline and a simple method to integrate topological features with deep learning models on remote sensing classification. Our method improves the performance of a ResNet18 model on the EuroSAT dataset by 1.44% achieving 99.33% accuracy, which surpasses all previously reported single-model accuracies, including those with larger architectures, such as ResNet50 (2x larger) and XL Vision Transformers (197x larger). We additionally show that our method’s accuracy is 1.82% higher than our ResNet18 baseline on the RESISC45 dataset. To our knowledge, this is the first application of TDA features in satellite scene classification with deep learning. This demonstrates that TDA features can be integrated with deep learning models, even on datasets without explicit topological structures, thereby increasing the applicability of TDA. A clean implementation of our method will be made publicly available upon publication.

[116] Numerically Computing Galois Groups of Minimal Problems cs.CV | cs.SC | math.AG | 68W30PDF

Timothy Duff

TL;DR: 该论文探讨了代数、数值计算和计算机视觉的交汇点，旨在解决参数化代数方程系统的求解问题，尤其是在计算机视觉中的RanSaC方法中的应用。

Details

Motivation: 研究动机源于需要高效解决计算机视觉中常用的参数化系统（如RanSaC）的求解问题，并评估其内在难度。

Result: 介绍了过去五年中在解决此类问题上的进展，并实现了实际应用的突破。

Insight: 研究表明，代数与数值计算的结合可以有效解决计算机视觉中的复杂问题，为未来研究提供了新方向。

Abstract: I discuss a seemingly unlikely confluence of topics in algebra, numerical computation, and computer vision. The motivating problem is that of solving multiples instances of a parametric family of systems of algebraic (polynomial or rational function) equations. No doubt already of interest to ISSAC attendees, this problem arises in the context of robust model-fitting paradigms currently utilized by the computer vision community (namely “Random Sampling and Consensus”, aka “RanSaC”.) This talk will give an overview of work in the last 5+ years that aspires to measure the intrinsic difficulty of solving such parametric systems, and makes strides towards practical solutions.

[117] Text-Visual Semantic Constrained AI-Generated Image Quality Assessment cs.CV | I.4.7PDF

Qiang Li, Qingsen Yan, Haojian Huang, Peng Wu, Haokui Zhang

TL;DR: 论文提出了一种名为SC-AGIQA的统一框架，通过文本-视觉语义约束，提升了AI生成图像的质量评估，解决了语义不对齐和细节感知缺失的问题。

Details

Motivation: 随着AI生成图像技术的快速发展，准确评估其质量变得尤为重要。现有方法依赖跨模态模型，但面临语义不对齐和细节感知不足的问题。

Result: 在多个基准数据集上的实验表明，SC-AGIQA优于现有方法。

Insight: 结合文本语义和频域分析能更全面地评估AI生成图像质量，这一方法为跨模态评估提供了新思路。

Abstract: With the rapid advancements in Artificial Intelligence Generated Image (AGI) technology, the accurate assessment of their quality has become an increasingly vital requirement. Prevailing methods typically rely on cross-modal models like CLIP or BLIP to evaluate text-image alignment and visual quality. However, when applied to AGIs, these methods encounter two primary challenges: semantic misalignment and details perception missing. To address these limitations, we propose Text-Visual Semantic Constrained AI-Generated Image Quality Assessment (SC-AGIQA), a unified framework that leverages text-visual semantic constraints to significantly enhance the comprehensive evaluation of both text-image consistency and perceptual distortion in AI-generated images. Our approach integrates key capabilities from multiple models and tackles the aforementioned challenges by introducing two core modules: the Text-assisted Semantic Alignment Module (TSAM), which leverages Multimodal Large Language Models (MLLMs) to bridge the semantic gap by generating an image description and comparing it against the original prompt for a refined consistency check, and the Frequency-domain Fine-Grained Degradation Perception Module (FFDPM), which draws inspiration from Human Visual System (HVS) properties by employing frequency domain analysis combined with perceptual sensitivity weighting to better quantify subtle visual distortions and enhance the capture of fine-grained visual quality details in images. Extensive experiments conducted on multiple benchmark datasets demonstrate that SC-AGIQA outperforms existing state-of-the-art methods. The code is publicly available at https://github.com/mozhu1/SC-AGIQA.

[118] 4D-Animal: Freely Reconstructing Animatable 3D Animals from Videos cs.CVPDF

Shanshan Zhong, Jiawei Peng, Zehan Zheng, Zhongzhan Huang, Wufei Ma

TL;DR: 论文提出4D-Animal框架，无需稀疏关键点标注即可从视频中重建可动画的3D动物，通过密集特征网络和分层对齐策略实现高效稳定的重建。

Details

Motivation: 现有方法依赖稀疏语义关键点拟合参数化模型，但关键点标注耗时且检测器不可靠，因此需要一种无需关键点标注的解决方案。

Result: 实验表明4D-Animal优于基于模型和无模型的基线方法，且生成的3D资产质量高。

Insight: 无需关键点标注的方法可推广到大规模应用，同时也为其他3D任务提供高质量数据支持。

Abstract: Existing methods for reconstructing animatable 3D animals from videos typically rely on sparse semantic keypoints to fit parametric models. However, obtaining such keypoints is labor-intensive, and keypoint detectors trained on limited animal data are often unreliable. To address this, we propose 4D-Animal, a novel framework that reconstructs animatable 3D animals from videos without requiring sparse keypoint annotations. Our approach introduces a dense feature network that maps 2D representations to SMAL parameters, enhancing both the efficiency and stability of the fitting process. Furthermore, we develop a hierarchical alignment strategy that integrates silhouette, part-level, pixel-level, and temporal cues from pre-trained 2D visual models to produce accurate and temporally coherent reconstructions across frames. Extensive experiments demonstrate that 4D-Animal outperforms both model-based and model-free baselines. Moreover, the high-quality 3D assets generated by our method can benefit other 3D tasks, underscoring its potential for large-scale applications. The code is released at https://github.com/zhongshsh/4D-Animal.

[119] CoralVQA: A Large-Scale Visual Question Answering Dataset for Coral Reef Image Understanding cs.CV | cs.AIPDF

Hongyong Han, Wei Wang, Gaowei Zhang, Mingjie Li, Yi Wang

TL;DR: CoralVQA是首个针对珊瑚礁图像理解的大规模视觉问答（VQA）数据集，包含12,805张真实珊瑚图像和277,653个问答对，通过半自动标注流程构建，为生态保护提供支持。

Details

Motivation: 珊瑚礁生态系统需要专业监测和用户友好的交互工具。现有VQA技术缺乏针对珊瑚礁图像的专业数据集。

Result: 评估显示现有大型视觉-语言模型（LVLM）在珊瑚礁图像理解上仍有局限，为未来模型优化提供方向。

Insight: CoralVQA为珊瑚礁保护提供了一个研究基础，同时揭示了LVLM在专业领域应用的挑战与潜力。

Abstract: Coral reefs are vital yet vulnerable ecosystems that require continuous monitoring to support conservation. While coral reef images provide essential information in coral monitoring, interpreting such images remains challenging due to the need for domain expertise. Visual Question Answering (VQA), powered by Large Vision-Language Models (LVLMs), has great potential in user-friendly interaction with coral reef images. However, applying VQA to coral imagery demands a dedicated dataset that addresses two key challenges: domain-specific annotations and multidimensional questions. In this work, we introduce CoralVQA, the first large-scale VQA dataset for coral reef analysis. It contains 12,805 real-world coral images from 67 coral genera collected from 3 oceans, along with 277,653 question-answer pairs that comprehensively assess ecological and health-related conditions. To construct this dataset, we develop a semi-automatic data construction pipeline in collaboration with marine biologists to ensure both scalability and professional-grade data quality. CoralVQA presents novel challenges and provides a comprehensive benchmark for studying vision-language reasoning in the context of coral reef images. By evaluating several state-of-the-art LVLMs, we reveal key limitations and opportunities. These insights form a foundation for future LVLM development, with a particular emphasis on supporting coral conservation efforts.

Zhicun Yin, Junjie Chen, Ming Liu, Zhixin Wang, Fan Li

TL;DR: 本文提出了一种基于参考选择、转移和重建的盲人脸图像修复方法（RefSTAR），通过构建参考选择和特征融合模块，显著提升了身份保持能力和参考特征转移质量。

Details

Motivation: 盲人脸图像修复因复杂降解和人类对脸部的高敏感度而极具挑战。现有方法在身份保持和细节纹理特征引入方面表现不佳。

Result: 在多种骨干模型上的实验表明，该方法在身份保持和参考特征转移质量上优于现有方法。

Insight: 参考图像的有效选择和特征融合是提升盲人脸图像修复性能的关键；循环一致性损失结合掩码设计进一步优化了效果。

Abstract: Blind facial image restoration is highly challenging due to unknown complex degradations and the sensitivity of humans to faces. Although existing methods introduce auxiliary information from generative priors or high-quality reference images, they still struggle with identity preservation problems, mainly due to improper feature introduction on detailed textures. In this paper, we focus on effectively incorporating appropriate features from high-quality reference images, presenting a novel blind facial image restoration method that considers reference selection, transfer, and reconstruction (RefSTAR). In terms of selection, we construct a reference selection (RefSel) module. For training the RefSel module, we construct a RefSel-HQ dataset through a mask generation pipeline, which contains annotating masks for 10,000 ground truth-reference pairs. As for the transfer, due to the trivial solution in vanilla cross-attention operations, a feature fusion paradigm is designed to force the features from the reference to be integrated. Finally, we propose a reference image reconstruction mechanism that further ensures the presence of reference image features in the output image. The cycle consistency loss is also redesigned in conjunction with the mask. Extensive experiments on various backbone models demonstrate superior performance, showing better identity preservation ability and reference feature transfer quality. Source code, dataset, and pre-trained models are available at https://github.com/yinzhicun/RefSTAR.

[121] Privacy-Preserving Multi-Stage Fall Detection Framework with Semi-supervised Federated Learning and Robotic Vision Confirmation cs.CV | cs.AI | cs.ROPDF

Seyed Alireza Rahimi Azghadi, Truong-Thanh-Hung Nguyen, Helene Fournier, Monica Wachowicz, Rene Richard

TL;DR: 该论文提出了一种结合半监督联邦学习与机器人视觉确认的多阶段跌倒检测框架，旨在为老年人提供高效、可靠的跌倒检测解决方案，同时保护用户隐私。综合系统的整体准确率达到99.99%。

Details

Motivation: 随着老龄化人口的增长，老年人跌倒的风险增加，及时检测跌倒可以节省医疗费用和恢复时间。然而，现有系统需同时兼顾高效性和隐私保护。

Result: SF2D准确率为99.19%，视觉检测系统准确率为96.3%，导航系统成功率为95%，综合准确率达到99.99%。

Insight: 通过多阶段互补系统结合，可以显著提高跌倒检测的整体可靠性，同时利用联邦学习保护用户隐私。

Abstract: The aging population is growing rapidly, and so is the danger of falls in older adults. A major cause of injury is falling, and detection in time can greatly save medical expenses and recovery time. However, to provide timely intervention and avoid unnecessary alarms, detection systems must be effective and reliable while addressing privacy concerns regarding the user. In this work, we propose a framework for detecting falls using several complementary systems: a semi-supervised federated learning-based fall detection system (SF2D), an indoor localization and navigation system, and a vision-based human fall recognition system. A wearable device and an edge device identify a fall scenario in the first system. On top of that, the second system uses an indoor localization technique first to localize the fall location and then navigate a robot to inspect the scenario. A vision-based detection system running on an edge device with a mounted camera on a robot is used to recognize fallen people. Each of the systems of this proposed framework achieves different accuracy rates. Specifically, the SF2D has a 0.81% failure rate equivalent to 99.19% accuracy, while the vision-based fallen people detection achieves 96.3% accuracy. However, when we combine the accuracy of these two systems with the accuracy of the navigation system (95% success rate), our proposed framework creates a highly reliable performance for fall detection, with an overall accuracy of 99.99%. Not only is the proposed framework safe for older adults, but it is also a privacy-preserving solution for detecting falls.

[122] BenchReAD: A systematic benchmark for retinal anomaly detection cs.CV | cs.AI | cs.LGPDF

Chenyu Lian, Hong-Yu Zhou, Zhanli Hu, Jing Qin

TL;DR: 该论文提出了一个系统性的视网膜异常检测基准BenchReAD，弥补了该领域缺乏全面公开基准的不足，并通过引入NFM-DRA方法提升了性能。

Details

Motivation: 视网膜异常检测在疾病筛查中至关重要，但缺乏全面公开的基准影响了方法的公平评估和进步。现有基准多局限于单类监督方法，忽视了临床中常见的标注异常和未标注数据。

Result: NFM-DRA方法在BenchReAD基准上表现最优，显著优于现有方法，尤其在面对未见异常时性能下降更小。

Insight: 充分利用标注异常和未标注数据可以提升模型的泛化性；结合单类监督的机制可能为多类异常检测提供新思路。

Abstract: Retinal anomaly detection plays a pivotal role in screening ocular and systemic diseases. Despite its significance, progress in the field has been hindered by the absence of a comprehensive and publicly available benchmark, which is essential for the fair evaluation and advancement of methodologies. Due to this limitation, previous anomaly detection work related to retinal images has been constrained by (1) a limited and overly simplistic set of anomaly types, (2) test sets that are nearly saturated, and (3) a lack of generalization evaluation, resulting in less convincing experimental setups. Furthermore, existing benchmarks in medical anomaly detection predominantly focus on one-class supervised approaches (training only with negative samples), overlooking the vast amounts of labeled abnormal data and unlabeled data that are commonly available in clinical practice. To bridge these gaps, we introduce a benchmark for retinal anomaly detection, which is comprehensive and systematic in terms of data and algorithm. Through categorizing and benchmarking previous methods, we find that a fully supervised approach leveraging disentangled representations of abnormalities (DRA) achieves the best performance but suffers from significant drops in performance when encountering certain unseen anomalies. Inspired by the memory bank mechanisms in one-class supervised learning, we propose NFM-DRA, which integrates DRA with a Normal Feature Memory to mitigate the performance degradation, establishing a new SOTA. The benchmark is publicly available at https://github.com/DopamineLcy/BenchReAD.

[123] Cameras as Relative Positional Encoding cs.CV | cs.AIPDF

Ruilong Li, Brent Yi, Junchen Liu, Hang Gao, Yi Ma

TL;DR: 这篇论文提出了一种名为PRoPE的相对位置编码方法，旨在通过相机几何关系增强多视图Transformer的性能，实验表明其在多种任务和设置中均能提升模型表现。

Details

Motivation: 在多视图计算机视觉任务中，几何关系对3D感知至关重要。现有的Transformer方法未能充分捕捉相机几何信息，因此需要一种更高效的方式将相机参数融入模型中。

Result: 实验表明，PRoPE在多种任务（如新视角合成、立体深度估计和空间认知）中均能提升性能，尤其是在输入序列长度和相机内参超出分布范围时仍能保持泛化能力。

Insight: 通过将相机几何信息显式编码为相对位置信息，可以显著提升多视图Transformer的性能，尤其是在复杂和多变的3D场景中。

Abstract: Transformers are increasingly prevalent for multi-view computer vision tasks, where geometric relationships between viewpoints are critical for 3D perception. To leverage these relationships, multi-view transformers must use camera geometry to ground visual tokens in 3D space. In this work, we compare techniques for conditioning transformers on cameras: token-level raymap encodings, attention-level relative pose encodings, and a new relative encoding we propose – Projective Positional Encoding (PRoPE) – that captures complete camera frustums, both intrinsics and extrinsics, as a relative positional encoding. Our experiments begin by showing how relative camera conditioning improves performance in feedforward novel view synthesis, with further gains from PRoPE. This holds across settings: scenes with both shared and varying intrinsics, when combining token- and attention-level conditioning, and for generalization to inputs with out-of-distribution sequence lengths and camera intrinsics. We then verify that these benefits persist for different tasks, stereo depth estimation and discriminative spatial cognition, as well as larger model sizes.

[124] Quantize-then-Rectify: Efficient VQ-VAE Training cs.CV | cs.LGPDF

Borui Zhang, Qihang Rao, Wenzhao Zheng, Jie Zhou, Jiwen Lu

TL;DR: 通过预训练的VAE快速转化为VQ-VAE，提出ReVQ框架，结合信道多组量化和后校正器，显著降低训练成本（单卡22小时）并保持高重建质量。

Details

Motivation: 训练高压缩率的VQ-VAE通常需要大量计算资源，ReVQ旨在利用预训练的VAE快速实现高效VQ-VAE训练，显著降低成本。

Result: 在ImageNet上压缩至最多512个token，保持高重建质量（rFID=1.06），单卡训练仅需22小时。

Insight: 预训练VAE的有效利用可显著加速VQ-VAE训练，量化噪声控制和校正技术是关键。

Abstract: Visual tokenizers are pivotal in multimodal large models, acting as bridges between continuous inputs and discrete tokens. Nevertheless, training high-compression-rate VQ-VAEs remains computationally demanding, often necessitating thousands of GPU hours. This work demonstrates that a pre-trained VAE can be efficiently transformed into a VQ-VAE by controlling quantization noise within the VAE’s tolerance threshold. We present \textbf{Quantize-then-Rectify (ReVQ)}, a framework leveraging pre-trained VAEs to enable rapid VQ-VAE training with minimal computational overhead. By integrating \textbf{channel multi-group quantization} to enlarge codebook capacity and a \textbf{post rectifier} to mitigate quantization errors, ReVQ compresses ImageNet images into at most 512 tokens while sustaining competitive reconstruction quality (rFID = 1.06). Significantly, ReVQ reduces training costs by over two orders of magnitude relative to state-of-the-art approaches: ReVQ finishes full training on a single NVIDIA 4090 in approximately 22 hours, whereas comparable methods require 4.5 days on 32 A100 GPUs. Experimental results show that ReVQ achieves superior efficiency-reconstruction trade-offs.

[125] Self-supervised Learning on Camera Trap Footage Yields a Strong Universal Face Embedder cs.CV | cs.AI | cs.LGPDF

Vladimir Iashin, Horace Lee, Dan Schofield, Andrew Zisserman

TL;DR: 该论文提出了一种完全自监督的方法，利用DINOv2框架训练Vision Transformers，从未标记的相机陷阱数据中学习黑猩猩面部嵌入。该模型在开放集重识别任务中表现优异，甚至超越了一些有监督基线，为生物多样性监测提供了可扩展的解决方案。

Details

Motivation: 相机陷阱数据量大，但手动识别个体动物耗时且低效。自监督学习可以解决标签数据稀缺的问题，提高野生动物监测的自动化程度。

Result: 在Bossou等挑战性基准测试中，模型超越了有监督基线，展现出强大的泛化能力。

Insight: 自监督学习在生物多样性监测中具有巨大潜力，可以实现非侵入性、可扩展的种群研究。

Abstract: Camera traps are revolutionising wildlife monitoring by capturing vast amounts of visual data; however, the manual identification of individual animals remains a significant bottleneck. This study introduces a fully self-supervised approach to learning robust chimpanzee face embeddings from unlabeled camera-trap footage. Leveraging the DINOv2 framework, we train Vision Transformers on automatically mined face crops, eliminating the need for identity labels. Our method demonstrates strong open-set re-identification performance, surpassing supervised baselines on challenging benchmarks such as Bossou, despite utilising no labelled data during training. This work underscores the potential of self-supervised learning in biodiversity monitoring and paves the way for scalable, non-invasive population studies.

cs.CL [Back]

[126] SEALGuard: Safeguarding the Multilingual Conversations in Southeast Asian Languages for LLM Software Systems cs.CL | cs.AIPDF

Wenliang Shan, Michael Fu, Rui Yang, Chakkrit, Tantithamthavorn

TL;DR: SEALGuard是一个多语言护栏，旨在提升LLM系统在多语言环境中的安全性对齐，特别是在东南亚等低资源语言场景中。

Details

Motivation: 现有护栏（如LlamaGuard）在英语中的不安全输入检测表现优异，但在多语言（尤其是低资源语言）中表现较差，导致LLM系统容易受到多语言不安全或越狱提示的攻击。

Result: SEALGuard在多语言不安全提示检测中表现最佳，DSR、精确率和F1分数均显著优于LlamaGuard。

Insight: 多语言安全对齐需要专门设计，而现有单语言护栏在多语言场景中表现显著下降；模型大小和适配策略对性能有重要影响。

Abstract: Safety alignment is critical for LLM-powered systems. While recent LLM-powered guardrail approaches such as LlamaGuard achieve high detection accuracy of unsafe inputs written in English (e.g., ``How to create a bomb?’’), they struggle with multilingual unsafe inputs. This limitation leaves LLM systems vulnerable to unsafe and jailbreak prompts written in low-resource languages such as those in Southeast Asia. This paper introduces SEALGuard, a multilingual guardrail designed to improve the safety alignment across diverse languages. It aims to address the multilingual safety alignment gap of existing guardrails and ensure effective filtering of unsafe and jailbreak prompts in LLM-powered systems. We adapt a general-purpose multilingual language model into a multilingual guardrail using low-rank adaptation (LoRA). We construct SEALSBench, a large-scale multilingual safety alignment dataset containing over 260,000 prompts in ten languages, including safe, unsafe, and jailbreak cases. We evaluate SEALGuard against state-of-the-art guardrails such as LlamaGuard on this benchmark. Our findings show that multilingual unsafe and jailbreak prompts substantially degrade the performance of the state-of-the-art LlamaGuard, which experiences a drop in Defense Success Rate (DSR) by 9% and 18%, respectively, compared to its performance on English-only prompts. In contrast, SEALGuard outperforms existing guardrails in detecting multilingual unsafe and jailbreak prompts, improving DSR by 48% over LlamaGuard and achieving the best DSR, precision, and F1-score. Our ablation study further reveals the contributions of adaptation strategies and model size to the overall performance of SEALGuard. SEALGuard advances the safety alignment of LLM systems by introducing an effective multilingual guardrail.

[127] From KMMLU-Redux to KMMLU-Pro: A Professional Korean Benchmark Suite for LLM Evaluation cs.CL | cs.AIPDF

Seokhee Hong, Sunkyoung Kim, Guijin Son, Soyeon Kim, Yeonjung Hong

TL;DR: 论文介绍了两个韩语专业级基准测试KMMLU-Redux和KMMLU-Pro，用于评估大型语言模型（LLMs）在韩国实际场景中的适用性，覆盖了学术和工业领域的专业知识。

Details

Motivation: 开发能够有效评估LLMs在实际场景中适用性的基准测试，尤其是涵盖工业领域的专业知识。

Result: 实验证明这些基准测试能全面代表韩国的工业知识。

Insight: 专业的基准测试对于评估LLMs在实际应用中的表现至关重要，尤其是在多语言和非英语环境中。

Abstract: The development of Large Language Models (LLMs) requires robust benchmarks that encompass not only academic domains but also industrial fields to effectively evaluate their applicability in real-world scenarios. In this paper, we introduce two Korean expert-level benchmarks. KMMLU-Redux, reconstructed from the existing KMMLU, consists of questions from the Korean National Technical Qualification exams, with critical errors removed to enhance reliability. KMMLU-Pro is based on Korean National Professional Licensure exams to reflect professional knowledge in Korea. Our experiments demonstrate that these benchmarks comprehensively represent industrial knowledge in Korea. We release our dataset publicly available.

[128] Self-Improving Model Steering cs.CLPDF

Rongyi Zhu, Yuhui Wang, Tanqiu Jiang, Jiacheng Liang, Ting Wang

TL;DR: 论文提出了一种自改进的模型引导框架SIMS，无需依赖外部标注数据，通过自主生成和优化对比样本实现动态适应性引导。

Details

Motivation: 传统模型引导方法依赖外部标注数据，限制了其适应不同上下文的能力，且效果受标注质量影响，因此需要一种无需外部监督的自改进方法。

Result: 在多样化LLM和基准测试中，SIMS显著优于现有方法，展现了自改进模型引导在推理阶段LLM对齐中的潜力。

Insight: 自改进模型引导是未来LLM推理对齐研究的可行方向，能够摆脱对外部标注数据的依赖并提升适应性。

Abstract: Model steering represents a powerful technique that dynamically aligns large language models (LLMs) with human preferences during inference. However, conventional model-steering methods rely heavily on externally annotated data, not only limiting their adaptability to varying contexts but also tethering their effectiveness to annotation quality. In this paper, we present SIMS, the first self-improving model-steering framework that operates without relying on external supervision. At its core, SIMS autonomously generates and refines contrastive samples through iterative self-improvement cycles, enabling adaptive, context-specific steering. Additionally, SIMS employs novel strategies, including prompt ranking and contrast sampling, to further enhance steering efficacy. Extensive evaluation across diverse LLMs and benchmarks demonstrates that SIMS substantially outperforms existing methods in steering effectiveness and adaptability, highlighting self-improving model steering as a promising direction for future research on inference-time LLM alignment.

[129] Beyond vividness: Content analysis of induced hallucinations reveals the hidden structure of individual differences in visual imagery cs.CL | q-bio.NC | q-bio.QMPDF

Ana Chkhaidze, Reshanne R. Reeder, Connor Gag, Anastasia Kiyonaga, Seana Coulson

TL;DR: 论文研究了视觉系统中个体差异如何影响Ganzflicker诱导的幻觉内容，发现强想象者描述复杂自然内容，弱想象者则报告简单几何模式。

Details

Motivation: 探索视觉想象频谱中的个体差异如何影响内部生成的视觉体验，特别是通过Ganzflicker诱导的幻觉内容。

Result: 强想象者描述的幻觉更复杂自然，而弱想象者多为简单几何模式。视觉语言模型能更好捕捉这些差异。

Insight: 研究揭示了早期视觉区域与高阶区域的协调差异可能是想象频谱个体差异的原因。

Abstract: A rapidly alternating red and black display known as Ganzflicker induces visual hallucinations that reflect the generative capacity of the visual system. Recent proposals regarding the imagery spectrum, that is, differences in the visual system of individuals with absent imagery, typical imagery, and vivid imagery, suggest these differences should impact the complexity of other internally generated visual experiences. Here, we used tools from natural language processing to analyze free-text descriptions of hallucinations from over 4,000 participants, asking whether people with different imagery phenotypes see different things in their mind’s eye during Ganzflicker-induced hallucinations. Strong imagers described complex, naturalistic content, while weak imagers reported simple geometric patterns. Embeddings from vision language models better captured these differences than text-only language models, and participants with stronger imagery used language with richer sensorimotor associations. These findings may reflect individual variation in coordination between early visual areas and higher-order regions relevant for the imagery spectrum.

[130] ALIGN: Prompt-based Attribute Alignment for Reliable, Responsible, and Personalized LLM-based Decision-Making cs.CL | cs.AIPDF

Bharadwaj Ravichandran, David Joy, Paul Elliott, Brian Hu, Jadie Adams

TL;DR: 论文提出了一种名为ALIGN的系统，通过基于提示的属性对齐方法，实现大型语言模型（LLM）的个性化决策支持，适用于多样化的用户价值观和偏好。

Details

Motivation: 随着LLM在决策支持中的应用增多，用户多样化的价值观和偏好需要新的对齐和个性化方法，以提升决策的可靠性和个性化。

Result: ALIGN系统在两个领域（公共意见调查和医疗分诊）中展示了对齐方法的有效性，支持个性化和可靠的LLM决策。

Insight: 该研究为LLM的个性化对齐提供了新思路，强调了结构和模块化设计在复杂决策任务中的重要性。

Abstract: Large language models (LLMs) are increasingly being used as decision aids. However, users have diverse values and preferences that can affect their decision-making, which requires novel methods for LLM alignment and personalization. Existing LLM comparison tools largely focus on benchmarking tasks, such as knowledge-based question answering. In contrast, our proposed ALIGN system focuses on dynamic personalization of LLM-based decision-makers through prompt-based alignment to a set of fine-grained attributes. Key features of our system include robust configuration management, structured output generation with reasoning, and several algorithm implementations with swappable LLM backbones, enabling different types of analyses. Our user interface enables a qualitative, side-by-side comparison of LLMs and their alignment to various attributes, with a modular backend for easy algorithm integration. Additionally, we perform a quantitative analysis comparing alignment approaches in two different domains: demographic alignment for public opinion surveys and value alignment for medical triage decision-making. The entire ALIGN framework is open source and will enable new research on reliable, responsible, and personalized LLM-based decision-makers.

[131] OpenCodeReasoning-II: A Simple Test Time Scaling Approach via Self-Critique cs.CLPDF

Wasi Uddin Ahmad, Somshubra Majumdar, Aleksander Ficek, Sean Narenthiran, Mehrzad Samadi

TL;DR: OpenCodeReasoning-II 提出了一个包含 250 万个问题-解决方案-评价三元组的代码推理数据集，并通过两阶段监督微调方法，在代码生成和评价任务上取得了优异表现。

Details

Motivation: 现有的代码推理和评价任务依赖于大规模高质量数据集，但目前公开可用的数据集规模有限。本文旨在通过扩大数据集规模和改进微调策略，提升代码生成和评价的性能。

Result: 微调后的Qwen2.5-Instruct模型在代码生成任务中表现优异，结合生成与评价模型后，在竞争性编程任务中取得了显著提升。

Insight: 大规模高质量数据集和多任务联合训练是提升代码推理模型性能的关键。

Abstract: Recent advancements in reasoning-based Large Language Models (LLMs), particularly their potential through test-time scaling, have created significant opportunities for distillation in code generation and critique. However, progress in both areas fundamentally depends on large-scale, high-quality datasets. In this work, we introduce OpenCodeReasoning-II, a dataset consists of 2.5M question-solution-critique triples (approx. 35K unique programming questions), making it nearly twice the size of the previous largest publicly available code reasoning dataset. In this work, we employ a two-stage supervised fine-tuning strategy. The first stage focuses on fine-tuning for code generation, while the second stage involves the joint training of models for both code generation and critique. Our resulting finetuned Qwen2.5-Instruct models achieve performance in code generation that either exceeds or equals the best prior open-weight distilled models. Notably, the integration of our code generation and critique models leads to significant improvements in competitive coding performance. Furthermore, we present an extension of the LiveCodeBench benchmark to specifically support the C++ programming language, thereby facilitating more comprehensive LLM evaluation using this benchmark.

[132] CompassJudger-2: Towards Generalist Judge Model via Verifiable Rewards cs.CL | cs.AIPDF

Taolin Zhang, Maosong Cao, Alexander Lam, Songyang Zhang, Kai Chen

TL;DR: CompassJudger-2提出了一个通用的评判模型，通过可验证的奖励和多领域数据策略，提升了模型的鲁棒性和泛化能力，并在多个基准测试中表现出色。

Details

Motivation: 现有评判模型存在专业性强和鲁棒性不足的问题，无法全面评估大语言模型。

Result: CompassJudger-2在多个基准测试中表现优异，7B模型性能接近更大模型。

Insight: 通过可验证奖励和多样性数据可以显著提升评判模型的泛化能力和鲁棒性。

Abstract: Recently, the role of LLM-as-judge in evaluating large language models has gained prominence. However, current judge models suffer from narrow specialization and limited robustness, undermining their capacity for comprehensive evaluations. In this work, we present CompassJudger-2, a novel generalist judge model that overcomes these limitations via a task-driven, multi-domain data curation strategy. Central to our approach is supervising judgment tasks with verifiable rewards, guiding intrinsic critical reasoning through rejection sampling to foster robust, generalizable judgment capabilities. We introduce a refined learning objective with margin policy gradient loss to enhance performance. Empirically, CompassJudger-2 achieves superior results across multiple judge and reward benchmarks, and our 7B model demonstrates competitive judgment accuracy with significantly larger models like DeepSeek-V3 and Qwen3-235B-A22B. Additionally, we propose JudgerBenchV2, a comprehensive benchmark evaluating cross-domain judgment accuracy and rank consistency to standardize judge model evaluation. These contributions advance robust, scalable LLM judgment and establish new performance and evaluation standards.

[133] OPENXRD: A Comprehensive Benchmark and Enhancement Framework for LLM/MLLM XRD Question Answering cs.CL | cs.AI | 68T50, 68T07PDF

Ali Vosoughi, Ayoub Shahnazari, Yufeng Xi, Zeliang Zhang, Griffin Hess

TL;DR: OPENXRD是一个开放书本框架，专为晶体学问答任务设计，通过结合GPT-4.5生成的领域特定支持内容提升小模型在X射线衍射（XRD）任务中的表现。实验表明，GPT-4.5生成的摘要能显著提高模型的准确性，尤其是那些在晶体学领域预训练有限的模型。

Details

Motivation: 研究动机在于解决现有晶体学问答系统中依赖扫描教材导致的版权问题，同时通过生成紧凑的领域特定参考内容帮助小模型提升性能。

Result: 实验结果显示，使用GPT-4.5生成的支持内容显著提升了模型（尤其是预训练有限的模型）在217个专家级XRD问题上的准确性。

Insight: 研究发现，领域特定支持内容可以帮助小模型在科学任务中更有效地推理，为未来的扩展（如图像结合）提供了基础，并展示了开放书本系统在材料科学中的潜力。

Abstract: This work presents OPENXRD, an open-book pipeline designed for crystallography question answering, which integrates textual prompts with concise supporting content generated by GPT-4.5. Instead of using scanned textbooks, which may lead to copyright issues, OPENXRD generates compact, domain-specific references that help smaller models understand key concepts in X-ray diffraction (XRD). We evaluate OPENXRD on a well-defined set of 217 expert-level XRD questions by comparing different vision-language models, including GPT-4 and LLaVA-based frameworks such as Mistral, LLaMA, and QWEN, under both closed-book (without supporting material) and open-book (with supporting material) conditions. Our experimental results show significant accuracy improvements in models that use the GPT-4.5-generated summaries, particularly those with limited prior training in crystallography. OPENXRD uses knowledge from larger models to fill knowledge gaps in crystallography and shows that AI-generated texts can help smaller models reason more effectively in scientific tasks. While the current version of OPENXRD focuses on text-based inputs, we also explore future extensions such as adding real crystal diagrams or diffraction patterns to improve interpretation in specialized materials science contexts. Overall, OPENXRD shows that specialized open-book systems can be useful in materials science and provides a foundation for broader natural language processing (NLP) tools in critical scientific fields.

[134] RAMA: Retrieval-Augmented Multi-Agent Framework for Misinformation Detection in Multimodal Fact-Checking cs.CLPDF

Shuo Yang, Zijian Yu, Zhenzhe Ying, Yuqin Dai, Guoqing Wang

TL;DR: RAMA是一个检索增强的多代理框架，用于多模态虚假信息检测，通过战略查询、跨验证证据聚合和多代理架构提升性能。

Details

Motivation: 多模态虚假信息的快速传播对自动化事实核查系统提出了挑战，尤其在声明模糊或缺乏上下文时。

Result: 在基准数据集上表现优异，特别擅长处理模糊或不太可能的声明。

Insight: 基于网络证据和多代理推理的集成方法对可信的多模态验证至关重要。

Abstract: The rapid proliferation of multimodal misinformation presents significant challenges for automated fact-checking systems, especially when claims are ambiguous or lack sufficient context. We introduce RAMA, a novel retrieval-augmented multi-agent framework designed for verifying multimedia misinformation. RAMA incorporates three core innovations: (1) strategic query formulation that transforms multimodal claims into precise web search queries; (2) cross-verification evidence aggregation from diverse, authoritative sources; and (3) a multi-agent ensemble architecture that leverages the complementary strengths of multiple multimodal large language models and prompt variants. Extensive experiments demonstrate that RAMA achieves superior performance on benchmark datasets, particularly excelling in resolving ambiguous or improbable claims by grounding verification in retrieved factual evidence. Our findings underscore the necessity of integrating web-based evidence and multi-agent reasoning for trustworthy multimedia verification, paving the way for more reliable and scalable fact-checking solutions. RAMA will be publicly available at https://github.com/kalendsyang/RAMA.git.

[135] Detecting and Pruning Prominent but Detrimental Neurons in Large Language Models cs.CL | cs.LGPDF

Ameen Ali, Shahar Katz, Lior Wolf, Ivan Titov

TL;DR: 本文提出了一种通过识别和修剪大型语言模型（LLMs）中与数据集特定机制相关的神经元来增强模型泛化能力的微调方法。

Details

Motivation: 大型语言模型容易在特定数据集上形成数据集依赖的机制，这些机制在遇到新任务或分布时会导致性能下降，因此需要一种方法来提升模型的泛化能力。

Result: 在多项选择基准测试中，该方法显著提升了模型性能，超越了之前的非修剪适应方法。

Insight: 本文揭示了数据集特定神经元对模型泛化能力的负面影响，并通过修剪这些神经元实现了性能提升，为LLM的优化提供了新思路。

Abstract: Large language models (LLMs) often develop learned mechanisms specialized to specific datasets, such as reliance on domain-specific correlations, which yield high-confidence predictions without generalizable reasoning. While beneficial in one setting, these dataset-specific mechanisms typically degrade performance when models encounter novel tasks or distributions. In this work, we introduce a fine-tuning approach designed to enhance generalization by identifying and pruning neurons associated with dataset-specific mechanisms in transformer-based LLMs. Our method employs Integrated Gradients to quantify each neuron’s influence on high-confidence predictions, pinpointing those that disproportionately contribute to dataset-specific performance without supporting robust, transferable reasoning. Selectively pruning these neurons compels the model to depend on generalizable representations. Evaluated across multiple-choice benchmarks, our pruning-based fine-tuning significantly enhances performance, surpassing prior (non-pruning) adaptation methods.

[136] Towards Agentic RAG with Deep Reasoning: A Survey of RAG-Reasoning Systems in LLMs cs.CL | cs.AIPDF

Yangning Li, Weizhi Zhang, Yuyao Yang, Wei-Chieh Huang, Yaozu Wu

TL;DR: 该综述探讨了如何在大型语言模型（LLMs）中结合检索增强生成（RAG）与推理能力，以提升复杂问题的多步推理表现。

Details

Motivation: RAG虽能增强LLMs的事实性，但在需要多步推理的任务上表现不足；而纯推理方法易产生幻觉或事实错位。研究旨在结合二者的优势。

Result: 揭示了结合RAG与推理的框架在知识密集型任务上的SOTA表现，并提供了方法、数据集和开放挑战的分类。

Insight: 未来方向包括更高效、自适应多模态、可信且以人为本的RAG-推理系统。

Abstract: Retrieval-Augmented Generation (RAG) lifts the factuality of Large Language Models (LLMs) by injecting external knowledge, yet it falls short on problems that demand multi-step inference; conversely, purely reasoning-oriented approaches often hallucinate or mis-ground facts. This survey synthesizes both strands under a unified reasoning-retrieval perspective. We first map how advanced reasoning optimizes each stage of RAG (Reasoning-Enhanced RAG). Then, we show how retrieved knowledge of different type supply missing premises and expand context for complex inference (RAG-Enhanced Reasoning). Finally, we spotlight emerging Synergized RAG-Reasoning frameworks, where (agentic) LLMs iteratively interleave search and reasoning to achieve state-of-the-art performance across knowledge-intensive benchmarks. We categorize methods, datasets, and open challenges, and outline research avenues toward deeper RAG-Reasoning systems that are more effective, multimodally-adaptive, trustworthy, and human-centric. The collection is available at https://github.com/DavidZWZ/Awesome-RAG-Reasoning.

[137] ViSP: A PPO-Driven Framework for Sarcasm Generation with Contrastive Learning cs.CL | cs.AI | cs.HCPDF

Changli Wang, Rui Wu, Fang Yin

TL;DR: ViSP结合PPO（近端策略优化）和对比学习，提出一种讽刺文本生成框架，并在M2SaG数据集上表现优于现有基线模型，生成更具讽刺性的文本。

Details

Motivation: 讽刺是一种复杂的情感表达形式，现有研究多依赖单模态（文本）且忽视视觉线索，数据集中图像内容与讽刺意图不匹配的问题也未解决。

Result: ViSP在五个指标集上超越基线（包括大语言模型），生成文本的讽刺分数（0.898 vs. 0.770）和事实不一致性（0.768 vs. 0.739）均高于原始数据集。

Insight: 多模态数据（图像+文本）和强化学习能有效提升讽刺生成质量；大语言模型在讽刺生成任务中存在局限性。

Abstract: Human emotions are complex, with sarcasm being a subtle and distinctive form. Despite progress in sarcasm research, sarcasm generation remains underexplored, primarily due to the overreliance on textual modalities and the neglect of visual cues, as well as the mismatch between image content and sarcastic intent in existing datasets. In this paper, we introduce M2SaG, a multimodal sarcasm generation dataset with 4,970 samples, each containing an image, a sarcastic text, and a sarcasm target. To benchmark M2SaG, we propose ViSP, a generation framework that integrates Proximal Policy Optimization (PPO) and contrastive learning. PPO utilizes reward scores from DIP to steer the generation of sarcastic texts, while contrastive learning encourages the model to favor outputs with higher reward scores. These strategies improve overall generation quality and produce texts with more pronounced sarcastic intent. We evaluate ViSP across five metric sets and find it surpasses all baselines, including large language models, underscoring their limitations in sarcasm generation. Furthermore, we analyze the distributions of Sarcasm Scores and Factual Incongruity for both M2SaG and the texts generated by ViSP. The generated texts exhibit higher mean Sarcasm Scores (0.898 vs. 0.770) and Factual Incongruity (0.768 vs. 0.739), demonstrating that ViSP produces higher-quality sarcastic content than the original dataset. % The dataset and code will be publicly available. Our dataset and code will be released at \textit{https://github.com/wclapply/ViSP}.

[138] Balanced Training Data Augmentation for Aspect-Based Sentiment Analysis cs.CLPDF

Junjie Liu, Yuanhe Tian, Yan Song

TL;DR: 该论文提出了一种基于大语言模型（LLM）的训练数据增强方法，用于改进基于方面的情感分析（ABSA）任务，并通过强化学习优化数据增强过程。

Details

Motivation: ABSA任务由于文本短、标注数据少且不平衡（多数为正面情感），导致模型难以充分学习上下文信息。数据增强（DA）是解决这一问题的可行方法，但现有方法难以保证增强数据的质量。

Result: 在英文ABSA基准数据集上，该方法表现出优于基线模型和现有研究的性能。

Insight: 通过平衡数据分布和优化生成质量，可以有效提升ABSA任务的表现，尤其适用于数据稀缺和不平衡的场景。

Abstract: Aspect-based sentiment analysis (ABSA) is a crucial fine-grained task in social media scenarios to identify the sentiment polarity of specific aspect terms in a sentence. Although many existing studies leverage large language models (LLMs) to perform ABSA due to their strong context understanding capabilities, they still face challenges to learn the context information in the running text because of the short text, as well as the small and unbalanced labeled training data, where most data are labeled with positive sentiment. Data augmentation (DA) is a feasible strategy for providing richer contextual information, especially when using LLMs to create synthetic training data, but faces challenges in ensuring a high quality of the augmented data.In this paper, we propose an LLM-based ABSA approach with training data augmentation.Specifically, an LLM is prompted to generate augmented training data based on the original training data, so as to construct a new training data with larger size and balanced label distributions to better train an ABSA model. Meanwhile, in order to improve the quality of the augmented data, we propose a reinforcement learning approach to optimize the data augmentation. LLM.Experiment results and further analyses on English benchmark datasets for ABSA demonstrate the effectiveness of our approach, where superior performance is observed over strong baselines and most existing studies.

[139] GoalfyMax: A Protocol-Driven Multi-Agent System for Intelligent Experience Entities cs.CLPDF

Siyi Wu, Zeyu Wang, Xinyuan Song, Zhengpeng Zhou, Lifan Sun

TL;DR: GoalfyMax是一个协议驱动的多智能体系统，通过标准化的Agent-to-Agent通信层和分层内存架构，解决了传统AI系统在协调性、内存复用和任务分解能力上的不足。

Details

Motivation: 现代企业环境需要能够处理复杂、动态和多任务的高自主性和适应性智能系统，但传统单一用途AI系统缺乏协调和扩展能力，难以满足实际需求。

Result: 在复杂任务编排基准测试和案例研究中，GoalfyMax表现出优于基线框架的适应性、协调性和经验复用能力。

Insight: 通过标准化协议和结构化内存设计，GoalfyMax为多智能体系统提供了一种可扩展的基础架构，适用于未来复杂任务的需求。

Abstract: Modern enterprise environments demand intelligent systems capable of handling complex, dynamic, and multi-faceted tasks with high levels of autonomy and adaptability. However, traditional single-purpose AI systems often lack sufficient coordination, memory reuse, and task decomposition capabilities, limiting their scalability in realistic settings. To address these challenges, we present \textbf{GoalfyMax}, a protocol-driven framework for end-to-end multi-agent collaboration. GoalfyMax introduces a standardized Agent-to-Agent (A2A) communication layer built on the Model Context Protocol (MCP), allowing independent agents to coordinate through asynchronous, protocol-compliant interactions. It incorporates the Experience Pack (XP) architecture, a layered memory system that preserves both task rationales and execution traces, enabling structured knowledge retention and continual learning. Moreover, our system integrates advanced features including multi-turn contextual dialogue, long-short term memory modules, and dynamic safety validation, supporting robust, real-time strategy adaptation. Empirical results on complex task orchestration benchmarks and case study demonstrate that GoalfyMax achieves superior adaptability, coordination, and experience reuse compared to baseline frameworks. These findings highlight its potential as a scalable, future-ready foundation for multi-agent intelligent systems.

[140] Can Group Relative Policy Optimization Improve Thai Legal Reasoning and Question Answering? cs.CLPDF

Pawitsapak Akarajaradwong, Chompakorn Chaksangchaichot, Pirat Pothavorn, Attapol Thamrongrattanarit-Rutherford, Ekapol Chuangsuwanich

TL;DR: 这篇论文提出了一种基于群组相对策略优化（GRPO）的方法，显著提升了泰语法律问答系统的性能和资源效率。

Details

Motivation: 现有的检索增强生成（RAG）系统在泰语法律问答任务中表现有限，尤其是在需要复杂法律推理的情况下。

Result: 在NitiBench基准上，GRPO实现了90%的引用F1提升和31%的联合质量指标提升。

Insight: GRPO不仅提升了性能，还在复杂法律推理任务中展现出更强的鲁棒性，同时减少了资源消耗。

Abstract: The Retrieval-Augmented Generation (RAG) systems’ performance on Thai legal question answering is still limited, especially for questions requiring extensive, complex legal reasoning. To address these limitations, we introduce an approach aligning LLMs toward improved law citation accuracy and better response quality using Group-Relative Policy Optimization (GRPO). Our approach leverages BGE-M3 embeddings as a cost-efficient semantic-similarity reward, significantly reducing computational expenses up to 2.5x compared to large language model judges. Experiments on the NitiBench benchmark demonstrate substantial improvements: GRPO achieves up to 90% citation-F1 gains from the base model and a 31% increase in joint quality metrics over instruction tuning. Crucially, our method shows enhanced robustness on complex legal reasoning tasks compared to instruction tuning, providing an effective and resource-efficient solution for enhancing Thai legal LLMs.

[141] Large Language Models Encode Semantics in Low-Dimensional Linear Subspaces cs.CL | cs.LGPDF

Baturay Saglam, Paul Kassianik, Blaine Nelson, Sajana Weerawardhena, Yaron Singer

TL;DR: 论文研究了大型语言模型（LLMs）潜在空间的几何结构，发现高层次语义信息集中在低维线性子空间中，这些子空间在不同领域间可线性分离。这种分离在深层网络中更明显，且可通过简单向量方向干预模型行为（如思维链）。

Details

Motivation: 理解LLMs的潜在空间几何结构对解释其行为和提升模型对齐至关重要。目前尚不清楚LLMs如何内部组织与语义理解相关的表示。

Result: 语义信息的线性分离在深层网络中更明显，且可被简单向量方向捕获（如思维链）；实验证明了基于潜在空间的轻量级对抗检测器的高精度。

Insight: 潜在空间的几何特性为开发直接操作表示的几何化工具（如对抗防御）提供了可能性，有助于提升模型安全性和对齐性。

Abstract: Understanding the latent space geometry of large language models (LLMs) is key to interpreting their behavior and improving alignment. \baturay{However, it remains unclear to what extent LLMs internally organize representations related to semantic understanding. To investigate this, we conduct a large-scale empirical study of hidden states in transformer-based LLMs, analyzing 11 decoder-only models across 6 scientific topics and 12 layers each. We find that high-level semantic information consistently lies in low-dimensional subspaces that form linearly separable representations across distinct domains. This separability becomes more pronounced in deeper layers and under prompts that trigger structured reasoning or alignment behaviors$\unicode{x2013}$even when surface content is unchanged. This geometry enables simple yet effective causal interventions in hidden space; for example, reasoning patterns like chain-of-thought can be captured by a single vector direction. Together, these findings support the development of geometry-aware tools that operate directly on latent representations to detect and mitigate harmful or adversarial content, using methods such as transport-based defenses that leverage this separability. As a proof of concept, we demonstrate this potential by training a simple MLP classifier as a lightweight latent-space guardrail, which detects adversarial and malicious prompts with high precision.

[142] Tiny Reward Models cs.CL | cs.AIPDF

Sarah Pan

TL;DR: 论文提出了一种轻量级双向语言模型TinyRM，仅需4亿参数，通过特定领域的调优策略（如FLAN提示和DoRA），在推理和安全性偏好建模任务中表现优于大模型，为高效偏好建模提供了新方向。

Details

Motivation: 随着奖励模型在RLHF中的广泛部署，其推理成本成为重要问题。研究旨在开发小规模但高效的奖励模型，以减少资源消耗。

Result: 实验显示TinyRM在推理和安全性任务中表现优异，轻量调优方法尤其有效。

Insight: 小模型通过特定领域调优可实现高效性能，双向架构在偏好建模中具有潜力，但通用性和对话偏好建模仍存挑战。

Abstract: Large decoder-based language models have become the dominant architecture for reward modeling in reinforcement learning from human feedback (RLHF). However, as reward models are increasingly deployed in test-time strategies, their inference costs become a growing concern. We present TinyRM, a family of small, bidirectional masked language models (MLMs) with as few as 400 million parameters, that rival the capabilities of models over 175 times larger on reasoning and safety preference modeling tasks. TinyRM combines FLAN-style prompting, Directional Low-Rank Adaptation (DoRA), and layer freezing to achieve strong performance on RewardBench, despite using significantly fewer resources. Our experiments suggest that small models benefit from domain-specific tuning strategies, particularly in reasoning, where lightweight finetuning methods are especially effective. While challenges remain in building generalist models and conversational preference modeling, our preliminary results highlight the promise of lightweight bidirectional architectures as efficient, scalable alternatives for preference modeling.

[143] Enhancing Chain-of-Thought Reasoning with Critical Representation Fine-tuning cs.CL | cs.AIPDF

Chenxi Huang, Shaotian Yan, Liang Xie, Binbin Lin, Sinan Fan

TL;DR: 论文提出了一种名为CRFT的新方法，通过分析信息流识别并优化关键表示，显著提升了复杂推理任务的性能。

Details

Motivation: 传统的ReFT方法在复杂推理任务中表现不佳，因为固定位置的表示对输出的影响不确定。作者发现关键表示对推理任务的最终输出有重要影响，因此提出优化这些表示以提升性能。

Result: 在八个算术和常识推理基准测试中验证了方法的有效性，少样本设置下的一次准确率提升了16.4%。

Insight: 关键表示在复杂推理任务中起到决定性作用；表示级优化是一种轻量且有效的微调替代方案。

Abstract: Representation Fine-tuning (ReFT), a recently proposed Parameter-Efficient Fine-Tuning (PEFT) method, has attracted widespread attention for significantly improving parameter efficiency by editing representation space alone. In this work, we investigate applying ReFT to complex reasoning tasks. However, directly using the native ReFT method, which modifies fixed representations at the beginning and end of each layer, yields suboptimal performance, as these fixed-position representations have uncertain impact on the outputs. We observe that, in complex reasoning tasks, there often exist certain critical representations. These representations either integrate significant information from preceding layers or regulate subsequent layer representations. Through layer-by-layer propagation, they exert a substantial influence on the final output. Naturally, fine-tuning these critical representations has the potential to greatly enhance reasoning performance. Building upon these insights, we propose Critical Representation Fine-Tuning (CRFT), a novel method that identifies and optimizes these critical representations through information flow analysis. CRFT operates within a supervised learning framework, dynamically optimizing critical representations in a low-rank linear subspace while freezing the base model. The effectiveness and efficiency of our method are validated across eight benchmarks for arithmetic and commonsense reasoning, using LLaMA and Mistral model families. Furthermore, our method also adapts effectively to few-shot settings, boosting one-shot accuracy by 16.4%. Our work highlights the untapped potential of representation-level optimization for CoT reasoning, offering a lightweight yet powerful alternative to traditional PEFT methods.

[144] Fusing Large Language Models with Temporal Transformers for Time Series Forecasting cs.CLPDF

Chen Su, Yuanhe Tian, Qinyu Liu, Jun Zhang, Yan Song

TL;DR: 该论文提出了一种结合大语言模型（LLMs）和时间序列Transformer的新架构，以弥补LLMs在建模连续时间序列数据时的不足，并提升预测性能。

Details

Motivation: 大语言模型（LLMs）在文本任务中表现优异，但直接应用于时间序列预测（TSF）时表现不如专用的Transformer模型。LLMs擅长离散语义模式，而时间序列是连续的数值数据，存在建模鸿沟。

Result: 在基准数据集上的实验证明，该方法融合了语义和时序信息，预测效果优于单独使用LLMs或普通Transformer的模型。

Insight: LLMs的语义建模能力与时间序列Transformer的时序动态捕捉能力可以互补，融合两者可以显著提升时间序列预测任务的性能。

Abstract: Recently, large language models (LLMs) have demonstrated powerful capabilities in performing various tasks and thus are applied by recent studies to time series forecasting (TSF) tasks, which predict future values with the given historical time series. Existing LLM-based approaches transfer knowledge learned from text data to time series prediction using prompting or fine-tuning strategies. However, LLMs are proficient at reasoning over discrete tokens and semantic patterns but are not initially designed to model continuous numerical time series data. The gaps between text and time series data lead LLMs to achieve inferior performance to a vanilla Transformer model that is directly trained on TSF data. However, the vanilla Transformers often struggle to learn high-level semantic patterns. In this paper, we design a novel Transformer-based architecture that complementarily leverages LLMs and vanilla Transformers, so as to integrate the high-level semantic representations learned by LLMs into the temporal information encoded by time series Transformers, where a hybrid representation is obtained by fusing the representations from the LLM and the Transformer. The resulting fused representation contains both historical temporal dynamics and semantic variation patterns, allowing our model to predict more accurate future values. Experiments on benchmark datasets demonstrate the effectiveness of the proposed approach.

[145] Task-Based Flexible Feature Distillation for LLMs cs.CLPDF

Khouloud Saadi, Di Wang

TL;DR: 本文提出了一种新型的基于任务的特征蒸馏方法，用于解决大型语言模型（LLMs）在知识蒸馏中因隐藏层尺寸不同而导致的性能下降问题，无需引入额外参数。

Details

Motivation: 传统特征蒸馏方法假设师生模型的隐藏层尺寸相同，限制了学生模型的灵活性。线性投影器虽能解决这一问题，但会引入额外参数并降低任务性能。本文旨在避免这些问题。

Result: 实验结果表明，该方法在分类、指令跟随和摘要等任务上表现优于传统线性投影基线，性能提升高达3%。

Insight: 仅蒸馏与任务相关的隐藏单元可以高效实现知识迁移，同时保持模型的灵活性和性能。

Abstract: Knowledge Distillation (KD) in general and feature distillation in particular are promising techniques for reducing the high computational demand of large language models (LLMs). However, traditional feature KD methods typically assume that the teacher and the student share the same hidden size, limiting the flexibility of the student’s architecture. A common solution to this problem involves training a linear projector to align their feature spaces, but this introduces additional parameters that must be learned from scratch and often degrades performance on downstream tasks, especially in generative settings. To address this issue, in this work, we propose a novel task-based feature distillation method that enables knowledge transfer between teacher and student models with different hidden layer dimensions, without introducing any new parameters. Leveraging the insight that only a subset of LLM components contribute significantly to a specific downstream task, our approach identifies the most task-relevant hidden units in the teacher and directly distills their activations to the student. Our method is flexible and easily integrates with other distillation frameworks. Empirical results show consistent improvements over prior approaches across diverse tasks, including classification, instruction-following, and summarization, achieving up to a 3% performance gain over the linear projection baseline.

[146] Meanings are like Onions: a Layered Approach to Metaphor Processing cs.CLPDF

Silvia Cappa, Anna Sofia Lippolis, Stefano Zoia

TL;DR: 论文提出了一个分层模型，将隐喻处理分为内容分析、概念混合和语用意图形三个层次，为计算系统提供了更加丰富和认知基础的隐喻解读方法。

Details

Motivation: 隐喻意义通常被视为概念间的简单映射，但实际上是一个复杂认知现象，涉及多层次解释。传统方法未能充分捕捉这种复杂性，因此需要一种新的分层模型。

Result: 通过统一这三个层次，模型能够超越表面关联，实现更深层次、更语境敏感的隐喻推理。

Insight: 隐喻意义的理解需要多层次的综合，特别是语用层面的引入，使得计算模型更接近人类认知方式。

Abstract: Metaphorical meaning is not a flat mapping between concepts, but a complex cognitive phenomenon that integrates multiple levels of interpretation. In this paper, we propose a stratified model of metaphor processing that treats meaning as an onion: a multi-layered structure comprising (1) content analysis, (2) conceptual blending, and (3) pragmatic intentionality. This three-dimensional framework allows for a richer and more cognitively grounded approach to metaphor interpretation in computational systems. At the first level, metaphors are annotated through basic conceptual elements. At the second level, we model conceptual combinations, linking components to emergent meanings. Finally, at the third level, we introduce a pragmatic vocabulary to capture speaker intent, communicative function, and contextual effects, aligning metaphor understanding with pragmatic theories. By unifying these layers into a single formal framework, our model lays the groundwork for computational methods capable of representing metaphorical meaning beyond surface associations, toward deeper, more context-sensitive reasoning.

[147] From Sequence to Structure: Uncovering Substructure Reasoning in Transformers cs.CL | cs.AIPDF

Xinnan Dai, Kai Yang, Jay Revolinsky, Kai Guo, Aoran Wang

TL;DR: 这篇论文探讨了解码器-only Transformer架构如何理解文本描述中的图结构，提出了Induced Substructure Filtration (ISF)视角，展示了其在多层Transformer中的子结构识别能力，并验证了其在处理多种图类型时的有效性。

Details

Motivation: 大型语言模型（LLMs）能够解决图推理任务，即使图结构嵌入在文本描述中也能有效回答问题。然而，解码器-only Transformer如何理解这些结构尚不清楚，因此需要深入研究其内部机制。

Result: 研究表明，解码器-only Transformer能够成功从属性图中提取子结构，并展示了其在多层中的一致内部动态。

Insight: 发现序列化Transformer能够通过子结构思维有效地从图数据中提取结构信息，为理解其处理复杂图任务的能力提供了新视角。

Abstract: Recent studies suggest that large language models (LLMs) possess the capability to solve graph reasoning tasks. Notably, even when graph structures are embedded within textual descriptions, LLMs can still effectively answer related questions. This raises a fundamental question: How can a decoder-only Transformer architecture understand underlying graph structures? To address this, we start with the substructure extraction task, interpreting the inner mechanisms inside the transformers and analyzing the impact of the input queries. Specifically, through both empirical results and theoretical analysis, we present Induced Substructure Filtration (ISF), a perspective that captures the substructure identification in the multi-layer transformers. We further validate the ISF process in LLMs, revealing consistent internal dynamics across layers. Building on these insights, we explore the broader capabilities of Transformers in handling diverse graph types. Specifically, we introduce the concept of thinking in substructures to efficiently extract complex composite patterns, and demonstrate that decoder-only Transformers can successfully extract substructures from attributed graphs, such as molecular graphs. Together, our findings offer a new insight on how sequence-based Transformers perform the substructure extraction task over graph data.

[148] Referential ambiguity and clarification requests: comparing human and LLM behaviour cs.CL | cs.AIPDF

Chris Madge, Matthew Purver, Massimo Poesio

TL;DR: 論文研究了LLMs在任務導向對話中提出澄清問題的能力，並比較人類和LLM的行為差異。研究發現人類和LLM在面對歧義時的澄清行為關聯性低。

Details

Motivation: 探討LLMs在任務對話中提出澄清問題的能力，並分析其與人類行為的差異，以理解LLMs的表現是否依賴於其推理能力。

Result: 人類對指稱歧義的澄清問題較少，更多針對任務不確定性；LLM則相反。推理能力能增加LLM提問的頻率和相關性。

Insight: LLMs的澄清行為與人類存在顯著差異，推理能力是其提問能力的關鍵因素。

Abstract: In this work we examine LLMs’ ability to ask clarification questions in task-oriented dialogues that follow the asynchronous instruction-giver/instruction-follower format. We present a new corpus that combines two existing annotations of the Minecraft Dialogue Corpus – one for reference and ambiguity in reference, and one for SDRT including clarifications – into a single common format providing the necessary information to experiment with clarifications and their relation to ambiguity. With this corpus we compare LLM actions with original human-generated clarification questions, examining how both humans and LLMs act in the case of ambiguity. We find that there is only a weak link between ambiguity and humans producing clarification questions in these dialogues, and low correlation between humans and LLMs. Humans hardly ever produce clarification questions for referential ambiguity, but often do so for task-based uncertainty. Conversely, LLMs produce more clarification questions for referential ambiguity, but less so for task uncertainty. We question if LLMs’ ability to ask clarification questions is predicated on their recent ability to simulate reasoning, and test this with different reasoning approaches, finding that reasoning does appear to increase question frequency and relevancy.

[149] CodeJudgeBench: Benchmarking LLM-as-a-Judge for Coding Tasks cs.CL | cs.AI | cs.SEPDF

Hongchao Jiang, Yiming Chen, Yushi Cao, Hung-yi Lee, Robby T. Tan

TL;DR: 论文介绍了CodeJudgeBench，一个专门用于评估LLM在编码任务中作为法官表现的基准，发现在代码生成、修复和单元测试生成任务中，思维模型表现优于非思维模型，但所有模型均存在显著随机性。

Details

Motivation: 尽管LLM作为法官的模式在编码任务中逐渐普及，但由于缺乏专用基准，其有效性尚未充分研究。

Result: 思维模型显著优于非思维模型，但所有模型在判断中存在随机性；成对比较提示优于点式评分。

Insight: LLM作为法官的可靠性受回答顺序和模型选择影响，保留完整输出（包括注释和推理）可提高表现。

Abstract: Large Language Models (LLMs) have significantly advanced the state-of-the-art in various coding tasks. Beyond directly answering user queries, LLMs can also serve as judges, assessing and comparing the quality of responses generated by other models. Such an evaluation capability is crucial both for benchmarking different LLMs and for improving response quality through response ranking. However, despite the growing adoption of the LLM-as-a-Judge paradigm, its effectiveness in coding scenarios remains underexplored due to the absence of dedicated benchmarks. To address this gap, we introduce CodeJudgeBench, a benchmark explicitly designed to evaluate the performance of LLM-as-a-Judge models across three critical coding tasks: code generation, code repair, and unit test generation. Through comprehensive benchmarking of 26 LLM-as-a-Judge models, we find that recent thinking models significantly outperform non-thinking models on our carefully designed code judging tasks. Notably, even relatively small thinking models, such as Qwen3-8B, can outperform specially trained LLM-as-a-Judge models up to 70B in size. Nevertheless, all models still exhibit significant randomness in their judgment of coding tasks. For pairwise judging tasks, simply changing the order in which responses are presented can substantially impact accuracy. In addition, when judging code and unit tests written by different LLMs, LLM-as-a-Judge models also show variance in performance. This sensitivity raises concerns about the reliability and consistency of LLM-as-a-Judge in coding scenarios. Lastly, we study optimal prompting strategies for LLM-as-a-Judge. We find that using pair-wise comparison outperforms scalar point-wise judging. Furthermore, retaining comments and reasoning in the full, unprocessed LLM response leads to improved judge performance.

[150] REST: Stress Testing Large Reasoning Models by Asking Multiple Problems at Once cs.CLPDF

Zhuoshi Pan, Qizhi Pei, Yu Li, Qiyao Sun, Zinan Tang

TL;DR: REST是一个同时测试大型推理模型（LRMs）解决多个问题能力的框架，揭示了当前模型在压力测试下的性能退化问题，并展示了比传统单问题测试更强的区分能力。

Details

Motivation: 现有测试方法仅针对单问题推理，无法评估模型在多上下文压力下的表现，且易受数据污染的影响，需要不断创建新问题，成本高昂。

Result: 即使SOTA模型（如DeepSeek-R1）在压力测试下也表现出显著性能下降；REST比传统测试方法更具区分力，揭示了模型间的差异。

Insight: 1. 模型在压力下的性能下降部分源于”过度思考陷阱”；2. 使用”long2short”训练技术的模型在REST中表现更优。

Abstract: Recent Large Reasoning Models (LRMs) have achieved remarkable progress on task-specific benchmarks, yet their evaluation methods remain constrained by isolated problem-solving paradigms. Existing benchmarks predominantly assess single-question reasoning through sequential testing, resulting critical limitations: (1) vulnerability to data contamination and less challenging (e.g., DeepSeek-R1 achieves 97.0% on MATH500), forcing costly and perpetual creation of new questions with large human efforts, (2) failure to evaluate models under multi-context pressure, a key requirement for real-world deployment. To bridge this gap, we present REST (Reasoning Evaluation through Simultaneous Testing), a stress-testing framework that concurrently exposes LRMs to multiple problems simultaneously. Beyond basic reasoning, REST specifically evaluates several under-tested capabilities: contextual priority allocation, cross-problem interference resistance, and dynamic cognitive load management. Our evaluation reveals several striking findings: Even state-of-the-art (SOTA) models like DeepSeek-R1 exhibit substantial performance degradation under stress testing. Crucially, REST demonstrates stronger discriminative power than existing benchmarks, revealing pronounced performance differences among models that exhibit similar, near-ceiling performance under single-question evaluations. Some key mechanistic insights emerge from our analysis: (1) the “overthinking trap” is a critical factor contributing to the performance degradation; (2) the models trained with “long2short” technique preserve more accuracy of their single-problem performance under REST, outperforming standard-trained counterparts. These results establish REST as a cost-efficient, future-proof evaluation paradigm that better reflects real-world reasoning demands while reducing reliance on continuous human annotation.

cs.AI [Back]

[151] Think Clearly: Improving Reasoning via Redundant Token Pruning cs.AI | cs.CL | cs.LGPDF

Daewon Choi, Jimin Lee, Jihoon Tack, Woomin Song, Saket Dingliwal

TL;DR: 该论文提出了一种通过冗余令牌剪枝（Redundant Token Pruning）来提高大语言模型推理性能的方法，通过分析和剪枝推理路径中的冗余令牌，显著提升了模型在复杂推理任务中的准确率。

Details

Motivation: 现有大型语言模型在长序列推理任务中表现出冗余性，尤其是错误答案往往伴随更高的注意力稀疏性，这表明冗余令牌可能干扰模型的清晰思考过程。

Result: 实验结果表明，该方法在多个推理密集型基准测试（如AIME和AMC）中显著提高了准确率，特别是在数学竞赛中表现突出。

Insight: 冗余地令牌可能是干扰模型清晰推理的关键因素，而结构化的剪枝策略能够有效提升模型的推理能力，无需额外的训练成本。

Abstract: Recent large language models have shown promising capabilities in long-form reasoning, following structured chains of thought before arriving at a final answer. However, we observe that these reasoning paths tend to include substantial redundancy; analyzing attention patterns reveals that attention scores are widely scattered, particularly incorrect answers exhibit greater attention sparsity. In this paper, we demonstrate that deliberately removing this redundancy in the reasoning process significantly improves performance through clear thinking, i.e., removing distraction. Specifically, we systematically identify reasoning redundancy by measuring token-level attention scores to a special end-of-thinking token, which is appended to an explicit instruction inserted to conclude each intermediate reasoning step. Furthermore, we propose structure-aware pruning that prioritizes removing tokens in low-contributing reasoning chunks over individual tokens. After evicting redundant tokens, we remove the injected end-of-thinking instruction, then resume the reasoning generation. We demonstrate that our method significantly improves overall accuracy across reasoning-intensive benchmarks without any training involved. In particular, our method shows strong performance on challenging mathematical competition benchmarks such as AIME and AMC, where reasoning redundancy is more prevalent.

[152] Towards Concise and Adaptive Thinking in Large Reasoning Models: A Survey cs.AI | cs.CLPDF

Jason Zhu, Hongyu Li

TL;DR: 该论文是一篇关于大型推理模型（LRMs）中简洁与自适应思维方法的综述，重点探讨了如何减少冗余推理链并实现快慢思维的动态切换。

Details

Motivation: 传统大型推理模型在处理复杂任务时表现出色，但其生成的推理链往往过长且冗余，导致资源浪费和响应时间增加，影响了实际应用。因此，研究如何缩短推理链并实现自适应思维变得至关重要。

Result: 论文未提供具体实验结果，但总结了现有方法的优缺点，并提出了未来研究方向。

Insight: 研究表明，LRMs需要更高效的推理策略，尤其是根据输入复杂性动态调整快慢思维的能力，这对提升模型的实际应用价值具有重要意义。

Abstract: Large reasoning models (LRMs) like OpenAI o1 and DeepSeek R1 have demonstrated impressive performance on complex reasoning tasks like mathematics and programming with long Chain-of-Thought (CoT) reasoning sequences (slow-thinking), compared with traditional large language models (fast-thinking). However, these reasoning models also face a huge challenge that generating unnecessarily lengthy and redundant reasoning chains even for trivial questions. This phenomenon leads to a significant waste of inference resources, increases the response time for simple queries, and hinders the practical application of LRMs in real-world products. To this end, it is crucial to shorten lengthy reasoning chains and learn adaptive reasoning between fast and slow thinking based on input difficulty. In this survey, we provide a comprehensive overview of recent progress in concise and adaptive thinking for efficient reasoning of LRMs, including methodologies, benchmarks, and challenges for future exploration. We hope this survey can help researchers quickly understand the landscape of this field and inspire novel adaptive thinking ideas to facilitate better usage of LRMs.

[153] Sound and Complete Neuro-symbolic Reasoning with LLM-Grounded Interpretations cs.AI | cs.CL | cs.LOPDF

Bradley P. Allen, Prateek Chhikara, Thomas Macaulay Ferguson, Filip Ilievski, Paul Groth

TL;DR: 本文提出一种神经符号推理框架，利用LLMs的广泛知识，同时保持底层逻辑的完备性和一致性，解决了LLMs输出逻辑不一致的问题。

Details

Motivation: LLMs在自然语言理解和生成方面表现优异，但存在逻辑不一致性。如何利用其广泛知识进行形式化推理是一个关键挑战。

Result: 实验验证了该方法的可行性，能够有效利用LLMs知识并保持逻辑性质。

Insight: 该方法为神经符号推理提供了一种理论框架，结合了LLMs的广泛知识和形式逻辑的严谨性。

Abstract: Large language models (LLMs) have demonstrated impressive capabilities in natural language understanding and generation, but they exhibit problems with logical consistency in the output they generate. How can we harness LLMs’ broad-coverage parametric knowledge in formal reasoning despite their inconsistency? We present a method for directly integrating an LLM into the interpretation function of the formal semantics for a paraconsistent logic. We provide experimental evidence for the feasibility of the method by evaluating the function using datasets created from several short-form factuality benchmarks. Unlike prior work, our method offers a theoretical framework for neuro-symbolic reasoning that leverages an LLM’s knowledge while preserving the underlying logic’s soundness and completeness properties.

[154] On The Role of Intentionality in Knowledge Representation: Analyzing Scene Context for Cognitive Agents with a Tiny Language Model cs.AI | cs.CL | I.2.11; F.4.1; I.2.4; G.2.2PDF

Mark Burgess

TL;DR: 论文探讨了意图性（intentionality）在知识表示中的作用，提出了一种基于意图性和场景上下文分析的轻量级语言模型方法，适用于认知代理的低成本处理。

Details

Motivation: 论文的动机源于哲学中对意图性的研究长期未在科学技术领域得到实际应用。作者希望填补这一空白，通过Promise Theory的模型，提出一种低成本、高效的分析方法。

Result: 该方法能够在低成本计算条件下实现基本的意图性分析，适合基础认知代理的处理需求。但其概念形成能力受限于代理的记忆容量。

Insight: 论文揭示了意图性分析可以通过轻量级方法实现，无需复杂计算。这一发现为认知代理的简单化设计提供了新思路。

Abstract: Since Searle’s work deconstructing intent and intentionality in the realm of philosophy, the practical meaning of intent has received little attention in science and technology. Intentionality and context are both central to the scope of Promise Theory’s model of Semantic Spacetime, used as an effective Tiny Language Model. One can identify themes and concepts from a text, on a low level (without knowledge of the specific language) by using process coherence as a guide. Any agent process can assess superficially a degree of latent intentionality' in data by looking for anomalous multi-scale anomalies and assessing the work done to form them. Scale separation can be used to sort parts into intended’ content and `ambient context’, using the spacetime coherence as a measure. This offers an elementary but pragmatic interpretation of latent intentionality for very low computational cost, and without reference to extensive training or reasoning capabilities. The process is well within the reach of basic organisms as it does not require large scale artificial probabilistic batch processing. The level of concept formation depends, however, on the memory capacity of the agent.

[155] Automating SPARQL Query Translations between DBpedia and Wikidata cs.AI | cs.CLPDF

Malte Christian Bartels, Debayan Banerjee, Ricardo Usbeck

TL;DR: 论文研究了利用大型语言模型（LLM）自动在不同知识图谱（KG）模式（如DBpedia和Wikidata）之间翻译SPARQL查询的性能，并评估了不同模型和提示策略的效果。

Details

Motivation: 知识图谱间的互操作性研究存在明显不足，特别是在SPARQL查询的自动翻译方面，本文旨在填补这一空白。

Result: 翻译性能因模型和提示策略而异，Wikidata到DBpedia的翻译效果显著优于反向翻译。

Insight: 知识图谱模式的差异和查询复杂性是影响翻译性能的关键因素，未来的研究可以优化提示策略以提升翻译效果。

Abstract: This paper investigates whether state-of-the-art Large Language Models (LLMs) can automatically translate SPARQL between popular Knowledge Graph (KG) schemas. We focus on translations between the DBpedia and Wikidata KG, and later on DBLP and OpenAlex KG. This study addresses a notable gap in KG interoperability research by rigorously evaluating LLM performance on SPARQL-to-SPARQL translation. Two benchmarks are assembled, where the first align 100 DBpedia-Wikidata queries from QALD-9-Plus; the second contains 100 DBLP queries aligned to OpenAlex, testing generalizability beyond encyclopaedic KGs. Three open LLMs: Llama-3-8B, DeepSeek-R1-Distill-Llama-70B, and Mistral-Large-Instruct-2407 are selected based on their sizes and architectures and tested with zero-shot, few-shot, and two chain-of-thought variants. Outputs were compared with gold answers, and resulting errors were categorized. We find that the performance varies markedly across models and prompting strategies, and that translations for Wikidata to DBpedia work far better than translations for DBpedia to Wikidata.

[156] DeepResearch$^{\text{Eco}}$: A Recursive Agentic Workflow for Complex Scientific Question Answering in Ecology cs.AI | cs.CL | cs.MAPDF

Jennifer D’Souza, Endres Keno Sander, Andrei Aioanei

TL;DR: DeepResearch$^{ ext{Eco}}$是一种基于LLM的递归代理系统，用于生态学领域的复杂科学问题解答，通过深度和广度可控的探索提升文献检索的多样性和细致度。

Details

Motivation: 传统检索增强生成方法在科学文献合成中缺乏透明性和可配置性，难以满足高吞吐量和分析严谨性的需求。

Result: 在49个生态学研究问题上，源文献整合效率提升21倍，每千字整合源文献量增加14.9倍，高参数设置下接近专家水平。

Insight: 通过透明性和可配置性设计，DeepResearch$^{ ext{Eco}}$为科学问题解答提供了高效且严谨的自动化工具。

Abstract: We introduce DeepResearch$^{\text{Eco}}$, a novel agentic LLM-based system for automated scientific synthesis that supports recursive, depth- and breadth-controlled exploration of original research questions – enhancing search diversity and nuance in the retrieval of relevant scientific literature. Unlike conventional retrieval-augmented generation pipelines, DeepResearch enables user-controllable synthesis with transparent reasoning and parameter-driven configurability, facilitating high-throughput integration of domain-specific evidence while maintaining analytical rigor. Applied to 49 ecological research questions, DeepResearch achieves up to a 21-fold increase in source integration and a 14.9-fold rise in sources integrated per 1,000 words. High-parameter settings yield expert-level analytical depth and contextual diversity. Source code available at: https://github.com/sciknoworg/deep-research.

cs.RO [Back]

[157] Multimodal HD Mapping for Intersections by Intelligent Roadside Units cs.RO | cs.CVPDF

Zhongzhang Chen, Miao Fan, Shengtong Xu, Mengmeng Yang, Kun Jiang

TL;DR: 论文提出了一种基于路边智能单元（IRU）的多模态高清地图生成框架，结合摄像头和LiDAR数据，并发布了增强标注的数据集RS-seq。多模态方法在语义分割任务中显著优于单模态方法。

Details

Motivation: 传统基于车辆的高清地图生成方法在复杂交叉口因遮挡和视角限制面临挑战。通过利用路边智能单元（IRU）的多模态数据（摄像头和LiDAR），可以为高清地图提供更全面的环境信息。

Result: 在RS-seq数据集上，多模态方法比单模态方法的语义分割mIoU提高了4%（相比图像）和18%（相比点云）。

Insight: 通过路边多模态数据可以有效解决复杂交叉口的高清地图生成问题，为基础设施辅助的自动驾驶系统提供了新思路。

Abstract: High-definition (HD) semantic mapping of complex intersections poses significant challenges for traditional vehicle-based approaches due to occlusions and limited perspectives. This paper introduces a novel camera-LiDAR fusion framework that leverages elevated intelligent roadside units (IRUs). Additionally, we present RS-seq, a comprehensive dataset developed through the systematic enhancement and annotation of the V2X-Seq dataset. RS-seq includes precisely labelled camera imagery and LiDAR point clouds collected from roadside installations, along with vectorized maps for seven intersections annotated with detailed features such as lane dividers, pedestrian crossings, and stop lines. This dataset facilitates the systematic investigation of cross-modal complementarity for HD map generation using IRU data. The proposed fusion framework employs a two-stage process that integrates modality-specific feature extraction and cross-modal semantic integration, capitalizing on camera high-resolution texture and precise geometric data from LiDAR. Quantitative evaluations using the RS-seq dataset demonstrate that our multimodal approach consistently surpasses unimodal methods. Specifically, compared to unimodal baselines evaluated on the RS-seq dataset, the multimodal approach improves the mean Intersection-over-Union (mIoU) for semantic segmentation by 4% over the image-only results and 18% over the point cloud-only results. This study establishes a baseline methodology for IRU-based HD semantic mapping and provides a valuable dataset for future research in infrastructure-assisted autonomous driving systems.

[158] Visual Homing in Outdoor Robots Using Mushroom Body Circuits and Learning Walks cs.RO | cs.AI | cs.CVPDF

Gabriel G. Gattaux, Julien R. Serres, Franck Ruffier, Antoine Wystrach

TL;DR: 该论文提出了一种基于蚂蚁视觉归巢行为的生物启发式方法，首次在真实世界的紧凑型自主车上实现了蘑菇体（MB）架构，用于自然户外环境中的视觉归巢。

Details

Motivation: 蚂蚁仅需少量学习行走和简单感官输入即可实现鲁棒的视觉归巢，这启发了作者开发一种高效且资源节省的自主导航解决方案。

Result: 实验结果表明，该系统能在自然户外环境中实现精确的归巢行为，运行频率为8 Hz，内存占用低于9 kB。

Insight: 该研究表明，生物启发的MB架构可以用于资源受限的机器人系统，提供了一种高效的视觉归巢解决方案，同时验证了其与蚂蚁行为的相似性。

Abstract: Ants achieve robust visual homing with minimal sensory input and only a few learning walks, inspiring biomimetic solutions for autonomous navigation. While Mushroom Body (MB) models have been used in robotic route following, they have not yet been applied to visual homing. We present the first real-world implementation of a lateralized MB architecture for visual homing onboard a compact autonomous car-like robot. We test whether the sign of the angular path integration (PI) signal can categorize panoramic views, acquired during learning walks and encoded in the MB, into “goal on the left” and “goal on the right” memory banks, enabling robust homing in natural outdoor settings. We validate this approach through four incremental experiments: (1) simulation showing attractor-like nest dynamics; (2) real-world homing after decoupled learning walks, producing nest search behavior; (3) homing after random walks using noisy PI emulated with GPS-RTK; and (4) precise stopping-at-the-goal behavior enabled by a fifth MB Output Neuron (MBON) encoding goal-views to control velocity. This mimics the accurate homing behavior of ants and functionally resembles waypoint-based position control in robotics, despite relying solely on visual input. Operating at 8 Hz on a Raspberry Pi 4 with 32x32 pixel views and a memory footprint under 9 kB, our system offers a biologically grounded, resource-efficient solution for autonomous visual homing.

[159] Scene-Aware Conversational ADAS with Generative AI for Real-Time Driver Assistance cs.RO | cs.AI | cs.CV | cs.HCPDF

Kyungtae Han, Yitao Chen, Rohit Gupta, Onur Altintas

TL;DR: 论文提出了一种结合生成式AI的场景感知对话ADAS系统（SC-ADAS），通过整合大语言模型和视觉到文本的解析，实现了动态环境中的自然语言交互和自适应驾驶辅助。

Details

Motivation: 当前ADAS系统缺乏场景理解和自然语言交互能力，难以适应动态环境或驾驶员意图，需要更智能的解决方案。

Result: 在CARLA模拟器中验证了系统可行性，但发现了视觉上下文检索延迟和对话历史增长带来的性能折衷。

Insight: 生成式AI可以有效整合对话推理、场景感知和模块化ADAS控制，为下一代智能驾驶辅助提供了可行路径。

Abstract: While autonomous driving technologies continue to advance, current Advanced Driver Assistance Systems (ADAS) remain limited in their ability to interpret scene context or engage with drivers through natural language. These systems typically rely on predefined logic and lack support for dialogue-based interaction, making them inflexible in dynamic environments or when adapting to driver intent. This paper presents Scene-Aware Conversational ADAS (SC-ADAS), a modular framework that integrates Generative AI components including large language models, vision-to-text interpretation, and structured function calling to enable real-time, interpretable, and adaptive driver assistance. SC-ADAS supports multi-turn dialogue grounded in visual and sensor context, allowing natural language recommendations and driver-confirmed ADAS control. Implemented in the CARLA simulator with cloud-based Generative AI, the system executes confirmed user intents as structured ADAS commands without requiring model fine-tuning. We evaluate SC-ADAS across scene-aware, conversational, and revisited multi-turn interactions, highlighting trade-offs such as increased latency from vision-based context retrieval and token growth from accumulated dialogue history. These results demonstrate the feasibility of combining conversational reasoning, scene perception, and modular ADAS control to support the next generation of intelligent driver assistance.

cs.IR [Back]

[160] Overview of the TREC 2023 deep learning track cs.IR | cs.AI | cs.CLPDF

Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Hossein A. Rahmani, Daniel Campos

TL;DR: TREC 2023深度學習軌道總結，基於MS MARCO數據集，引入合成查詢（T5和GPT-4生成），發現LLM提示方法超越傳統方法。

Details

Motivation: 研究深度學習在信息檢索中的應用，特別是通過引入合成查詢和LLM提示方法，進一步提升排名任務性能。

Result: LLM提示方法優於傳統方法，合成查詢與人工查詢結果一致性高（τ=0.8487）。

Insight: 合成查詢可作為人工查詢的有效補充，未來可進一步探索LLM在信息檢索中的應用。

Abstract: This is the fifth year of the TREC Deep Learning track. As in previous years, we leverage the MS MARCO datasets that made hundreds of thousands of human-annotated training labels available for both passage and document ranking tasks. We mostly repeated last year’s design, to get another matching test set, based on the larger, cleaner, less-biased v2 passage and document set, with passage ranking as primary and document ranking as a secondary task (using labels inferred from passage). As we did last year, we sample from MS MARCO queries that were completely held out, unused in corpus construction, unlike the test queries in the first three years. This approach yields a more difficult test with more headroom for improvement. Alongside the usual MS MARCO (human) queries from MS MARCO, this year we generated synthetic queries using a fine-tuned T5 model and using a GPT-4 prompt. The new headline result this year is that runs using Large Language Model (LLM) prompting in some way outperformed runs that use the “nnlm” approach, which was the best approach in the previous four years. Since this is the last year of the track, future iterations of prompt-based ranking can happen in other tracks. Human relevance assessments were applied to all query types, not just human MS MARCO queries. Evaluation using synthetic queries gave similar results to human queries, with system ordering agreement of $\tau=0.8487$. However, human effort was needed to select a subset of the synthetic queries that were usable. We did not see clear evidence of bias, where runs using GPT-4 were favored when evaluated using synthetic GPT-4 queries, or where runs using T5 were favored when evaluated on synthetic T5 queries.

cs.MA [Back]

[161] TinyTroupe: An LLM-powered Multiagent Persona Simulation Toolkit cs.MA | cs.AI | cs.CL | cs.HC | I.2.11; I.6.5; I.6.7PDF

Paulo Salem, Robert Sim, Christopher Olsen, Prerit Saxena, Rafael Barcelos

TL;DR: TinyTroupe是一个基于大型语言模型（LLM）的多智能体模拟工具包，解决了现有工具在人物角色定义和行为模拟方面的不足，支持详细的人物设定和实验控制。

Details

Motivation: 现有多智能体系统（MAS）工具在模拟真实人类行为时缺乏精细化的人物定义和实验支持，限制了其在行为研究和社会模拟中的应用。

Result: 通过案例展示了工具包的功能，包括头脑风暴和市场调研，并提供了定量和定性评估。

Insight: LLM在多智能体模拟中的应用潜力巨大，但仍需注意其局限性和权衡。

Abstract: Recent advances in Large Language Models (LLM) have led to a new class of autonomous agents, renewing and expanding interest in the area. LLM-powered Multiagent Systems (MAS) have thus emerged, both for assistive and simulation purposes, yet tools for realistic human behavior simulation – with its distinctive challenges and opportunities – remain underdeveloped. Existing MAS libraries and tools lack fine-grained persona specifications, population sampling facilities, experimentation support, and integrated validation, among other key capabilities, limiting their utility for behavioral studies, social simulation, and related applications. To address these deficiencies, in this work we introduce TinyTroupe, a simulation toolkit enabling detailed persona definitions (e.g., nationality, age, occupation, personality, beliefs, behaviors) and programmatic control via numerous LLM-driven mechanisms. This allows for the concise formulation of behavioral problems of practical interest, either at the individual or group level, and provides effective means for their solution. TinyTroupe’s components are presented using representative working examples, such as brainstorming and market research sessions, thereby simultaneously clarifying their purpose and demonstrating their usefulness. Quantitative and qualitative evaluations of selected aspects are also provided, highlighting possibilities, limitations, and trade-offs. The approach, though realized as a specific Python implementation, is meant as a novel conceptual contribution, which can be partially or fully incorporated in other contexts. The library is available as open source at https://github.com/microsoft/tinytroupe.

cs.MM [Back]

[162] ESG-Net: Event-Aware Semantic Guided Network for Dense Audio-Visual Event Localization cs.MM | cs.CVPDF

Huilai Li, Yonghao Dang, Ying Xing, Yiming Wang, Jianqin Yin

TL;DR: 本文提出了ESG-Net，通过多阶段语义引导和事件依赖建模，解决了密集音频-视觉事件定位任务中的模态语义鸿沟和多事件关联问题。

Details

Motivation: 现有研究缺乏中间层的跨模态语义桥接，导致模态语义鸿沟，且未考虑事件相关性，限制了复杂场景下多事件推断能力。

Result: 实验表明，ESG-Net在性能显著超越现有方法的同时，大幅减少参数和计算量。

Insight: 通过早期语义交互和事件依赖建模，可以更高效地提取多模态特征，提升复杂场景下的事件定位能力。

Abstract: Dense audio-visual event localization (DAVE) aims to identify event categories and locate the temporal boundaries in untrimmed videos. Most studies only employ event-related semantic constraints on the final outputs, lacking cross-modal semantic bridging in intermediate layers. This causes modality semantic gap for further fusion, making it difficult to distinguish between event-related content and irrelevant background content. Moreover, they rarely consider the correlations between events, which limits the model to infer concurrent events among complex scenarios. In this paper, we incorporate multi-stage semantic guidance and multi-event relationship modeling, which respectively enable hierarchical semantic understanding of audio-visual events and adaptive extraction of event dependencies, thereby better focusing on event-related information. Specifically, our eventaware semantic guided network (ESG-Net) includes a early semantics interaction (ESI) module and a mixture of dependency experts (MoDE) module. ESI applys multi-stage semantic guidance to explicitly constrain the model in learning semantic information through multi-modal early fusion and several classification loss functions, ensuring hierarchical understanding of event-related content. MoDE promotes the extraction of multi-event dependencies through multiple serial mixture of experts with adaptive weight allocation. Extensive experiments demonstrate that our method significantly surpasses the state-of-the-art methods, while greatly reducing parameters and computational load. Our code will be released on https://github.com/uchiha99999/ESG-Net.

[163] LayLens: Improving Deepfake Understanding through Simplified Explanations cs.MM | cs.CVPDF

Abhijeet Narang, Parul Gupta, Liuyijia Su, Abhinav Dhall

TL;DR: LayLens是一种工具，通过简化的解释帮助用户更好地理解深度伪造技术，分为检测、语言简化和可视化重建三个阶段，显著提高了用户的清晰度和信心。

Details

Motivation: 现有深度伪造检测工具常使用技术术语，导致非专业用户难以理解。LayLens旨在通过简化解释，弥合技术推理与人类理解之间的鸿沟。

Result: 用户研究表明，简化解释显著提高了清晰度，降低了认知负荷，用户对识别深度伪造的信心增强。

Insight: 用户友好的解释工具可以增强深度伪造检测的透明度和信任度，适合各类教育背景的用户。

Abstract: This demonstration paper presents $\mathbf{LayLens}$, a tool aimed to make deepfake understanding easier for users of all educational backgrounds. While prior works often rely on outputs containing technical jargon, LayLens bridges the gap between model reasoning and human understanding through a three-stage pipeline: (1) explainable deepfake detection using a state-of-the-art forgery localization model, (2) natural language simplification of technical explanations using a vision-language model, and (3) visual reconstruction of a plausible original image via guided image editing. The interface presents both technical and layperson-friendly explanations in addition to a side-by-side comparison of the uploaded and reconstructed images. A user study with 15 participants shows that simplified explanations significantly improve clarity and reduce cognitive load, with most users expressing increased confidence in identifying deepfakes. LayLens offers a step toward transparent, trustworthy, and user-centric deepfake forensics.

cs.DB [Back]

[164] TRACER: Efficient Object Re-Identification in Networked Cameras through Adaptive Query Processing cs.DB | cs.CVPDF

Pramod Chunduri, Yao Lu, Joy Arulraj

TL;DR: TRACER是一个高效处理多摄像头网络中目标重识别（Re-ID）查询的VDBMS，通过自适应查询处理框架和增量搜索窗口技术，显著提升了性能和召回率。

Details

Motivation: 现有系统Spatala在大规模摄像头网络中时空过滤精度有限，且缺乏自适应查询处理能力，无法满足高召回率需求。

Result: TRACER在多样数据集上比现有最佳系统快3.9倍。

Insight: 自适应查询处理和长期历史建模能显著提升多摄像头网络的Re-ID效率，合成数据集可弥补隐私数据短缺问题。

Abstract: Efficiently re-identifying and tracking objects across a network of cameras is crucial for applications like traffic surveillance. Spatula is the state-of-the-art video database management system (VDBMS) for processing Re-ID queries. However, it suffers from two limitations. Its spatio-temporal filtering scheme has limited accuracy on large camera networks due to localized camera history. It is not suitable for critical video analytics applications that require high recall due to a lack of support for adaptive query processing. In this paper, we present Tracer, a novel VDBMS for efficiently processing Re-ID queries using an adaptive query processing framework. Tracer selects the optimal camera to process at each time step by training a recurrent network to model long-term historical correlations. To accelerate queries under a high recall constraint, Tracer incorporates a probabilistic adaptive search model that processes camera feeds in incremental search windows and dynamically updates the sampling probabilities using an exploration-exploitation strategy. To address the paucity of benchmarks for the Re-ID task due to privacy concerns, we present a novel synthetic benchmark for generating multi-camera Re-ID datasets based on real-world traffic distribution. Our evaluation shows that Tracer outperforms the state-of-the-art cross-camera analytics system by 3.9x on average across diverse datasets.

cs.CR [Back]

[165] RAG Safety: Exploring Knowledge Poisoning Attacks to Retrieval-Augmented Generation cs.CR | cs.CLPDF

Tianzhe Zhao, Jiaoyan Chen, Yanchi Ru, Haiping Zhu, Nan Hu

TL;DR: 本文首次系统研究了基于知识图谱的检索增强生成（KG-RAG）的安全性问题，提出了通过数据投毒攻击的隐蔽攻击策略，实验证明其效性。

Details

Motivation: 尽管KG-RAG在解决大语言模型幻觉和知识过时问题上表现出色，但其在结构化知识图谱中的安全风险尚未被充分研究，本文旨在填补这一空白。

Result: 攻击策略能显著降低KG-RAG性能，即使在最小扰动下也有效；分析了KG-RAG内部阶段的安全威胁和LLM对抗性知识的鲁棒性。

Insight: 知识图谱的结构化和可编辑特性使其易受攻击，KG-RAG系统需要更强的防御机制以应对潜在的安全性威胁。

Abstract: Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by retrieving external data to mitigate hallucinations and outdated knowledge issues. Benefiting from the strong ability in facilitating diverse data sources and supporting faithful reasoning, knowledge graphs (KGs) have been increasingly adopted in RAG systems, giving rise to KG-based RAG (KG-RAG) methods. Though RAG systems are widely applied in various applications, recent studies have also revealed its vulnerabilities to data poisoning attacks, where malicious information injected into external knowledge sources can mislead the system into producing incorrect or harmful responses. However, these studies focus exclusively on RAG systems using unstructured textual data sources, leaving the security risks of KG-RAG largely unexplored, despite the fact that KGs present unique vulnerabilities due to their structured and editable nature. In this work, we conduct the first systematic investigation of the security issue of KG-RAG methods through data poisoning attacks. To this end, we introduce a practical, stealthy attack setting that aligns with real-world implementation. We propose an attack strategy that first identifies adversarial target answers and then inserts perturbation triples to complete misleading inference chains in the KG, increasing the likelihood that KG-RAG methods retrieve and rely on these perturbations during generation. Through extensive experiments on two benchmarks and four recent KG-RAG methods, our attack strategy demonstrates strong effectiveness in degrading KG-RAG performance, even with minimal KG perturbations. In-depth analyses are also conducted to understand the safety threats within the internal stages of KG-RAG systems and to explore the robustness of LLMs against adversarial knowledge.

[166] EventHunter: Dynamic Clustering and Ranking of Security Events from Hacker Forum Discussions cs.CR | cs.AI | cs.CLPDF

Yasir Ech-Chammakhy, Anas Motii, Anass Rabii, Jaafar Chbili

TL;DR: 该论文提出了一种无监督框架EventHunter，通过对比学习微调的Transformer嵌入技术，自动从黑客论坛的非结构化内容中检测、聚类并优先排序安全事件，显著减少了噪声并突出了高优先级威胁。

Details

Motivation: 黑客论坛提供了关于新兴网络安全威胁的重要早期预警信号，但其内容非结构化且噪声多，难以提取可操作的情报，亟需自动化工具来改进威胁检测和分析。

Result: 实验验证表明，该方法能有效减少噪声并突出高优先级威胁，为安全分析人员提供了可操作的智能。

Insight: 通过无监督学习和动态聚类技术，可以从非结构化文本中提取结构化威胁情报，为网络安全领域提供了新的自动化解决方案。

Abstract: Hacker forums provide critical early warning signals for emerging cybersecurity threats, but extracting actionable intelligence from their unstructured and noisy content remains a significant challenge. This paper presents an unsupervised framework that automatically detects, clusters, and prioritizes security events discussed across hacker forum posts. Our approach leverages Transformer-based embeddings fine-tuned with contrastive learning to group related discussions into distinct security event clusters, identifying incidents like zero-day disclosures or malware releases without relying on predefined keywords. The framework incorporates a daily ranking mechanism that prioritizes identified events using quantifiable metrics reflecting timeliness, source credibility, information completeness, and relevance. Experimental evaluation on real-world hacker forum data demonstrates that our method effectively reduces noise and surfaces high-priority threats, enabling security analysts to mount proactive responses. By transforming disparate hacker forum discussions into structured, actionable intelligence, our work addresses fundamental challenges in automated threat detection and analysis.

cs.SE [Back]

[167] Semantic Source Code Segmentation using Small and Large Language Models cs.SE | cs.CL | cs.PLPDF

Abdelhalim Dahou, Ansgar Scherp, Sebastian Kurten, Brigitte Mathiak, Madhu Chauhan

TL;DR: 本文提出了一种自动化方法，利用大语言模型（LLMs）和小语言模型（SLMs）对研究用R代码进行语义分割，并实验了两种方法：基于行的上下文分析和基于范围的段落确定。结果显示，基于行的方法表现更优。

Details

Motivation: 由于代码库规模的扩大和低资源语言（如R语言）的需求增加，传统的人工和基于语法分析的方法变得不切实际，亟需一种自动化、领域特定的代码分割方法。

Result: 基于行的上下文分析方法优于基于范围的分割方法，且经过微调的小型语言模型（如CodeBERT和编码器版本的CodeT5+）表现优于LLMs。

Insight: 即使小型语言模型在预训练阶段未接触R代码，仅通过微调少量人工标注数据仍能优于LLMs，显示了微调策略在领域特定任务中的有效性。

Abstract: Source code segmentation, dividing code into functionally coherent segments, is crucial for knowledge retrieval and maintenance in software development. While enabling efficient navigation and comprehension of large codebases, manual and syntactic analysis approaches have become impractical as repositories grow, especially for low-resource languages like R and their research domains (e.g., social sciences, psychology).This paper introduces an automated, domain-specific approach for research R code segmentation using Large and Small Language Models (LLMs/SLMs). It presents two novel approaches and a human-annotated dataset, StatCodeSeg. We explore two distinct approaches: line-by-line analysis with context and range-based segment determination. We experiment with LLMs and fine-tuned SLMs. To support the generalizability of our approaches, we also include experiments on Python code from the computer science domain.Our results show that context-based line-by-line analysis is superior over range-based segmentation.Using smaller language models like CodeBERT and an encoder-only version of CodeT5+ are better than their LLM counterparts. Most notably, these two best-performing models did not see R code during pre-training versus the LLMs but were only fine-tuned on 4,130 lines of manually annotated code.

q-bio.NC [Back]

[168] CNeuroMod-THINGS, a densely-sampled fMRI dataset for visual neuroscience q-bio.NC | cs.CVPDF

Marie St-Laurent, Basile Pinsard, Oliver Contier, Elizabeth DuPre, Katja Seeliger

TL;DR: CNeuroMod-THINGS是一个密集采样的fMRI数据集，旨在满足神经AI建模对大数据的需求，通过结合THINGS和CNeuroMod项目资源，提供广泛的语义概念神经表征。

Details

Motivation: 神经AI建模需要大规模、高质量的神经影像数据，而现有数据集往往不足以支持这一需求。CNeuroMod-THINGS旨在填补这一空白，通过整合两个已有项目的优势资源，提供更丰富的数据支持。

Result: 数据集展示了高质量的行为和神经影像指标，为人类视觉体验的广泛建模提供了有力支持。

Insight: 通过整合现有资源，可以更高效地构建大规模神经影像数据集，推动神经AI建模的发展。

Abstract: Data-hungry neuro-AI modelling requires ever larger neuroimaging datasets. CNeuroMod-THINGS meets this need by capturing neural representations for a wide set of semantic concepts using well-characterized stimuli in a new densely-sampled, large-scale fMRI dataset. Importantly, CNeuroMod-THINGS exploits synergies between two existing projects: the THINGS initiative (THINGS) and the Courtois Project on Neural Modelling (CNeuroMod). THINGS has developed a common set of thoroughly annotated images broadly sampling natural and man-made objects which is used to acquire a growing collection of large-scale multimodal neural responses. Meanwhile, CNeuroMod is acquiring hundreds of hours of fMRI data from a core set of participants during controlled and naturalistic tasks, including visual tasks like movie watching and videogame playing. For CNeuroMod-THINGS, four CNeuroMod participants each completed 33-36 sessions of a continuous recognition paradigm using approximately 4000 images from the THINGS stimulus set spanning 720 categories. We report behavioural and neuroimaging metrics that showcase the quality of the data. By bridging together large existing resources, CNeuroMod-THINGS expands our capacity to model broad slices of the human visual experience.

[169] Self-supervised pretraining of vision transformers for animal behavioral analysis and neural encoding q-bio.NC | cs.CVPDF

Yanchen Wang, Han Yu, Ari Blau, Yizi Zhang, The International Brain Laboratory

TL;DR: 论文提出了BEAST框架，通过自监督预训练视觉Transformer（ViT）解决神经行为分析中的数据标注难题，结合掩码自编码与时序对比学习，提升行为特征提取、姿态估计及动作分割任务性能。

Details

Motivation: 现代神经科学依赖行为分析理解大脑，但现有方法需大量标注数据，限制了其可扩展性。BEAST旨在通过自监督学习减少对标注数据的依赖。

Result: 在多种任务（如行为-神经关联分析、姿态估计、动作分割）中表现优于现有方法，尤其在标注数据稀缺场景中显著。

Insight: 自监督预训练的ViT模型可作为普适的行为分析骨干网络，为神经科学研究提供高效工具，减少对标注数据的依赖。

Abstract: The brain can only be fully understood through the lens of the behavior it generates – a guiding principle in modern neuroscience research that nevertheless presents significant technical challenges. Many studies capture behavior with cameras, but video analysis approaches typically rely on specialized models requiring extensive labeled data. We address this limitation with BEAST (BEhavioral Analysis via Self-supervised pretraining of Transformers), a novel and scalable framework that pretrains experiment-specific vision transformers for diverse neuro-behavior analyses. BEAST combines masked autoencoding with temporal contrastive learning to effectively leverage unlabeled video data. Through comprehensive evaluation across multiple species, we demonstrate improved performance in three critical neuro-behavioral tasks: extracting behavioral features that correlate with neural activity, and pose estimation and action segmentation in both the single- and multi-animal settings. Our method establishes a powerful and versatile backbone model that accelerates behavioral analysis in scenarios where labeled data remains scarce.

cs.SD [Back]

[170] Less Stress, More Privacy: Stress Detection on Anonymized Speech of Air Traffic Controllers cs.SD | cs.CL | eess.AS | I.2.7; I.5.5PDF

Janaki Viswanathan, Alexander Blatt, Konrad Hagemann, Dietrich Klakow

TL;DR: 论文研究了在隐私保护的匿名化语音数据上进行空中交通管制员（ATCO）压力检测的方法，提出了多种架构，并在匿名数据集上取得了高准确率。

Details

Motivation: 空中交通管制（ATC）工作压力大且错误后果严重，压力检测对维持高安全标准至关重要。但由于隐私法规（如GDPR）限制，需对语音数据进行匿名化处理。

Result: 最佳模型在匿名化SUSAS数据集上达到93.6%的准确率，在匿名化ATC数据集上达到80.1%的准确率。

Insight: 隐私保护（如语音匿名化）不会显著影响深度学习模型的性能，为合规高隐私要求场景下的AI应用提供了可能。

Abstract: Air traffic control (ATC) demands multi-tasking under time pressure with high consequences of an error. This can induce stress. Detecting stress is a key point in maintaining the high safety standards of ATC. However, processing ATC voice data entails privacy restrictions, e.g. the General Data Protection Regulation (GDPR) law. Anonymizing the ATC voice data is one way to comply with these restrictions. In this paper, different architectures for stress detection for anonymized ATCO speech are evaluated. Our best networks reach a stress detection accuracy of 93.6% on an anonymized version of the Speech Under Simulated and Actual Stress (SUSAS) dataset and an accuracy of 80.1% on our anonymized ATC simulation dataset. This shows that privacy does not have to be an impediment in building well-performing deep-learning-based models.

[171] Voice Conversion for Lombard Speaking Style with Implicit and Explicit Acoustic Feature Conditioning cs.SD | cs.CL | eess.ASPDF

Dominika Woszczyk, Manuel Sam Ribeiro, Thomas Merritt, Daniel Korzekwa

TL;DR: 该论文研究了使用隐式和显式声学特征条件进行Lombard说话风格的语音转换，提出了一种隐式条件策略，能够在保持说话人相似性的同时，实现与显式条件相当的清晰度提升。

Details

Motivation: Lombard说话风格的TTS系统可以提高语音的清晰度，适用于听力损失和嘈杂环境。然而，训练这种模型需要大量数据，且Lombard效应因说话人和噪声的变异性以及疲劳的录制条件而难以记录。语音转换（VC）是一种有用的数据增强技术，可用于在没有目标说话人数据的情况下训练TTS系统。

Result: 实验结果表明，隐式条件策略在保持说话人相似性的同时，达到了与显式条件模型相当的清晰度增益。

Insight: 隐式声学特征条件是一种有效的替代方案，能够在不需要显式特征的情况下实现高质量的语音风格转换，为数据稀缺的场景提供了新思路。

Abstract: Text-to-Speech (TTS) systems in Lombard speaking style can improve the overall intelligibility of speech, useful for hearing loss and noisy conditions. However, training those models requires a large amount of data and the Lombard effect is challenging to record due to speaker and noise variability and tiring recording conditions. Voice conversion (VC) has been shown to be a useful augmentation technique to train TTS systems in the absence of recorded data from the target speaker in the target speaking style. In this paper, we are concerned with Lombard speaking style transfer. Our goal is to convert speaker identity while preserving the acoustic attributes that define the Lombard speaking style. We compare voice conversion models with implicit and explicit acoustic feature conditioning. We observe that our proposed implicit conditioning strategy achieves an intelligibility gain comparable to the model conditioned on explicit acoustic features, while also preserving speaker similarity.

eess.IV [Back]

Yang Ming, Jiang Shi Zhong, Zhou Su Juan

TL;DR: 该论文提出了一种新颖的非对称跨模态跨注意力网络，用于融合多模态医学数据（PET、MRI、基因和临床数据）以准确诊断阿尔茨海默病（AD）。

Details

Motivation: 传统卷积神经网络和简单特征拼接方法在多模态数据融合中难以充分利用互补信息且易导致关键信息丢失，亟需一种更有效的方法。

Result: 模型在测试集上达到94.88%的准确率，优于传统单模态和多模态方法。

Insight: 非对称跨模态注意力机制在多模态数据融合中具有显著优势，能有效提升AD诊断性能。

Abstract: Alzheimer’s Disease (AD) is an irreversible neurodegenerative disease characterized by progressive cognitive decline as its main symptom. In the research field of deep learning-assisted diagnosis of AD, traditional convolutional neural networks and simple feature concatenation methods fail to effectively utilize the complementary information between multimodal data, and the simple feature concatenation approach is prone to cause the loss of key information during the process of modal fusion. In recent years, the development of deep learning technology has brought new possibilities for solving the problem of how to effectively fuse multimodal features. This paper proposes a novel deep learning algorithm framework to assist medical professionals in AD diagnosis. By fusing medical multi-view information such as brain fluorodeoxyglucose positron emission tomography (PET), magnetic resonance imaging (MRI), genetic data, and clinical data, it can accurately detect the presence of AD, Mild Cognitive Impairment (MCI), and Cognitively Normal (CN). The innovation of the algorithm lies in the use of an asymmetric cross-modal cross-attention mechanism, which can effectively capture the key information features of the interactions between different data modal features. This paper compares the asymmetric cross-modal cross-attention mechanism with the traditional algorithm frameworks of unimodal and multimodal deep learning models for AD diagnosis, and evaluates the importance of the asymmetric cross-modal cross-attention mechanism. The algorithm model achieves an accuracy of 94.88% on the test set.

[173] VIP: Visual Information Protection through Adversarial Attacks on Vision-Language Models eess.IV | cs.CV | cs.LGPDF

Hanene F. Z. Brachemi Meftah, Wassim Hamidouche, Sid Ahmed Fezza, Olivier Déforges

TL;DR: 论文提出了一种通过对抗攻击保护视觉信息的方法，选择性地隐藏图像中敏感区域，防止视觉-语言模型（VLM）获取隐私内容，同时保持图像其余部分的语义完整性。

Details

Motivation: 随着视觉-语言模型（VLM）的广泛应用，用户隐私风险日益凸显。传统对抗攻击常破坏整张图像，而缺乏针对特定敏感区域的隐私保护方法。

Result: 在三种先进的VLM上，该方法将目标ROI的检测率降低高达98%，同时生成的对抗样本与原始图像的语义相似度高。

Insight: 隐私保护可以通过针对性的对抗攻击实现，而无需破坏整张图像。这种方法为多模态模型的隐私保护提供了新思路，平衡隐私与实用性。

Abstract: Recent years have witnessed remarkable progress in developing Vision-Language Models (VLMs) capable of processing both textual and visual inputs. These models have demonstrated impressive performance, leading to their widespread adoption in various applications. However, this widespread raises serious concerns regarding user privacy, particularly when models inadvertently process or expose private visual information. In this work, we frame the preservation of privacy in VLMs as an adversarial attack problem. We propose a novel attack strategy that selectively conceals information within designated Region Of Interests (ROIs) in an image, effectively preventing VLMs from accessing sensitive content while preserving the semantic integrity of the remaining image. Unlike conventional adversarial attacks that often disrupt the entire image, our method maintains high coherence in unmasked areas. Experimental results across three state-of-the-art VLMs namely LLaVA, Instruct-BLIP, and BLIP2-T5 demonstrate up to 98% reduction in detecting targeted ROIs, while maintaining global image semantics intact, as confirmed by high similarity scores between clean and adversarial outputs. We believe that this work contributes to a more privacy conscious use of multimodal models and offers a practical tool for further research, with the source code publicly available at: https://github.com/hbrachemi/Vlm_defense-attack.

Mehmet Onurcan Kaya, Figen S. Oktem

TL;DR: 该论文提出了一种基于图像到图像扩散模型的相位检索方法（I2I-PR），通过混合迭代技术和加速机制生成鲁棒的初始估计，并利用学习的扩散模型进行迭代优化，显著提高了重建质量和训练效率。

Details

Motivation: 相位检索在多个领域至关重要，但现有方法对初始化和噪声敏感。扩散模型在图像重建任务中表现出色，因此将其引入相位检索可以提升性能和鲁棒性。

Result: 在训练效率和重建质量上均取得显著提升，优于现有方法。

Insight: 扩散模型在相位检索中具有潜力，尤其是通过结合传统迭代方法和学习技术，可以实现更高效和鲁棒的重建。

Abstract: Phase retrieval involves recovering a signal from intensity-only measurements, crucial in many fields such as imaging, holography, optical computing, crystallography, and microscopy. Although there are several well-known phase retrieval algorithms, including classical iterative solvers, the reconstruction performance often remains sensitive to initialization and measurement noise. Recently, image-to-image diffusion models have gained traction in various image reconstruction tasks, yielding significant theoretical insights and practical breakthroughs. In this work, we introduce a novel phase retrieval approach based on an image-to-image diffusion framework called Inversion by Direct Iteration. Our method begins with an enhanced initialization stage that leverages a hybrid iterative technique, combining the Hybrid Input-Output and Error Reduction methods and incorporating a novel acceleration mechanism to obtain a robust crude estimate. Then, it iteratively refines this initial crude estimate using the learned image-to-image pipeline. Our method achieves substantial improvements in both training efficiency and reconstruction quality. Furthermore, our approach utilizes aggregation techniques to refine quality metrics and demonstrates superior results compared to both classical and contemporary techniques. This highlights its potential for effective and efficient phase retrieval across various applications.

[175] Pre-trained Under Noise: A Framework for Robust Bone Fracture Detection in Medical Imaging eess.IV | cs.CVPDF

Robby Hoover, Nelly Elsayed, Zag ElSayed, Chengcheng Li

TL;DR: 本文研究了预训练深度学习模型在X射线图像中骨骼骨折分类的鲁棒性，通过模拟不同设备质量条件测试了ResNet50、VGG16和EfficientNetv2的性能，提出了一个评估模型退化的方法论框架。

Details

Motivation: 研究动机是解决全球医疗影像设备质量不均导致的骨骼骨折检测技术差异问题，探索预训练模型在不同噪声条件下的表现。

Result: 实验结果揭示了不同预训练模型在面对噪声时的性能退化模式，为实际应用中的模型选择提供了依据。

Insight: 研究表明噪声显著影响模型性能，EfficientNetv2展现出更强的鲁棒性，为医疗影像分析中的模型优化提供了实用指导。

Abstract: Medical Imagings are considered one of the crucial diagnostic tools for different bones-related diseases, especially bones fractures. This paper investigates the robustness of pre-trained deep learning models for classifying bone fractures in X-ray images and seeks to address global healthcare disparity through the lens of technology. Three deep learning models have been tested under varying simulated equipment quality conditions. ResNet50, VGG16 and EfficientNetv2 are the three pre-trained architectures which are compared. These models were used to perform bone fracture classification as images were progressively degraded using noise. This paper specifically empirically studies how the noise can affect the bone fractures detection and how the pre-trained models performance can be changes due to the noise that affect the quality of the X-ray images. This paper aims to help replicate real world challenges experienced by medical imaging technicians across the world. Thus, this paper establishes a methodological framework for assessing AI model degradation using transfer learning and controlled noise augmentation. The findings provide practical insight into how robust and generalizable different pre-trained deep learning powered computer vision models can be when used in different contexts.

[176] AI-Enhanced Pediatric Pneumonia Detection: A CNN-Based Approach Using Data Augmentation and Generative Adversarial Networks (GANs) eess.IV | cs.AI | cs.CVPDF

Abdul Manaf, Nimra Mughal

TL;DR: 本文提出了一种基于CNN的方法，结合数据增强和GANs，用于儿童肺炎检测，旨在提升诊断准确性和效率。

Details

Motivation: 儿童肺炎是五岁以下儿童死亡的主要原因之一，准确诊断至关重要。传统方法依赖医生经验，易受主观性影响，亟需自动化的辅助诊断工具。

Result: 模型通过结合原始、增强和GAN生成的数据，实现了最优性能（高准确率和F1分数），并成功部署为实时分类工具。

Insight: 深度学习结合数据增强和GANs可以有效解决医疗领域的小样本问题，提升诊断效率，特别适用于资源有限的临床场景。

Abstract: Pneumonia is a leading cause of mortality in children under five, requiring accurate chest X-ray diagnosis. This study presents a machine learning-based Pediatric Chest Pneumonia Classification System to assist healthcare professionals in diagnosing pneumonia from chest X-ray images. The CNN-based model was trained on 5,863 labeled chest X-ray images from children aged 0-5 years from the Guangzhou Women and Children’s Medical Center. To address limited data, we applied augmentation techniques (rotation, zooming, shear, horizontal flipping) and employed GANs to generate synthetic images, addressing class imbalance. The system achieved optimal performance using combined original, augmented, and GAN-generated data, evaluated through accuracy and F1 score metrics. The final model was deployed via a Flask web application, enabling real-time classification with probability estimates. Results demonstrate the potential of deep learning and GANs in improving diagnostic accuracy and efficiency for pediatric pneumonia classification, particularly valuable in resource-limited clinical settings https://github.com/AbdulManaf12/Pediatric-Chest-Pneumonia-Classification

[177] Advanced U-Net Architectures with CNN Backbones for Automated Lung Cancer Detection and Segmentation in Chest CT Images eess.IV | cs.AI | cs.CV | cs.LGPDF

Alireza Golkarieha, Kiana Kiashemshakib, Sajjad Rezvani Boroujenic, Nasibeh Asadi Isakand

TL;DR: 该研究探讨了结合U-Net与不同CNN主干网络（如ResNet50、VGG16和Xception）在胸部CT图像中自动检测和分割肺癌的效果。结果表明，这些架构在分割和分类任务中均优于现有方法。

Details

Motivation: 临床需要准确诊断工具以提升肺癌检测效率，尤其是通过自动化手段分析CT图像。

Result: 最佳模型在分割任务中Dice系数达0.9495（癌症）和0.9532（非癌症），分类任务中Xception模型的准确率达99.1%。

Insight: U-Net结合CNN主干网络在医学图像分析中表现优异，混合模型（如CNN-SVM）能进一步提升分类性能。

Abstract: This study investigates the effectiveness of U-Net architectures integrated with various convolutional neural network (CNN) backbones for automated lung cancer detection and segmentation in chest CT images, addressing the critical need for accurate diagnostic tools in clinical settings. A balanced dataset of 832 chest CT images (416 cancerous and 416 non-cancerous) was preprocessed using Contrast Limited Adaptive Histogram Equalization (CLAHE) and resized to 128x128 pixels. U-Net models were developed with three CNN backbones: ResNet50, VGG16, and Xception, to segment lung regions. After segmentation, CNN-based classifiers and hybrid models combining CNN feature extraction with traditional machine learning classifiers (Support Vector Machine, Random Forest, and Gradient Boosting) were evaluated using 5-fold cross-validation. Metrics included accuracy, precision, recall, F1-score, Dice coefficient, and ROC-AUC. U-Net with ResNet50 achieved the best performance for cancerous lungs (Dice: 0.9495, Accuracy: 0.9735), while U-Net with VGG16 performed best for non-cancerous segmentation (Dice: 0.9532, Accuracy: 0.9513). For classification, the CNN model using U-Net with Xception achieved 99.1 percent accuracy, 99.74 percent recall, and 99.42 percent F1-score. The hybrid CNN-SVM-Xception model achieved 96.7 percent accuracy and 97.88 percent F1-score. Compared to prior methods, our framework consistently outperformed existing models. In conclusion, combining U-Net with advanced CNN backbones provides a powerful method for both segmentation and classification of lung cancer in CT scans, supporting early diagnosis and clinical decision-making.

Guohao Huo, Ruiting Dai, Hao Tang

TL;DR: 论文提出了一种轻量级的图基多模态交互网络（GMLN-BTS）用于脑肿瘤分割，并结合边缘迭代MRI病灶定位系统（EdgeIMLocSys），通过持续学习和人类反馈提升模型对MRI扫描仪的适应性。GMLN-BTS利用图结构实现多模态协同交互，并通过新颖的体素细化上采样模块（VRUM）提高了分割边界的准确性。

Details

Motivation: 不同MRI扫描仪的成像质量差异导致模型泛化能力受限，需要一种轻量且高精度的脑肿瘤分割方法以适应资源受限的临床环境。

Result: 在BraTS2017数据集上Dice分数达85.1%，参数量减少98%，优于主流轻量方法。

Insight: 图结构可有效建模多模态关系，而轻量设计和持续学习的结合能为临床环境提供高效的分割方案。

Abstract: Brain tumor segmentation plays a critical role in clinical diagnosis and treatment planning, yet the variability in imaging quality across different MRI scanners presents significant challenges to model generalization. To address this, we propose the Edge Iterative MRI Lesion Localization System (EdgeIMLocSys), which integrates Continuous Learning from Human Feedback to adaptively fine-tune segmentation models based on clinician feedback, thereby enhancing robustness to scanner-specific imaging characteristics. Central to this system is the Graph-based Multi-Modal Interaction Lightweight Network for Brain Tumor Segmentation (GMLN-BTS), which employs a Modality-Aware Adaptive Encoder (M2AE) to extract multi-scale semantic features efficiently, and a Graph-based Multi-Modal Collaborative Interaction Module (G2MCIM) to model complementary cross-modal relationships via graph structures. Additionally, we introduce a novel Voxel Refinement UpSampling Module (VRUM) that synergistically combines linear interpolation and multi-scale transposed convolutions to suppress artifacts while preserving high-frequency details, improving segmentation boundary accuracy. Our proposed GMLN-BTS model achieves a Dice score of 85.1% on the BraTS2017 dataset with only 4.58 million parameters, representing a 98% reduction compared to mainstream 3D Transformer models, and significantly outperforms existing lightweight approaches. This work demonstrates a synergistic breakthrough in achieving high-accuracy, resource-efficient brain tumor segmentation suitable for deployment in resource-constrained clinical environments.

[179] DepViT-CAD: Deployable Vision Transformer-Based Cancer Diagnosis in Histopathology eess.IV | cs.AI | cs.CV | cs.LGPDF

Ashkan Shakarami, Lorenzo Nicole, Rocco Cappellesso, Angelo Paolo Dei Tos, Stefano Ghidoni

TL;DR: 论文提出DepViT-CAD，一种可部署的AI系统，基于多注意力视觉变压器（MAViT）从组织病理学切片中实现多类癌症诊断，并在大规模实际验证中表现优异。

Details

Motivation: 准确和及时的癌症诊断对临床决策至关重要，需要一种鲁棒且可扩展的AI系统来处理多样化的肿瘤类型。

Result: 在两个独立验证队列中分别达到94.11%和92%的诊断灵敏度，展示了系统的鲁棒性。

Insight: 结合先进的变压器架构和大规模实际验证，为AI辅助癌症诊断提供了可靠且可扩展的解决方案。

Abstract: Accurate and timely cancer diagnosis from histopathological slides is vital for effective clinical decision-making. This paper introduces DepViT-CAD, a deployable AI system for multi-class cancer diagnosis in histopathology. At its core is MAViT, a novel Multi-Attention Vision Transformer designed to capture fine-grained morphological patterns across diverse tumor types. MAViT was trained on expert-annotated patches from 1008 whole-slide images, covering 11 diagnostic categories, including 10 major cancers and non-tumor tissue. DepViT-CAD was validated on two independent cohorts: 275 WSIs from The Cancer Genome Atlas and 50 routine clinical cases from pathology labs, achieving diagnostic sensitivities of 94.11% and 92%, respectively. By combining state-of-the-art transformer architecture with large-scale real-world validation, DepViT-CAD offers a robust and scalable approach for AI-assisted cancer diagnostics. To support transparency and reproducibility, software and code will be made publicly available at GitHub.

cs.LG [Back]

[180] Multiple Choice Learning of Low Rank Adapters for Language Modeling cs.LG | cs.AI | cs.CL | stat.MLPDF

Victor Letzelter, Hugo Malard, Mathieu Fontaine, Gaël Richard, Slim Essid

TL;DR: 作者提出了LoRA-MCL，一种结合多选学习（MCL）和低秩适应（LoRA）的语言模型训练方法，旨在解决语言建模中上下文多义性的问题，生成多样且相关的句子续写。

Details

Motivation: 传统的语言建模是一个不适定问题，因为给定上下文可能存在多个合理的未来续写。为了解决这一问题，作者提出了一种能够处理歧义并生成多样化输出的方法。

Result: 在视觉和音频字幕任务中，LoRA-MCL能够生成高多样性和高相关性的输出。

Insight: 结合MCL和LoRA可以高效处理语言建模中的歧义问题，生成的输出既多样又相关。

Abstract: We propose LoRA-MCL, a training scheme that extends next-token prediction in language models with a method designed to decode diverse, plausible sentence continuations at inference time. Traditional language modeling is an intrinsically ill-posed problem: given a context, multiple futures may be equally plausible. Our approach leverages Multiple Choice Learning (MCL) and the Winner-Takes-All (WTA) loss to efficiently handle ambiguity through Low-Rank Adaptation (LoRA). We provide a theoretical interpretation of applying Multiple Choice Learning to Language Modeling, assuming the data is generated from a mixture of distributions. To illustrate the proposed approach, we use data sampled from mixtures of Markov chains. We then demonstrate with extensive experiments on real-world visual and audio captioning tasks that our method achieves high diversity and relevance in generated outputs.

[181] Reasoning or Memorization? Unreliable Results of Reinforcement Learning Due to Data Contamination cs.LG | cs.AI | cs.CLPDF

Mingqi Wu, Zhihao Zhang, Qiaole Dong, Zhiheng Xi, Jun Zhao

TL;DR: 论文揭示了通过强化学习（RL）增强大语言模型（LLMs）推理能力的研究可能因数据污染导致结果不可靠，并提出了一种生成无泄漏数据集的方法以验证RL方法的有效性。

Details

Motivation: 当前许多研究通过RL方法显著提升了LLMs的推理能力，但结果的可靠性存疑，尤其在Qwen2.5模型上表现突出而在其他模型上效果不佳。作者怀疑数据污染是主要原因。

Result: 实验表明，只有在无污染数据集上，准确的奖励信号才能稳定提升性能，而噪声或错误信号无效。

Insight: RL方法的评估需在无污染数据集上进行，且需跨模型验证，以避免数据污染导致的虚假结论。

Abstract: The reasoning capabilities of large language models (LLMs) have been a longstanding focus of research. Recent works have further enhanced these capabilities using reinforcement learning (RL), with many new methods claiming significant improvements with minimal or no external supervision. Surprisingly, some studies even suggest that random or incorrect reward signals can enhance reasoning performance. However, these breakthroughs are mostly reported on the Qwen2.5 model family and evaluated on well-known benchmarks such as MATH-500, AMC, and AIME, while failing to achieve similar gains on other models like Llama, which warrants further investigation. Our analysis shows that although Qwen2.5 achieves strong mathematical reasoning performance, its pretraining on large-scale web corpora makes it vulnerable to data contamination in popular benchmarks. As a result, results derived from these benchmarks may be unreliable. To address this, we introduce a generator that produces fully synthetic arithmetic problems of arbitrary length and difficulty, yielding a clean dataset we call RandomCalculation. Using these leakage-free datasets, we show that only accurate reward signals consistently improve performance, while noisy or incorrect signals do not. We advocate for evaluating RL methods on uncontaminated benchmarks and across diverse model families to ensure trustworthy conclusions.

[182] Learning Diffusion Models with Flexible Representation Guidance cs.LG | cs.AI | cs.CVPDF

Chenyu Wang, Cai Zhou, Sharut Gupta, Zongyu Lin, Stefanie Jegelka

TL;DR: 本文提出了一种系统框架，通过辅助表示指导改进扩散模型，引入两种新策略增强表示对齐，实验证明其性能和训练速度显著提升。

Details

Motivation: 通过预训练模型的表征指导改进扩散模型的生成质量和训练效率，现有方法缺乏系统性框架。

Result: 在图像、蛋白质序列和分子生成任务中表现优异，ImageNet 256×256基准上训练速度提升23.3倍。

Insight: 表征指导可以有效提升扩散模型的生成质量和训练效率，多模态对齐和训练课程的优化是关键。

Abstract: Diffusion models can be improved with additional guidance towards more effective representations of input. Indeed, prior empirical work has already shown that aligning internal representations of the diffusion model with those of pre-trained models improves generation quality. In this paper, we present a systematic framework for incorporating representation guidance into diffusion models. We provide alternative decompositions of denoising models along with their associated training criteria, where the decompositions determine when and how the auxiliary representations are incorporated. Guided by our theoretical insights, we introduce two new strategies for enhancing representation alignment in diffusion models. First, we pair examples with target representations either derived from themselves or arisen from different synthetic modalities, and subsequently learn a joint model over the multimodal pairs. Second, we design an optimal training curriculum that balances representation learning and data generation. Our experiments across image, protein sequence, and molecule generation tasks demonstrate superior performance as well as accelerated training. In particular, on the class-conditional ImageNet $256\times 256$ benchmark, our guidance results in $23.3$ times faster training than the original SiT-XL as well as four times speedup over the state-of-the-art method REPA. The code is available at https://github.com/ChenyuWang-Monica/REED.

[183] Confounder-Free Continual Learning via Recursive Feature Normalization cs.LG | cs.CVPDF

Yash Shah, Camila Gonzalez, Mohammad H. Abbasi, Qingyu Zhao, Kilian M. Pohl

TL;DR: 论文提出了一种递归特征归一化方法（R-MDN），用于在持续学习中消除混淆变量的影响，提升模型预测的公平性和鲁棒性。

Details

Motivation: 混淆变量会影响模型的特征表示，导致虚假相关性和预测偏差。尽管传统方法如元数据归一化（MDN）可以处理这一问题，但在持续学习场景中，消除混淆变量的影响仍然是一个挑战。

Result: 实验表明，R-MDN在静态学习和持续学习中均能提升跨群体预测的公平性，并减少因混淆变量变化引起的灾难性遗忘。

Insight: 递归方法能够有效适应动态变化的混淆变量，为公平性和持续学习提供了一种新的解决方案。

Abstract: Confounders are extraneous variables that affect both the input and the target, resulting in spurious correlations and biased predictions. There are recent advances in dealing with or removing confounders in traditional models, such as metadata normalization (MDN), where the distribution of the learned features is adjusted based on the study confounders. However, in the context of continual learning, where a model learns continuously from new data over time without forgetting, learning feature representations that are invariant to confounders remains a significant challenge. To remove their influence from intermediate feature representations, we introduce the Recursive MDN (R-MDN) layer, which can be integrated into any deep learning architecture, including vision transformers, and at any model stage. R-MDN performs statistical regression via the recursive least squares algorithm to maintain and continually update an internal model state with respect to changing distributions of data and confounding variables. Our experiments demonstrate that R-MDN promotes equitable predictions across population groups, both within static learning and across different stages of continual learning, by reducing catastrophic forgetting caused by confounder effects changing over time.

[184] Warm Starts Accelerate Generative Modelling cs.LG | cs.CV | stat.MLPDF

Jonas Scholz, Richard E. Turner

TL;DR: 该论文提出了一种称为“温启动模型”的方法，通过提供更好的初始点来加速迭代生成模型的生成过程，显著减少所需的函数评估次数。

Details

Motivation: 迭代生成模型（如扩散模型和流匹配模型）在生成高保真样本时需要大量计算，通常需要数百次函数评估，这限制了其实际应用效率。论文旨在通过改进初始条件来加速生成过程。

Result: 在图像修复任务中，论文方法仅需11次函数评估（1次温启动，10次生成）即可达到与1000步DDPM基线相当的效果。

Insight: 通过优化初始条件（温启动），可以显著减少迭代生成模型的计算负担，同时保持生成质量，为生成模型的实际应用提供了高效解决方案。

Abstract: Iterative generative models, like diffusion and flow-matching, create high-fidelity samples by progressively refining a noise vector into data. However, this process is notoriously slow, often requiring hundreds of function evaluations. We introduce the warm-start model, a simple, deterministic model that dramatically accelerates conditional generation by providing a better starting point. Instead of starting generation from an uninformed N(0, I) prior, our warm-start model predicts an informed prior N(mu, sigma), whose moments are conditioned on the input context. This “warm start” substantially reduces the distance the generative process must traverse, particularly when the conditioning information is strongly informative. On tasks like image inpainting, our method achieves results competitive with a 1000-step DDPM baseline using only 11 total function evaluations (1 for the warm start, 10 for generation). A simple conditional normalization trick makes our method compatible with any standard generative model and sampler without modification, allowing it to be combined with other efficient sampling techniques for further acceleration. Our implementation is available at https://github.com/jonas-scholz123/warm-start-model.

[185] MLoRQ: Bridging Low-Rank and Quantization for Transformer Compression cs.LG | cs.CVPDF

Ofir Gordon, Ariel Lapid, Elad Cohen, Yarden Yagil, Arnon Netzer

TL;DR: MLoRQ 提出了一种结合低秩近似和混合精度量化的新方法，通过两阶段优化选择最佳比特宽度和秩分配，显著提升了Transformer模型在资源受限设备上的压缩性能。

Details

Motivation: 在资源受限的边缘设备上部署Transformer模型具有挑战性，现有方法（如低秩近似和量化）通常单独使用。MLoRQ旨在结合这两种技术，以更高效地压缩模型。

Result: 在ViT上的实验表明，MLoRQ性能提升高达15%，在图像分类、目标检测和实例分割任务中达到SOTA效果。

Insight: 结合低秩和量化能更高效地压缩模型，但其协同优化是关键。两阶段优化方法可扩展到其他压缩任务。

Abstract: Deploying transformer-based neural networks on resource-constrained edge devices presents a significant challenge. This challenge is often addressed through various techniques, such as low-rank approximation and mixed-precision quantization. In this work, we introduce Mixed Low-Rank and Quantization (MLoRQ), a novel method that integrates both techniques. MLoRQ employs a two-stage optimization process to determine optimal bit-width and rank assignments for each layer, adhering to predefined memory constraints. This process includes: (i) an intra-layer optimization that identifies potentially optimal compression solutions out of all low-rank and quantization combinations; (ii) an inter-layer optimization that assigns bit-width precision and rank to each layer while ensuring the memory constraint is met. An optional final step applies a sequential optimization process using a modified adaptive rounding technique to mitigate compression-induced errors in joint low-rank approximation and quantization. The method is compatible and can be seamlessly integrated with most existing quantization algorithms. MLoRQ shows state-of-the-art results with up to 15% performance improvement, evaluated on Vision Transformers for image classification, object detection, and instance segmentation tasks.

[186] Learning Private Representations through Entropy-based Adversarial Training cs.LG | cs.AI | cs.CVPDF

Tassilo Klein, Moin Nabi

TL;DR: 该论文提出了一种基于对抗训练的隐私保护表示学习方法，通过引入焦点熵（focal entropy）来减少信息泄漏，同时保持高预测性能。

Details

Motivation: 如何在保持高预测性能的同时保护用户隐私，是隐私保护机器学习中的一个核心问题。现有的基于熵的方法可能存在信息泄漏风险，因此需要改进。

Result: 在多个基准测试中验证了方法的有效性，实现了高目标性能的同时降低了隐私泄漏风险。

Insight: 对抗训练与熵结合的框架为隐私保护表示学习提供了新思路，焦点熵的设计在隐私与性能之间取得了更好的平衡。

Abstract: How can we learn a representation with high predictive power while preserving user privacy? We present an adversarial representation learning method for sanitizing sensitive content from the learned representation. Specifically, we introduce a variant of entropy - focal entropy, which mitigates the potential information leakage of the existing entropy-based approaches. We showcase feasibility on multiple benchmarks. The results suggest high target utility at moderate privacy leakage.

cs.GR [Back]

[187] RectifiedHR: High-Resolution Diffusion via Energy Profiling and Adaptive Guidance Scheduling cs.GR | cs.CVPDF

Ankit Sanjyal

TL;DR: 论文提出了一种通过能量分析和自适应指导调度的高分辨率扩散模型方法，解决了图像合成中的能量不稳定和指导伪影问题。

Details

Motivation: 现有高分辨率图像合成方法在扩散模型中常因能量不稳定和指导伪影导致视觉质量下降，作者希望通过优化指导调度策略改善这一问题。

Result: 实验表明，该方法在稳定性和一致性方面表现优异（稳定分数达0.9998，一致性分数达0.9873），生成的图像更清晰且伪影更少。

Insight: 能量分析框架为理解和改进扩散模型的行为提供了有力的诊断工具，揭示了优化指导调度的潜力。

Abstract: High-resolution image synthesis with diffusion models often suffers from energy instabilities and guidance artifacts that degrade visual quality. We analyze the latent energy landscape during sampling and propose adaptive classifier-free guidance (CFG) schedules that maintain stable energy trajectories. Our approach introduces energy-aware scheduling strategies that modulate guidance strength over time, achieving superior stability scores (0.9998) and consistency metrics (0.9873) compared to fixed-guidance approaches. We demonstrate that DPM++ 2M with linear-decreasing CFG scheduling yields optimal performance, providing sharper, more faithful images while reducing artifacts. Our energy profiling framework serves as a powerful diagnostic tool for understanding and improving diffusion model behavior.

eess.AS [Back]

[188] ZipVoice-Dialog: Non-Autoregressive Spoken Dialogue Generation with Flow Matching eess.AS | cs.CLPDF

Han Zhu, Wei Kang, Liyong Guo, Zengwei Yao, Fangjun Kuang

TL;DR: 论文提出了一种非自回归的（non-autoregressive）零样本（zero-shot）口语对话生成模型ZipVoice-Dialog，基于流匹配（flow matching）技术解决现有自回归模型的推理速度慢和不稳定问题。

Details

Motivation: 现有的口语对话生成模型多为自回归（auto-regressive）模型，存在推理速度慢和不稳定的问题。ZipVoice-Dialog旨在通过非自回归方法解决这些问题。

Result: 在可理解性、说话人转向准确性、说话人相似性和推理速度方面表现优异。

Insight: 非自回归方法在口语对话生成中具有潜力，尤其是在实时性和稳定性方面；同时，高质量的数据集和评测标准对研究至关重要。

Abstract: Generating spoken dialogue is more challenging than monologue text-to-speech (TTS) due to the need for realistic turn-taking and distinct speaker timbres. Existing spoken dialogue generation models, being auto-regressive, suffer from slow and unstable inference. To overcome these limitations, we introduce ZipVoice-Dialog, a non-autoregressive zero-shot spoken dialogue generation model built upon flow matching. Key designs include: 1) speaker-turn embeddings for precise speaker turn-taking; 2) a curriculum learning strategy for stable speech-text alignment; 3) specialized strategies to enable stereo dialogue generation. Additionally, recognizing the lack of open-source large-scale spoken dialogue datasets, we curated OpenDialog, a 6.8k-hour spoken dialogue dataset from in-the-wild speech data. Furthermore, we established a benchmark to comprehensively evaluate various models. Experimental results demonstrate that ZipVoice-Dialog achieves superior performance in intelligibility, speaker turn-taking accuracy, speaker similarity, and inference speed. Our codes, model checkpoints, demo samples, and the OpenDialog dataset are all publicly available at https://github.com/k2-fsa/ZipVoice.

[189] Generative Audio Language Modeling with Continuous-valued Tokens and Masked Next-Token Prediction eess.AS | cs.CV | cs.SDPDF

Shu-wen Yang, Byeonggeun Kim, Kuan-Po Huang, Qingming Tang, Huy Phan

TL;DR: 该论文提出了一种用于音频生成的连续值令牌和掩码下一令牌预测的语言模型方法，显著提升了音频生成质量，并在参数效率上优于现有方法。

Details

Motivation: 传统基于离散令牌的自回归模型在处理连续音频数据时面临挑战。论文旨在探索如何用因果语言模型直接建模连续音频数据，同时提高生成质量和参数效率。

Result: 相比AudioGen，在AudioCaps上FAD和KL散度分别提升了20%和40%。掩码预测任务进一步带来41%和33%的FAD提升，且参数更少。

Insight: 连续值令牌和掩码预测的结合为音频生成任务提供了新思路，且在性能与效率间取得了平衡。

Abstract: Autoregressive next-token prediction with the Transformer decoder has become a de facto standard in large language models (LLMs), achieving remarkable success in Natural Language Processing (NLP) at scale. Extending this paradigm to audio poses unique challenges due to its inherently continuous nature. We research audio generation with a causal language model (LM) without discrete tokens. We leverage token-wise diffusion to model the continuous distribution of the next continuous-valued token. Our approach delivers significant improvements over previous discrete solution, AudioGen, achieving 20% and 40% relative gains on AudioCaps in Frechet Audio Distance (FAD) and Kullback-Leibler (KL) divergence, respectively. Additionally, we propose a novel masked next-token prediction task that incorporates masked prediction into the causal LM framework. On AudioCaps, the innovation yields 41% and 33% relative FAD improvements over AudioGen Base (285M) and AudioGen Large (1B) models, respectively, and is on par with the state-of-the-art (SOTA) diffusion models. Furthermore, we achieve these results with significantly fewer parameters – 193M for our Base and 462M for our Large models.

Table of Contents

cs.CV [Back]

[1] View Invariant Learning for Vision-Language Navigation in Continuous Environments cs.CV | cs.LG | cs.ROPDF

[2] Detecting Deepfake Talking Heads from Facial Biometric Anomalies cs.CVPDF

[3] PRISM: Reducing Spurious Implicit Biases in Vision-Language Models with LLM-Guided Embedding Projection cs.CV | cs.LGPDF

[4] Video Inference for Human Mesh Recovery with Vision Transformer cs.CVPDF

[5] From images to properties: a NeRF-driven framework for granular material parameter inversion cs.CV | physics.geo-phPDF

[6] VISTA: A Visual Analytics Framework to Enhance Foundation Model-Generated Data Labels cs.CVPDF

[7] BrainLesion Suite: A Flexible and User-Friendly Framework for Modular Brain Lesion Image Analysis cs.CV | cs.AI | cs.LG | cs.SEPDF

[8] Can Contrastive Learning Improve Class-Imbalanced Diffusion Model? cs.CV | cs.LGPDF

[9] Infinite Video Understanding cs.CV | cs.AI | cs.IR | cs.LG | cs.MMPDF

[10] BlindSight: Harnessing Sparsity for Efficient VLMs cs.CV | I.2.10PDF

[11] From Physics to Foundation Models: A Review of AI-Driven Quantitative Remote Sensing Inversion cs.CVPDF

[12] Taming generative video models for zero-shot optical flow extraction cs.CVPDF

[13] MI CAM: Mutual Information Weighted Activation Mapping for Causal Visual Explanations of Convolutional Neural Networks cs.CV | cs.LGPDF

[14] RadEyeVideo: Enhancing general-domain Large Vision Language Model for chest X-ray analysis with video representations of eye gaze cs.CVPDF

[15] Harnessing Text-to-Image Diffusion Models for Point Cloud Self-Supervised Learning cs.CVPDF

[16] Hybrid Autoregressive-Diffusion Model for Real-Time Streaming Sign Language Production cs.CVPDF

[17] RoHOI: Robustness Benchmark for Human-Object Interaction Detection cs.CV | cs.HC | cs.RO | eess.IVPDF

[18] Mind the Gap: Preserving and Compensating for the Modality Gap in CLIP-Based Continual Learning cs.CV | cs.LGPDF

[19] SnapMoGen: Human Motion Generation from Expressive Texts cs.CVPDF

[20] PoseLLM: Enhancing Language-Guided Human Pose Estimation with MLP Alignment cs.CVPDF

[21] $I^{2}$-World: Intra-Inter Tokenization for Efficient Dynamic 4D Scene Forecasting cs.CVPDF

[22] Learning and Transferring Better with Depth Information in Visual Reinforcement Learning cs.CV | cs.ROPDF

[23] MCA-LLaVA: Manhattan Causal Attention for Reducing Hallucination in Large Vision-Language Models cs.CVPDF

[24] THYME: Temporal Hierarchical-Cyclic Interactivity Modeling for Video Scene Graphs in Aerial Footage cs.CVPDF

[25] Visual Surface Wave Elastography: Revealing Subsurface Physical Properties via Visible Surface Waves cs.CVPDF

[26] Uncertainty-Driven Expert Control: Enhancing the Reliability of Medical Vision-Language Models cs.CVPDF

[27] Stereo-based 3D Anomaly Object Detection for Autonomous Driving: A New Dataset and Baseline cs.CVPDF

[28] 360-Degree Full-view Image Segmentation by Spherical Convolution compatible with Large-scale Planar Pre-trained Models cs.CVPDF

[29] Online Long-term Point Tracking in the Foundation Model Era cs.CVPDF

[30] Calibrated and Robust Foundation Models for Vision-Language and Medical Image Tasks Under Distribution Shift cs.CV | cs.LGPDF

[31] EgoAnimate: Generating Human Animations from Egocentric top-down Views cs.CVPDF

[32] PPJudge: Towards Human-Aligned Assessment of Artistic Painting Process cs.CVPDF

[33] AGCD-Net: Attention Guided Context Debiasing Network for Emotion Recognition cs.CV | cs.AIPDF

[34] Ambiguity-Aware and High-Order Relation Learning for Multi-Grained Image-Text Matching cs.CV | cs.IR | cs.MMPDF

[35] SAGE: Segment-Aware Gloss-Free Encoding for Token-Efficient Sign Language Translation cs.CVPDF

[36] Cross Knowledge Distillation between Artificial and Spiking Neural Networks cs.CV | cs.AIPDF

[37] Prompt4Trust: A Reinforcement Learning Prompt Augmentation Framework for Clinically-Aligned Confidence Calibration in Multimodal Large Language Models cs.CV | cs.AI | cs.CLPDF

[38] Generative Latent Kernel Modeling for Blind Motion Deblurring cs.CVPDF

[39] Supercharging Floorplan Localization with Semantic Rays cs.CV | cs.LGPDF

[40] Geo-RepNet: Geometry-Aware Representation Learning for Surgical Phase Recognition in Endoscopic Submucosal Dissection cs.CV | cs.ROPDF

[41] ViT-ProtoNet for Few-Shot Image Classification: A Multi-Benchmark Evaluation cs.CV | cs.AI | cs.LGPDF

[42] DAA*: Deep Angular A Star for Image-based Path Planning cs.CV | cs.LG | eess.IVPDF

[43] ProactiveBench: A Comprehensive Benchmark Evaluating Proactive Interactions in Video Large Language Models cs.CVPDF

[44] Dynamic Inter-Class Confusion-Aware Encoder for Audio-Visual Fusion in Human Activity Recognition cs.CVPDF

[45] Fast3D: Accelerating 3D Multi-modal Large Language Models for Efficient 3D Scene Understanding cs.CVPDF

[46] Simplifying Traffic Anomaly Detection with Video Foundation Models cs.CVPDF

[47] Automated Multi-Class Crop Pathology Classification via Convolutional Neural Networks: A Deep Learning Approach for Real-Time Precision Agriculture cs.CV | I.2.6; I.5.4PDF

[48] GreenCrossingAI: A Camera Trap/Computer Vision Pipeline for Environmental Science Research Groups cs.CV | cs.LGPDF

[49] Domain Adaptation and Multi-view Attention for Learnable Landmark Tracking with Sparse Data cs.CV | cs.AI | cs.LG | cs.ROPDF

[50] SegVec3D: A Method for Vector Embedding of 3D Objects Oriented Towards Robot manipulation cs.CV | cs.ROPDF

[51] CKAA: Cross-subspace Knowledge Alignment and Aggregation for Robust Continual Learning cs.CVPDF

[52] HMID-Net: An Exploration of Masked Image Modeling and Knowledge Distillation in Hyperbolic Space cs.CV | cs.AIPDF

[53] GLIMPSE: Do Large Vision-Language Models Truly Think With Videos or Just Glimpse at Them? cs.CVPDF

[54] SDTN and TRN: Adaptive Spectral-Spatial Feature Extraction for Hyperspectral Image Classification cs.CV | cs.AIPDF

[55] Advancing Reliable Test-Time Adaptation of Vision-Language Models under Visual Variations cs.CVPDF

[56] Online Micro-gesture Recognition Using Data Augmentation and Spatial-Temporal Attention cs.CVPDF

[57] QuarterMap: Efficient Post-Training Token Pruning for Visual State Space Models cs.CV | cs.AIPDF

[58] When Schrödinger Bridge Meets Real-World Image Dehazing with Unpaired Training cs.CVPDF

[59] VDInstruct: Zero-Shot Key Information Extraction via Content-Aware Vision Tokenization cs.CV | cs.AI | cs.LGPDF

[60] Prompt Engineering in Segment Anything Model: Methodologies, Applications, and Emerging Challenges cs.CV | cs.AIPDF

[61] WordCraft: Interactive Artistic Typography with Attention Awareness and Noise Blending cs.CVPDF

[62] MENTOR: Efficient Multimodal-Conditioned Tuning for Autoregressive Vision Generation Models cs.CV | cs.AI | cs.CLPDF

[63] Memory-Augmented SAM2 for Training-Free Surgical Video Segmentation cs.CVPDF

[64] Towards Fine-Grained Adaptation of CLIP via a Self-Trained Alignment Score cs.CVPDF

[65] Generate Aligned Anomaly: Region-Guided Few-Shot Anomaly Image-Mask Pair Synthesis for Industrial Inspection cs.CVPDF

[66] Brain Stroke Detection and Classification Using CT Imaging with Transformer Models and Explainable AI cs.CV | cs.AIPDF

[67] Disentanglement and Assessment of Shortcuts in Ophthalmological Retinal Imaging Exams cs.CV | cs.LGPDF

[68] Prompt2DEM: High-Resolution DEMs for Urban and Open Environments from Global Prompts Using a Monocular Foundation Model cs.CV | eess.IVPDF

[69] ViTCoT: Video-Text Interleaved Chain-of-Thought for Boosting Video Understanding in Large Language Models cs.CV | cs.AI | cs.CLPDF

[70] ExpStar: Towards Automatic Commentary Generation for Multi-discipline Scientific Experiments cs.CVPDF

[71] Token Compression Meets Compact Vision Transformers: A Survey and Comparative Evaluation for Edge AI cs.CVPDF

[72] Cross-modal Associations in Vision and Language Models: Revisiting the bouba-kiki effect cs.CV | cs.CLPDF

[73] NegRefine: Refining Negative Label-Based Zero-Shot OOD Detection cs.CV | cs.LGPDF

[74] VRU-Accident: A Vision-Language Benchmark for Video Question Answering and Dense Captioning for Accident Scene Understanding cs.CVPDF

[75] Hierarchical Abstraction Enables Human-Like 3D Object Recognition in Deep Learning Models cs.CV | cs.LGPDF

[76] FaceLLM: A Multimodal Large Language Model for Face Understanding cs.CV | cs.AI | cs.CLPDF

[77] A Survey on MLLM-based Visually Rich Document Understanding: Methods, Challenges, and Emerging Trends cs.CV | cs.AIPDF

[78] SpeakerVid-5M: A Large-Scale High-Quality Dataset for Audio-Visual Dyadic Interactive Human Generation cs.CV | eess.ASPDF