cs.CV [Total: 48]
cs.CL [Total: 12]
cs.RO [Total: 5]
cs.SI [Total: 1]
cs.SD [Total: 1]
eess.IV [Total: 4]
cs.LG [Total: 3]

cs.CV [Back]

[1] 3D and 4D World Modeling: A Survey cs.CV | cs.ROPDF

Lingdong Kong, Wesley Yang, Jianbiao Mei, Youquan Liu, Ao Liang

TL;DR: 这篇论文是对3D和4D世界建模领域的首次全面综述，提出了明确定义和分类法，总结了相关数据集和评估指标，并探讨了实际应用与未来研究方向。

Details

Motivation: 现有研究多集中于生成2D图像和视频的方法，而忽略了3D和4D表示（如RGB-D图像、占据栅格和LiDAR点云）的应用。同时，缺乏对‘世界模型’的标准化定义和分类法，导致文献中的主张分散且不一致。

Result: 论文提供了对3D和4D世界建模领域的全面概述，包括分类法、数据集、评估指标和未来研究方向。

Insight: 3D和4D表示在动态环境建模中具有显著优势；标准化定义和评估指标将推动领域发展；未来研究方向包括多模态融合和实时建模等。

Abstract: World modeling has become a cornerstone in AI research, enabling agents to understand, represent, and predict the dynamic environments they inhabit. While prior work largely emphasizes generative methods for 2D image and video data, they overlook the rapidly growing body of work that leverages native 3D and 4D representations such as RGB-D imagery, occupancy grids, and LiDAR point clouds for large-scale scene modeling. At the same time, the absence of a standardized definition and taxonomy for ``world models’’ has led to fragmented and sometimes inconsistent claims in the literature. This survey addresses these gaps by presenting the first comprehensive review explicitly dedicated to 3D and 4D world modeling and generation. We establish precise definitions, introduce a structured taxonomy spanning video-based (VideoGen), occupancy-based (OccGen), and LiDAR-based (LiDARGen) approaches, and systematically summarize datasets and evaluation metrics tailored to 3D/4D settings. We further discuss practical applications, identify open challenges, and highlight promising research directions, aiming to provide a coherent and foundational reference for advancing the field. A systematic summary of existing literature is available at https://github.com/worldbench/survey

[2] Video Parallel Scaling: Aggregating Diverse Frame Subsets for VideoLLMs cs.CV | cs.LGPDF

Hyungjin Chung, Hyelin Nam, Jiyeon Kim, Hyojun Go, Byeongjun Park

TL;DR: Video Parallel Scaling (VPS) 是一种推理时方法，通过并行处理视频帧的不同子集并聚合结果，扩展了 VideoLLMs 的感知能力，而不增加上下文窗口。

Details

Motivation: VideoLLMs 在处理更多帧以捕捉细粒度时间细节时，面临计算成本过高和性能下降的问题。需要一种方法在不增加计算负担的情况下提升性能。

Result: 在多种模型架构和规模（2B-32B）上的实验显示，VPS 在 Video-MME 和 EventHallusion 等基准上显著提升了性能，且比其他并行方法更具扩展性。

Insight: VPS 是一种内存高效且稳健的框架，能增强 VideoLLMs 的时间推理能力，且与其他解码策略互补。

Abstract: Video Large Language Models (VideoLLMs) face a critical bottleneck: increasing the number of input frames to capture fine-grained temporal detail leads to prohibitive computational costs and performance degradation from long context lengths. We introduce Video Parallel Scaling (VPS), an inference-time method that expands a model’s perceptual bandwidth without increasing its context window. VPS operates by running multiple parallel inference streams, each processing a unique, disjoint subset of the video’s frames. By aggregating the output probabilities from these complementary streams, VPS integrates a richer set of visual information than is possible with a single pass. We theoretically show that this approach effectively contracts the Chinchilla scaling law by leveraging uncorrelated visual evidence, thereby improving performance without additional training. Extensive experiments across various model architectures and scales (2B-32B) on benchmarks such as Video-MME and EventHallusion demonstrate that VPS consistently and significantly improves performance. It scales more favorably than other parallel alternatives (e.g. Self-consistency) and is complementary to other decoding strategies, offering a memory-efficient and robust framework for enhancing the temporal reasoning capabilities of VideoLLMs.

[3] Two Stage Context Learning with Large Language Models for Multimodal Stance Detection on Climate Change cs.CV | cs.CYPDF

Lata Pangtey, Omkar Kabde, Shahid Shafi Dar, Nagendra Kumar

TL;DR: 本文提出了一种基于大型语言模型的分阶段多模态立场检测框架，旨在结合文本和视觉信息，实现对社交媒体内容中气候相关立场的准确分类。该方法在MultiClimate数据集上表现优于现有技术。

Details

Motivation: 社交媒体内容日益多模态化，而现有立场检测方法主要依赖文本数据。为了填补这一空白，本文提出了一种结合文本和视觉信息的先进多模态方法。

Result: 在MultiClimate数据集上的准确率为76.2%，精确率、召回率和F1-score均为76.3%，优于现有技术。

Insight: 多模态数据的联合建模可以显著提升立场检测任务的性能，尤其是在复杂话题（如气候变化）上。

Abstract: With the rapid proliferation of information across digital platforms, stance detection has emerged as a pivotal challenge in social media analysis. While most of the existing approaches focus solely on textual data, real-world social media content increasingly combines text with visual elements creating a need for advanced multimodal methods. To address this gap, we propose a multimodal stance detection framework that integrates textual and visual information through a hierarchical fusion approach. Our method first employs a Large Language Model to retrieve stance-relevant summaries from source text, while a domain-aware image caption generator interprets visual content in the context of the target topic. These modalities are then jointly modeled along with the reply text, through a specialized transformer module that captures interactions between the texts and images. The proposed modality fusion framework integrates diverse modalities to facilitate robust stance classification. We evaluate our approach on the MultiClimate dataset, a benchmark for climate change-related stance detection containing aligned video frames and transcripts. We achieve accuracy of 76.2%, precision of 76.3%, recall of 76.2% and F1-score of 76.2%, respectively, outperforming existing state-of-the-art approaches.

[4] Sparse Transformer for Ultra-sparse Sampled Video Compressive Sensing cs.CVPDF

Miao Cao, Siming Zheng, Lishun Wang, Ziyang Chen, David Brady

TL;DR: 该论文提出了一种超稀疏采样（USS）策略和BSTFormer稀疏Transformer，用于视频压缩感知，显著提升了稀疏采样下的重建性能。

Details

Motivation: 为了解决高分辨率、高帧率视频采集的功耗问题，当前基于随机采样（RS）的视频压缩感知方法效率不足，提出超稀疏采样策略以减少功耗并提升动态范围。

Result: 在仿真和真实数据上，BSTFormer显著优于现有方法，且USS策略具有更高的动态范围和固定曝光时间优势。

Insight: 1. 超稀疏采样在压缩感知中具有更高的能效和动态范围；2. 稀疏Transformer通过多粒度注意力机制有效解决稀疏测量分解问题。

Abstract: Digital cameras consume ~0.1 microjoule per pixel to capture and encode video, resulting in a power usage of ~20W for a 4K sensor operating at 30 fps. Imagining gigapixel cameras operating at 100-1000 fps, the current processing model is unsustainable. To address this, physical layer compressive measurement has been proposed to reduce power consumption per pixel by 10-100X. Video Snapshot Compressive Imaging (SCI) introduces high frequency modulation in the optical sensor layer to increase effective frame rate. A commonly used sampling strategy of video SCI is Random Sampling (RS) where each mask element value is randomly set to be 0 or 1. Similarly, image inpainting (I2P) has demonstrated that images can be recovered from a fraction of the image pixels. Inspired by I2P, we propose Ultra-Sparse Sampling (USS) regime, where at each spatial location, only one sub-frame is set to 1 and all others are set to 0. We then build a Digital Micro-mirror Device (DMD) encoding system to verify the effectiveness of our USS strategy. Ideally, we can decompose the USS measurement into sub-measurements for which we can utilize I2P algorithms to recover high-speed frames. However, due to the mismatch between the DMD and CCD, the USS measurement cannot be perfectly decomposed. To this end, we propose BSTFormer, a sparse TransFormer that utilizes local Block attention, global Sparse attention, and global Temporal attention to exploit the sparsity of the USS measurement. Extensive results on both simulated and real-world data show that our method significantly outperforms all previous state-of-the-art algorithms. Additionally, an essential advantage of the USS strategy is its higher dynamic range than that of the RS strategy. Finally, from the application perspective, the USS strategy is a good choice to implement a complete video SCI system on chip due to its fixed exposure time.

[5] GTA-Crime: A Synthetic Dataset and Generation Framework for Fatal Violence Detection with Adversarial Snippet-Level Domain Adaptation cs.CVPDF

Seongho Kim, Sejong Ryu, Hyoukjun You, Je Hyeong Hong

TL;DR: GTA-Crime是一个基于GTA5合成的致命暴力检测数据集和生成框架，解决了真实场景中数据稀缺和伦理问题。通过片段级域适应策略（Wasserstein对抗训练），该方法提升了真实数据集（如UCF-Crime）上的检测精度。

Details

Motivation: 真实世界中致命暴力事件（如枪击和刺伤）的数据难以获取且存在伦理问题，而现有视频异常检测方法对这些场景的检测效果有限。

Result: 实验表明，GTA-Crime及其域适应策略显著提升了真实世界致命暴力检测的准确率。

Insight: 合成数据可以作为真实数据稀缺场景的有效补充，对抗训练在小样本域适应中表现出色。

Abstract: Recent advancements in video anomaly detection (VAD) have enabled identification of various criminal activities in surveillance videos, but detecting fatal incidents such as shootings and stabbings remains difficult due to their rarity and ethical issues in data collection. Recognizing this limitation, we introduce GTA-Crime, a fatal video anomaly dataset and generation framework using Grand Theft Auto 5 (GTA5). Our dataset contains fatal situations such as shootings and stabbings, captured from CCTV multiview perspectives under diverse conditions including action types, weather, time of day, and viewpoints. To address the rarity of such scenarios, we also release a framework for generating these types of videos. Additionally, we propose a snippet-level domain adaptation strategy using Wasserstein adversarial training to bridge the gap between synthetic GTA-Crime features and real-world features like UCF-Crime. Experimental results validate our GTA-Crime dataset and demonstrate that incorporating GTA-Crime with our domain adaptation strategy consistently enhances real world fatal violence detection accuracy. Our dataset and the data generation framework are publicly available at https://github.com/ta-ho/GTA-Crime.

[6] RepViT-CXR: A Channel Replication Strategy for Vision Transformers in Chest X-ray Tuberculosis and Pneumonia Classification cs.CV | cs.LG | F.2.2; I.2.7PDF

Faisal Ahmed

TL;DR: RepViT-CXR提出了一种通道复制策略，将单通道的胸部X光图像适配到ViT架构中，显著提升了结核病和肺炎的分类性能。

Details

Motivation: 胸部X光图像（CXR）是检测结核病和肺炎的重要工具，但大多数ViT模型是基于三通道的自然图像训练的，无法直接处理单通道的CXR图像，因此需要一种适配方法。

Result: 在TB-CXR数据集上取得了99.9%的准确率和AUC，优于Topo-CXR；在儿科肺炎数据集上召回率和精确率均超过99%；在深圳结核病数据集上也表现优于CNN方法。

Insight: 简单的通道复制策略可以有效适配ViT模型到单通道医学图像任务，展现出ViT在医学图像分析中的强大潜力。

Abstract: Chest X-ray (CXR) imaging remains one of the most widely used diagnostic tools for detecting pulmonary diseases such as tuberculosis (TB) and pneumonia. Recent advances in deep learning, particularly Vision Transformers (ViTs), have shown strong potential for automated medical image analysis. However, most ViT architectures are pretrained on natural images and require three-channel inputs, while CXR scans are inherently grayscale. To address this gap, we propose RepViT-CXR, a channel replication strategy that adapts single-channel CXR images into a ViT-compatible format without introducing additional information loss. We evaluate RepViT-CXR on three benchmark datasets. On the TB-CXR dataset,our method achieved an accuracy of 99.9% and an AUC of 99.9%, surpassing prior state-of-the-art methods such as Topo-CXR (99.3% accuracy, 99.8% AUC). For the Pediatric Pneumonia dataset, RepViT-CXR obtained 99.0% accuracy, with 99.2% recall, 99.3% precision, and an AUC of 99.0%, outperforming strong baselines including DCNN and VGG16. On the Shenzhen TB dataset, our approach achieved 91.1% accuracy and an AUC of 91.2%, marking a performance improvement over previously reported CNN-based methods. These results demonstrate that a simple yet effective channel replication strategy allows ViTs to fully leverage their representational power on grayscale medical imaging tasks. RepViT-CXR establishes a new state of the art for TB and pneumonia detection from chest X-rays, showing strong potential for deployment in real-world clinical screening systems.

[7] Symmetry Interactive Transformer with CNN Framework for Diagnosis of Alzheimer’s Disease Using Structural MRI cs.CVPDF

Zheng Yang, Yanteng Zhang, Xupeng Kou, Yang Liu, Chao Ren

TL;DR: 该论文提出了一种结合3D CNN编码器和对称交互Transformer（SIT）的网络，用于通过sMRI诊断阿尔茨海默病（AD），重点关注大脑左右半球的不对称特征，提升了诊断准确性。

Details

Motivation: 现有的深度学习方法在sMRI诊断AD中忽视了由脑部疾病引起的不对称特征，因此作者提出了一种新的网络结构，以捕捉并利用这种不对称性来改进诊断性能。

Result: 在ADNI数据集上，该方法取得了92.5%的诊断准确率，优于其他CNN和通用Transformer方法。可视化结果也显示出网络能有效关注脑萎缩区域，尤其是AD引起的不对称病理特征。

Insight: 论文揭示了大脑左右半球不对称特征在AD诊断中的重要性，并提出了一种有效的方法来捕捉和利用这种不对称性，为深度学习在医学影像分析中的应用提供了新思路。

Abstract: Structural magnetic resonance imaging (sMRI) combined with deep learning has achieved remarkable progress in the prediction and diagnosis of Alzheimer’s disease (AD). Existing studies have used CNN and transformer to build a well-performing network, but most of them are based on pretraining or ignoring the asymmetrical character caused by brain disorders. We propose an end-to-end network for the detection of disease-based asymmetric induced by left and right brain atrophy which consist of 3D CNN Encoder and Symmetry Interactive Transformer (SIT). Following the inter-equal grid block fetch operation, the corresponding left and right hemisphere features are aligned and subsequently fed into the SIT for diagnostic analysis. SIT can help the model focus more on the regions of asymmetry caused by structural changes, thus improving diagnostic performance. We evaluated our method based on the ADNI dataset, and the results show that the method achieves better diagnostic accuracy (92.5%) compared to several CNN methods and CNNs combined with a general transformer. The visualization results show that our network pays more attention in regions of brain atrophy, especially for the asymmetric pathological characteristics induced by AD, demonstrating the interpretability and effectiveness of the method.

[8] EVDI++: Event-based Video Deblurring and Interpolation via Self-Supervised Learning cs.CVPDF

Chi Zhang, Xiang Zhang, Chenxu Jiang, Gui-Song Xia, Lei Yu

TL;DR: EVDI++提出了一种自监督学习框架，结合事件相机的高时间分辨率，解决帧模糊和帧间插值问题，使用Learnable Double Integral网络和自适应融合策略，在合成和真实数据集上表现优异。

Details

Motivation: 传统帧相机在长曝光时间下会产生明显的运动模糊和帧间信息丢失，事件相机的高时间分辨率为解决这一问题提供了可能。

Result: 在合成和真实数据集上，EVDI++在视频去模糊和插值任务中达到了最先进的性能。

Insight: 事件相机的高时间分辨率可以有效解决帧相机的运动模糊问题；自监督学习框架可以缓解真实数据标注不足的问题。

Abstract: Frame-based cameras with extended exposure times often produce perceptible visual blurring and information loss between frames, significantly degrading video quality. To address this challenge, we introduce EVDI++, a unified self-supervised framework for Event-based Video Deblurring and Interpolation that leverages the high temporal resolution of event cameras to mitigate motion blur and enable intermediate frame prediction. Specifically, the Learnable Double Integral (LDI) network is designed to estimate the mapping relation between reference frames and sharp latent images. Then, we refine the coarse results and optimize overall training efficiency by introducing a learning-based division reconstruction module, enabling images to be converted with varying exposure intervals. We devise an adaptive parameter-free fusion strategy to obtain the final results, utilizing the confidence embedded in the LDI outputs of concurrent events. A self-supervised learning framework is proposed to enable network training with real-world blurry videos and events by exploring the mutual constraints among blurry frames, latent images, and event streams. We further construct a dataset with real-world blurry images and events using a DAVIS346c camera, demonstrating the generalizability of the proposed EVDI++ in real-world scenarios. Extensive experiments on both synthetic and real-world datasets show that our method achieves state-of-the-art performance in video deblurring and interpolation tasks.

[9] Hyperspectral Mamba for Hyperspectral Object Tracking cs.CVPDF

Long Gao, Yunhe Zhang, Yan Jiang, Weiying Xie, Yunsong Li

TL;DR: 该论文提出了一种新的超光谱目标跟踪网络HyMamba，通过状态空间模块统一光谱、跨深度和时间建模，利用Spectral State Integration模块和Hyperspectral Mamba模块同步学习空间和光谱信息，在多个基准数据集上实现了最先进的性能。

Details

Motivation: 超光谱目标跟踪在复杂场景中因丰富的光谱信息而具有潜力，但现有方法难以捕捉内在光谱信息、时间依赖性和跨深度交互。

Result: 在七个基准数据集上表现优异，例如在HOTC2020数据集上AUC得分为73.0％，DP@20得分为96.3％。

Insight: 结合原始光谱特征和假彩色输入，通过跨深度和时间建模，可以显著提升超光谱目标跟踪的性能。

Abstract: Hyperspectral object tracking holds great promise due to the rich spectral information and fine-grained material distinctions in hyperspectral images, which are beneficial in challenging scenarios. While existing hyperspectral trackers have made progress by either transforming hyperspectral data into false-color images or incorporating modality fusion strategies, they often fail to capture the intrinsic spectral information, temporal dependencies, and cross-depth interactions. To address these limitations, a new hyperspectral object tracking network equipped with Mamba (HyMamba), is proposed. It unifies spectral, cross-depth, and temporal modeling through state space modules (SSMs). The core of HyMamba lies in the Spectral State Integration (SSI) module, which enables progressive refinement and propagation of spectral features with cross-depth and temporal spectral information. Embedded within each SSI, the Hyperspectral Mamba (HSM) module is introduced to learn spatial and spectral information synchronously via three directional scanning SSMs. Based on SSI and HSM, HyMamba constructs joint features from false-color and hyperspectral inputs, and enhances them through interaction with original spectral features extracted from raw hyperspectral images. Extensive experiments conducted on seven benchmark datasets demonstrate that HyMamba achieves state-of-the-art performance. For instance, it achieves 73.0% of the AUC score and 96.3% of the DP@20 score on the HOTC2020 dataset. The code will be released at https://github.com/lgao001/HyMamba.

[10] Examining Vision Language Models through Multi-dimensional Experiments with Vision and Text Features cs.CVPDF

Saurav Sengupta, Nazanin Moradinasab, Jiebei Liu, Donald E. Brown

TL;DR: 该论文通过多维实验框架研究视觉语言模型（VLMs）的性能，发现模型对输入数据（如图像大小、物体数量、背景颜色和提示语特异性）的特性高度敏感，这些小变化会导致答案生成和性能显著差异。

Details

Motivation: 现有研究表明VLMs依赖训练时的固有偏见回答问题，尤其是在需要聚焦图像细节的特定问题上表现不佳。论文旨在系统研究输入数据的哪些特性导致这种性能差异。

Result: 研究表明，即使输入数据的微小变化（如图像特性或提示语特异性）也会导致VLMs的回答方式和整体性能发生显著变化。

Insight: VLMs对输入数据的特性高度敏感，未来的改进需关注如何减少模型对固有偏见的依赖，并提升其对视觉细节的捕捉能力。

Abstract: Recent research on Vision Language Models (VLMs) suggests that they rely on inherent biases learned during training to respond to questions about visual properties of an image. These biases are exacerbated when VLMs are asked highly specific questions that require focusing on specific areas of the image. For example, a VLM tasked with counting stars on a modified American flag (e.g., with more than 50 stars) will often disregard the visual evidence and fail to answer accurately. We build upon this research and develop a multi-dimensional examination framework to systematically determine which characteristics of the input data, including both the image and the accompanying prompt, lead to such differences in performance. Using open-source VLMs, we further examine how attention values fluctuate with varying input parameters (e.g., image size, number of objects in the image, background color, prompt specificity). This research aims to learn how the behavior of vision language models changes and to explore methods for characterizing such changes. Our results suggest, among other things, that even minor modifications in image characteristics and prompt specificity can lead to large changes in how a VLM formulates its answer and, subsequently, its overall performance.

[11] Generalized Zero-Shot Learning for Point Cloud Segmentation with Evidence-Based Dynamic Calibration cs.CVPDF

Hyeonseok Kim, Byeongkeun Kang, Yeejin Lee

TL;DR: 提出了一种名为E3DPC-GZSL的新方法，通过证据基不确定性估计器和动态校准策略解决点云分割中广义零样本学习的偏见预测问题，并在ScanNet v2和S3DIS数据集上实现最优性能。

Details

Motivation: 3D点云的广义零样本语义分割中，模型倾向于偏向训练中见过的类别，尤其在数据规模较小的3D任务中更为严重。

Result: 在ScanNet v2和S3DIS数据集上实现了最优性能。

Insight: 证据基方法和语义空间优化可以有效减少模型对已知类别的偏好，提升对未知类别的泛化能力。

Abstract: Generalized zero-shot semantic segmentation of 3D point clouds aims to classify each point into both seen and unseen classes. A significant challenge with these models is their tendency to make biased predictions, often favoring the classes encountered during training. This problem is more pronounced in 3D applications, where the scale of the training data is typically smaller than in image-based tasks. To address this problem, we propose a novel method called E3DPC-GZSL, which reduces overconfident predictions towards seen classes without relying on separate classifiers for seen and unseen data. E3DPC-GZSL tackles the overconfidence problem by integrating an evidence-based uncertainty estimator into a classifier. This estimator is then used to adjust prediction probabilities using a dynamic calibrated stacking factor that accounts for pointwise prediction uncertainty. In addition, E3DPC-GZSL introduces a novel training strategy that improves uncertainty estimation by refining the semantic space. This is achieved by merging learnable parameters with text-derived features, thereby improving model optimization for unseen data. Extensive experiments demonstrate that the proposed approach achieves state-of-the-art performance on generalized zero-shot semantic segmentation datasets, including ScanNet v2 and S3DIS.

[12] Dual-Thresholding Heatmaps to Cluster Proposals for Weakly Supervised Object Detection cs.CVPDF

Yuelin Guo, Haoyu He, Zhiyuan Chen, Zitong Huang, Renhao Lu

TL;DR: 论文提出了一种基于双阈值热图的弱监督目标检测方法，通过改进提案选择和网络架构，解决了现有方法的三个主要问题。

Details

Motivation: 当前弱监督目标检测方法存在提案选择不足、背景类别缺失以及收敛速度慢的问题，论文旨在改进这些局限性。

Result: 在PASCAL VOC 2007和2012上分别达到58.5%/81.8%和55.6%/80.5%的mAP/mCorLoc分数，优于现有方法。

Insight: 双阈值热图和背景类别的引入显著提升了弱监督目标检测的性能和收敛速度。

Abstract: Weakly supervised object detection (WSOD) has attracted significant attention in recent years, as it does not require box-level annotations. State-of-the-art methods generally adopt a multi-module network, which employs WSDDN as the multiple instance detection network module and multiple instance refinement modules to refine performance. However, these approaches suffer from three key limitations. First, existing methods tend to generate pseudo GT boxes that either focus only on discriminative parts, failing to capture the whole object, or cover the entire object but fail to distinguish between adjacent intra-class instances. Second, the foundational WSDDN architecture lacks a crucial background class representation for each proposal and exhibits a large semantic gap between its branches. Third, prior methods discard ignored proposals during optimization, leading to slow convergence. To address these challenges, we first design a heatmap-guided proposal selector (HGPS) algorithm, which utilizes dual thresholds on heatmaps to pre-select proposals, enabling pseudo GT boxes to both capture the full object extent and distinguish between adjacent intra-class instances. We then present a weakly supervised basic detection network (WSBDN), which augments each proposal with a background class representation and uses heatmaps for pre-supervision to bridge the semantic gap between matrices. At last, we introduce a negative certainty supervision loss on ignored proposals to accelerate convergence. Extensive experiments on the challenging PASCAL VOC 2007 and 2012 datasets demonstrate the effectiveness of our framework. We achieve mAP/mCorLoc scores of 58.5%/81.8% on VOC 2007 and 55.6%/80.5% on VOC 2012, performing favorably against the state-of-the-art WSOD methods. Our code is publicly available at https://github.com/gyl2565309278/DTH-CP.

[13] An Open Benchmark Dataset for GeoAI Foundation Models for Oil Palm Mapping in Indonesia cs.CVPDF

M. Warizmi Wafiq, Peter Cutter, Ate Poortinga, Daniel Marc G. dela Torre, Karis Tenneson

TL;DR: 该论文提出了一个开放的地理空间基准数据集，用于支持印度尼西亚油棕榈种植的可持续性监测和法规实施。

Details

Motivation: 油棕榈种植是印度尼西亚森林砍伐的主要原因之一，缺乏高质量的训练数据限制了遥感技术的应用，阻碍了可持续性监测和法规实施。

Result: 数据集填补了遥感领域高质量训练数据的空白，支持透明监测油棕榈扩张，有助于全球减少森林砍伐的目标。

Insight: 开放的高质量数据集推动了GeoAI和遥感技术的发展，为可持续性监测提供了重要工具。

Abstract: Oil palm cultivation remains one of the leading causes of deforestation in Indonesia. To better track and address this challenge, detailed and reliable mapping is needed to support sustainability efforts and emerging regulatory frameworks. We present an open-access geospatial dataset of oil palm plantations and related land cover types in Indonesia, produced through expert labeling of high-resolution satellite imagery from 2020 to 2024. The dataset provides polygon-based, wall-to-wall annotations across a range of agro-ecological zones and includes a hierarchical typology that distinguishes oil palm planting stages as well as similar perennial crops. Quality was ensured through multi-interpreter consensus and field validation. The dataset was created using wall-to-wall digitization over large grids, making it suitable for training and benchmarking both conventional convolutional neural networks and newer geospatial foundation models. Released under a CC-BY license, it fills a key gap in training data for remote sensing and aims to improve the accuracy of land cover types mapping. By supporting transparent monitoring of oil palm expansion, the resource contributes to global deforestation reduction goals and follows FAIR data principles.

[14] SimCroP: Radiograph Representation Learning with Similarity-driven Cross-granularity Pre-training cs.CVPDF

Rongsheng Wang, Fenghe Tang, Qingsong Yao, Rui Yan, Xu Zhang

TL;DR: SimCroP框架通过相似性驱动的跨粒度预训练优化CT影像与报告的对齐与融合，提升稀疏病灶表征学习，在多个下游任务中性能超越现有方法。

Details

Motivation: CT影像中病灶分布稀疏且结构复杂，同时影像与报告间的多粒度关系难以捕捉，亟需一种有效的预训练方法提升表征学习能力。

Result: 在五个公开数据集上的图像分类与分割任务中，SimCroP超越现有自监督和跨模态预训练方法。

Insight: 通过跨粒度对齐与融合，可有效缓解CT影像稀疏性问题，同时多模态掩码建模能捕捉细粒度语义信息。

Abstract: Medical vision-language pre-training shows great potential in learning representative features from massive paired radiographs and reports. However, in computed tomography (CT) scans, the distribution of lesions which contain intricate structures is characterized by spatial sparsity. Besides, the complex and implicit relationships between different pathological descriptions in each sentence of the report and their corresponding sub-regions in radiographs pose additional challenges. In this paper, we propose a Similarity-Driven Cross-Granularity Pre-training (SimCroP) framework on chest CTs, which combines similarity-driven alignment and cross-granularity fusion to improve radiograph interpretation. We first leverage multi-modal masked modeling to optimize the encoder for understanding precise low-level semantics from radiographs. Then, similarity-driven alignment is designed to pre-train the encoder to adaptively select and align the correct patches corresponding to each sentence in reports. The cross-granularity fusion module integrates multimodal information across instance level and word-patch level, which helps the model better capture key pathology structures in sparse radiographs, resulting in improved performance for multi-scale downstream tasks. SimCroP is pre-trained on a large-scale paired CT-reports dataset and validated on image classification and segmentation tasks across five public datasets. Experimental results demonstrate that SimCroP outperforms both cutting-edge medical self-supervised learning methods and medical vision-language pre-training methods. Codes and models are available at https://github.com/ToniChopp/SimCroP.

[15] Boosted Training of Lightweight Early Exits for Optimizing CNN Image Classification Inference cs.CVPDF

Yehudit Aperstein, Alexander Apartsin

TL;DR: 论文提出了一种名为Boosted Training Scheme for Early Exits (BTS-EE)的训练方法，通过顺序训练中间分类器来解决传统早期退出策略中的协方差偏移问题，并结合轻量级分支架构和类精度边际校准方法，显著提升了CNN在资源受限平台上的推理效率。

Details

Motivation: 在资源受限平台上实现实时图像分类需要平衡准确性和计算开销。传统早期退出策略在训练和推理时存在数据分布不匹配的问题（协方差偏移），限制了效率与准确性的权衡。

Result: 在CINIC-10数据集和ResNet18上，BTS-EE在64种配置中均优于非增强训练方法，计算量减少45%，而准确率仅下降2%。

Insight: BTS-EE不仅提升了推理效率，还为资源受限平台上的CNN部署提供了新的设计思路，适用于工业检测、嵌入式视觉和无人机监控等领域。

Abstract: Real-time image classification on resource-constrained platforms demands inference methods that balance accuracy with strict latency and power budgets. Early-exit strategies address this need by attaching auxiliary classifiers to intermediate layers of convolutional neural networks (CNNs), allowing “easy” samples to terminate inference early. However, conventional training of early exits introduces a covariance shift: downstream branches are trained on full datasets, while at inference they process only the harder, non-exited samples. This mismatch limits efficiency–accuracy trade-offs in practice. We introduce the Boosted Training Scheme for Early Exits (BTS-EE), a sequential training approach that aligns branch training with inference-time data distributions. Each branch is trained and calibrated before the next, ensuring robustness under selective inference conditions. To further support embedded deployment, we propose a lightweight branch architecture based on 1D convolutions and a Class Precision Margin (CPM) calibration method that enables per-class threshold tuning for reliable exit decisions. Experiments on the CINIC-10 dataset with a ResNet18 backbone demonstrate that BTS-EE consistently outperforms non-boosted training across 64 configurations, achieving up to 45 percent reduction in computation with only 2 percent accuracy degradation. These results expand the design space for deploying CNNs in real-time image processing systems, offering practical efficiency gains for applications such as industrial inspection, embedded vision, and UAV-based monitoring.

[16] Retrieval-Augmented VLMs for Multimodal Melanoma Diagnosis cs.CV | cs.AI | cs.LGPDF

Jihyun Moon, Charmgil Hong

TL;DR: 该论文提出了一种检索增强的视觉语言模型（VLM）框架，用于多模态黑色素瘤诊断，通过结合语义相似的病例数据提升诊断准确性。

Details

Motivation: 现有的卷积神经网络（CNN）在皮肤镜图像分析中忽略了临床元数据且需要大量预处理，而通用领域的视觉语言模型（VLM）难以捕捉临床特异性。

Result: 该方法显著提升了分类准确性，并在错误纠正方面优于传统基线。

Insight: 检索增强的提示策略为临床决策支持提供了一种鲁棒的解决方案，尤其是在缺乏领域特定训练数据时。

Abstract: Accurate and early diagnosis of malignant melanoma is critical for improving patient outcomes. While convolutional neural networks (CNNs) have shown promise in dermoscopic image analysis, they often neglect clinical metadata and require extensive preprocessing. Vision-language models (VLMs) offer a multimodal alternative but struggle to capture clinical specificity when trained on general-domain data. To address this, we propose a retrieval-augmented VLM framework that incorporates semantically similar patient cases into the diagnostic prompt. Our method enables informed predictions without fine-tuning and significantly improves classification accuracy and error correction over conventional baselines. These results demonstrate that retrieval-augmented prompting provides a robust strategy for clinical decision support.

[17] InsFusion: Rethink Instance-level LiDAR-Camera Fusion for 3D Object Detection cs.CVPDF

Zhongyu Xia, Hansong Yang, Yongtao Wang

TL;DR: InsFusion提出了一种新的LiDAR-相机融合方法，通过从原始和融合特征中提取提案并利用注意力机制，减少3D目标检测中的误差累积。

Details

Motivation: 多视角相机和LiDAR在3D目标检测中特征提取、视角变换和特征融合过程中会导致噪声和误差累积，影响检测性能。

Result: 在nuScenes数据集上，InsFusion兼容多种先进基线方法，实现了新的SOTA性能。

Insight: 通过直接查询原始特征和注意力机制，可以有效减少特征融合过程中的误差累积，提升3D目标检测的准确性。

Abstract: Three-dimensional Object Detection from multi-view cameras and LiDAR is a crucial component for autonomous driving and smart transportation. However, in the process of basic feature extraction, perspective transformation, and feature fusion, noise and error will gradually accumulate. To address this issue, we propose InsFusion, which can extract proposals from both raw and fused features and utilizes these proposals to query the raw features, thereby mitigating the impact of accumulated errors. Additionally, by incorporating attention mechanisms applied to the raw features, it thereby mitigates the impact of accumulated errors. Experiments on the nuScenes dataset demonstrate that InsFusion is compatible with various advanced baseline methods and delivers new state-of-the-art performance for 3D object detection.

[18] Bitrate-Controlled Diffusion for Disentangling Motion and Content in Video cs.CVPDF

Xiao Li, Qi Chen, Xiulian Peng, Kai Yu, Xie Chen

TL;DR: 本文提出了一种新的框架，通过自监督学习将视频数据解耦为动态运动（motion）和静态内容（content）两部分，并通过低比特率向量量化促进解耦。

Details

Motivation: 视频数据的动态运动和静态内容通常是纠缠在一起的，传统的解耦方法依赖于强假设或归纳偏置。本文旨在提出一种更通用的自监督框架，减少对先验知识的依赖。

Result: 在运动迁移和自回归运动生成任务上验证了框架的有效性，且能推广到多种视频类型。

Insight: 通过控制比特率可以更有效地促进解耦，同时自监督学习在缺乏标记数据时仍能学习到有意义的视频表示。

Abstract: We propose a novel and general framework to disentangle video data into its dynamic motion and static content components. Our proposed method is a self-supervised pipeline with less assumptions and inductive biases than previous works: it utilizes a transformer-based architecture to jointly generate flexible implicit features for frame-wise motion and clip-wise content, and incorporates a low-bitrate vector quantization as an information bottleneck to promote disentanglement and form a meaningful discrete motion space. The bitrate-controlled latent motion and content are used as conditional inputs to a denoising diffusion model to facilitate self-supervised representation learning. We validate our disentangled representation learning framework on real-world talking head videos with motion transfer and auto-regressive motion generation tasks. Furthermore, we also show that our method can generalize to other types of video data, such as pixel sprites of 2D cartoon characters. Our work presents a new perspective on self-supervised learning of disentangled video representations, contributing to the broader field of video analysis and generation.

[19] Semantic Causality-Aware Vision-Based 3D Occupancy Prediction cs.CV | cs.AIPDF

Dubing Chen, Huan Zheng, Yucheng Zhou, Xianfei Li, Wenlong Liao

TL;DR: 这篇论文提出了一种基于语义因果关系的视觉3D占用预测方法，通过设计新颖的因果损失函数，实现端到端的模块化2D到3D转换管道的整体监督。

Details

Motivation: 现有方法依赖模块化管道，独立优化或使用预配置输入，导致级联错误。通过引入语义因果关系的监督机制，解决这一问题。

Result: 在Occ3D基准测试中达到最优性能，显著提升了鲁棒性和2D到3D语义一致性。

Insight: 语义因果关系提供了一种新的监督机制，能够有效解决模块化管道中的级联错误问题。

Abstract: Vision-based 3D semantic occupancy prediction is a critical task in 3D vision that integrates volumetric 3D reconstruction with semantic understanding. Existing methods, however, often rely on modular pipelines. These modules are typically optimized independently or use pre-configured inputs, leading to cascading errors. In this paper, we address this limitation by designing a novel causal loss that enables holistic, end-to-end supervision of the modular 2D-to-3D transformation pipeline. Grounded in the principle of 2D-to-3D semantic causality, this loss regulates the gradient flow from 3D voxel representations back to the 2D features. Consequently, it renders the entire pipeline differentiable, unifying the learning process and making previously non-trainable components fully learnable. Building on this principle, we propose the Semantic Causality-Aware 2D-to-3D Transformation, which comprises three components guided by our causal loss: Channel-Grouped Lifting for adaptive semantic mapping, Learnable Camera Offsets for enhanced robustness against camera perturbations, and Normalized Convolution for effective feature propagation. Extensive experiments demonstrate that our method achieves state-of-the-art performance on the Occ3D benchmark, demonstrating significant robustness to camera perturbations and improved 2D-to-3D semantic consistency.

[20] VRAE: Vertical Residual Autoencoder for License Plate Denoising and Deblurring cs.CVPDF

Cuong Nguyen, Dung T. Tran, Hong Nguyen, Xuan-Vu Phan, Nam-Phong Nguyen

TL;DR: 论文提出了一种垂直残差自编码器（VRAE），用于交通监控中车牌图像的去噪和去模糊任务，显著提升了性能。

Details

Motivation: 在恶劣天气、低光照或高速运动条件下，交通监控中的车牌图像常受噪声和模糊影响，现有方法在保留信息方面不足，需要更高效的解决方案。

Result: 相比传统自编码器（AE）、生成对抗网络（GAN）和基于流的方法（FB），VRAE在PSNR、NMSE和SSIM指标上均有显著提升，且参数增加较少。

Insight: 引入输入感知的特征注入机制可以有效改善自编码器的信息保留能力，适用于小目标图像恢复任务。

Abstract: In real-world traffic surveillance, vehicle images captured under adverse weather, poor lighting, or high-speed motion often suffer from severe noise and blur. Such degradations significantly reduce the accuracy of license plate recognition systems, especially when the plate occupies only a small region within the full vehicle image. Restoring these degraded images a fast realtime manner is thus a crucial pre-processing step to enhance recognition performance. In this work, we propose a Vertical Residual Autoencoder (VRAE) architecture designed for the image enhancement task in traffic surveillance. The method incorporates an enhancement strategy that employs an auxiliary block, which injects input-aware features at each encoding stage to guide the representation learning process, enabling better general information preservation throughout the network compared to conventional autoencoders. Experiments on a vehicle image dataset with visible license plates demonstrate that our method consistently outperforms Autoencoder (AE), Generative Adversarial Network (GAN), and Flow-Based (FB) approaches. Compared with AE at the same depth, it improves PSNR by about 20%, reduces NMSE by around 50%, and enhances SSIM by 1%, while requiring only a marginal increase of roughly 1% in parameters.

[21] Sparse BEV Fusion with Self-View Consistency for Multi-View Detection and Tracking cs.CV | cs.AIPDF

Keisuke Toida, Taigo Sakai, Naoki Kato, Kazutoyo Yokota, Takeshi Nakamura

TL;DR: 该论文提出了SCFusion框架，通过稀疏变换、密度感知加权和多视角一致性损失，改进了多视角特征融合，提升了多视角目标检测与跟踪的性能。

Details

Motivation: 多视角多目标跟踪(MVMOT)在应用中常因视角变化、光照差异和遮挡等问题导致物体身份不一致，现有方法通过BEV投影虽提升了鲁棒性，但存在特征扭曲和非均匀密度问题。

Result: 在WildTrack上IDF1达到95.9%，MultiviewX上MODP为89.2%，优于基准方法TrackTacular。

Insight: SCFusion通过稀疏化与一致性约束，有效缓解了BEV投影的局限性，为多视角跟踪提供了更鲁棒的解决方案。

Abstract: Multi-View Multi-Object Tracking (MVMOT) is essential for applications such as surveillance, autonomous driving, and sports analytics. However, maintaining consistent object identities across multiple cameras remains challenging due to viewpoint changes, lighting variations, and occlusions, which often lead to tracking errors.Recent methods project features from multiple cameras into a unified Bird’s-Eye-View (BEV) space to improve robustness against occlusion. However, this projection introduces feature distortion and non-uniform density caused by variations in object scale with distance. These issues degrade the quality of the fused representation and reduce detection and tracking accuracy.To address these problems, we propose SCFusion, a framework that combines three techniques to improve multi-view feature integration. First, it applies a sparse transformation to avoid unnatural interpolation during projection. Next, it performs density-aware weighting to adaptively fuse features based on spatial confidence and camera distance. Finally, it introduces a multi-view consistency loss that encourages each camera to learn discriminative features independently before fusion.Experiments show that SCFusion achieves state-of-the-art performance, reaching an IDF1 score of 95.9% on WildTrack and a MODP of 89.2% on MultiviewX, outperforming the baseline method TrackTacular. These results demonstrate that SCFusion effectively mitigates the limitations of conventional BEV projection and provides a robust and accurate solution for multi-view object detection and tracking.

[22] LD-ViCE: Latent Diffusion Model for Video Counterfactual Explanations cs.CV | cs.LGPDF

Payal Varshney, Adriano Lucieri, Christoph Balada, Sheraz Ahmed, Andreas Dengel

TL;DR: LD-ViCE提出了一种基于潜在扩散模型的视频反事实解释框架，旨在解决视频AI系统解释性不足的问题，通过降低计算成本并提高语义保真度，在三个数据集上表现优于现有方法。

Details

Motivation: 视频AI系统在安全关键领域（如自动驾驶和医疗）的广泛应用需要更高的解释性，当前解释方法在时间一致性、鲁棒性和因果洞察方面存在不足。

Result: 在EchoNet-Dynamic、FERV39k和Something-Something V2数据集上表现优于现有方法，推理时间减半。

Insight: 在隐空间操作和细化步骤的结合是提高视频反事实解释质量和效率的关键。

Abstract: Video-based AI systems are increasingly adopted in safety-critical domains such as autonomous driving and healthcare. However, interpreting their decisions remains challenging due to the inherent spatiotemporal complexity of video data and the opacity of deep learning models. Existing explanation techniques often suffer from limited temporal coherence, insufficient robustness, and a lack of actionable causal insights. Current counterfactual explanation methods typically do not incorporate guidance from the target model, reducing semantic fidelity and practical utility. We introduce Latent Diffusion for Video Counterfactual Explanations (LD-ViCE), a novel framework designed to explain the behavior of video-based AI models. Compared to previous approaches, LD-ViCE reduces the computational costs of generating explanations by operating in latent space using a state-of-the-art diffusion model, while producing realistic and interpretable counterfactuals through an additional refinement step. Our experiments demonstrate the effectiveness of LD-ViCE across three diverse video datasets, including EchoNet-Dynamic (cardiac ultrasound), FERV39k (facial expression), and Something-Something V2 (action recognition). LD-ViCE outperforms a recent state-of-the-art method, achieving an increase in R2 score of up to 68% while reducing inference time by half. Qualitative analysis confirms that LD-ViCE generates semantically meaningful and temporally coherent explanations, offering valuable insights into the target model behavior. LD-ViCE represents a valuable step toward the trustworthy deployment of AI in safety-critical domains.

[23] Beyond Distribution Shifts: Adaptive Hyperspectral Image Classification at Test Time cs.CVPDF

Xia Yue, Anfeng Liu, Ning Chen, Chenjia Huang, Hui Liu

TL;DR: 论文提出HyperTTA框架，用于增强高光谱图像（HSI）分类模型在多种退化条件下的鲁棒性。通过构建多退化数据集、设计光谱-空间变换器分类器（SSTC），并提出轻量级测试时适应策略（CELA），实现了动态适应且无需源数据或目标标注。

Details

Motivation: 高光谱图像分类模型对噪声、模糊等现实退化条件非常敏感，现有方法难以应对多样化的分布偏移。

Result: 在两种基准数据集上验证了HyperTTA在多种退化场景下优于现有基线。

Insight: 轻量级测试时适应策略可以在不依赖源数据或目标标注的情况下，高效适应动态退化条件。

Abstract: Hyperspectral image (HSI) classification models are highly sensitive to distribution shifts caused by various real-world degradations such as noise, blur, compression, and atmospheric effects. To address this challenge, we propose HyperTTA, a unified framework designed to enhance model robustness under diverse degradation conditions. Specifically, we first construct a multi-degradation hyperspectral dataset that systematically simulates nine representative types of degradations, providing a comprehensive benchmark for robust classification evaluation. Based on this, we design a spectral-spatial transformer classifier (SSTC) enhanced with a multi-level receptive field mechanism and label smoothing regularization to jointly capture multi-scale spatial context and improve generalization. Furthermore, HyperTTA incorporates a lightweight test-time adaptation (TTA) strategy, the confidence-aware entropy-minimized LayerNorm adapter (CELA), which updates only the affine parameters of LayerNorm layers by minimizing prediction entropy on high-confidence unlabeled target samples. This confidence-aware adaptation prevents unreliable updates from noisy predictions, enabling robust and dynamic adaptation without access to source data or target annotations. Extensive experiments on two benchmark datasets demonstrate that HyperTTA outperforms existing baselines across a wide range of degradation scenarios, validating the effectiveness of both its classification backbone and the proposed TTA scheme. Code will be made available publicly.

[24] Spherical Brownian Bridge Diffusion Models for Conditional Cortical Thickness Forecasting cs.CV | cs.AI | cs.LG | q-bio.NCPDF

Ivan Stoyanov, Fabian Bongratz, Christian Wachinger

TL;DR: 该论文提出了一种名为Spherical Brownian Bridge Diffusion Model (SBDM)的新方法，用于预测个性化的脑皮质厚度(CTh)轨迹，解决了非欧几何和多模态数据整合的挑战。

Details

Motivation: 准确预测高分辨率的脑皮质厚度(CTh)轨迹对检测神经退行性变化和早期干预至关重要，但由于皮质的复杂几何和需整合多模态数据，这一任务极具挑战性。

Result: 在ADNI和OASIS数据集上，SBDM显著优于先前方法，并能生成个体的事实和反事实CTh轨迹。

Insight: SBDM为探索皮质发育的假设情景提供了新框架，展示了其在神经科学研究中的潜力。

Abstract: Accurate forecasting of individualized, high-resolution cortical thickness (CTh) trajectories is essential for detecting subtle cortical changes, providing invaluable insights into neurodegenerative processes and facilitating earlier and more precise intervention strategies. However, CTh forecasting is a challenging task due to the intricate non-Euclidean geometry of the cerebral cortex and the need to integrate multi-modal data for subject-specific predictions. To address these challenges, we introduce the Spherical Brownian Bridge Diffusion Model (SBDM). Specifically, we propose a bidirectional conditional Brownian bridge diffusion process to forecast CTh trajectories at the vertex level of registered cortical surfaces. Our technical contribution includes a new denoising model, the conditional spherical U-Net (CoS-UNet), which combines spherical convolutions and dense cross-attention to integrate cortical surfaces and tabular conditions seamlessly. Compared to previous approaches, SBDM achieves significantly reduced prediction errors, as demonstrated by our experiments based on longitudinal datasets from the ADNI and OASIS. Additionally, we demonstrate SBDM’s ability to generate individual factual and counterfactual CTh trajectories, offering a novel framework for exploring hypothetical scenarios of cortical development.

[25] First-order State Space Model for Lightweight Image Super-resolution cs.CVPDF

Yujie Zhu, Xinyi Zhang, Yekai Lu, Guang Yang, Faming Fang

TL;DR: 该论文提出了一种改进的状态空间模型（FSSM），用于轻量级图像超分辨率任务，通过引入一阶保持条件和改进SSM模块的计算过程，提升了性能且未增加参数数量。

Details

Motivation: 状态空间模型（SSMs）在NLP任务中表现突出，但在视觉任务中的应用较少。作者希望探索SSM在轻量级图像超分辨率任务中的潜力，尤其是改进SSM模块的性能。

Result: 实验结果表明，FSSM在五个基准数据集上提升了MambaIR的性能，超过了当前轻量级SR方法，达到了最先进的结果。

Insight: SSM模块的改进在视觉任务中仍有潜力，尤其是通过细粒度的离散化和误差分析可以提升性能。

Abstract: State space models (SSMs), particularly Mamba, have shown promise in NLP tasks and are increasingly applied to vision tasks. However, most Mamba-based vision models focus on network architecture and scan paths, with little attention to the SSM module. In order to explore the potential of SSMs, we modified the calculation process of SSM without increasing the number of parameters to improve the performance on lightweight super-resolution tasks. In this paper, we introduce the First-order State Space Model (FSSM) to improve the original Mamba module, enhancing performance by incorporating token correlations. We apply a first-order hold condition in SSMs, derive the new discretized form, and analyzed cumulative error. Extensive experimental results demonstrate that FSSM improves the performance of MambaIR on five benchmark datasets without additionally increasing the number of parameters, and surpasses current lightweight SR methods, achieving state-of-the-art results.

[26] Maximally Useful and Minimally Redundant: The Key to Self Supervised Learning for Imbalanced Data cs.CVPDF

Yash Kumar Sharma, Vineet Nair, Wilson Naik

TL;DR: 论文提出了一种基于多视图互信息的自监督学习方法，用于解决不平衡数据集中的特征学习问题，并取得了显著的性能提升。

Details

Motivation: 当前对比自监督学习（CSSL）在平衡数据集上表现良好，但对于不平衡数据集的鲁棒性未被充分研究。受Yann LeCun多视图框架的启发，本文探索如何借鉴互信息理论来改进不平衡数据集的自监督学习。

Result: 在多个不平衡数据集上实现了显著的性能提升：Cifar10-LT（ResNet-18）提升2%，Cifar100-LT（ResNet-18）提升5%，Imagenet-LT（1k, ResNet-50）提升3%，达到新SOTA。

Insight: 多视图互信息理论在不平衡数据集中具有潜力，能够有效提取尾部类的特征。损失函数的设计通过过滤极端特征，进一步提升了模型的鲁棒性和泛化能力。

Abstract: The robustness of contrastive self-supervised learning (CSSL) for imbalanced datasets is largely unexplored. CSSL usually makes use of \emph{multi-view} assumptions to learn discriminatory features via similar and dissimilar data samples. CSSL works well on balanced datasets, but does not generalize well for imbalanced datasets. In a very recent paper, as part of future work, Yann LeCun pointed out that the self-supervised multiview framework can be extended to cases involving \emph{more than two views}. Taking a cue from this insight we propose a theoretical justification based on the concept of \emph{mutual information} to support the \emph{more than two views} objective and apply it to the problem of dataset imbalance in self-supervised learning. The proposed method helps extract representative characteristics of the tail classes by segregating between \emph{intra} and \emph{inter} discriminatory characteristics. We introduce a loss function that helps us to learn better representations by filtering out extreme features. Experimental evaluation on a variety of self-supervised frameworks (both contrastive and non-contrastive) also prove that the \emph{more than two view} objective works well for imbalanced datasets. We achieve a new state-of-the-art accuracy in self-supervised imbalanced dataset classification (2% improvement in Cifar10-LT using Resnet-18, 5% improvement in Cifar100-LT using Resnet-18, 3% improvement in Imagenet-LT (1k) using Resnet-50).

[27] Prompt-Driven Image Analysis with Multimodal Generative AI: Detection, Segmentation, Inpainting, and Interpretation cs.CV | cs.AIPDF

Kaleem Ahmad

TL;DR: 论文介绍了一种基于多模态生成AI的提示驱动图像分析流程，结合开放词汇检测、可提示分割、文本条件修复和视觉语言描述，实现透明化调试和高效运行。

Details

Motivation: 通过整合多模态AI和视觉模型，简化图像分析的复杂流程，提升透明性和可靠性，适用于对象替换、场景增强和删除等任务。

Result: 单次提示的分割和检测准确率达85%以上，修复占运行时60-75%，需精细调参。

Insight: 透明化调试和多模态整合是关键，操作实践（如版本固定和种子控制）对可靠性和一致性至关重要。

Abstract: Prompt-driven image analysis converts a single natural-language instruction into multiple steps: locate, segment, edit, and describe. We present a practical case study of a unified pipeline that combines open-vocabulary detection, promptable segmentation, text-conditioned inpainting, and vision-language description into a single workflow. The system works end to end from a single prompt, retains intermediate artifacts for transparent debugging (such as detections, masks, overlays, edited images, and before and after composites), and provides the same functionality through an interactive UI and a scriptable CLI for consistent, repeatable runs. We highlight integration choices that reduce brittleness, including threshold adjustments, mask inspection with light morphology, and resource-aware defaults. In a small, single-word prompt segment, detection and segmentation produced usable masks in over 90% of cases with an accuracy above 85% based on our criteria. On a high-end GPU, inpainting makes up 60 to 75% of total runtime under typical guidance and sampling settings, which highlights the need for careful tuning. The study offers implementation-guided advice on thresholds, mask tightness, and diffusion parameters, and details version pinning, artifact logging, and seed control to support replay. Our contribution is a transparent, reliable pattern for assembling modern vision and multimodal models behind a single prompt, with clear guardrails and operational practices that improve reliability in object replacement, scene augmentation, and removal.

[28] A Structured Review of Underwater Object Detection Challenges and Solutions: From Traditional to Large Vision Language Models cs.CV | cs.AIPDF

Edwine Nabahirwa, Wei Song, Minghua Zhang, Yi Fang, Zhou Ni

TL;DR: 这篇综述系统分析了水下目标检测（UOD）的挑战与解决方案，涵盖了从传统方法到大型视觉语言模型（LVLMs）的进展，并提出了未来研究方向。

Details

Motivation: 水下目标检测对海洋应用至关重要，但由于水下环境的复杂性，现有方法难以完全解决其挑战。

Result: 1. 现有方法难以完全应对水下环境的动态性和图像退化问题；2. 合成数据生成有潜力但需优化；3. LVLMs在UOD中前景广阔，但实时应用仍需研究。

Insight: 1. LVLMs可能是解决UOD复杂挑战的关键；2. 合成数据的生成与优化是未来研究重点；3. LVLMs的实际应用需进一步优化。

Abstract: Underwater object detection (UOD) is vital to diverse marine applications, including oceanographic research, underwater robotics, and marine conservation. However, UOD faces numerous challenges that compromise its performance. Over the years, various methods have been proposed to address these issues, but they often fail to fully capture the complexities of underwater environments. This review systematically categorizes UOD challenges into five key areas: Image quality degradation, target-related issues, data-related challenges, computational and processing constraints, and limitations in detection methodologies. To address these challenges, we analyze the progression from traditional image processing and object detection techniques to modern approaches. Additionally, we explore the potential of large vision-language models (LVLMs) in UOD, leveraging their multi-modal capabilities demonstrated in other domains. We also present case studies, including synthetic dataset generation using DALL-E 3 and fine-tuning Florence-2 LVLM for UOD. This review identifies three key insights: (i) Current UOD methods are insufficient to fully address challenges like image degradation and small object detection in dynamic underwater environments. (ii) Synthetic data generation using LVLMs shows potential for augmenting datasets but requires further refinement to ensure realism and applicability. (iii) LVLMs hold significant promise for UOD, but their real-time application remains under-explored, requiring further research on optimization techniques.

[29] Chirality in Action: Time-Aware Video Representation Learning by Latent Straightening cs.CVPDF

Piyush Bagad, Andrew Zisserman

TL;DR: 该论文提出了一种时间敏感的视频表示学习方法，通过引入手性动作识别任务（区分时间相反的动作）和自监督适应方法，构建了一个紧凑且对时间敏感的视频嵌入模型。

Details

Motivation: 现有视频嵌入模型在区分时间相反的动作（如“开门与关门”）上表现不佳，这类动作在日常中频繁出现且需要理解时间上的视觉变化。

Result: 在多个数据集（Something-Something、EPIC-Kitchens、Charade）上表现优异，超越了大规模预训练的视频模型，并能提升现有模型的分类性能。

Insight: 通过时间敏感的特征学习，可以更有效地捕捉视频中简单的视觉变化，从而提升对时间相关任务的性能。

Abstract: Our objective is to develop compact video representations that are sensitive to visual change over time. To measure such time-sensitivity, we introduce a new task: chiral action recognition, where one needs to distinguish between a pair of temporally opposite actions, such as “opening vs. closing a door”, “approaching vs. moving away from something”, “folding vs. unfolding paper”, etc. Such actions (i) occur frequently in everyday life, (ii) require understanding of simple visual change over time (in object state, size, spatial position, count . . . ), and (iii) are known to be poorly represented by many video embeddings. Our goal is to build time aware video representations which offer linear separability between these chiral pairs. To that end, we propose a self-supervised adaptation recipe to inject time-sensitivity into a sequence of frozen image features. Our model is based on an auto-encoder with a latent space with inductive bias inspired by perceptual straightening. We show that this results in a compact but time-sensitive video representation for the proposed task across three datasets: Something-Something, EPIC-Kitchens, and Charade. Our method (i) outperforms much larger video models pre-trained on large-scale video datasets, and (ii) leads to an improvement in classification performance on standard benchmarks when combined with these existing models.

Liyang Chen, Tianxiang Ma, Jiawei Liu, Bingchuan Li, Zhuowei Chen

TL;DR: HuMo提出了一种统一的人类中心视频生成框架，通过两阶段训练和任务特定策略解决多模态输入协调问题，并在实验中表现优异。

Details

Motivation: 现有方法难以协调多模态输入，且缺乏高质量训练数据和有效的任务协作机制。

Result: HuMo在子任务中优于现有方法，实现了多模态输入的统一协作控制。

Insight: 1）高质量数据对多模态任务至关重要；2）分阶段训练和任务特定策略可提升模型性能；3）动态调整的引导策略增强灵活性。

Abstract: Human-Centric Video Generation (HCVG) methods seek to synthesize human videos from multimodal inputs, including text, image, and audio. Existing methods struggle to effectively coordinate these heterogeneous modalities due to two challenges: the scarcity of training data with paired triplet conditions and the difficulty of collaborating the sub-tasks of subject preservation and audio-visual sync with multimodal inputs. In this work, we present HuMo, a unified HCVG framework for collaborative multimodal control. For the first challenge, we construct a high-quality dataset with diverse and paired text, reference images, and audio. For the second challenge, we propose a two-stage progressive multimodal training paradigm with task-specific strategies. For the subject preservation task, to maintain the prompt following and visual generation abilities of the foundation model, we adopt the minimal-invasive image injection strategy. For the audio-visual sync task, besides the commonly adopted audio cross-attention layer, we propose a focus-by-predicting strategy that implicitly guides the model to associate audio with facial regions. For joint learning of controllabilities across multimodal inputs, building on previously acquired capabilities, we progressively incorporate the audio-visual sync task. During inference, for flexible and fine-grained multimodal control, we design a time-adaptive Classifier-Free Guidance strategy that dynamically adjusts guidance weights across denoising steps. Extensive experimental results demonstrate that HuMo surpasses specialized state-of-the-art methods in sub-tasks, establishing a unified framework for collaborative multimodal-conditioned HCVG. Project Page: https://phantom-video.github.io/HuMo.

[31] MESH – Understanding Videos Like Human: Measuring Hallucinations in Large Video Models cs.CV | cs.AIPDF

Garry Yang, Zizhe Chen, Man Hon Wong, Haoyu Lei, Yongqiang Chen

TL;DR: 论文提出了MESH评测基准，通过问答框架系统评估大型视频模型（LVMs）中的幻觉问题，揭示其在小细节和多动作对齐方面的局限性。

Details

Motivation: 现有评测基准依赖视频内容的手动分类，忽略了人类感知视频的自然过程。MESH旨在填补这一空白。

Result: LVMs在识别基本对象和特征上表现良好，但在处理小细节或多动作对齐时易产生幻觉。

Insight: MESH为评测LVMs提供了更接近人类理解的全面方法，揭示了其在复杂场景中的局限性。

Abstract: Large Video Models (LVMs) build on the semantic capabilities of Large Language Models (LLMs) and vision modules by integrating temporal information to better understand dynamic video content. Despite their progress, LVMs are prone to hallucinations-producing inaccurate or irrelevant descriptions. Current benchmarks for video hallucination depend heavily on manual categorization of video content, neglecting the perception-based processes through which humans naturally interpret videos. We introduce MESH, a benchmark designed to evaluate hallucinations in LVMs systematically. MESH uses a Question-Answering framework with binary and multi-choice formats incorporating target and trap instances. It follows a bottom-up approach, evaluating basic objects, coarse-to-fine subject features, and subject-action pairs, aligning with human video understanding. We demonstrate that MESH offers an effective and comprehensive approach for identifying hallucinations in videos. Our evaluations show that while LVMs excel at recognizing basic objects and features, their susceptibility to hallucinations increases markedly when handling fine details or aligning multiple actions involving various subjects in longer videos.

[32] Calibrating MLLM-as-a-judge via Multimodal Bayesian Prompt Ensembles cs.CV | cs.CLPDF

Eric Slyman, Mehrab Tanjim, Kushal Kafle, Stefan Lee

TL;DR: 该论文提出了一种名为MMB的多模态贝叶斯提示集成方法，用于校准多模态大语言模型（MLLM）作为评判者对文本到图像生成系统的评估，解决了其存在的偏见、过度自信和性能不一致问题。

Details

Motivation: 多模态大语言模型作为评判者在评估文本到图像生成系统时存在偏见、过度自信和跨领域性能不一致的问题，现有提示集成方法在单模态文本任务中表现良好，但在多模态任务中表现不佳。

Result: 在两个文本到图像基准测试（HPSv2和MJBench）中，MMB在人类标注对齐和校准性方面优于现有基线。

Insight: 多模态特定的校准策略对实现可靠的评判至关重要，MMB为大规模文本到图像评估提供了一条可行路径。

Abstract: Multimodal large language models (MLLMs) are increasingly used to evaluate text-to-image (TTI) generation systems, providing automated judgments based on visual and textual context. However, these “judge” models often suffer from biases, overconfidence, and inconsistent performance across diverse image domains. While prompt ensembling has shown promise for mitigating these issues in unimodal, text-only settings, our experiments reveal that standard ensembling methods fail to generalize effectively for TTI tasks. To address these limitations, we propose a new multimodal-aware method called Multimodal Mixture-of-Bayesian Prompt Ensembles (MMB). Our method uses a Bayesian prompt ensemble approach augmented by image clustering, allowing the judge to dynamically assign prompt weights based on the visual characteristics of each sample. We show that MMB improves accuracy in pairwise preference judgments and greatly enhances calibration, making it easier to gauge the judge’s true uncertainty. In evaluations on two TTI benchmarks, HPSv2 and MJBench, MMB outperforms existing baselines in alignment with human annotations and calibration across varied image content. Our findings highlight the importance of multimodal-specific strategies for judge calibration and suggest a promising path forward for reliable large-scale TTI evaluation.

[33] Vision-Language Semantic Aggregation Leveraging Foundation Model for Generalizable Medical Image Segmentation cs.CVPDF

Wenjun Yu, Yinchen Zhou, Jia-Xuan Jiang, Shubin Zeng, Yuee Li

TL;DR: 本文提出了一种基于语义聚合的视觉-语言模型，通过EM Aggregation机制和Text-Guided Pixel Decoder解决医学图像分割中多模态融合的语义鸿沟和特征分散问题，显著提升模型的泛化能力。

Details

Motivation: 多模态模型在自然图像分割中表现优异，但在医学领域效果不佳，原因在于抽象文本提示与细粒度医学视觉特征间的语义鸿沟及特征分散问题。

Result: 在公共心脏和眼底数据集上的实验表明，该方法在多领域泛化基准上一致优于现有SOTA方法。

Insight: 语义聚合是解决医学图像分割中多模态融合问题的有效途径，文本引导的视觉表征学习能显著提高模型的泛化能力。

Abstract: Multimodal models have achieved remarkable success in natural image segmentation, yet they often underperform when applied to the medical domain. Through extensive study, we attribute this performance gap to the challenges of multimodal fusion, primarily the significant semantic gap between abstract textual prompts and fine-grained medical visual features, as well as the resulting feature dispersion. To address these issues, we revisit the problem from the perspective of semantic aggregation. Specifically, we propose an Expectation-Maximization (EM) Aggregation mechanism and a Text-Guided Pixel Decoder. The former mitigates feature dispersion by dynamically clustering features into compact semantic centers to enhance cross-modal correspondence. The latter is designed to bridge the semantic gap by leveraging domain-invariant textual knowledge to effectively guide deep visual representations. The synergy between these two mechanisms significantly improves the model’s generalization ability. Extensive experiments on public cardiac and fundus datasets demonstrate that our method consistently outperforms existing SOTA approaches across multiple domain generalization benchmarks.

[34] Improving Greenland Bed Topography Mapping with Uncertainty-Aware Graph Learning on Sparse Radar Data cs.CVPDF

Bayu Adhi Tama, Homayra Alam, Mostafa Cham, Omar Faruque, Jianwu Wang

TL;DR: 作者提出了GraphTopoNet，一种基于图学习的框架，通过融合异质监督和蒙特卡洛dropout建模不确定性，提升了格陵兰冰床地形的映射精度。该方法结合了空间图和动态平衡正则化，显著降低了误差。

Details

Motivation: 格陵兰冰床地形的精确映射对海平面预测至关重要，但雷达数据稀疏且分布不均，现有方法难以充分捕捉地形特征。

Result: 在三个格陵兰子区域中，GraphTopoNet较基线方法误差降低了60%，并保留了冰川细节特征。

Insight: 图机器学习可将稀疏、不确定的地球物理观测数据转化为有价值的全球尺度知识，为气候预测和决策提供支持。

Abstract: Accurate maps of Greenland’s subglacial bed are essential for sea-level projections, but radar observations are sparse and uneven. We introduce GraphTopoNet, a graph-learning framework that fuses heterogeneous supervision and explicitly models uncertainty via Monte Carlo dropout. Spatial graphs built from surface observables (elevation, velocity, mass balance) are augmented with gradient features and polynomial trends to capture both local variability and broad structure. To handle data gaps, we employ a hybrid loss that combines confidence-weighted radar supervision with dynamically balanced regularization. Applied to three Greenland subregions, GraphTopoNet outperforms interpolation, convolutional, and graph-based baselines, reducing error by up to 60 percent while preserving fine-scale glacial features. The resulting bed maps improve reliability for operational modeling, supporting agencies engaged in climate forecasting and policy. More broadly, GraphTopoNet shows how graph machine learning can convert sparse, uncertain geophysical observations into actionable knowledge at continental scale.

[35] EfficientIML: Efficient High-Resolution Image Manipulation Localization cs.CVPDF

Jinhan Li, Haoyang He, Lei Xie, Jiangning Zhang

TL;DR: 论文提出了一种高效的高分辨率图像篡改定位方法EfficientIML，通过新型数据集和轻量级网络EfficientRWKV解决现有方法在计算资源上的限制。

Details

Motivation: 随着高分辨率图像和基于扩散的伪造方法的普及，传统篡改检测方法无法应对新型伪造类型，且计算复杂度高。

Result: 在数据集和标准基准测试中优于ViT和其他轻量级基线，定位性能、计算量和推理速度均表现优异。

Insight: 轻量级混合网络结构在高分辨率图像处理中具有高效性和实用性，适合实时取证应用。

Abstract: With imaging devices delivering ever-higher resolutions and the emerging diffusion-based forgery methods, current detectors trained only on traditional datasets (with splicing, copy-moving and object removal forgeries) lack exposure to this new manipulation type. To address this, we propose a novel high-resolution SIF dataset of 1200+ diffusion-generated manipulations with semantically extracted masks. However, this also imposes a challenge on existing methods, as they face significant computational resource constraints due to their prohibitive computational complexities. Therefore, we propose a novel EfficientIML model with a lightweight, three-stage EfficientRWKV backbone. EfficientRWKV’s hybrid state-space and attention network captures global context and local details in parallel, while a multi-scale supervision strategy enforces consistency across hierarchical predictions. Extensive evaluations on our dataset and standard benchmarks demonstrate that our approach outperforms ViT-based and other SOTA lightweight baselines in localization performance, FLOPs and inference speed, underscoring its suitability for real-time forensic applications.

Zhihao Zhao, Yinzheng Zhao, Junjie Yang, Xiangtong Yao, Quanmin Liang

TL;DR: CLAPS是一种基于CLIP和SAM的统一自动提示分割方法，针对多模态视网膜图像，解决了当前方法中的模态模糊性、手动提示依赖以及缺乏统一框架的问题。

Details

Motivation: 当前视网膜图像分割方法面临模态模糊性、依赖手动提示和缺乏统一框架的挑战，CLAPS旨在通过自动化和多模态统一解决这些问题。

Result: 在12个数据集和11个分割任务上的实验表明，CLAPS性能与专家模型相当，并超越现有基准，展示了其广泛的泛化能力。

Insight: CLAPS通过自动化提示和多模态统一，为医学图像分割提供了高效且通用的解决方案，具有成为基础模型的潜力。

Abstract: Recent advancements in foundation models, such as the Segment Anything Model (SAM), have significantly impacted medical image segmentation, especially in retinal imaging, where precise segmentation is vital for diagnosis. Despite this progress, current methods face critical challenges: 1) modality ambiguity in textual disease descriptions, 2) a continued reliance on manual prompting for SAM-based workflows, and 3) a lack of a unified framework, with most methods being modality- and task-specific. To overcome these hurdles, we propose CLIP-unified Auto-Prompt Segmentation (\CLAPS), a novel method for unified segmentation across diverse tasks and modalities in retinal imaging. Our approach begins by pre-training a CLIP-based image encoder on a large, multi-modal retinal dataset to handle data scarcity and distribution imbalance. We then leverage GroundingDINO to automatically generate spatial bounding box prompts by detecting local lesions. To unify tasks and resolve ambiguity, we use text prompts enhanced with a unique “modality signature” for each imaging modality. Ultimately, these automated textual and spatial prompts guide SAM to execute precise segmentation, creating a fully automated and unified pipeline. Extensive experiments on 12 diverse datasets across 11 critical segmentation categories show that CLAPS achieves performance on par with specialized expert models while surpassing existing benchmarks across most metrics, demonstrating its broad generalizability as a foundation model.

[37] AdsQA: Towards Advertisement Video Understanding cs.CVPDF

Xinwei Long, Kai Tian, Peng Xu, Guoli Jia, Jingxuan Li

TL;DR: 论文提出了AdsQA，首个基于广告视频的问答基准，用于评估大型语言模型在广告视频理解上的能力。同时提出了ReAd-R模型，通过奖励驱动的优化生成答案，并在基准测试中表现优异。

Details

Motivation: 探索如何利用广告视频丰富且信息密集的特性（如营销逻辑、说服策略和观众参与），测试大型语言模型在超越常见视觉领域内容理解上的能力。

Result: ReAd-R在AdsQA基准上表现优异，显著优于其他具备长链推理能力的模型。

Insight: 广告视频为测试LLM的多维理解能力提供了挑战性平台，结合RL的奖励驱动优化能有效提升模型表现。

Abstract: Large language models (LLMs) have taken a great step towards AGI. Meanwhile, an increasing number of domain-specific problems such as math and programming boost these general-purpose models to continuously evolve via learning deeper expertise. Now is thus the time further to extend the diversity of specialized applications for knowledgeable LLMs, though collecting high quality data with unexpected and informative tasks is challenging. In this paper, we propose to use advertisement (ad) videos as a challenging test-bed to probe the ability of LLMs in perceiving beyond the objective physical content of common visual domain. Our motivation is to take full advantage of the clue-rich and information-dense ad videos’ traits, e.g., marketing logic, persuasive strategies, and audience engagement. Our contribution is three-fold: (1) To our knowledge, this is the first attempt to use ad videos with well-designed tasks to evaluate LLMs. We contribute AdsQA, a challenging ad Video QA benchmark derived from 1,544 ad videos with 10,962 clips, totaling 22.7 hours, providing 5 challenging tasks. (2) We propose ReAd-R, a Deepseek-R1 styled RL model that reflects on questions, and generates answers via reward-driven optimization. (3) We benchmark 14 top-tier LLMs on AdsQA, and our \texttt{ReAd-R}~achieves the state-of-the-art outperforming strong competitors equipped with long-chain reasoning capabilities by a clear margin.

[38] UOPSL: Unpaired OCT Predilection Sites Learning for Fundus Image Diagnosis Augmentation cs.CV | cs.AI | I.4.10PDF

Zhihao Zhao, Yinzheng Zhao, Junjie Yang, Xiangtong Yao, Quanmin Liang

TL;DR: 论文提出了一种名为UOPSL的新型多模态框架，利用未配对的OCT和眼底图像学习病变偏好位点，以增强仅基于眼底图像的疾病诊断能力。

Details

Motivation: 多模态眼科图像配对的成本高昂，且眼底图像与OCT数据之间模态不平衡，传统方法难以捕获细粒度的空间信息。

Result: 在9个数据集上的28个关键类别中，UOPSL显著优于现有基准方法。

Insight: 通过OCT的空间先验增强眼底图像诊断能力，解决了模态不平衡问题，为多模态医学图像分析提供了新思路。

Abstract: Significant advancements in AI-driven multimodal medical image diagnosis have led to substantial improvements in ophthalmic disease identification in recent years. However, acquiring paired multimodal ophthalmic images remains prohibitively expensive. While fundus photography is simple and cost-effective, the limited availability of OCT data and inherent modality imbalance hinder further progress. Conventional approaches that rely solely on fundus or textual features often fail to capture fine-grained spatial information, as each imaging modality provides distinct cues about lesion predilection sites. In this study, we propose a novel unpaired multimodal framework \UOPSL that utilizes extensive OCT-derived spatial priors to dynamically identify predilection sites, enhancing fundus image-based disease recognition. Our approach bridges unpaired fundus and OCTs via extended disease text descriptions. Initially, we employ contrastive learning on a large corpus of unpaired OCT and fundus images while simultaneously learning the predilection sites matrix in the OCT latent space. Through extensive optimization, this matrix captures lesion localization patterns within the OCT feature space. During the fine-tuning or inference phase of the downstream classification task based solely on fundus images, where paired OCT data is unavailable, we eliminate OCT input and utilize the predilection sites matrix to assist in fundus image classification learning. Extensive experiments conducted on 9 diverse datasets across 28 critical categories demonstrate that our framework outperforms existing benchmarks.

[39] LADB: Latent Aligned Diffusion Bridges for Semi-Supervised Domain Translation cs.CVPDF

Xuqin Wang, Tao Wu, Yanfeng Zhang, Lu Liu, Dong Wang

TL;DR: LADB是一种半监督领域转换框架，利用部分配对数据在共享潜在空间中对齐源域和目标域分布，结合预训练的扩散模型和目标域潜在对齐扩散模型，实现高效领域转换。

Details

Motivation: 扩散模型在数据稀缺领域表现不佳，且需要大量配对数据。LADB旨在通过部分配对数据解决这一问题，提升领域转换的效率和可控性。

Result: 实验表明，LADB在部分监督的深度到图像转换任务中表现优异，并可扩展至多源和多目标转换任务。

Insight: LADB通过部分配对数据实现了高效且可控的领域转换，特别适用于标注成本高或不完整的场景。

Abstract: Diffusion models excel at generating high-quality outputs but face challenges in data-scarce domains, where exhaustive retraining or costly paired data are often required. To address these limitations, we propose Latent Aligned Diffusion Bridges (LADB), a semi-supervised framework for sample-to-sample translation that effectively bridges domain gaps using partially paired data. By aligning source and target distributions within a shared latent space, LADB seamlessly integrates pretrained source-domain diffusion models with a target-domain Latent Aligned Diffusion Model (LADM), trained on partially paired latent representations. This approach enables deterministic domain mapping without the need for full supervision. Compared to unpaired methods, which often lack controllability, and fully paired approaches that require large, domain-specific datasets, LADB strikes a balance between fidelity and diversity by leveraging a mixture of paired and unpaired latent-target couplings. Our experimental results demonstrate superior performance in depth-to-image translation under partial supervision. Furthermore, we extend LADB to handle multi-source translation (from depth maps and segmentation masks) and multi-target translation in a class-conditioned style transfer task, showcasing its versatility in handling diverse and heterogeneous use cases. Ultimately, we present LADB as a scalable and versatile solution for real-world domain translation, particularly in scenarios where data annotation is costly or incomplete.

[40] FractalPINN-Flow: A Fractal-Inspired Network for Unsupervised Optical Flow Estimation with Total Variation Regularization cs.CVPDF

Sara Behnamian, Rasoul Khaksarinezhad, Andreas Langer

TL;DR: FractalPINN-Flow提出了一种基于分形几何的无监督光流估计框架，通过分形变形网络（FDN）和总变差（TV）正则化实现了高分辨率数据的准确光流估计。

Details

Motivation: 传统光流估计方法依赖有标注数据且难以处理高分辨率和大运动范围，而FractalPINN-Flow旨在通过无监督学习和分形结构克服这些限制。

Result: 实验表明，FractalPINN-Flow在高分辨率数据上表现优异，边缘保持效果好，适用于标注有限的场景。

Insight: 分形结构的递归设计能有效捕捉光流的多尺度特征，TV正则化在无监督设置下显著提升了光流场的平滑性和一致性。

Abstract: We present FractalPINN-Flow, an unsupervised deep learning framework for dense optical flow estimation that learns directly from consecutive grayscale frames without requiring ground truth. The architecture centers on the Fractal Deformation Network (FDN) - a recursive encoder-decoder inspired by fractal geometry and self-similarity. Unlike traditional CNNs with sequential downsampling, FDN uses repeated encoder-decoder nesting with skip connections to capture both fine-grained details and long-range motion patterns. The training objective is based on a classical variational formulation using total variation (TV) regularization. Specifically, we minimize an energy functional that combines $L^1$ and $L^2$ data fidelity terms to enforce brightness constancy, along with a TV term that promotes spatial smoothness and coherent flow fields. Experiments on synthetic and benchmark datasets show that FractalPINN-Flow produces accurate, smooth, and edge-preserving optical flow fields. The model is especially effective for high-resolution data and scenarios with limited annotations.

Zhen Tian, Christos Anagnostopoulos, Qiyuan Wang, Zhiwei Gao

TL;DR: 该论文提出了一种基于HSV色彩空间监督和多模态约束的系统性鲁棒增强框架（Robust U-Net），用于提升海岸水域分割的准确性和稳定性。

Details

Motivation: 海岸水域分割在卫星图像中面临复杂光谱特征和不规则边界模式的挑战，传统RGB方法在多样化海洋环境中表现不稳定且泛化能力差。

Result: 实验表明HSV监督影响最大（影响分数0.85），完整框架显著提升训练稳定性（方差减少84%）和分割质量。

Insight: HSV色彩空间监督在处理复杂光照和光谱变化时更有效，多模态约束显著改善分割的鲁棒性和泛化能力。

Abstract: Coastal water segmentation from satellite imagery presents unique challenges due to complex spectral characteristics and irregular boundary patterns. Traditional RGB-based approaches often suffer from training instability and poor generalization in diverse maritime environments. This paper introduces a systematic robust enhancement framework, referred to as Robust U-Net, that leverages HSV color space supervision and multi-modal constraints for improved coastal water segmentation. Our approach integrates five synergistic components: HSV-guided color supervision, gradient-based coastline optimization, morphological post-processing, sea area cleanup, and connectivity control. Through comprehensive ablation studies, we demonstrate that HSV supervision provides the highest impact (0.85 influence score), while the complete framework achieves superior training stability (84% variance reduction) and enhanced segmentation quality. Our method shows consistent improvements across multiple evaluation metrics while maintaining computational efficiency. For reproducibility, our training configurations and code are available here: https://github.com/UofgCoastline/ICASSP-2026-Robust-Unet.

[42] Computational Imaging for Enhanced Computer Vision cs.CVPDF

Humera Shaikh, Kaur Jashanpreet

TL;DR: 这篇论文综述了计算成像（CI）技术及其对计算机视觉（CV）应用的变革性影响，探讨了CI如何通过改进图像获取和重建过程，应对低光、运动模糊等高难度场景中的挑战，从而提升CV任务的表现。

Details

Motivation: 传统成像方法在高难度场景下（如低光、运动模糊）难以获取高质量视觉数据，限制了计算机视觉系统的性能。计算成像技术通过改进图像获取和重建过程，为解决这些问题提供了新途径。

Result: 研究强调了CI技术在提升CV任务鲁棒性和准确性方面的潜力，尤其是在自动驾驶、监控、AR和机器人等实际应用中。

Insight: 未来研究方向包括开发自适应成像管道，进一步推动CI技术在复杂场景中的应用。

Abstract: This paper presents a comprehensive survey of computational imaging (CI) techniques and their transformative impact on computer vision (CV) applications. Conventional imaging methods often fail to deliver high-fidelity visual data in challenging conditions, such as low light, motion blur, or high dynamic range scenes, thereby limiting the performance of state-of-the-art CV systems. Computational imaging techniques, including light field imaging, high dynamic range (HDR) imaging, deblurring, high-speed imaging, and glare mitigation, address these limitations by enhancing image acquisition and reconstruction processes. This survey systematically explores the synergies between CI techniques and core CV tasks, including object detection, depth estimation, optical flow, face recognition, and keypoint detection. By analyzing the relationships between CI methods and their practical contributions to CV applications, this work highlights emerging opportunities, challenges, and future research directions. We emphasize the potential for task-specific, adaptive imaging pipelines that improve robustness, accuracy, and efficiency in real-world scenarios, such as autonomous navigation, surveillance, augmented reality, and robotics.

Sike Xiang, Shuang Chen, Amir Atapour-Abarghouei

TL;DR: Error

Details

Motivation: Error

Result: Error

Insight: Error

Abstract: As multimodal large language models (MLLMs) advance, their large-scale architectures pose challenges for deployment in resource-constrained environments. In the age of large models, where energy efficiency, computational scalability and environmental sustainability are paramount, the development of lightweight and high-performance models is critical for real-world applications. As such, we propose a lightweight MLLM framework for end-to-end visual question answering. Our proposed approach centres on BreezeCLIP, a compact yet powerful vision-language encoder optimised for efficient multimodal understanding. With only 1.2 billion parameters overall, our model significantly reduces computational cost while achieving performance comparable to standard-size MLLMs. Experiments conducted on multiple datasets further validate its effectiveness in balancing accuracy and efficiency. The modular and extensible design enables generalisation to broader multimodal tasks. The proposed lightweight vision-language framework is denoted as BcQLM (BreezeCLIP-enhanced Q-Gated Multimodal Language Model). It offers a promising path toward deployable MLLMs under practical hardware constraints. The source code is available at https://github.com/thico0224/BcQLM.

[44] ArgoTweak: Towards Self-Updating HD Maps through Structured Priors cs.CVPDF

Lena Wild, Rafael Valencia, Patric Jensfelt

TL;DR: ArgoTweak提出了一种新型数据集，填补了现有高清地图研究中缺乏真实地图先验的空白，并引入了双射映射框架来精确检测和集成地图变化。

Details

Motivation: 现有高清地图研究缺乏包含真实地图先验、当前地图和传感器数据的公开数据集，导致现有方法依赖合成先验，引发不一致性和显著的模拟到现实差距。

Result: 实验表明，ArgoTweak显著缩小了模拟到现实差距，并通过消融研究验证了结构化先验和详细变化标注的作用。

Insight: 结构化先验和细粒度原子变化是实现自更新高清地图的关键，同时表明真实数据集对提升模型性能的重要性。

Abstract: Reliable integration of prior information is crucial for self-verifying and self-updating HD maps. However, no public dataset includes the required triplet of prior maps, current maps, and sensor data. As a result, existing methods must rely on synthetic priors, which create inconsistencies and lead to a significant sim2real gap. To address this, we introduce ArgoTweak, the first dataset to complete the triplet with realistic map priors. At its core, ArgoTweak employs a bijective mapping framework, breaking down large-scale modifications into fine-grained atomic changes at the map element level, thus ensuring interpretability. This paradigm shift enables accurate change detection and integration while preserving unchanged elements with high fidelity. Experiments show that training models on ArgoTweak significantly reduces the sim2real gap compared to synthetic priors. Extensive ablations further highlight the impact of structured priors and detailed change annotations. By establishing a benchmark for explainable, prior-aided HD mapping, ArgoTweak advances scalable, self-improving mapping solutions. The dataset, baselines, map modification toolbox, and further resources are available at https://kth-rpl.github.io/ArgoTweak/.

[45] Quantifying Accuracy of an Event-Based Star Tracker via Earth’s Rotation cs.CVPDF

Dennis Melamed, Connor Hashemi, Scott McCloskey

TL;DR: 论文通过利用地球自转作为真实基准，量化了基于事件相机（EBC）的星跟踪系统的精度，展示了其在低成本、低延迟星跟踪中的实用性。

Details

Motivation: 事件相机在星跟踪中有潜力，但缺乏真实数据基准。通过地球自转这一规律性运动，提供了一种量化精度的新方法。

Result: 事件相机系统达到18.47角秒的RMS误差，展示了其在星跟踪中的潜力。

Insight: 事件相机因其稀疏数据流、高动态范围和低能耗等优势，适合低成本、低延迟的星跟踪应用。

Abstract: Event-based cameras (EBCs) are a promising new technology for star tracking-based attitude determination, but prior studies have struggled to determine accurate ground truth for real data. We analyze the accuracy of an EBC star tracking system utilizing the Earth’s motion as the ground truth for comparison. The Earth rotates in a regular way with very small irregularities which are measured to the level of milli-arcseconds. By keeping an event camera static and pointing it through a ground-based telescope at the night sky, we create a system where the only camera motion in the celestial reference frame is that induced by the Earth’s rotation. The resulting event stream is processed to generate estimates of orientation which we compare to the International Earth Rotation and Reference System (IERS) measured orientation of the Earth. The event camera system is able to achieve a root mean squared across error of 18.47 arcseconds and an about error of 78.84 arcseconds. Combined with the other benefits of event cameras over framing sensors (reduced computation due to sparser data streams, higher dynamic range, lower energy consumption, faster update rates), this level of accuracy suggests the utility of event cameras for low-cost and low-latency star tracking. We provide all code and data used to generate our results: https://gitlab.kitware.com/nest-public/telescope_accuracy_quantification.

[46] GeneVA: A Dataset of Human Annotations for Generative Text to Video Artifacts cs.CVPDF

Jenna Kang, Maria Silva, Patsorn Sangkloy, Kenneth Chen, Niall Williams

TL;DR: GeneVA是一个大规模的人工标注数据集，专注于从文本生成的视频中存在的时空伪影，旨在填补现有基准主要集中于生成图像的不足。

Details

Motivation: 生成模型在文本驱动视频生成方面取得了进展，但其随机性可能导致不可预测的伪影。现有数据集主要关注静态图像，缺乏对视频时空复杂性的系统性评估。

Result: GeneVA数据集为生成视频质量评估和模型改进提供了重要资源。

Insight: 视频生成中的时空一致性是核心挑战，GeneVA的标注数据为未来研究提供了关键支持。

Abstract: Recent advances in probabilistic generative models have extended capabilities from static image synthesis to text-driven video generation. However, the inherent randomness of their generation process can lead to unpredictable artifacts, such as impossible physics and temporal inconsistency. Progress in addressing these challenges requires systematic benchmarks, yet existing datasets primarily focus on generative images due to the unique spatio-temporal complexities of videos. To bridge this gap, we introduce GeneVA, a large-scale artifact dataset with rich human annotations that focuses on spatio-temporal artifacts in videos generated from natural text prompts. We hope GeneVA can enable and assist critical applications, such as benchmarking model performance and improving generative video quality.

[47] RewardDance: Reward Scaling in Visual Generation cs.CVPDF

Jie Wu, Yu Gao, Zilyu Ye, Ming Li, Liang Li

TL;DR: RewardDance提出了一种可扩展的奖励建模框架，通过生成式奖励范式解决了视觉生成中奖励模型（RM）的扩展问题，并有效避免了奖励破解问题。

Details

Motivation: 现有CLIP-based奖励模型存在架构和输入模态限制，而Bradley-Terry损失与视觉语言模型（VLM）的下一个token预测机制不匹配，导致奖励模型难以扩展。此外，RLHF优化过程中的奖励破解问题阻碍了模型质量的提升。

Result: RewardDance在文本到图像、文本到视频和图像到视频生成任务中显著优于现有方法。大规模的RM在RL微调期间表现出高奖励方差，有效抵抗奖励破解，生成多样且高质量的输出。

Insight: 通过生成式奖励范式，RewardDance成功解决了视觉生成中奖励模型的扩展问题，并为避免模式崩溃提供了新的方向。

Abstract: Reward Models (RMs) are critical for improving generation models via Reinforcement Learning (RL), yet the RM scaling paradigm in visual generation remains largely unexplored. It primarily due to fundamental limitations in existing approaches: CLIP-based RMs suffer from architectural and input modality constraints, while prevalent Bradley-Terry losses are fundamentally misaligned with the next-token prediction mechanism of Vision-Language Models (VLMs), hindering effective scaling. More critically, the RLHF optimization process is plagued by Reward Hacking issue, where models exploit flaws in the reward signal without improving true quality. To address these challenges, we introduce RewardDance, a scalable reward modeling framework that overcomes these barriers through a novel generative reward paradigm. By reformulating the reward score as the model’s probability of predicting a “yes” token, indicating that the generated image outperforms a reference image according to specific criteria, RewardDance intrinsically aligns reward objectives with VLM architectures. This alignment unlocks scaling across two dimensions: (1) Model Scaling: Systematic scaling of RMs up to 26 billion parameters; (2) Context Scaling: Integration of task-specific instructions, reference examples, and chain-of-thought (CoT) reasoning. Extensive experiments demonstrate that RewardDance significantly surpasses state-of-the-art methods in text-to-image, text-to-video, and image-to-video generation. Crucially, we resolve the persistent challenge of “reward hacking”: Our large-scale RMs exhibit and maintain high reward variance during RL fine-tuning, proving their resistance to hacking and ability to produce diverse, high-quality outputs. It greatly relieves the mode collapse problem that plagues smaller models.

[48] SAFT: Shape and Appearance of Fabrics from Template via Differentiable Physical Simulations from Monocular Video cs.CVPDF

David Stotko, Reinhard Klein

TL;DR: 这篇论文提出了一种新颖的方法SAFT，通过单目RGB视频序列重建织物的3D几何形状和外观，结合物理模拟和可微分渲染技术。通过引入两种新的正则化项，解决了单目视频中的深度模糊问题，并将3D重建误差降低了2.64倍，同时平均每场景耗时30分钟。优化的运动质量足以进行外观估计，恢复织物变形中的锐利细节。

Details

Motivation: 动态3D场景重建是计算机视觉领域的核心挑战之一。现有方法在单目视频中难以处理织物的高质量几何重建和外观估计，尤其是在深度模糊问题方面表现不佳。因此，作者提出了一种结合物理模拟和可微分渲染的方法，以提升重建的精度和真实性。

Result: 1. 3D重建误差比现有方法降低了2.64倍。2. 平均每场景运行时间为30分钟。3. 优化的运动质量支持高质量的外观估计，恢复织物变形中的锐利细节。

Insight: 1. 物理模拟与可微分渲染的结合是解决单目视频重建问题的有效途径。2. 正则化项的设计对提升重建质量至关重要，尤其是在处理深度模糊问题时。3. 该方法展示了从单目视频中同时完成几何和外观估计的潜力。

Abstract: The reconstruction of three-dimensional dynamic scenes is a well-established yet challenging task within the domain of computer vision. In this paper, we propose a novel approach that combines the domains of 3D geometry reconstruction and appearance estimation for physically based rendering and present a system that is able to perform both tasks for fabrics, utilizing only a single monocular RGB video sequence as input. In order to obtain realistic and high-quality deformations and renderings, a physical simulation of the cloth geometry and differentiable rendering are employed. In this paper, we introduce two novel regularization terms for the 3D reconstruction task that improve the plausibility of the reconstruction by addressing the depth ambiguity problem in monocular video. In comparison with the most recent methods in the field, we have reduced the error in the 3D reconstruction by a factor of 2.64 while requiring a medium runtime of 30 min per scene. Furthermore, the optimized motion achieves sufficient quality to perform an appearance estimation of the deforming object, recovering sharp details from this single monocular RGB video.

cs.CL [Back]

[49] AntiDote: Bi-level Adversarial Training for Tamper-Resistant LLMs cs.CLPDF

Debdeep Sanyal, Manodeep Ray, Murari Mandal

TL;DR: 这篇论文提出了AntiDote，一种双层对抗训练方法，旨在使大语言模型（LLMs）在开放权重条件下能够抵抗恶意微调攻击，同时保持其通用能力。

Details

Motivation: 开放权重LLMs的研究与潜在滥用（如恶意微调以生成有害内容）之间产生了矛盾。当前的安全措施难以在保持模型通用能力的同时抵御对权重和架构有完全访问权的攻击者。

Result: 在与52种红队攻击的对抗中，AntiDote比基线方法提高了27.4%的鲁棒性，且在能力基准（如MMLU、HellaSwag）中性能下降小于0.5%。

Insight: AntiDote展示了如何在开放权重模型中嵌入更具弹性的安全性，同时几乎不影响模型的实用性，为安全研究提供了高效的计算方法。

Abstract: The release of open-weight large language models (LLMs) creates a tension between advancing accessible research and preventing misuse, such as malicious fine-tuning to elicit harmful content. Current safety measures struggle to preserve the general capabilities of the LLM while resisting a determined adversary with full access to the model’s weights and architecture, who can use full-parameter fine-tuning to erase existing safeguards. To address this, we introduce AntiDote, a bi-level optimization procedure for training LLMs to be resistant to such tampering. AntiDote involves an auxiliary adversary hypernetwork that learns to generate malicious Low-Rank Adaptation (LoRA) weights conditioned on the defender model’s internal activations. The defender LLM is then trained with an objective to nullify the effect of these adversarial weight additions, forcing it to maintain its safety alignment. We validate this approach against a diverse suite of 52 red-teaming attacks, including jailbreak prompting, latent space manipulation, and direct weight-space attacks. AntiDote is upto 27.4% more robust against adversarial attacks compared to both tamper-resistance and unlearning baselines. Crucially, this robustness is achieved with a minimal trade-off in utility, incurring a performance degradation of upto less than 0.5% across capability benchmarks including MMLU, HellaSwag, and GSM8K. Our work offers a practical and compute efficient methodology for building open-weight models where safety is a more integral and resilient property.

[50] NOWJ@COLIEE 2025: A Multi-stage Framework Integrating Embedding Models and Large Language Models for Legal Retrieval and Entailment cs.CL | cs.AIPDF

Hoang-Trung Nguyen, Tan-Minh Nguyen, Xuan-Bach Le, Tuan-Kiet Le, Khanh-Huyen Nguyen

TL;DR: NOWJ团队在COLIEE 2025竞赛中提出了一种多阶段框架，结合嵌入模型和大语言模型（LLM）完成法律检索与蕴含任务，尤其是在Legal Case Entailment任务中获得第一名。

Details

Motivation: 解决法律信息处理中的检索与蕴含挑战，结合传统信息检索技术与现代生成模型的优势。

Result: 在Legal Case Entailment任务中获得F1分数0.3195，排名第一；其他任务也表现优异。

Insight: 混合模型（传统IR技术与生成模型结合）在法律信息处理中具有潜力，为未来研究提供了参考。

Abstract: This paper presents the methodologies and results of the NOWJ team’s participation across all five tasks at the COLIEE 2025 competition, emphasizing advancements in the Legal Case Entailment task (Task 2). Our comprehensive approach systematically integrates pre-ranking models (BM25, BERT, monoT5), embedding-based semantic representations (BGE-m3, LLM2Vec), and advanced Large Language Models (Qwen-2, QwQ-32B, DeepSeek-V3) for summarization, relevance scoring, and contextual re-ranking. Specifically, in Task 2, our two-stage retrieval system combined lexical-semantic filtering with contextualized LLM analysis, achieving first place with an F1 score of 0.3195. Additionally, in other tasks–including Legal Case Retrieval, Statute Law Retrieval, Legal Textual Entailment, and Legal Judgment Prediction–we demonstrated robust performance through carefully engineered ensembles and effective prompt-based reasoning strategies. Our findings highlight the potential of hybrid models integrating traditional IR techniques with contemporary generative models, providing a valuable reference for future advancements in legal information processing.

[51] SciGPT: A Large Language Model for Scientific Literature Understanding and Knowledge Discovery cs.CLPDF

Fengyu She, Nan Wang, Hongfei Wu, Ziyi Wan, Jingmian Wang

TL;DR: SciGPT是一个针对科学文献理解的大语言模型，通过领域适应技术和创新的注意力机制，在科学任务中超越了GPT-4o的表现。

Details

Motivation: 科学文献的快速增长使得研究人员难以高效提取知识，而通用LLMs难以处理科学领域的技术细节和复杂任务。

Result: 在ScienceBench上，SciGPT在序列标注、生成和推理任务上超越GPT-4o，且对未见过的科学任务表现出强鲁棒性。

Insight: 通过领域适配和专家注意力机制，可以显著提升LLMs在科学任务中的表现，为AI辅助科学发现提供了新的可能性。

Abstract: Scientific literature is growing exponentially, creating a critical bottleneck for researchers to efficiently synthesize knowledge. While general-purpose Large Language Models (LLMs) show potential in text processing, they often fail to capture scientific domain-specific nuances (e.g., technical jargon, methodological rigor) and struggle with complex scientific tasks, limiting their utility for interdisciplinary research. To address these gaps, this paper presents SciGPT, a domain-adapted foundation model for scientific literature understanding and ScienceBench, an open source benchmark tailored to evaluate scientific LLMs. Built on the Qwen3 architecture, SciGPT incorporates three key innovations: (1) low-cost domain distillation via a two-stage pipeline to balance performance and efficiency; (2) a Sparse Mixture-of-Experts (SMoE) attention mechanism that cuts memory consumption by 55% for 32,000-token long-document reasoning; and (3) knowledge-aware adaptation integrating domain ontologies to bridge interdisciplinary knowledge gaps. Experimental results on ScienceBench show that SciGPT outperforms GPT-4o in core scientific tasks including sequence labeling, generation, and inference. It also exhibits strong robustness in unseen scientific tasks, validating its potential to facilitate AI-augmented scientific discovery.

[52] No for Some, Yes for Others: Persona Prompts and Other Sources of False Refusal in Language Models cs.CLPDF

Flor Miriam Plaza-del-Arco, Paul Röttger, Nino Scherrer, Emanuele Borgonovo, Elmar Plischke

TL;DR: 该研究量化了15种社会人口学形象提示对语言模型虚假拒绝率的影响，发现模型能力和任务类型对虚假拒绝的影响可能大于人物形象提示，表明先前的估计可能过高。

Details

Motivation: 大型语言模型(LLMs)的个性化可能导致虚假拒绝用户请求的问题，但此前的研究未充分量化这一现象。本文旨在填补这一空白，探究人物形象提示及其他因素对虚假拒绝的影响。

Result: 更强大的模型受人物形象影响较小；某些社会人口学形象会增加部分模型的虚假拒绝；模型选择和任务类型(尤其是敏感内容任务)显著影响虚假拒绝。

Insight: 人物形象提示对虚假拒绝的影响可能被高估，模型能力、任务类型和安全机制中的偏见是更重要的因素。

Abstract: Large language models (LLMs) are increasingly integrated into our daily lives and personalized. However, LLM personalization might also increase unintended side effects. Recent work suggests that persona prompting can lead models to falsely refuse user requests. However, no work has fully quantified the extent of this issue. To address this gap, we measure the impact of 15 sociodemographic personas (based on gender, race, religion, and disability) on false refusal. To control for other factors, we also test 16 different models, 3 tasks (Natural Language Inference, politeness, and offensiveness classification), and nine prompt paraphrases. We propose a Monte Carlo-based method to quantify this issue in a sample-efficient manner. Our results show that as models become more capable, personas impact the refusal rate less and less. Certain sociodemographic personas increase false refusal in some models, which suggests underlying biases in the alignment strategies or safety mechanisms. However, we find that the model choice and task significantly influence false refusals, especially in sensitive content tasks. Our findings suggest that persona effects have been overestimated, and might be due to other factors.

[53] MERLIN: Multi-Stage Curriculum Alignment for Multilingual Encoder and LLM Fusion cs.CLPDF

Kosei Uemura, David Guzmán, Quang Phuoc Nguyen, Jesujoba Oluwadara Alabi, En-shiun Annie Lee

TL;DR: MERLIN是一个两阶段的模型堆叠框架，通过课程学习策略从通用双语数据到任务特定数据，仅调整少量DoRA权重，显著提升了低资源语言（LRLs）的推理能力。在AfriMGSM基准上，MERLIN比MindMerger准确率提高了12.9个百分点，甚至超越了GPT-4o-mini。

Details

Motivation: 大型语言模型在英语表现优异，但在低资源语言中的复杂推理任务上表现不足。现有的编码器-解码器方法对中高资源语言有效但对LRLs仍有较大差距。

Result: 在AfriMGSM上准确率提升12.9个百分点，超越MindMerger和GPT-4o-mini，在MGSM和MSVAMP上也分别提升0.9和2.8个百分点。

Insight: 通过课程学习和轻量级权重调整，可以有效弥合LRLs与高资源语言之间的性能差距，且方法具有广泛的适用性。

Abstract: Large language models excel in English but still struggle with complex reasoning in many low-resource languages (LRLs). Existing encoder-plus-decoder methods such as LangBridge and MindMerger raise accuracy on mid and high-resource languages, yet they leave a large gap on LRLs. We present MERLIN, a two-stage model-stacking framework that applies a curriculum learning strategy – from general bilingual bitext to task-specific data – and adapts only a small set of DoRA weights. On the AfriMGSM benchmark MERLIN improves exact-match accuracy by +12.9 pp over MindMerger and outperforms GPT-4o-mini. It also yields consistent gains on MGSM and MSVAMP (+0.9 and +2.8 pp), demonstrating effectiveness across both low and high-resource settings.

[54] Bias after Prompting: Persistent Discrimination in Large Language Models cs.CL | cs.LGPDF

Nivedha Sivakumar, Natalie Mackraz, Samira Khorshidi, Krishna Patel, Barry-John Theobald

TL;DR: 论文通过研究发现，大规模语言模型（LLMs）中的偏见会通过提示（prompting）传递到下游任务，且现有的提示去偏见方法无法一致性地减少这种传递。

Details

Motivation: 研究动机在于揭示提示适应过程中偏见传递的持续性，挑战了先前关于偏见不会从预训练模型传递到下游任务的假设。

Result: 结果显示，内在偏见与提示适应后的偏见之间存在中等到强相关性（如性别rho ≥ 0.94），且现有方法无法一致性减少偏见传递。

Insight: 深入观点包括：1. 修正内在模型偏见可能有助于阻止向下游任务的传播；2. 提示去偏见方法需针对不同任务和群体优化。

Abstract: A dangerous assumption that can be made from prior work on the bias transfer hypothesis (BTH) is that biases do not transfer from pre-trained large language models (LLMs) to adapted models. We invalidate this assumption by studying the BTH in causal models under prompt adaptations, as prompting is an extremely popular and accessible adaptation strategy used in real-world applications. In contrast to prior work, we find that biases can transfer through prompting and that popular prompt-based mitigation methods do not consistently prevent biases from transferring. Specifically, the correlation between intrinsic biases and those after prompt adaptation remain moderate to strong across demographics and tasks – for example, gender (rho >= 0.94) in co-reference resolution, and age (rho >= 0.98) and religion (rho >= 0.69) in question answering. Further, we find that biases remain strongly correlated when varying few-shot composition parameters, such as sample size, stereotypical content, occupational distribution and representational balance (rho >= 0.90). We evaluate several prompt-based debiasing strategies and find that different approaches have distinct strengths, but none consistently reduce bias transfer across models, tasks or demographics. These results demonstrate that correcting bias, and potentially improving reasoning ability, in intrinsic models may prevent propagation of biases to downstream tasks.

[55] Verbalized Algorithms cs.CLPDF

Supriya Lall, Christian Farrell, Hari Pathanjaly, Marko Pavic, Sarvesh Chezhian

TL;DR: 论文提出了一种名为‘语言化算法’（VAs）的新范式，通过将任务分解为简单的自然语言操作，限制LLM的作用范围，从而提高推理任务的可靠性。

Details

Motivation: 传统的单次查询LLMs方法存在不可靠性，作者希望通过结合经典算法，将任务分解为LLMs能够可靠处理的简单操作。

Result: 在排序和聚类任务中验证了该方法的有效性。

Insight: 将LLMs与经典算法结合，能够提高任务的可控性和可靠性，同时也为LLMs的定向优化提供了一种新思路。

Abstract: Instead of querying LLMs in a one-shot manner and hoping to get the right answer for a reasoning task, we propose a paradigm we call \emph{verbalized algorithms} (VAs), which leverage classical algorithms with established theoretical understanding. VAs decompose a task into simple elementary operations on natural language strings that they should be able to answer reliably, and limit the scope of LLMs to only those simple tasks. For example, for sorting a series of natural language strings, \emph{verbalized sorting} uses an LLM as a binary comparison oracle in a known and well-analyzed sorting algorithm (e.g., bitonic sorting network). We demonstrate the effectiveness of this approach on sorting and clustering tasks.

[56] Balancing Quality and Variation: Spam Filtering Distorts Data Label Distributions cs.CL | cs.AIPDF

Eve Fleisig, Matthias Orlikowski, Philipp Cimiano, Dan Klein

TL;DR: 论文研究了如何在标注数据过滤垃圾标注（spam filtering）时平衡标注质量和多样性，发现传统过滤方法可能误删持不同意见的标注者而非真正的垃圾标注者，提出了保守过滤策略的建议。

Details

Motivation: 在主观任务中，数据标注需要保留多样性以反映真实意见分布，但传统垃圾标注过滤方法可能将不同意见误判为低质量标注，导致数据偏差。本文旨在研究如何平衡标注可靠性和多样性。

Result: 传统过滤方法会显著增加标签分布误差，尤其是在过滤比例超过5%时；垃圾标注者多为固定答案而非随机行为，且多数与正常标注者分布相似。

Insight: 在需要保留多样性的任务中，传统过滤方法（假设多样性为噪声）表现不佳，需设计新的过滤方法以区分真正垃圾标注与合理多样性。

Abstract: For machine learning datasets to accurately represent diverse opinions in a population, they must preserve variation in data labels while filtering out spam or low-quality responses. How can we balance annotator reliability and representation? We empirically evaluate how a range of heuristics for annotator filtering affect the preservation of variation on subjective tasks. We find that these methods, designed for contexts in which variation from a single ground-truth label is considered noise, often remove annotators who disagree instead of spam annotators, introducing suboptimal tradeoffs between accuracy and label diversity. We find that conservative settings for annotator removal (<5%) are best, after which all tested methods increase the mean absolute error from the true average label. We analyze performance on synthetic spam to observe that these methods often assume spam annotators are less random than real spammers tend to be: most spammers are distributionally indistinguishable from real annotators, and the minority that are distinguishable tend to give fixed answers, not random ones. Thus, tasks requiring the preservation of variation reverse the intuition of existing spam filtering methods: spammers tend to be less random than non-spammers, so metrics that assume variation is spam fare worse. These results highlight the need for spam removal methods that account for label diversity.

[57] Towards Knowledge-Aware Document Systems: Modeling Semantic Coverage Relations via Answerability Detection cs.CLPDF

Yehudit Aperstein, Alon Gottlib, Gal Benita, Alexander Apartsin

TL;DR: 本文提出了一种基于问答（QA）的框架来建模语义覆盖关系（SCR），通过分析文档对之间信息内容的对齐程度。研究构建了一个合成数据集，并测试了生成式模型和判别式模型在SCR预测上的表现。

Details

Motivation: 理解跨文档信息共享对于信息检索、摘要生成和内容对齐等任务至关重要。目前缺乏系统的方法来量化文档之间的语义关系，尤其是不同表达形式之间的信息重叠。

Result: 判别式模型显著优于生成式模型，RoBERTa-base模型准确率达61.4%，随机森林模型在宏F1分数上表现最佳（52.9%）。

Insight: QA方法为分析语义关系提供了有效工具，揭示了当前模型在处理超越表面相似性的信息推理能力。判别式模型更适合此类任务。

Abstract: Understanding how information is shared across documents, regardless of the format in which it is expressed, is critical for tasks such as information retrieval, summarization, and content alignment. In this work, we introduce a novel framework for modelling Semantic Coverage Relations (SCR), which classifies document pairs based on how their informational content aligns. We define three core relation types: equivalence, where both texts convey the same information using different textual forms or styles; inclusion, where one document fully contains the information of another and adds more; and semantic overlap, where each document presents partially overlapping content. To capture these relations, we adopt a question answering (QA)-based approach, using the answerability of shared questions across documents as an indicator of semantic coverage. We construct a synthetic dataset derived from the SQuAD corpus by paraphrasing source passages and selectively omitting information, enabling precise control over content overlap. This dataset allows us to benchmark generative language models and train transformer-based classifiers for SCR prediction. Our findings demonstrate that discriminative models significantly outperform generative approaches, with the RoBERTa-base model achieving the highest accuracy of 61.4% and the Random Forest-based model showing the best balance with a macro-F1 score of 52.9%. The results show that QA provides an effective lens for assessing semantic relations across stylistically diverse texts, offering insights into the capacity of current models to reason about information beyond surface similarity. The dataset and code developed in this study are publicly available to support reproducibility.

[58] Memorization in Large Language Models in Medicine: Prevalence, Characteristics, and Implications cs.CL | cs.AIPDF

Anran Li, Lingfei Qian, Mengmeng Du, Yu Yin, Yan Hu

TL;DR: 该研究首次全面评估了大型语言模型（LLMs）在医学领域的记忆行为，揭示了其在医学训练数据中的普遍性、特征及潜在影响，并提出了优化建议。

Details

Motivation: LLMs在医学领域应用广泛，但其对训练数据的记忆行为尚未被系统研究，这可能影响模型的开发与应用。

Result: 记忆行为在所有场景中普遍存在且高于通用领域，可分为有益、无用和有害三类，直接影响医学应用的开发与采用。

Insight: 记忆行为对医学LLMs具有双面性，需针对性优化以实现准确性提升、减少无意义记忆，并防止敏感信息泄漏。

Abstract: Large Language Models (LLMs) have demonstrated significant potential in medicine. To date, LLMs have been widely applied to tasks such as diagnostic assistance, medical question answering, and clinical information synthesis. However, a key open question remains: to what extent do LLMs memorize medical training data. In this study, we present the first comprehensive evaluation of memorization of LLMs in medicine, assessing its prevalence (how frequently it occurs), characteristics (what is memorized), volume (how much content is memorized), and potential downstream impacts (how memorization may affect medical applications). We systematically analyze common adaptation scenarios: (1) continued pretraining on medical corpora, (2) fine-tuning on standard medical benchmarks, and (3) fine-tuning on real-world clinical data, including over 13,000 unique inpatient records from Yale New Haven Health System. The results demonstrate that memorization is prevalent across all adaptation scenarios and significantly higher than reported in the general domain. Memorization affects both the development and adoption of LLMs in medicine and can be categorized into three types: beneficial (e.g., accurate recall of clinical guidelines and biomedical references), uninformative (e.g., repeated disclaimers or templated medical document language), and harmful (e.g., regeneration of dataset-specific or sensitive clinical content). Based on these findings, we offer practical recommendations to facilitate beneficial memorization that enhances domain-specific reasoning and factual accuracy, minimize uninformative memorization to promote deeper learning beyond surface-level patterns, and mitigate harmful memorization to prevent the leakage of sensitive or identifiable patient information.

[59] Streaming Sequence-to-Sequence Learning with Delayed Streams Modeling cs.CLPDF

Neil Zeghidour, Eugene Kharitonov, Manu Orsini, Václav Volhejn, Gabriel de Marmiesse

TL;DR: 论文提出了Delayed Streams Modeling (DSM)，一种灵活的流式多模态序列到序列学习方法，通过预处理的延迟对齐实现高效的流式推断。

Details

Motivation: 传统的序列到序列方法多为离线方式，无法适应流式场景的需求。DSM旨在解决这一问题，支持任意长度的输入输出序列组合。

Result: 实验显示，DSM在ASR和TTS任务中达到最先进的性能和延迟，甚至可与离线基准竞争。

Insight: DSM的创新在于将复杂的流式对齐问题简化为预处理延迟设计，为多模态序列任务提供了灵活的解决方案。

Abstract: We introduce Delayed Streams Modeling (DSM), a flexible formulation for streaming, multimodal sequence-to-sequence learning. Sequence-to-sequence generation is often cast in an offline manner, where the model consumes the complete input sequence before generating the first output timestep. Alternatively, streaming sequence-to-sequence rely on learning a policy for choosing when to advance on the input stream, or write to the output stream. DSM instead models already time-aligned streams with a decoder-only language model. By moving the alignment to a pre-processing step,and introducing appropriate delays between streams, DSM provides streaming inference of arbitrary output sequences, from any input combination, making it applicable to many sequence-to-sequence problems. In particular, given text and audio streams, automatic speech recognition (ASR) corresponds to the text stream being delayed, while the opposite gives a text-to-speech (TTS) model. We perform extensive experiments for these two major sequence-to-sequence tasks, showing that DSM provides state-of-the-art performance and latency while supporting arbitrary long sequences, being even competitive with offline baselines. Code, samples and demos are available at https://github.com/kyutai-labs/delayed-streams-modeling

[60] A Survey of Reinforcement Learning for Large Reasoning Models cs.CL | cs.AI | cs.LGPDF

Kaiyan Zhang, Yuxin Zuo, Bingxiang He, Youbang Sun, Runze Liu

TL;DR: 本文综述了强化学习（RL）在大模型推理（LRMs）中的应用进展，探讨其在提升大型语言模型（LLMs）逻辑推理能力方面的成功及面临的挑战。

Details

Motivation: 随着RL在LLMs（如数学和编程任务）中的成功应用，需解决其扩展性问题以推动LRMs发展，进而实现人工超级智能（ASI）。

Result: 指出了RL在扩展性方面的资源、算法和基础设施挑战，并探讨了未来发展方向。

Insight: RL是提升LRMs推理能力的关键方法，但其大规模应用仍需解决多维度挑战。

Abstract: In this paper, we survey recent advances in Reinforcement Learning (RL) for reasoning with Large Language Models (LLMs). RL has achieved remarkable success in advancing the frontier of LLM capabilities, particularly in addressing complex logical tasks such as mathematics and coding. As a result, RL has emerged as a foundational methodology for transforming LLMs into LRMs. With the rapid progress of the field, further scaling of RL for LRMs now faces foundational challenges not only in computational resources but also in algorithm design, training data, and infrastructure. To this end, it is timely to revisit the development of this domain, reassess its trajectory, and explore strategies to enhance the scalability of RL toward Artificial SuperIntelligence (ASI). In particular, we examine research applying RL to LLMs and LRMs for reasoning abilities, especially since the release of DeepSeek-R1, including foundational components, core problems, training resources, and downstream applications, to identify future opportunities and directions for this rapidly evolving area. We hope this review will promote future research on RL for broader reasoning models. Github: https://github.com/TsinghuaC3I/Awesome-RL-for-LRMs

cs.RO [Back]

Jonathan Lee, Abhishek Rathod, Kshitij Goel, John Stecklein, Wennie Tabib

TL;DR: 这篇论文提出了一种基于强化学习的旋翼无人机导航方法，利用可微分模拟、新型损失函数和特权信息来绕过大障碍物导航，在复杂环境中表现优异。

Details

Motivation: 现有的基于学习的方法能够处理狭窄障碍物场景，但在大障碍物（如墙壁或地形）遮挡目标位置时表现不佳。因此，论文提出了一种利用特权信息的新型导航方法。

Result: 在仿真环境中达到86%的成功率，比基线方法高出34%；并在真实飞行中完成20次测试，总飞行距离589米，速度达4m/s，无碰撞记录。

Insight: 特权信息（如ToA地图）在复杂导航任务中具有显著优势，而偏航角对齐损失可以有效帮助无人机绕过大障碍物。

Abstract: This paper presents a reinforcement learning-based quadrotor navigation method that leverages efficient differentiable simulation, novel loss functions, and privileged information to navigate around large obstacles. Prior learning-based methods perform well in scenes that exhibit narrow obstacles, but struggle when the goal location is blocked by large walls or terrain. In contrast, the proposed method utilizes time-of-arrival (ToA) maps as privileged information and a yaw alignment loss to guide the robot around large obstacles. The policy is evaluated in photo-realistic simulation environments containing large obstacles, sharp corners, and dead-ends. Our approach achieves an 86% success rate and outperforms baseline strategies by 34%. We deploy the policy onboard a custom quadrotor in outdoor cluttered environments both during the day and night. The policy is validated across 20 flights, covering 589 meters without collisions at speeds up to 4 m/s.

[62] Foundation Models for Autonomous Driving Perception: A Survey Through Core Capabilities cs.RO | cs.CVPDF

Rajendramayavan Sathyam, Yueqi Li

TL;DR: 这篇论文是一篇关于基础模型在自动驾驶感知领域的综述，探讨了它们如何解决泛化性、可扩展性和分布偏移鲁棒性等核心挑战。论文提出了围绕四种关键能力的新分类法，并总结了当前的研究方法。

Details

Motivation: 自动驾驶感知领域正从特定任务的深度学习模型转向通用性强、基于大规模多样化数据集的基础模型。然而，如何整合这些模型的多种能力以实现动态驾驶环境中的鲁棒性能仍是一个重要挑战。

Result: 通过分类和综合分析，论文揭示了基础模型在自动驾驶感知中的潜力和局限，尤其是在实时性和可靠性方面的不足。

Insight: 论文强调了能力驱动的研究框架的重要性，并指出了未来需要更多关注模型在动态环境中实际部署的挑战，如计算需求和幻觉问题。

Abstract: Foundation models are revolutionizing autonomous driving perception, transitioning the field from narrow, task-specific deep learning models to versatile, general-purpose architectures trained on vast, diverse datasets. This survey examines how these models address critical challenges in autonomous perception, including limitations in generalization, scalability, and robustness to distributional shifts. The survey introduces a novel taxonomy structured around four essential capabilities for robust performance in dynamic driving environments: generalized knowledge, spatial understanding, multi-sensor robustness, and temporal reasoning. For each capability, the survey elucidates its significance and comprehensively reviews cutting-edge approaches. Diverging from traditional method-centric surveys, our unique framework prioritizes conceptual design principles, providing a capability-driven guide for model development and clearer insights into foundational aspects. We conclude by discussing key challenges, particularly those associated with the integration of these capabilities into real-time, scalable systems, and broader deployment challenges related to computational demands and ensuring model reliability against issues like hallucinations and out-of-distribution failures. The survey also outlines crucial future research directions to enable the safe and effective deployment of foundation models in autonomous driving systems.

[63] Good Deep Features to Track: Self-Supervised Feature Extraction and Tracking in Visual Odometry cs.RO | cs.CVPDF

Sai Puneeth Reddy Gottam, Haoming Zhang, Eivydas Keras

TL;DR: 这篇论文提出了一种通过自监督学习提升深度特征提取和跟踪能力的方法，以提高视觉里程计在复杂环境中的性能。

Details

Motivation: 视觉定位在大规模、户外和长期场景中常因光照变化、动态场景和低纹理区域等因素导致性能下降。传统学习型方法（如SuperPoint和SuperGlue）虽然在特征覆盖和鲁棒性上有所提升，但面对分布外数据时仍存在泛化问题。

Result: 实验结果表明，该方法能够显著提升特征提取和跟踪的稳定性，尤其在光照变化和低纹理区域等复杂场景中表现优异。

Insight: 论文揭示了自监督学习结合任务反馈可以有效地提升特征的泛化能力，为视觉里程计在复杂环境中的应用提供了新的思路和技术支持。

Abstract: Visual-based localization has made significant progress, yet its performance often drops in large-scale, outdoor, and long-term settings due to factors like lighting changes, dynamic scenes, and low-texture areas. These challenges degrade feature extraction and tracking, which are critical for accurate motion estimation. While learning-based methods such as SuperPoint and SuperGlue show improved feature coverage and robustness, they still face generalization issues with out-of-distribution data. We address this by enhancing deep feature extraction and tracking through self-supervised learning with task specific feedback. Our method promotes stable and informative features, improving generalization and reliability in challenging environments.

Stefan Podgorski, Sourav Garg, Mehdi Hosseinzadeh, Lachlan Mares, Feras Dayoub

TL;DR: TANGO提出了一种基于RGB图像的视觉导航方法，结合全局拓扑路径规划和局部轨迹控制，无需3D地图或预训练控制器，实现开放环境中的零样本长距离导航。

Details

Motivation: 传统视觉导航依赖全局3D地图或学习控制器，计算成本高且泛化性差。TANGO旨在通过对象级拓扑目标导航，解决这些问题，并提供开放环境中的适应性解决方案。

Result: 在仿真和真实环境测试中，TANGO表现优于现有方法，展示了在开放环境中的高适应性和有效性。代码开源。

Insight: 通过无监督学习和基础模型，TANGO避免了昂贵的计算需求和领域微调，为开放环境视觉导航提供了新思路。

Abstract: Visual navigation in robotics traditionally relies on globally-consistent 3D maps or learned controllers, which can be computationally expensive and difficult to generalize across diverse environments. In this work, we present a novel RGB-only, object-level topometric navigation pipeline that enables zero-shot, long-horizon robot navigation without requiring 3D maps or pre-trained controllers. Our approach integrates global topological path planning with local metric trajectory control, allowing the robot to navigate towards object-level sub-goals while avoiding obstacles. We address key limitations of previous methods by continuously predicting local trajectory using monocular depth and traversability estimation, and incorporating an auto-switching mechanism that falls back to a baseline controller when necessary. The system operates using foundational models, ensuring open-set applicability without the need for domain-specific fine-tuning. We demonstrate the effectiveness of our method in both simulated environments and real-world tests, highlighting its robustness and deployability. Our approach outperforms existing state-of-the-art methods, offering a more adaptable and effective solution for visual navigation in open-set environments. The source code is made publicly available: https://github.com/podgorki/TANGO.

Michael J. Munje, Chen Tang, Shuijing Liu, Zichao Hu, Yifeng Zhu

TL;DR: 本文介绍了SocialNav-SUB，一个用于评估视觉语言模型（VLMs）在社交机器人导航场景中场景理解能力的基准测试集。实验表明，当前最先进的VLMs在复杂的社交场景理解中仍有不足。

Details

Motivation: 动态、以人为中心的环境中的机器人导航需要基于强大的场景理解做出社会合规的决策。VLMs虽展现出潜力，但其在复杂社交导航场景（如推断空间-时间关系和人类意图）中的能力尚未被系统评估。

Result: 实验显示，当前最先进的VLMs在部分任务中表现尚可，但仍落后于简单的规则方法和人类共识基准，表明其在社交场景理解上存在关键缺陷。

Insight: VLMs在复杂社交场景中的应用仍需改进，SocialNav-SUB为未来研究提供了评估和改进的基础。

Abstract: Robot navigation in dynamic, human-centered environments requires socially-compliant decisions grounded in robust scene understanding. Recent Vision-Language Models (VLMs) exhibit promising capabilities such as object recognition, common-sense reasoning, and contextual understanding-capabilities that align with the nuanced requirements of social robot navigation. However, it remains unclear whether VLMs can accurately understand complex social navigation scenes (e.g., inferring the spatial-temporal relations among agents and human intentions), which is essential for safe and socially compliant robot navigation. While some recent works have explored the use of VLMs in social robot navigation, no existing work systematically evaluates their ability to meet these necessary conditions. In this paper, we introduce the Social Navigation Scene Understanding Benchmark (SocialNav-SUB), a Visual Question Answering (VQA) dataset and benchmark designed to evaluate VLMs for scene understanding in real-world social robot navigation scenarios. SocialNav-SUB provides a unified framework for evaluating VLMs against human and rule-based baselines across VQA tasks requiring spatial, spatiotemporal, and social reasoning in social robot navigation. Through experiments with state-of-the-art VLMs, we find that while the best-performing VLM achieves an encouraging probability of agreeing with human answers, it still underperforms simpler rule-based approach and human consensus baselines, indicating critical gaps in social scene understanding of current VLMs. Our benchmark sets the stage for further research on foundation models for social robot navigation, offering a framework to explore how VLMs can be tailored to meet real-world social robot navigation needs. An overview of this paper along with the code and data can be found at https://larg.github.io/socialnav-sub .

cs.SI [Back]

[66] Scaling Truth: The Confidence Paradox in AI Fact-Checking cs.SI | cs.AI | cs.CL | cs.CYPDF

Ihsan A. Qazi, Zohaib Khan, Abdullah Ghani, Agha A. Raza, Zafar A. Qazi

TL;DR: 这篇论文系统地评估了9种大型语言模型（LLM）在全球范围内的多语言事实核查任务中的表现，揭示了模型规模与置信度之间的反向关系，可能加剧信息不平等。

Details

Motivation: 随着错误信息的泛滥，需要可扩展且可靠的事实核查解决方案。大型语言模型（LLM）在自动化事实核查方面展现出潜力，但其在多语言和全球背景下的有效性尚不明确。

Result: 1. 小模型（如开源模型）自信度高但准确性低，大模型（如闭源模型）准确性高但置信度低。2. 非英语和Global South地区的性能差距显著。3. 开源和资源受限组织使用的小模型可能导致系统偏见。

Insight: 1. 模型规模和置信度的不平衡可能影响事实核查的公平性。2. 需要政策和技术干预，确保全球范围内AI辅助事实核查的公平访问。3. 为未来研究提供了多语言基准。

Abstract: The rise of misinformation underscores the need for scalable and reliable fact-checking solutions. Large language models (LLMs) hold promise in automating fact verification, yet their effectiveness across global contexts remains uncertain. We systematically evaluate nine established LLMs across multiple categories (open/closed-source, multiple sizes, diverse architectures, reasoning-based) using 5,000 claims previously assessed by 174 professional fact-checking organizations across 47 languages. Our methodology tests model generalizability on claims postdating training cutoffs and four prompting strategies mirroring both citizen and professional fact-checker interactions, with over 240,000 human annotations as ground truth. Findings reveal a concerning pattern resembling the Dunning-Kruger effect: smaller, accessible models show high confidence despite lower accuracy, while larger models demonstrate higher accuracy but lower confidence. This risks systemic bias in information verification, as resource-constrained organizations typically use smaller models. Performance gaps are most pronounced for non-English languages and claims originating from the Global South, threatening to widen existing information inequalities. These results establish a multilingual benchmark for future research and provide an evidence base for policy aimed at ensuring equitable access to trustworthy, AI-assisted fact-checking.

cs.SD [Back]

[67] PianoVAM: A Multimodal Piano Performance Dataset cs.SD | cs.AI | cs.CV | cs.MM | eess.ASPDF

Yonghyun Kim, Junhyung Park, Joonhyung Bae, Kirak Kim, Taegyun Kwon

TL;DR: PianoVAM 是一个多模态钢琴演奏数据集，包含视频、音频、MIDI、手部关键点、指法标签和丰富的元数据，用于支持音乐信息检索任务。

Details

Motivation: 音乐表演的多模态特性促使 MIR 社区对音频以外的数据产生兴趣，PianoVAM 旨在填补这一领域的空白。

Result: 展示了音频和视听钢琴转录的基准测试结果，并讨论了潜在应用。

Insight: 多模态数据可以显著提升音乐转录和分析的效果，实际演奏环境的多样性对数据质量提出了新挑战。

Abstract: The multimodal nature of music performance has driven increasing interest in data beyond the audio domain within the music information retrieval (MIR) community. This paper introduces PianoVAM, a comprehensive piano performance dataset that includes videos, audio, MIDI, hand landmarks, fingering labels, and rich metadata. The dataset was recorded using a Disklavier piano, capturing audio and MIDI from amateur pianists during their daily practice sessions, alongside synchronized top-view videos in realistic and varied performance conditions. Hand landmarks and fingering labels were extracted using a pretrained hand pose estimation model and a semi-automated fingering annotation algorithm. We discuss the challenges encountered during data collection and the alignment process across different modalities. Additionally, we describe our fingering annotation method based on hand landmarks extracted from videos. Finally, we present benchmarking results for both audio-only and audio-visual piano transcription using the PianoVAM dataset and discuss additional potential applications.

eess.IV [Back]

[68] STROKEVISION-BENCH: A Multimodal Video And 2D Pose Benchmark For Tracking Stroke Recovery eess.IV | cs.CV | cs.LGPDF

David Robinson, Animesh Gupta, Rizwan Quershi, Qiushi Fu, Mubarak Shah

TL;DR: 该论文介绍了StrokeVision-Bench数据集，这是首个专门用于中风康复评估的多模态视频与2D姿态数据集，填补了现有数据集的不足。

Details

Motivation: 当前中风康复评估主要依赖主观观察和粗糙评分系统，缺乏对细微运动改进的敏感性。尽管计算机视觉技术有潜力实现客观量化评估，但现有数据集多为日常生活活动，缺乏临床结构化任务。

Result: 论文为中风康复评估领域建立了性能基准，推动了自动化康复评估的研究。

Insight: 该数据集填补了临床结构化任务数据的空白，为计算机视觉在中风康复中的应用提供了重要资源。

Abstract: Despite advancements in rehabilitation protocols, clinical assessment of upper extremity (UE) function after stroke largely remains subjective, relying heavily on therapist observation and coarse scoring systems. This subjectivity limits the sensitivity of assessments to detect subtle motor improvements, which are critical for personalized rehabilitation planning. Recent progress in computer vision offers promising avenues for enabling objective, quantitative, and scalable assessment of UE motor function. Among standardized tests, the Box and Block Test (BBT) is widely utilized for measuring gross manual dexterity and tracking stroke recovery, providing a structured setting that lends itself well to computational analysis. However, existing datasets targeting stroke rehabilitation primarily focus on daily living activities and often fail to capture clinically structured assessments such as block transfer tasks. Furthermore, many available datasets include a mixture of healthy and stroke-affected individuals, limiting their specificity and clinical utility. To address these critical gaps, we introduce StrokeVision-Bench, the first-ever dedicated dataset of stroke patients performing clinically structured block transfer tasks. StrokeVision-Bench comprises 1,000 annotated videos categorized into four clinically meaningful action classes, with each sample represented in two modalities: raw video frames and 2D skeletal keypoints. We benchmark several state-of-the-art video action recognition and skeleton-based action classification methods to establish performance baselines for this domain and facilitate future research in automated stroke rehabilitation assessment.

[69] Expert-Guided Explainable Few-Shot Learning for Medical Image Diagnosis eess.IV | cs.AI | cs.CVPDF

Ifrat Ikhtear Uddin, Longwei Wang, KC Santosh

TL;DR: 该论文提出了一种专家引导的可解释少样本学习框架，通过集成放射科医生提供的感兴趣区域（ROIs）来提升医学图像诊断的分类性能和可解释性。

Details

Motivation: 医学图像分析因专家标注数据有限而面临挑战，影响模型泛化和临床应用。现有方法在性能和可解释性之间尚未取得平衡。

Result: 在BraTS上准确率从77.09%提升至83.61%，在VinDr-CXR上从54.33%提升至73.29%。Grad-CAM可视化证实模型注意力更符合诊断区域。

Insight: 专家引导的注意力监督能有效弥合少样本医学图像诊断中性能与可解释性之间的差距，提升模型可靠性和临床可信度。

Abstract: Medical image analysis often faces significant challenges due to limited expert-annotated data, hindering both model generalization and clinical adoption. We propose an expert-guided explainable few-shot learning framework that integrates radiologist-provided regions-of-interests (ROIs) into model training to simultaneously enhance classification performance and interpretability. Leveraging Grad-CAM for spatial attention supervision, we introduce an explanation loss based on Dice similarity to align model attention with diagnostically relevant regions during training. This explanation loss is jointly optimized with a standard prototypical network objective, encouraging the model to focus on clinically meaningful features even under limited data conditions. We evaluate our framework on two distinct datasets: BraTS (MRI) and VinDr-CXR (Chest X-ray), achieving significant accuracy improvements from 77.09% to 83.61% on BraTS and from 54.33% to 73.29% on VinDr-CXR compared to non-guided models. Grad-CAM visualizations further confirm that expert-guided training consistently aligns attention with diagnostic regions, improving both predictive reliability and clinical trustworthiness. Our findings demonstrate the effectiveness of incorporating expert-guided attention supervision to bridge the gap between performance and interpretability in few-shot medical image diagnosis.

[70] CardioComposer: Flexible and Compositional Anatomical Structure Generation with Disentangled Geometric Guidance eess.IV | cs.AI | cs.CV | cs.LGPDF

Karim Kadry, Shoaib Goraya, Ajay Manicka, Abdalla Abdelwahed, Farhad Nezami

TL;DR: CardioComposer是一个可编程、组合式的框架，用于指导无条件扩散模型生成三维人体解剖结构，通过解析几何原语实现控制解剖结构的真实性。

Details

Motivation: 当前生成解剖结构的模型在可控性和解剖真实性之间存在权衡，难以满足临床研究和医疗设备设计的需求。

Result: 框架能够生成具有高解剖真实性的3D结构，同时支持灵活的控制和组合。

Insight: 解析几何原语的引入为生成模型中解剖结构的可控性和真实性提供了新的平衡点。

Abstract: Generative models of 3D anatomy, when integrated with biophysical simulators, enable the study of structure-function relationships for clinical research and medical device design. However, current models face a trade-off between controllability and anatomical realism. We propose a programmable and compositional framework for guiding unconditional diffusion models of human anatomy using interpretable ellipsoidal primitives embedded in 3D space. Our method involves the selection of certain tissues within multi-tissue segmentation maps, upon which we apply geometric moment losses to guide the reverse diffusion process. This framework supports the independent control over size, shape, and position, as well as the composition of multi-component constraints during inference.

[71] RoentMod: A Synthetic Chest X-Ray Modification Model to Identify and Correct Image Interpretation Model Shortcuts eess.IV | cs.AI | cs.CV | I.4, I.2, J.3PDF

Lauren H. Cooke, Matthias Jung, Jan M. Brendel, Nora M. Kerkovits, Borek Foldyna

TL;DR: RoentMod是一个合成胸部X射线修改模型，用于识别和纠正图像解释模型中的捷径学习问题。它生成具有指定病理特征的逼真X射线图像，并通过实验证明其有效提升模型的鲁棒性和泛化能力。

Details

Motivation: 胸部X射线（CXRs）是最常见的医学检查之一，深度学习模型在CXR解释中表现出色，但容易依赖非临床相关的捷径学习。RoentMod旨在解决这一问题。

Result: RoentMod生成的图像93%的被认为逼真，89-99%的正确引入了指定病理特征。在训练中使用这些图像提升了模型特异性（AUC提升3-19%）。

Insight: RoentMod为医学AI提供了一种通用工具，能够通过反事实干预增强模型的鲁棒性和可解释性，并适用于其他医学影像任务。

Abstract: Chest radiographs (CXRs) are among the most common tests in medicine. Automated image interpretation may reduce radiologists' workload and expand access to diagnostic expertise. Deep learning multi-task and foundation models have shown strong performance for CXR interpretation but are vulnerable to shortcut learning, where models rely on spurious and off-target correlations rather than clinically relevant features to make decisions. We introduce RoentMod, a counterfactual image editing framework that generates anatomically realistic CXRs with user-specified, synthetic pathology while preserving unrelated anatomical features of the original scan. RoentMod combines an open-source medical image generator (RoentGen) with an image-to-image modification model without requiring retraining. In reader studies with board-certified radiologists and radiology residents, RoentMod-produced images appeared realistic in 93% of cases, correctly incorporated the specified finding in 89-99% of cases, and preserved native anatomy comparable to real follow-up CXRs. Using RoentMod, we demonstrate that state-of-the-art multi-task and foundation models frequently exploit off-target pathology as shortcuts, limiting their specificity. Incorporating RoentMod-generated counterfactual images during training mitigated this vulnerability, improving model discrimination across multiple pathologies by 3-19% AUC in internal validation and by 1-11% for 5 out of 6 tested pathologies in external testing. These findings establish RoentMod as a broadly applicable tool for probing and correcting shortcut learning in medical AI. By enabling controlled counterfactual interventions, RoentMod enhances the robustness and interpretability of CXR interpretation models and provides a generalizable strategy for improving foundation models in medical imaging.

cs.LG [Back]

[72] AgentGym-RL: Training LLM Agents for Long-Horizon Decision Making through Multi-Turn Reinforcement Learning cs.LG | cs.AI | cs.CLPDF

Zhiheng Xi, Jixuan Huang, Chenyang Liao, Baodai Huang, Honglin Guo

TL;DR: 该论文提出了AgentGym-RL框架，通过多轮强化学习训练LLM智能体进行长期决策，并提出了ScalingInter-RL训练方法以实现探索与开发的平衡。

Details

Motivation: 现有方法缺乏统一的交互式强化学习框架，无法在不依赖监督微调的情况下，训练智能体在多样化环境中进行长期决策。

Result: 在27个多样化任务中，训练出的智能体表现优于或与商业模型相当。

Insight: 通过逐步扩展交互范围，可以有效避免智能体在长期决策中的崩溃问题。

Abstract: Developing autonomous LLM agents capable of making a series of intelligent decisions to solve complex, real-world tasks is a fast-evolving frontier. Like human cognitive development, agents are expected to acquire knowledge and skills through exploration and interaction with the environment. Despite advances, the community still lacks a unified, interactive reinforcement learning (RL) framework that can effectively train such agents from scratch – without relying on supervised fine-tuning (SFT) – across diverse and realistic environments. To bridge this gap, we introduce AgentGym-RL, a new framework to train LLM agents for multi-turn interactive decision-making through RL. The framework features a modular and decoupled architecture, ensuring high flexibility and extensibility. It encompasses a wide variety of real-world scenarios, and supports mainstream RL algorithms. Furthermore, we propose ScalingInter-RL, a training approach designed for exploration-exploitation balance and stable RL optimization. In early stages, it emphasizes exploitation by restricting the number of interactions, and gradually shifts towards exploration with larger horizons to encourage diverse problem-solving strategies. In this way, the agent develops more diverse behaviors and is less prone to collapse under long horizons. We perform extensive experiments to validate the stability and effectiveness of both the AgentGym-RL framework and the ScalingInter-RL approach. Our agents match or surpass commercial models on 27 tasks across diverse environments. We offer key insights and will open-source the complete AgentGym-RL framework – including code and datasets – to empower the research community in developing the next generation of intelligent agents.

[73] Merge-of-Thought Distillation cs.LG | cs.AI | cs.CLPDF

Zhanming Shen, Zeyu Qin, Zenan Huang, Hao Chen, Jiaqi Hu

TL;DR: Merge-of-Thought Distillation (MoT) is proposed to efficiently distill reasoning abilities from multiple teachers into a compact student model, outperforming single-teacher methods and naive multi-teacher unions with significant performance gains on math benchmarks.

Details

Motivation: 当前的推理蒸馏方法通常假设存在单一的完美教师模型，而忽视了实际中存在多个候选教师和不断增长的思维链（CoT）数据集的现实。因此，需要一种方法来整合多个教师的推理能力，同时解决不同教师监督之间的冲突。

Result: 在数学竞赛基准测试中，MoT显著优于单教师蒸馏方法和简单的多教师联合方法，超越了包括DEEPSEEK-R1、QWEN3-30B-A3B等强模型。此外，MoT减少了灾难性遗忘，并在数学以外的领域提升了推理能力。

Insight: MoT展示了通过共识筛选的推理特征具有广泛的迁移能力，表明轻量级框架可以高效地整合多教师的推理能力。

Abstract: Efficient reasoning distillation for long chain-of-thought (CoT) models is increasingly constrained by the assumption of a single oracle teacher, despite practical availability of multiple candidate teachers and growing CoT corpora. We revisit teacher selection and observe that different students have different “best teachers,” and even for the same student the best teacher can vary across datasets. Therefore, to unify multiple teachers’ reasoning abilities into student with overcoming conflicts among various teachers’ supervision, we propose Merge-of-Thought Distillation (MoT), a lightweight framework that alternates between teacher-specific supervised fine-tuning branches and weight-space merging of the resulting student variants. On competition math benchmarks, using only about 200 high-quality CoT samples, applying MoT to a Qwen3-14B student surpasses strong models including DEEPSEEK-R1, QWEN3-30B-A3B, QWEN3-32B, and OPENAI-O1, demonstrating substantial gains. Besides, MoT consistently outperforms the best single-teacher distillation and the naive multi-teacher union, raises the performance ceiling while mitigating overfitting, and shows robustness to distribution-shifted and peer-level teachers. Moreover, MoT reduces catastrophic forgetting, improves general reasoning beyond mathematics and even cultivates a better teacher, indicating that consensus-filtered reasoning features transfer broadly. These results position MoT as a simple, scalable route to efficiently distilling long CoT capabilities from diverse teachers into compact students.

[74] Adapting Vision-Language Models for Neutrino Event Classification in High-Energy Physics cs.LG | cs.AI | cs.CV | hep-exPDF

Dikshant Sagar, Kaiwen Yu, Alejandro Yankelevich, Jianming Bian, Pierre Baldi

TL;DR: 该论文探讨了如何将视觉语言模型（VLM）应用于高能物理中中微子事件的分类任务，并通过微调LLaMa 3.2模型，证明了VLM在性能和可解释性上优于传统卷积神经网络（CNN）。

Details

Motivation: 近年来，大型语言模型（LLM）在多模态数据处理方面表现出色。论文旨在利用视觉语言模型（VLM）的优势，解决高能物理实验中中微子事件的分类问题，以提升分类性能和模型的可解释性。

Result: 实验结果表明，VLM在分类任务中表现优于CNN，同时提供了更强的灵活性和可解释性，能够更好地整合辅助文本或语义信息。

Insight: VLM因其高性能、可解释性和泛化能力，有潜力成为物理事件分类的通用框架，推动了多模态推理在实验物理学中的应用。

Abstract: Recent advances in Large Language Models (LLMs) have demonstrated their remarkable capacity to process and reason over structured and unstructured data modalities beyond natural language. In this work, we explore the applications of Vision Language Models (VLMs), specifically a fine-tuned variant of LLaMa 3.2, to the task of identifying neutrino interactions in pixelated detector data from high-energy physics (HEP) experiments. We benchmark this model against a state-of-the-art convolutional neural network (CNN) architecture, similar to those used in the NOvA and DUNE experiments, which have achieved high efficiency and purity in classifying electron and muon neutrino events. Our evaluation considers both the classification performance and interpretability of the model predictions. We find that VLMs can outperform CNNs, while also providing greater flexibility in integrating auxiliary textual or semantic information and offering more interpretable, reasoning-based predictions. This work highlights the potential of VLMs as a general-purpose backbone for physics event classification, due to their high performance, interpretability, and generalizability, which opens new avenues for integrating multimodal reasoning in experimental neutrino physics.

Table of Contents

cs.CV [Back]

[1] 3D and 4D World Modeling: A Survey cs.CV | cs.ROPDF

[2] Video Parallel Scaling: Aggregating Diverse Frame Subsets for VideoLLMs cs.CV | cs.LGPDF

[3] Two Stage Context Learning with Large Language Models for Multimodal Stance Detection on Climate Change cs.CV | cs.CYPDF

[4] Sparse Transformer for Ultra-sparse Sampled Video Compressive Sensing cs.CVPDF

[5] GTA-Crime: A Synthetic Dataset and Generation Framework for Fatal Violence Detection with Adversarial Snippet-Level Domain Adaptation cs.CVPDF

[6] RepViT-CXR: A Channel Replication Strategy for Vision Transformers in Chest X-ray Tuberculosis and Pneumonia Classification cs.CV | cs.LG | F.2.2; I.2.7PDF

[7] Symmetry Interactive Transformer with CNN Framework for Diagnosis of Alzheimer’s Disease Using Structural MRI cs.CVPDF

[8] EVDI++: Event-based Video Deblurring and Interpolation via Self-Supervised Learning cs.CVPDF

[9] Hyperspectral Mamba for Hyperspectral Object Tracking cs.CVPDF

[10] Examining Vision Language Models through Multi-dimensional Experiments with Vision and Text Features cs.CVPDF

[11] Generalized Zero-Shot Learning for Point Cloud Segmentation with Evidence-Based Dynamic Calibration cs.CVPDF

[12] Dual-Thresholding Heatmaps to Cluster Proposals for Weakly Supervised Object Detection cs.CVPDF

[13] An Open Benchmark Dataset for GeoAI Foundation Models for Oil Palm Mapping in Indonesia cs.CVPDF

[14] SimCroP: Radiograph Representation Learning with Similarity-driven Cross-granularity Pre-training cs.CVPDF

[15] Boosted Training of Lightweight Early Exits for Optimizing CNN Image Classification Inference cs.CVPDF

[16] Retrieval-Augmented VLMs for Multimodal Melanoma Diagnosis cs.CV | cs.AI | cs.LGPDF

[17] InsFusion: Rethink Instance-level LiDAR-Camera Fusion for 3D Object Detection cs.CVPDF

[18] Bitrate-Controlled Diffusion for Disentangling Motion and Content in Video cs.CVPDF

[19] Semantic Causality-Aware Vision-Based 3D Occupancy Prediction cs.CV | cs.AIPDF

[20] VRAE: Vertical Residual Autoencoder for License Plate Denoising and Deblurring cs.CVPDF

[21] Sparse BEV Fusion with Self-View Consistency for Multi-View Detection and Tracking cs.CV | cs.AIPDF

[22] LD-ViCE: Latent Diffusion Model for Video Counterfactual Explanations cs.CV | cs.LGPDF

[23] Beyond Distribution Shifts: Adaptive Hyperspectral Image Classification at Test Time cs.CVPDF

[24] Spherical Brownian Bridge Diffusion Models for Conditional Cortical Thickness Forecasting cs.CV | cs.AI | cs.LG | q-bio.NCPDF

[25] First-order State Space Model for Lightweight Image Super-resolution cs.CVPDF

[26] Maximally Useful and Minimally Redundant: The Key to Self Supervised Learning for Imbalanced Data cs.CVPDF

[27] Prompt-Driven Image Analysis with Multimodal Generative AI: Detection, Segmentation, Inpainting, and Interpretation cs.CV | cs.AIPDF

[28] A Structured Review of Underwater Object Detection Challenges and Solutions: From Traditional to Large Vision Language Models cs.CV | cs.AIPDF

[29] Chirality in Action: Time-Aware Video Representation Learning by Latent Straightening cs.CVPDF

[30] HuMo: Human-Centric Video Generation via Collaborative Multi-Modal Conditioning cs.CV | cs.MMPDF

[31] MESH – Understanding Videos Like Human: Measuring Hallucinations in Large Video Models cs.CV | cs.AIPDF

[32] Calibrating MLLM-as-a-judge via Multimodal Bayesian Prompt Ensembles cs.CV | cs.CLPDF

[33] Vision-Language Semantic Aggregation Leveraging Foundation Model for Generalizable Medical Image Segmentation cs.CVPDF

[34] Improving Greenland Bed Topography Mapping with Uncertainty-Aware Graph Learning on Sparse Radar Data cs.CVPDF

[35] EfficientIML: Efficient High-Resolution Image Manipulation Localization cs.CVPDF

[36] CLAPS: A CLIP-Unified Auto-Prompt Segmentation for Multi-Modal Retinal Imaging cs.CV | I.4.6PDF

[37] AdsQA: Towards Advertisement Video Understanding cs.CVPDF

[38] UOPSL: Unpaired OCT Predilection Sites Learning for Fundus Image Diagnosis Augmentation cs.CV | cs.AI | I.4.10PDF

[39] LADB: Latent Aligned Diffusion Bridges for Semi-Supervised Domain Translation cs.CVPDF

[40] FractalPINN-Flow: A Fractal-Inspired Network for Unsupervised Optical Flow Estimation with Total Variation Regularization cs.CVPDF

[41] Multi-Modal Robust Enhancement for Coastal Water Segmentation: A Systematic HSV-Guided Framework cs.CVPDF

[42] Computational Imaging for Enhanced Computer Vision cs.CVPDF

[43] BcQLM: Efficient Vision-Language Understanding with Distilled Q-Gated Cross-Modal Fusion cs.CVPDF

[44] ArgoTweak: Towards Self-Updating HD Maps through Structured Priors cs.CVPDF

[45] Quantifying Accuracy of an Event-Based Star Tracker via Earth’s Rotation cs.CVPDF

[46] GeneVA: A Dataset of Human Annotations for Generative Text to Video Artifacts cs.CVPDF

[47] RewardDance: Reward Scaling in Visual Generation cs.CVPDF

[48] SAFT: Shape and Appearance of Fabrics from Template via Differentiable Physical Simulations from Monocular Video cs.CVPDF

cs.CL [Back]

[49] AntiDote: Bi-level Adversarial Training for Tamper-Resistant LLMs cs.CLPDF

[50] NOWJ@COLIEE 2025: A Multi-stage Framework Integrating Embedding Models and Large Language Models for Legal Retrieval and Entailment cs.CL | cs.AIPDF

[51] SciGPT: A Large Language Model for Scientific Literature Understanding and Knowledge Discovery cs.CLPDF

[52] No for Some, Yes for Others: Persona Prompts and Other Sources of False Refusal in Language Models cs.CLPDF

[53] MERLIN: Multi-Stage Curriculum Alignment for Multilingual Encoder and LLM Fusion cs.CLPDF

[54] Bias after Prompting: Persistent Discrimination in Large Language Models cs.CL | cs.LGPDF

[55] Verbalized Algorithms cs.CLPDF

[56] Balancing Quality and Variation: Spam Filtering Distorts Data Label Distributions cs.CL | cs.AIPDF

[57] Towards Knowledge-Aware Document Systems: Modeling Semantic Coverage Relations via Answerability Detection cs.CLPDF

[58] Memorization in Large Language Models in Medicine: Prevalence, Characteristics, and Implications cs.CL | cs.AIPDF

[59] Streaming Sequence-to-Sequence Learning with Delayed Streams Modeling cs.CLPDF

[60] A Survey of Reinforcement Learning for Large Reasoning Models cs.CL | cs.AI | cs.LGPDF

cs.RO [Back]

[61] Quadrotor Navigation using Reinforcement Learning with Privileged Information cs.RO | cs.AI | cs.CVPDF

[62] Foundation Models for Autonomous Driving Perception: A Survey Through Core Capabilities cs.RO | cs.CVPDF

[63] Good Deep Features to Track: Self-Supervised Feature Extraction and Tracking in Visual Odometry cs.RO | cs.CVPDF

[64] TANGO: Traversability-Aware Navigation with Local Metric Control for Topological Goals cs.RO | cs.AI | cs.CV | cs.LG | cs.SY | eess.SYPDF

[65] SocialNav-SUB: Benchmarking VLMs for Scene Understanding in Social Robot Navigation cs.RO | cs.CVPDF

cs.SI [Back]

[66] Scaling Truth: The Confidence Paradox in AI Fact-Checking cs.SI | cs.AI | cs.CL | cs.CYPDF

cs.SD [Back]

[67] PianoVAM: A Multimodal Piano Performance Dataset cs.SD | cs.AI | cs.CV | cs.MM | eess.ASPDF

eess.IV [Back]

[68] STROKEVISION-BENCH: A Multimodal Video And 2D Pose Benchmark For Tracking Stroke Recovery eess.IV | cs.CV | cs.LGPDF

[69] Expert-Guided Explainable Few-Shot Learning for Medical Image Diagnosis eess.IV | cs.AI | cs.CVPDF

[70] CardioComposer: Flexible and Compositional Anatomical Structure Generation with Disentangled Geometric Guidance eess.IV | cs.AI | cs.CV | cs.LGPDF

[71] RoentMod: A Synthetic Chest X-Ray Modification Model to Identify and Correct Image Interpretation Model Shortcuts eess.IV | cs.AI | cs.CV | I.4, I.2, J.3PDF

cs.LG [Back]

[72] AgentGym-RL: Training LLM Agents for Long-Horizon Decision Making through Multi-Turn Reinforcement Learning cs.LG | cs.AI | cs.CLPDF