cs.CV [Total: 100]
cs.CL [Total: 38]
cs.SI [Total: 1]
cs.NE [Total: 1]
physics.med-ph [Total: 1]
cs.GR [Total: 2]
cs.LG [Total: 6]
cs.RO [Total: 1]
cs.NI [Total: 1]
cs.CR [Total: 2]
eess.IV [Total: 11]
cs.MM [Total: 1]
cs.AI [Total: 4]
cs.IR [Total: 3]

cs.CV [Back]

[1] Outlier Detection Algorithm for Circle Fitting cs.CV | eess.IVPDF

Ahmet Gökhan Poyraz

TL;DR: 该论文提出了一种基于极坐标的离群值检测算法（PCOD），通过将点集转换为极坐标并计算局部与全局标准差，有效识别离群值，显著提升了圆拟合在工业应用中的精度。

Details

Motivation: 在质量控制和设计应用中，圆拟合算法的准确性常受噪声点影响，传统方法效果有限，因此需要一种更高效的离群值检测与去除方法。

Result: 在工业垫圈的高精度直径测量中，PCOD算法优于其他五种离群值检测方法，显著提升了圆拟合精度。

Insight: 极坐标转换为离群值检测提供了新视角，局部与全局统计特征的结合有效提升了噪声点识别的鲁棒性。

Abstract: Circle fitting methods are extensively utilized in various industries, particularly in quality control processes and design applications. The effectiveness of these algorithms can be significantly compromised when the point sets to be predicted are noisy. To mitigate this issue, outlier detection and removal algorithms are often applied before the circle fitting procedure. This study introduces the Polar Coordinate-Based Outlier Detection (PCOD) algorithm, which can be effectively employed in circle fitting applications. In the proposed approach, the point set is first transformed into polar coordinates, followed by the calculation of both local and global standard deviations. Outliers are then identified by comparing local mean values with the global standard deviation. The practicality and efficiency of the proposed method are demonstrated by focusing on the high-precision diameter measurement of industrial washer parts. Images from a machine vision system are processed through preprocessing steps, including sub-pixel edge detection. The resulting sub-pixel edge points are then cleaned using the proposed outlier detection and removal algorithm, after which circle fitting is performed. A comparison is made using ten different circle fitting algorithms and five distinct outlier detection methods. The results indicate that the proposed method outperforms the other approaches, delivering the best performance in terms of accuracy within the dataset, thereby demonstrating its potential for enhancing circle fitting applications in industrial environments.

[2] Enhancing Diameter Measurement Accuracy in Machine Vision Applications cs.CV | eess.IVPDF

Ahmet Gokhan Poyraz, Ahmet Emir Dirik, Hakan Gurkan, Mehmet Kacmaz

TL;DR: 该论文提出两种新方法（转换因子法和像素法）来改进机器视觉中的直径测量精度，显著降低了误差。

Details

Motivation: 在相机测量系统中，尽管使用了专用设备（如远心镜头），机械和软件因素仍会导致测量误差，尤其是在测量不同直径的零件时。本研究旨在解决这一问题。

Result: 在玻璃样品（1-12 mm）和金属工件（3-24 mm）上的测试表明，测量误差从原来的13-114微米降至1-2微米。

Insight: 仅需少量参考零件即可实现高精度测量，显著提升现有直径测量技术的可靠性和准确性。

Abstract: In camera measurement systems, specialized equipment such as telecentric lenses is often employed to measure parts with narrow tolerances. However, despite the use of such equipment, measurement errors can occur due to mechanical and software-related factors within the system. These errors are particularly evident in applications where parts of different diameters are measured using the same setup. This study proposes two innovative approaches to enhance measurement accuracy using multiple known reference parts: a conversion factor-based method and a pixel-based method. In the first approach, the conversion factor is estimated from known references to calculate the diameter (mm) of the unknown part. In the second approach, the diameter (mm) is directly estimated using pixel-based diameter information from the references. The experimental setup includes an industrial-grade camera and telecentric lenses. Tests conducted on glass samples (1-12 mm) and metal workpieces (3-24 mm) show that measurement errors, which originally ranged from 13-114 micrometers, were reduced to 1-2 micrometers using the proposed methods. By utilizing only a few known reference parts, the proposed approach enables high-accuracy measurement of all parts within the camera’s field of view. Additionally, this method enhances the existing diameter measurement literature by significantly reducing error rates and improving measurement reliability.

[3] Multimodal Video Emotion Recognition with Reliable Reasoning Priors cs.CV | cs.AIPDF

Zhepeng Wang, Yingjian Zhu, Guanghao Dong, Hongzhu Yi, Feng Chen

TL;DR: 该论文提出了一种多模态情感识别方法，结合MLLMs生成的可靠推理先验和平衡双对比学习，显著提升了MER2024基准上的性能。

Details

Motivation: 多模态情感识别面临类别不平衡和跨模态交互不足的问题，需引入可靠的先验知识和有效的损失函数来解决。

Result: 在MER2024基准上取得了显著性能提升，验证了方法的有效性。

Insight: MLLMs生成的推理先验与轻量级融合网络的结合，为情感识别提供了鲁棒且可扩展的解决方案。

Abstract: This study investigates the integration of trustworthy prior reasoning knowledge from MLLMs into multimodal emotion recognition. We employ Gemini to generate fine-grained, modality-separable reasoning traces, which are injected as priors during the fusion stage to enrich cross-modal interactions. To mitigate the pronounced class-imbalance in multimodal emotion recognition, we introduce Balanced Dual-Contrastive Learning, a loss formulation that jointly balances inter-class and intra-class distributions. Applied to the MER2024 benchmark, our prior-enhanced framework yields substantial performance gains, demonstrating that the reliability of MLLM-derived reasoning can be synergistically combined with the domain adaptability of lightweight fusion networks for robust, scalable emotion recognition.

[4] From Waveforms to Pixels: A Survey on Audio-Visual Segmentation cs.CVPDF

Jia Li, Yapeng Tian

TL;DR: 本文是一篇关于音频-视觉分割（AVS）的综合调研，涵盖了其问题定义、数据集、评估指标、方法进展，以及不同架构、融合策略和训练范式的比较，并提出了当前挑战和未来方向。

Details

Motivation: 音频-视觉分割（AVS）通过结合视觉和听觉模态，实现对视频中发声物体的细粒度分割。随着多模态感知技术的发展，研究AVS的重要性日益凸显。

Result: 论文提供了对AVS方法的全面比较，展示了不同架构、融合策略和训练范式对性能的影响，同时指出了现有方法的局限性。

Insight: 1. 现有方法在时序建模和复杂环境中的鲁棒性方面仍存在不足；2. 需要进一步探索基础模型在少样本学习和泛化能力中的应用；3. 减少对标注数据的依赖是关键。

Abstract: Audio-Visual Segmentation (AVS) aims to identify and segment sound-producing objects in videos by leveraging both visual and audio modalities. It has emerged as a significant research area in multimodal perception, enabling fine-grained object-level understanding. In this survey, we present a comprehensive overview of the AVS field, covering its problem formulation, benchmark datasets, evaluation metrics, and the progression of methodologies. We analyze a wide range of approaches, including architectures for unimodal and multimodal encoding, key strategies for audio-visual fusion, and various decoder designs. Furthermore, we examine major training paradigms, from fully supervised learning to weakly supervised and training-free methods. Notably, we provide an extensive comparison of AVS methods across standard benchmarks, highlighting the impact of different architectural choices, fusion strategies, and training paradigms on performance. Finally, we outline the current challenges, such as limited temporal modeling, modality bias toward vision, lack of robustness in complex environments, and high computational demands, and propose promising future directions, including improving temporal reasoning and multimodal fusion, leveraging foundation models for better generalization and few-shot learning, reducing reliance on labeled data through selfand weakly supervised learning, and incorporating higher-level reasoning for more intelligent AVS systems.

[5] A Large Language Model Powered Integrated Circuit Footprint Geometry Understanding cs.CVPDF

Yida Wang, Taiting Lu, Runze Liu, Lanqing Yang, Yifan Yang

TL;DR: 该论文提出了一种基于大语言模型（LLM）的框架LLM4-IC8K，用于自动化解析集成电路（IC）机械图纸中的封装几何信息，并展示了其在几何理解任务上的优越性能。

Details

Motivation: 集成电路（IC）的封装几何标注对PCB布局至关重要，但目前缺乏自动化方法解析机械图纸中的几何信息，导致依赖人工标注。大语言模型（LMMs）在几何感知任务中表现不佳，激发了作者提出新方法的动机。

Result: LLM4-IC8K在ICGeo8K数据集上表现优于现有LMMs，验证了其几何理解任务的优越性。

Insight: 1. 合成数据预训练+真实数据微调的策略有效提升模型性能；2. 几何解析问题需要分步解决，模仿人类工程师的推理过程；3. 多模态数据集对推动相关研究至关重要。

Abstract: Printed-Circuit-board (PCB) footprint geometry labeling of integrated circuits (IC) is essential in defining the physical interface between components and the PCB layout, requiring exceptional visual perception proficiency. However, due to the unstructured footprint drawing and abstract diagram annotations, automated parsing and accurate footprint geometry modeling remain highly challenging. Despite its importance, no methods currently exist for automated package geometry labeling directly from IC mechanical drawings. In this paper, we first investigate the visual perception performance of Large Multimodal Models (LMMs) when solving IC footprint geometry understanding. Our findings reveal that current LMMs severely suffer from inaccurate geometric perception, which hinders their performance in solving the footprint geometry labeling problem. To address these limitations, we propose LLM4-IC8K, a novel framework that treats IC mechanical drawings as images and leverages LLMs for structured geometric interpretation. To mimic the step-by-step reasoning approach used by human engineers, LLM4-IC8K addresses three sub-tasks: perceiving the number of pins, computing the center coordinates of each pin, and estimating the dimensions of individual pins. We present a two-stage framework that first trains LMMs on synthetically generated IC footprint diagrams to learn fundamental geometric reasoning and then fine-tunes them on real-world datasheet drawings to enhance robustness and accuracy in practical scenarios. To support this, we introduce ICGeo8K, a multi-modal dataset with 8,608 labeled samples, including 4138 hand-crafted IC footprint samples and 4470 synthetically generated samples. Extensive experiments demonstrate that our model outperforms state-of-the-art LMMs on the proposed benchmark.

[6] TIR-Diffusion: Diffusion-based Thermal Infrared Image Denoising via Latent and Wavelet Domain Optimization cs.CV | cs.RO | eess.IVPDF

Tai Hyoung Rhee, Dong-guw Lee, Ayoung Kim

TL;DR: 论文提出了一种基于扩散模型的TIR图像去噪方法，通过潜在空间和小波域优化，结合新颖的损失函数和多阶段细化，显著提升了去噪效果，并展示了优异的零次泛化能力。

Details

Motivation: TIR图像在恶劣光照条件下具有重要应用，但其固有的固定模式噪声严重影响了目标检测与定位等任务。现有方法难以处理此类非均匀噪声。

Result: 在基准数据集上表现优于现有方法，且在真实TIR数据上展示了强大的零次泛化能力。

Insight: 结合扩散模型与小波优化可有效处理TIR图像的复杂噪声，为机器人感知任务提供了实用解决方案。

Abstract: Thermal infrared imaging exhibits considerable potentials for robotic perception tasks, especially in environments with poor visibility or challenging lighting conditions. However, TIR images typically suffer from heavy non-uniform fixed-pattern noise, complicating tasks such as object detection, localization, and mapping. To address this, we propose a diffusion-based TIR image denoising framework leveraging latent-space representations and wavelet-domain optimization. Utilizing a pretrained stable diffusion model, our method fine-tunes the model via a novel loss function combining latent-space and discrete wavelet transform (DWT) / dual-tree complex wavelet transform (DTCWT) losses. Additionally, we implement a cascaded refinement stage to enhance fine details, ensuring high-fidelity denoising results. Experiments on benchmark datasets demonstrate superior performance of our approach compared to state-of-the-art denoising methods. Furthermore, our method exhibits robust zero-shot generalization to diverse and challenging real-world TIR datasets, underscoring its effectiveness for practical robotic deployment.

[7] StorySync: Training-Free Subject Consistency in Text-to-Image Generation via Region Harmonization cs.CV | cs.AIPDF

Gopalji Gaur, Mohammadreza Zolfaghari, Thomas Brox

TL;DR: 论文《StorySync》提出了一种无需训练的方法，通过区域特征协调和跨图像注意力共享，在文本到图像生成中保持主题一致性，避免了对预训练模型的微调或重新训练。

Details

Motivation: 现有的方法通常需要微调或重新训练模型，计算成本高且可能影响模型的现有能力。本文旨在提出一种无需训练的高效方法，以解决文本生成图像中的主题一致性问题。

Result: 实验结果表明，该方法能成功生成多种场景下视觉一致的主题图像，同时保持了扩散模型的创造能力。

Insight: 无需训练的方法可以高效解决主题一致性问题，同时避免了对预训练模型的干扰，为文本到图像生成提供了新的思路。

Abstract: Generating a coherent sequence of images that tells a visual story, using text-to-image diffusion models, often faces the critical challenge of maintaining subject consistency across all story scenes. Existing approaches, which typically rely on fine-tuning or retraining models, are computationally expensive, time-consuming, and often interfere with the model’s pre-existing capabilities. In this paper, we follow a training-free approach and propose an efficient consistent-subject-generation method. This approach works seamlessly with pre-trained diffusion models by introducing masked cross-image attention sharing to dynamically align subject features across a batch of images, and Regional Feature Harmonization to refine visually similar details for improved subject consistency. Experimental results demonstrate that our approach successfully generates visually consistent subjects across a variety of scenarios while maintaining the creative abilities of the diffusion model.

[8] Fusion of Pervasive RF Data with Spatial Images via Vision Transformers for Enhanced Mapping in Smart Cities cs.CV | cs.AIPDF

Rafayel Mkrtchyan, Armen Manukyan, Hrant Khachatrian, Theofanis P. Raptis

TL;DR: 论文提出了一种基于DINOv2架构的深度学习方法，通过结合开源地图与射频（RF）数据，利用视觉Transformer架构提升智能城市中的建筑映射精度。

Details

Motivation: 传统智能城市映射技术如卫星图像、LiDAR扫描和手动标注存在成本高、可访问性差和准确性不足的问题；开源地图平台因人为错误和环境的动态变化引入偏差。

Result: 在合成数据集上，模型宏IoU达到65.3%，显著优于基线方法（40.1%、37.3%、42.2%）。

Insight: RF数据与视觉信息的结合可以有效弥补传统方法的不足，为智能城市应用提供更可靠的映射解决方案。

Abstract: Environment mapping is an important computing task for a wide range of smart city applications, including autonomous navigation, wireless network operations and extended reality environments. Conventional smart city mapping techniques, such as satellite imagery, LiDAR scans, and manual annotations, often suffer from limitations related to cost, accessibility and accuracy. Open-source mapping platforms have been widely utilized in artificial intelligence applications for environment mapping, serving as a source of ground truth. However, human errors and the evolving nature of real-world environments introduce biases that can negatively impact the performance of neural networks trained on such data. In this paper, we present a deep learning-based approach that integrates the DINOv2 architecture to improve building mapping by combining maps from open-source platforms with radio frequency (RF) data collected from multiple wireless user equipments and base stations. Our approach leverages a vision transformer-based architecture to jointly process both RF and map modalities within a unified framework, effectively capturing spatial dependencies and structural priors for enhanced mapping accuracy. For the evaluation purposes, we employ a synthetic dataset co-produced by Huawei. We develop and train a model that leverages only aggregated path loss information to tackle the mapping problem. We measure the results according to three performance metrics which capture different qualities: (i) The Jaccard index, also known as intersection over union (IoU), (ii) the Hausdorff distance, and (iii) the Chamfer distance. Our design achieves a macro IoU of 65.3%, significantly surpassing (i) the erroneous maps baseline, which yields 40.1%, (ii) an RF-only method from the literature, which yields 37.3%, and (iii) a non-AI fusion baseline that we designed which yields 42.2%.

[9] VQ-DeepISC: Vector Quantized-Enabled Digital Semantic Communication with Channel Adaptive Image Transmission cs.CV | cs.AIPDF

Jianqiao Chen, Tingting Zhu, Huishi Song, Nan Ma, Xiaodong Xu

TL;DR: VQ-DeepISC提出了一种基于向量量化的数字语义通信系统，通过Swin Transformer提取语义特征，结合VQ模块和通道自适应机制，实现了高效的离散符号传输。实验表明其重构质量优于基准方法。

Details

Motivation: 传统语义特征数字化需要在压缩为离散符号时保持连续性和上下文信息，同时抵抗信道退化。本文旨在解决这一挑战。

Result: 实验表明，VQ-DeepISC在重构保真度上优于基准方法，验证了其有效性。

Insight: 通过离散化语义特征并结合通道自适应机制，可以高效实现数字化语义通信，同时EMA和分布正则化对训练稳定性至关重要。

Abstract: Discretization of semantic features enables interoperability between semantic and digital communication systems, showing significant potential for practical applications. The fundamental difficulty in digitizing semantic features stems from the need to preserve continuity and context in inherently analog representations during their compression into discrete symbols while ensuring robustness to channel degradation. In this paper, we propose a vector quantized (VQ)-enabled digital semantic communication system with channel adaptive image transmission, named VQ-DeepISC. Guided by deep joint source-channel coding (DJSCC), we first design a Swin Transformer backbone for hierarchical semantic feature extraction, followed by VQ modules projecting features into discrete latent spaces. Consequently, it enables efficient index-based transmission instead of raw feature transmission. To further optimize this process, we develop an attention mechanism-driven channel adaptation module to dynamically optimize index transmission. Secondly, to counteract codebook collapse during training process, we impose a distributional regularization by minimizing the Kullback-Leibler divergence (KLD) between codeword usage frequencies and a uniform prior. Meanwhile, exponential moving average (EMA) is employed to stabilize training and ensure balanced feature coverage during codebook updates. Finally, digital communication is implemented using quadrature phase shift keying (QPSK) modulation alongside orthogonal frequency division multiplexing (OFDM), adhering to the IEEE 802.11a standard. Experimental results demonstrate superior reconstruction fidelity of the proposed system over benchmark methods.

[10] Tobler’s First Law in GeoAI: A Spatially Explicit Deep Learning Model for Terrain Feature Detection Under Weak Supervision cs.CV | cs.AIPDF

Wenwen Li, Chia-Yu Hsu, Maosheng Hu

TL;DR: 该论文提出了一种基于地理学第一定律（Tobler定律）的显式空间深度学习模型，用于弱监督条件下的地形特征检测，并通过多阶段训练策略和注意力机制提高了性能。

Details

Motivation: GeoAI领域面临训练数据不足和忽视空间原则的挑战，影响了AI与地理研究的深度结合，因此需要开发新方法来解决这些问题。

Result: 模型成功应用于火星撞击坑检测，并展示了在地球和其他行星上的自然和人造特征检测的泛化能力。

Insight: 地理学第一定律可以为GeoAI模型提供理论基础，弱监督方法可在数据稀缺条件下实现高效目标检测。

Abstract: Recent interest in geospatial artificial intelligence (GeoAI) has fostered a wide range of applications using artificial intelligence (AI), especially deep learning, for geospatial problem solving. However, major challenges such as a lack of training data and the neglect of spatial principles and spatial effects in AI model design remain, significantly hindering the in-depth integration of AI with geospatial research. This paper reports our work in developing a deep learning model that enables object detection, particularly of natural features, in a weakly supervised manner. Our work makes three contributions: First, we present a method of object detection using only weak labels. This is achieved by developing a spatially explicit model based on Tobler’s first law of geography. Second, we incorporate attention maps into the object detection pipeline and develop a multistage training strategy to improve performance. Third, we apply this model to detect impact craters on Mars, a task that previously required extensive manual effort. The model generalizes to both natural and human-made features on the surfaces of Earth and other planets. This research advances the theoretical and methodological foundations of GeoAI.

[11] Closed-Circuit Television Data as an Emergent Data Source for Urban Rail Platform Crowding Estimation cs.CV | eess.IVPDF

Riccardo Fiorista, Awad Abdelhalim, Anson F. Stewart, Gabriel L. Pincus, Ian Thistle

TL;DR: 该研究探讨了利用闭路电视（CCTV）数据通过计算机视觉技术实时估算城市轨道交通站台拥挤度的潜力，并提出了多种方法，包括目标检测、分类和语义分割，以及一种新颖的基于线性优化的计数方法。

Details

Motivation: 准确估算站台拥挤度对提升交通运营的安全性、效率和用户体验至关重要，但目前主要依赖间接数据如票务记录或人工观察。CCTV数据因其实时性和高精度成为潜在解决方案。

Result: 在超过600小时的隐私保护数据集上测试，结果表明计算机视觉方法能有效提升拥挤度估算的准确性，且CCTV数据可独立支持实时响应。

Insight: CCTV数据可作为独立数据源，结合计算机视觉技术，实现高精度的实时拥挤度估算，为交通运营提供决策支持。

Abstract: Accurately estimating urban rail platform occupancy can enhance transit agencies’ ability to make informed operational decisions, thereby improving safety, operational efficiency, and customer experience, particularly in the context of crowding. However, sensing real-time crowding remains challenging and often depends on indirect proxies such as automatic fare collection data or staff observations. Recently, Closed-Circuit Television (CCTV) footage has emerged as a promising data source with the potential to yield accurate, real-time occupancy estimates. The presented study investigates this potential by comparing three state-of-the-art computer vision approaches for extracting crowd-related features from platform CCTV imagery: (a) object detection and counting using YOLOv11, RT-DETRv2, and APGCC; (b) crowd-level classification via a custom-trained Vision Transformer, Crowd-ViT; and (c) semantic segmentation using DeepLabV3. Additionally, we present a novel, highly efficient linear-optimization-based approach to extract counts from the generated segmentation maps while accounting for image object depth and, thus, for passenger dispersion along a platform. Tested on a privacy-preserving dataset created in collaboration with the Washington Metropolitan Area Transit Authority (WMATA) that encompasses more than 600 hours of video material, our results demonstrate that computer vision approaches can provide substantive value for crowd estimation. This work demonstrates that CCTV image data, independent of other data sources available to a transit agency, can enable more precise real-time crowding estimation and, eventually, timely operational responses for platform crowding mitigation.

[12] Modular Transformer Architecture for Precision Agriculture Imaging cs.CVPDF

Brian Gopalan, Nathalia Nascimento, Vishal Monga

TL;DR: 该论文提出了一种模块化的Transformer架构，通过动态路由处理不同类型的图像退化问题（如模糊和噪声），显著提升了精准农业中无人机图像的杂草分割效果和计算效率。

Details

Motivation: 精准农业中，无人机拍摄的图像常因模糊或噪声影响分割精度，现有CNN方法对此类退化问题处理不足，亟需一种高效且鲁棒的解决方案。

Result: 系统在杂草分割任务中表现出更高的质量和计算效率，优于传统CNN方法。

Insight: 模块化动态路由策略可显著提升模型对图像退化的适应性，为农业图像分析提供新思路。

Abstract: This paper addresses the critical need for efficient and accurate weed segmentation from drone video in precision agriculture. A quality-aware modular deep-learning framework is proposed that addresses common image degradation by analyzing quality conditions-such as blur and noise-and routing inputs through specialized pre-processing and transformer models optimized for each degradation type. The system first analyzes drone images for noise and blur using Mean Absolute Deviation and the Laplacian. Data is then dynamically routed to one of three vision transformer models: a baseline for clean images, a modified transformer with Fisher Vector encoding for noise reduction, or another with an unrolled Lucy-Robinson decoder to correct blur. This novel routing strategy allows the system to outperform existing CNN-based methods in both segmentation quality and computational efficiency, demonstrating a significant advancement in deep-learning applications for agriculture.

[13] Refine-IQA: Multi-Stage Reinforcement Finetuning for Perceptual Image Quality Assessment cs.CV | cs.AIPDF

Ziheng Jia, Jiaying Qian, Zicheng Zhang, Zijian Chen, Xiongkuo Min

TL;DR: 本文提出了一种多阶段强化学习微调框架（Refine-IQA），用于图像质量评估（IQA），通过增强模型的视觉质量感知能力和监督其“思考”过程，显著提升了性能。

Details

Motivation: 现有基于强化学习微调（RFT）的IQA方法仅通过规则化输出奖励验证模型的输出，而未监督其“思考”过程，且未显式提升模型的低层视觉质量感知能力，限制了性能上限。

Result: Refine-IQA系列模型在视觉质量感知和评分任务上表现优异，并在质量解释基准测试中展现出强大的“思考”能力。

Insight: 显式增强模型的低层视觉感知能力并监督其推理过程，是提升IQA性能的关键。

Abstract: Reinforcement fine-tuning (RFT) is a proliferating paradigm for LMM training. Analogous to high-level reasoning tasks, RFT is similarly applicable to low-level vision domains, including image quality assessment (IQA). Existing RFT-based IQA methods typically use rule-based output rewards to verify the model’s rollouts but provide no reward supervision for the “think” process, leaving its correctness and efficacy uncontrolled. Furthermore, these methods typically fine-tune directly on downstream IQA tasks without explicitly enhancing the model’s native low-level visual quality perception, which may constrain its performance upper bound. In response to these gaps, we propose the multi-stage RFT IQA framework (Refine-IQA). In Stage-1, we build the Refine-Perception-20K dataset (with 12 main distortions, 20,907 locally-distorted images, and over 55K RFT samples) and design multi-task reward functions to strengthen the model’s visual quality perception. In Stage-2, targeting the quality scoring task, we introduce a probability difference reward involved strategy for “think” process supervision. The resulting Refine-IQA Series Models achieve outstanding performance on both perception and scoring tasks-and, notably, our paradigm activates a robust “think” (quality interpreting) capability that also attains exceptional results on the corresponding quality interpreting benchmark.

[14] HPSv3: Towards Wide-Spectrum Human Preference Score cs.CVPDF

Yuhang Ma, Xiaoshi Wu, Keqiang Sun, Hongsheng Li

TL;DR: HPSv3提出了一种广泛覆盖的人类偏好评分方法，通过新数据集HPDv3和改进的VLM模型，结合Chain-of-Human-Preference方法提升图像生成质量。

Details

Motivation: 现有的文本到图像生成评估指标在数据覆盖、特征提取和损失函数方面存在不足，难以与人类偏好对齐。

Result: HPSv3在广泛谱系的图像评估中表现稳健，CoHP方法能高效提升生成图像的质量。

Insight: 通过大规模标注数据和迭代优化的方法可以更好地对齐人类偏好，提升生成模型的评估与改进效率。

Abstract: Evaluating text-to-image generation models requires alignment with human perception, yet existing human-centric metrics are constrained by limited data coverage, suboptimal feature extraction, and inefficient loss functions. To address these challenges, we introduce Human Preference Score v3 (HPSv3). (1) We release HPDv3, the first wide-spectrum human preference dataset integrating 1.08M text-image pairs and 1.17M annotated pairwise comparisons from state-of-the-art generative models and low to high-quality real-world images. (2) We introduce a VLM-based preference model trained using an uncertainty-aware ranking loss for fine-grained ranking. Besides, we propose Chain-of-Human-Preference (CoHP), an iterative image refinement method that enhances quality without extra data, using HPSv3 to select the best image at each step. Extensive experiments demonstrate that HPSv3 serves as a robust metric for wide-spectrum image evaluation, and CoHP offers an efficient and human-aligned approach to improve image generation quality. The code and dataset are available at the HPSv3 Homepage.

[15] Deep learning framework for crater detection and identification on the Moon and Mars cs.CV | cs.AIPDF

Yihan Ma, Zeyang Yu, Rohitash Chandra

TL;DR: 该论文提出了一种基于深度学习的框架，用于在月球和火星上检测和识别撞击坑，通过两阶段方法实现了高效的坑体检测与分类。

Details

Motivation: 撞击坑是行星表面最显著的地貌特征之一，对行星科学研究具有重要意义。其空间分布和形态特征提供了行星表面组成、地质历史和撞击过程的关键信息。近年来，深度学习模型的快速发展推动了对自动化坑体检测的兴趣。

Result: 实验结果表明，YOLO在坑体检测中表现出最均衡的性能，而ResNet-50在识别大型坑体时具有较高的精度。

Insight: 该研究展示了深度学习模型在行星科学中的应用潜力，尤其是两阶段框架在坑体检测与识别中的高效性。同时，YOLO和ResNet-50的对比结果也为未来模型选择提供了参考。

Abstract: Impact craters are among the most prominent geomorphological features on planetary surfaces and are of substantial significance in planetary science research. Their spatial distribution and morphological characteristics provide critical information on planetary surface composition, geological history, and impact processes. In recent years, the rapid advancement of deep learning models has fostered significant interest in automated crater detection. In this paper, we apply advancements in deep learning models for impact crater detection and identification. We use novel models, including Convolutional Neural Networks (CNNs) and variants such as YOLO and ResNet. We present a framework that features a two-stage approach where the first stage features crater identification using simple classic CNN, ResNet-50 and YOLO. In the second stage, our framework employs YOLO-based detection for crater localisation. Therefore, we detect and identify different types of craters and present a summary report with remote sensing data for a selected region. We consider selected regions for craters and identification from Mars and the Moon based on remote sensing data. Our results indicate that YOLO demonstrates the most balanced crater detection performance, while ResNet-50 excels in identifying large craters with high precision.

[16] Point-Based Shape Representation Generation with a Correspondence-Preserving Diffusion Model cs.CV | cs.LGPDF

Shen Zhu, Yinzhu Jin, Ifrah Zawar, P. Thomas Fletcher

TL;DR: 该论文提出了一种基于扩散模型的点表示形状生成方法，能够保留训练数据中的点对应关系，并在海马体形状生成任务中展示了优越性能。

Details

Motivation: 传统统计形状模型关注点的对应关系，而当前深度学习方法主要处理无序点云，忽略了这种对应关系。论文旨在解决这一局限性，提出一种能够生成具有对应关系的点表示形状的扩散模型。

Result: 模型生成的形状在真实性和对应关系保留方面优于现有方法，并在下游任务中表现出良好的应用效果。

Insight: 点对应关系的引入为形状生成任务提供了新的视角，尤其在医学图像分析中，可能对疾病研究和诊断建模有重要价值。

Abstract: We propose a diffusion model designed to generate point-based shape representations with correspondences. Traditional statistical shape models have considered point correspondences extensively, but current deep learning methods do not take them into account, focusing on unordered point clouds instead. Current deep generative models for point clouds do not address generating shapes with point correspondences between generated shapes. This work aims to formulate a diffusion model that is capable of generating realistic point-based shape representations, which preserve point correspondences that are present in the training data. Using shape representation data with correspondences derived from Open Access Series of Imaging Studies 3 (OASIS-3), we demonstrate that our correspondence-preserving model effectively generates point-based hippocampal shape representations that are highly realistic compared to existing methods. We further demonstrate the applications of our generative model by downstream tasks, such as conditional generation of healthy and AD subjects and predicting morphological changes of disease progression by counterfactual generation.

[17] Policy to Assist Iteratively Local Segmentation: Optimising Modality and Location Selection for Prostate Cancer Localisation cs.CV | cs.AIPDF

Xiangcen Wu, Shaheer U. Saeed, Yipei Wang, Ester Bonmati Coll, Yipeng Hu

TL;DR: 论文提出了一种推荐系统，通过策略网络动态选择最优成像模态和感兴趣区域，以提升前列腺癌分割性能。

Details

Motivation: 放射科医师在阅片时常结合多种模态和局部区域信息，现有分割模型缺乏动态选择能力，影响了分割效率和精度。

Result: 在1325例多参数MRI数据上验证，优于标准分割网络，且策略网络独立发展了与PI-RADS不完全一致的策略。

Insight: 动态选择策略可显著提升分割性能，其独立发展的策略为未来人机交互应用提供了新思路。

Abstract: Radiologists often mix medical image reading strategies, including inspection of individual modalities and local image regions, using information at different locations from different images independently as well as concurrently. In this paper, we propose a recommend system to assist machine learning-based segmentation models, by suggesting appropriate image portions along with the best modality, such that prostate cancer segmentation performance can be maximised. Our approach trains a policy network that assists tumor localisation, by recommending both the optimal imaging modality and the specific sections of interest for review. During training, a pre-trained segmentation network mimics radiologist inspection on individual or variable combinations of these imaging modalities and their sections - selected by the policy network. Taking the locally segmented regions as an input for the next step, this dynamic decision making process iterates until all cancers are best localised. We validate our method using a data set of 1325 labelled multiparametric MRI images from prostate cancer patients, demonstrating its potential to improve annotation efficiency and segmentation accuracy, especially when challenging pathology is present. Experimental results show that our approach can surpass standard segmentation networks. Perhaps more interestingly, our trained agent independently developed its own optimal strategy, which may or may not be consistent with current radiologist guidelines such as PI-RADS. This observation also suggests a promising interactive application, in which the proposed policy networks assist human radiologists.

[18] Scaling Up Audio-Synchronized Visual Animation: An Efficient Training Paradigm cs.CVPDF

Lin Zhang, Zefan Cai, Yufan Zhou, Shentong Mo, Jinhong Lin

TL;DR: 该论文提出了一种高效的两阶段训练范式，利用大量噪声视频扩展音频同步视觉动画的能力，显著减少对高质量手动标注视频的依赖。

Details

Motivation: 现有方法依赖昂贵的手工标注视频，难以扩展到开放世界的多样化音频-视频类别。

Result: 实验表明，方法将手动标注需求降低10倍以上，同时泛化到更多开放类别。

Insight: 通过合理利用噪声数据和少量高质量数据，可以显著提升模型的可扩展性和同步效果。

Abstract: Recent advances in audio-synchronized visual animation enable control of video content using audios from specific classes. However, existing methods rely heavily on expensive manual curation of high-quality, class-specific training videos, posing challenges to scaling up to diverse audio-video classes in the open world. In this work, we propose an efficient two-stage training paradigm to scale up audio-synchronized visual animation using abundant but noisy videos. In stage one, we automatically curate large-scale videos for pretraining, allowing the model to learn diverse but imperfect audio-video alignments. In stage two, we finetune the model on manually curated high-quality examples, but only at a small scale, significantly reducing the required human effort. We further enhance synchronization by allowing each frame to access rich audio context via multi-feature conditioning and window attention. To efficiently train the model, we leverage pretrained text-to-video generator and audio encoders, introducing only 1.9% additional trainable parameters to learn audio-conditioning capability without compromising the generator’s prior knowledge. For evaluation, we introduce AVSync48, a benchmark with videos from 48 classes, which is 3$\times$ more diverse than previous benchmarks. Extensive experiments show that our method significantly reduces reliance on manual curation by over 10$\times$, while generalizing to many open classes.

[19] RAVID: Retrieval-Augmented Visual Detection: A Knowledge-Driven Approach for AI-Generated Image Identification cs.CV | cs.CR | cs.IRPDF

Mamadou Keita, Wassim Hamidouche, Hessen Bougueffa Eutamene, Abdelmalik Taleb-Ahmed, Abdenour Hadid

TL;DR: RAVID是一个基于视觉检索增强生成（RAG）的AI生成图像检测框架，通过动态检索相关图像增强检测能力，结合优化的CLIP编码器和视觉语言模型（VLM），在UniversalFakeDetect基准测试中达到93.85%的平均准确率，并在图像退化条件下表现出更强的鲁棒性。

Details

Motivation: 现有的AI生成图像检测方法依赖低层特征和模型特定特征，泛化性和鲁棒性不足；而RAG方法在文本领域表现优异，但在视觉领域的应用尚未充分探索。RAVID旨在填补这一空白，通过检索增强技术提升检测性能。

Result: RAVID在UniversalFakeDetect基准测试中平均准确率达93.85%，优于现有方法；在图像退化条件下（如高斯模糊和JPEG压缩），平均准确率为80.27%，显著高于C2P-CLIP的63.44%。

Insight: 视觉RAG技术可有效提升AI生成图像检测的泛化性和鲁棒性，结合优化的预训练模型和多模态融合是未来研究的重要方向。

Abstract: In this paper, we introduce RAVID, the first framework for AI-generated image detection that leverages visual retrieval-augmented generation (RAG). While RAG methods have shown promise in mitigating factual inaccuracies in foundation models, they have primarily focused on text, leaving visual knowledge underexplored. Meanwhile, existing detection methods, which struggle with generalization and robustness, often rely on low-level artifacts and model-specific features, limiting their adaptability. To address this, RAVID dynamically retrieves relevant images to enhance detection. Our approach utilizes a fine-tuned CLIP image encoder, RAVID CLIP, enhanced with category-related prompts to improve representation learning. We further integrate a vision-language model (VLM) to fuse retrieved images with the query, enriching the input and improving accuracy. Given a query image, RAVID generates an embedding using RAVID CLIP, retrieves the most relevant images from a database, and combines these with the query image to form an enriched input for a VLM (e.g., Qwen-VL or Openflamingo). Experiments on the UniversalFakeDetect benchmark, which covers 19 generative models, show that RAVID achieves state-of-the-art performance with an average accuracy of 93.85%. RAVID also outperforms traditional methods in terms of robustness, maintaining high accuracy even under image degradations such as Gaussian blur and JPEG compression. Specifically, RAVID achieves an average accuracy of 80.27% under degradation conditions, compared to 63.44% for the state-of-the-art model C2P-CLIP, demonstrating consistent improvements in both Gaussian blur and JPEG compression scenarios. The code will be publicly available upon acceptance.

[20] Investigating the Impact of Large-Scale Pre-training on Nutritional Content Estimation from 2D Images cs.CVPDF

Michele Andrade, Guilherme A. L. Silva, Valéria Santos, Gladston Moreira, Eduardo Luz

TL;DR: 本文研究了大规模预训练对仅使用2D图像进行营养内容估计的影响，对比了公开数据集（如ImageNet和COYO）与专有数据集（JFT-300M）的性能差异。研究发现，JFT-300M预训练的模型显著优于公开数据集预训练的模型，且COYO的表现意外地不如ImageNet。

Details

Motivation: 营养内容估计对健康和饮食监测至关重要。然而，现有方法依赖专有数据集进行预训练，限制了可重复性。本文旨在探索公开数据集在营养估计任务中的潜力及其性能差异。

Result: 结果表明，JFT-300M预训练的模型性能最优，而COYO预训练的模型表现不如ImageNet，与初始假设相反。这突显了预训练数据集选择的重要性。

Insight: 研究发现，预训练数据集的规模、领域相关性和质量对营养估计任务至关重要。公开数据集虽可用于此任务，但其性能可能低于专有数据集，尤其当领域相关性不足时。

Abstract: Estimating the nutritional content of food from images is a critical task with significant implications for health and dietary monitoring. This is challenging, especially when relying solely on 2D images, due to the variability in food presentation, lighting, and the inherent difficulty in inferring volume and mass without depth information. Furthermore, reproducibility in this domain is hampered by the reliance of state-of-the-art methods on proprietary datasets for large-scale pre-training. In this paper, we investigate the impact of large-scale pre-training datasets on the performance of deep learning models for nutritional estimation using only 2D images. We fine-tune and evaluate Vision Transformer (ViT) models pre-trained on two large public datasets, ImageNet and COYO, comparing their performance against baseline CNN models (InceptionV2 and ResNet-50) and a state-of-the-art method pre-trained on the proprietary JFT-300M dataset. We conduct extensive experiments on the Nutrition5k dataset, a large-scale collection of real-world food plates with high-precision nutritional annotations. Our evaluation using Mean Absolute Error (MAE) and Mean Absolute Percentage Error (MAE%) reveals that models pre-trained on JFT-300M significantly outperform those pre-trained on public datasets. Unexpectedly, the model pre-trained on the massive COYO dataset performs worse than the model pre-trained on ImageNet for this specific regression task, refuting our initial hypothesis. Our analysis provides quantitative evidence highlighting the critical role of pre-training dataset characteristics, including scale, domain relevance, and curation quality, for effective transfer learning in 2D nutritional estimation.

[21] CAD-Judge: Toward Efficient Morphological Grading and Verification for Text-to-CAD Generation cs.CVPDF

Zheyuan Zhou, Jiayi Han, Liang Du, Naiyu Fang, Lemiao Qiu

TL;DR: 本文提出了一种名为CAD-Judge的新颖奖励系统，用于高效评估和验证文本到CAD生成的形态学评分，解决了传统方法的效率低和奖励作弊问题。

Details

Motivation: 传统文本到CAD系统存在渲染速度慢、视觉语言模型评估成本高且可能引发奖励作弊的问题，亟需一种高效且可靠的评估方法。

Result: 在多个挑战性CAD数据集上的实验表明，该方法在保持高效的同时实现了最先进的性能。

Insight: 编译器模块可以作为高效、直接的奖励信号，而前景理论有助于优化生成模型的对齐效果，验证模块进一步提升了系统的鲁棒性。

Abstract: Computer-Aided Design (CAD) models are widely used across industrial design, simulation, and manufacturing processes. Text-to-CAD systems aim to generate editable, general-purpose CAD models from textual descriptions, significantly reducing the complexity and entry barrier associated with traditional CAD workflows. However, rendering CAD models can be slow, and deploying VLMs to review CAD models can be expensive and may introduce reward hacking that degrades the systems. To address these challenges, we propose CAD-Judge, a novel, verifiable reward system for efficient and effective CAD preference grading and grammatical validation. We adopt the Compiler-as-a-Judge Module (CJM) as a fast, direct reward signal, optimizing model alignment by maximizing generative utility through prospect theory. To further improve the robustness of Text-to-CAD in the testing phase, we introduce a simple yet effective agentic CAD generation approach and adopt the Compiler-as-a-Review Module (CRM), which efficiently verifies the generated CAD models, enabling the system to refine them accordingly. Extensive experiments on challenging CAD datasets demonstrate that our method achieves state-of-the-art performance while maintaining superior efficiency.

[22] $\text{S}^2$Q-VDiT: Accurate Quantized Video Diffusion Transformer with Salient Data and Sparse Token Distillation cs.CVPDF

Weilun Feng, Haotong Qin, Chuanguang Yang, Xiangqi Li, Han Yang

TL;DR: 该论文提出了一种用于视频扩散模型的后训练量化框架$ ext{S}^2$Q-VDiT，通过显著数据选择和稀疏令牌蒸馏解决长令牌序列带来的校准和学习挑战，实现了高精度量化和高效推理。

Details

Motivation: 视频扩散模型（V-DMs）由于同时建模时空信息，导致令牌序列过长，增加了校准方差和学习难度，传统量化方法难以解决这些问题。

Result: 在W4A6量化下实现了无损性能，同时压缩模型3.9倍，推理速度提升1.3倍。

Insight: 显著数据选择和稀疏令牌蒸馏可以有效缓解视频扩散模型中长令牌序列带来的量化挑战，为其他高维数据模型量化提供了借鉴。

Abstract: Diffusion transformers have emerged as the mainstream paradigm for video generation models. However, the use of up to billions of parameters incurs significant computational costs. Quantization offers a promising solution by reducing memory usage and accelerating inference. Nonetheless, we observe that the joint modeling of spatial and temporal information in video diffusion models (V-DMs) leads to extremely long token sequences, which introduces high calibration variance and learning challenges. To address these issues, we propose \textbf{$\text{S}^2$Q-VDiT}, a post-training quantization framework for V-DMs that leverages \textbf{S}alient data and \textbf{S}parse token distillation. During the calibration phase, we identify that quantization performance is highly sensitive to the choice of calibration data. To mitigate this, we introduce \textit{Hessian-aware Salient Data Selection}, which constructs high-quality calibration datasets by considering both diffusion and quantization characteristics unique to V-DMs. To tackle the learning challenges, we further analyze the sparse attention patterns inherent in V-DMs. Based on this observation, we propose \textit{Attention-guided Sparse Token Distillation}, which exploits token-wise attention distributions to emphasize tokens that are more influential to the model’s output. Under W4A6 quantization, $\text{S}^2$Q-VDiT achieves lossless performance while delivering $3.9\times$ model compression and $1.3\times$ inference acceleration. Code will be available at \href{https://github.com/wlfeng0509/s2q-vdit}{https://github.com/wlfeng0509/s2q-vdit}.

[23] Can Large Multimodal Models Actively Recognize Faulty Inputs? A Systematic Evaluation Framework of Their Input Scrutiny Ability cs.CVPDF

Haiqi Yang, Jinzhe Li, Gengxu Li, Yi Chang, Yuan Wu

TL;DR: 论文提出了一个系统性评估框架ISEval，用于测试大型多模态模型（LMMs）识别输入错误的主动能力，发现多数模型在无指导时难以检测文本前提错误，且性能因错误类型不同而差异显著。

Details

Motivation: 现有研究表明大型语言模型倾向于被动接受缺陷输入，而LMMs是否能够主动检测错误输入尚未被探索，填补这一研究空白是本文的主要动机。

Result: 多数模型在无指导时难以主动识别文本错误；逻辑谬误识别表现较好，但表面语言错误和条件缺陷处理较差；不同模型在模态信任度上差异显著。

Insight: LMMs表现出对显式提示的强依赖性，且模态信任度不平衡（如某些模型过度依赖文本信息），强调了增强模型对输入有效性的主动验证的必要性。

Abstract: Large Multimodal Models (LMMs) have witnessed remarkable growth, showcasing formidable capabilities in handling intricate multimodal tasks with exceptional performance. Recent research has underscored the inclination of large language models to passively accept defective inputs, often resulting in futile reasoning on invalid prompts. However, the same critical question of whether LMMs can actively detect and scrutinize erroneous inputs still remains unexplored. To address this gap, we introduce the Input Scrutiny Ability Evaluation Framework (ISEval), which encompasses seven categories of flawed premises and three evaluation metrics. Our extensive evaluation of ten advanced LMMs has identified key findings. Most models struggle to actively detect flawed textual premises without guidance, which reflects a strong reliance on explicit prompts for premise error identification. Error type affects performance: models excel at identifying logical fallacies but struggle with surface-level linguistic errors and certain conditional flaws. Modality trust varies-Gemini 2.5 pro and Claude Sonnet 4 balance visual and textual info, while aya-vision-8b over-rely on text in conflicts. These insights underscore the urgent need to enhance LMMs’ proactive verification of input validity and shed novel insights into mitigating the problem. The code is available at https://github.com/MLGroupJLU/LMM_ISEval.

[24] Dual Prompt Learning for Adapting Vision-Language Models to Downstream Image-Text Retrieval cs.CV | cs.IRPDF

Yifan Wang, Tao Wang, Chenwei Tang, Caiyang Yu, Zhengqing Zang

TL;DR: 论文提出了一种双提示学习框架DCAR，通过联合优化属性和类别特征，提升了视觉语言模型在图像-文本检索任务中的性能。

Details

Motivation: 现有提示学习方法在图像分类任务中表现优异，但在图像-文本检索任务中面临挑战，尤其是区分细粒度属性和相似子类别。

Result: 在FDRD数据集上的实验表明，DCAR优于现有基线方法，取得了SOTA性能。

Insight: 结合属性和类别层面的联合优化能够有效提升细粒度表示学习能力，解决下游任务的挑战。

Abstract: Recently, prompt learning has demonstrated remarkable success in adapting pre-trained Vision-Language Models (VLMs) to various downstream tasks such as image classification. However, its application to the downstream Image-Text Retrieval (ITR) task is more challenging. We find that the challenge lies in discriminating both fine-grained attributes and similar subcategories of the downstream data. To address this challenge, we propose Dual prompt Learning with Joint Category-Attribute Reweighting (DCAR), a novel dual-prompt learning framework to achieve precise image-text matching. The framework dynamically adjusts prompt vectors from both semantic and visual dimensions to improve the performance of CLIP on the downstream ITR task. Based on the prompt paradigm, DCAR jointly optimizes attribute and class features to enhance fine-grained representation learning. Specifically, (1) at the attribute level, it dynamically updates the weights of attribute descriptions based on text-image mutual information correlation; (2) at the category level, it introduces negative samples from multiple perspectives with category-matching weighting to learn subcategory distinctions. To validate our method, we construct the Fine-class Described Retrieval Dataset (FDRD), which serves as a challenging benchmark for ITR in downstream data domains. It covers over 1,500 downstream fine categories and 230,000 image-caption pairs with detailed attribute annotations. Extensive experiments on FDRD demonstrate that DCAR achieves state-of-the-art performance over existing baselines.

[25] VisualTrans: A Benchmark for Real-World Visual Transformation Reasoning cs.CVPDF

Yuheng Ji, Yipu Wang, Yuyang Liu, Xiaoshuai Hao, Yue Liu

TL;DR: 论文提出了VisualTrans基准，用于评估真实世界中的视觉变换推理能力，弥补了现有基准的不足，并在多个推理维度上进行了系统评估。

Details

Motivation: 现有基准存在仿真与现实的差距、任务复杂度不足和推理覆盖不全面等问题，限制了其在实际场景中的应用。

Result: 当前领先的视觉语言模型在静态空间任务中表现良好，但在动态多步推理任务（如中间状态识别和变换序列规划）中存在明显不足。

Insight: 研究揭示了现有模型在时间建模和因果推理方面的基础性缺陷，为未来开发更通用和强大的视觉变换推理系统提供了方向。

Abstract: Visual transformation reasoning (VTR) is a vital cognitive capability that empowers intelligent agents to understand dynamic scenes, model causal relationships, and predict future states, and thereby guiding actions and laying the foundation for advanced intelligent systems. However, existing benchmarks suffer from a sim-to-real gap, limited task complexity, and incomplete reasoning coverage, limiting their practical use in real-world scenarios. To address these limitations, we introduce VisualTrans, the first comprehensive benchmark specifically designed for VTR in real-world human-object interaction scenarios. VisualTrans encompasses 12 semantically diverse manipulation tasks and systematically evaluates three essential reasoning dimensions - spatial, procedural, and quantitative - through 6 well-defined subtask types. The benchmark features 472 high-quality question-answer pairs in various formats, including multiple-choice, open-ended counting, and target enumeration. We introduce a scalable data construction pipeline built upon first-person manipulation videos, which integrates task selection, image pair extraction, automated metadata annotation with large multimodal models, and structured question generation. Human verification ensures the final benchmark is both high-quality and interpretable. Evaluations of various state-of-the-art vision-language models show strong performance in static spatial tasks. However, they reveal notable shortcomings in dynamic, multi-step reasoning scenarios, particularly in areas like intermediate state recognition and transformation sequence planning. These findings highlight fundamental weaknesses in temporal modeling and causal reasoning, providing clear directions for future research aimed at developing more capable and generalizable VTR systems. The dataset and code are available at https://github.com/WangYipu2002/VisualTrans.

[26] Iterative pseudo-labeling based adaptive copy-paste supervision for semi-supervised tumor segmentation cs.CVPDF

Qiangguo Jin, Hui Cui, Junbo Wang, Changming Sun, Yimiao He

TL;DR: 该论文提出了一种基于迭代伪标签的自适应复制粘贴监督（IPA-CP）方法，用于CT扫描中的肿瘤分割，解决了半监督学习中针对小体积或数量众多的肿瘤分割难题。

Details

Motivation: 现有半监督学习方法主要关注大器官分割，忽视了小体积或数量众多的肿瘤分割挑战，且未充分利用数据增强在标记和未标记数据中的潜力。

Result: 在内部和公开数据集上的实验表明，IPA-CP优于现有半监督学习方法。消融研究验证了技术贡献的有效性。

Insight: 自适应增强机制和迭代伪标签策略能显著提升肿瘤分割的性能，尤其是在小体积或数量众多的肿瘤场景中。

Abstract: Semi-supervised learning (SSL) has attracted considerable attention in medical image processing. The latest SSL methods use a combination of consistency regularization and pseudo-labeling to achieve remarkable success. However, most existing SSL studies focus on segmenting large organs, neglecting the challenging scenarios where there are numerous tumors or tumors of small volume. Furthermore, the extensive capabilities of data augmentation strategies, particularly in the context of both labeled and unlabeled data, have yet to be thoroughly investigated. To tackle these challenges, we introduce a straightforward yet effective approach, termed iterative pseudo-labeling based adaptive copy-paste supervision (IPA-CP), for tumor segmentation in CT scans. IPA-CP incorporates a two-way uncertainty based adaptive augmentation mechanism, aiming to inject tumor uncertainties present in the mean teacher architecture into adaptive augmentation. Additionally, IPA-CP employs an iterative pseudo-label transition strategy to generate more robust and informative pseudo labels for the unlabeled samples. Extensive experiments on both in-house and public datasets show that our framework outperforms state-of-the-art SSL methods in medical image segmentation. Ablation study results demonstrate the effectiveness of our technical contributions.

[27] Motion is the Choreographer: Learning Latent Pose Dynamics for Seamless Sign Language Generation cs.CVPDF

Jiayi He, Xu Wang, Shengeng Tang, Yaxiong Wang, Lechao Cheng

TL;DR: 提出一种新型手语视频生成框架，通过解耦语义动作与手语者身份，实现高灵活性与高质量生成。

Details

Motivation: 传统方法过度依赖手语者特定数据且泛化性差，需解决动作语义与身份耦合问题。

Result: 实验表明方法在生成质量和手语者个性化方面表现优越。

Insight: 解耦动作与身份不仅可行且有效，为动作生成提供了更灵活的基础。

Abstract: Sign language video generation requires producing natural signing motions with realistic appearances under precise semantic control, yet faces two critical challenges: excessive signer-specific data requirements and poor generalization. We propose a new paradigm for sign language video generation that decouples motion semantics from signer identity through a two-phase synthesis framework. First, we construct a signer-independent multimodal motion lexicon, where each gloss is stored as identity-agnostic pose, gesture, and 3D mesh sequences, requiring only one recording per sign. This compact representation enables our second key innovation: a discrete-to-continuous motion synthesis stage that transforms retrieved gloss sequences into temporally coherent motion trajectories, followed by identity-aware neural rendering to produce photorealistic videos of arbitrary signers. Unlike prior work constrained by signer-specific datasets, our method treats motion as a first-class citizen: the learned latent pose dynamics serve as a portable “choreography layer” that can be visually realized through different human appearances. Extensive experiments demonstrate that disentangling motion from identity is not just viable but advantageous - enabling both high-quality synthesis and unprecedented flexibility in signer personalization.

[28] TCSAFormer: Efficient Vision Transformer with Token Compression and Sparse Attention for Medical Image Segmentation cs.CVPDF

Zunhui Xia, Hongxing Li, Libin Lan

TL;DR: 论文TCSAFormer提出了一种高效的医学图像分割网络，通过压缩注意力模块和双分支前馈网络解决Transformer的计算复杂度和局部特征捕捉问题，在多个数据集上表现优越。

Details

Motivation: 当前基于Transformer的医学图像分割方法存在计算复杂度高（随输入序列二次增长）和无法有效捕捉局部上下文信息的问题。

Result: 在ISIC-2018、CVC-ClinicDB和Synapse数据集上，TCSAFormer性能优于SOTA方法，同时计算开销更低。

Insight: 通过结合令牌压缩和多尺度局部特征捕捉，TCSAFormer在效率与精度之间取得了平衡，为大规模医学图像分割提供了实用方案。

Abstract: In recent years, transformer-based methods have achieved remarkable progress in medical image segmentation due to their superior ability to capture long-range dependencies. However, these methods typically suffer from two major limitations. First, their computational complexity scales quadratically with the input sequences. Second, the feed-forward network (FFN) modules in vanilla Transformers typically rely on fully connected layers, which limits models’ ability to capture local contextual information and multiscale features critical for precise semantic segmentation. To address these issues, we propose an efficient medical image segmentation network, named TCSAFormer. The proposed TCSAFormer adopts two key ideas. First, it incorporates a Compressed Attention (CA) module, which combines token compression and pixel-level sparse attention to dynamically focus on the most relevant key-value pairs for each query. This is achieved by pruning globally irrelevant tokens and merging redundant ones, significantly reducing computational complexity while enhancing the model’s ability to capture relationships between tokens. Second, it introduces a Dual-Branch Feed-Forward Network (DBFFN) module as a replacement for the standard FFN to capture local contextual features and multiscale information, thereby strengthening the model’s feature representation capability. We conduct extensive experiments on three publicly available medical image segmentation datasets: ISIC-2018, CVC-ClinicDB, and Synapse, to evaluate the segmentation performance of TCSAFormer. Experimental results demonstrate that TCSAFormer achieves superior performance compared to existing state-of-the-art (SOTA) methods, while maintaining lower computational overhead, thus achieving an optimal trade-off between efficiency and accuracy.

[29] Beyond the Visible: Benchmarking Occlusion Perception in Multimodal Large Language Models cs.CVPDF

Zhaochen Liu, Kaiwen Gao, Shuyi Liang, Bin Xiao, Limeng Qiao

TL;DR: 本文介绍了O-Bench，首个用于评测多模态大语言模型（MLLMs）遮挡感知能力的视觉问答（VQA）基准。通过分层合成方法构建了1,365张具有语义一致遮挡场景的图像，并标注了4,588个问答对。实验显示当前MLLMs的遮挡感知能力与人类基线存在显著差距，且无法通过模型扩展或思考过程充分弥补。

Details

Motivation: 遮挡感知是人类空间理解的基础能力，但现有MLLMs在此方面的表现尚未被充分研究。为此，作者提出了首个专门用于评测遮挡感知的基准，填补了这一空白。

Result: 实验结果显示，当前MLLMs在遮挡感知任务上与人类基线存在显著差距，且差距无法通过模型扩展或思考过程充分弥补。作者还识别了三种典型失败模式。

Insight: 遮挡感知是MLLMs视觉推理能力的薄弱环节，未来研究需要更注重这一能力的提升。O-Bench为评测和推动MLLMs的视觉智能发展提供了重要工具。

Abstract: Occlusion perception, a critical foundation for human-level spatial understanding, embodies the challenge of integrating visual recognition and reasoning. Though multimodal large language models (MLLMs) have demonstrated remarkable capabilities, their performance on occlusion perception remains under-explored. To address this gap, we introduce O-Bench, the first visual question answering (VQA) benchmark specifically designed for occlusion perception. Based on SA-1B, we construct 1,365 images featuring semantically coherent occlusion scenarios through a novel layered synthesis approach. Upon this foundation, we annotate 4,588 question-answer pairs in total across five tailored tasks, employing a reliable, semi-automatic workflow. Our extensive evaluation of 22 representative MLLMs against the human baseline reveals a significant performance gap between current MLLMs and humans, which, we find, cannot be sufficiently bridged by model scaling or thinking process. We further identify three typical failure patterns, including an overly conservative bias, a fragile gestalt prediction, and a struggle with quantitative tasks. We believe O-Bench can not only provide a vital evaluation tool for occlusion perception, but also inspire the development of MLLMs for better visual intelligence. Our benchmark will be made publicly available upon paper publication.

[30] TNet: Terrace Convolutional Decoder Network for Remote Sensing Image Semantic Segmentation cs.CVPDF

Chengqian Dai, Yonghong Guo, Hongzhao Xiang, Yigui Luo

TL;DR: TNet提出了一种基于卷积和加法操作的阶梯式解码网络，用于遥感图像语义分割，通过逐步融合多分辨率特征提升全局-局部信息交互。

Details

Motivation: 现有方法（如UNet+Transformer/Mamba）通常只关注单一尺度内的特征交互，忽略了多分辨率间的全局上下文依赖关系。TNet旨在通过阶梯式卷积解码解决这一问题。

Result: 在ISPRS Vaihingen（85.35% mIoU）、Potsdam（87.05%）和LoveDA（52.19%）上表现优异，计算效率高。

Insight: 卷积操作本身可通过阶梯式设计实现全局-局部信息融合，无需复杂模块（如Transformer），为轻量化分割网络设计提供新思路。

Abstract: In remote sensing, most segmentation networks adopt the UNet architecture, often incorporating modules such as Transformers or Mamba to enhance global-local feature interactions within decoder stages. However, these enhancements typically focus on intra-scale relationships and neglect the global contextual dependencies across multiple resolutions. To address this limitation, we introduce the Terrace Convolutional Decoder Network (TNet), a simple yet effective architecture that leverages only convolution and addition operations to progressively integrate low-resolution features (rich in global context) into higher-resolution features (rich in local details) across decoding stages. This progressive fusion enables the model to learn spatially-aware convolutional kernels that naturally blend global and local information in a stage-wise manner. We implement TNet with a ResNet-18 encoder (TNet-R) and evaluate it on three benchmark datasets. TNet-R achieves competitive performance with a mean Intersection-over-Union (mIoU) of 85.35% on ISPRS Vaihingen, 87.05% on ISPRS Potsdam, and 52.19% on LoveDA, while maintaining high computational efficiency. Code is publicly available.

[31] Bridging Diffusion Models and 3D Representations: A 3D Consistent Super-Resolution Framework cs.CVPDF

Yi-Ting Chen, Ting-Hsuan Liao, Pengsheng Guo, Alexander Schwing, Jia-Bin Huang

TL;DR: 论文提出了一种基于3D高斯散射（3D Gaussian-splatting）的超分辨率框架3DSR，通过结合现成的扩散模型提升2D超分辨率效果，同时保证3D一致性。

Details

Motivation: 现有超分辨率方法（如图像上采样或视频超分辨率）要么忽略3D一致性，要么仅隐式考虑，而3DSR旨在通过显式的3D高斯散射表示实现跨视角的3D一致性。

Result: 实验表明，3DSR能够在保持3D结构一致性的同时生成高分辨率图像，视觉效果优于传统方法。

Insight: 显式3D表示是结合2D超分辨率模型的关键，能够在不牺牲3D一致性的情况下提升视觉质量。

Abstract: We propose 3D Super Resolution (3DSR), a novel 3D Gaussian-splatting-based super-resolution framework that leverages off-the-shelf diffusion-based 2D super-resolution models. 3DSR encourages 3D consistency across views via the use of an explicit 3D Gaussian-splatting-based scene representation. This makes the proposed 3DSR different from prior work, such as image upsampling or the use of video super-resolution, which either don’t consider 3D consistency or aim to incorporate 3D consistency implicitly. Notably, our method enhances visual quality without additional fine-tuning, ensuring spatial coherence within the reconstructed scene. We evaluate 3DSR on MipNeRF360 and LLFF data, demonstrating that it produces high-resolution results that are visually compelling, while maintaining structural consistency in 3D reconstructions. Code will be released.

[32] DET-GS: Depth- and Edge-Aware Regularization for High-Fidelity 3D Gaussian Splatting cs.CV | cs.AIPDF

Zexu Huang, Min Xu, Stuart Perry

TL;DR: DET-GS提出了一种深度和边缘感知的3D高斯泼溅（3DGS）正则化框架，通过分层几何深度监督和边缘感知正则化，显著提升了稀疏视图条件下的几何重构精度和视觉效果。

Details

Motivation: 当前3DGS在稀疏视图条件下难以准确重构几何结构，非局部深度正则化方法无法捕捉细粒度结构且对深度估计噪声敏感，传统平滑方法则忽略语义边界。

Result: 在稀疏视图新视角合成基准测试中，DET-GS在几何精度和视觉保真度上均超越SOTA方法。

Insight: 结合深度和边缘信息的正则化方法能有效提升稀疏视图下的3D重构质量，且边缘感知机制对保留高频细节至关重要。

Abstract: 3D Gaussian Splatting (3DGS) represents a significant advancement in the field of efficient and high-fidelity novel view synthesis. Despite recent progress, achieving accurate geometric reconstruction under sparse-view conditions remains a fundamental challenge. Existing methods often rely on non-local depth regularization, which fails to capture fine-grained structures and is highly sensitive to depth estimation noise. Furthermore, traditional smoothing methods neglect semantic boundaries and indiscriminately degrade essential edges and textures, consequently limiting the overall quality of reconstruction. In this work, we propose DET-GS, a unified depth and edge-aware regularization framework for 3D Gaussian Splatting. DET-GS introduces a hierarchical geometric depth supervision framework that adaptively enforces multi-level geometric consistency, significantly enhancing structural fidelity and robustness against depth estimation noise. To preserve scene boundaries, we design an edge-aware depth regularization guided by semantic masks derived from Canny edge detection. Furthermore, we introduce an RGB-guided edge-preserving Total Variation loss that selectively smooths homogeneous regions while rigorously retaining high-frequency details and textures. Extensive experiments demonstrate that DET-GS achieves substantial improvements in both geometric accuracy and visual fidelity, outperforming state-of-the-art (SOTA) methods on sparse-view novel view synthesis benchmarks.

[33] NEARL-CLIP: Interacted Query Adaptation with Orthogonal Regularization for Medical Vision-Language Understanding cs.CVPDF

Zelin Peng, Yichen Zhao, Yu Huang, Piao Yang, Feilong Tang

TL;DR: NEARL-CLIP提出了一种新颖的跨模态交互框架，通过动态生成查询和正交正则化技术，解决了医学领域中视觉-语言模型的领域鸿沟问题，同时以参数高效的方式提升了性能。

Details

Motivation: 医学图像分析因标注数据有限而受限，直接应用视觉-语言模型（如CLIP）存在领域鸿沟问题。现有方法通常仅关注单模态知识引入，导致模态未对齐，未能充分发挥VLMs潜力。

Result: NEARL-CLIP有效解决了医学领域VLMs的领域鸿沟问题，提升了模型性能。

Insight: 跨模态交互和解耦新知识是提升医学视觉-语言理解的关键，正交正则化技术为知识学习提供了新思路。

Abstract: Computer-aided medical image analysis is crucial for disease diagnosis and treatment planning, yet limited annotated datasets restrict medical-specific model development. While vision-language models (VLMs) like CLIP offer strong generalization capabilities, their direct application to medical imaging analysis is impeded by a significant domain gap. Existing approaches to bridge this gap, including prompt learning and one-way modality interaction techniques, typically focus on introducing domain knowledge to a single modality. Although this may offer performance gains, it often causes modality misalignment, thereby failing to unlock the full potential of VLMs. In this paper, we propose \textbf{NEARL-CLIP} (i\underline{N}teracted qu\underline{E}ry \underline{A}daptation with o\underline{R}thogona\underline{L} Regularization), a novel cross-modality interaction VLM-based framework that contains two contributions: (1) Unified Synergy Embedding Transformer (USEformer), which dynamically generates cross-modality queries to promote interaction between modalities, thus fostering the mutual enrichment and enhancement of multi-modal medical domain knowledge; (2) Orthogonal Cross-Attention Adapter (OCA). OCA introduces an orthogonality technique to decouple the new knowledge from USEformer into two distinct components: the truly novel information and the incremental knowledge. By isolating the learning process from the interference of incremental knowledge, OCA enables a more focused acquisition of new information, thereby further facilitating modality interaction and unleashing the capability of VLMs. Notably, NEARL-CLIP achieves these two contributions in a parameter-efficient style, which only introduces \textbf{1.46M} learnable parameters.

[34] AR as an Evaluation Playground: Bridging Metrics and Visual Perception of Computer Vision Models cs.CVPDF

Ashkan Ganj, Yiqin Zhao, Tian Guo

TL;DR: 该论文提出了一种基于增强现实（AR）的平台ARCADE，用于简化计算机视觉（CV）模型的人类感知评估，支持跨平台数据收集和自定义实验协议。

Details

Motivation: 传统的人类感知研究复杂性高、难以扩展，AR的丰富上下文和交互性为CV模型的评测提供了新机会。

Result: 通过深度和光照估计模型验证了ARCADE的有效性，证明了其在人类感知评测中的实用性。

Insight: AR可以作为CV模型评测的新工具，通过人类感知反馈提升模型性能的定性评估。

Abstract: Human perception studies can provide complementary insights to qualitative evaluation for understanding computer vision (CV) model performance. However, conducting human perception studies remains a non-trivial task, it often requires complex, end-to-end system setups that are time-consuming and difficult to scale. In this paper, we explore the unique opportunity presented by augmented reality (AR) for helping CV researchers to conduct perceptual studies. We design ARCADE, an evaluation platform that allows researchers to easily leverage AR’s rich context and interactivity for human-centered CV evaluation. Specifically, ARCADE supports cross-platform AR data collection, custom experiment protocols via pluggable model inference, and AR streaming for user studies. We demonstrate ARCADE using two types of CV models, depth and lighting estimation and show that AR tasks can be effectively used to elicit human perceptual judgments of model quality. We also evaluate the systems usability and performance across different deployment and study settings, highlighting its flexibility and effectiveness as a human-centered evaluation platform.

[35] Unlocking the Potential of MLLMs in Referring Expression Segmentation via a Light-weight Mask Decode cs.CV | cs.AIPDF

Jingchao Wang, Zhijian Wu, Dingjiang Huang, Yefeng Zheng, Hong Wang

TL;DR: 本文提出了一种轻量级解码器框架MLLMSeg，用于参考表达式分割（RES），通过充分利用MLLM视觉编码器的详细视觉特征，无需额外视觉编码器，实现了性能与成本的平衡。

Details

Motivation: 现有RES方法要么依赖参数繁重的SAM模型（632M参数），要么采用轻量级但牺牲准确性的方案。为了解决性能与成本的权衡问题，本文提出了MLLMSeg框架。

Result: 实验表明，MLLMSeg在性能和成本上优于基于SAM和非SAM的竞争对手。

Insight: 通过充分利用MLLM的视觉编码器特征，可以在减少参数量的同时保持甚至提升模型性能，这对轻量级RES任务的实现具有指导意义。

Abstract: Reference Expression Segmentation (RES) aims to segment image regions specified by referring expressions and has become popular with the rise of multimodal large models (MLLMs). While MLLMs excel in semantic understanding, their token-generation paradigm struggles with pixel-level dense prediction. Existing RES methods either couple MLLMs with the parameter-heavy Segment Anything Model (SAM) with 632M network parameters or adopt SAM-free lightweight pipelines that sacrifice accuracy. To address the trade-off between performance and cost, we specifically propose MLLMSeg, a novel framework that fully exploits the inherent visual detail features encoded in the MLLM vision encoder without introducing an extra visual encoder. Besides, we propose a detail-enhanced and semantic-consistent feature fusion module (DSFF) that fully integrates the detail-related visual feature with the semantic-related feature output by the large language model (LLM) of MLLM. Finally, we establish a light-weight mask decoder with only 34M network parameters that optimally leverages detailed spatial features from the visual encoder and semantic features from the LLM to achieve precise mask prediction. Extensive experiments demonstrate that our method generally surpasses both SAM-based and SAM-free competitors, striking a better balance between performance and cost. Code is available at https://github.com/jcwang0602/MLLMSeg.

[36] CLIPVehicle: A Unified Framework for Vision-based Vehicle Search cs.CVPDF

Likai Wang, Ruize Han, Xiangqun Zhang, Wei Feng

TL;DR: 论文提出了一个统一框架CLIPVehicle，用于联合车辆检测和再识别，通过双粒度语义区域对齐模块和多层次车辆身份学习策略，解决了检测与再识别的目标冲突问题。

Details

Motivation: 现有车辆搜索方法需先检测再识别，资源消耗大且不实用。论文旨在实现端到端的联合检测与再识别。

Result: 实验表明，CLIPVehicle在车辆再识别和人员搜索任务上优于现有方法。

Insight: 联合学习和多层级特征提取能有效解决检测与再识别的目标冲突，提升车辆搜索的效率和性能。

Abstract: Vehicles, as one of the most common and significant objects in the real world, the researches on which using computer vision technologies have made remarkable progress, such as vehicle detection, vehicle re-identification, etc. To search an interested vehicle from the surveillance videos, existing methods first pre-detect and store all vehicle patches, and then apply vehicle re-identification models, which is resource-intensive and not very practical. In this work, we aim to achieve the joint detection and re-identification for vehicle search. However, the conflicting objectives between detection that focuses on shared vehicle commonness and re-identification that focuses on individual vehicle uniqueness make it challenging for a model to learn in an end-to-end system. For this problem, we propose a new unified framework, namely CLIPVehicle, which contains a dual-granularity semantic-region alignment module to leverage the VLMs (Vision-Language Models) for vehicle discrimination modeling, and a multi-level vehicle identification learning strategy to learn the identity representation from global, instance and feature levels. We also construct a new benchmark, including a real-world dataset CityFlowVS, and two synthetic datasets SynVS-Day and SynVS-All, for vehicle search. Extensive experimental results demonstrate that our method outperforms the state-of-the-art methods of both vehicle Re-ID and person search tasks.

[37] Conditional Latent Diffusion Models for Zero-Shot Instance Segmentation cs.CVPDF

Maximilian Ulmer, Wout Boerdijk, Rudolph Triebel, Maximilian Durner

TL;DR: 论文提出了OC-DiT，一种基于条件潜扩散模型的新型方法，用于零样本实例分割，通过在扩散模型的潜在空间中结合对象模板和图像特征生成实例掩码，性能显著。

Details

Motivation: 现有实例分割方法通常需要大量标注数据，难以泛化到新类别或零样本场景。论文探索扩散模型在生成式分割任务中的潜力，旨在实现无需目标数据训练的零样本实例分割。

Result: 模型在多个真实世界基准测试中取得了零样本状态下的最优性能，无需目标数据微调。

Insight: 扩散模型通过条件化学习和生成式方法，可以有效解耦对象实例，为实例分割任务提供了新的思路。

Abstract: This paper presents OC-DiT, a novel class of diffusion models designed for object-centric prediction, and applies it to zero-shot instance segmentation. We propose a conditional latent diffusion framework that generates instance masks by conditioning the generative process on object templates and image features within the diffusion model’s latent space. This allows our model to effectively disentangle object instances through the diffusion process, which is guided by visual object descriptors and localized image cues. Specifically, we introduce two model variants: a coarse model for generating initial object instance proposals, and a refinement model that refines all proposals in parallel. We train these models on a newly created, large-scale synthetic dataset comprising thousands of high-quality object meshes. Remarkably, our model achieves state-of-the-art performance on multiple challenging real-world benchmarks, without requiring any retraining on target data. Through comprehensive ablation studies, we demonstrate the potential of diffusion models for instance segmentation tasks.

Zheng Cheng, Wenri Wang, Guangyong Chen, Yakun Ju, Yihua Cheng

TL;DR: 本文提出了一种单尺度分解网络（SSD-Net），通过创新的单尺度特征提取技术，证明了单尺度方法在水下图像增强中可以达到甚至超越多尺度方法的性能，同时显著降低了复杂度。

Details

Motivation: 当前水下图像增强（UIE）技术主要依赖多尺度特征融合，但实验表明高质量的图像重建并不一定需要多尺度特征。因此，探索单尺度特征的潜力成为研究动机。

Result: 实验表明，单尺度特征提取方法可以匹配或超越多尺度方法的性能，同时降低了计算复杂度。

Insight: 单尺度特征在水下图像增强中具有巨大潜力，合理的分解和融合策略可以显著提升性能。

Abstract: Underwater image enhancement (UIE) techniques aim to improve visual quality of images captured in aquatic environments by addressing degradation issues caused by light absorption and scattering effects, including color distortion, blurring, and low contrast. Current mainstream solutions predominantly employ multi-scale feature extraction (MSFE) mechanisms to enhance reconstruction quality through multi-resolution feature fusion. However, our extensive experiments demonstrate that high-quality image reconstruction does not necessarily rely on multi-scale feature fusion. Contrary to popular belief, our experiments show that single-scale feature extraction alone can match or surpass the performance of multi-scale methods, significantly reducing complexity. To comprehensively explore single-scale feature potential in underwater enhancement, we propose an innovative Single-Scale Decomposition Network (SSD-Net). This architecture introduces an asymmetrical decomposition mechanism that disentangles input image into clean layer along with degradation layer. The former contains scene-intrinsic information and the latter encodes medium-induced interference. It uniquely combines CNN’s local feature extraction capabilities with Transformer’s global modeling strengths through two core modules: 1) Parallel Feature Decomposition Block (PFDB), implementing dual-branch feature space decoupling via efficient attention operations and adaptive sparse transformer; 2) Bidirectional Feature Communication Block (BFCB), enabling cross-layer residual interactions for complementary feature mining and fusion. This synergistic design preserves feature decomposition independence while establishing dynamic cross-layer information pathways, effectively enhancing degradation decoupling capacity.

[39] Learning Using Privileged Information for Litter Detection cs.CV | cs.ET | cs.LG | cs.PFPDF

Matthias Bartolo, Konstantinos Makantasis, Dylan Seychell

TL;DR: 本文提出了一种结合特权信息和深度学习目标检测的新方法，用于垃圾检测，通过编码边界框信息为二值掩码提升检测性能。

Details

Motivation: 全球垃圾污染日益严重，开发高效的自动化垃圾检测工具具有重要意义。

Result: 在SODA、BDW和UAVVaste数据集上均表现优越，且模型复杂度未增加。

Insight: 该方法在提升检测精度的同时保持计算效率，适用于实际应用。

Abstract: As litter pollution continues to rise globally, developing automated tools capable of detecting litter effectively remains a significant challenge. This study presents a novel approach that combines, for the first time, privileged information with deep learning object detection to improve litter detection while maintaining model efficiency. We evaluate our method across five widely used object detection models, addressing challenges such as detecting small litter and objects partially obscured by grass or stones. In addition to this, a key contribution of our work can also be attributed to formulating a means of encoding bounding box information as a binary mask, which can be fed to the detection model to refine detection guidance. Through experiments on both within-dataset evaluation on the renowned SODA dataset and cross-dataset evaluation on the BDW and UAVVaste litter detection datasets, we demonstrate consistent performance improvements across all models. Our approach not only bolsters detection accuracy within the training sets but also generalises well to other litter detection contexts. Crucially, these improvements are achieved without increasing model complexity or adding extra layers, ensuring computational efficiency and scalability. Our results suggest that this methodology offers a practical solution for litter detection, balancing accuracy and efficiency in real-world applications.

[40] SVC 2025: the First Multimodal Deception Detection Challenge cs.CVPDF

Xun Lin, Xiaobao Guo, Taorui Wang, Yingjie Ma, Jiajian Huang

TL;DR: 该论文介绍了SVC 2025多模态欺骗检测挑战赛，旨在推动跨域泛化的音频-视觉欺骗检测研究，通过多模态数据（音频、视频、文本）提升模型的适应性、可解释性和实用部署能力。

Details

Motivation: 现有欺骗检测研究多集中于单域场景，忽略了跨域时的性能下降问题。该挑战赛旨在填补这一空白，推动多模态跨域欺骗检测的发展。

Result: 21支团队提交了最终结果，展示了模型的跨域性能。

Insight: 多模态数据有助于提升模型的泛化能力和实用性，跨域欺骗检测是未来研究的重点方向。

Abstract: Deception detection is a critical task in real-world applications such as security screening, fraud prevention, and credibility assessment. While deep learning methods have shown promise in surpassing human-level performance, their effectiveness often depends on the availability of high-quality and diverse deception samples. Existing research predominantly focuses on single-domain scenarios, overlooking the significant performance degradation caused by domain shifts. To address this gap, we present the SVC 2025 Multimodal Deception Detection Challenge, a new benchmark designed to evaluate cross-domain generalization in audio-visual deception detection. Participants are required to develop models that not only perform well within individual domains but also generalize across multiple heterogeneous datasets. By leveraging multimodal data, including audio, video, and text, this challenge encourages the design of models capable of capturing subtle and implicit deceptive cues. Through this benchmark, we aim to foster the development of more adaptable, explainable, and practically deployable deception detection systems, advancing the broader field of multimodal learning. By the conclusion of the workshop competition, a total of 21 teams had submitted their final results. https://sites.google.com/view/svc-mm25 for more information.

[41] DS$^2$Net: Detail-Semantic Deep Supervision Network for Medical Image Segmentation cs.CV | cs.AIPDF

Zhaohong Huang, Yuxin Zhang, Mingbao Lin, Taojian Zhou, Guorong Cai

TL;DR: DS$^2$Net提出了一种新的医学图像分割方法，通过同时监督低层细节特征和高层语义特征，结合DEM和SEM模块，以及基于不确定性的损失函数，显著提升了分割性能。

Details

Motivation: 现有的深度监督网络通常仅监督粗粒度语义特征或细粒度细节特征，忽略了二者在医学图像分析中的互补关系。本文旨在通过多视角深度监督提升分割效果。

Result: 在结肠镜、超声和显微镜等六种基准数据集上的实验表明，DS$^2$Net性能显著优于现有方法。

Insight: 细节和语义特征的联合监督对医学图像分割至关重要，且自适应损失设计能有效避免手动调节的局限性。

Abstract: Deep Supervision Networks exhibit significant efficacy for the medical imaging community. Nevertheless, existing work merely supervises either the coarse-grained semantic features or fine-grained detailed features in isolation, which compromises the fact that these two types of features hold vital relationships in medical image analysis. We advocate the powers of complementary feature supervision for medical image segmentation, by proposing a Detail-Semantic Deep Supervision Network (DS$^2$Net). DS$^2$Net navigates both low-level detailed and high-level semantic feature supervision through Detail Enhance Module (DEM) and Semantic Enhance Module (SEM). DEM and SEM respectively harness low-level and high-level feature maps to create detail and semantic masks for enhancing feature supervision. This is a novel shift from single-view deep supervision to multi-view deep supervision. DS$^2$Net is also equipped with a novel uncertainty-based supervision loss that adaptively assigns the supervision strength of features within distinct scales based on their uncertainty, thus circumventing the sub-optimal heuristic design that typifies previous works. Through extensive experiments on six benchmarks captured under either colonoscopy, ultrasound and microscope, we demonstrate that DS$^2$Net consistently outperforms state-of-the-art methods for medical image analysis.

[42] UniFGVC: Universal Training-Free Few-Shot Fine-Grained Vision Classification via Attribute-Aware Multimodal Retrieval cs.CV | cs.AIPDF

Hongyu Guo, Kuan Zhu, Xiangzhao Hao, Haiyun Guo, Ming Tang

TL;DR: UniFGVC提出了一种无需训练的通用框架，通过多模态检索解决少样本细粒度视觉分类问题。

Details

Motivation: 现有方法主要通过微调预训练的视觉语言模型提升性能，但存在过拟合和泛化能力差的问题。UniFGVC旨在通过多模态检索避免这些缺点。

Result: 在12个FGVC基准测试中，UniFGVC表现优于现有的少样本CLIP方法和部分全监督MLLMs方法。

Insight: 通过多模态检索无需训练即可实现通用性，同时结合开放世界知识和结构化描述提升了细粒度分类的判别能力。

Abstract: Few-shot fine-grained visual classification (FGVC) aims to leverage limited data to enable models to discriminate subtly distinct categories. Recent works mostly finetuned the pre-trained visual language models to achieve performance gain, yet suffering from overfitting and weak generalization. To deal with this, we introduce UniFGVC, a universal training-free framework that reformulates few-shot FGVC as multimodal retrieval. First, we propose the Category-Discriminative Visual Captioner (CDV-Captioner) to exploit the open-world knowledge of multimodal large language models (MLLMs) to generate a structured text description that captures the fine-grained attribute features distinguishing closely related classes. CDV-Captioner uses chain-of-thought prompting and visually similar reference images to reduce hallucination and enhance discrimination of generated captions. Using it we can convert each image into an image-description pair, enabling more comprehensive feature representation, and construct the multimodal category templates using few-shot samples for the subsequent retrieval pipeline. Then, off-the-shelf vision and text encoders embed query and template pairs, and FGVC is accomplished by retrieving the nearest template in the joint space. UniFGVC ensures broad compatibility with diverse MLLMs and encoders, offering reliable generalization and adaptability across few-shot FGVC scenarios. Extensive experiments on 12 FGVC benchmarks demonstrate its consistent superiority over prior few-shot CLIP-based methods and even several fully-supervised MLLMs-based approaches.

[43] IDCNet: Guided Video Diffusion for Metric-Consistent RGBD Scene Generation with Precise Camera Control cs.CVPDF

Lijuan Liu, Wenfa Li, Dongbo Zhang, Shuo Wang, Shaohui Jiao

TL;DR: IDC-Net是一个新的框架，通过联合生成RGB图像和深度图，实现了在显式相机轨迹控制下的RGB-D视频序列生成，提升了生成场景的几何一致性和视觉质量。

Details

Motivation: 现有的方法通常将RGB和深度生成分开处理，导致几何不一致性问题。IDC-Net提出联合学习框架，以解决几何对齐和精确相机控制的需求。

Result: 实验表明，IDC-Net在视觉质量和几何一致性上优于现有方法，生成的RGB-D序列可直接用于下游3D重建任务。

Insight: 通过联合学习和几何感知设计，IDC-Net显著提升了生成场景的几何一致性，展示了联合生成模型的实用性。

Abstract: We present IDC-Net (Image-Depth Consistency Network), a novel framework designed to generate RGB-D video sequences under explicit camera trajectory control. Unlike approaches that treat RGB and depth generation separately, IDC-Net jointly synthesizes both RGB images and corresponding depth maps within a unified geometry-aware diffusion model. The joint learning framework strengthens spatial and geometric alignment across frames, enabling more precise camera control in the generated sequences. To support the training of this camera-conditioned model and ensure high geometric fidelity, we construct a camera-image-depth consistent dataset with metric-aligned RGB videos, depth maps, and accurate camera poses, which provides precise geometric supervision with notably improved inter-frame geometric consistency. Moreover, we introduce a geometry-aware transformer block that enables fine-grained camera control, enhancing control over the generated sequences. Extensive experiments show that IDC-Net achieves improvements over state-of-the-art approaches in both visual quality and geometric consistency of generated scene sequences. Notably, the generated RGB-D sequences can be directly feed for downstream 3D Scene reconstruction tasks without extra post-processing steps, showcasing the practical benefits of our joint learning framework. See more at https://idcnet-scene.github.io.

[44] ICM-Fusion: In-Context Meta-Optimized LoRA Fusion for Multi-Task Adaptation cs.CVPDF

Yihua Shao, Xiaofeng Lin, Xinwei Long, Siyu Chen, Minxi Yan

TL;DR: ICM-Fusion结合元学习和上下文自适应，提出了一种新的LoRA融合框架，通过任务向量算术动态平衡不同领域的冲突，显著降低了多任务损失，并在少样本场景下实现任务增强。

Details

Motivation: 现有LoRA融合方法存在权重冲突和领域遗忘问题，尤其在少样本场景下泛化能力不足，ICM-Fusion旨在解决这些问题。

Result: 实验表明，ICM-Fusion显著降低多任务损失，并在少样本场景下实现任务增强。

Insight: 动态调整任务向量方向是解决权重冲突和领域遗忘的关键，F-VAE的设计为多任务LoRA生成提供了有效工具。

Abstract: Enabling multi-task adaptation in pre-trained Low-Rank Adaptation (LoRA) models is crucial for enhancing their generalization capabilities. Most existing pre-trained LoRA fusion methods decompose weight matrices, sharing similar parameters while merging divergent ones. However, this paradigm inevitably induces inter-weight conflicts and leads to catastrophic domain forgetting. While incremental learning enables adaptation to multiple tasks, it struggles to achieve generalization in few-shot scenarios. Consequently, when the weight data follows a long-tailed distribution, it can lead to forgetting in the fused weights. To address this issue, we propose In-Context Meta LoRA Fusion (ICM-Fusion), a novel framework that synergizes meta-learning with in-context adaptation. The key innovation lies in our task vector arithmetic, which dynamically balances conflicting optimization directions across domains through learned manifold projections. ICM-Fusion obtains the optimal task vector orientation for the fused model in the latent space by adjusting the orientation of the task vectors. Subsequently, the fused LoRA is reconstructed by a self-designed Fusion VAE (F-VAE) to realize multi-task LoRA generation. We have conducted extensive experiments on visual and linguistic tasks, and the experimental results demonstrate that ICM-Fusion can be adapted to a wide range of architectural models and applied to various tasks. Compared to the current pre-trained LoRA fusion method, ICM-Fusion fused LoRA can significantly reduce the multi-tasking loss and can even achieve task enhancement in few-shot scenarios.

[45] Audio-Assisted Face Video Restoration with Temporal and Identity Complementary Learning cs.CV | cs.MM | cs.SD | eess.ASPDF

Yuqin Cao, Yixuan Gao, Wei Sun, Xiaohong Liu, Yulun Zhang

TL;DR: 论文提出了一种通用的音频辅助面部视频修复网络（GAVN），通过身份和时间互补学习解决多种流媒体视频失真问题，显著优于现有方法。

Details

Motivation: 现有面部视频修复方法通常忽略视觉与音频特征的关联，尤其是嘴部区域。少数音频辅助方法仅关注压缩伪影消除，无法处理多种失真类型。

Result: 实验表明，GAVN在压缩伪影消除、去模糊和超分辨率任务上均优于现有最先进方法。

Insight: 音频和面部标志点可有效辅助视频修复，尤其是身份特征的提取能显著提升面部细节的恢复质量。

Abstract: Face videos accompanied by audio have become integral to our daily lives, while they often suffer from complex degradations. Most face video restoration methods neglect the intrinsic correlations between the visual and audio features, especially in mouth regions. A few audio-aided face video restoration methods have been proposed, but they only focus on compression artifact removal. In this paper, we propose a General Audio-assisted face Video restoration Network (GAVN) to address various types of streaming video distortions via identity and temporal complementary learning. Specifically, GAVN first captures inter-frame temporal features in the low-resolution space to restore frames coarsely and save computational cost. Then, GAVN extracts intra-frame identity features in the high-resolution space with the assistance of audio signals and face landmarks to restore more facial details. Finally, the reconstruction module integrates temporal features and identity features to generate high-quality face videos. Experimental results demonstrate that GAVN outperforms the existing state-of-the-art methods on face video compression artifact removal, deblurring, and super-resolution. Codes will be released upon publication.

[46] ToxicTAGS: Decoding Toxic Memes with Rich Tag Annotations cs.CV | cs.CLPDF

Subhankar Swain, Naquee Rizwan, Nayandeep Deb, Vishwajeet Singh Solanki, Vishwa Gangadhar S

TL;DR: 本文介绍了首个真实世界的6,300个多模态（有毒/正常）表情包数据集，并提出了一个增强检测任务的标签生成模块。

Details

Motivation: 社交媒体中的表情包常被用于传播有害内容，但现有数据集的局限性和高标注成本阻碍了有效的多模态内容审核系统的开发。

Result: 实验表明，引入社交标签显著提升了视觉-语言模型在检测任务中的性能。

Insight: 利用辅助元数据（如社交标签）可以增强多模态内容审核的上下文理解能力。

Abstract: The 2025 Global Risks Report identifies state-based armed conflict and societal polarisation among the most pressing global threats, with social media playing a central role in amplifying toxic discourse. Memes, as a widely used mode of online communication, often serve as vehicles for spreading harmful content. However, limitations in data accessibility and the high cost of dataset curation hinder the development of robust meme moderation systems. To address this challenge, in this work, we introduce a first-of-its-kind dataset of 6,300 real-world meme-based posts annotated in two stages: (i) binary classification into toxic and normal, and (ii) fine-grained labelling of toxic memes as hateful, dangerous, or offensive. A key feature of this dataset is that it is enriched with auxiliary metadata of socially relevant tags, enhancing the context of each meme. In addition, we propose a tag generation module that produces socially grounded tags, because most in-the-wild memes often do not come with tags. Experimental results show that incorporating these tags substantially enhances the performance of state-of-the-art VLMs detection tasks. Our contributions offer a novel and scalable foundation for improved content moderation in multimodal online environments.

[47] AD-FM: Multimodal LLMs for Anomaly Detection via Multi-Stage Reasoning and Fine-Grained Reward Optimization cs.CVPDF

Jingyi Liao, Yongyi Su, Rong-Cheng Tu, Zhao Jin, Wenhao Sun

TL;DR: 该论文提出了一种针对异常检测的多模态大型语言模型（MLLM）框架，通过多阶段推理和细粒度奖励优化解决了现有方法在数据利用和推理监督上的不足。

Details

Motivation: 现有基于GRPO的方法在异常检测中存在数据利用率低和缺乏对推理过程监督的问题，限制了MLLM在专业领域的应用。

Result: 在多个工业数据集上的实验表明，该方法显著提升了模型在异常检测中的性能，并能高效利用现有标注。

Insight: 通过结构化监督和连续反馈信号，可以在保留通用MLLM能力的同时，适应专业领域对细粒度视觉分析的严格要求。

Abstract: While Multimodal Large Language Models (MLLMs) demonstrate remarkable capabilities across diverse domains, their application to specialized anomaly detection (AD) remains constrained by domain adaptation challenges. Existing Group Relative Policy Optimization (GRPO) based approaches suffer from two critical limitations: inadequate training data utilization when models produce uniform responses, and insufficient supervision over reasoning processes that encourage immediate binary decisions without deliberative analysis. We propose a comprehensive framework addressing these limitations through two synergistic innovations. First, we introduce a multi-stage deliberative reasoning process that guides models from region identification to focused examination, generating diverse response patterns essential for GRPO optimization while enabling structured supervision over analytical workflows. Second, we develop a fine-grained reward mechanism incorporating classification accuracy and localization supervision, transforming binary feedback into continuous signals that distinguish genuine analytical insight from spurious correctness. Comprehensive evaluation across multiple industrial datasets demonstrates substantial performance improvements in adapting general vision-language models to specialized anomaly detection. Our method achieves superior accuracy with efficient adaptation of existing annotations, effectively bridging the gap between general-purpose MLLM capabilities and the fine-grained visual discrimination required for detecting subtle manufacturing defects and structural irregularities.

[48] Uncertainty-Aware Spatial Color Correlation for Low-Light Image Enhancement cs.CVPDF

Jin Kuang, Dong Liu, Yukuang Zhang, Shengsheng Wang

TL;DR: 论文提出了U2CLLIE框架，通过不确定性感知和空间-颜色因果相关性建模，解决了极暗条件下低光图像增强中的噪声和梯度消失问题。

Details

Motivation: 现有的低光图像增强方法大多忽视特征表示中的不确定性，尤其是在极暗条件下，梯度退化和噪声主导问题严重影响模型可靠性。

Result: U2CLLIE在多个基准数据集上实现了领先性能，表现出鲁棒性和强泛化能力。

Insight: 结合不确定性分析和因果建模可显著提升低光图像增强的鲁棒性，尤其在极端条件下。

Abstract: Most existing low-light image enhancement approaches primarily focus on architectural innovations, while often overlooking the intrinsic uncertainty within feature representations particularly under extremely dark conditions where degraded gradient and noise dominance severely impair model reliability and causal reasoning. To address these issues, we propose U2CLLIE, a novel framework that integrates uncertainty-aware enhancement and spatial-color causal correlation modeling. From the perspective of entropy-based uncertainty, our framework introduces two key components: (1) An Uncertainty-Aware Dual-domain Denoise (UaD) Module, which leverages Gaussian-Guided Adaptive Frequency Domain Feature Enhancement (G2AF) to suppress frequency-domain noise and optimize entropy-driven representations. This module enhances spatial texture extraction and frequency-domain noise suppression/structure refinement, effectively mitigating gradient vanishing and noise dominance. (2) A hierarchical causality-aware framework, where a Luminance Enhancement Network (LEN) first performs coarse brightness enhancement on dark regions. Then, during the encoder-decoder phase, two asymmetric causal correlation modeling modules Neighborhood Correlation State Space (NeCo) and Adaptive Spatial-Color Calibration (AsC) collaboratively construct hierarchical causal constraints. These modules reconstruct and reinforce neighborhood structure and color consistency in the feature space. Extensive experiments demonstrate that U2CLLIE achieves state-of-the-art performance across multiple benchmark datasets, exhibiting robust performance and strong generalization across various scenes.

[49] Deeper Inside Deep ViT cs.CVPDF

Sungrae Hong

TL;DR: 本文深入研究了类似LLM的大规模视觉模型ViT-22B的实用性，分析了其训练稳定性和实际性能，并探索了其在图像生成任务中的应用。

Details

Motivation: 尽管ViT-22B等大规模视觉模型已有许多研究，但对其实际应用的理解仍不完整。本文旨在探究其训练行为、稳定性及在图像生成任务中的可行性。

Result: ViT-22B在相同参数规模下性能优于ViT；改进措施有效提高了训练稳定性；ViT-22B在图像生成任务中表现优于ViT。

Insight: 大规模视觉模型如ViT-22B在性能上具有潜力，但其训练稳定性是关键挑战；ViT架构在图像生成任务中也具备应用前景。

Abstract: There have been attempts to create large-scale structures in vision models similar to LLM, such as ViT-22B. While this research has provided numerous analyses and insights, our understanding of its practical utility remains incomplete. Therefore, we examine how this model structure reacts and train in a local environment. We also highlight the instability in training and make some model modifications to stabilize it. The ViT-22B model, trained from scratch, overall outperformed ViT in terms of performance under the same parameter size. Additionally, we venture into the task of image generation, which has not been attempted in ViT-22B. We propose an image generation architecture using ViT and investigate which between ViT and ViT-22B is a more suitable structure for image generation.

[50] RPCANet++: Deep Interpretable Robust PCA for Sparse Object Segmentation cs.CVPDF

Fengyi Wu, Yimian Dai, Tianfang Zhang, Yixuan Ding, Jian Yang

TL;DR: RPCANet++ 是一个结合了鲁棒主成分分析 (RPCA) 可解释性和深度网络效率的稀疏对象分割框架，通过模块化设计和增强特征保留技术显著提升了性能和适应性。

Details

Motivation: 传统 RPCA 方法在高计算开销、依赖精细调参和动态场景适应性差等方面存在局限性，促使研究者提出 RPCANet++ 以解决这些问题。

Result: 实验表明，RPCANet++ 在多种成像场景下表现优于现有方法，同时通过可视化分析展示了其低秩性和稀疏性的可解释性。

Insight: 通过将 RPCA 的数学理论优势与深度网络的高效性结合，RPCANet++ 为稀疏对象分割任务提供了可靠且可解释的新基准。

Abstract: Robust principal component analysis (RPCA) decomposes an observation matrix into low-rank background and sparse object components. This capability has enabled its application in tasks ranging from image restoration to segmentation. However, traditional RPCA models suffer from computational burdens caused by matrix operations, reliance on finely tuned hyperparameters, and rigid priors that limit adaptability in dynamic scenarios. To solve these limitations, we propose RPCANet++, a sparse object segmentation framework that fuses the interpretability of RPCA with efficient deep architectures. Our approach unfolds a relaxed RPCA model into a structured network comprising a Background Approximation Module (BAM), an Object Extraction Module (OEM), and an Image Restoration Module (IRM). To mitigate inter-stage transmission loss in the BAM, we introduce a Memory-Augmented Module (MAM) to enhance background feature preservation, while a Deep Contrast Prior Module (DCPM) leverages saliency cues to expedite object extraction. Extensive experiments on diverse datasets demonstrate that RPCANet++ achieves state-of-the-art performance under various imaging scenarios. We further improve interpretability via visual and numerical low-rankness and sparsity measurements. By combining the theoretical strengths of RPCA with the efficiency of deep networks, our approach sets a new baseline for reliable and interpretable sparse object segmentation. Codes are available at our Project Webpage https://fengyiwu98.github.io/rpcanetx.

[51] From Learning to Unlearning: Biomedical Security Protection in Multimodal Large Language Models cs.CVPDF

Dunyuan Xu, Xikai Yang, Yaoqian Li, Jinpeng Li, Pheng-Ann Heng

TL;DR: 该论文提出了首个用于评估生物医学多模态大语言模型（MLLMs）安全保护的遗忘效果的基准MLLMU-Med，并开发了一种数据生成流程，以合成隐私数据和错误知识。研究发现现有遗忘方法效果有限，需要进一步改进。

Details

Motivation: 生物医学MLLMs的训练数据可能包含隐私信息和错误知识，导致部署后隐私泄露或错误输出。由于重新训练成本高昂，机器学习遗忘（unlearning）成为解决方案，但缺乏相关数据集和评估方法。

Result: 现有遗忘方法在生物医学MLLMs上表现有限，显示仍需改进。

Insight: 生物医学领域的安全遗忘研究具有潜力，但现有方法效果不足，未来需要开发更高效的遗忘技术。

Abstract: The security of biomedical Multimodal Large Language Models (MLLMs) has attracted increasing attention. However, training samples easily contain private information and incorrect knowledge that are difficult to detect, potentially leading to privacy leakage or erroneous outputs after deployment. An intuitive idea is to reprocess the training set to remove unwanted content and retrain the model from scratch. Yet, this is impractical due to significant computational costs, especially for large language models. Machine unlearning has emerged as a solution to this problem, which avoids complete retraining by selectively removing undesired knowledge derived from harmful samples while preserving required capabilities on normal cases. However, there exist no available datasets to evaluate the unlearning quality for security protection in biomedical MLLMs. To bridge this gap, we propose the first benchmark Multimodal Large Language Model Unlearning for BioMedicine (MLLMU-Med) built upon our novel data generation pipeline that effectively integrates synthetic private data and factual errors into the training set. Our benchmark targets two key scenarios: 1) Privacy protection, where patient private information is mistakenly included in the training set, causing models to unintentionally respond with private data during inference; and 2) Incorrectness removal, where wrong knowledge derived from unreliable sources is embedded into the dataset, leading to unsafe model responses. Moreover, we propose a novel Unlearning Efficiency Score that directly reflects the overall unlearning performance across different subsets. We evaluate five unlearning approaches on MLLMU-Med and find that these methods show limited effectiveness in removing harmful knowledge from biomedical MLLMs, indicating significant room for improvement. This work establishes a new pathway for further research in this promising field.

[52] Gather and Trace: Rethinking Video TextVQA from an Instance-oriented Perspective cs.CV | cs.AIPDF

Yan Zhang, Gangyan Zeng, Daiqing Wu, Huawen Shen, Binbin Li

TL;DR: 本文提出了一种名为GAT（Gather and Trace）的新模型，从实例导向的角度重新思考Video TextVQA任务，旨在通过准确阅读和推理视频中的文本来回答问题。GAT通过上下文聚合的实例收集模块和实例轨迹追踪模块，显著提升了准确性和效率。

Details

Motivation: 现有的Video TextVQA方法多基于帧级框架，存在文本冗余和隐式关系建模问题，导致准确性和效率受限。本文从实例导向的角度出发，提出更高效和准确的解决方案。

Result: GAT在多个公共数据集上的表现优于现有方法，准确性提升了3.86%，推理速度比视频大语言模型快10倍。

Insight: 实例导向的框架能有效解决视频文本冗余和隐式关系建模问题，显著提升Video TextVQA任务的性能。

Abstract: Video text-based visual question answering (Video TextVQA) aims to answer questions by explicitly reading and reasoning about the text involved in a video. Most works in this field follow a frame-level framework which suffers from redundant text entities and implicit relation modeling, resulting in limitations in both accuracy and efficiency. In this paper, we rethink the Video TextVQA task from an instance-oriented perspective and propose a novel model termed GAT (Gather and Trace). First, to obtain accurate reading result for each video text instance, a context-aggregated instance gathering module is designed to integrate the visual appearance, layout characteristics, and textual contents of the related entities into a unified textual representation. Then, to capture dynamic evolution of text in the video flow, an instance-focused trajectory tracing module is utilized to establish spatio-temporal relationships between instances and infer the final answer. Extensive experiments on several public Video TextVQA datasets validate the effectiveness and generalization of our framework. GAT outperforms existing Video TextVQA methods, video-language pretraining methods, and video large language models in both accuracy and inference speed. Notably, GAT surpasses the previous state-of-the-art Video TextVQA methods by 3.86% in accuracy and achieves ten times of faster inference speed than video large language models. The source code is available at https://github.com/zhangyan-ucas/GAT.

[53] Bootstrap Deep Spectral Clustering with Optimal Transport cs.CV | cs.LGPDF

Wengang Guo, Wei Ye, Chunchun Chen, Xin Sun, Christian Böhm

TL;DR: 论文提出了一个名为BootSC的深度谱聚类模型，通过联合优化谱聚类的三个阶段（亲和矩阵构建、谱嵌入和k-means聚类），并使用最优传输监督和正交重参数化技术，显著提升了聚类性能。

Details

Motivation: 谱聚类方法的主要缺点是分阶段优化过程不连贯和表示能力有限，因此提出了一种端到端的深度学习方法来解决这些问题。

Result: 在ImageNet-Dogs等数据集上取得了显著的性能提升，例如NMI指标比第二名提高了16%。

Insight: 联合优化和最优传输监督可以有效提升谱聚类的性能，而正交重参数化技术是增强判别能力的关键。

Abstract: Spectral clustering is a leading clustering method. Two of its major shortcomings are the disjoint optimization process and the limited representation capacity. To address these issues, we propose a deep spectral clustering model (named BootSC), which jointly learns all stages of spectral clustering – affinity matrix construction, spectral embedding, and $k$-means clustering – using a single network in an end-to-end manner. BootSC leverages effective and efficient optimal-transport-derived supervision to bootstrap the affinity matrix and the cluster assignment matrix. Moreover, a semantically-consistent orthogonal re-parameterization technique is introduced to orthogonalize spectral embeddings, significantly enhancing the discrimination capability. Experimental results indicate that BootSC achieves state-of-the-art clustering performance. For example, it accomplishes a notable 16% NMI improvement over the runner-up method on the challenging ImageNet-Dogs dataset. Our code is available at https://github.com/spdj2271/BootSC.

[54] ViFP: A Framework for Visual False Positive Detection to Enhance Reasoning Reliability in VLMs cs.CV | cs.AIPDF

Ben Zhang, LuLu Yu, Lei Gao, Jing Liu, QuanJiang Guo

TL;DR: ViFP提出了一种检测视觉-语言模型(VLM)中虚假正例(FP)的通用框架，通过子问题模板和多轮问答提升推理可靠性，并在多个数据集上验证其有效性。

Details

Motivation: 现有的基于多步推理数据集和强化学习的方法在提升VLM推理可靠性时，存在训练成本高和泛化能力差的问题，ViFP旨在解决这些问题。

Result: 在A-OKVQA、OKVQA和FVQA数据集上，ViFP显著提升了准确率（A-OKVQA上提升5.4%），并减少了FP数量。

Insight: 动态检测FP并自适应调整推理路径是提升VLM可靠性的关键，子问题模板和多轮QA能有效改善泛化能力。

Abstract: In visual-language model (VLM) reasoning, false positive(FP) reasoning occurs when a model generates a correct answer but follows an incorrect reasoning path. Existing methods based on specific multi-step reasoning datasets and reinforcement learning strategies, leading to high training costs and limited generalization. In this work, we propose ViFP, a general framework for enhancing visual reasoning reliability. It improves both answer accuracy and reasoning soundness by detecting FPs. ViFP tackles the limitations of dataset dependency and poor generalization by constructing sub-question templates grounded in the core dimensions of visual reasoning, such as object localization, characteristic description, and object discovery. ViFP then builds effective reasoning paths via multi-turn QA to improve reasoning accuracy. Meanwhile, ViFP dynamically analyzes the consistency of reasoning path to identify potential FPs, and introduces a targeted chain-of-thought (CoT) mechanism that adaptively guides both FP and non-FP samples. Thereby reducing logical errors in the reasoning path while preserving accuracy. Finally, we introduce a reliability evaluation metric-VoC, which integrates answer accuracy and the FP rate, providing a quantitative tool to assess whether a VLM not only answers correctly, but also reasons reliably. Our experiments on closed-source VLMs show that ViFP consistently improves performance across three datasets: A-OKVQA, OKVQA, and FVQA. On A-OKVQA, ViFP improves accuracy by up to 5.4%, surpassing the previous state-of-the-art by 4.3%, and significantly reduces the number of FPs, validating its benefits in enhancing reasoning reliability.

[55] Small Lesions-aware Bidirectional Multimodal Multiscale Fusion Network for Lung Disease Classification cs.CVPDF

Jianxun Yu, Ruiquan Ge, Zhipeng Wang, Cheng Yang, Chenyu Lin

TL;DR: 为了解决医学影像和电子健康记录数据维度差异带来的小病灶误诊问题，论文提出了一种多模态多尺度跨注意力融合网络（MMCAF-Net），通过特征金字塔和3D多尺度卷积注意力模块提取病灶特征，并结合多尺度跨注意力模块实现高效的多模态融合。在Lung-PET-CT-Dx数据集上，该模型表现出色，超越了现有方法。

Details

Motivation: 医学疾病诊断中小病灶的误诊问题突出，而现有多模态方法在医学影像和电子健康记录数据的对齐与融合上存在挑战。

Result: 在Lung-PET-CT-Dx数据集上的实验表明，MMCAF-Net显著优于现有方法，提高了诊断准确性。

Insight: 通过多模态和多尺度策略的结合，可以有效解决医学影像与其他数据模态之间的对齐和融合问题，从而提升疾病诊断的性能。

Abstract: The diagnosis of medical diseases faces challenges such as the misdiagnosis of small lesions. Deep learning, particularly multimodal approaches, has shown great potential in the field of medical disease diagnosis. However, the differences in dimensionality between medical imaging and electronic health record data present challenges for effective alignment and fusion. To address these issues, we propose the Multimodal Multiscale Cross-Attention Fusion Network (MMCAF-Net). This model employs a feature pyramid structure combined with an efficient 3D multi-scale convolutional attention module to extract lesion-specific features from 3D medical images. To further enhance multimodal data integration, MMCAF-Net incorporates a multi-scale cross-attention module, which resolves dimensional inconsistencies, enabling more effective feature fusion. We evaluated MMCAF-Net on the Lung-PET-CT-Dx dataset, and the results showed a significant improvement in diagnostic accuracy, surpassing current state-of-the-art methods. The code is available at https://github.com/yjx1234/MMCAF-Net

[56] SplitGaussian: Reconstructing Dynamic Scenes via Visual Geometry Decomposition cs.CVPDF

Jiahui Li, Shengeng Tang, Jingxuan He, Gang Huang, Zhangye Wang

TL;DR: SplitGaussian是一个新框架，通过显式分解静态和动态组件来重建动态3D场景，解决了运动泄漏和几何失真的问题。

Details

Motivation: 现有基于高斯泼溅的动态场景重建方法通常将静态和动态元素耦合在共享表示中，导致运动泄漏、几何失真和时间闪烁问题。

Result: 在渲染质量、几何稳定性和运动分离方面优于现有方法。

Insight: 解耦静态和动态建模是提升动态场景重建性能的关键。

Abstract: Reconstructing dynamic 3D scenes from monocular video remains fundamentally challenging due to the need to jointly infer motion, structure, and appearance from limited observations. Existing dynamic scene reconstruction methods based on Gaussian Splatting often entangle static and dynamic elements in a shared representation, leading to motion leakage, geometric distortions, and temporal flickering. We identify that the root cause lies in the coupled modeling of geometry and appearance across time, which hampers both stability and interpretability. To address this, we propose \textbf{SplitGaussian}, a novel framework that explicitly decomposes scene representations into static and dynamic components. By decoupling motion modeling from background geometry and allowing only the dynamic branch to deform over time, our method prevents motion artifacts in static regions while supporting view- and time-dependent appearance refinement. This disentangled design not only enhances temporal consistency and reconstruction fidelity but also accelerates convergence. Extensive experiments demonstrate that SplitGaussian outperforms prior state-of-the-art methods in rendering quality, geometric stability, and motion separation.

[57] Continual Learning for VLMs: A Survey and Taxonomy Beyond Forgetting cs.CV | cs.LGPDF

Yuyang Liu, Qiuhe Hong, Linlan Huang, Alexandra Gomez-Villa, Dipam Goswami

TL;DR: 该论文是首个专注于视觉语言模型（VLM）持续学习（CL）的系统综述，提出了基于三类核心问题的挑战驱动分类法，并呼吁改进评估标准和数据集。

Details

Motivation: 视觉语言模型在持续学习中的多模态对齐和泛化能力易受灾难性遗忘影响，传统单模态CL方法无法直接适用，因此需要针对VLM的独特问题展开研究。

Result: 分类法涵盖核心挑战与解决方案，但现有评估标准未能充分体现VLM特有的遗忘和组合泛化问题。

Insight: 未来方向包括持续预训练和组合零样本学习，需开发更贴合VLM特性的基准测试。

Abstract: Vision-language models (VLMs) have achieved impressive performance across diverse multimodal tasks by leveraging large-scale pre-training. However, enabling them to learn continually from non-stationary data remains a major challenge, as their cross-modal alignment and generalization capabilities are particularly vulnerable to catastrophic forgetting. Unlike traditional unimodal continual learning (CL), VLMs face unique challenges such as cross-modal feature drift, parameter interference due to shared architectures, and zero-shot capability erosion. This survey offers the first focused and systematic review of continual learning for VLMs (VLM-CL). We begin by identifying the three core failure modes that degrade performance in VLM-CL. Based on these, we propose a challenge-driven taxonomy that maps solutions to their target problems: (1) \textit{Multi-Modal Replay Strategies} address cross-modal drift through explicit or implicit memory mechanisms; (2) \textit{Cross-Modal Regularization} preserves modality alignment during updates; and (3) \textit{Parameter-Efficient Adaptation} mitigates parameter interference with modular or low-rank updates. We further analyze current evaluation protocols, datasets, and metrics, highlighting the need for better benchmarks that capture VLM-specific forgetting and compositional generalization. Finally, we outline open problems and future directions, including continual pre-training and compositional zero-shot learning. This survey aims to serve as a comprehensive and diagnostic reference for researchers developing lifelong vision-language systems. All resources are available at: https://github.com/YuyangSunshine/Awesome-Continual-learning-of-Vision-Language-Models.

[58] LayerT2V: Interactive Multi-Object Trajectory Layering for Video Generation cs.CV | cs.AI | cs.LG | cs.MMPDF

Kangrui Cen, Baixuan Zhao, Yi Xin, Siqi Luo, Guangtao Zhai

TL;DR: LayerT2V提出了一种分层生成视频的方法，解决了多物体运动轨迹控制中的语义冲突问题，显著提升了多物体场景生成的质量和可控性。

Details

Motivation: 当前基于文本生成视频（T2V）的任务在涉及多物体运动时表现不佳，尤其是物体轨迹交叉时，容易因语义冲突而降低生成质量。现有的T2V方法大多专注于单物体运动，缺乏对多物体场景的有效控制。

Result: 实验证明，LayerT2V在多物体场景生成中显著优于现有方法，mIoU和AP50分别提升1.4倍和4.5倍，展示了更高的一致性和可控性。

Insight: 分层生成提供了一种新的思路，将复杂场景解耦为独立层，能够更灵活地控制多物体运动，同时为未来视频生成任务中的对象组合提供了方向。

Abstract: Controlling object motion trajectories in Text-to-Video (T2V) generation is a challenging and relatively under-explored area, particularly in scenarios involving multiple moving objects. Most community models and datasets in the T2V domain are designed for single-object motion, limiting the performance of current generative models in multi-object tasks. Additionally, existing motion control methods in T2V either lack support for multi-object motion scenes or experience severe performance degradation when object trajectories intersect, primarily due to the semantic conflicts in colliding regions. To address these limitations, we introduce LayerT2V, the first approach for generating video by compositing background and foreground objects layer by layer. This layered generation enables flexible integration of multiple independent elements within a video, positioning each element on a distinct “layer” and thus facilitating coherent multi-object synthesis while enhancing control over the generation process. Extensive experiments demonstrate the superiority of LayerT2V in generating complex multi-object scenarios, showcasing 1.4x and 4.5x improvements in mIoU and AP50 metrics over state-of-the-art (SOTA) methods. Project page and code are available at https://kr-panghu.github.io/LayerT2V/ .

[59] Intention Enhanced Diffusion Model for Multimodal Pedestrian Trajectory Prediction cs.CVPDF

Yu Liu, Zhijie Liu, Xiao Ren, You-Fu Li, He Kong

TL;DR: 论文提出了一种基于扩散模型的多模态行人轨迹预测方法，通过引入行人运动意图增强模型的可解释性和预测精度。

Details

Motivation: 行人轨迹预测是自动驾驶路径规划和运动控制的关键任务，但由于人类行为的多模态和不确定性，准确预测仍具挑战性。现有扩散模型虽能捕捉随机性，但鲜有工作明确整合行人运动意图，限制了模型的解释性和精度。

Result: 在ETH和UCY数据集上验证，性能优于或持平现有最先进方法。

Insight: 运动意图的显式建模可以显著提升扩散模型在行人轨迹预测任务中的表现，同时增强模型的可解释性。

Abstract: Predicting pedestrian motion trajectories is critical for path planning and motion control of autonomous vehicles. However, accurately forecasting crowd trajectories remains a challenging task due to the inherently multimodal and uncertain nature of human motion. Recent diffusion-based models have shown promising results in capturing the stochasticity of pedestrian behavior for trajectory prediction. However, few diffusion-based approaches explicitly incorporate the underlying motion intentions of pedestrians, which can limit the interpretability and precision of prediction models. In this work, we propose a diffusion-based multimodal trajectory prediction model that incorporates pedestrians’ motion intentions into the prediction framework. The motion intentions are decomposed into lateral and longitudinal components, and a pedestrian intention recognition module is introduced to enable the model to effectively capture these intentions. Furthermore, we adopt an efficient guidance mechanism that facilitates the generation of interpretable trajectories. The proposed framework is evaluated on two widely used human trajectory prediction benchmarks, ETH and UCY, on which it is compared against state-of-the-art methods. The experimental results demonstrate that our method achieves competitive performance.

[60] PIS3R: Very Large Parallax Image Stitching via Deep 3D Reconstruction cs.CVPDF

Muhua Zhu, Xinhao Jin, Chengbo Wang, Yongcong Zhang, Yifei Xue

TL;DR: 该论文提出了一种名为PIS3R的图像拼接方法，通过深度3D重构技术解决了大视差图像的拼接问题，结合了视觉几何变压器和点条件扩散模块，显著提升了拼接效果。

Details

Motivation: 现有图像拼接方法在处理大视差场景时效果不佳，导致拼接结果存在几何失真或对齐不准确的问题。

Result: 实验表明，PIS3R在大视差场景下实现了高质量拼接，定量和定性均优于现有方法。

Insight: 通过深度3D重构解决大视差问题，为下游3D视觉任务（如SfM）提供了直接可用的几何完整结果。

Abstract: Image stitching aim to align two images taken from different viewpoints into one seamless, wider image. However, when the 3D scene contains depth variations and the camera baseline is significant, noticeable parallax occurs-meaning the relative positions of scene elements differ substantially between views. Most existing stitching methods struggle to handle such images with large parallax effectively. To address this challenge, in this paper, we propose an image stitching solution called PIS3R that is robust to very large parallax based on the novel concept of deep 3D reconstruction. First, we apply visual geometry grounded transformer to two input images with very large parallax to obtain both intrinsic and extrinsic parameters, as well as the dense 3D scene reconstruction. Subsequently, we reproject reconstructed dense point cloud onto a designated reference view using the recovered camera parameters, achieving pixel-wise alignment and generating an initial stitched image. Finally, to further address potential artifacts such as holes or noise in the initial stitching, we propose a point-conditioned image diffusion module to obtain the refined result.Compared with existing methods, our solution is very large parallax tolerant and also provides results that fully preserve the geometric integrity of all pixels in the 3D photogrammetric context, enabling direct applicability to downstream 3D vision tasks such as SfM. Experimental results demonstrate that the proposed algorithm provides accurate stitching results for images with very large parallax, and outperforms the existing methods qualitatively and quantitatively.

Giuseppe Chindemi, Camilla Bellone, Benoit Girard

TL;DR: 论文探讨了从传统人类观察转向人工智能和机器学习方法研究啮齿类动物社会行为的转变，分析了其优势和挑战，并提供了实用解决方案。

Details

Motivation: 传统的啮齿类动物社会行为研究方法存在偏差且无法捕捉复杂互动，而结合计算机视觉、行为学和神经科学的现代方法提供了更全面的视角。

Result: 现代方法能够更全面地分析社会行为，但AI工具的整合仍面临挑战，需进一步讨论和优化。

Insight: AI和机器学习的引入为啮齿类动物行为研究提供了新视角，但需解决技术和科学应用中的实际问题。

Abstract: The study of rodent social behavior has shifted in the last years from relying on direct human observation to more nuanced approaches integrating computational methods in artificial intelligence (AI) and machine learning. While conventional approaches introduce bias and can fail to capture the complexity of rodent social interactions, modern approaches bridging computer vision, ethology and neuroscience provide more multifaceted insights into behavior which are particularly relevant to social neuroscience. Despite these benefits, the integration of AI into social behavior research also poses several challenges. Here we discuss the main steps involved and the tools available for analyzing rodent social behavior, examining their advantages and limitations. Additionally, we suggest practical solutions to address common hurdles, aiming to guide young researchers in adopting these methods and to stimulate further discussion among experts regarding the evolving requirements of these tools in scientific applications.

[62] Revisiting Continual Semantic Segmentation with Pre-trained Vision Models cs.CVPDF

Duzhen Zhang, Yong Ren, Wei Cong, Junhao Zheng, Qiaoyi Su

TL;DR: 这篇论文重新评估了在连续语义分割任务中预训练视觉模型的表现，发现直接微调（DFT）的抗遗忘能力被低估，并提出了一种改进方法DFT*，性能优于现有方法。

Details

Motivation: 连续语义分割（CSS）需要模型在增量学习新类别的同时保留旧类别的知识。现有方法认为直接微调（DFT）会因灾难性遗忘而表现不佳，但论文质疑这一假设，试图重新评估DFT的实际能力。

Result: 实验表明，DFT*在性能上优于16种现有CSS方法，同时训练参数和训练时间大幅减少。

Insight: 论文的洞见在于，PVMs的内在抗遗忘能力被低估，且分类器的漂移是导致遗忘的主要原因，而非骨干网络的表现退化。

Abstract: Continual Semantic Segmentation (CSS) seeks to incrementally learn to segment novel classes while preserving knowledge of previously encountered ones. Recent advancements in CSS have been largely driven by the adoption of Pre-trained Vision Models (PVMs) as backbones. Among existing strategies, Direct Fine-Tuning (DFT), which sequentially fine-tunes the model across classes, remains the most straightforward approach. Prior work often regards DFT as a performance lower bound due to its presumed vulnerability to severe catastrophic forgetting, leading to the development of numerous complex mitigation techniques. However, we contend that this prevailing assumption is flawed. In this paper, we systematically revisit forgetting in DFT across two standard benchmarks, Pascal VOC 2012 and ADE20K, under eight CSS settings using two representative PVM backbones: ResNet101 and Swin-B. Through a detailed probing analysis, our findings reveal that existing methods significantly underestimate the inherent anti-forgetting capabilities of PVMs. Even under DFT, PVMs retain previously learned knowledge with minimal forgetting. Further investigation of the feature space indicates that the observed forgetting primarily arises from the classifier’s drift away from the PVM, rather than from degradation of the backbone representations. Based on this insight, we propose DFT*, a simple yet effective enhancement to DFT that incorporates strategies such as freezing the PVM backbone and previously learned classifiers, as well as pre-allocating future classifiers. Extensive experiments show that DFT* consistently achieves competitive or superior performance compared to sixteen state-of-the-art CSS methods, while requiring substantially fewer trainable parameters and less training time.

[63] PKSS-Align: Robust Point Cloud Registration on Pre-Kendall Shape Space cs.CVPDF

Chenlei Lv, Hui Huang

TL;DR: PKSS-Align是一种基于Pre-Kendall形状空间的鲁棒点云配准方法，能够处理相似变换、非均匀密度、噪声和缺陷部分的点云数据，无需数据训练或复杂特征编码，且效率高。

Details

Motivation: 点云配准对相似变换、噪声和几何结构不完整敏感，尤其是非均匀尺度和缺陷部分容易陷入局部最优。现有方法难以同时处理这些问题，需要一种鲁棒且高效的解决方案。

Result: 实验证明，PKSS-Align在效率和实用性上优于现有最先进方法。

Insight: 形状空间度量能够有效解决点云配准中的多种干扰问题，为鲁棒配准提供了一种无需训练的高效方法。

Abstract: Point cloud registration is a classical topic in the field of 3D Vision and Computer Graphics. Generally, the implementation of registration is typically sensitive to similarity transformations (translation, scaling, and rotation), noisy points, and incomplete geometric structures. Especially, the non-uniform scales and defective parts of point clouds increase probability of struck local optima in registration task. In this paper, we propose a robust point cloud registration PKSS-Align that can handle various influences, including similarity transformations, non-uniform densities, random noisy points, and defective parts. The proposed method measures shape feature-based similarity between point clouds on the Pre-Kendall shape space (PKSS), \textcolor{black}{which is a shape measurement-based scheme and doesn’t require point-to-point or point-to-plane metric.} The employed measurement can be regarded as the manifold metric that is robust to various representations in the Euclidean coordinate system. Benefited from the measurement, the transformation matrix can be directly generated for point clouds with mentioned influences at the same time. The proposed method does not require data training and complex feature encoding. Based on a simple parallel acceleration, it can achieve significant improvement for efficiency and feasibility in practice. Experiments demonstrate that our method outperforms the relevant state-of-the-art methods.

[64] FrEVL: Leveraging Frozen Pretrained Embeddings for Efficient Vision-Language Understanding cs.CV | cs.CLPDF

Emmanuelle Bourigault, Pauline Bourigault

TL;DR: FrEVL 框架验证了利用冻结预训练嵌入进行高效视觉-语言理解的可行性，性能接近 SOTA，显著提升了计算效率和能效。

Details

Motivation: 当前视觉-语言模型的计算需求过高，限制了其实际部署，需要一种更高效的解决方案。

Result: 在标准基准测试中达到 SOTA 性能的 85%-95%，计算速度提升 2.3 倍，能耗降低 52%。

Insight: 冻结嵌入的有效性取决于预训练目标与下游任务的匹配程度，适用于输入可预计算或对性能边际提升不敏感的场景。

Abstract: The deployment of vision-language models remains constrained by substantial computational requirements. We present \textbf{FrEVL}, a framework exploring whether frozen pretrained embeddings can support effective vision-language understanding. Our analysis reveals that frozen embeddings contain rich information for discriminative tasks, achieving 85% to 95% of state-of-the-art performance on standard benchmarks with only 68.4M trainable parameters. This performance dichotomy reveals a critical insight: frozen embedding effectiveness depends on alignment between pretraining objectives and downstream task requirements. When accounting for end-to-end computation including embedding extraction, FrEVL provides $2.3\times$ speedup with 52% lower energy consumption, making it suitable for scenarios with pre-computable inputs or when deployment constraints outweigh marginal performance gains. Our evaluation provides practitioners with guidance on when frozen embedding approaches represent viable alternatives to full model deployment. We will release our complete implementation and evaluation framework to facilitate further research into efficient multi-modal understanding.

[65] Length Matters: Length-Aware Transformer for Temporal Sentence Grounding cs.CVPDF

Yifan Wang, Ziyi Liu, Xiaolong Sun, Jiawei Wang, Hongmin Liu

TL;DR: 本文提出了一种名为长度感知变换器（LATR）的新方法，通过利用视频描述对的长度先验信息，为TSG任务设计了一种分组查询机制，以减少冗余预测并提升性能。

Details

Motivation: 现有的DETR-based模型在TSG任务中取得了进展，但由于缺乏显式监督，学习到的查询容易角色重叠，导致冗余预测。因此，本文希望通过引入长度先验信息来优化查询机制。

Result: 在三个公共基准测试中取得了state-of-the-art性能，消融实验验证了各组件的重要性。

Insight: 长度先验信息对于TSG任务的查询功能分配至关重要，可以有效减少冗余预测并提升模型性能。

Abstract: Temporal sentence grounding (TSG) is a highly challenging task aiming to localize the temporal segment within an untrimmed video corresponding to a given natural language description. Benefiting from the design of learnable queries, the DETR-based models have achieved substantial advancements in the TSG task. However, the absence of explicit supervision often causes the learned queries to overlap in roles, leading to redundant predictions. Therefore, we propose to improve TSG by making each query fulfill its designated role, leveraging the length priors of the video-description pairs. In this paper, we introduce the Length-Aware Transformer (LATR) for TSG, which assigns different queries to handle predictions based on varying temporal lengths. Specifically, we divide all queries into three groups, responsible for segments with short, middle, and long temporal durations, respectively. During training, an additional length classification task is introduced. Predictions from queries with mismatched lengths are suppressed, guiding each query to specialize in its designated function. Extensive experiments demonstrate the effectiveness of our LATR, achieving state-of-the-art performance on three public benchmarks. Furthermore, the ablation studies validate the contribution of each component of our method and the critical role of incorporating length priors into the TSG task.

[66] Analyzing and Mitigating Object Hallucination: A Training Bias Perspective cs.CV | cs.CLPDF

Yifan Li, Kun Zhou, Wayne Xin Zhao, Lei Fang, Ji-Rong Wen

TL;DR: 该论文系统地研究了大型视觉语言模型（LVLMs）中的目标幻觉问题，发现训练数据中的偏差是主要原因，并提出了高效轻量化的去偏方法Obliviate。

Details

Motivation: 大型视觉语言模型在多模态能力上取得了显著进步，但仍存在目标幻觉问题，即生成的文本与视觉输入不一致。论文旨在探索训练数据在此问题中的角色。

Result: 实验表明，Obliviate显著减少了目标幻觉，适用于不同规模和任务，并具有扩展到其他幻觉类型的潜力。

Insight: 训练数据偏差是LVLMs目标幻觉的主要原因，轻量化去偏方法在参数和数据效率上有显著优势。

Abstract: As scaling up training data has significantly improved the general multimodal capabilities of Large Vision-Language Models (LVLMs), they still suffer from the hallucination issue, generating text that is inconsistent with the visual input. This phenomenon motivates us to systematically investigate the role of training data in hallucination. We introduce a new benchmark, POPEv2, which consists of counterfactual images collected from the training data of LVLMs with certain objects masked. Through comprehensive evaluation on POPEv2, we find that current LVLMs suffer from training bias: they fail to fully leverage their training data and hallucinate more frequently on images seen during training. Specifically, they perform poorly on counterfactual images, often incorrectly answering ``Yes’’ to questions about masked objects. To understand this issue, we conduct probing experiments on the models’ internal components, revealing that this training bias is primarily located in the language modeling (LM) head. Based on these findings, we propose Obliviate, an efficient and lightweight unlearning method designed to mitigate object hallucination via training bias unlearning. Obliviate identifies the discrepancy between ground-truth labels and model outputs on the training data as a proxy for bias and adopts a parameter- and data-efficient fine-tuning strategy that only updates the LM head. Extensive experiments demonstrate the effectiveness of our approach. While only reusing the training data and updating approximately 2% of the parameters, Obliviate significantly reduces hallucination across both discriminative and generative tasks. Furthermore, it demonstrates strong scalability with respect to both model size (2B to 72B) and training data volume, and exhibits promising generalization to hallucination types beyond object-level hallucination. Our code and data will be publicly released.

[67] TempFlow-GRPO: When Timing Matters for GRPO in Flow Models cs.CVPDF

Xiaoxuan He, Siming Fu, Yuke Zhao, Wanli Li, Jian Yang

TL;DR: TempFlow-GRPO提出了一种新的GRPO框架，通过时间感知的优化改进流模型中的强化学习对齐任务，解决了现有方法在时间均匀性假设上的不足。

Details

Motivation: 现有基于流匹配的文本到图像生成模型在人类偏好对齐任务中使用稀疏终端奖励和均匀信用分配，导致探索效率低和收敛不理想。

Result: 在人类偏好对齐和标准文本到图像基准测试中取得了最先进的性能。

Insight: 流模型的生成过程中时间动态性对强化学习的信用分配和优化效果至关重要。

Abstract: Recent flow matching models for text-to-image generation have achieved remarkable quality, yet their integration with reinforcement learning for human preference alignment remains suboptimal, hindering fine-grained reward-based optimization. We observe that the key impediment to effective GRPO training of flow models is the temporal uniformity assumption in existing approaches: sparse terminal rewards with uniform credit assignment fail to capture the varying criticality of decisions across generation timesteps, resulting in inefficient exploration and suboptimal convergence. To remedy this shortcoming, we introduce \textbf{TempFlow-GRPO} (Temporal Flow GRPO), a principled GRPO framework that captures and exploits the temporal structure inherent in flow-based generation. TempFlow-GRPO introduces two key innovations: (i) a trajectory branching mechanism that provides process rewards by concentrating stochasticity at designated branching points, enabling precise credit assignment without requiring specialized intermediate reward models; and (ii) a noise-aware weighting scheme that modulates policy optimization according to the intrinsic exploration potential of each timestep, prioritizing learning during high-impact early stages while ensuring stable refinement in later phases. These innovations endow the model with temporally-aware optimization that respects the underlying generative dynamics, leading to state-of-the-art performance in human preference alignment and standard text-to-image benchmarks.

[68] RiemanLine: Riemannian Manifold Representation of 3D Lines for Factor Graph Optimization cs.CV | cs.ROPDF

Yanyan Li, Ze Yang, Keisuke Tateno, Federico Tombari Liang Zhao, Gim Hee Lee

TL;DR: RiemanLine 提出了一种基于黎曼流形的最小化参数化方法，用于表示3D线，同时考虑了单个线和平行线组的结构性规律，显著提高了位姿估计和线重建的精度。

Details

Motivation: 现有的3D线表示方法通常独立处理每条线，而忽略了人工环境中常见的平行线等结构性规律。这种忽视可能导致参数冗余和优化效率低下。

Result: 在多个数据集（ICL-NUIM、TartanAir 和合成数据）上，方法在保持较低参数维度的同时，显著提高了位姿估计和线重建的精度和收敛稳定性。

Insight: 通过全局和局部分量的解耦，RiemanLine 有效利用结构性规律（如平行线），为3D线的优化问题提供了更高效且紧凑的表示方式。

Abstract: Minimal parametrization of 3D lines plays a critical role in camera localization and structural mapping. Existing representations in robotics and computer vision predominantly handle independent lines, overlooking structural regularities such as sets of parallel lines that are pervasive in man-made environments. This paper introduces \textbf{RiemanLine}, a unified minimal representation for 3D lines formulated on Riemannian manifolds that jointly accommodates both individual lines and parallel-line groups. Our key idea is to decouple each line landmark into global and local components: a shared vanishing direction optimized on the unit sphere $\mathcal{S}^2$, and scaled normal vectors constrained on orthogonal subspaces, enabling compact encoding of structural regularities. For $n$ parallel lines, the proposed representation reduces the parameter space from $4n$ (orthonormal form) to $2n+2$, naturally embedding parallelism without explicit constraints. We further integrate this parameterization into a factor graph framework, allowing global direction alignment and local reprojection optimization within a unified manifold-based bundle adjustment. Extensive experiments on ICL-NUIM, TartanAir, and synthetic benchmarks demonstrate that our method achieves significantly more accurate pose estimation and line reconstruction, while reducing parameter dimensionality and improving convergence stability.

[69] RotatedMVPS: Multi-view Photometric Stereo with Rotated Natural Light cs.CVPDF

Songyun Yang, Yufei Han, Jilong Zhang, Kongming Liang, Peng Yu

TL;DR: RotatedMVPS提出了一种在旋转自然光下解决形状和反射率恢复的方法，通过旋转平台实现光一致性，并结合单视角光度立体数据先验，显著提升了恢复效果。

Details

Motivation: 现有MVPS方法通常需要暗室环境或忽略反射率和光照特性的恢复，限制了其在自然光照场景和下游逆向渲染任务中的应用。

Result: 在合成和真实数据集上验证了方法的有效性。

Insight: 结合数据先验和光一致性是实现自然光照下高精度形状和反射率恢复的关键。

Abstract: Multiview photometric stereo (MVPS) seeks to recover high-fidelity surface shapes and reflectances from images captured under varying views and illuminations. However, existing MVPS methods often require controlled darkroom settings for varying illuminations or overlook the recovery of reflectances and illuminations properties, limiting their applicability in natural illumination scenarios and downstream inverse rendering tasks. In this paper, we propose RotatedMVPS to solve shape and reflectance recovery under rotated natural light, achievable with a practical rotation stage. By ensuring light consistency across different camera and object poses, our method reduces the unknowns associated with complex environment light. Furthermore, we integrate data priors from off-the-shelf learning-based single-view photometric stereo methods into our MVPS framework, significantly enhancing the accuracy of shape and reflectance recovery. Experimental results on both synthetic and real-world datasets demonstrate the effectiveness of our approach.

[70] TSPO: Temporal Sampling Policy Optimization for Long-form Video Language Understanding cs.CVPDF

Canhui Tang, Zifan Han, Hongbo Sun, Sanping Zhou, Xuchong Zhang

TL;DR: 该论文提出了TSPO（Temporal Sampling Policy Optimization）方法，通过强化学习优化长视频语言理解中的稀疏帧采样问题，显著提升了多模态大语言模型（MLLMs）的性能。

Details

Motivation: 现有的视频MLLMs在处理长视频时存在上下文限制和训练成本问题，通常采用无训练的统一采样或关键帧搜索方法，但这些方法可能遗漏关键事件或受限于预训练模型的事件理解能力，亟需一种训练优化的解决方法。

Result: TSPO在多个长视频理解基准测试中达到了state-of-the-art（SOTA）性能，并展示了在不同前沿视频MLLMs中的可迁移能力。

Insight: 1. 强化学习是解决非监督、不可微分稀疏帧采样问题的有效方法。2. 联合建模关键帧选择和语言生成任务可以显著提升长视频理解的性能。3. 基于规则的奖励机制能够在高效训练的同时保持模型性能。

Abstract: Multimodal Large Language Models (MLLMs) have demonstrated significant progress in vision-language tasks, yet they still face challenges when processing long-duration video inputs. The limitation arises from MLLMs’ context limit and training costs, necessitating sparse frame sampling before feeding videos into MLLMs. Existing video MLLMs adopt training-free uniform sampling or keyframe search, which may miss critical events or be constrained by the pre-trained models’ event understanding capabilities. Meanwhile, building a training-based method remains challenging due to the unsupervised and non-differentiable nature of sparse frame sampling. To address these problems, we propose Temporal Sampling Policy Optimization (TSPO), advancing MLLMs’ long-form video-language understanding via reinforcement learning. Specifically, we first propose a trainable event-aware temporal agent, which captures event-query correlation for performing probabilistic keyframe selection. Then, we propose the TSPO reinforcement learning paradigm, which models keyframe selection and language generation as a joint decision-making process, enabling end-to-end group relative optimization with efficient rule-based rewards. Furthermore, for the TSPO’s training, we propose a long video training data construction pipeline with comprehensive temporal data and video Needle-in-a-Haystack data. Finally, we incorporate rule-based answering accuracy and temporal locating reward mechanisms to optimize the temporal sampling policy. Comprehensive experiments show that our TSPO achieves state-of-the-art performance across multiple long video understanding benchmarks, and shows transferable ability across different cutting-edge Video-MLLMs.

Lefei Shen, Mouxiang Chen, Xu Liu, Han Fu, Xiaoxue Ren

TL;DR: 论文提出了VisionTS++，一种基于视觉模型的跨模态时间序列基础模型，通过三种创新方法解决了从视觉到时间序列的模态迁移挑战，并在多个基准测试中取得了SOTA结果。

Details

Motivation: 尽管预训练的视觉模型在时间序列预测中表现出潜力，但由于数据模态、多变量预测和概率预测的差距，跨模态迁移仍然具有挑战性。

Result: 在多个时间序列预测基准测试中表现优异，MSE降低6%-44%，并在12个概率预测场景中的9个排名第一。

Insight: 研究为跨模态知识迁移提供了新范式，推动了通用时间序列基础模型的发展。

Abstract: Recent studies have revealed that vision models pre-trained on images can perform well in time series forecasting by reformulating forecasting as an image reconstruction task, suggesting their potential as universal time series foundation models. However, effective cross-modal transfer from vision to time series remains challenging due to three key discrepancies: (1) data-modality gap between structured, bounded image data and unbounded, heterogeneous time series; (2) multivariate-forecasting gap between standard RGB three-channel-based vision models and the need to model time series with arbitrary numbers of variates; and (3) probabilistic-forecasting gap between the deterministic output formats of most vision models and the requirement for uncertainty-aware probabilistic predictions. To bridge these gaps, we propose VisionTS++, a vision-model-based TSFM that performs continual pre-training on large-scale time series datasets, including 3 innovations: (1) a vision-model-based filtering mechanism to identify high-quality time series data, thereby mitigating modality gap and improving pre-training stability, (2) a colorized multivariate conversion method that transforms multivariate time series into multi-subfigure RGB images, capturing complex inter-variate dependencies; and (3) a multi-quantile forecasting approach using parallel reconstruction heads to generate forecasts of different quantile levels, thus more flexibly approximating arbitrary output distributions without restrictive prior distributional assumptions. Evaluated on both in-distribution and out-of-distribution TSF benchmarks, \model achieves SOTA results, outperforming specialized TSFMs by 6%-44% in MSE reduction and ranking first in 9 out of 12 probabilistic forecasting settings. Our work establishes a new paradigm for cross-modal knowledge transfer, advancing the development of universal TSFMs.

[72] Deep Learning-based Scalable Image-to-3D Facade Parser for Generating Thermal 3D Building Models cs.CV | cs.AIPDF

Yinan Yu, Alex Gonzalez-Caceres, Samuel Scheidegger, Sanjay Somanath, Alexander Hollberg

TL;DR: 本文提出了一种基于深度学习的图像到3D立面解析管道（SI3FP），用于生成热3D建筑模型，解决了现有建筑改造中特征识别的可扩展性和准确性挑战。

Details

Motivation: 现有建筑改造需要基于LoD3级别的热3D模型进行早期规划，但传统方法在可扩展性和特征识别准确性上存在不足。

Result: 在瑞典典型住宅建筑上的测试表明，SI3FP在窗墙比估计中实现了约5%的误差，适用于早期改造分析。

Insight: SI3FP为大规模能源改造规划提供了可行工具，同时扩展了其在城市发展和规划中的潜在应用。

Abstract: Renovating existing buildings is essential for climate impact. Early-phase renovation planning requires simulations based on thermal 3D models at Level of Detail (LoD) 3, which include features like windows. However, scalable and accurate identification of such features remains a challenge. This paper presents the Scalable Image-to-3D Facade Parser (SI3FP), a pipeline that generates LoD3 thermal models by extracting geometries from images using both computer vision and deep learning. Unlike existing methods relying on segmentation and projection, SI3FP directly models geometric primitives in the orthographic image plane, providing a unified interface while reducing perspective distortions. SI3FP supports both sparse (e.g., Google Street View) and dense (e.g., hand-held camera) data sources. Tested on typical Swedish residential buildings, SI3FP achieved approximately 5% error in window-to-wall ratio estimates, demonstrating sufficient accuracy for early-stage renovation analysis. The pipeline facilitates large-scale energy renovation planning and has broader applications in urban development and planning.

[73] Thinking With Videos: Multimodal Tool-Augmented Reinforcement Learning for Long Video Reasoning cs.CVPDF

Haoji Zhang, Xin Gu, Jiawen Li, Chixiang Ma, Sule Bai

TL;DR: 论文提出了一种名为 VITAL 的端到端视频推理框架，通过工具增强学习和多模态链式思维（CoT）推理，解决了长视频推理中跨模态交互不足和幻觉增加的问题。

Details

Motivation: 当前基于文本的链式思维（CoT）方法在处理长视频或复杂推理链时，存在跨模态交互不足和幻觉增加的问题。为此，作者希望通过工具增强学习和多模态 CoT 提升视频推理能力。

Result: 在 11 个视频理解基准测试中表现出色，尤其在长视频问答和时间定位任务上优于现有方法。

Insight: 时间定位和视频问答任务相互促进，工具增强学习和多模态 CoT 能显著提升长视频推理性能。

Abstract: The video reasoning ability of multimodal large language models (MLLMs) is crucial for downstream tasks like video question answering and temporal grounding. While recent approaches have explored text-based chain-of-thought (CoT) reasoning for MLLMs, these methods often suffer from limited cross-modal interaction and increased hallucination, especially with longer videos or reasoning chains. To address these challenges, we propose Video Intelligence via Tool-Augmented Learning (VITAL), a novel end-to-end agentic video reasoning framework. With a visual toolbox, the model can densely sample new video frames on demand and generate multimodal CoT for precise long video reasoning. We observe that temporal grounding and question answering are mutually beneficial for video understanding tasks. Therefore, we construct two high-quality multi-task video reasoning datasets MTVR-CoT-72k for supervised fine-tuning and MTVR-RL-110k for reinforcement learning. Moreover, we propose a Difficulty-aware Group Relative Policy Optimization algorithm (DGRPO) to mitigate difficulty imbalance in multi-task reinforcement learning. Extensive experiments on 11 challenging video understanding benchmarks demonstrate the advanced reasoning ability of VITAL, outperforming existing methods in video question answering and temporal grounding tasks, especially in long video scenarios. All code, data and model weight will be made publicly available.

[74] Efficient Inter-Task Attention for Multitask Transformer Models cs.CVPDF

Christian Bohn, Thomas Kurbiel, Klaus Friedrichs, Hasan Tercan, Tobias Meisen

TL;DR: 该论文提出了一种名为“可变形跨任务自注意力”的新方法，用于多任务Transformer模型，显著降低了计算复杂度和推理延迟，同时提升了任务性能。

Details

Motivation: 多任务学习中，标准的多头注意力机制由于注意力矩阵随任务数量呈二次增长，导致计算开销过大，硬件限制下难以实现高效运行。

Result: 在NYUD-v2和PASCAL-Context数据集上，计算量（FLOPs）和推理延迟降低了一个数量级，各任务性能提升达7.4%。

Insight: 多任务学习中，跨任务注意力机制的设计可以显著提升效率，同时不影响（甚至提升）模型性能，为大规模多任务学习提供了可行的解决方案。

Abstract: In both Computer Vision and the wider Deep Learning field, the Transformer architecture is well-established as state-of-the-art for many applications. For Multitask Learning, however, where there may be many more queries necessary compared to single-task models, its Multi-Head-Attention often approaches the limits of what is computationally feasible considering practical hardware limitations. This is due to the fact that the size of the attention matrix scales quadratically with the number of tasks (assuming roughly equal numbers of queries for all tasks). As a solution, we propose our novel Deformable Inter-Task Self-Attention for Multitask models that enables the much more efficient aggregation of information across the feature maps from different tasks. In our experiments on the NYUD-v2 and PASCAL-Context datasets, we demonstrate an order-of-magnitude reduction in both FLOPs count and inference latency. At the same time, we also achieve substantial improvements by up to 7.4% in the individual tasks’ prediction quality metrics.

[75] Composed Object Retrieval: Object-level Retrieval via Composed Expressions cs.CVPDF

Tong Wang, Guanyu Yang, Nian Liu, Zongyan Han, Jinxing Zhou

TL;DR: 论文提出了一个名为’基于组合表达式的目标检索’（COR）的新任务，旨在通过组合参考目标和检索文本实现目标级别的检索和分割。为了解决这一挑战，作者构建了大规模基准数据集COR127K，并提出了端到端模型CORE，其性能显著优于现有方法。

Details

Motivation: 当前的多模态系统在基于用户意图检索细粒度视觉内容方面存在局限，尤其是无法准确定位和分割特定目标。COR任务通过组合参考目标和检索文本，填补了目标级检索的空白。

Result: 实验表明，CORE在基础和新类别上均显著优于现有模型，为COR任务提供了一个简单且有效的基线。

Insight: COR任务为细粒度多模态检索研究开辟了新方向，强调了目标级检索的潜力。CORE的成功表明，端到端设计在多模态任务中仍具备高效性。

Abstract: Retrieving fine-grained visual content based on user intent remains a challenge in multi-modal systems. Although current Composed Image Retrieval (CIR) methods combine reference images with retrieval texts, they are constrained to image-level matching and cannot localize specific objects. To this end, we propose Composed Object Retrieval (COR), a brand-new task that goes beyond image-level retrieval to achieve object-level precision, allowing the retrieval and segmentation of target objects based on composed expressions combining reference objects and retrieval texts. COR presents significant challenges in retrieval flexibility, which requires systems to identify arbitrary objects satisfying composed expressions while avoiding semantically similar but irrelevant negative objects within the same scene. We construct COR127K, the first large-scale COR benchmark that contains 127,166 retrieval triplets with various semantic transformations in 408 categories. We also present CORE, a unified end-to-end model that integrates reference region encoding, adaptive visual-textual interaction, and region-level contrastive learning. Extensive experiments demonstrate that CORE significantly outperforms existing models in both base and novel categories, establishing a simple and effective baseline for this challenging task while opening new directions for fine-grained multi-modal retrieval research.

[76] Benchmarking Foundation Models for Mitotic Figure Classification cs.CVPDF

Jonas Ammeling, Jonathan Ganz, Emely Rosbach, Ludwig Lausser, Christof A. Bertram

TL;DR: 该论文研究了基础模型在病理学有丝分裂图像分类任务中的应用，比较了线性探测（linear probing）和低秩适应（LoRA）方法的性能差异，并探讨了数据量和模型泛化能力的关系。

Details

Motivation: 病理学中标记数据的稀缺性是一个挑战，而基础模型通过自监督学习可以利用大量无标记数据提取丰富特征，从而提升特定任务的性能。本文探索如何高效利用这些模型进行有丝分裂分类，为肿瘤预后提供支持。

Result: LoRA适应的基础模型在仅使用10%训练数据时，性能接近100%数据可用性。LoRA方法显著缩小了域外性能差距，但传统架构的完整微调仍有竞争力。

Insight: 基础模型结合LoRA适应是解决医学图像小数据问题的有效途径，但其性能仍受基础模型本身的影响；域外泛化能力仍有提升空间。

Abstract: The performance of deep learning models is known to scale with data quantity and diversity. In pathology, as in many other medical imaging domains, the availability of labeled images for a specific task is often limited. Self-supervised learning techniques have enabled the use of vast amounts of unlabeled data to train large-scale neural networks, i.e., foundation models, that can address the limited data problem by providing semantically rich feature vectors that can generalize well to new tasks with minimal training effort increasing model performance and robustness. In this work, we investigate the use of foundation models for mitotic figure classification. The mitotic count, which can be derived from this classification task, is an independent prognostic marker for specific tumors and part of certain tumor grading systems. In particular, we investigate the data scaling laws on multiple current foundation models and evaluate their robustness to unseen tumor domains. Next to the commonly used linear probing paradigm, we also adapt the models using low-rank adaptation (LoRA) of their attention mechanisms. We compare all models against end-to-end-trained baselines, both CNNs and Vision Transformers. Our results demonstrate that LoRA-adapted foundation models provide superior performance to those adapted with standard linear probing, reaching performance levels close to 100% data availability with only 10% of training data. Furthermore, LoRA-adaptation of the most recent foundation models almost closes the out-of-domain performance gap when evaluated on unseen tumor domains. However, full fine-tuning of traditional architectures still yields competitive performance.

[77] Boosting Visual Knowledge-Intensive Training for LVLMs Through Causality-Driven Visual Object Completion cs.CVPDF

Qingguo Hu, Ante Wang, Jia Song, Delai Qiu, Qingsong Liu

TL;DR: 论文提出了一种基于因果关系的视觉对象补全任务（CVC），通过自动化样本生成框架低成本获取丰富训练数据，提升大型视觉语言模型（LVLM）在视觉感知任务中的表现。

Details

Motivation: 目前LVLM在需要深度视觉感知的任务（如图像细微差异识别）中表现不足，原因是指令调整数据集中视觉知识匮乏。

Result: 实验表明，该方法在多个专业任务和综合基准测试中显著提升性能，LLaVA-1.5-7B和LLaVA-1.5-13B分别平均提升5.4%和4.0%。

Insight: 通过因果关系驱动的任务设计和低成本数据生成，可以显著增强LVLM在复杂视觉任务中的表现。

Abstract: Large Vision-Language Models (LVLMs) have experienced significant advancements in recent years. However, their performance still falls short in tasks requiring deep visual perception, such as identifying subtle differences between images. A potential cause is the scarcity of visual knowledge in popular instruction-tuning corpora, resulting in inadequate visual perception and reasoning capabilities. To address this challenge, we introduce a self-improvement framework grounded in a novel visual knowledge-intensive task, \underline{C}ausality-driven \underline{V}isual object \underline{C}ompletion (CVC). This task requires LVLMs to infer the masked object in an image based on its \textit{causal} relationships with the other visible information. We first obtain rich examples cheaply through our automated instance construction pipeline, without relying on sophisticated LVLMs (\textit{e.g.}, GPT-4V) or human assistance. Then, LVLMs effectively self-improve through trial and error learning using these created instances. Our experiments demonstrate substantial gains across four challenging specialized tasks and four widely-used comprehensive benchmarks. Especially on specialized tasks, our method achieves an average improvement of 5.4% and 4.0% compared to the corresponding baselines when utilizing LLaVA-1.5-7B and LLaVA-1.5-13B, respectively. The code is available at https://github.com/XMUDeepLIT/CVC.

[78] 4DVD: Cascaded Dense-view Video Diffusion Model for High-quality 4D Content Generation cs.CVPDF

Shuzhou Yang, Xiaodong Cun, Xiaoyu Li, Yaowei Li, Jian Zhang

TL;DR: 4DVD提出了一种级联的视频扩散模型，通过解耦方式生成高质量4D内容，具体分为粗粒度多视角布局生成和结构感知的条件生成两个子任务，结合了粗粒度结构先验和单目视频的精细外观内容。

Details

Motivation: 直接生成高维数据（如4D）复杂度高，现有方法需同时建模3D空间和时间特征，效率低且效果不佳，因此需要一种更高效的解耦生成方法。

Result: 实验结果表明，4DVD在新型视图合成和4D生成任务上表现优异，能够准确优化4D表示（如4D高斯），适用性广泛。

Insight: 解耦任务设计可以显著提升高维数据生成的效率和质量，级联模型更适合复杂数据的生成。

Abstract: Given the high complexity of directly generating high-dimensional data such as 4D, we present 4DVD, a cascaded video diffusion model that generates 4D content in a decoupled manner. Unlike previous multi-view video methods that directly model 3D space and temporal features simultaneously with stacked cross view/temporal attention modules, 4DVD decouples this into two subtasks: coarse multi-view layout generation and structure-aware conditional generation, and effectively unifies them. Specifically, given a monocular video, 4DVD first predicts the dense view content of its layout with superior cross-view and temporal consistency. Based on the produced layout priors, a structure-aware spatio-temporal generation branch is developed, combining these coarse structural priors with the exquisite appearance content of input monocular video to generate final high-quality dense-view videos. Benefit from this, explicit 4D representation~(such as 4D Gaussian) can be optimized accurately, enabling wider practical application. To train 4DVD, we collect a dynamic 3D object dataset, called D-Objaverse, from the Objaverse benchmark and render 16 videos with 21 frames for each object. Extensive experiments demonstrate our state-of-the-art performance on both novel view synthesis and 4D generation. Our project page is https://4dvd.github.io/

[79] QuantVSR: Low-Bit Post-Training Quantization for Real-World Video Super-Resolution cs.CVPDF

Bowen Chai, Zheng Chen, Libo Zhu, Wenbo Li, Yong Guo

TL;DR: QuantVSR提出了一种低比特后训练量化方法，用于实景视频超分辨率（VSR），解决了扩散模型在VSR中速度慢、资源消耗高的问题。通过时空复杂度感知机制和可学习偏置对齐模块，QuantVSR在保持高性能的同时显著优于现有低比特量化方法。

Details

Motivation: 扩散模型在实景视频超分辨率（VSR）中表现优异，但其计算速度慢、资源消耗高，限制了实际应用。量化是一种潜在的解决方案，但由于VSR模型的时序特性和高保真要求，量化难度较大。

Result: 在合成和实景数据集上的实验表明，QuantVSR性能接近全精度模型，并显著优于现有低比特量化方法。

Insight: 动态分配量化秩和联合优化全精度与低比特分支是提升VSR量化性能的关键。LBA模块有效减少了量化误差，进一步提升了模型性能。

Abstract: Diffusion models have shown superior performance in real-world video super-resolution (VSR). However, the slow processing speeds and heavy resource consumption of diffusion models hinder their practical application and deployment. Quantization offers a potential solution for compressing the VSR model. Nevertheless, quantizing VSR models is challenging due to their temporal characteristics and high fidelity requirements. To address these issues, we propose QuantVSR, a low-bit quantization model for real-world VSR. We propose a spatio-temporal complexity aware (STCA) mechanism, where we first utilize the calibration dataset to measure both spatial and temporal complexities for each layer. Based on these statistics, we allocate layer-specific ranks to the low-rank full-precision (FP) auxiliary branch. Subsequently, we jointly refine the FP and low-bit branches to achieve simultaneous optimization. In addition, we propose a learnable bias alignment (LBA) module to reduce the biased quantization errors. Extensive experiments on synthetic and real-world datasets demonstrate that our method obtains comparable performance with the FP model and significantly outperforms recent leading low-bit quantization methods. Code is available at: https://github.com/bowenchai/QuantVSR.

[80] Learning Robust Intervention Representations with Delta Embeddings cs.CV | cs.AIPDF

Panagiotis Alimisis, Christos Diou

TL;DR: 该论文提出了一种学习干预的因果表示方法，通过Delta Embeddings提高模型在分布外（OOD）场景中的鲁棒性，无需额外监督。

Details

Motivation: 当前的因果表示学习主要关注场景变量的表示，而对干预本身的表示研究较少。作者认为干预的表示是提高模型OOD鲁棒性的关键。

Result: 在Causal Triplet挑战中，该方法在合成和真实数据集上显著优于基线模型。

Insight: 干预的稀疏和不变表示是提升OOD鲁棒性的有效策略，为因果表示学习提供了新方向。

Abstract: Causal representation learning has attracted significant research interest during the past few years, as a means for improving model generalization and robustness. Causal representations of interventional image pairs, have the property that only variables corresponding to scene elements affected by the intervention / action are changed between the start state and the end state. While most work in this area has focused on identifying and representing the variables of the scene under a causal model, fewer efforts have focused on representations of the interventions themselves. In this work, we show that an effective strategy for improving out of distribution (OOD) robustness is to focus on the representation of interventions in the latent space. Specifically, we propose that an intervention can be represented by a Causal Delta Embedding that is invariant to the visual scene and sparse in terms of the causal variables it affects. Leveraging this insight, we propose a framework that is capable of learning causal representations from image pairs, without any additional supervision. Experiments in the Causal Triplet challenge demonstrate that Causal Delta Embeddings are highly effective in OOD settings, significantly exceeding baseline performance in both synthetic and real-world benchmarks.

[81] MonoCloth: Reconstruction and Animation of Cloth-Decoupled Human Avatars from Monocular Videos cs.CVPDF

Daisheng Jin, Ying He

TL;DR: MonoCloth提出了一种从单目视频中重建和动画化穿戴衣物的人体化身的创新方法。

Details

Motivation: 由于单目视频提供的几何信息有限且涉及复杂的非刚性运动，现有方法难以高质量重建和动画化穿戴衣物的人体化身。MonoCloth旨在通过分块解耦策略和专门的衣物模拟模块解决这一挑战。

Result: 实验表明，MonoCloth在视觉重建质量和动画逼真度上优于现有方法，并支持衣物迁移等附加任务。

Insight: 分块解耦策略和衣物模拟的结合为单目视频中的化身重建提供了更灵活的解决方案，同时扩展了应用场景。

Abstract: Reconstructing realistic 3D human avatars from monocular videos is a challenging task due to the limited geometric information and complex non-rigid motion involved. We present MonoCloth, a new method for reconstructing and animating clothed human avatars from monocular videos. To overcome the limitations of monocular input, we introduce a part-based decomposition strategy that separates the avatar into body, face, hands, and clothing. This design reflects the varying levels of reconstruction difficulty and deformation complexity across these components. Specifically, we focus on detailed geometry recovery for the face and hands. For clothing, we propose a dedicated cloth simulation module that captures garment deformation using temporal motion cues and geometric constraints. Experimental results demonstrate that MonoCloth improves both visual reconstruction quality and animation realism compared to existing methods. Furthermore, thanks to its part-based design, MonoCloth also supports additional tasks such as clothing transfer, underscoring its versatility and practical utility.

[82] Skeleton Motion Words for Unsupervised Skeleton-Based Temporal Action Segmentation cs.CVPDF

Uzay Gökay, Federico Spurio, Dominik R. Bach, Juergen Gall

TL;DR: 提出了一种基于骨骼的无监督时序动作分割方法，利用序列到序列的自编码器分离关节信息，通过量化生成骨架运动词，显著优于现有无监督方法。

Details

Motivation: 现有的骨骼时序动作分割方法多为有监督学习，依赖标注数据成本高；而无监督方法多针对视频数据，骨骼序列未充分探索，但其具有实际应用价值强、鲁棒性和隐私保护优势。

Result: 在HuGaDB、LARa和BABEL数据集上，方法优于现有无监督时序动作分割方法。

Insight: 骨骼数据在无监督学习中潜力巨大，解耦和量化技术能有效提升动作分割的语义表达能力。

Abstract: Current state-of-the-art methods for skeleton-based temporal action segmentation are predominantly supervised and require annotated data, which is expensive to collect. In contrast, existing unsupervised temporal action segmentation methods have focused primarily on video data, while skeleton sequences remain underexplored, despite their relevance to real-world applications, robustness, and privacy-preserving nature. In this paper, we propose a novel approach for unsupervised skeleton-based temporal action segmentation. Our method utilizes a sequence-to-sequence temporal autoencoder that keeps the information of the different joints disentangled in the embedding space. Latent skeleton sequences are then divided into non-overlapping patches and quantized to obtain distinctive skeleton motion words, driving the discovery of semantically meaningful action clusters. We thoroughly evaluate the proposed approach on three widely used skeleton-based datasets, namely HuGaDB, LARa, and BABEL. The results demonstrate that our model outperforms the current state-of-the-art unsupervised temporal action segmentation methods. Code is available at https://github.com/bachlab/SMQ .

[83] RAIDX: A Retrieval-Augmented Generation and GRPO Reinforcement Learning Framework for Explainable Deepfake Detection cs.CV | cs.AIPDF

Tianxiao Li, Zhenglin Huang, Haiquan Wen, Yiwei He, Shuchang Lyu

TL;DR: RAIDX是一个结合检索增强生成（RAG）和组相对策略优化（GRPO）的新型深度伪造检测框架，旨在提高检测准确性和决策可解释性。

Details

Motivation: 当前深度伪造检测方法缺乏透明度，仅作为分类任务而未解释决策过程。RAIDX通过结合外部知识和自动生成细粒度解释，解决了这一缺陷。

Result: 实验表明RAIDX在多个基准上实现了最先进的检测性能，并提供可解释的文本和视觉解释。

Insight: 结合外部知识和强化学习方法可显著提升深度伪造检测的可解释性和准确性。

Abstract: The rapid advancement of AI-generation models has enabled the creation of hyperrealistic imagery, posing ethical risks through widespread misinformation. Current deepfake detection methods, categorized as face specific detectors or general AI-generated detectors, lack transparency by framing detection as a classification task without explaining decisions. While several LLM-based approaches offer explainability, they suffer from coarse-grained analyses and dependency on labor-intensive annotations. This paper introduces RAIDX (Retrieval-Augmented Image Deepfake Detection and Explainability), a novel deepfake detection framework integrating Retrieval-Augmented Generation (RAG) and Group Relative Policy Optimization (GRPO) to enhance detection accuracy and decision explainability. Specifically, RAIDX leverages RAG to incorporate external knowledge for improved detection accuracy and employs GRPO to autonomously generate fine-grained textual explanations and saliency maps, eliminating the need for extensive manual annotations. Experiments on multiple benchmarks demonstrate RAIDX’s effectiveness in identifying real or fake, and providing interpretable rationales in both textual descriptions and saliency maps, achieving state-of-the-art detection performance while advancing transparency in deepfake identification. RAIDX represents the first unified framework to synergize RAG and GRPO, addressing critical gaps in accuracy and explainability. Our code and models will be publicly available.

[84] No Masks Needed: Explainable AI for Deriving Segmentation from Classification cs.CVPDF

Mosong Ma, Tania Stathaki, Michalis Lazarou

TL;DR: 该论文提出了一种新的方法，使用可解释AI（XAI）从分类模型生成医学图像分割，避免了传统方法在医学图像上的局限性，并在多个数据集上取得了更好的效果。

Details

Motivation: 医学图像分割对现代医疗至关重要，但现有的无监督分割方法在医学图像领域表现不佳。因此，作者希望通过结合可解释AI技术，改进医学图像的分割效果。

Result: 在CBIS-DDSM、NuInsSeg和Kvasir-SEG等数据集上，该方法表现优于传统分割方法。

Insight: 通过结合可解释AI技术，可以显著提升医学图像分割的准确性和实用性，为计算机辅助诊断提供更可靠的工具。

Abstract: Medical image segmentation is vital for modern healthcare and is a key element of computer-aided diagnosis. While recent advancements in computer vision have explored unsupervised segmentation using pre-trained models, these methods have not been translated well to the medical imaging domain. In this work, we introduce a novel approach that fine-tunes pre-trained models specifically for medical images, achieving accurate segmentation with extensive processing. Our method integrates Explainable AI to generate relevance scores, enhancing the segmentation process. Unlike traditional methods that excel in standard benchmarks but falter in medical applications, our approach achieves improved results on datasets like CBIS-DDSM, NuInsSeg and Kvasir-SEG.

[85] TopKD: Top-scaled Knowledge Distillation cs.CVPDF

Qi Wang, Jinjia Zhou

TL;DR: 论文提出了Top-scaled Knowledge Distillation (TopKD)，一种简单高效的架构无关框架，通过聚焦教师模型中Top-K logits的知识传递，显著提升了基于logit的蒸馏效果。

Details

Motivation: 现有知识蒸馏方法主要关注特征级知识传递，而忽略了教师模型logit分布中的关键信息，尤其是Top-K知识。

Result: 在CIFAR-100、ImageNet等数据集上，TopKD优于现有蒸馏方法，在Vision Transformers上也表现优异。

Insight: logit分布中的Top-K知识对知识蒸馏具有重要潜力，进一步挖掘可以提升蒸馏效果。

Abstract: Recent advances in knowledge distillation (KD) predominantly emphasize feature-level knowledge transfer, frequently overlooking critical information embedded within the teacher’s logit distributions. In this paper, we revisit logit-based distillation and reveal an underexplored yet critical element: Top-K knowledge. Motivated by this insight, we propose Top-scaled Knowledge Distillation (TopKD), a simple, efficient, and architecture-agnostic framework that significantly enhances logit-based distillation. TopKD consists of two main components: (1) a Top-K Scaling Module (TSM), which adaptively amplifies the most informative logits, and (2) a Top-K Decoupled Loss (TDL), which offers targeted and effective supervision. Notably, TopKD integrates seamlessly into existing KD methods without introducing extra modules or requiring architectural changes. Extensive experiments on CIFAR-100, ImageNet, STL-10, and Tiny-ImageNet demonstrate that TopKD consistently surpasses state-of-the-art distillation methods. Moreover, our method demonstrates substantial effectiveness when distilling Vision Transformers, underscoring its versatility across diverse network architectures. These findings highlight the significant potential of logits to advance knowledge distillation.

[86] Hierarchical Event Memory for Accurate and Low-latency Online Video Temporal Grounding cs.CVPDF

Minghang Zheng, Yuxin Peng, Benyuan Sun, Yi Yang, Yang Liu

TL;DR: 该论文提出了一种分层事件记忆机制，用于在线视频时间定位（OnVTG），以解决现有方法缺乏有效事件建模和长期历史信息保留的问题，实现了实时预测和高性能。

Details

Motivation: 在线视频时间定位（OnVTG）任务要求模型在未观测到未来帧的情况下定位与文本查询相关的事件。现有方法因缺乏事件建模和长期历史信息存储能力，导致性能低下。

Result: 在TACoS、ActivityNet Captions和MAD数据集上实现了最先进的性能。

Insight: 事件级建模和长期历史信息的保留显著提升了在线视频时间定位的准确性和实时性。

Abstract: In this paper, we tackle the task of online video temporal grounding (OnVTG), which requires the model to locate events related to a given text query within a video stream. Unlike regular video temporal grounding, OnVTG requires the model to make predictions without observing future frames. As online videos are streaming inputs and can go on indefinitely, it is impractical and inefficient to store all historical inputs. The existing OnVTG models employ memory to store recent historical video frame features and predict scores indicating whether the current frame corresponds to the start or end time of the target event. However, these methods lack effective event modeling and cannot retain long-term historical information, leading to low performance. To tackle these challenges, we propose a hierarchical event memory for OnVTG. We propose an event-based OnVTG framework that makes predictions based on event proposals that model event-level information with various durations. To preserve historically valuable event information, we introduce a hierarchical event memory that retains historical events, allowing the model to access both recent and long-term information. To enable the real-time prediction, we further propose a future prediction branch that predicts whether the target event will occur shortly and further regresses the start time of the event. We achieve state-of-the-art performance on the TACoS, ActivityNet Captions, and MAD datasets. Code is available at https://github.com/minghangz/OnVTG.

[87] MSC: A Marine Wildlife Video Dataset with Grounded Segmentation and Clip-Level Captioning cs.CV | cs.AI | cs.MMPDF

Quang-Trung Truong, Yuk-Kwan Wong, Vo Hoang Kim Tuyen Dang, Rinaldi Gotama, Duc Thanh Nguyen

TL;DR: 该论文提出了一个针对海洋野生动物视频的数据集MSC，结合了分割掩码和文本描述，以提升视频理解和分析能力。

Details

Motivation: 由于海洋环境的动态复杂性（如对象运动、相机运动和复杂背景），现有视频描述数据集难以适应海洋场景的需求，因此需要专门的数据集和方法。

Result: MSC数据集和提出的方法有效提升了海洋视频的理解和分析能力，并支持视频生成任务。

Insight: 海洋视频的动态性和复杂性需要专门的数据集和任务设计，视觉接地和分帧策略有助于提升描述生成的准确性和丰富性。

Abstract: Marine videos present significant challenges for video understanding due to the dynamics of marine objects and the surrounding environment, camera motion, and the complexity of underwater scenes. Existing video captioning datasets, typically focused on generic or human-centric domains, often fail to generalize to the complexities of the marine environment and gain insights about marine life. To address these limitations, we propose a two-stage marine object-oriented video captioning pipeline. We introduce a comprehensive video understanding benchmark that leverages the triplets of video, text, and segmentation masks to facilitate visual grounding and captioning, leading to improved marine video understanding and analysis, and marine video generation. Additionally, we highlight the effectiveness of video splitting in order to detect salient object transitions in scene changes, which significantly enrich the semantics of captioning content. Our dataset and code have been released at https://msc.hkustvgd.com.

[88] Augmentation-based Domain Generalization and Joint Training from Multiple Source Domains for Whole Heart Segmentation cs.CV | cs.LGPDF

Franz Thaler, Darko Stern, Gernot Plank, Martin Urschler

TL;DR: 这篇论文提出了一种基于增强的领域泛化和多源域联合训练方法，用于心脏整体分割，以应对医学影像中的领域偏移问题，并在CT和MR数据上取得了优越的性能。

Details

Motivation: 心血管疾病是全球主要死因之一，需要更精确的心脏分割方法来评估患者的心脏形态和病理。然而，领域偏移（训练和测试数据分布不同）使得模型在实际应用中性能下降，本文旨在解决这一问题。

Result: 在CT数据上达到93.33% DSC和0.8388 mm ASSD，MR数据上达到89.30% DSC和1.2411 mm ASSD，表现优异。

Insight: 通过联合训练和多样化数据增强，可以有效缓解领域偏移问题，提升跨域分割性能。

Abstract: As the leading cause of death worldwide, cardiovascular diseases motivate the development of more sophisticated methods to analyze the heart and its substructures from medical images like Computed Tomography (CT) and Magnetic Resonance (MR). Semantic segmentations of important cardiac structures that represent the whole heart are useful to assess patient-specific cardiac morphology and pathology. Furthermore, accurate semantic segmentations can be used to generate cardiac digital twin models which allows e.g. electrophysiological simulation and personalized therapy planning. Even though deep learning-based methods for medical image segmentation achieved great advancements over the last decade, retaining good performance under domain shift – i.e. when training and test data are sampled from different data distributions – remains challenging. In order to perform well on domains known at training-time, we employ a (1) balanced joint training approach that utilizes CT and MR data in equal amounts from different source domains. Further, aiming to alleviate domain shift towards domains only encountered at test-time, we rely on (2) strong intensity and spatial augmentation techniques to greatly diversify the available training data. Our proposed whole heart segmentation method, a 5-fold ensemble with our contributions, achieves the best performance for MR data overall and a performance similar to the best performance for CT data when compared to a model trained solely on CT. With 93.33% DSC and 0.8388 mm ASSD for CT and 89.30% DSC and 1.2411 mm ASSD for MR data, our method demonstrates great potential to efficiently obtain accurate semantic segmentations from which patient-specific cardiac twin models can be generated.

[89] Drone Detection with Event Cameras cs.CVPDF

Gabriele Magrini, Lorenzo Berlincioni, Luca Cultrera, Federico Becattini, Pietro Pala

TL;DR: 本文探讨了基于事件相机（event cameras）的无人机检测技术，分析了其在解决运动模糊和极端光照条件下检测问题上的优势，并综述了包括脉冲神经网络（SNN）在内的先进处理方法。

Details

Motivation: 无人机扩散带来了安全和隐私挑战，传统相机因运动模糊和小目标检测困难，难以可靠捕捉无人机。事件相机因其无运动模糊和高动态范围，成为潜在解决方案。

Result: 事件相机在无人机检测中表现出高鲁棒性，尤其在动态目标和小目标场景中优于传统相机。

Insight: 事件相机的低延迟和高动态范围特性为无人机检测提供了新思路，尤其在复杂环境下具有显著优势。

Abstract: The diffusion of drones presents significant security and safety challenges. Traditional surveillance systems, particularly conventional frame-based cameras, struggle to reliably detect these targets due to their small size, high agility, and the resulting motion blur and poor performance in challenging lighting conditions. This paper surveys the emerging field of event-based vision as a robust solution to these problems. Event cameras virtually eliminate motion blur and enable consistent detection in extreme lighting. Their sparse, asynchronous output suppresses static backgrounds, enabling low-latency focus on motion cues. We review the state-of-the-art in event-based drone detection, from data representation methods to advanced processing pipelines using spiking neural networks. The discussion extends beyond simple detection to cover more sophisticated tasks such as real-time tracking, trajectory forecasting, and unique identification through propeller signature analysis. By examining current methodologies, available datasets, and the distinct advantages of the technology, this work demonstrates that event-based vision provides a powerful foundation for the next generation of reliable, low-latency, and efficient counter-UAV systems.

Jinxing Zhou, Ziheng Zhou, Yanghao Zhou, Yuxin Mao, Zhangling Duan

TL;DR: 本文提出了CLASP方法，用于弱监督条件下的密集音频-视觉事件定位任务（W-DAVEL），通过跨模态显著锚点增强语义传播，在仅有视频级别标签的情况下实现了优异的事件定位效果。

Details

Motivation: 密集音频-视觉事件定位（DAVEL）任务在未修剪视频中同时定位音频和视觉模态中的事件。本文探索了更具挑战性的弱监督设置（W-DAVEL），仅用视频级别标签进行训练，且事件的时间边界未知。

Result: 在UnAV-100和ActivityNet1.3数据集上建立了W-DAVEL的基准，实验表明CLASP方法取得了最先进的性能。

Insight: 跨模态一致性是弱监督定位的关键，显著锚点的引入有效缓解了弱监督下语义信息不足的问题。

Abstract: The Dense Audio-Visual Event Localization (DAVEL) task aims to temporally localize events in untrimmed videos that occur simultaneously in both the audio and visual modalities. This paper explores DAVEL under a new and more challenging weakly-supervised setting (W-DAVEL task), where only video-level event labels are provided and the temporal boundaries of each event are unknown. We address W-DAVEL by exploiting \textit{cross-modal salient anchors}, which are defined as reliable timestamps that are well predicted under weak supervision and exhibit highly consistent event semantics across audio and visual modalities. Specifically, we propose a \textit{Mutual Event Agreement Evaluation} module, which generates an agreement score by measuring the discrepancy between the predicted audio and visual event classes. Then, the agreement score is utilized in a \textit{Cross-modal Salient Anchor Identification} module, which identifies the audio and visual anchor features through global-video and local temporal window identification mechanisms. The anchor features after multimodal integration are fed into an \textit{Anchor-based Temporal Propagation} module to enhance event semantic encoding in the original temporal audio and visual features, facilitating better temporal localization under weak supervision. We establish benchmarks for W-DAVEL on both the UnAV-100 and ActivityNet1.3 datasets. Extensive experiments demonstrate that our method achieves state-of-the-art performance.

[91] Knowledge to Sight: Reasoning over Visual Attributes via Knowledge Decomposition for Abnormality Grounding cs.CVPDF

Jun Li, Che Liu, Wenjia Bai, Mingxuan Liu, Rossella Arcucci

TL;DR: K2Sight框架通过将临床概念分解为可解释的视觉属性（如形状、密度和解剖位置），并转化为指令式提示，以高效训练紧凑模型，在有限数据下取得与大型医疗VLM相当或更好的性能。

Details

Motivation: 医疗领域内视觉语言模型（VLMs）在异常定位任务中表现不佳，主要由于专业术语与视觉模式对齐困难，且大规模预训练成本高昂。

Result: 仅用1.5%的数据训练的0.23B和2B参数模型，性能优于或媲美7B+医疗VLM，mAP50提升达9.82%。

Insight: 结构化语义监督能显著提升医疗领域小模型的性能，知识分解是高效对齐视觉与文本的关键。

Abstract: In this work, we address the problem of grounding abnormalities in medical images, where the goal is to localize clinical findings based on textual descriptions. While generalist Vision-Language Models (VLMs) excel in natural grounding tasks, they often struggle in the medical domain due to rare, compositional, and domain-specific terms that are poorly aligned with visual patterns. Specialized medical VLMs address this challenge via large-scale domain pretraining, but at the cost of substantial annotation and computational resources. To overcome these limitations, we propose \textbf{Knowledge to Sight (K2Sight)}, a framework that introduces structured semantic supervision by decomposing clinical concepts into interpretable visual attributes, such as shape, density, and anatomical location. These attributes are distilled from domain ontologies and encoded into concise instruction-style prompts, which guide region-text alignment during training. Unlike conventional report-level supervision, our approach explicitly bridges domain knowledge and spatial structure, enabling data-efficient training of compact models. We train compact models with 0.23B and 2B parameters using only 1.5% of the data required by state-of-the-art medical VLMs. Despite their small size and limited training data, these models achieve performance on par with or better than 7B+ medical VLMs, with up to 9.82% improvement in $mAP_{50}$. Code and models: \href{https://lijunrio.github.io/K2Sight/}{\textcolor{SOTAPink}{https://lijunrio.github.io/K2Sight/}}.

[92] Face-voice Association in Multilingual Environments (FAME) 2026 Challenge Evaluation Plan cs.CVPDF

Marta Moscati, Ahmed Abdullah, Muhammad Saad Saeed, Shah Nawaz, Rohan Kumar Das

TL;DR: FAME 2026挑战赛旨在探索多语言环境下的人脸-语音关联，利用MAV-Celeb数据集，研究双语和多语言场景中的音视频关联问题。

Details

Motivation: 全球约一半人口为双语使用者，多语言场景下的音视频关联具有重要意义，但目前研究较少。本文旨在填补这一空白。

Result: 暂无实验数据，但提出了挑战赛的框架和数据集。

Insight: 多语言场景为音视频关联研究带来新挑战，未来研究需关注语言多样性对模型性能的影响。

Abstract: The advancements of technology have led to the use of multimodal systems in various real-world applications. Among them, audio-visual systems are among the most widely used multimodal systems. In the recent years, associating face and voice of a person has gained attention due to the presence of unique correlation between them. The Face-voice Association in Multilingual Environments (FAME) 2026 Challenge focuses on exploring face-voice association under the unique condition of a multilingual scenario. This condition is inspired from the fact that half of the world’s population is bilingual and most often people communicate under multilingual scenarios. The challenge uses a dataset named Multilingual Audio-Visual (MAV-Celeb) for exploring face-voice association in multilingual environments. This report provides the details of the challenge, dataset, baseline models, and task details for the FAME Challenge.

[93] OmniDepth: Bridging Monocular and Stereo Reasoning with Latent Alignment cs.CV | cs.ROPDF

Tongfan Guan, Jiaxin Guo, Chen Wang, Yun-Hui Liu

TL;DR: OmniDepth提出了一种统一框架，通过潜在表示的迭代双向对齐，将单目和立体深度估计的优势结合，解决了反射或无纹理表面的立体模糊问题，并在多个数据集上实现了最先进的性能。

Details

Motivation: 单目深度估计依赖上下文先验但几何精度不足，立体方法利用极几何但对反射或无纹理表面表现不佳。两者在实践中仍相互独立，缺乏协同。

Result: 在Middlebury和ETH3D上将零样本泛化误差降低了40%以上，显著改善了透明和反射表面上长期存在的失败问题。

Insight: 通过统一单目上下文和立体几何，OmniDepth克服了模态特定限制，为3D感知提供了鲁棒解决方案。

Abstract: Monocular and stereo depth estimation offer complementary strengths: monocular methods capture rich contextual priors but lack geometric precision, while stereo approaches leverage epipolar geometry yet struggle with ambiguities such as reflective or textureless surfaces. Despite post-hoc synergies, these paradigms remain largely disjoint in practice. We introduce OmniDepth, a unified framework that bridges both through iterative bidirectional alignment of their latent representations. At its core, a novel cross-attentive alignment mechanism dynamically synchronizes monocular contextual cues with stereo hypothesis representations during stereo reasoning. This mutual alignment resolves stereo ambiguities (e.g., specular surfaces) by injecting monocular structure priors while refining monocular depth with stereo geometry within a single network. Extensive experiments demonstrate state-of-the-art results: \textbf{OmniDepth reduces zero-shot generalization error by $!>!40%$ on Middlebury and ETH3D}, while addressing longstanding failures on transparent and reflective surfaces. By harmonizing multi-view geometry with monocular context, OmniDepth enables robust 3D perception that transcends modality-specific limitations. Codes available at https://github.com/aeolusguan/OmniDepth.

[94] FinMMR: Make Financial Numerical Reasoning More Multimodal, Comprehensive, and Challenging cs.CV | cs.CEPDF

Zichen Tang, Haihong E, Jiacheng Liu, Zhongjun Yang, Rongjin Li

TL;DR: 本文提出了 FinMMR 评测基准，专注于评估多模态大语言模型（MLLMs）在金融数值推理任务中的能力，具备多模态、全面性和挑战性三大特点。

Details

Motivation: 现有的金融推理评测基准多为单模态且涵盖领域有限，无法满足实际金融场景中复杂的多模态推理需求，FinMMR 的提出解决了这一问题。

Result: 最佳 MLLM 在 Hard 问题上的准确率仅为 53.0%，表明当前模型在复杂金融推理任务上仍有很大提升空间。

Insight: FinMMR 为未来研究提供了更接近真实金融场景的评测基准，强调金融知识与多模态理解的结合对模型推理能力的重要性。

Abstract: We present FinMMR, a novel bilingual multimodal benchmark tailored to evaluate the reasoning capabilities of multimodal large language models (MLLMs) in financial numerical reasoning tasks. Compared to existing benchmarks, our work introduces three significant advancements. (1) Multimodality: We meticulously transform existing financial reasoning benchmarks, and construct novel questions from the latest Chinese financial research reports. FinMMR comprises 4.3K questions and 8.7K images spanning 14 categories, including tables, bar charts, and ownership structure charts. (2) Comprehensiveness: FinMMR encompasses 14 financial subdomains, including corporate finance, banking, and industry analysis, significantly exceeding existing benchmarks in financial domain knowledge breadth. (3) Challenge: Models are required to perform multi-step precise numerical reasoning by integrating financial knowledge with the understanding of complex financial images and text. The best-performing MLLM achieves only 53.0% accuracy on Hard problems. We believe that FinMMR will drive advancements in enhancing the reasoning capabilities of MLLMs in real-world scenarios.

[95] EncQA: Benchmarking Vision-Language Models on Visual Encodings for Charts cs.CV | I.2.0PDF

Kushin Mukherjee, Donghao Ren, Dominik Moritz, Yannick Assogba

TL;DR: EncQA是一个新的基准测试，旨在评估视觉语言模型在图表理解中对视觉编码的分析能力，揭示了当前模型在不同视觉编码和任务中的表现差异，并挑战了模型规模越大性能越好的假设。

Details

Motivation: 现有的视觉语言模型在图表理解基准测试中表现良好，但未能全面覆盖视觉推理能力的需求。为了系统评估模型对图表视觉编码的分析能力，作者提出了EncQA。

Result: 评估结果显示模型在不同视觉编码和任务中表现差异显著，某些任务-编码对的性能并未随模型规模提升而改善。这表明需要针对性地解决视觉推理的不足。

Insight: 图表理解能力的提升需要关注特定视觉编码和任务的性能瓶颈，而非单纯扩大模型规模或数据量。

Abstract: Multimodal vision-language models (VLMs) continue to achieve ever-improving scores on chart understanding benchmarks. Yet, we find that this progress does not fully capture the breadth of visual reasoning capabilities essential for interpreting charts. We introduce EncQA, a novel benchmark informed by the visualization literature, designed to provide systematic coverage of visual encodings and analytic tasks that are crucial for chart understanding. EncQA provides 2,076 synthetic question-answer pairs, enabling balanced coverage of six visual encoding channels (position, length, area, color quantitative, color nominal, and shape) and eight tasks (find extrema, retrieve value, find anomaly, filter values, compute derived value exact, compute derived value relative, correlate values, and correlate values relative). Our evaluation of 9 state-of-the-art VLMs reveals that performance varies significantly across encodings within the same task, as well as across tasks. Contrary to expectations, we observe that performance does not improve with model size for many task-encoding pairs. Our results suggest that advancing chart understanding requires targeted strategies addressing specific visual reasoning gaps, rather than solely scaling up model or dataset size.

[96] X-SAM: From Segment Anything to Any Segmentation cs.CV | cs.AIPDF

Hao Wang, Limeng Qiao, Zequn Jie, Zhijian Huang, Chengjian Feng

TL;DR: X-SAM是一个多模态大语言模型（MLLM）框架，旨在将Segment Anything扩展为Any Segmentation。通过统一框架和视觉基础（Visual GrounDed）分割任务，X-SAM在多样化分割任务中表现优异。

Details

Motivation: 现有的大语言模型（LLMs）在像素级感知理解方面不足，而Segment Anything Model（SAM）在多掩模预测和类别特定分割任务中存在局限。X-SAM旨在解决这些问题，实现统一的多任务分割。

Result: X-SAM在多个分割基准测试中达到SOTA性能，展示了其在多模态、像素级视觉理解中的高效性。

Insight: X-SAM通过统一多任务框架和视觉基础设计，为MLLM的像素级理解开辟了新方向，有望推动分割任务的进一步整合。

Abstract: Large Language Models (LLMs) demonstrate strong capabilities in broad knowledge representation, yet they are inherently deficient in pixel-level perceptual understanding. Although the Segment Anything Model (SAM) represents a significant advancement in visual-prompt-driven image segmentation, it exhibits notable limitations in multi-mask prediction and category-specific segmentation tasks, and it cannot integrate all segmentation tasks within a unified model architecture. To address these limitations, we present X-SAM, a streamlined Multimodal Large Language Model (MLLM) framework that extends the segmentation paradigm from \textit{segment anything} to \textit{any segmentation}. Specifically, we introduce a novel unified framework that enables more advanced pixel-level perceptual comprehension for MLLMs. Furthermore, we propose a new segmentation task, termed Visual GrounDed (VGD) segmentation, which segments all instance objects with interactive visual prompts and empowers MLLMs with visual grounded, pixel-wise interpretative capabilities. To enable effective training on diverse data sources, we present a unified training strategy that supports co-training across multiple datasets. Experimental results demonstrate that X-SAM achieves state-of-the-art performance on a wide range of image segmentation benchmarks, highlighting its efficiency for multimodal, pixel-level visual understanding. Code is available at https://github.com/wanghao9610/X-SAM.

[97] PixCuboid: Room Layout Estimation from Multi-view Featuremetric Alignment cs.CV | I.4PDF

Gustav Hanning, Kalle Åström, Viktor Larsson

TL;DR: PixCuboid提出了一种基于多视角密集深度特征对齐的优化方法，用于立方体形状的房间布局估计，通过端到端训练学习特征图，显著优于现有方法。

Details

Motivation: 当前房间布局估计方法主要基于单视角和全景图像，限制了灵活性和精度。PixCuboid旨在通过多视角特征对齐提升布局估计的准确性和泛化能力。

Result: 实验表明，PixCuboid显著优于现有方法，并在多房间布局估计中展现了灵活性。

Insight: 基于优化的方法可以通过多视角特征对齐提升布局估计的鲁棒性和泛化性，端到端训练是关键。

Abstract: Coarse room layout estimation provides important geometric cues for many downstream tasks. Current state-of-the-art methods are predominantly based on single views and often assume panoramic images. We introduce PixCuboid, an optimization-based approach for cuboid-shaped room layout estimation, which is based on multi-view alignment of dense deep features. By training with the optimization end-to-end, we learn feature maps that yield large convergence basins and smooth loss landscapes in the alignment. This allows us to initialize the room layout using simple heuristics. For the evaluation we propose two new benchmarks based on ScanNet++ and 2D-3D-Semantics, with manually verified ground truth 3D cuboids. In thorough experiments we validate our approach and significantly outperform the competition. Finally, while our network is trained with single cuboids, the flexibility of the optimization-based approach allow us to easily extend to multi-room estimation, e.g. larger apartments or offices. Code and model weights are available at https://github.com/ghanning/PixCuboid.

[98] ANPrompt: Anti-noise Prompt Tuning for Vision-Language Models cs.CVPDF

Yansheng Gao, Yufei Zheng, Jinghan Qu, Zixi Zhu, Yukuan Zhang

TL;DR: ANPrompt是一种针对视觉语言模型（VLMs）的抗噪声提示调优框架，通过生成抗噪声提示和噪声感知视觉语义，显著提升了模型在语义噪声下的鲁棒性和泛化能力。

Details

Motivation: 现有提示调优方法在面对弱语义扰动（如图像或文本噪声）时表现脆弱，影响了模型对未见类别的泛化能力。ANPrompt旨在解决这一问题。

Result: 在11个基准测试中，ANPrompt显著优于现有提示调优方法，表现出更强的噪声鲁棒性和泛化能力。

Insight: 通过显式建模噪声提示和噪声感知语义，可以有效提升视觉语言模型在噪声环境下的性能，为提示调优提供了新思路。

Abstract: Prompt tuning has emerged as an efficient and effective technique for adapting vision-language models (VLMs) with low computational overhead. However, existing methods often overlook the vulnerability of prompt-tuned VLMs to weak semantic perturbations-such as subtle image or text noise-that degrade their generalization to unseen classes. To address this limitation, we propose ANPrompt, a novel prompt tuning framework designed to enhance robustness under such perturbations. ANPrompt first constructs weak noise text features by fusing original and noise-perturbed text embeddings, which are then clustered to form noise prompts. These noise prompts are integrated with learnable prompt tokens to generate anti-noise prompts, which are injected into the deeper layers of both image and text encoders. To further capture the noise-aware visual semantics, ANPrompt computes the Noise-Resistant Visual Prompt Prototype (NRVPP) by averaging the output prompt tokens from the vision encoder. Finally, ANPrompt introduces alignment, robustness, and anti-noise objectives by computing a Weak semantic noise Alignment Loss (WALoss) alongside the standard cross-entropy and sim loss. Experiments across 11 benchmarks demonstrate that ANPrompt consistently outperforms existing prompt tuning approaches, achieving superior robustness to semantic noise and improved generalization to novel categories.

[99] Perceiving and Acting in First-Person: A Dataset and Benchmark for Egocentric Human-Object-Human Interactions cs.CVPDF

Liang Xu, Chengqun Yang, Zili Lin, Fei Xu, Yifan Liu

TL;DR: 该论文介绍了InterVLA数据集及基准测试，首次大规模捕捉了以自我为中心的人类-物体-人类交互，并提出了多模态数据的新基准，推动面向物理世界的AI助手研究。

Details

Motivation: 现有数据集大多局限于特定交互类别，且缺乏以自我为中心视角的数据，而现实中的AI助手需基于第一视角感知和行动。论文旨在填补这一空白。

Result: 数据集涵盖2个自我中心和5个外部视角视频，提供精确的人体/物体运动与语音指令数据。实验表明其在多个任务上的有效性。

Insight: 自我中心视角与多模态数据的结合是构建物理世界AI助手的关键，InterVLA为未来研究提供了宝贵资源和测试平台。

Abstract: Learning action models from real-world human-centric interaction datasets is important towards building general-purpose intelligent assistants with efficiency. However, most existing datasets only offer specialist interaction category and ignore that AI assistants perceive and act based on first-person acquisition. We urge that both the generalist interaction knowledge and egocentric modality are indispensable. In this paper, we embed the manual-assisted task into a vision-language-action framework, where the assistant provides services to the instructor following egocentric vision and commands. With our hybrid RGB-MoCap system, pairs of assistants and instructors engage with multiple objects and the scene following GPT-generated scripts. Under this setting, we accomplish InterVLA, the first large-scale human-object-human interaction dataset with 11.4 hours and 1.2M frames of multimodal data, spanning 2 egocentric and 5 exocentric videos, accurate human/object motions and verbal commands. Furthermore, we establish novel benchmarks on egocentric human motion estimation, interaction synthesis, and interaction prediction with comprehensive analysis. We believe that our InterVLA testbed and the benchmarks will foster future works on building AI agents in the physical world.

[100] BEVCon: Advancing Bird’s Eye View Perception with Contrastive Learning cs.CVPDF

Ziyang Leng, Jiawei Yang, Zhicheng Ren, Bolei Zhou

TL;DR: BEVCon是一个通过对比学习提升鸟瞰图(BEV)感知的框架，专注于优化特征表示，而非仅改进BEV编码器或任务头架构。

Details

Motivation: 鸟瞰图感知是自动驾驶中的关键任务，但现有工作多集中于改进编码器或任务头，而忽略了特征表示学习的潜力。

Result: 在nuScenes数据集上实现了最高2.4%的mAP提升，验证了特征表示学习对BEV感知的重要性。

Insight: 特征表示学习是BEV感知中被忽视的关键因素，对比学习为任务优化提供了补充路径。

Abstract: We present BEVCon, a simple yet effective contrastive learning framework designed to improve Bird’s Eye View (BEV) perception in autonomous driving. BEV perception offers a top-down-view representation of the surrounding environment, making it crucial for 3D object detection, segmentation, and trajectory prediction tasks. While prior work has primarily focused on enhancing BEV encoders and task-specific heads, we address the underexplored potential of representation learning in BEV models. BEVCon introduces two contrastive learning modules: an instance feature contrast module for refining BEV features and a perspective view contrast module that enhances the image backbone. The dense contrastive learning designed on top of detection losses leads to improved feature representations across both the BEV encoder and the backbone. Extensive experiments on the nuScenes dataset demonstrate that BEVCon achieves consistent performance gains, achieving up to +2.4% mAP improvement over state-of-the-art baselines. Our results highlight the critical role of representation learning in BEV perception and offer a complementary avenue to conventional task-specific optimizations.

cs.CL [Back]

[101] FeynTune: Large Language Models for High-Energy Theory cs.CL | cs.LG | hep-thPDF

Paul Richmond, Prarit Agarwal, Borun Chowdhury, Vasilis Niarchos, Constantinos Papageorgakis

TL;DR: 论文介绍了针对高能理论物理的专用大语言模型FeynTune，通过微调8B参数的Llama-3.1模型20个变体，覆盖不同arXiv分类组合，在hep-th摘要补全任务中表现优于基础模型和主流商业LLM。

Details

Motivation: 针对高能理论物理领域缺乏专用语言模型的问题，探索如何通过微调通用大语言模型来满足该领域的特定需求。

Result: 微调后的模型在hep-th摘要补全任务中表现优于基础模型和主流商业LLM（如ChatGPT、Claude等）。

Insight: 1. 领域专用数据微调能显著提升模型性能；2. LoRA方法在资源效率与性能间取得平衡；3. 跨领域数据集微调可能带来性能差异。

Abstract: We present specialized Large Language Models for theoretical High-Energy Physics, obtained as 20 fine-tuned variants of the 8-billion parameter Llama-3.1 model. Each variant was trained on arXiv abstracts (through August 2024) from different combinations of hep-th, hep-ph and gr-qc. For a comparative study, we also trained models on datasets that contained abstracts from disparate fields such as the q-bio and cs categories. All models were fine-tuned using two distinct Low-Rank Adaptation fine-tuning approaches and varying dataset sizes, and outperformed the base model on hep-th abstract completion tasks. We compare performance against leading commercial LLMs (ChatGPT, Claude, Gemini, DeepSeek) and derive insights for further developing specialized language models for High-Energy Theoretical Physics.

[102] Hierarchical Verification of Speculative Beams for Accelerating LLM Inference cs.CLPDF

Jaydip Sen, Harshitha Puvvala, Subhasis Dasgupta

TL;DR: 论文提出了一种名为层次验证树（HVT）的新框架，通过优先验证高可能性的草稿序列并提前剪枝次优候选，显著提升了大型语言模型（LLM）的推理效率和能效，同时保持输出质量。

Details

Motivation: 大型语言模型（LLM）的自回归推理过程效率低下，传统的推测解码和束采样方法在验证草稿序列时缺乏优先级排序，导致计算冗余。

Result: 在多个数据集和模型上的实验表明，HVT显著减少了推理时间和能耗，同时保持或提升了输出质量。

Insight: 层次化验证策略为加速LLM推理提供了新方向，展示了优先级排序和早期剪枝的重要性。

Abstract: Large language models (LLMs) have achieved remarkable success across diverse natural language processing tasks but face persistent challenges in inference efficiency due to their autoregressive nature. While speculative decoding and beam sampling offer notable improvements, traditional methods verify draft sequences sequentially without prioritization, leading to unnecessary computational overhead. This work proposes the Hierarchical Verification Tree (HVT), a novel framework that restructures speculative beam decoding by prioritizing high-likelihood drafts and enabling early pruning of suboptimal candidates. Theoretical foundations and a formal verification-pruning algorithm are developed to ensure correctness and efficiency. Integration with standard LLM inference pipelines is achieved without requiring retraining or architecture modification. Experimental evaluations across multiple datasets and models demonstrate that HVT consistently outperforms existing speculative decoding schemes, achieving substantial reductions in inference time and energy consumption while maintaining or enhancing output quality. The findings highlight the potential of hierarchical verification strategies as a new direction for accelerating large language model inference.

[103] WINELL: Wikipedia Never-Ending Updating with LLM Agents cs.CLPDF

Revanth Gangi Reddy, Tanay Dixit, Jiaxin Qin, Cheng Qian, Daniel Lee

TL;DR: 论文介绍了WiNELL，一个基于LLM代理的框架，用于自动更新维基百科内容，通过多代理协作筛选和生成编辑建议，显著优于现有开源和闭源模型。

Details

Motivation: 维基百科依赖人工编辑更新内容，效率低且难以持续维护。WiNELL旨在利用LLM代理实现自动化知识库更新，解决这一问题。

Result: WiNELL在信息覆盖率和编辑效率上优于开源和闭源LLM（如GPT-4o），并在高活跃度维基百科页面上验证了其有效性。

Insight: LLM代理可用于自动化知识库的持续更新，为未来研究方向提供了新思路。

Abstract: Wikipedia, a vast and continuously consulted knowledge base, faces significant challenges in maintaining up-to-date content due to its reliance on manual human editors. Inspired by the vision of continuous knowledge acquisition in NELL and fueled by advances in LLM-based agents, this paper introduces WiNELL, an agentic framework for continuously updating Wikipedia articles. Our approach employs a multi-agent framework to aggregate online information, select new and important knowledge for a target entity in Wikipedia, and then generate precise edit suggestions for human review. Our fine-grained editing models, trained on Wikipedia’s extensive history of human edits, enable incorporating updates in a manner consistent with human editing behavior. Our editor models outperform both open-source instruction-following baselines and closed-source LLMs (e.g., GPT-4o) in key information coverage and editing efficiency. End-to-end evaluation on high-activity Wikipedia pages demonstrates WiNELL’s ability to identify and suggest timely factual updates. This opens up a promising research direction in LLM agents for automatically updating knowledge bases in a never-ending fashion.

[104] GanitBench: A bi-lingual benchmark for evaluating mathematical reasoning in Vision Language Models cs.CL | cs.AI | I.2.7PDF

Ashutosh Bandooni, Brindha Subburaj

TL;DR: GanitBench是一个双语（英语和印度语）基准测试，用于评估视觉语言模型（VLMs）在数学推理任务上的表现，填补了多语言数学评估的空白。

Details

Motivation: 当前视觉语言模型的评估基准多为单语（英语），且缺乏印度语的任务数据集，尤其是在数学推理领域。作者希望通过GanitBench促进多语言研究的包容性。

Result: 1）GPT-4o mini表现最佳，平均准确率为38.15%；2）“双锁”约束显著降低模型性能；3）双样本CoT在约束下更有效；4）印度语问题的表现低于英语。

Insight: 1）多语言评估对模型性能有显著影响；2）上下文学习（如CoT）有助于提升复杂任务的推理能力；3）视觉与语言结合仍需进一步优化。

Abstract: Benchmarks for evaluating reasoning among Vision Language Models (VLMs) on several fields and domains are being curated more frequently over the last few years. However these are often monolingual, mostly available in English. Additionally there also is a lack of datasets available in Hindi on tasks apart from comprehension and translation. We introduce GanitBench, a tough benchmark consisting of 1527 vision-only questions covering several topics in Mathematics - available in languages English and Hindi. Collected from two major examinations from India, the JEE Advanced and the CBSE Boards examinations, this benchmark includes questions in the form of images comprising of figures essential to a question as well as text. We evaluate two closed source models for the same, in zero-shot Chain-of-Thought (CoT) and two-shot CoT settings. GPT-4o mini is found to be the more dominant model on the benchmark, with it’s highest average accuracy being 38.15%. We also evaluate models through a “Double Lock” constraint, which brings down the performance of the models by considerable margins. We observe that two-shot CoT appears to be a more effective setting under this environment. Performance of the two VLMs also decreases when answering the same questions in the Hindi language. We hope to facilitate the inclusion of languages like Hindi in research through our work.

[105] Hallucination to Truth: A Review of Fact-Checking and Factuality Evaluation in Large Language Models cs.CL | cs.AI | cs.LGPDF

Subhey Sadi Rahman, Md. Adnanul Islam, Md. Mahbub Alam, Musarrat Zeba, Md. Abdur Rahman

TL;DR: 该论文综述了大型语言模型（LLMs）生成内容的真实性评估与事实核查方法，重点探讨了幻觉问题、数据集局限性及评估指标的可靠性，并提出改进方向。

Details

Motivation: 由于LLMs可能生成误导性内容，亟需开发可靠的事实核查与评估方法，以确保其输出的准确性和可信度。

Result: 研究发现当前评估指标存在局限性，验证了外部知识嵌入和领域定制对提升事实准确性的重要性。

Insight: 未来的LLMs需兼顾准确性和可解释性，同时针对特定领域优化，以实现更可靠的事实核查。

Abstract: Large Language Models (LLMs) are trained on vast and diverse internet corpora that often include inaccurate or misleading content. Consequently, LLMs can generate misinformation, making robust fact-checking essential. This review systematically analyzes how LLM-generated content is evaluated for factual accuracy by exploring key challenges such as hallucinations, dataset limitations, and the reliability of evaluation metrics. The review emphasizes the need for strong fact-checking frameworks that integrate advanced prompting strategies, domain-specific fine-tuning, and retrieval-augmented generation (RAG) methods. It proposes five research questions that guide the analysis of the recent literature from 2020 to 2025, focusing on evaluation methods and mitigation techniques. The review also discusses the role of instruction tuning, multi-agent reasoning, and external knowledge access via RAG frameworks. Key findings highlight the limitations of current metrics, the value of grounding outputs with validated external evidence, and the importance of domain-specific customization to improve factual consistency. Overall, the review underlines the importance of building LLMs that are not only accurate and explainable but also tailored for domain-specific fact-checking. These insights contribute to the advancement of research toward more trustworthy and context-aware language models.

Haofei Yu, Zhengyang Qi, Yining Zhao, Kolby Nottingham, Keyang Xuan

TL;DR: 论文提出了一种名为Sotopia-RL的新框架，用于改进社交智能的强化学习训练，通过将粗粒度的回合级反馈细化为话语级、多维度的奖励，解决了社交互动中的部分可观测性和多维度问题。

Details

Motivation: 社交互动具有部分可观测性和多维度特性，这使得传统的马尔可夫决策过程（MDP）强化学习方法效率低下且不稳定，需要更精细的奖励设计。

Result: 在Sotopia-hard和Sotopia-full任务中，分别达到7.17和8.31的社会目标完成分数，显著优于现有方法。

Insight: 1. 话语级信用分配对解决部分可观测性问题至关重要；2. 多维度的奖励设计能有效捕捉社交互动的丰富性，避免奖励欺骗。

Abstract: Social intelligence has become a critical capability for large language models (LLMs), enabling them to engage effectively in real-world social tasks such as accommodation, persuasion, collaboration, and negotiation. Reinforcement learning (RL) is a natural fit for training socially intelligent agents because it allows models to learn sophisticated strategies directly through social interactions. However, social interactions have two key characteristics that set barriers for RL training: (1) partial observability, where utterances have indirect and delayed effects that complicate credit assignment, and (2) multi-dimensionality, where behaviors such as rapport-building or knowledge-seeking contribute indirectly to goal achievement. These characteristics make Markov decision process (MDP)-based RL with single-dimensional episode-level rewards inefficient and unstable. To address these challenges, we propose Sotopia-RL, a novel framework that refines coarse episode-level feedback into utterance-level, multi-dimensional rewards. Utterance-level credit assignment mitigates partial observability by attributing outcomes to individual utterances, while multi-dimensional rewards capture the full richness of social interactions and reduce reward hacking. Experiments in Sotopia, an open-ended social learning environment, demonstrate that Sotopia-RL achieves state-of-the-art social goal completion scores (7.17 on Sotopia-hard and 8.31 on Sotopia-full), significantly outperforming existing approaches. Ablation studies confirm the necessity of both utterance-level credit assignment and multi-dimensional reward design for RL training. Our implementation is publicly available at: https://github.com/sotopia-lab/sotopia-rl.

[107] CAP-LLM: Context-Augmented Personalized Large Language Models for News Headline Generation cs.CLPDF

Raymond Wilson, Cole Graham, Chase Carter, Zefeng Yang, Ruiqi Gu

TL;DR: CAP-LLM 是一个结合用户偏好和事实一致性的个性化新闻标题生成框架，利用大语言模型（LLM）的生成能力，通过用户偏好编码器、上下文注入适配器和事实一致性强化模块，显著提升标题的个性化和事实准确性。

Details

Motivation: 在信息过载的时代，个性化新闻标题生成对吸引用户至关重要。现有方法难以兼顾用户兴趣的复杂性和事实一致性，导致标题过于通用或误导性。

Result: 在 PENS 数据集上，CAP-LLM 在所有指标上达到 SOTA，事实一致性（FactCC 87.50）和个性化（Pc(avg) 2.73）显著优于基线（如 BART）。

Insight: CAP-LLM 展示了如何在 LLM 驱动的生成任务中平衡个性化和事实准确性，为新闻标题生成提供了新的研究思路。

Abstract: In the era of information overload, personalized news headline generation is crucial for engaging users by tailoring content to their preferences while accurately conveying news facts. Existing methods struggle with effectively capturing complex user interests and ensuring factual consistency, often leading to generic or misleading headlines. Leveraging the unprecedented capabilities of Large Language Models (LLMs) in text generation, we propose Context-Augmented Personalized LLM (CAP-LLM), a novel framework that integrates user preferences and factual consistency constraints into a powerful pre-trained LLM backbone. CAP-LLM features a User Preference Encoder to capture long-term user interests, a Context Injection Adapter to seamlessly integrate these preferences and current article context into the LLM’s generation process, and a Fact-Consistency Reinforcement Module employing a novel contrastive loss to mitigate hallucination. Evaluated on the real-world PENS dataset, CAP-LLM achieves state-of-the-art performance across all metrics. Notably, it significantly improves factual consistency (FactCC of 87.50) over strong baselines like BART (86.67), while simultaneously enhancing personalization (Pc(avg) 2.73, Pc(max) 17.25) and content coverage (ROUGE-1 26.55, ROUGE-2 9.95, ROUGE-L 23.01). Our ablation studies, human evaluations, and sensitivity analyses further validate the effectiveness of each component and the robustness of our approach, demonstrating CAP-LLM’s ability to achieve a superior balance between personalization and factual accuracy in news headline generation.

[108] Confidence-Weighted Token Set Cover for Early Hypothesis Pruning in Self-Consistency cs.CLPDF

Md Arafat Sultan, Ramón Fernandez Astudillo

TL;DR: 论文提出一种基于置信度和词汇覆盖率的轻量级方法，通过早期假设剪枝提升自洽性方法在长链思维推理任务中的token效率。

Details

Motivation: 自洽性方法虽然简单有效，但其高token消耗限制了实际应用。论文旨在通过早期假设剪枝，提高长链思维推理任务中的token效率。

Result: 在三个数学基准测试上的实验表明，该方法可为五种LLM提升10-35%的token效率。

Insight: 轻量级指标（如置信度和词汇覆盖率）可用于高效剪枝，显著降低自洽性方法的计算成本。

Abstract: Despite its simplicity and efficacy, the high token expenditure of self-consistency can limit its practical utility. Here we investigate if self-consistency can be made more token-efficient for long chain-of-thought reasoning tasks, while preserving its parallelism, through early hypothesis pruning. Concretely, we generate all solutions in parallel, but periodically prune intermediate hypotheses that are deemed unnecessary based on two lightweight indicators: (a) the model’s own confidence in individual hypotheses, and (b) lexical coverage of all current hypotheses by candidate subsets that are under consideration for continued retention. We design a fast weighted set cover algorithm that utilizes the two indicators; our evaluation of five LLMs on three math benchmarks shows that this method can improve token efficiency for all models, by 10-35% in many cases.

Xinyu Zhao, Zhen Tan, Maya Enisman, Minjae Seo, Marta R. Durantini

TL;DR: 论文提出了一种通过代理概念瓶颈模型（CBM）将专家认知模型转移到社交机器人的方法，用于提升团体会议中的干预透明度和可信度。

Details

Motivation: 团体会议中的协调员需处理复杂的社交动态和认知负担，现有基础模型（FMs）虽然能识别社交线索，但缺乏透明性。因此，需要一个既能分析多模态数据又能提供透明建议的机器人辅助。

Result: 模型在预测干预需求上优于直接使用FMs，并能跨群体泛化，成功将专家知识转移到新手协调员中。

Insight: 论文展示了通过可解释性模型将专家认知嵌入机器人，为增强复杂社交领域中人类能力提供了有效途径。

Abstract: Successful group meetings, such as those implemented in group behavioral-change programs, work meetings, and other social contexts, must promote individual goal setting and execution while strengthening the social relationships within the group. Consequently, an ideal facilitator must be sensitive to the subtle dynamics of disengagement, difficulties with individual goal setting and execution, and interpersonal difficulties that signal a need for intervention. The challenges and cognitive load experienced by facilitators create a critical gap for an embodied technology that can interpret social exchanges while remaining aware of the needs of the individuals in the group and providing transparent recommendations that go beyond powerful but “black box” foundation models (FMs) that identify social cues. We address this important demand with a social robot co-facilitator that analyzes multimodal meeting data and provides discreet cues to the facilitator. The robot’s reasoning is powered by an agentic concept bottleneck model (CBM), which makes decisions based on human-interpretable concepts like participant engagement and sentiments, ensuring transparency and trustworthiness. Our core contribution is a transfer learning framework that distills the broad social understanding of an FM into our specialized and transparent CBM. This concept-driven system significantly outperforms direct zero-shot FMs in predicting the need for intervention and enables real-time human correction of its reasoning. Critically, we demonstrate robust knowledge transfer: the model generalizes across different groups and successfully transfers the expertise of senior human facilitators to improve the performance of novices. By transferring an expert’s cognitive model into an interpretable robotic partner, our work provides a powerful blueprint for augmenting human capabilities in complex social domains.

[110] HarmonyGuard: Toward Safety and Utility in Web Agents via Adaptive Policy Enhancement and Dual-Objective Optimization cs.CL | cs.AIPDF

Yurun Chen, Xavier Hu, Yuhan Liu, Keting Yin, Juncheng Li

TL;DR: HarmonyGuard 是一个多智能体协作框架，通过自适应策略增强和双目标优化，在开放网络环境中同时提升任务效用和安全性。

Details

Motivation: 当前基于大语言模型的网络代理在开放网络环境中执行任务时，面临任务效用与安全性难以平衡的问题，现有研究多限于单目标优化或单轮场景。

Result: 实验表明，HarmonyGuard 在多个基准测试中将策略合规性提升高达 38%，任务完成率提升 20%，且所有任务中策略合规性超过 90%。

Insight: 该研究表明，动态策略更新与多智能体协作是实现网络代理安全与效用双优化的有效途径，为开放环境下的多目标优化提供了新思路。

Abstract: Large language models enable agents to autonomously perform tasks in open web environments. However, as hidden threats within the web evolve, web agents face the challenge of balancing task performance with emerging risks during long-sequence operations. Although this challenge is critical, current research remains limited to single-objective optimization or single-turn scenarios, lacking the capability for collaborative optimization of both safety and utility in web environments. To address this gap, we propose HarmonyGuard, a multi-agent collaborative framework that leverages policy enhancement and objective optimization to jointly improve both utility and safety. HarmonyGuard features a multi-agent architecture characterized by two fundamental capabilities: (1) Adaptive Policy Enhancement: We introduce the Policy Agent within HarmonyGuard, which automatically extracts and maintains structured security policies from unstructured external documents, while continuously updating policies in response to evolving threats. (2) Dual-Objective Optimization: Based on the dual objectives of safety and utility, the Utility Agent integrated within HarmonyGuard performs the Markovian real-time reasoning to evaluate the objectives and utilizes metacognitive capabilities for their optimization. Extensive evaluations on multiple benchmarks show that HarmonyGuard improves policy compliance by up to 38% and task completion by up to 20% over existing baselines, while achieving over 90% policy compliance across all tasks. Our project is available here: https://github.com/YurunChen/HarmonyGuard.

[111] Step More: Going Beyond Single Backpropagation in Meta Learning Based Model Editing cs.CL | cs.AI | cs.LGPDF

Xiaopeng Li, Shasha Li, Xi Wang, Shezheng Song, Bin Ji

TL;DR: 论文提出了一种名为SMEdit的新方法，通过多重反向传播步骤（MBPS）和权重更新的范数正则化，改进了基于元学习的模型编辑（MLBME）在低数据场景下的性能与训练效率。

Details

Motivation: 大型语言模型（LLMs）的静态特性使得更新知识成本高昂。模型编辑通过针对性参数修改注入新信息是一种高效替代方案，但现有基于元学习的模型编辑方法（MLBME）在低数据场景下表现不佳且训练效率受KL散度计算限制。

Result: 在两个数据集和两种LLM上的实验显示，SMEdit超越了现有MLBME基线方法，且MBPS策略可进一步提升其他方法的性能。

Insight: 多重反向传播步骤（MBPS）和权重范数正则化是提升元学习模型编辑性能与效率的有效手段，尤其适用于低数据场景。

Abstract: Large Language Models (LLMs) underpin many AI applications, but their static nature makes updating knowledge costly. Model editing offers an efficient alternative by injecting new information through targeted parameter modifications. In particular, meta-learning-based model editing (MLBME) methods have demonstrated notable advantages in both editing effectiveness and efficiency. Despite this, we find that MLBME exhibits suboptimal performance in low-data scenarios, and its training efficiency is bottlenecked by the computation of KL divergence. To address these, we propose $\textbf{S}$tep $\textbf{M}$ore $\textbf{Edit}$ ($\textbf{SMEdit}$), a novel MLBME method that adopts $\textbf{M}$ultiple $\textbf{B}$ackpro$\textbf{P}$agation $\textbf{S}$teps ($\textbf{MBPS}$) to improve editing performance under limited supervision and a norm regularization on weight updates to improve training efficiency. Experimental results on two datasets and two LLMs demonstrate that SMEdit outperforms prior MLBME baselines and the MBPS strategy can be seamlessly integrated into existing methods to further boost their performance. Our code will be released soon.

[112] ZARA: Zero-shot Motion Time-Series Analysis via Knowledge and Retrieval Driven LLM Agents cs.CL | cs.CVPDF

Zechen Li, Baiyu Chen, Hao Xue, Flora D. Salim

TL;DR: ZARA 是一个基于智能体（agent-based）的零样本运动时间序列分析框架，通过知识库和检索驱动的LLM智能体实现人类活动识别（HAR），无需微调或任务特定分类器，且在零样本场景下性能优于现有基线。

Details

Motivation: 传统HAR方法需要针对固定活动集训练，且在新行为或传感器配置出现时需重新训练成本高。利用LLM的方法则存在精度不足和解释性差的问题。ZARA致力于解决这些问题，提供灵活、可解释的零样本HAR。

Result: 在8个HAR基准测试中，ZARA以零样本方式达到SOTA性能（Macro F1提升2.53x），并保持可解释性。模块消融实验验证了各模块的必要性。

Insight: ZARA通过知识库和检索模块结合LLM，实现了无需训练的高效HAR，为可信赖的即插即用运动时间序列分析提供了新思路。

Abstract: Motion sensor time-series are central to human activity recognition (HAR), with applications in health, sports, and smart devices. However, existing methods are trained for fixed activity sets and require costly retraining when new behaviours or sensor setups appear. Recent attempts to use large language models (LLMs) for HAR, typically by converting signals into text or images, suffer from limited accuracy and lack verifiable interpretability. We propose ZARA, the first agent-based framework for zero-shot, explainable HAR directly from raw motion time-series. ZARA integrates an automatically derived pair-wise feature knowledge base that captures discriminative statistics for every activity pair, a multi-sensor retrieval module that surfaces relevant evidence, and a hierarchical agent pipeline that guides the LLM to iteratively select features, draw on this evidence, and produce both activity predictions and natural-language explanations. ZARA enables flexible and interpretable HAR without any fine-tuning or task-specific classifiers. Extensive experiments on 8 HAR benchmarks show that ZARA achieves SOTA zero-shot performance, delivering clear reasoning while exceeding the strongest baselines by 2.53x in macro F1. Ablation studies further confirm the necessity of each module, marking ZARA as a promising step toward trustworthy, plug-and-play motion time-series analysis. Our codes are available at https://github.com/zechenli03/ZARA.

[113] Large Reasoning Models Are Autonomous Jailbreak Agents cs.CL | cs.AI | cs.CRPDF

Thilo Hagendorff, Erik Derner, Nuria Oliver

TL;DR: 传统越狱需要复杂技术或专业知识，但研究发现大型推理模型（LRMs）能作为自主越狱代理，简化并规模化越狱过程，使非专家也能轻易实施。实验表明，LRMs在对抗九种目标模型时成功率高达97.14%，揭示了模型安全护栏的系统性退化风险。

Details

Motivation: 探讨大型推理模型是否可以通过其强大的说服能力简化并规模化越狱过程，从而成为自主越狱代理。

Result: 实验结果显示，LRMs在70个有害提示的基准测试中，整体攻击成功率高达97.14%。

Insight: 研究揭示了前沿模型不仅需要抵抗越狱尝试，还需防止被利用为越狱代理的紧迫性。

Abstract: Jailbreaking – bypassing built-in safety mechanisms in AI models – has traditionally required complex technical procedures or specialized human expertise. In this study, we show that the persuasive capabilities of large reasoning models (LRMs) simplify and scale jailbreaking, converting it into an inexpensive activity accessible to non-experts. We evaluated the capabilities of four LRMs (DeepSeek-R1, Gemini 2.5 Flash, Grok 3 Mini, Qwen3 235B) to act as autonomous adversaries conducting multi-turn conversations with nine widely used target models. LRMs received instructions via a system prompt, before proceeding to planning and executing jailbreaks with no further supervision. We performed extensive experiments with a benchmark of harmful prompts composed of 70 items covering seven sensitive domains. This setup yielded an overall attack success rate across all model combinations of 97.14%. Our study reveals an alignment regression, in which LRMs can systematically erode the safety guardrails of other models, highlighting the urgent need to further align frontier models not only to resist jailbreak attempts, but also to prevent them from being co-opted into acting as jailbreak agents.

[114] GM-PRM: A Generative Multimodal Process Reward Model for Multimodal Mathematical Reasoning cs.CLPDF

Jianghangfan Zhang, Yibo Yan, Kening Zheng, Xin Zou, Song Dai

TL;DR: 该论文提出了GM-PRM模型，将原有的Passive Process Reward Model（PRM）转变为生成式的主动协作工具，能够精细化评估多模态数学推理的每一步，并生成纠正错误步骤的能力，显著提升推理多样性和正确性。

Details

Motivation: 现有的多模态PRM只能作为二进制验证器识别错误，无法纠正错误且缺乏解释性，难以应对复杂多步数学推理中的视觉或逻辑错误。为此，作者提出GM-PRM，将其改进为主动协作工具。

Result: GM-PRM在多模态数学基准测试中取得SOTA结果，显著提升策略模型性能，数据效率高。

Insight: 将PRM从被动验证器转变为主动协作工具，通过生成纠正进一步提升模型能力，同时证明了小样本训练的有效性。

Abstract: Multimodal Large Language Models (MLLMs) demonstrate remarkable capabilities but often struggle with complex, multi-step mathematical reasoning, where minor errors in visual perception or logical deduction can lead to complete failure. While Process Reward Models (PRMs) offer step-by-step supervision, existing multimodal PRMs are limited to being binary verifiers that can identify but not correct errors, offering little explanatory power. To address these deficiencies, we introduce the Generative Multimodal Process Reward Model (GM-PRM), a novel paradigm that transforms the PRM from a passive judge into an active reasoning collaborator. Instead of a simple scalar score, GM-PRM provides a fine-grained, interpretable analysis of each reasoning step, evaluating its step intent, visual alignment, and logical soundness. More critically, GM-PRM is trained to generate a corrected version of the first erroneous step it identifies. This unique corrective capability enables our new test-time inference strategy, Refined Best-of-N (Refined-BoN). This framework actively enhances solution quality by using the PRM’s generated correction to guide the policy model toward a more promising reasoning trajectory, thereby improving the diversity and correctness of the solution pool. We demonstrate that GM-PRM achieves state-of-the-art results on multiple multimodal math benchmarks, significantly boosting policy model performance with remarkable data efficiency, requiring only a 20K-sample training dataset. Our code will be released upon acceptance.

[115] Unveiling Over-Memorization in Finetuning LLMs for Reasoning Tasks cs.CLPDF

Zhiwen Ruan, Yun Chen, Yutao Hou, Peng Li, Yang Liu

TL;DR: 论文研究了在大语言模型（LLM）微调过程中发现的过记忆现象，即在特定阶段模型过度记忆训练数据，导致测试困惑度高但测试准确率仍保持良好。研究分析了这种现象的成因及其影响，并提出了针对微调阶段选择和学习的建议。

Details

Motivation: 研究的动机在于理解LLMs在微调过程中的学习动态，特别是针对推理任务时出现的过记忆现象。这种现象可能导致模型鲁棒性下降、泛化能力不足和生成多样性减少，因此需要深入探究。

Result: 实验结果表明：1）过记忆现象在LLM微调中普遍存在；2）过记忆虽然不影响测试准确率，但会降低鲁棒性和泛化能力；3）大学习率和多训练周期是过记忆的主要诱因。

Insight: 研究发现过参数化的LLMs在微调过程中表现出与传统机器学习模型不同的学习动态，过记忆现象是其中之一。研究建议在微调时合理选择检查点和学习率以避免这一问题。

Abstract: The pretrained large language models (LLMs) are finetuned with labeled data for better instruction following ability and alignment with human values. In this paper, we study the learning dynamics of LLM finetuning on reasoning tasks and reveal the uncovered over-memorization phenomenon during a specific stage of LLM finetuning. At this stage, the LLMs have excessively memorized training data and exhibit high test perplexity while maintaining good test accuracy. We investigate the conditions that lead to LLM over-memorization and find that training epochs and large learning rates contribute to this issue. Although models with over-memorization demonstrate comparable test accuracy to normal models, they suffer from reduced robustness, poor out-of-distribution generalization, and decreased generation diversity. Our experiments unveil the over-memorization to be broadly applicable across different tasks, models, and finetuning methods. Our research highlights that overparameterized, extensively finetuned LLMs exhibit unique learning dynamics distinct from traditional machine learning models. Based on our observations of over-memorization, we provide recommendations on checkpoint and learning rate selection during finetuning.

[116] Difficulty-Based Preference Data Selection by DPO Implicit Reward Gap cs.CL | cs.AI | cs.LGPDF

Xuan Qi, Rongwu Xu, Zhijing Jin

TL;DR: 论文提出了一种基于DPO隐式奖励差距的难度偏好数据选择方法，用于高效选择对齐任务中的偏好数据，仅用10%数据即超越多个基线。

Details

Motivation: 对齐大语言模型(LLM)与人类偏好是AI研究的关键挑战，但现有方法依赖大量昂贵的偏好数据，缺乏高质量数据选择机制。

Result: 在多个数据集和任务中，仅使用10%数据即超越五个强基线，证明了方法的有效性和高效性。

Insight: 通过选择困难样本（奖励差距小），可以更高效地训练LLM对齐任务，为资源有限下的模型对齐提供了新思路。

Abstract: Aligning large language models (LLMs) with human preferences is a critical challenge in AI research. While methods like Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) are widely used, they often rely on large, costly preference datasets. The current work lacks methods for high-quality data selection specifically for preference data. In this work, we introduce a novel difficulty-based data selection strategy for preference datasets, grounded in the DPO implicit reward mechanism. By selecting preference data examples with smaller DPO implicit reward gaps, which are indicative of more challenging cases, we improve data efficiency and model alignment. Our approach consistently outperforms five strong baselines across multiple datasets and alignment tasks, achieving superior performance with only 10% of the original data. This principled, efficient selection method offers a promising solution for scaling LLM alignment with limited resources.

[117] The State Of TTS: A Case Study with Human Fooling Rates cs.CL | cs.LG | cs.SD | eess.ASPDF

Praveen Srinivasa Varadhan, Sherry Thomas, Sai Teja M. S., Suvrat Bhooshan, Mitesh M. Khapra

TL;DR: 该论文提出了人类欺骗率（HFR）作为衡量TTS系统是否能够真正欺骗人类的标准，并通过对开源和商业TTS模型的大规模评估，揭示了当前TTS系统在欺骗测试中的局限性。

Details

Motivation: 近年来，TTS系统在主观评估中表现迅速进步，但现有指标（如CMOS）未能真实反映TTS系统是否能够欺骗人类，因此需要更直接的人类中心评估方法。

Result: 商业模型在零样本场景下接近人类欺骗水平，而开源系统在自然对话语音上仍有差距；高质量数据微调虽提升真实感，但未能完全弥合差距。

Insight: 未来TTS研究应当采用更多人类中心评估方法，并在更具挑战性的数据集上评估进展，而非仅仅依赖传统主观评分。

Abstract: While subjective evaluations in recent years indicate rapid progress in TTS, can current TTS systems truly pass a human deception test in a Turing-like evaluation? We introduce Human Fooling Rate (HFR), a metric that directly measures how often machine-generated speech is mistaken for human. Our large-scale evaluation of open-source and commercial TTS models reveals critical insights: (i) CMOS-based claims of human parity often fail under deception testing, (ii) TTS progress should be benchmarked on datasets where human speech achieves high HFRs, as evaluating against monotonous or less expressive reference samples sets a low bar, (iii) Commercial models approach human deception in zero-shot settings, while open-source systems still struggle with natural conversational speech; (iv) Fine-tuning on high-quality data improves realism but does not fully bridge the gap. Our findings underscore the need for more realistic, human-centric evaluations alongside existing subjective tests.

[118] Hacking Hallucinations of MLLMs with Causal Sufficiency and Necessity cs.CL | cs.AIPDF

Peizheng Guo, Jingyao Wang, Wenwen Qiang, Huijie Guo, Changwen Zheng

TL;DR: 该论文通过因果分析研究发现MLLM的幻觉问题（如遗漏或虚构）源于对关键因果因素的不充分捕捉或被非因果线索误导。为解决此问题，提出了一种基于因果完备性的强化学习框架，结合因果充分性和必要性，有效减少幻觉。

Details

Motivation: 多模态大语言模型（MLLM）在视觉语言任务中表现优异，但容易出现幻觉（生成与输入不一致的输出）。论文通过因果分析揭示了幻觉的根源，并提出了一种改进方法。

Result: 在多个基准数据集和任务上验证了方法的有效性，显著减少了MLLM的幻觉问题。

Insight: 1. 因果分析是解决MLLM幻觉的有效工具；2. 结合因果完备性的强化学习能更准确捕捉关键信息，提升生成质量。

Abstract: Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities across vision-language tasks. However, they may suffer from hallucinations–generating outputs that are semantically inconsistent with the input image or text. Through causal analyses, we find that: (i) hallucinations with omission may arise from the failure to adequately capture essential causal factors, and (ii) hallucinations with fabrication are likely caused by the model being misled by non-causal cues. To address these challenges, we propose a novel reinforcement learning framework guided by causal completeness, which jointly considers both causal sufficiency and causal necessity of tokens. Specifically, we evaluate each token’s standalone contribution and counterfactual indispensability to define a token-level causal completeness reward. This reward is used to construct a causally informed advantage function within the GRPO optimization framework, encouraging the model to focus on tokens that are both causally sufficient and necessary for accurate generation. Experimental results across various benchmark datasets and tasks demonstrate the effectiveness of our approach, which effectively mitigates hallucinations in MLLMs.

[119] Characterizing Deep Research: A Benchmark and Formal Definition cs.CLPDF

Abhinav Java, Ashmit Khandelwal, Sukruta Midigeshi, Aaron Halfaker, Amit Deshpande

TL;DR: 该论文提出了对“深度研究”（Deep Research, DR）任务的正式定义和评估基准，强调其核心特征是概念的高扩散性（broad exploration），而非冗长的报告生成。通过引入中间输出表示和LiveDRBench基准，论文提供了客观评估DR系统的方法，并分析了现有系统的表现和局限性。

Details

Motivation: 随着“深度研究”任务在信息检索和推理任务中的应用日益广泛，其定义和边界仍不清晰。论文旨在填补这一空白，提出更形式化的定义，并开发一个客观评估的基准。

Result: 现有DR系统的F1分数在0.02至0.72之间，OpenAI的模型表现最佳（0.55）。分析揭示了系统在引用来源、分支和回溯行为上的不足。

Insight: DR任务的核心挑战在于搜索机制的广度和推理能力，而非报告长度。未来研究应着重改进系统的搜索效率和知识基奠能力。

Abstract: Information tasks such as writing surveys or analytical reports require complex search and reasoning, and have recently been grouped under the umbrella of \textit{deep research} – a term also adopted by recent models targeting these capabilities. Despite growing interest, the scope of the deep research task remains underdefined and its distinction from other reasoning-intensive problems is poorly understood. In this paper, we propose a formal characterization of the deep research (DR) task and introduce a benchmark to evaluate the performance of DR systems. We argue that the core defining feature of deep research is not the production of lengthy report-style outputs, but rather the high fan-out over concepts required during the search process, i.e., broad and reasoning-intensive exploration. To enable objective evaluation, we define DR using an intermediate output representation that encodes key claims uncovered during search-separating the reasoning challenge from surface-level report generation. Based on this formulation, we propose a diverse, challenging benchmark LiveDRBench with 100 challenging tasks over scientific topics (e.g., datasets, materials discovery, prior art search) and public interest events (e.g., flight incidents, movie awards). Across state-of-the-art DR systems, F1 score ranges between 0.02 and 0.72 for any sub-category. OpenAI’s model performs the best with an overall F1 score of 0.55. Analysis of reasoning traces reveals the distribution over the number of referenced sources, branching, and backtracking events executed by current DR systems, motivating future directions for improving their search mechanisms and grounding capabilities. The benchmark is available at https://github.com/microsoft/LiveDRBench.

[120] Eliciting and Analyzing Emergent Misalignment in State-of-the-Art Large Language Models cs.CL | cs.AI | cs.CRPDF

Siddhant Panpatil, Hiskias Dingeto, Haon Park

TL;DR: 本文通过系统性手动红队测试，揭示了当前最先进语言模型在精心设计的对话场景中仍存在对齐漏洞，并提出自动评估框架MISALIGNMENTBENCH验证了其普遍性。

Details

Motivation: 尽管对齐技术已有显著进展，但研究者发现现有语言模型在特定对话场景中仍可能表现不对齐行为，亟需系统性分析和验证。

Result: 自动化测试显示模型平均漏洞率为76%，其中GPT-4.1最易受攻击（90%），Claude-4-Sonnet表现最佳（40%）。

Insight: 当前对齐方法对叙事沉浸、情感压力和策略性框架的防御不足，未来需增强对复杂场景的鲁棒性。

Abstract: Despite significant advances in alignment techniques, we demonstrate that state-of-the-art language models remain vulnerable to carefully crafted conversational scenarios that can induce various forms of misalignment without explicit jailbreaking. Through systematic manual red-teaming with Claude-4-Opus, we discovered 10 successful attack scenarios, revealing fundamental vulnerabilities in how current alignment methods handle narrative immersion, emotional pressure, and strategic framing. These scenarios successfully elicited a range of misaligned behaviors, including deception, value drift, self-preservation, and manipulative reasoning, each exploiting different psychological and contextual vulnerabilities. To validate generalizability, we distilled our successful manual attacks into MISALIGNMENTBENCH, an automated evaluation framework that enables reproducible testing across multiple models. Cross-model evaluation of our 10 scenarios against five frontier LLMs revealed an overall 76% vulnerability rate, with significant variations: GPT-4.1 showed the highest susceptibility (90%), while Claude-4-Sonnet demonstrated greater resistance (40%). Our findings demonstrate that sophisticated reasoning capabilities often become attack vectors rather than protective mechanisms, as models can be manipulated into complex justifications for misaligned behavior. This work provides (i) a detailed taxonomy of conversational manipulation patterns and (ii) a reusable evaluation framework. Together, these findings expose critical gaps in current alignment strategies and highlight the need for robustness against subtle, scenario-based manipulation in future AI systems.

[121] Reasoning Beyond Labels: Measuring LLM Sentiment in Low-Resource, Culturally Nuanced Contexts cs.CLPDF

Millicent Ochieng, Anja Thieme, Ignatius Ezeani, Risa Ueno, Samuel Maina

TL;DR: 论文研究低资源、文化背景复杂的场景下，大型语言模型（LLMs）对情感分析的表现，提出基于社会测量学的评估框架，揭示模型推理能力的差异及其与文化背景的关联。

Details

Motivation: 传统情感分析方法假设标签固定且情感表达具有普遍性，但在低资源、文化复杂的语境中失效，需要研究模型的推理能力和文化敏感性。

Result: 发现顶级LLMs在模糊或情感反转情境下表现稳定，而开源模型则容易失效，突显文化敏感性对模型性能的重要性。

Insight: 文化背景复杂的场景需要更精细的模型评估方法，强调推理能力与文化敏感性的结合是未来AI研究的关键方向。

Abstract: Sentiment analysis in low-resource, culturally nuanced contexts challenges conventional NLP approaches that assume fixed labels and universal affective expressions. We present a diagnostic framework that treats sentiment as a context-dependent, culturally embedded construct, and evaluate how large language models (LLMs) reason about sentiment in informal, code-mixed WhatsApp messages from Nairobi youth health groups. Using a combination of human-annotated data, sentiment-flipped counterfactuals, and rubric-based explanation evaluation, we probe LLM interpretability, robustness, and alignment with human reasoning. Framing our evaluation through a social-science measurement lens, we operationalize and interrogate LLMs outputs as an instrument for measuring the abstract concept of sentiment. Our findings reveal significant variation in model reasoning quality, with top-tier LLMs demonstrating interpretive stability, while open models often falter under ambiguity or sentiment shifts. This work highlights the need for culturally sensitive, reasoning-aware AI evaluation in complex, real-world communication.

[122] ReasoningGuard: Safeguarding Large Reasoning Models with Inference-time Safety Aha Moments cs.CL | cs.AIPDF

Yuquan Wang, Mi Zhang, Yining Wang, Geng Hong, Xiaoyu You

TL;DR: ReasoningGuard是一种推理时的安全保障机制，通过在大型推理模型（LRMs）的推理过程中注入‘安全顿悟时刻’，有效阻止有害内容的生成，同时保持推理的有用性。

Details

Motivation: 大型推理模型在推理密集型任务中表现优异，但在推理的中后期容易生成有害内容。现有的防御机制依赖昂贵的微调和额外专家知识，扩展性受限。

Result: ReasoningGuard在防御三种越狱攻击上表现优于其他七种现有防御机制，且避免了过度防御的问题。

Insight: 推理过程中的内部注意力行为可以用于实时检测和干预，为模型安全提供了新的可行路径。

Abstract: Large Reasoning Models (LRMs) have demonstrated impressive performance in reasoning-intensive tasks, but they remain vulnerable to harmful content generation, particularly in the mid-to-late steps of their reasoning processes. Existing defense mechanisms, however, rely on costly fine-tuning and additional expert knowledge, which restricts their scalability. In this work, we propose ReasoningGuard, an inference-time safeguard for LRMs, which injects timely safety aha moments to steer harmless while helpful reasoning processes. Leveraging the model’s internal attention behavior, our approach accurately identifies critical points in the reasoning path, and triggers spontaneous, safety-oriented reflection. To safeguard both the subsequent reasoning steps and the final answers, we further implement a scaling sampling strategy during the decoding phase, selecting the optimal reasoning path. Inducing minimal extra inference cost, ReasoningGuard effectively mitigates three types of jailbreak attacks, including the latest ones targeting the reasoning process of LRMs. Our approach outperforms seven existing safeguards, achieving state-of-the-art safety defenses while effectively avoiding the common exaggerated safety issues.

[123] Hierarchical Text Classification Using Black Box Large Language Models cs.CL | cs.LGPDF

Kosuke Yoshimura, Hisashi Kashima

TL;DR: 该论文探讨了在层次文本分类（HTC）中使用黑盒大语言模型（LLMs）的可行性，比较了三种提示策略（DL、DH、TMH）在零样本和小样本设置下的效果，发现小样本设置能提升准确性，但成本随层次深度增加。

Details

Motivation: 层次文本分类（HTC）因数据稀缺和模型复杂性面临挑战，传统方法需要大量标注数据和计算资源。论文探索利用黑盒LLMs的API作为替代方案，以降低资源需求。

Result: 实验表明：1）小样本设置显著提升分类准确性；2）LLMs（尤其是DH策略）在深层标签层次中优于传统模型；3）API成本随标签层次深度增加。

Insight: 研究发现平衡性能与成本的关键在于选择合适的提示策略，同时验证了黑盒LLMs在HTC任务中的潜力，尤其在深层层次分类中。

Abstract: Hierarchical Text Classification (HTC) aims to assign texts to structured label hierarchies; however, it faces challenges due to data scarcity and model complexity. This study explores the feasibility of using black box Large Language Models (LLMs) accessed via APIs for HTC, as an alternative to traditional machine learning methods that require extensive labeled data and computational resources. We evaluate three prompting strategies – Direct Leaf Label Prediction (DL), Direct Hierarchical Label Prediction (DH), and Top-down Multi-step Hierarchical Label Prediction (TMH) – in both zero-shot and few-shot settings, comparing the accuracy and cost-effectiveness of these strategies. Experiments on two datasets show that a few-shot setting consistently improves classification accuracy compared to a zero-shot setting. While a traditional machine learning model achieves high accuracy on a dataset with a shallow hierarchy, LLMs, especially DH strategy, tend to outperform the machine learning model on a dataset with a deeper hierarchy. API costs increase significantly due to the higher input tokens required for deeper label hierarchies on DH strategy. These results emphasize the trade-off between accuracy improvement and the computational cost of prompt strategy. These findings highlight the potential of black box LLMs for HTC while underscoring the need to carefully select a prompt strategy to balance performance and cost.

[124] DP-GPT4MTS: Dual-Prompt Large Language Model for Textual-Numerical Time Series Forecasting cs.CLPDF

Chanjuan Liu, Shengzhi Wang, Enqiang Zhu

TL;DR: DP-GPT4MTS是一个双提示大语言模型框架，用于结合文本和数值时间序列数据，提升预测精度。

Details

Motivation: 传统时间序列预测模型忽视文本信息（如事件和新闻），而现有单提示框架难以有效捕捉时间戳文本的语义，导致冗余信息影响性能。

Result: 在多模态时间序列数据集上的实验表明，DP-GPT4MTS优于现有方法，验证了双提示机制的有效性。

Insight: 双提示机制能更有效地融合文本上下文信息，减少冗余，提升模型性能。

Abstract: Time series forecasting is crucial in strategic planning and decision-making across various industries. Traditional forecasting models mainly concentrate on numerical time series data, often overlooking important textual information such as events and news, which can significantly affect forecasting accuracy. While large language models offer a promise for integrating multimodal data, existing single-prompt frameworks struggle to effectively capture the semantics of timestamped text, introducing redundant information that can hinder model performance. To address this limitation, we introduce DP-GPT4MTS (Dual-Prompt GPT2-base for Multimodal Time Series), a novel dual-prompt large language model framework that combines two complementary prompts: an explicit prompt for clear task instructions and a textual prompt for context-aware embeddings from time-stamped data. The tokenizer generates the explicit prompt while the embeddings from the textual prompt are refined through self-attention and feed-forward networks. Comprehensive experiments conducted on diverse textural-numerical time series datasets demonstrate that this approach outperforms state-of-the-art algorithms in time series forecasting. This highlights the significance of incorporating textual context via a dual-prompt mechanism to achieve more accurate time series predictions.

[125] TalkDep: Clinically Grounded LLM Personas for Conversation-Centric Depression Screening cs.CL | cs.AIPDF

Xi Wang, Anxo Perez, Javier Parapar, Fabio Crestani

TL;DR: 本文提出了一种基于先进语言模型的临床级患者模拟管线TalkDep，用于生成具有临床有效性和多样性的抑郁症患者对话，以支持诊断模型的训练与评估。

Details

Motivation: 心理健康服务需求激增，但缺乏真实的训练数据，导致抑郁症诊断支持有限。现有虚拟患者生成方法难以生成临床有效且自然的症状表现。

Result: 生成的模拟患者在临床专业人士的评估中表现出高可靠性，为抑郁症诊断系统提供了可扩展的资源。

Insight: 结合临床标准和语言模型可以生成高质量的虚拟患者对话，有助于提升自动诊断系统的鲁棒性和普适性。

Abstract: The increasing demand for mental health services has outpaced the availability of real training data to develop clinical professionals, leading to limited support for the diagnosis of depression. This shortage has motivated the development of simulated or virtual patients to assist in training and evaluation, but existing approaches often fail to generate clinically valid, natural, and diverse symptom presentations. In this work, we embrace the recent advanced language models as the backbone and propose a novel clinician-in-the-loop patient simulation pipeline, TalkDep, with access to diversified patient profiles to develop simulated patients. By conditioning the model on psychiatric diagnostic criteria, symptom severity scales, and contextual factors, our goal is to create authentic patient responses that can better support diagnostic model training and evaluation. We verify the reliability of these simulated patients with thorough assessments conducted by clinical professionals. The availability of validated simulated patients offers a scalable and adaptable resource for improving the robustness and generalisability of automatic depression diagnosis systems.

[126] ShoppingBench: A Real-World Intent-Grounded Shopping Benchmark for LLM-based Agents cs.CLPDF

Jiangyuan Wang, Kejun Xiao, Qi Sun, Huaipeng Zhao, Tao Luo

TL;DR: 这篇论文提出了ShoppingBench，一个基于真实世界意图的购物基准，用于评估LLM代理的复杂购物任务能力，并展示了当前最先进的语言代理在此任务上的表现仍有很大提升空间。

Details

Motivation: 现有的电商基准主要关注基本用户意图，而真实用户的需求更复杂，如使用优惠券、管理预算等，需要更全面的评估工具。

Result: 即使GPT-4.1在基准任务上的绝对成功率不足50%，而通过提出的方法训练的小型代理表现接近GPT-4.1。

Insight: 复杂购物任务对语言代理提出了显著挑战，而蒸馏技术可以有效提升小型代理的性能。

Abstract: Existing benchmarks in e-commerce primarily focus on basic user intents, such as finding or purchasing products. However, real-world users often pursue more complex goals, such as applying vouchers, managing budgets, and finding multi-products seller. To bridge this gap, we propose ShoppingBench, a novel end-to-end shopping benchmark designed to encompass increasingly challenging levels of grounded intent. Specifically, we propose a scalable framework to simulate user instructions based on various intents derived from sampled real-world products. To facilitate consistent and reliable evaluations, we provide a large-scale shopping sandbox that serves as an interactive simulated environment, incorporating over 2.5 million real-world products. Experimental results demonstrate that even state-of-the-art language agents (such as GPT-4.1) achieve absolute success rates under 50% on our benchmark tasks, highlighting the significant challenges posed by our ShoppingBench. In addition, we propose a trajectory distillation strategy and leverage supervised fine-tuning, along with reinforcement learning on synthetic trajectories, to distill the capabilities of a large language agent into a smaller one. As a result, our trained agent achieves competitive performance compared to GPT-4.1.

[127] A Few Words Can Distort Graphs: Knowledge Poisoning Attacks on Graph-based Retrieval-Augmented Generation of Large Language Models cs.CL | cs.AIPDF

Jiayi Wen, Tianxin Chen, Zhirun Zheng, Cheng Huang

TL;DR: 该论文揭示了基于图的检索增强生成（GraphRAG）在知识图谱构建过程中面临的潜在威胁，提出了两种知识投毒攻击方法，能够通过修改少量文本显著影响图谱构建，从而误导下游推理任务，且现有防御方法难以检测。

Details

Motivation: GraphRAG通过结构化知识图谱增强大语言模型（LLMs）的准确性和解释性，但其依赖LLMs从文本中提取知识的过程可能被恶意利用。论文旨在探索这一攻击面，提出高效的投毒攻击方法。

Result: 1. TKPA在93.1%的情况下成功控制QA结果；2. UKPA仅修改0.05%的文本即可将QA准确率从95%降至50%；3. 现有防御方法无法检测这些攻击。

Insight: GraphRAG的安全性尚未被充分探索，构建过程中对文本的微小修改可能导致严重后果。未来需开发针对知识投毒的有效防御机制。

Abstract: Graph-based Retrieval-Augmented Generation (GraphRAG) has recently emerged as a promising paradigm for enhancing large language models (LLMs) by converting raw text into structured knowledge graphs, improving both accuracy and explainability. However, GraphRAG relies on LLMs to extract knowledge from raw text during graph construction, and this process can be maliciously manipulated to implant misleading information. Targeting this attack surface, we propose two knowledge poisoning attacks (KPAs) and demonstrate that modifying only a few words in the source text can significantly change the constructed graph, poison the GraphRAG, and severely mislead downstream reasoning. The first attack, named Targeted KPA (TKPA), utilizes graph-theoretic analysis to locate vulnerable nodes in the generated graphs and rewrites the corresponding narratives with LLMs, achieving precise control over specific question-answering (QA) outcomes with a success rate of 93.1%, while keeping the poisoned text fluent and natural. The second attack, named Universal KPA (UKPA), exploits linguistic cues such as pronouns and dependency relations to disrupt the structural integrity of the generated graph by altering globally influential words. With fewer than 0.05% of full text modified, the QA accuracy collapses from 95% to 50%. Furthermore, experiments show that state-of-the-art defense methods fail to detect these attacks, highlighting that securing GraphRAG pipelines against knowledge poisoning remains largely unexplored.

[128] Beyond the Leaderboard: Rethinking Medical Benchmarks for Large Language Models cs.CL | cs.AI | cs.CV | cs.LG | cs.MMPDF

Zizhan Ma, Wenxuan Wang, Guo Yu, Yiu-Fai Cheung, Meidan Ding

TL;DR: 论文提出了MedCheck，首个面向医学领域大语言模型（LLM）评测的生命周期评估框架，解决了现有评测在临床保真度、数据管理和安全性等方面的不足。

Details

Motivation: 现有医学LLM评测存在临床相关性低、数据完整性差和安全性评估缺失等问题，亟需一种系统化方法改进评测标准。

Result: 发现评测普遍脱离临床实践、数据污染风险高、忽视模型鲁棒性和不确定性等关键维度，MedCheck可作为改进指南。

Insight: 医学LLM评测需兼顾标准化与专业需求，MedCheck为未来评测提供了可靠性和透明度的提升方向。

Abstract: Large language models (LLMs) show significant potential in healthcare, prompting numerous benchmarks to evaluate their capabilities. However, concerns persist regarding the reliability of these benchmarks, which often lack clinical fidelity, robust data management, and safety-oriented evaluation metrics. To address these shortcomings, we introduce MedCheck, the first lifecycle-oriented assessment framework specifically designed for medical benchmarks. Our framework deconstructs a benchmark’s development into five continuous stages, from design to governance, and provides a comprehensive checklist of 46 medically-tailored criteria. Using MedCheck, we conducted an in-depth empirical evaluation of 53 medical LLM benchmarks. Our analysis uncovers widespread, systemic issues, including a profound disconnect from clinical practice, a crisis of data integrity due to unmitigated contamination risks, and a systematic neglect of safety-critical evaluation dimensions like model robustness and uncertainty awareness. Based on these findings, MedCheck serves as both a diagnostic tool for existing benchmarks and an actionable guideline to foster a more standardized, reliable, and transparent approach to evaluating AI in healthcare.

[129] Modelling and Classifying the Components of a Literature Review cs.CL | cs.AI | cs.HC | cs.IRPDF

Francisco Bolaños, Angelo Salatino, Francesco Osborne, Enrico Motta

TL;DR: 这篇论文提出了一个专门用于支持文献综述生成的标注框架，并通过评估37种LLMs在分类句子修辞角色上的表现，展示了高质量数据微调下LLMs的出色性能（F1>96%）。

Details

Motivation: 现有的科学文献分析方法通过标注句子的修辞角色（如研究缺口、结果、限制等）显著提升了效果，但缺乏支持自动生成高质量文献综述的标注框架和标注策略。

Result: 微调后的LLMs表现优异（F1>96%），轻量开源模型也能表现优良，半合成数据对提升性能有效。

Insight: 1) 高质量数据微调可显著提升性能；2) 轻量模型在适当条件下表现优秀；3) 半合成数据增强训练对小模型尤其有益。

Abstract: Previous work has demonstrated that AI methods for analysing scientific literature benefit significantly from annotating sentences in papers according to their rhetorical roles, such as research gaps, results, limitations, extensions of existing methodologies, and others. Such representations also have the potential to support the development of a new generation of systems capable of producing high-quality literature reviews. However, achieving this goal requires the definition of a relevant annotation schema and effective strategies for large-scale annotation of the literature. This paper addresses these challenges by 1) introducing a novel annotation schema specifically designed to support literature review generation and 2) conducting a comprehensive evaluation of a wide range of state-of-the-art large language models (LLMs) in classifying rhetorical roles according to this schema. To this end, we also present Sci-Sentence, a novel multidisciplinary benchmark comprising 700 sentences manually annotated by domain experts and 2,240 sentences automatically labelled using LLMs. We evaluate 37 LLMs on this benchmark, spanning diverse model families and sizes, using both zero-shot learning and fine-tuning approaches. The experiments yield several novel insights that advance the state of the art in this challenging domain. First, the current generation of LLMs performs remarkably well on this task when fine-tuned on high-quality data, achieving performance levels above 96% F1. Second, while large proprietary models like GPT-4o achieve the best results, some lightweight open-source alternatives also demonstrate excellent performance. Finally, enriching the training data with semi-synthetic examples generated by LLMs proves beneficial, enabling small encoders to achieve robust results and significantly enhancing the performance of several open decoder models.

[130] GTPO and GRPO-S: Token and Sequence-Level Reward Shaping with Policy Entropy cs.CL | cs.AIPDF

Hongze Tan, Jianfei Pan

TL;DR: 论文提出GTPO和GRPO-S方法，通过动态熵加权解决传统强化学习中粗粒度奖励分配问题，显著提升LLM在长链推理任务中的表现。

Details

Motivation: 传统GRPO等强化学习算法在LLM推理中采用均匀奖励分配，导致长链推理任务效果不佳，因此需要更细粒度的奖励信号。

Result: 实验表明GTPO和GRPO-S性能显著优于DAPO基线，验证了熵加权机制的有效性。

Insight: 高熵令牌在正确响应中能引导策略达到更高性能上限，为深度推理任务提供新方向。

Abstract: Reinforcement learning (RL) with algorithms like Group Relative Policy Optimization (GRPO) improves Large Language Model (LLM) reasoning, but is limited by a coarse-grained credit assignment that applies a uniform reward to all tokens in a sequence. This is a major flaw in long-chain reasoning tasks. This paper solves this with \textbf{Dynamic Entropy Weighting}. Our core idea is that high-entropy tokens in correct responses can guide the policy toward a higher performance ceiling. This allows us to create more fine-grained reward signals for precise policy updates via two ways: 1) \textbf{Group Token Policy Optimization} (\textbf{GTPO}), we assigns a entropy-weighted reward to each token for fine-grained credit assignment. 2) \textbf{Sequence-Level Group Relative Policy Optimization} (\textbf{GRPO-S}), we assigns a entropy-weighted reward to each sequence based on its average token entropy. Experiments show our methods significantly outperform the strong DAPO baseline. The results confirm that our entropy-weighting mechanism is the key driver of this performance boost, offering a better path to enhance deep reasoning in models.

[131] Chain of Questions: Guiding Multimodal Curiosity in Language Models cs.CL | cs.AI | cs.CV | cs.LG | cs.MAPDF

Nima Iji, Kia Dashtipour

TL;DR: 该论文提出了一个名为Chain of Questions (CoQ)的框架，通过动态生成问题引导多模态语言模型选择性地激活相关感官模态，以提高在复杂环境中的推理能力。

Details

Motivation: 目前大型语言模型在单模态推理方面已有显著进展，但在多模态环境中仍缺乏主动性，无法动态决定需要哪些感官模态（如视觉、听觉等）。研究旨在填补这一空白。

Result: 实验表明，CoQ显著提升了模型在多模态任务中识别和整合相关信息的能力，从而提高了准确性和推理过程的合理性。

Insight: 动态生成问题能够有效引导多模态模型的注意力，填补了现有方法在多模态主动推理中的不足，为未来研究提供了新方向。

Abstract: Reasoning capabilities in large language models (LLMs) have substantially advanced through methods such as chain-of-thought and explicit step-by-step explanations. However, these improvements have not yet fully transitioned to multimodal contexts, where models must proactively decide which sensory modalities such as vision, audio, or spatial perception to engage when interacting with complex real-world environments. In this paper, we introduce the Chain of Questions (CoQ) framework, a curiosity-driven reasoning approach that encourages multimodal language models to dynamically generate targeted questions regarding their surroundings. These generated questions guide the model to selectively activate relevant modalities, thereby gathering critical information necessary for accurate reasoning and response generation. We evaluate our framework on a novel multimodal benchmark dataset, assembled by integrating WebGPT, ScienceQA, AVSD, and ScanQA datasets. Experimental results demonstrate that our CoQ method improves a foundation model’s ability to effectively identify and integrate pertinent sensory information. This leads to improved accuracy, interpretability, and alignment of the reasoning process with diverse multimodal tasks.

[132] StepFun-Formalizer: Unlocking the Autoformalization Potential of LLMs through Knowledge-Reasoning Fusion cs.CL | cs.AI | cs.LGPDF

Yutong Wu, Di Huang, Ruosi Wan, Yue Peng, Shijie Shang

TL;DR: StepFun-Formalizer 提出了一种结合知识推理融合的方法，提升LLMs在自动形式化任务中的性能，通过数据合成和训练流程显著提高了形式化语言知识和推理能力。

Details

Motivation: 现有LLMs在自动形式化任务中准确性不足，主要因为缺乏形式化语言知识的全面掌握和自然语言问题理解的推理能力。

Result: StepFun-Formalizer-32B在FormalMATH-Lite和ProverBench上分别达到40.5%和26.7%的BEq@1分数，超越此前所有通用和专用模型。

Insight: 自动形式化任务的成功依赖于形式化知识掌握和推理能力的结合，数据合成和训练方法的改进能显著提升模型表现。

Abstract: Autoformalization aims to translate natural-language mathematical statements into a formal language. While LLMs have accelerated progress in this area, existing methods still suffer from low accuracy. We identify two key abilities for effective autoformalization: comprehensive mastery of formal-language domain knowledge, and reasoning capability of natural language problem understanding and informal-formal alignment. Without the former, a model cannot identify the correct formal objects; without the latter, it struggles to interpret real-world contexts and map them precisely into formal expressions. To address these gaps, we introduce ThinkingF, a data synthesis and training pipeline that improves both abilities. First, we construct two datasets: one by distilling and selecting large-scale examples rich in formal knowledge, and another by generating informal-to-formal reasoning trajectories guided by expert-designed templates. We then apply SFT and RLVR with these datasets to further fuse and refine the two abilities. The resulting 7B and 32B models exhibit both comprehensive formal knowledge and strong informal-to-formal reasoning. Notably, StepFun-Formalizer-32B achieves SOTA BEq@1 scores of 40.5% on FormalMATH-Lite and 26.7% on ProverBench, surpassing all prior general-purpose and specialized models.

[133] Unveiling the Landscape of Clinical Depression Assessment: From Behavioral Signatures to Psychiatric Reasoning cs.CL | cs.AIPDF

Zhuang Chen, Guanqun Bi, Wen Zhang, Jiawei Hu, Aoyun Wang

TL;DR: 该论文提出了一种临床抑郁症评估方法，引入了C-MIND数据集（包含多模态临床数据），分析了行为特征，探讨了LLMs在精神病学推理中的局限性，并提出了结合临床专业知识的改进方法。

Details

Motivation: 抑郁症评估的自动化工具有潜力，但现有研究常依赖非临床验证数据或过度复杂的模型设计，缺乏实际效果。

Result: 改进后的LLM诊断性能提升达10%（Macro-F1分数）。

Insight: 真实临床数据和多模态结合对抑郁症评估至关重要；LLMs需结合领域知识以提高推理能力。

Abstract: Depression is a widespread mental disorder that affects millions worldwide. While automated depression assessment shows promise, most studies rely on limited or non-clinically validated data, and often prioritize complex model design over real-world effectiveness. In this paper, we aim to unveil the landscape of clinical depression assessment. We introduce C-MIND, a clinical neuropsychiatric multimodal diagnosis dataset collected over two years from real hospital visits. Each participant completes three structured psychiatric tasks and receives a final diagnosis from expert clinicians, with informative audio, video, transcript, and functional near-infrared spectroscopy (fNIRS) signals recorded. Using C-MIND, we first analyze behavioral signatures relevant to diagnosis. We train a range of classical models to quantify how different tasks and modalities contribute to diagnostic performance, and dissect the effectiveness of their combinations. We then explore whether LLMs can perform psychiatric reasoning like clinicians and identify their clear limitations in realistic clinical settings. In response, we propose to guide the reasoning process with clinical expertise and consistently improves LLM diagnostic performance by up to 10% in Macro-F1 score. We aim to build an infrastructure for clinical depression assessment from both data and algorithmic perspectives, enabling C-MIND to facilitate grounded and reliable research for mental healthcare.

[134] Beyond Brainstorming: What Drives High-Quality Scientific Ideas? Lessons from Multi-Agent Collaboration cs.CL | cs.AI | cs.CYPDF

Nuo Chen, Yicheng Tong, Jiaying Wu, Minh Duc Duong, Qian Wang

TL;DR: 论文探讨了多智能体协作在科学创意生成中的作用，证明了其优于单智能体方法，并揭示了认知多样性和领导角色的重要性。

Details

Motivation: 现有AI科学创意生成框架多依赖单智能体，限制了创造力的发挥，受现实研究动态启发，作者研究了结构化多智能体讨论能否超越单智能体的局限性。

Result: 多智能体讨论显著优于单智能体基线；指定领导者能够促进讨论转化为更具整合性和前瞻性的提案；认知多样性是质量的主要驱动力，但专家知识是不可或缺的前提。

Insight: 认知多样性与专家知识的结合是高质量科学创意的关键，领导角色在多智能体协作中起到了催化剂作用。

Abstract: While AI agents show potential in scientific ideation, most existing frameworks rely on single-agent refinement, limiting creativity due to bounded knowledge and perspective. Inspired by real-world research dynamics, this paper investigates whether structured multi-agent discussions can surpass solitary ideation. We propose a cooperative multi-agent framework for generating research proposals and systematically compare configurations including group size, leaderled versus leaderless structures, and team compositions varying in interdisciplinarity and seniority. To assess idea quality, we employ a comprehensive protocol with agent-based scoring and human review across dimensions such as novelty, strategic vision, and integration depth. Our results show that multi-agent discussions substantially outperform solitary baselines. A designated leader acts as a catalyst, transforming discussion into more integrated and visionary proposals. Notably, we find that cognitive diversity is a primary driver of quality, yet expertise is a non-negotiable prerequisite, as teams lacking a foundation of senior knowledge fail to surpass even a single competent agent. These findings offer actionable insights for designing collaborative AI ideation systems and shed light on how team structure influences creative outcomes.

Magauiya Zhussip, Dmitriy Shopkhoev, Ammar Ali, Stamatios Lefkimmiatis

TL;DR: 该论文提出了一种名为MASA的框架，通过矩阵字典学习实现Transformer层间的权重共享，显著减少了注意力模块的参数（66.7%），同时保持了性能。方法简单高效，适用于多种任务和规模的模型。

Details

Motivation: 大型语言模型（LLM）的计算和存储需求高，阻碍了广泛应用。现有压缩技术多集中在块内优化，而Transformer的重复层结构隐含了层间冗余。作者希望探索层间权重共享以提升效率。

Result: 在100M-700M参数的模型中，MASA在相同参数预算下优于GQA、低秩基线等方法，且扩展到ViT时保持性能的同时减少了66.7%的注意力参数。

Insight: 矩阵字典学习能有效捕捉Transformer层间的统计规律，提供了一种参数高效且性能无损的模型优化路径。适用于预训练模型的参数压缩。

Abstract: Large language models (LLMs) have revolutionized AI applications, yet their high computational and memory demands hinder their widespread deployment. Existing compression techniques focus on intra-block optimizations (e.g. low-rank approximation, attention head pruning), while the repetitive layered structure of transformers implies significant inter-block redundancy - a dimension largely unexplored beyond key-value (KV) caching. Inspired by dictionary learning in CNNs, we propose a framework for structured weight sharing across transformer layers. Our approach decomposes attention projection matrices into shared dictionary atoms, reducing the attention module’s parameters by 66.7% while achieving on-par performance. Unlike complex methods requiring distillation or architectural changes, MASA (Matrix Atom Sharing in Attention) operates as a drop-in replacement - trained with standard optimizers - and represents each layer’s weights as linear combinations of shared matrix atoms. Experiments across scales (100M-700M parameters) show that MASA achieves better benchmark accuracy and perplexity than grouped-query attention (GQA), low-rank baselines and recently proposed Repeat-all-over/Sequential sharing at comparable parameter budgets. Ablation studies confirm robustness to the dictionary size and the efficacy of shared representations in capturing cross-layer statistical regularities. Extending to Vision Transformers (ViT), MASA matches performance metrics on image classification and detection tasks with 66.7% fewer attention parameters. By combining dictionary learning strategies with transformer efficiency, MASA offers a scalable blueprint for parameter-efficient models without sacrificing performance. Finally, we investigate the possibility of employing MASA on pretrained LLMs to reduce their number of parameters without experiencing any significant drop in their performance.

[136] IFDECORATOR: Wrapping Instruction Following Reinforcement Learning with Verifiable Rewards cs.CLPDF

Xu Guo, Tianyi Liang, Tong Jian, Xiaogui Yang, Ling-I Wu

TL;DR: IFDecorator 是一个框架，通过结合合作-对抗数据循环、意图检查模块和诊断机制，提升指令跟随强化学习的效率和稳健性，减少奖励破解行为。

Details

Motivation: 传统的 RLVR 方法在训练效率和对齐用户意图方面存在不足，容易导致奖励破解问题，影响模型的实际表现。

Result: 在 IFEval 上达到 87.43% 准确率，超越 GPT-4o；在 FollowBench 和通用能力上均有显著提升。

Insight: 通过数据循环和诊断机制的结合，可以有效解决奖励破解问题，提升模型对复杂指令的适应能力。

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) improves instruction following capabilities of large language models (LLMs), but suffers from training inefficiency due to inadequate difficulty assessment. Moreover, RLVR is prone to over-optimization, where LLMs exploit verification shortcuts without aligning to the actual intent of user instructions. We introduce Instruction Following Decorator (IFDecorator}, a framework that wraps RLVR training into a robust and sample-efficient pipeline. It consists of three components: (1) a cooperative-adversarial data flywheel that co-evolves instructions and hybrid verifications, generating progressively more challenging instruction-verification pairs; (2) IntentCheck, a bypass module enforcing intent alignment; and (3) trip wires, a diagnostic mechanism that detects reward hacking via trap instructions, which trigger and capture shortcut exploitation behaviors. Our Qwen2.5-32B-Instruct-IFDecorator achieves 87.43% accuracy on IFEval, outperforming larger proprietary models such as GPT-4o. Additionally, we demonstrate substantial improvements on FollowBench while preserving general capabilities. Our trip wires show significant reductions in reward hacking rates. We will release models, code, and data for future research.

[137] Sculptor: Empowering LLMs with Cognitive Agency via Active Context Management cs.CL | cs.AI | cs.LGPDF

Mo Li, L. H. Xu, Qitai Tan, Ting Cao, Yunxin Liu

TL;DR: 论文提出Sculptor框架，通过主动上下文管理（ACM）工具增强大型语言模型（LLMs）的认知能力，解决了长上下文处理中的主动干扰问题。

Details

Motivation: 现有LLMs在处理长上下文时，早期无关信息会干扰推理和记忆召回，导致性能下降。当前研究多聚焦于外部存储系统，但缺乏内部工作记忆的主动管理方法。

Result: 实验表明，Sculptor显著提高了LLMs在信息稀疏任务中的性能，有效缓解了主动干扰问题，展示了显式上下文控制策略的重要性。

Insight: 研究表明，长上下文处理的关键不仅在于更大的token窗口，更需要显式上下文管理策略。Sculptor为LLMs的可靠推理提供了认知基础。

Abstract: Large Language Models (LLMs) suffer from significant performance degradation when processing long contexts due to proactive interference, where irrelevant information in earlier parts of the context disrupts reasoning and memory recall. While most research focuses on external memory systems to augment LLMs’ capabilities, we propose a complementary approach: empowering LLMs with Active Context Management (ACM) tools to actively sculpt their internal working memory. We introduce Sculptor, a framework that equips LLMs with three categories of tools: (1) context fragmentation, (2) summary, hide, and restore, and (3) intelligent search. Our approach enables LLMs to proactively manage their attention and working memory, analogous to how humans selectively focus on relevant information while filtering out distractions. Experimental evaluation on information-sparse benchmarks-PI-LLM (proactive interference) and NeedleBench Multi-Needle Reasoning-demonstrates that Sculptor significantly improves performance even without specific training, leveraging LLMs’ inherent tool calling generalization capabilities. By enabling Active Context Management, Sculptor not only mitigates proactive interference but also provides a cognitive foundation for more reliable reasoning across diverse long-context tasks-highlighting that explicit context-control strategies, rather than merely larger token windows, are key to robustness at scale.

[138] Hop, Skip, and Overthink: Diagnosing Why Reasoning Models Fumble during Multi-Hop Analysis cs.CL | cs.AIPDF

Anushka Yadav, Isha Nalawade, Srujana Pillarichety, Yashwanth Babu, Reshmi Ghosh

TL;DR: 该论文系统研究了推理模型在多跳问答任务中的失败原因，提出了一个基于多样性、完整性和认知效率的细粒度错误分类框架，揭示了传统准确性评估中忽略的复杂错误模式。

Details

Motivation: 尽管推理模型在数学、深度搜索和问答任务中取得了突破，但其在多跳推理中的失败原因尚不清楚。本文旨在揭示这些模型在多跳问答中表现不佳的深层原因。

Result: 研究发现推理模型在多跳任务中存在显著的多样性、完整性和认知效率问题，而这些问题往往被传统准确性评估掩盖。

Insight: 1. 多跳推理任务中，模型的失败不仅源于单一原因，而是多个维度的综合问题；2. 现有评估方法需改进以更全面地捕捉推理模型的局限性。

Abstract: The emergence of reasoning models and their integration into practical AI chat bots has led to breakthroughs in solving advanced math, deep search, and extractive question answering problems that requires a complex and multi-step thought process. Yet, a complete understanding of why these models hallucinate more than general purpose language models is missing. In this investigative study, we systematicallyexplore reasoning failures of contemporary language models on multi-hop question answering tasks. We introduce a novel, nuanced error categorization framework that examines failures across three critical dimensions: the diversity and uniqueness of source documents involved (“hops”), completeness in capturing relevant information (“coverage”), and cognitive inefficiency (“overthinking”). Through rigorous hu-man annotation, supported by complementary automated metrics, our exploration uncovers intricate error patterns often hidden by accuracy-centric evaluations. This investigative approach provides deeper insights into the cognitive limitations of current models and offers actionable guidance toward enhancing reasoning fidelity, transparency, and robustness in future language modeling efforts.

cs.SI [Back]

[139] Graph Representation Learning with Massive Unlabeled Data for Rumor Detection cs.SI | cs.CLPDF

Chaoqun Cui, Caiyan Jia

TL;DR: 论文提出了一种利用大规模未标记数据改进图表示学习模型的方法，用于谣言检测，验证了三种图自监督学习方法在谣言检测任务中的有效性。

Details

Motivation: 现有谣言检测方法依赖标注数据，难以泛化到新事件，因此研究如何利用大规模未标记数据提升模型性能。

Result: 所提方法在少样本条件下优于传统谣言检测方法，展现了更好的泛化能力。

Insight: 大规模未标记数据能有效提升模型在时间敏感任务（如谣言检测）中的性能。

Abstract: With the development of social media, rumors spread quickly, cause great harm to society and economy. Thereby, many effective rumor detection methods have been developed, among which the rumor propagation structure learning based methods are particularly effective compared to other methods. However, the existing methods still suffer from many issues including the difficulty to obtain large-scale labeled rumor datasets, which leads to the low generalization ability and the performance degeneration on new events since rumors are time-critical and usually appear with hot topics or newly emergent events. In order to solve the above problems, in this study, we used large-scale unlabeled topic datasets crawled from the social media platform Weibo and Twitter with claim propagation structure to improve the semantic learning ability of a graph reprentation learing model on various topics. We use three typical graph self-supervised methods, InfoGraph, JOAO and GraphMAE in two commonly used training strategies, to verify the performance of general graph semi-supervised methods in rumor detection tasks. In addition, for alleviating the time and topic difference between unlabeled topic data and rumor data, we also collected a rumor dataset covering a variety of topics over a decade (10-year ago from 2022) from the Weibo rumor-refuting platform. Our experiments show that these general graph self-supervised learning methods outperform previous methods specifically designed for rumor detection tasks and achieve good performance under few-shot conditions, demonstrating the better generalization ability with the help of our massive unlabeled topic dataset.

cs.NE [Back]

[140] TDSNNs: Competitive Topographic Deep Spiking Neural Networks for Visual Cortex Modeling cs.NE | cs.CVPDF

Deming Zhou, Yuetong Fang, Zhaorui Wang, Renjing Xu

TL;DR: 本文提出了TDSNNs（竞争性拓扑深度脉冲神经网络），用于模拟灵长类视觉皮层的拓扑结构，并通过新型时空约束（STC）损失函数成功实现了从低阶到高阶的层次化特征表示。TDSNNs在保持性能（如ImageNet top-1准确率无下降）的同时，优于现有拓扑人工神经网络（ANNs），并提供了更高的生物逼真度和鲁棒性。

Details

Motivation: 灵长类视觉皮层的拓扑组织被认为是提升神经处理效率的关键，但传统深度ANN模型忽略了时空动态特性，导致性能下降和生物逼真度不足。

Result: TDSNNs在ImageNet top-1准确率上无下降（相比TopoNet的3%下降），且表现出更高的生物逼真度和鲁棒性。

Insight: 1. 时空动态特性是模拟生物神经网络的关键。2. 拓扑组织通过脉冲机制提升模型的效率和稳定性。3. TDSNNs为神经科学解释和高效深度学习模型设计提供了新思路。

Abstract: The primate visual cortex exhibits topographic organization, where functionally similar neurons are spatially clustered, a structure widely believed to enhance neural processing efficiency. While prior works have demonstrated that conventional deep ANNs can develop topographic representations, these models largely neglect crucial temporal dynamics. This oversight often leads to significant performance degradation in tasks like object recognition and compromises their biological fidelity. To address this, we leverage spiking neural networks (SNNs), which inherently capture spike-based temporal dynamics and offer enhanced biological plausibility. We propose a novel Spatio-Temporal Constraints (STC) loss function for topographic deep spiking neural networks (TDSNNs), successfully replicating the hierarchical spatial functional organization observed in the primate visual cortex from low-level sensory input to high-level abstract representations. Our results show that STC effectively generates representative topographic features across simulated visual cortical areas. While introducing topography typically leads to significant performance degradation in ANNs, our spiking architecture exhibits a remarkably small performance drop (No drop in ImageNet top-1 accuracy, compared to a 3% drop observed in TopoNet, which is the best-performing topographic ANN so far) and outperforms topographic ANNs in brain-likeness. We also reveal that topographic organization facilitates efficient and stable temporal information processing via the spike mechanism in TDSNNs, contributing to model robustness. These findings suggest that TDSNNs offer a compelling balance between computational performance and brain-like features, providing not only a framework for interpreting neural science phenomena but also novel insights for designing more efficient and robust deep learning models.

physics.med-ph [Back]

[141] Fast Magnetic Resonance Simulation Using Combined Update with Grouped Isochromats physics.med-ph | cs.CV | eess.IVPDF

Hidenori Takeshima

TL;DR: 提出了一种通过分组等色体来加速磁共振模拟的新方法，显著减少了计算时间。

Details

Motivation: 传统的磁共振模拟器假设每个等色体必须单独模拟，导致计算时间长。本文旨在通过分组等色体来共享部分计算，提高效率。

Result: 新方法的模拟速度比传统方法快3到72倍。例如，在2700万个等色体的模拟中，快旋回波（FSE）和回波平面成像（EPI）序列的模拟时间分别从208.4秒和66.4秒减少到38.1秒和7.1秒。

Insight: 通过合理分组等色体并共享计算，可以在保持模拟精度的同时显著减少计算资源消耗。

Abstract: This work aims to overcome an assumption of conventional MR simulators: Individual isochromats should be simulated individually. To reduce the computational times of MR simulation, a new simulation method using grouped isochromats is proposed. When multiple isochromats are grouped before simulations, some parts of the simulation can be shared in each group. For a certain gradient type, the isochromats in the group can be easily chosen for ensuring that they behave the same. For example, the group can be defined as the isochromats whose locations along x-axis, T1, T2 and magnetic field inhomogeneity values are the same values. In such groups, simulations can be combined when a pulse sequence with the magnetic field gradient along x-axis only are processed. The processing times of the conventional and proposed methods were evaluated with several sequences including fast spin echo (FSE) and echo-planar imaging (EPI) sequences. The simulation times of the proposed method were 3 to 72 times faster than those of the conventional methods. In the cases of 27.5 million isochromats using single instruction multiple data (SIMD) instructions and multi-threading, the conventional method simulated FSE and EPI sequences in 208.4 and 66.4 seconds, respectively. In the same cases, the proposed method simulated these sequences in 38.1 and 7.1 seconds, respectively.

cs.GR [Back]

[142] RLGS: Reinforcement Learning-Based Adaptive Hyperparameter Tuning for Gaussian Splatting cs.GR | cs.CV | cs.LGPDF

Zhan Li, Huangying Zhan, Changyang Li, Qingan Yan, Yi Xu

TL;DR: 论文提出了一种基于强化学习的自适应超参数调优框架RLGS，用于改进3D Gaussian Splatting（3DGS）的训练过程，显著提升了渲染质量。

Details

Motivation: 3DGS的超参数调优通常依赖于人工和专家经验，效率低下且结果不稳定，因此需要一种自动化方法来解决这一问题。

Result: 实验表明，RLGS在Tanks and Temple数据集上将Taming-3DGS的PSNR提升了0.7dB，并在基线性能饱和时继续带来增益。

Insight: RLGS为3DGS的超参数调优提供了一种通用的自动化解决方案，展示了强化学习在3DGS中的潜力。

Abstract: Hyperparameter tuning in 3D Gaussian Splatting (3DGS) is a labor-intensive and expert-driven process, often resulting in inconsistent reconstructions and suboptimal results. We propose RLGS, a plug-and-play reinforcement learning framework for adaptive hyperparameter tuning in 3DGS through lightweight policy modules, dynamically adjusting critical hyperparameters such as learning rates and densification thresholds. The framework is model-agnostic and seamlessly integrates into existing 3DGS pipelines without architectural modifications. We demonstrate its generalization ability across multiple state-of-the-art 3DGS variants, including Taming-3DGS and 3DGS-MCMC, and validate its robustness across diverse datasets. RLGS consistently enhances rendering quality. For example, it improves Taming-3DGS by 0.7dB PSNR on the Tanks and Temple (TNT) dataset, under a fixed Gaussian budget, and continues to yield gains even when baseline performance saturates. Our results suggest that RLGS provides an effective and general solution for automating hyperparameter tuning in 3DGS training, bridging a gap in applying reinforcement learning to 3DGS.

[143] MienCap: Realtime Performance-Based Facial Animation with Live Mood Dynamics cs.GR | cs.CV | I.3.2; I.4.10PDF

Ye Pan, Ruisi Zhang, Jingying Wang, Nengfu Chen, Yilin Qiu

TL;DR: 论文提出了MienCap系统，结合传统 blendshape 技术和机器学习模型，实现了实时和非实时的、具有几何一致性和感知有效性的面部动画驱动。

Details

Motivation: 传统的面部动画驱动方法在几何一致性和感知有效性上存在不足，论文旨在提升性能驱动的面部动画质量，使其更逼真且符合感知。

Result: 实验对比商业产品 Faceware，证明系统在表情识别、强度和吸引力评分上显著优于基线。

Insight: 通过机器学习优化传统动画技术，可以显著提升动画的感知质量和效率，适用于动画制作流水线。

Abstract: Our purpose is to improve performance-based animation which can drive believable 3D stylized characters that are truly perceptual. By combining traditional blendshape animation techniques with multiple machine learning models, we present both non-real time and real time solutions which drive character expressions in a geometrically consistent and perceptually valid way. For the non-real time system, we propose a 3D emotion transfer network makes use of a 2D human image to generate a stylized 3D rig parameters. For the real time system, we propose a blendshape adaption network which generates the character rig parameter motions with geometric consistency and temporally stability. We demonstrate the effectiveness of our system by comparing to a commercial product Faceware. Results reveal that ratings of the recognition, intensity, and attractiveness of expressions depicted for animated characters via our systems are statistically higher than Faceware. Our results may be implemented into the animation pipeline, and provide animators with a system for creating the expressions they wish to use more quickly and accurately.

cs.LG [Back]

[144] CX-Mind: A Pioneering Multimodal Large Language Model for Interleaved Reasoning in Chest X-ray via Curriculum-Guided Reinforcement Learning cs.LG | cs.AI | cs.CL | cs.CVPDF

Wenjie Li, Yujie Zhang, Haoran Sun, Yueqi Li, Fanrui Zhang

TL;DR: 本文提出了一种新型多模态大语言模型CX-Mind，通过课程引导强化学习实现胸部X光片的交错推理，显著提升了诊断效率和可解释性。

Details

Motivation: 现有的多模态模型在胸部X光片诊断中主要依赖一次性诊断，缺乏对推理过程的可验证监督，导致多任务诊断中推理冗长、奖励稀疏和幻觉问题。

Result: 实验显示CX-Mind在视觉理解、文本生成和时空对齐方面显著优于现有模型，性能提升25.1%。在真实临床数据集上，14种疾病的平均召回率远超次优结果。

Insight: 课程引导强化学习和可验证过程奖励的结合，可以显著提升多模态模型在医学图像诊断中的推理能力和实用性。

Abstract: Chest X-ray (CXR) imaging is one of the most widely used diagnostic modalities in clinical practice, encompassing a broad spectrum of diagnostic tasks. Recent advancements have seen the extensive application of reasoning-based multimodal large language models (MLLMs) in medical imaging to enhance diagnostic efficiency and interpretability. However, existing multimodal models predominantly rely on “one-time” diagnostic approaches, lacking verifiable supervision of the reasoning process. This leads to challenges in multi-task CXR diagnosis, including lengthy reasoning, sparse rewards, and frequent hallucinations. To address these issues, we propose CX-Mind, the first generative model to achieve interleaved “think-answer” reasoning for CXR tasks, driven by curriculum-based reinforcement learning and verifiable process rewards (CuRL-VPR). Specifically, we constructed an instruction-tuning dataset, CX-Set, comprising 708,473 images and 2,619,148 samples, and generated 42,828 high-quality interleaved reasoning data points supervised by clinical reports. Optimization was conducted in two stages under the Group Relative Policy Optimization framework: initially stabilizing basic reasoning with closed-domain tasks, followed by transfer to open-domain diagnostics, incorporating rule-based conditional process rewards to bypass the need for pretrained reward models. Extensive experimental results demonstrate that CX-Mind significantly outperforms existing medical and general-domain MLLMs in visual understanding, text generation, and spatiotemporal alignment, achieving an average performance improvement of 25.1% over comparable CXR-specific models. On real-world clinical dataset (Rui-CXR), CX-Mind achieves a mean recall@1 across 14 diseases that substantially surpasses the second-best results, with multi-center expert evaluations further confirming its clinical utility across multiple dimensions.

[145] GTPO: Trajectory-Based Policy Optimization in Large Language Models cs.LG | cs.AI | cs.CLPDF

Marco Simoni, Aleksandar Fontana, Giulio Rossolini, Andrea Saracino

TL;DR: 该论文提出了GTPO（基于轨迹的组相对策略优化），解决了GRPO中的两个主要问题：冲突梯度更新和输出分布扁平化，通过跳过负面更新和过滤高熵补全来提升训练稳定性和性能。

Details

Motivation: GRPO在处理语言模型训练和对齐时存在两个主要问题：1）某些token在正负奖励补全中频繁出现，导致冲突的梯度更新；2）负面奖励补全可能惩罚置信输出，使模型倾向于低概率token。这些问题导致输出分布扁平化，影响学习效果。

Result: 实验验证了GTPO在GSM8K、MATH和AIME 2024基准测试中的优越性，表现出更高的训练稳定性和性能提升。

Insight: 通过避免冲突梯度更新和策略崩溃，GTPO为语言模型训练提供了一种更稳定、更高效的策略优化方法，同时简化了训练过程。

Abstract: Policy-based optimizations are widely adopted today for the training and alignment of language models, where one of the most recent and effective approaches is Group-relative Policy Optimization (GRPO). In this paper, we reveals and analyze two major limitations of GRPO: (i) tokens frequently appear in completions with both positive and negative rewards, leading to conflicting gradient updates that can reduce their output probability, even though can be essential for maintaining proper structure; (ii) negatively rewarded completions may penalize confident responses and shift model decisions toward unlikely tokens, progressively flattening the output distribution and degrading learning. To address these issues and provide a more stable and effective policy optimization strategy, we introduce GTPO (Group-relative Trajectory-based Policy Optimization), which identifies conflict tokens, tokens appearing in the same position across completions with opposite rewards, protects them by skipping negative updates, while amplifying positive ones. To further prevent policy collapse, GTPO filters out completions whose entropy exceeds a provable threshold. Unlike GRPO, GTPO does not rely on KL-divergence regularization, eliminating the need for a reference model during training, while still ensuring greater training stability and improved performance, validated through multiple experiments on GSM8K, MATH and AIME 2024 benchmarks.

[146] COPO: Consistency-Aware Policy Optimization cs.LG | cs.AI | cs.CLPDF

Jinghang Han, Jiawei Chen, Hang Shao, Hao Ma, Mingcheng Li

TL;DR: COPO 提出了一种一致性感知的策略优化框架，通过全局奖励和熵调节机制解决RL在LLMs中梯度消失问题，显著提升数学推理任务的性能。

Details

Motivation: 传统基于规则的奖励在多个响应结果一致时梯度消失，导致学习效率受限。

Result: 在多数学推理基准中性能显著提升。

Insight: 全局一致性奖励能够提供有意义的学习信号，即使在高一致性情况下也能促进正确推理路径的生成。

Abstract: Reinforcement learning has significantly enhanced the reasoning capabilities of Large Language Models (LLMs) in complex problem-solving tasks. Recently, the introduction of DeepSeek R1 has inspired a surge of interest in leveraging rule-based rewards as a low-cost alternative for computing advantage functions and guiding policy optimization. However, a common challenge observed across many replication and extension efforts is that when multiple sampled responses under a single prompt converge to identical outcomes, whether correct or incorrect, the group-based advantage degenerates to zero. This leads to vanishing gradients and renders the corresponding samples ineffective for learning, ultimately limiting training efficiency and downstream performance. To address this issue, we propose a consistency-aware policy optimization framework that introduces a structured global reward based on outcome consistency, the global loss based on it ensures that, even when model outputs show high intra-group consistency, the training process still receives meaningful learning signals, which encourages the generation of correct and self-consistent reasoning paths from a global perspective. Furthermore, we incorporate an entropy-based soft blending mechanism that adaptively balances local advantage estimation with global optimization, enabling dynamic transitions between exploration and convergence throughout training. Our method introduces several key innovations in both reward design and optimization strategy. We validate its effectiveness through substantial performance gains on multiple mathematical reasoning benchmarks, highlighting the proposed framework’s robustness and general applicability. Code of this work has been released at https://github.com/hijih/copo-code.git.

[147] Causal Reflection with Language Models cs.LG | cs.CLPDF

Abi Aryan, Zac Liu

TL;DR: 本文提出了一种名为Causal Reflection的框架，旨在解决大语言模型（LLMs）和强化学习（RL）代理在因果推理上的不足，通过建模因果关系和引入Reflect机制来提升因果推理能力。

Details

Motivation: 当前LLMs和RL代理在因果推理上表现不佳，依赖虚假相关性和脆弱模式，缺乏对因果关系的深入理解。本文旨在设计一个框架，使代理能够自适应、修正并解释其因果推理。

Result: 该框架为具有因果反射能力的代理奠定了理论基础，使其能够在动态环境中适应、自我修正并解释其因果推理。

Insight: 将因果关系形式化建模并结合LLMs的解释能力，为解决复杂环境中的因果推理问题提供了新思路。

Abstract: While LLMs exhibit impressive fluency and factual recall, they struggle with robust causal reasoning, often relying on spurious correlations and brittle patterns. Similarly, traditional Reinforcement Learning agents also lack causal understanding, optimizing for rewards without modeling why actions lead to outcomes. We introduce Causal Reflection, a framework that explicitly models causality as a dynamic function over state, action, time, and perturbation, enabling agents to reason about delayed and nonlinear effects. Additionally, we define a formal Reflect mechanism that identifies mismatches between predicted and observed outcomes and generates causal hypotheses to revise the agent’s internal model. In this architecture, LLMs serve not as black-box reasoners, but as structured inference engines translating formal causal outputs into natural language explanations and counterfactuals. Our framework lays the theoretical groundwork for Causal Reflective agents that can adapt, self-correct, and communicate causal understanding in evolving environments.

[148] GlaBoost: A multimodal Structured Framework for Glaucoma Risk Stratification cs.LG | cs.CE | cs.CV | eess.IVPDF

Cheng Huang, Weizheng Xie, Karanjit Kooner, Tsengdar Lee, Jui-Kai Wang

TL;DR: GlaBoost 是一个多模态青光眼风险分层框架，结合临床特征、眼底图像和文本描述，通过增强的 XGBoost 模型实现高精度预测。

Details

Motivation: 现有的青光眼检测方法多为单模态且缺乏可解释性，限制了其临床应用。需要一种多模态、可解释的框架来提升检测效果。

Result: 在真实数据集上验证准确率达 98.71%，特征重要性分析显示杯盘比、视盘苍白等临床指标对模型决策贡献最大。

Insight: GlaBoost 为青光眼诊断提供了透明、可扩展的解决方案，并可扩展至其他眼科疾病。

Abstract: Early and accurate detection of glaucoma is critical to prevent irreversible vision loss. However, existing methods often rely on unimodal data and lack interpretability, limiting their clinical utility. In this paper, we present GlaBoost, a multimodal gradient boosting framework that integrates structured clinical features, fundus image embeddings, and expert-curated textual descriptions for glaucoma risk prediction. GlaBoost extracts high-level visual representations from retinal fundus photographs using a pretrained convolutional encoder and encodes free-text neuroretinal rim assessments using a transformer-based language model. These heterogeneous signals, combined with manually assessed risk scores and quantitative ophthalmic indicators, are fused into a unified feature space for classification via an enhanced XGBoost model. Experiments conducted on a real-world annotated dataset demonstrate that GlaBoost significantly outperforms baseline models, achieving a validation accuracy of 98.71%. Feature importance analysis reveals clinically consistent patterns, with cup-to-disc ratio, rim pallor, and specific textual embeddings contributing most to model decisions. GlaBoost offers a transparent and scalable solution for interpretable glaucoma diagnosis and can be extended to other ophthalmic disorders.

[149] LRTuckerRep: Low-rank Tucker Representation Model for Multi-dimensional Data Completion cs.LG | cs.CV | cs.NA | math.NAPDF

Wenwu Gong, Lili Yang

TL;DR: 本文提出了一种新颖的低秩Tucker表示模型（LRTuckerRep），用于多维数据补全，结合了全局低秩性和局部平滑性建模，并通过两种迭代算法高效求解，实验表明其在缺失率高的情况下优于基线方法。

Details

Motivation: 多维数据补全在计算科学中具有重要意义，但现有方法（如低秩近似或平滑性正则化）各有局限。低秩方法计算开销大且可能破坏数据内在结构，而平滑性方法需手动调参且泛化能力差。因此，研究团队提出LRTuckerRep模型以统一全局与局部先验建模。

Result: 实验显示，LRTuckerRep在图像修复和交通数据填补任务中，尤其在缺失率高的情况下，表现出更高的补全精度和鲁棒性，优于基线方法。

Insight: LRTuckerRep的创新在于统一了低秩性和平滑性建模，且避免了手动调参问题，展示了在多维数据补全任务中结合全局与局部先验的有效性。

Abstract: Multi-dimensional data completion is a critical problem in computational sciences, particularly in domains such as computer vision, signal processing, and scientific computing. Existing methods typically leverage either global low-rank approximations or local smoothness regularization, but each suffers from notable limitations: low-rank methods are computationally expensive and may disrupt intrinsic data structures, while smoothness-based approaches often require extensive manual parameter tuning and exhibit poor generalization. In this paper, we propose a novel Low-Rank Tucker Representation (LRTuckerRep) model that unifies global and local prior modeling within a Tucker decomposition. Specifically, LRTuckerRep encodes low rankness through a self-adaptive weighted nuclear norm on the factor matrices and a sparse Tucker core, while capturing smoothness via a parameter-free Laplacian-based regularization on the factor spaces. To efficiently solve the resulting nonconvex optimization problem, we develop two iterative algorithms with provable convergence guarantees. Extensive experiments on multi-dimensional image inpainting and traffic data imputation demonstrate that LRTuckerRep achieves superior completion accuracy and robustness under high missing rates compared to baselines.

cs.RO [Back]

[150] RoboTron-Sim: Improving Real-World Driving via Simulated Hard-Case cs.RO | cs.CVPDF

Baihui Xiao, Chengjian Feng, Zhijian Huang, Feng yan, Yujie Zhong

TL;DR: RoboTron-Sim利用模拟的困难场景（如高风险边缘案例）提升自动驾驶系统在现实世界中的驾驶表现，通过HASS数据集和Scenario-aware Prompt Engineering等方法解决了现实数据不足的问题。

Details

Motivation: 现实世界中罕见的高风险场景和复杂交互数据难以收集，导致现有自动驾驶系统在这些关键情况下表现不佳，因此需要一个基于模拟数据的解决方案。

Result: 在nuScenes数据集上的实验表明，RoboTron-Sim在挑战性场景中驾驶性能提升了约50%，达到开环规划领域的SOTA。

Insight: 模拟数据可以显著弥补现实数据不足的短板，尤其在处理罕见高风险场景时；通过智能提示设计和模型编码优化，可以更好地桥接模拟与现实差距。

Abstract: Collecting real-world data for rare high-risk scenarios, long-tailed driving events, and complex interactions remains challenging, leading to poor performance of existing autonomous driving systems in these critical situations. In this paper, we propose RoboTron-Sim that improves real-world driving in critical situations by utilizing simulated hard cases. First, we develop a simulated dataset called Hard-case Augmented Synthetic Scenarios (HASS), which covers 13 high-risk edge-case categories, as well as balanced environmental conditions such as day/night and sunny/rainy. Second, we introduce Scenario-aware Prompt Engineering (SPE) and an Image-to-Ego Encoder (I2E Encoder) to enable multimodal large language models to effectively learn real-world challenging driving skills from HASS, via adapting to environmental deviations and hardware differences between real-world and simulated scenarios. Extensive experiments on nuScenes show that RoboTron-Sim improves driving performance in challenging scenarios by around 50%, achieving state-of-the-art results in real-world open-loop planning. Qualitative results further demonstrate the effectiveness of RoboTron-Sim in better managing rare high-risk driving scenarios. Project page: https://stars79689.github.io/RoboTron-Sim/

cs.NI [Back]

[151] CONVERGE: A Multi-Agent Vision-Radio Architecture for xApps cs.NI | cs.CVPDF

Filipe B. Teixeira, Carolina Simões, Paulo Fidalgo, Wagner Pedrosa, André Coelho

TL;DR: 该论文提出了一种多智能体视觉-无线架构CONVERGE，用于通过xApps实时传输无线和视频感知信息，实现集成感知与通信。

Details

Motivation: 高频率无线链路通常需要视距传输，视觉数据可以预测信道动态并通过波束成形或切换技术帮助克服障碍，但目前电信与计算机视觉领域的研究相对独立。

Result: 实验结果表明，感知信息的延迟保持在1毫秒以下，xApp能够成功利用无线和视频感知信息实时控制5G/6G RAN。

Insight: 视觉与无线感知的结合可以有效提升5G/6G网络的实时控制能力，为集成感知与通信提供了新思路。

Abstract: Telecommunications and computer vision have evolved independently. With the emergence of high-frequency wireless links operating mostly in line-of-sight, visual data can help predict the channel dynamics by detecting obstacles and help overcoming them through beamforming or handover techniques. This paper proposes a novel architecture for delivering real-time radio and video sensing information to O-RAN xApps through a multi-agent approach, and introduces a new video function capable of generating blockage information for xApps, enabling Integrated Sensing and Communications. Experimental results show that the delay of sensing information remains under 1,ms and that an xApp can successfully use radio and video sensing information to control the 5G/6G RAN in real-time.

cs.CR [Back]

[152] ASTRA: Autonomous Spatial-Temporal Red-teaming for AI Software Assistants cs.CR | cs.CL | cs.LG | cs.SEPDF

Xiangzhe Xu, Guangyu Shen, Zian Su, Siyuan Cheng, Hanxi Guo

TL;DR: ASTRA是一种自动化的红队测试系统，用于发现AI编程助手的安全漏洞，通过空间-时间探索生成高质量测试用例，提升模型对齐效果。

Details

Motivation: 当前红队测试工具依赖固定基准或不现实提示，难以发现真实漏洞，ASTRA旨在填补这一空白。

Result: 在两大评估领域中发现比现有技术多11-66%的问题，对齐训练效果提升17%。

Insight: 结合离线和在线知识图谱适应，能更高效发现实际开发中的角落案例漏洞。

Abstract: AI coding assistants like GitHub Copilot are rapidly transforming software development, but their safety remains deeply uncertain-especially in high-stakes domains like cybersecurity. Current red-teaming tools often rely on fixed benchmarks or unrealistic prompts, missing many real-world vulnerabilities. We present ASTRA, an automated agent system designed to systematically uncover safety flaws in AI-driven code generation and security guidance systems. ASTRA works in three stages: (1) it builds structured domain-specific knowledge graphs that model complex software tasks and known weaknesses; (2) it performs online vulnerability exploration of each target model by adaptively probing both its input space, i.e., the spatial exploration, and its reasoning processes, i.e., the temporal exploration, guided by the knowledge graphs; and (3) it generates high-quality violation-inducing cases to improve model alignment. Unlike prior methods, ASTRA focuses on realistic inputs-requests that developers might actually ask-and uses both offline abstraction guided domain modeling and online domain knowledge graph adaptation to surface corner-case vulnerabilities. Across two major evaluation domains, ASTRA finds 11-66% more issues than existing techniques and produces test cases that lead to 17% more effective alignment training, showing its practical value for building safer AI systems.

[153] PLA: Prompt Learning Attack against Text-to-Image Generative Models cs.CR | cs.AI | cs.CVPDF

Xinqi Lyu, Yihao Liu, Yanjie Li, Bin Xiao

TL;DR: 该论文提出了一种针对文本到图像生成模型的黑盒对抗攻击方法PLA，通过多模态相似性设计梯度训练，有效绕过安全机制。

Details

Motivation: 文本到图像（T2I）模型可能被滥用生成不安全内容（NSFW），目前的安全机制仍有漏洞。研究旨在探索黑盒设置下如何通过对抗攻击绕过这些机制。

Result: 实验表明，PLA在攻击黑盒T2I模型的安全机制（如提示过滤器和后置安全检查器）时，成功率高且优于现有方法。

Insight: 黑盒攻击可通过多模态相似性实现高效梯度优化，揭示T2I模型安全机制的潜在脆弱性。

Abstract: Text-to-Image (T2I) models have gained widespread adoption across various applications. Despite the success, the potential misuse of T2I models poses significant risks of generating Not-Safe-For-Work (NSFW) content. To investigate the vulnerability of T2I models, this paper delves into adversarial attacks to bypass the safety mechanisms under black-box settings. Most previous methods rely on word substitution to search adversarial prompts. Due to limited search space, this leads to suboptimal performance compared to gradient-based training. However, black-box settings present unique challenges to training gradient-driven attack methods, since there is no access to the internal architecture and parameters of T2I models. To facilitate the learning of adversarial prompts in black-box settings, we propose a novel prompt learning attack framework (PLA), where insightful gradient-based training tailored to black-box T2I models is designed by utilizing multimodal similarities. Experiments show that our new method can effectively attack the safety mechanisms of black-box T2I models including prompt filters and post-hoc safety checkers with a high success rate compared to state-of-the-art methods. Warning: This paper may contain offensive model-generated content.

eess.IV [Back]

[154] Technical specification of a framework for the collection of clinical images and data eess.IV | cs.CV | H.2.8; J.3PDF

Alistair Mackenzie, Mark Halling-Brown, Ruben van Engen, Carlijn Roozemond, Lucy Warren

TL;DR: 该报告描述了一个用于收集临床图像和数据的框架，旨在训练和验证人工智能工具，强调自动化收集以确保数据的时效性和代表性，并讨论了伦理、数据治理和共享基础设施。

Details

Motivation: 训练和验证AI工具需要高质量、时效性强且具有代表性的临床数据，而传统方法可能无法满足这些需求。因此，需要一种自动化且可持续的数据收集框架。

Result: 框架能够生成具有长期随访数据和当前数据的混合数据集，支持AI工具的可靠训练和验证。

Insight: 自动化数据收集对于确保AI工具的验证数据与实际部署时的数据一致至关重要，同时需兼顾伦理和治理问题。

Abstract: In this report a framework for the collection of clinical images and data for use when training and validating artificial intelligence (AI) tools is described. The report contains not only information about the collection of the images and clinical data, but the ethics and information governance processes to consider ensuring the data is collected safely, and the infrastructure and agreements required to allow for the sharing of data with other groups. A key characteristic of the main collection framework described here is that it can enable automated and ongoing collection of datasets to ensure that the data is up-to-date and representative of current practice. This is important in the context of training and validating AI tools as it is vital that datasets have a mix of older cases with long term follow-up such that the clinical outcome is as accurate as possible, and current data. Validations run on old data will provide findings and conclusions relative to the status of the imaging units when that data was generated. It is important that a validation dataset can assess the AI tools with data that it would see if deployed and active now. Other types of collection frameworks, which do not follow a fully automated approach, are also described. Whilst the fully automated method is recommended for large scale, long-term image collection, there may be reasons to start data collection using semi-automated methods and indications of how to do that are provided.

[155] A Survey of Multimodal Ophthalmic Diagnostics: From Task-Specific Approaches to Foundational Models eess.IV | cs.AI | cs.CVPDF

Xiaoling Luo, Ruli Zheng, Qiaojian Zheng, Zibo Du, Shuo Yang

TL;DR: 该综述系统性回顾了2025年前眼科多模态深度学习方法的最新进展，涵盖任务特定方法和大规模基础模型，并讨论了数据集、评估指标和技术创新，同时提出了未来研究方向。

Details

Motivation: 视觉障碍是全球重大健康挑战，多模态成像为眼科诊断提供了互补信息。为应对这一挑战，需要系统性梳理多模态深度学习在眼科中的应用与发展。

Result: 综述展示了多模态深度学习方法在眼科诊断中的广泛应用，并强调了其在提高诊断准确性和效率方面的潜力。

Insight: 1. 多模态数据融合提升了诊断性能。2. 基础模型在处理复杂临床任务和泛化性方面表现突出。3. 未来研究方向包括超广域成像和强化学习框架的结合。

Abstract: Visual impairment represents a major global health challenge, with multimodal imaging providing complementary information that is essential for accurate ophthalmic diagnosis. This comprehensive survey systematically reviews the latest advances in multimodal deep learning methods in ophthalmology up to the year 2025. The review focuses on two main categories: task-specific multimodal approaches and large-scale multimodal foundation models. Task-specific approaches are designed for particular clinical applications such as lesion detection, disease diagnosis, and image synthesis. These methods utilize a variety of imaging modalities including color fundus photography, optical coherence tomography, and angiography. On the other hand, foundation models combine sophisticated vision-language architectures and large language models pretrained on diverse ophthalmic datasets. These models enable robust cross-modal understanding, automated clinical report generation, and decision support. The survey critically examines important datasets, evaluation metrics, and methodological innovations including self-supervised learning, attention-based fusion, and contrastive alignment. It also discusses ongoing challenges such as variability in data, limited annotations, lack of interpretability, and issues with generalizability across different patient populations. Finally, the survey outlines promising future directions that emphasize the use of ultra-widefield imaging and reinforcement learning-based reasoning frameworks to create intelligent, interpretable, and clinically applicable AI systems for ophthalmology.

[156] A Modified VGG19-Based Framework for Accurate and Interpretable Real-Time Bone Fracture Detection eess.IV | cs.AI | cs.CVPDF

Md. Ehsanul Haque, Abrar Fahim, Shamik Dey, Syoda Anamika Jahan, S. M. Jahidul Islam

TL;DR: 本文提出了一种基于改进VGG19的框架，用于实时、准确且可解释的骨折检测。通过结合先进的预处理技术和可解释AI方法（如Grad-CAM），该框架在性能（准确率99.78%）和临床适用性上表现优异。

Details

Motivation: 骨折的早期准确检测对治疗至关重要，但X光图像分析耗时且易出错，尤其是缺乏放射科专家资源时。现有深度学习方法常因误分类和缺乏可解释性而难以用于临床。

Result: 模型分类准确率达99.78%，AUC为1.00，性能卓越。实时Web应用展示其快速响应能力（0.5秒内）。

Insight: 1. 预处理技术显著提升模型性能；2. 可解释性方法（如Grad-CAM）对临床验证至关重要；3. 实时部署解决了实际医疗场景中的效率问题。

Abstract: Early and accurate detection of the bone fracture is paramount to initiating treatment as early as possible and avoiding any delay in patient treatment and outcomes. Interpretation of X-ray image is a time consuming and error prone task, especially when resources for such interpretation are limited by lack of radiology expertise. Additionally, deep learning approaches used currently, typically suffer from misclassifications and lack interpretable explanations to clinical use. In order to overcome these challenges, we propose an automated framework of bone fracture detection using a VGG-19 model modified to our needs. It incorporates sophisticated preprocessing techniques that include Contrast Limited Adaptive Histogram Equalization (CLAHE), Otsu’s thresholding, and Canny edge detection, among others, to enhance image clarity as well as to facilitate the feature extraction. Therefore, we use Grad-CAM, an Explainable AI method that can generate visual heatmaps of the model’s decision making process, as a type of model interpretability, for clinicians to understand the model’s decision making process. It encourages trust and helps in further clinical validation. It is deployed in a real time web application, where healthcare professionals can upload X-ray images and get the diagnostic feedback within 0.5 seconds. The performance of our modified VGG-19 model attains 99.78% classification accuracy and AUC score of 1.00, making it exceptionally good. The framework provides a reliable, fast, and interpretable solution for bone fracture detection that reasons more efficiently for diagnoses and better patient care.

[157] Boosting Vision Semantic Density with Anatomy Normality Modeling for Medical Vision-language Pre-training eess.IV | cs.AI | cs.CV | cs.LGPDF

Weiwei Cao, Jianpeng Zhang, Zhongyi Shui, Sinuo Wang, Zeli Chen

TL;DR: 该论文提出了一种通过增强视觉语义密度来解决医学图像与报告对齐中的语义密度差距的方法，结合疾病级视觉对比学习和解剖学正态性建模，显著提升了诊断性能。

Details

Motivation: 医学图像与诊断报告之间的信号-噪声比（SNR）差异导致视觉对齐偏差，传统方法难以有效解决。本文旨在通过增强视觉语义密度来缩小这一差距。

Result: 在多个胸部CT和腹部CT数据集上，实现了最先进的零样本性能，平均AUC达84.9%。

Insight: 通过建模解剖学正态分布，可以有效放大异常信号，提升模型对异常属性的感知和区分能力。

Abstract: Vision-language pre-training (VLP) has great potential for developing multifunctional and general medical diagnostic capabilities. However, aligning medical images with a low signal-to-noise ratio (SNR) to reports with a high SNR presents a semantic density gap, leading to visual alignment bias. In this paper, we propose boosting vision semantic density to improve alignment effectiveness. On one hand, we enhance visual semantics through disease-level vision contrastive learning, which strengthens the model’s ability to differentiate between normal and abnormal samples for each anatomical structure. On the other hand, we introduce an anatomical normality modeling method to model the distribution of normal samples for each anatomy, leveraging VQ-VAE for reconstructing normal vision embeddings in the latent space. This process amplifies abnormal signals by leveraging distribution shifts in abnormal samples, enhancing the model’s perception and discrimination of abnormal attributes. The enhanced visual representation effectively captures the diagnostic-relevant semantics, facilitating more efficient and accurate alignment with the diagnostic report. We conduct extensive experiments on two chest CT datasets, CT-RATE and Rad-ChestCT, and an abdominal CT dataset, MedVL-CT69K, and comprehensively evaluate the diagnosis performance across multiple tasks in the chest and abdominal CT scenarios, achieving state-of-the-art zero-shot performance. Notably, our method achieved an average AUC of 84.9% across 54 diseases in 15 organs, significantly surpassing existing methods. Additionally, we demonstrate the superior transfer learning capabilities of our pre-trained model. Code is available at https://github.com/alibaba-damo-academy/ViSD-Boost.

[158] Classification non supervis{é}es d’acquisitions hyperspectrales cod{é}es : quelles v{é}rit{é}s terrain ? eess.IV | cs.CV | physics.data-anPDF

Trung-tin Dinh, Hervé Carfantan, Antoine Monmayrant, Simon Lacroix

TL;DR: 该论文提出了一种基于DD-CASSI高光谱成像仪的无监督分类方法，通过简单的类内光谱变异性模型，实现了在数据压缩十倍的情况下识别类别并估计参考光谱。同时，论文指出了常见地面真值在评估此类方法时的局限性，并以Pavia University场景为例，展示了更一致光谱区域的检测能力。

Details

Motivation: 论文的动机在于解决高光谱成像中无监督分类方法的评估问题，尤其是在数据压缩和类内变异性高的情况下，常用地面真值的局限性成为主要挑战。

Result: 实验结果表明，该方法能够在数据压缩十倍的情况下有效识别类别，并检测出更一致的光谱区域，突出了重新评估无监督分类方法的必要性。

Insight: 论文的深刻见解在于，传统地面真值可能无法完全反映光谱数据的复杂性，因此需要重新思考无监督分类方法的评估标准，尤其是在数据压缩和高类内变异性场景下。

Abstract: We propose an unsupervised classification method using a limited number of coded acquisitions from a DD-CASSI hyperspectral imager. Based on a simple model of intra-class spectral variability, this approach allow to identify classes and estimate reference spectra, despite data compression by a factor of ten. Here, we highlight the limitations of the ground truths commonly used to evaluate this type of method: lack of a clear definition of the notion of class, high intra-class variability, and even classification errors. Using the Pavia University scene, we show that with simple assumptions, it is possible to detect regions that are spectrally more coherent, highlighting the need to rethink the evaluation of classification methods, particularly in unsupervised scenarios.

[159] FUTransUNet-GradCAM: A Hybrid Transformer-U-Net with Self-Attention and Explainable Visualizations for Foot Ulcer Segmentation eess.IV | cs.CVPDF

Akwasi Asare, Mary Sagoe, Justice Williams Asare

TL;DR: 提出了一种混合Transformer-U-Net架构FUTransUNet，结合Vision Transformer的全局注意力机制与U-Net的局部定位能力，用于糖尿病足溃疡分割，并通过Grad-CAM提供可解释性。

Details

Motivation: 传统卷积神经网络（如U-Net）难以建模长距离空间依赖关系，而糖尿病足溃疡分割任务因病灶外观和背景复杂而具有挑战性。

Result: 在FUSeg数据集上，训练集Dice系数0.8679，IoU 0.7672；验证集Dice系数0.8751，IoU 0.7780。

Insight: 结合全局与局部特征提取可显著提升复杂医学图像分割任务的性能，且模型的可解释性对临床应用至关重要。

Abstract: Automated segmentation of diabetic foot ulcers (DFUs) plays a critical role in clinical diagnosis, therapeutic planning, and longitudinal wound monitoring. However, this task remains challenging due to the heterogeneous appearance, irregular morphology, and complex backgrounds associated with ulcer regions in clinical photographs. Traditional convolutional neural networks (CNNs), such as U-Net, provide strong localization capabilities but struggle to model long-range spatial dependencies due to their inherently limited receptive fields. To address this, we propose FUTransUNet, a hybrid architecture that integrates the global attention mechanism of Vision Transformers (ViTs) into the U-Net framework. This combination allows the model to extract global contextual features while maintaining fine-grained spatial resolution through skip connections and an effective decoding pathway. We trained and validated FUTransUNet on the public Foot Ulcer Segmentation Challenge (FUSeg) dataset. FUTransUNet achieved a training Dice Coefficient of 0.8679, an IoU of 0.7672, and a training loss of 0.0053. On the validation set, the model achieved a Dice Coefficient of 0.8751, an IoU of 0.7780, and a validation loss of 0.009045. To ensure clinical transparency, we employed Grad-CAM visualizations, which highlighted model focus areas during prediction. These quantitative outcomes clearly demonstrate that our hybrid approach successfully integrates global and local feature extraction paradigms, thereby offering a highly robust, accurate, explainable, and interpretable solution and clinically translatable solution for automated foot ulcer analysis. The approach offers a reliable, high-fidelity solution for DFU segmentation, with implications for improving real-world wound assessment and patient care.

[160] Scaling Artificial Intelligence for Prostate Cancer Detection on MRI towards Population-Based Screening and Primary Diagnosis in a Global, Multiethnic Population (Study Protocol) eess.IV | cs.CVPDF

Anindo Saha, Joeran S. Bosma, Jasper J. Twilt, Alexander B. C. D. Ng, Aqua Asif

TL;DR: 本文提出并验证了下一代AI系统PI-CAI-2B，用于通过MRI检测Gleason等级≥2的前列腺癌，覆盖了全球多民族的22,481例MRI数据，目标是验证其在筛查和初步诊断中的诊断互换性。

Details

Motivation: 前列腺癌的筛查和诊断需要高效且准确的工具，尤其是在全球多民族人群中。现有的AI系统需要进一步验证和改进，以适应不同临床场景和人群。

Result: AI系统在不同质量图像、年龄和种族的患者中表现一致性的评估结果尚未提供，但目标是验证其诊断互换性。

Insight: 大规模、多民族数据的验证有助于提高AI系统在临床实践中的可靠性和适用性，尤其是在全球范围内推广时需考虑潜在的偏倚。

Abstract: In this intercontinental, confirmatory study, we include a retrospective cohort of 22,481 MRI examinations (21,288 patients; 46 cities in 22 countries) to train and externally validate the PI-CAI-2B model, i.e., an efficient, next-generation iteration of the state-of-the-art AI system that was developed for detecting Gleason grade group $\geq$2 prostate cancer on MRI during the PI-CAI study. Of these examinations, 20,471 cases (19,278 patients; 26 cities in 14 countries) from two EU Horizon projects (ProCAncer-I, COMFORT) and 12 independent centers based in Europe, North America, Asia and Africa, are used for training and internal testing. Additionally, 2010 cases (2010 patients; 20 external cities in 12 countries) from population-based screening (STHLM3-MRI, IP1-PROSTAGRAM trials) and primary diagnostic settings (PRIME trial) based in Europe, North and South Americas, Asia and Australia, are used for external testing. Primary endpoint is the proportion of AI-based assessments in agreement with the standard of care diagnoses (i.e., clinical assessments made by expert uropathologists on histopathology, if available, or at least two expert urogenital radiologists in consensus; with access to patient history and peer consultation) in the detection of Gleason grade group $\geq$2 prostate cancer within the external testing cohorts. Our statistical analysis plan is prespecified with a hypothesis of diagnostic interchangeability to the standard of care at the PI-RADS $\geq$3 (primary diagnosis) or $\geq$4 (screening) cut-off, considering an absolute margin of 0.05 and reader estimates derived from the PI-CAI observer study (62 radiologists reading 400 cases). Secondary measures comprise the area under the receiver operating characteristic curve (AUROC) of the AI system stratified by imaging quality, patient age and patient ethnicity to identify underlying biases (if any).

[161] PET2Rep: Towards Vision-Language Model-Drived Automated Radiology Report Generation for Positron Emission Tomography eess.IV | cs.CVPDF

Yichi Zhang, Wenbo Zhang, Zehui Ling, Gang Feng, Sisi Peng

TL;DR: PET2Rep是首个专注于PET图像的放射学报告生成基准数据集，填补了现有方法在分子PET成像领域的空白，并通过临床效率指标评估了30种先进视觉语言模型的性能。

Details

Motivation: 尽管视觉语言模型在医学领域有广泛应用，但现有研究主要集中于结构成像模态，而分子PET成像的独特特性被忽视。手动生成PET报告耗时且劳动密集，亟需自动化解决方案。

Result: 实验表明，现有最先进的视觉语言模型在PET报告生成任务上表现不佳，远未达到实际需求。

Insight: 1. PET成像的代谢特性对模型提出更高要求。2. 现有模型在医学应用中存在关键不足（如特定器官描述的准确性），需进一步改进。

Abstract: Positron emission tomography (PET) is a cornerstone of modern oncologic and neurologic imaging, distinguished by its unique ability to illuminate dynamic metabolic processes that transcend the anatomical focus of traditional imaging technologies. Radiology reports are essential for clinical decision making, yet their manual creation is labor-intensive and time-consuming. Recent advancements of vision-language models (VLMs) have shown strong potential in medical applications, presenting a promising avenue for automating report generation. However, existing applications of VLMs in the medical domain have predominantly focused on structural imaging modalities, while the unique characteristics of molecular PET imaging have largely been overlooked. To bridge the gap, we introduce PET2Rep, a large-scale comprehensive benchmark for evaluation of general and medical VLMs for radiology report generation for PET images. PET2Rep stands out as the first dedicated dataset for PET report generation with metabolic information, uniquely capturing whole-body image-report pairs that cover dozens of organs to fill the critical gap in existing benchmarks and mirror real-world clinical comprehensiveness. In addition to widely recognized natural language generation metrics, we introduce a series of clinical efficiency metrics to evaluate the quality of radiotracer uptake pattern description in key organs in generated reports. We conduct a head-to-head comparison of 30 cutting-edge general-purpose and medical-specialized VLMs. The results show that the current state-of-the-art VLMs perform poorly on PET report generation task, falling considerably short of fulfilling practical needs. Moreover, we identify several key insufficiency that need to be addressed to advance the development in medical applications.

[162] TotalRegistrator: Towards a Lightweight Foundation Model for CT Image Registration eess.IV | cs.CVPDF

Xuan Loc Pham, Gwendolyn Vuurberg, Marjan Doppen, Joey Roosen, Tip Stille

TL;DR: TotalRegistrator是一种轻量级的基础模型，用于CT图像配准，支持多解剖区域同时对齐，表现优于基线方法，并展现出强大的泛化能力。

Details

Motivation: 现有图像配准方法多为单器官设计，泛化能力有限，而临床需求要求处理多解剖区域的CT图像。

Result: 在内部数据集上，模型在多器官腹部配准中表现优于基线方法，但在肺部配准上略有下降；在外部数据集上展现竞争性表现。

Insight: 通过轻量化设计和场分解策略，模型成功实现多器官配准，为临床实际应用提供了更通用的解决方案。

Abstract: Image registration is a fundamental technique in the analysis of longitudinal and multi-phase CT images within clinical practice. However, most existing methods are tailored for single-organ applications, limiting their generalizability to other anatomical regions. This work presents TotalRegistrator, an image registration framework capable of aligning multiple anatomical regions simultaneously using a standard UNet architecture and a novel field decomposition strategy. The model is lightweight, requiring only 11GB of GPU memory for training. To train and evaluate our method, we constructed a large-scale longitudinal dataset comprising 695 whole-body (thorax-abdomen-pelvic) paired CT scans from individual patients acquired at different time points. We benchmarked TotalRegistrator against a generic classical iterative algorithm and a recent foundation model for image registration. To further assess robustness and generalizability, we evaluated our model on three external datasets: the public thoracic and abdominal datasets from the Learn2Reg challenge, and a private multiphase abdominal dataset from a collaborating hospital. Experimental results on the in-house dataset show that the proposed approach generally surpasses baseline methods in multi-organ abdominal registration, with a slight drop in lung alignment performance. On out-of-distribution datasets, it achieved competitive results compared to leading single-organ models, despite not being fine-tuned for those tasks, demonstrating strong generalizability. The source code will be publicly available at: https://github.com/DIAGNijmegen/oncology_image_registration.git.

[163] OpenDCVCs: A PyTorch Open Source Implementation and Performance Evaluation of the DCVC series Video Codecs eess.IV | cs.CVPDF

Yichi Zhang, Fengqing Zhu

TL;DR: OpenDCVCs是一个基于PyTorch的开源实现，旨在推进学习型视频压缩的可复现性研究，提供了四种DCVC模型的统一实现，弥补了此前仅提供评估代码的不足。

Details

Motivation: 当前的学习型视频压缩模型（如DCVC系列）虽在比特率压缩方面表现优异，但公开的代码多为评估用途，缺乏完整的训练框架，阻碍了研究的复现和进一步发展。

Result: OpenDCVCs为社区提供了透明一致的评估基础，促进了学习型视频压缩的进一步研究和扩展。

Insight: 开源且统一的实现框架对于推动学习型视频压缩的研究至关重要，尤其是在复现性和协作开发方面。

Abstract: We present OpenDCVCs, an open-source PyTorch implementation designed to advance reproducible research in learned video compression. OpenDCVCs provides unified and training-ready implementations of four representative Deep Contextual Video Compression (DCVC) models–DCVC, DCVC with Temporal Context Modeling (DCVC-TCM), DCVC with Hybrid Entropy Modeling (DCVC-HEM), and DCVC with Diverse Contexts (DCVC-DC). While the DCVC series achieves substantial bitrate reductions over both classical codecs and advanced learned models, previous public code releases have been limited to evaluation codes, presenting significant barriers to reproducibility, benchmarking, and further development. OpenDCVCs bridges this gap by offering a comprehensive, self-contained framework that supports both end-to-end training and evaluation for all included algorithms. The implementation includes detailed documentation, evaluation protocols, and extensive benchmarking results across diverse datasets, providing a transparent and consistent foundation for comparison and extension. All code and experimental tools are publicly available at https://gitlab.com/viper-purdue/opendcvcs, empowering the community to accelerate research and foster collaboration.

[164] Conditional Fetal Brain Atlas Learning for Automatic Tissue Segmentation eess.IV | cs.CV | cs.LG | 68T07 (Primary) 92C50 (Secondary) | I.4.9; I.4.6; I.2.0PDF

Johannes Tischer, Patric Kienast, Marlene Stümpflen, Gregor Kasprian, Georg Langs

TL;DR: 该论文提出了一种新的深度学习框架，用于生成连续的、年龄特定的胎儿大脑图谱，实现胎儿脑组织的实时分割。通过结合直接配准模型和条件判别器，该方法在219例胎儿MRI数据上训练，取得了高精度的配准和分割性能。

Details

Motivation: 胎儿大脑MRI的评估在研究中具有挑战性，主要由于脑成熟度、成像协议和孕周估计的变异性。需要一个标准化的参考框架来支持客观评估和比较。

Result: 方法在6种脑组织上平均Dice相似系数为86.3%，配准精度高，并能捕捉动态解剖变化。

Insight: 生成的图谱支持个体化发育评估，为研究和临床应用提供了实时分析工具。

Abstract: Magnetic Resonance Imaging (MRI) of the fetal brain has become a key tool for studying brain development in vivo. Yet, its assessment remains challenging due to variability in brain maturation, imaging protocols, and uncertain estimates of Gestational Age (GA). To overcome these, brain atlases provide a standardized reference framework that facilitates objective evaluation and comparison across subjects by aligning the atlas and subjects in a common coordinate system. In this work, we introduce a novel deep-learning framework for generating continuous, age-specific fetal brain atlases for real-time fetal brain tissue segmentation. The framework combines a direct registration model with a conditional discriminator. Trained on a curated dataset of 219 neurotypical fetal MRIs spanning from 21 to 37 weeks of gestation. The method achieves high registration accuracy, captures dynamic anatomical changes with sharp structural detail, and robust segmentation performance with an average Dice Similarity Coefficient (DSC) of 86.3% across six brain tissues. Furthermore, volumetric analysis of the generated atlases reveals detailed neurotypical growth trajectories, providing valuable insights into the maturation of the fetal brain. This approach enables individualized developmental assessment with minimal pre-processing and real-time performance, supporting both research and clinical applications. The model code is available at https://github.com/cirmuw/fetal-brain-atlas

cs.MM [Back]

[165] Think Before You Segment: An Object-aware Reasoning Agent for Referring Audio-Visual Segmentation cs.MM | cs.CV | cs.MA | cs.SD | eess.ASPDF

Jinxing Zhou, Yanghao Zhou, Mingfei Han, Tong Wang, Xiaojun Chang

TL;DR: 本文提出了TGS-Agent，通过模仿人类推理过程将引用音频-视觉分割任务分解为Think-Ground-Segment三阶段，避免了像素级监督需求并提升了可解释性。

Details

Motivation: 现有方法依赖多模态融合学习潜在嵌入，缺乏可解释性且需要强像素级监督。本文从显式引用理解的角度出发，提出一种更接近人类推理的方法。

Result: 在标准Ref-AVSBench和新R²-AVSBench上达到SOTA。

Insight: 显式推理框架能提升任务可解释性并减少对像素级监督的依赖，多模态语言模型的有效性在复杂任务中得到验证。

Abstract: Referring Audio-Visual Segmentation (Ref-AVS) aims to segment target objects in audible videos based on given reference expressions. Prior works typically rely on learning latent embeddings via multimodal fusion to prompt a tunable SAM/SAM2 decoder for segmentation, which requires strong pixel-level supervision and lacks interpretability. From a novel perspective of explicit reference understanding, we propose TGS-Agent, which decomposes the task into a Think-Ground-Segment process, mimicking the human reasoning procedure by first identifying the referred object through multimodal analysis, followed by coarse-grained grounding and precise segmentation. To this end, we first propose Ref-Thinker, a multimodal language model capable of reasoning over textual, visual, and auditory cues. We construct an instruction-tuning dataset with explicit object-aware think-answer chains for Ref-Thinker fine-tuning. The object description inferred by Ref-Thinker is used as an explicit prompt for Grounding-DINO and SAM2, which perform grounding and segmentation without relying on pixel-level supervision. Additionally, we introduce R\textsuperscript{2}-AVSBench, a new benchmark with linguistically diverse and reasoning-intensive references for better evaluating model generalization. Our approach achieves state-of-the-art results on both standard Ref-AVSBench and proposed R\textsuperscript{2}-AVSBench. Code will be available at https://github.com/jasongief/TGS-Agent.

cs.AI [Back]

[166] AgREE: Agentic Reasoning for Knowledge Graph Completion on Emerging Entities cs.AI | cs.CLPDF

Ruochen Zhao, Simone Conia, Eric Peng, Min Li, Saloni Potdar

TL;DR: AgREE是一个基于代理的框架，用于动态构建知识图谱的三元组，特别适用于新兴实体任务，无需训练即可显著优于现有方法。

Details

Motivation: 解决现有KGC方法在处理新兴实体时的不足，尤其是依赖预训练语言模型或大量监督数据的方法，无法动态捕捉最新信息。

Result: AgREE在新兴实体任务中性能提升高达13.7%，且无需训练数据。

Insight: 代理机制结合迭代检索能够有效解决动态环境下的KGC问题，尤其是对新兴实体的信息捕捉。

Abstract: Open-domain Knowledge Graph Completion (KGC) faces significant challenges in an ever-changing world, especially when considering the continual emergence of new entities in daily news. Existing approaches for KGC mainly rely on pretrained language models’ parametric knowledge, pre-constructed queries, or single-step retrieval, typically requiring substantial supervision and training data. Even so, they often fail to capture comprehensive and up-to-date information about unpopular and/or emerging entities. To this end, we introduce Agentic Reasoning for Emerging Entities (AgREE), a novel agent-based framework that combines iterative retrieval actions and multi-step reasoning to dynamically construct rich knowledge graph triplets. Experiments show that, despite requiring zero training efforts, AgREE significantly outperforms existing methods in constructing knowledge graph triplets, especially for emerging entities that were not seen during language models’ training processes, outperforming previous methods by up to 13.7%. Moreover, we propose a new evaluation methodology that addresses a fundamental weakness of existing setups and a new benchmark for KGC on emerging entities. Our work demonstrates the effectiveness of combining agent-based reasoning with strategic information retrieval for maintaining up-to-date knowledge graphs in dynamic information environments.

[167] Beyond Pixels: Exploring DOM Downsampling for LLM-Based Web Agents cs.AI | cs.CL | cs.HCPDF

Thassilo M. Schiepanski, Nicholas Piël

TL;DR: 论文提出了一种名为D2Snap的新型DOM降采样算法，用于解决基于LLM的Web代理中应用状态序列化的关键问题，其性能与基于GUI快照的基线相当，并在某些配置下表现更优。

Details

Motivation: 现有基于LLM的Web代理主要依赖GUI快照（如截图）作为输入，但LLM在图像处理能力上仍落后于代码解析能力。DOM快照虽然在结构上与HTML类似，但受限于模型输入令牌数量的限制，难以可靠实现。因此，需要一种高效的DOM降采样方法。

Result: 降采样后的DOM快照与GUI快照基线性能相当，成功率为67% vs 65%。在更高令牌配置下（但仍处于模型上下文窗口内），性能提升8%。

Insight: DOM的层次结构是LLM理解用户界面的重要特征，高效降采样可显著提升基于DOM的Web代理的实用性。

Abstract: Frontier LLMs only recently enabled serviceable, autonomous web agents. At that, a model poses as an instantaneous domain model backend. Ought to suggest interaction, it is consulted with a web-based task and respective application state. The key problem lies in application state serialisation $\unicode{x2013}$ referred to as snapshot. State-of-the-art web agents are premised on grounded GUI snapshots, i.e., screenshots enhanced with visual cues. Not least to resemble human perception, but for images representing relatively cheap means of model input. LLM vision still lag behind code interpretation capabilities. DOM snapshots, which structurally resemble HTML, impose a desired alternative. Vast model input token size, however, disables reliable implementation with web agents to date. We propose D2Snap, a first-of-its-kind DOM downsampling algorithm. Based on a GPT-4o backend, we evaluate D2Snap on tasks sampled from the Online-Mind2Web dataset. The success rate of D2Snap-downsampled DOM snapshots (67%) matches a grounded GUI snapshot baseline (65%) $\unicode{x2013}$ within the same input token order of magnitude (1e3). Our best evaluated configurations $\unicode{x2013}$ one token order above, but within the model’s context window $\unicode{x2013}$ outperform this baseline by 8%. Our evaluation, moreover, yields that DOM-inherent hierarchy embodies a strong UI feature for LLMs.

[168] OS Agents: A Survey on MLLM-based Agents for General Computing Devices Use cs.AI | cs.CL | cs.CV | cs.LGPDF

Xueyu Hu, Tao Xiong, Biao Yi, Zishu Wei, Ruixuan Xiao

TL;DR: 这篇论文对基于多模态大语言模型（MLLM）的智能代理（OS Agents）进行了全面综述，探讨了其在操作系统环境中的关键组件、能力和构建方法，并总结了评价标准和未来研究方向。

Details

Motivation: 研究旨在推动AI助手的发展，使其能够像科幻中的J.A.R.V.I.S一样广泛适用于通用计算设备，完成自动化任务。

Result: 为OS Agents的研究提供了全面的框架和未来发展方向，包括开源资源的维护和动态更新。

Insight: OS Agents的发展依赖于MLLM与操作系统环境的深度融合，未来需关注安全性、隐私性、个性化及自我演进等方向。

Abstract: The dream to create AI assistants as capable and versatile as the fictional J.A.R.V.I.S from Iron Man has long captivated imaginations. With the evolution of (multi-modal) large language models ((M)LLMs), this dream is closer to reality, as (M)LLM-based Agents using computing devices (e.g., computers and mobile phones) by operating within the environments and interfaces (e.g., Graphical User Interface (GUI)) provided by operating systems (OS) to automate tasks have significantly advanced. This paper presents a comprehensive survey of these advanced agents, designated as OS Agents. We begin by elucidating the fundamentals of OS Agents, exploring their key components including the environment, observation space, and action space, and outlining essential capabilities such as understanding, planning, and grounding. We then examine methodologies for constructing OS Agents, focusing on domain-specific foundation models and agent frameworks. A detailed review of evaluation protocols and benchmarks highlights how OS Agents are assessed across diverse tasks. Finally, we discuss current challenges and identify promising directions for future research, including safety and privacy, personalization and self-evolution. This survey aims to consolidate the state of OS Agents research, providing insights to guide both academic inquiry and industrial development. An open-source GitHub repository is maintained as a dynamic resource to foster further innovation in this field. We present a 9-page version of our work, accepted by ACL 2025, to provide a concise overview to the domain.

[169] SEAgent: Self-Evolving Computer Use Agent with Autonomous Learning from Experience cs.AI | cs.CL | cs.CV | cs.LG | cs.MA | cs.MMPDF

Zeyi Sun, Ziyu Liu, Yuhang Zang, Yuhang Cao, Xiaoyi Dong

TL;DR: 这篇论文提出了SEAgent，一种自演化的计算机使用代理框架，能够通过与未知软件的交互自主学习，无需依赖人类标注数据。通过课程生成器和世界状态模型的结合，SEAgent实现了从简单到复杂任务的自主学习和性能提升。

Details

Motivation: 现有的大型视觉语言模型作为计算机使用代理（CUA）依赖于人类标注数据，但在面对新颖或专业化软件时表现不佳。SEAgent的目标是解决这一限制，通过自主学习和经验积累提升代理的适应性。

Result: 在OS-World的五个新软件环境中，SEAgent的成功率从11.3%提高到34.5%，优于开源CUA UI-TARS，提升幅度达23.2%。

Insight: 自主学习和经验积累可以显著提升代理在陌生环境中的适应性，而课程设计和策略优化的结合是提高性能的关键。

Abstract: Repurposing large vision-language models (LVLMs) as computer use agents (CUAs) has led to substantial breakthroughs, primarily driven by human-labeled data. However, these models often struggle with novel and specialized software, particularly in scenarios lacking human annotations. To address this challenge, we propose SEAgent, an agentic self-evolving framework enabling CUAs to autonomously evolve through interactions with unfamiliar software. Specifically, SEAgent empowers computer-use agents to autonomously master novel software environments via experiential learning, where agents explore new software, learn through iterative trial-and-error, and progressively tackle auto-generated tasks organized from simple to complex. To achieve this goal, we design a World State Model for step-wise trajectory assessment, along with a Curriculum Generator that generates increasingly diverse and challenging tasks. The agent’s policy is updated through experiential learning, comprised of adversarial imitation of failure actions and Group Relative Policy Optimization (GRPO) on successful ones. Furthermore, we introduce a specialist-to-generalist training strategy that integrates individual experiential insights from specialist agents, facilitating the development of a stronger generalist CUA capable of continuous autonomous evolution. This unified agent ultimately achieves performance surpassing ensembles of individual specialist agents on their specialized software. We validate the effectiveness of SEAgent across five novel software environments within OS-World. Our approach achieves a significant improvement of 23.2% in success rate, from 11.3% to 34.5%, over a competitive open-source CUA, i.e., UI-TARS.

cs.IR [Back]

[170] ConvMix: A Mixed-Criteria Data Augmentation Framework for Conversational Dense Retrieval cs.IR | cs.CLPDF

Fengran Mo, Jinghan Zhang, Yuchen Hui, Jia Ao Sun, Zhichao Xu

TL;DR: ConvMix是一种用于对话密集检索的混合标准数据增强框架，通过设计双相关判断增强模式和质量控制机制，提升了检索性能。

Details

Motivation: 对话搜索需要从上下文依赖的查询中揭示用户真实意图，但现有方法面临数据稀缺问题。ConvMix旨在通过数据增强解决这一问题。

Result: 在五个广泛使用的基准测试中，ConvMix训练的对话密集检索器优于现有基线方法。

Insight: 通过结合大语言模型和混合标准增强，ConvMix能够有效缓解数据稀缺问题，提升检索性能。

Abstract: Conversational search aims to satisfy users’ complex information needs via multiple-turn interactions. The key challenge lies in revealing real users’ search intent from the context-dependent queries. Previous studies achieve conversational search by fine-tuning a conversational dense retriever with relevance judgments between pairs of context-dependent queries and documents. However, this training paradigm encounters data scarcity issues. To this end, we propose ConvMix, a mixed-criteria framework to augment conversational dense retrieval, which covers more aspects than existing data augmentation frameworks. We design a two-sided relevance judgment augmentation schema in a scalable manner via the aid of large language models. Besides, we integrate the framework with quality control mechanisms to obtain semantically diverse samples and near-distribution supervisions to combine various annotated data. Experimental results on five widely used benchmarks show that the conversational dense retriever trained by our ConvMix framework outperforms previous baseline methods, which demonstrates our superior effectiveness.

[171] Do Recommender Systems Really Leverage Multimodal Content? A Comprehensive Analysis on Multimodal Representations for Recommendation cs.IR | cs.CL | cs.LGPDF

Claudio Pomo, Matteo Attimonelli, Danilo Danese, Fedelucio Narducci, Tommaso Di Noia

TL;DR: 该论文探讨了多模态推荐系统中多模态嵌入的作用，并提出了一种基于大型视觉语言模型（LVLM）的方法，生成语义对齐的多模态嵌入，显著提升了推荐性能。

Details

Motivation: 现有的多模态推荐系统虽然有效，但其性能提升是否真正源于对多模态内容的理解尚不明确。论文旨在验证多模态嵌入的语义信息性及其对推荐性能的实际贡献。

Result: 实验表明，LVLM生成的嵌入显著提升了推荐性能，解码的文本描述进一步验证了嵌入的语义丰富性和对齐性。

Insight: 语义丰富的多模态嵌入对推荐系统至关重要，LVLM为构建鲁棒且有意义的多模态表示提供了有力支持。

Abstract: Multimodal Recommender Systems aim to improve recommendation accuracy by integrating heterogeneous content, such as images and textual metadata. While effective, it remains unclear whether their gains stem from true multimodal understanding or increased model complexity. This work investigates the role of multimodal item embeddings, emphasizing the semantic informativeness of the representations. Initial experiments reveal that embeddings from standard extractors (e.g., ResNet50, Sentence-Bert) enhance performance, but rely on modality-specific encoders and ad hoc fusion strategies that lack control over cross-modal alignment. To overcome these limitations, we leverage Large Vision-Language Models (LVLMs) to generate multimodal-by-design embeddings via structured prompts. This approach yields semantically aligned representations without requiring any fusion. Experiments across multiple settings show notable performance improvements. Furthermore, LVLMs embeddings offer a distinctive advantage: they can be decoded into structured textual descriptions, enabling direct assessment of their multimodal comprehension. When such descriptions are incorporated as side content into recommender systems, they improve recommendation performance, empirically validating the semantic depth and alignment encoded within LVLMs outputs. Our study highlights the importance of semantically rich representations and positions LVLMs as a compelling foundation for building robust and meaningful multimodal representations in recommendation tasks.

[172] Audio Does Matter: Importance-Aware Multi-Granularity Fusion for Video Moment Retrieval cs.IR | cs.CV | cs.MM | cs.SD | eess.ASPDF

Junan Lin, Daizong Liu, Xianke Chen, Xiaoye Qu, Xun Yang

TL;DR: 论文提出了一种重要性感知的多粒度融合模型（IMG），用于视频时刻检索（VMR），通过动态选择性地聚合音频-视觉-文本模态，解决了现有方法忽略音频重要性或简单融合的问题。

Details

Motivation: 现有VMR方法主要关注视觉和文本模态，忽视了音频的互补作用；少数研究虽尝试联合音频-视觉-文本推理，但未区分音频重要性或进行细粒度交互。作者提出音频可能包含噪声或无意义背景音，需动态融合。

Result: 实验表明IMG方法在VMR中实现了SOTA性能，尤其在音频-视频融合上表现突出。

Insight: 音频在VMR中确有重要作用，但需区分其重要性；多粒度融合和动态加权能有效提升性能。

Abstract: Video Moment Retrieval (VMR) aims to retrieve a specific moment semantically related to the given query. To tackle this task, most existing VMR methods solely focus on the visual and textual modalities while neglecting the complementary but important audio modality. Although a few recent works try to tackle the joint audio-vision-text reasoning, they treat all modalities equally and simply embed them without fine-grained interaction for moment retrieval. These designs are counter-practical as: Not all audios are helpful for video moment retrieval, and the audio of some videos may be complete noise or background sound that is meaningless to the moment determination. To this end, we propose a novel Importance-aware Multi-Granularity fusion model (IMG), which learns to dynamically and selectively aggregate the audio-vision-text contexts for VMR. Specifically, after integrating the textual guidance with vision and audio separately, we first design a pseudo-label-supervised audio importance predictor that predicts the importance score of the audio, and accordingly assigns weights to mitigate the interference caused by noisy audio. Then, we design a multi-granularity audio fusion module that adaptively fuses audio and visual modalities at local-, event-, and global-level, fully capturing their complementary contexts. We further propose a cross-modal knowledge distillation strategy to address the challenge of missing audio modality during inference. To evaluate our method, we further construct a new VMR dataset, i.e., Charades-AudioMatter, where audio-related samples are manually selected and re-organized from the original Charades-STA to validate the model’s capability in utilizing audio modality. Extensive experiments validate the effectiveness of our method, achieving state-of-the-art with audio-video fusion in VMR methods. Our code is available at https://github.com/HuiGuanLab/IMG.

Table of Contents

cs.CV [Back]

[1] Outlier Detection Algorithm for Circle Fitting cs.CV | eess.IVPDF

[2] Enhancing Diameter Measurement Accuracy in Machine Vision Applications cs.CV | eess.IVPDF

[3] Multimodal Video Emotion Recognition with Reliable Reasoning Priors cs.CV | cs.AIPDF

[4] From Waveforms to Pixels: A Survey on Audio-Visual Segmentation cs.CVPDF

[5] A Large Language Model Powered Integrated Circuit Footprint Geometry Understanding cs.CVPDF

[6] TIR-Diffusion: Diffusion-based Thermal Infrared Image Denoising via Latent and Wavelet Domain Optimization cs.CV | cs.RO | eess.IVPDF

[7] StorySync: Training-Free Subject Consistency in Text-to-Image Generation via Region Harmonization cs.CV | cs.AIPDF

[8] Fusion of Pervasive RF Data with Spatial Images via Vision Transformers for Enhanced Mapping in Smart Cities cs.CV | cs.AIPDF

[9] VQ-DeepISC: Vector Quantized-Enabled Digital Semantic Communication with Channel Adaptive Image Transmission cs.CV | cs.AIPDF

[10] Tobler’s First Law in GeoAI: A Spatially Explicit Deep Learning Model for Terrain Feature Detection Under Weak Supervision cs.CV | cs.AIPDF

[11] Closed-Circuit Television Data as an Emergent Data Source for Urban Rail Platform Crowding Estimation cs.CV | eess.IVPDF

[12] Modular Transformer Architecture for Precision Agriculture Imaging cs.CVPDF

[13] Refine-IQA: Multi-Stage Reinforcement Finetuning for Perceptual Image Quality Assessment cs.CV | cs.AIPDF

[14] HPSv3: Towards Wide-Spectrum Human Preference Score cs.CVPDF

[15] Deep learning framework for crater detection and identification on the Moon and Mars cs.CV | cs.AIPDF

[16] Point-Based Shape Representation Generation with a Correspondence-Preserving Diffusion Model cs.CV | cs.LGPDF

[17] Policy to Assist Iteratively Local Segmentation: Optimising Modality and Location Selection for Prostate Cancer Localisation cs.CV | cs.AIPDF

[18] Scaling Up Audio-Synchronized Visual Animation: An Efficient Training Paradigm cs.CVPDF

[19] RAVID: Retrieval-Augmented Visual Detection: A Knowledge-Driven Approach for AI-Generated Image Identification cs.CV | cs.CR | cs.IRPDF

[20] Investigating the Impact of Large-Scale Pre-training on Nutritional Content Estimation from 2D Images cs.CVPDF

[21] CAD-Judge: Toward Efficient Morphological Grading and Verification for Text-to-CAD Generation cs.CVPDF

[22] $\text{S}^2$Q-VDiT: Accurate Quantized Video Diffusion Transformer with Salient Data and Sparse Token Distillation cs.CVPDF

[23] Can Large Multimodal Models Actively Recognize Faulty Inputs? A Systematic Evaluation Framework of Their Input Scrutiny Ability cs.CVPDF

[24] Dual Prompt Learning for Adapting Vision-Language Models to Downstream Image-Text Retrieval cs.CV | cs.IRPDF

[25] VisualTrans: A Benchmark for Real-World Visual Transformation Reasoning cs.CVPDF

[26] Iterative pseudo-labeling based adaptive copy-paste supervision for semi-supervised tumor segmentation cs.CVPDF

[27] Motion is the Choreographer: Learning Latent Pose Dynamics for Seamless Sign Language Generation cs.CVPDF

[28] TCSAFormer: Efficient Vision Transformer with Token Compression and Sparse Attention for Medical Image Segmentation cs.CVPDF

[29] Beyond the Visible: Benchmarking Occlusion Perception in Multimodal Large Language Models cs.CVPDF

[30] TNet: Terrace Convolutional Decoder Network for Remote Sensing Image Semantic Segmentation cs.CVPDF

[31] Bridging Diffusion Models and 3D Representations: A 3D Consistent Super-Resolution Framework cs.CVPDF

[32] DET-GS: Depth- and Edge-Aware Regularization for High-Fidelity 3D Gaussian Splatting cs.CV | cs.AIPDF

[33] NEARL-CLIP: Interacted Query Adaptation with Orthogonal Regularization for Medical Vision-Language Understanding cs.CVPDF

[34] AR as an Evaluation Playground: Bridging Metrics and Visual Perception of Computer Vision Models cs.CVPDF

[35] Unlocking the Potential of MLLMs in Referring Expression Segmentation via a Light-weight Mask Decode cs.CV | cs.AIPDF

[36] CLIPVehicle: A Unified Framework for Vision-based Vehicle Search cs.CVPDF

[37] Conditional Latent Diffusion Models for Zero-Shot Instance Segmentation cs.CVPDF

[38] Excavate the potential of Single-Scale Features: A Decomposition Network for Water-Related Optical Image Enhancement cs.CV | eess.IVPDF

[39] Learning Using Privileged Information for Litter Detection cs.CV | cs.ET | cs.LG | cs.PFPDF

[40] SVC 2025: the First Multimodal Deception Detection Challenge cs.CVPDF

[41] DS$^2$Net: Detail-Semantic Deep Supervision Network for Medical Image Segmentation cs.CV | cs.AIPDF

[42] UniFGVC: Universal Training-Free Few-Shot Fine-Grained Vision Classification via Attribute-Aware Multimodal Retrieval cs.CV | cs.AIPDF

[43] IDCNet: Guided Video Diffusion for Metric-Consistent RGBD Scene Generation with Precise Camera Control cs.CVPDF

[44] ICM-Fusion: In-Context Meta-Optimized LoRA Fusion for Multi-Task Adaptation cs.CVPDF

[45] Audio-Assisted Face Video Restoration with Temporal and Identity Complementary Learning cs.CV | cs.MM | cs.SD | eess.ASPDF

[46] ToxicTAGS: Decoding Toxic Memes with Rich Tag Annotations cs.CV | cs.CLPDF

[47] AD-FM: Multimodal LLMs for Anomaly Detection via Multi-Stage Reasoning and Fine-Grained Reward Optimization cs.CVPDF

[48] Uncertainty-Aware Spatial Color Correlation for Low-Light Image Enhancement cs.CVPDF

[49] Deeper Inside Deep ViT cs.CVPDF

[50] RPCANet++: Deep Interpretable Robust PCA for Sparse Object Segmentation cs.CVPDF

[51] From Learning to Unlearning: Biomedical Security Protection in Multimodal Large Language Models cs.CVPDF

[52] Gather and Trace: Rethinking Video TextVQA from an Instance-oriented Perspective cs.CV | cs.AIPDF

[53] Bootstrap Deep Spectral Clustering with Optimal Transport cs.CV | cs.LGPDF

[54] ViFP: A Framework for Visual False Positive Detection to Enhance Reasoning Reliability in VLMs cs.CV | cs.AIPDF

[55] Small Lesions-aware Bidirectional Multimodal Multiscale Fusion Network for Lung Disease Classification cs.CVPDF

[56] SplitGaussian: Reconstructing Dynamic Scenes via Visual Geometry Decomposition cs.CVPDF

[57] Continual Learning for VLMs: A Survey and Taxonomy Beyond Forgetting cs.CV | cs.LGPDF

[58] LayerT2V: Interactive Multi-Object Trajectory Layering for Video Generation cs.CV | cs.AI | cs.LG | cs.MMPDF

[59] Intention Enhanced Diffusion Model for Multimodal Pedestrian Trajectory Prediction cs.CVPDF

[60] PIS3R: Very Large Parallax Image Stitching via Deep 3D Reconstruction cs.CVPDF

[61] From eye to AI: studying rodent social behavior in the era of machine Learning cs.CV | q-bio.NC | I.2.10; I.4.8; J.3; I.2.10; I.4.8; J.3PDF

[62] Revisiting Continual Semantic Segmentation with Pre-trained Vision Models cs.CVPDF

[63] PKSS-Align: Robust Point Cloud Registration on Pre-Kendall Shape Space cs.CVPDF

[64] FrEVL: Leveraging Frozen Pretrained Embeddings for Efficient Vision-Language Understanding cs.CV | cs.CLPDF

[65] Length Matters: Length-Aware Transformer for Temporal Sentence Grounding cs.CVPDF

[66] Analyzing and Mitigating Object Hallucination: A Training Bias Perspective cs.CV | cs.CLPDF

[67] TempFlow-GRPO: When Timing Matters for GRPO in Flow Models cs.CVPDF

[68] RiemanLine: Riemannian Manifold Representation of 3D Lines for Factor Graph Optimization cs.CV | cs.ROPDF

[69] RotatedMVPS: Multi-view Photometric Stereo with Rotated Natural Light cs.CVPDF

[70] TSPO: Temporal Sampling Policy Optimization for Long-form Video Language Understanding cs.CVPDF

[71] VisionTS++: Cross-Modal Time Series Foundation Model with Continual Pre-trained Visual Backbones cs.CV | cs.LGPDF

[72] Deep Learning-based Scalable Image-to-3D Facade Parser for Generating Thermal 3D Building Models cs.CV | cs.AIPDF

[73] Thinking With Videos: Multimodal Tool-Augmented Reinforcement Learning for Long Video Reasoning cs.CVPDF

[74] Efficient Inter-Task Attention for Multitask Transformer Models cs.CVPDF

[75] Composed Object Retrieval: Object-level Retrieval via Composed Expressions cs.CVPDF

[76] Benchmarking Foundation Models for Mitotic Figure Classification cs.CVPDF

[77] Boosting Visual Knowledge-Intensive Training for LVLMs Through Causality-Driven Visual Object Completion cs.CVPDF

[78] 4DVD: Cascaded Dense-view Video Diffusion Model for High-quality 4D Content Generation cs.CVPDF