Table of Contents

cs.CV [Back]

[1] CorrDetail: Visual Detail Enhanced Self-Correction for Face Forgery Detection cs.CV | cs.AIPDF

Binjia Zhou, Hengrui Lou, Lizhe Chen, Haoyuan Li, Dawei Luo

TL;DR: 这篇论文提出了一个名为CorrDetail的视觉细节增强自校正框架,用于可解释的面部伪造检测,通过纠正真实伪造细节和增强视觉细粒度细节来提升检测性能。

Details

Motivation: 随着图像生成技术的快速发展,面部深度伪造的广泛出现对安全领域提出了重大挑战,现有的伪造检测方法要么缺乏清晰的伪造细节解释,要么容易产生幻觉问题,因此需要一种更有效且可解释的检测方法。

Result: 实验结果表明,CorrDetail在性能上达到了最新方法的水平,同时在准确识别伪造细节和泛化能力方面表现出色。

Insight: 论文展示了通过增强视觉细节和自校正机制可以有效提升伪造检测的可解释性和性能,这在安全领域具有重要意义。

Abstract: With the swift progression of image generation technology, the widespread emergence of facial deepfakes poses significant challenges to the field of security, thus amplifying the urgent need for effective deepfake detection.Existing techniques for face forgery detection can broadly be categorized into two primary groups: visual-based methods and multimodal approaches. The former often lacks clear explanations for forgery details, while the latter, which merges visual and linguistic modalities, is more prone to the issue of hallucinations.To address these shortcomings, we introduce a visual detail enhanced self-correction framework, designated CorrDetail, for interpretable face forgery detection. CorrDetail is meticulously designed to rectify authentic forgery details when provided with error-guided questioning, with the aim of fostering the ability to uncover forgery details rather than yielding hallucinated responses. Additionally, to bolster the reliability of its findings, a visual fine-grained detail enhancement module is incorporated, supplying CorrDetail with more precise visual forgery details. Ultimately, a fusion decision strategy is devised to further augment the model’s discriminative capacity in handling extreme samples, through the integration of visual information compensation and model bias reduction.Experimental results demonstrate that CorrDetail not only achieves state-of-the-art performance compared to the latest methodologies but also excels in accurately identifying forged details, all while exhibiting robust generalization capabilities.


[2] pFedMMA: Personalized Federated Fine-Tuning with Multi-Modal Adapter for Vision-Language Models cs.CV | cs.LGPDF

Sajjad Ghiasvand, Mahnoosh Alizadeh, Ramtin Pedarsani

TL;DR: pFedMMA提出了一种个性化的联邦学习框架,利用多模态适配器优化视觉-语言模型,在局部适应性和全局泛化性之间取得平衡,并通过共享投影层实现通信高效性。

Details

Motivation: 现有的联邦学习方法在个性化与泛化性之间难以平衡,尤其在未见过的类别或领域上表现不佳。pFedMMA旨在解决这一问题。

Result: 在11个数据集上的实验表明,pFedMMA在个性化和泛化性权衡上优于现有联邦提示调优方法。

Insight: 共享投影层的设计是实现通信高效性和全局泛化的关键,非对称优化策略有助于兼顾局部与全局性能。

Abstract: Vision-Language Models (VLMs) like CLIP have demonstrated remarkable generalization in zero- and few-shot settings, but adapting them efficiently to decentralized, heterogeneous data remains a challenge. While prompt tuning has emerged as a popular parameter-efficient approach in personalized federated learning, existing methods often sacrifice generalization in favor of personalization, struggling particularly on unseen classes or domains. In this work, we propose pFedMMA, the first personalized federated learning framework that leverages multi-modal adapters for vision-language tasks. Each adapter contains modality-specific up- and down-projection layers alongside a globally shared projection that aligns cross-modal features. Our asymmetric optimization strategy allows clients to locally adapt to personalized data distributions while collaboratively training the shared projection to improve global generalization. This design is also communication-efficient, as only the shared component is exchanged during rounds. Through extensive experiments across eleven datasets, including domain- and label-shift scenarios, we show that pFedMMA achieves state-of-the-art trade-offs between personalization and generalization, outperforming recent federated prompt tuning methods. The code is available at https://github.com/sajjad-ucsb/pFedMMA.


[3] Neural-Driven Image Editing cs.CVPDF

Pengfei Zhou, Jie Xia, Xiaopeng Peng, Wangbo Zhao, Zilong Ye

TL;DR: LoongX提出了一种基于多模态神经生理信号(如EEG、fNIRS、PPG等)的无手操作图像编辑方法,利用扩散模型和对比学习实现意图与语义的对齐,性能媲美文本驱动方法,并展示了与语音结合的潜力。

Details

Motivation: 传统图像编辑需要手动输入提示,对行动受限或语言能力有限的人群不友好。通过结合脑机接口和生成模型,提出了一种更直观、无障碍的编辑方式。

Result: LoongX性能与文本驱动方法相当(CLIP-I: 0.6605 vs. 0.6558),且在结合语音时更优(CLIP-T: 0.2588 vs. 0.2549)。

Insight: 神经驱动的生成模型为无障碍图像编辑和认知驱动技术开辟了新方向;多模态信号融合能提升意图理解的准确性。

Abstract: Traditional image editing typically relies on manual prompting, making it labor-intensive and inaccessible to individuals with limited motor control or language abilities. Leveraging recent advances in brain-computer interfaces (BCIs) and generative models, we propose LoongX, a hands-free image editing approach driven by multimodal neurophysiological signals. LoongX utilizes state-of-the-art diffusion models trained on a comprehensive dataset of 23,928 image editing pairs, each paired with synchronized electroencephalography (EEG), functional near-infrared spectroscopy (fNIRS), photoplethysmography (PPG), and head motion signals that capture user intent. To effectively address the heterogeneity of these signals, LoongX integrates two key modules. The cross-scale state space (CS3) module encodes informative modality-specific features. The dynamic gated fusion (DGF) module further aggregates these features into a unified latent space, which is then aligned with edit semantics via fine-tuning on a diffusion transformer (DiT). Additionally, we pre-train the encoders using contrastive learning to align cognitive states with semantic intentions from embedded natural language. Extensive experiments demonstrate that LoongX achieves performance comparable to text-driven methods (CLIP-I: 0.6605 vs. 0.6558; DINO: 0.4812 vs. 0.4636) and outperforms them when neural signals are combined with speech (CLIP-T: 0.2588 vs. 0.2549). These results highlight the promise of neural-driven generative models in enabling accessible, intuitive image editing and open new directions for cognitive-driven creative technologies. Datasets and code will be released to support future work and foster progress in this emerging area.


[4] Motion Generation: A Survey of Generative Approaches and Benchmarks cs.CV | cs.LGPDF

Aliasghar Khani, Arianna Rampini, Bruno Roy, Larasika Nadela, Noa Kaplan

TL;DR: 这是一篇关于运动生成的综述论文,重点对2023年以来顶级会议中的生成方法进行了分类,并总结了架构原理、评估指标和数据集,旨在为研究人员提供参考和挑战识别。

Details

Motivation: 运动生成在计算机视觉、图形学和机器人领域具有重要应用,但现有方法的多样性使得全面回顾和比较变得困难,因此需要一篇系统的综述来梳理最新进展。

Result: 论文提供了全面的运动生成方法综述,强调了不同方法的优缺点,并指出了未来的研究方向。

Insight: 运动生成领域的快速发展需要更标准化的评估指标和数据集,以促进方法的比较和进步。

Abstract: Motion generation, the task of synthesizing realistic motion sequences from various conditioning inputs, has become a central problem in computer vision, computer graphics, and robotics, with applications ranging from animation and virtual agents to human-robot interaction. As the field has rapidly progressed with the introduction of diverse modeling paradigms including GANs, autoencoders, autoregressive models, and diffusion-based techniques, each approach brings its own advantages and limitations. This growing diversity has created a need for a comprehensive and structured review that specifically examines recent developments from the perspective of the generative approach employed. In this survey, we provide an in-depth categorization of motion generation methods based on their underlying generative strategies. Our main focus is on papers published in top-tier venues since 2023, reflecting the most recent advancements in the field. In addition, we analyze architectural principles, conditioning mechanisms, and generation settings, and compile a detailed overview of the evaluation metrics and datasets used across the literature. Our objective is to enable clearer comparisons and identify open challenges, thereby offering a timely and foundational reference for researchers and practitioners navigating the rapidly evolving landscape of motion generation.


[5] OpenWorldSAM: Extending SAM2 for Universal Image Segmentation with Language Prompts cs.CVPDF

Shiting Xiao, Rishabh Kabra, Yuhang Li, Donghyun Lee, Joao Carreira

TL;DR: OpenWorldSAM扩展了SAM2,通过集成轻量级视觉语言模型(VLM)的多模态嵌入,实现了基于开放词汇语言提示的通用图像分割。其核心优势包括统一提示、高效性、实例感知和强泛化能力,在多个基准测试中表现优异。

Details

Motivation: 现有图像分割模型在开放词汇场景下的能力有限,特别是基于语言提示的分割任务仍需改进。OpenWorldSAM旨在解决这一挑战,通过结合多模态嵌入与预训练模型的能力,实现更灵活和通用的分割。

Result: 在ADE20k、PASCAL、ScanNet和SUN-RGBD等基准测试中,OpenWorldSAM在语义、实例和全景分割任务上实现了SOTA性能。

Insight: 通过高效的多模态嵌入和轻量化设计,OpenWorldSAM证明了在保持模型简洁性的同时,可以实现对开放词汇语义的精确分割和零样本泛化。

Abstract: The ability to segment objects based on open-ended language prompts remains a critical challenge, requiring models to ground textual semantics into precise spatial masks while handling diverse and unseen categories. We present OpenWorldSAM, a framework that extends the prompt-driven Segment Anything Model v2 (SAM2) to open-vocabulary scenarios by integrating multi-modal embeddings extracted from a lightweight vision-language model (VLM). Our approach is guided by four key principles: i) Unified prompting: OpenWorldSAM supports a diverse range of prompts, including category-level and sentence-level language descriptions, providing a flexible interface for various segmentation tasks. ii) Efficiency: By freezing the pre-trained components of SAM2 and the VLM, we train only 4.5 million parameters on the COCO-stuff dataset, achieving remarkable resource efficiency. iii) Instance Awareness: We enhance the model’s spatial understanding through novel positional tie-breaker embeddings and cross-attention layers, enabling effective segmentation of multiple instances. iv) Generalization: OpenWorldSAM exhibits strong zero-shot capabilities, generalizing well on unseen categories and an open vocabulary of concepts without additional training. Extensive experiments demonstrate that OpenWorldSAM achieves state-of-the-art performance in open-vocabulary semantic, instance, and panoptic segmentation across multiple benchmarks, including ADE20k, PASCAL, ScanNet, and SUN-RGBD.


[6] Robotic System with AI for Real Time Weed Detection, Canopy Aware Spraying, and Droplet Pattern Evaluation cs.CV | cs.AIPDF

Inayat Rasool, Pappu Kumar Yadav, Amee Parmar, Hasan Mirzakhaninafchi, Rikesh Budhathoki

TL;DR: 论文提出了一种基于AI的实时杂草检测和可变喷雾系统,通过轻量级YOLO11n模型和嵌入式硬件实现实时杂草识别和喷雾调控,显著减少农药浪费。

Details

Motivation: 现代农业中农药的过度使用导致成本增加、环境污染和抗药性杂草的出现,需要一种智能化的解决方案。

Result: YOLO11n模型mAP@50达0.98,喷雾覆盖率为24.22%,且能根据冠层大小动态调整喷雾量。

Insight: 结合实时深度学习与低成本嵌入式硬件可实现精准施药,未来需扩展杂草种类检测和田间验证。

Abstract: Uniform and excessive herbicide application in modern agriculture contributes to increased input costs, environmental pollution, and the emergence of herbicide resistant weeds. To address these challenges, we developed a vision guided, AI-driven variable rate sprayer system capable of detecting weed presence, estimating canopy size, and dynamically adjusting nozzle activation in real time. The system integrates lightweight YOLO11n and YOLO11n-seg deep learning models, deployed on an NVIDIA Jetson Orin Nano for onboard inference, and uses an Arduino Uno-based relay interface to control solenoid actuated nozzles based on canopy segmentation results. Indoor trials were conducted using 15 potted Hibiscus rosa sinensis plants of varying canopy sizes to simulate a range of weed patch scenarios. The YOLO11n model achieved a mean average precision (mAP@50) of 0.98, with a precision of 0.99 and a recall close to 1.0. The YOLO11n-seg segmentation model achieved a mAP@50 of 0.48, precision of 0.55, and recall of 0.52. System performance was validated using water sensitive paper, which showed an average spray coverage of 24.22% in zones where canopy was present. An upward trend in mean spray coverage from 16.22% for small canopies to 21.46% and 21.65% for medium and large canopies, respectively, demonstrated the system’s capability to adjust spray output based on canopy size in real time. These results highlight the potential of combining real time deep learning with low-cost embedded hardware for selective herbicide application. Future work will focus on expanding the detection capabilities to include three common weed species in South Dakota: water hemp (Amaranthus tuberculatus), kochia (Bassia scoparia), and foxtail (Setaria spp.), followed by further validation in both indoor and field trials within soybean and corn production systems.


[7] Driving as a Diagnostic Tool: Scenario-based Cognitive Assessment in Older Drivers From Driving Video cs.CV | cs.AIPDF

Md Zahid Hasan, Guillermo Basulto-Elias, Jun Ha Chang, Sahuna Hallmark, Matthew Rizzo

TL;DR: 该论文提出了一种基于自然驾驶视频和大规模视觉模型的场景化认知状态识别方法,旨在通过分析老年驾驶员的日常驾驶行为,早期发现认知衰退(如阿尔茨海默病和轻度认知障碍),为主动干预策略提供支持。

Details

Motivation: 当前认知衰退的诊断方法耗时且昂贵,导致许多病例未能及时发现。通过分析驾驶行为(作为认知状态的观察指标),可以开发一种非侵入性、可扩展的早期检测工具。

Result: 该方法能够识别功能衰退的早期预警信号,支持早期干预策略的开发。

Insight: 驾驶行为是认知状态的有效观察指标,结合大规模视觉模型可以实现非侵入性、可扩展的早期检测,减轻老龄化社会中认知衰退的社会和经济负担。

Abstract: We introduce scenario-based cognitive status identification in older drivers from Naturalistic driving videos and large vision models. In recent times, cognitive decline, including Alzheimer’s disease (AD) and mild cognitive impairment (MCI), is often underdiagnosed due to the time-consuming and costly nature of current diagnostic methods. By analyzing real-world driving behavior captured through in-vehicle systems, this research aims to extract “digital fingerprints” that correlate with functional decline and clinical features of MCI and AD. Moreover, modern large vision models can draw meaningful insights from everyday driving patterns of older patients to early detect cognitive decline. We propose a framework that uses large vision models and naturalistic driving videos to analyze driver behavior, classify cognitive status and predict disease progression. We leverage the strong relationship between real-world driving behavior as an observation of the current cognitive status of the drivers where the vehicle can be utilized as a “diagnostic tool”. Our method identifies early warning signs of functional impairment, contributing to proactive intervention strategies. This work enhances early detection and supports the development of scalable, non-invasive monitoring systems to mitigate the growing societal and economic burden of cognitive decline in the aging population.


[8] Cloud Diffusion Part 1: Theory and Motivation cs.CV | cs.AI | cs.LGPDF

Andrew Randono

TL;DR: 这篇论文提出了一种名为‘云扩散模型’的新方法,通过引入尺度不变性的噪声分布替代传统的白噪声,旨在提升扩散模型的生成速度、高频细节和可控性。

Details

Motivation: 传统的扩散模型使用白噪声作为噪声分布,但自然图像的低阶统计特性表现出尺度不变性。论文认为,利用这种尺度不变的噪声分布可以更好地匹配自然图像的特性,从而改进模型性能。

Result: 论文认为云扩散模型在推理速度、高频细节生成和可控性方面优于传统白噪声扩散模型,但具体实验结果将在后续论文中展示。

Insight: 通过利用自然图像的尺度不变性,云扩散模型在理论上更接近真实图像分布,从而在生成任务中可能表现更优。

Abstract: Diffusion models for image generation function by progressively adding noise to an image set and training a model to separate out the signal from the noise. The noise profile used by these models is white noise – that is, noise based on independent normal distributions at each point whose mean and variance is independent of the scale. By contrast, most natural image sets exhibit a type of scale invariance in their low-order statistical properties characterized by a power-law scaling. Consequently, natural images are closer (in a quantifiable sense) to a different probability distribution that emphasizes large scale correlations and de-emphasizes small scale correlations. These scale invariant noise profiles can be incorporated into diffusion models in place of white noise to form what we will call a ``Cloud Diffusion Model”. We argue that these models can lead to faster inference, improved high-frequency details, and greater controllability. In a follow-up paper, we will build and train a Cloud Diffusion Model that uses scale invariance at a fundamental level and compare it to classic, white noise diffusion models.


[9] LoomNet: Enhancing Multi-View Image Generation via Latent Space Weaving cs.CVPDF

Giulio Federico, Fabio Carrara, Claudio Gennaro, Giuseppe Amato, Marco Di Benedetto

TL;DR: LoomNet提出了一种新颖的多视图扩散架构,通过共享潜在空间生成一致的16视图图像,显著提升了多视图图像的生成质量和3D重建效果。

Details

Motivation: 从单一图像生成一致的多视图图像是一个挑战,空间一致性的缺乏会影响3D网格的表面重建质量。

Result: 在15秒内生成16个高质量一致视图,实验显示其在图像质量和重建指标上优于现有方法,并能生成多样化的合理新视图。

Insight: 通过共享潜在空间和协作推理,LoomNet在多视图生成中实现了更高的空间一致性和效率。

Abstract: Generating consistent multi-view images from a single image remains challenging. Lack of spatial consistency often degrades 3D mesh quality in surface reconstruction. To address this, we propose LoomNet, a novel multi-view diffusion architecture that produces coherent images by applying the same diffusion model multiple times in parallel to collaboratively build and leverage a shared latent space for view consistency. Each viewpoint-specific inference generates an encoding representing its own hypothesis of the novel view from a given camera pose, which is projected onto three orthogonal planes. For each plane, encodings from all views are fused into a single aggregated plane. These aggregated planes are then processed to propagate information and interpolate missing regions, combining the hypotheses into a unified, coherent interpretation. The final latent space is then used to render consistent multi-view images. LoomNet generates 16 high-quality and coherent views in just 15 seconds. In our experiments, LoomNet outperforms state-of-the-art methods on both image quality and reconstruction metrics, also showing creativity by producing diverse, plausible novel views from the same input.


[10] Llama Nemoretriever Colembed: Top-Performing Text-Image Retrieval Model cs.CV | cs.AIPDF

Mengyao Xu, Gabriel Moreira, Ronay Ak, Radek Osmulski, Yauhen Babakhin

TL;DR: 论文提出了一种名为Llama Nemoretriever Colembed的多模态检索模型,通过改进NVIDIA Eagle2 VLM的注意力机制并集成ColBERT风格的交互机制,实现了文本-图像检索的顶尖性能。

Details

Motivation: 随着对跨模态检索系统需求的增长,作者旨在开发一种统一的文本-图像检索模型,以在多个基准测试中实现最优表现。

Result: 3B模型在ViDoRe V1和V2上分别达到NDCG@5 91.0和63.5,均为当前最优表现。

Insight: 双向注意力和晚期交互机制显著提升了检索性能,但需权衡存储和计算效率。

Abstract: Motivated by the growing demand for retrieval systems that operate across modalities, we introduce llama-nemoretriever-colembed, a unified text-image retrieval model that delivers state-of-the-art performance across multiple benchmarks. We release two model variants, 1B and 3B. The 3B model achieves state of the art performance, scoring NDCG@5 91.0 on ViDoRe V1 and 63.5 on ViDoRe V2, placing first on both leaderboards as of June 27, 2025. Our approach leverages the NVIDIA Eagle2 Vision-Language model (VLM), modifies its architecture by replacing causal attention with bidirectional attention, and integrates a ColBERT-style late interaction mechanism to enable fine-grained multimodal retrieval in a shared embedding space. While this mechanism delivers superior retrieval accuracy, it introduces trade-offs in storage and efficiency. We provide a comprehensive analysis of these trade-offs. Additionally, we adopt a two-stage training strategy to enhance the model’s retrieval capabilities.


[11] ReLayout: Integrating Relation Reasoning for Content-aware Layout Generation with Multi-modal Large Language Models cs.CV | cs.LGPDF

Jiaxu Tian, Xuehui Yu, Yaoxing Wang, Pan Wang, Guangqian Guo

TL;DR: ReLayout提出了一种基于关系推理的内容感知布局生成方法,通过引入明确的元素间关系定义和布局原型重平衡采样器,解决了现有LLM方法在空间关系理解上的不足。

Details

Motivation: 现有基于LLM的布局生成方法未能充分理解视觉主题与设计元素间的空间关系,导致生成的布局结构性和多样性不足。

Result: 实验表明,ReLayout在生成更符合人类美学和可解释性更高的布局上优于基线方法。

Insight: 关系推理和原型重平衡是提升布局生成结构性与多样性的关键。

Abstract: Content-aware layout aims to arrange design elements appropriately on a given canvas to convey information effectively. Recently, the trend for this task has been to leverage large language models (LLMs) to generate layouts automatically, achieving remarkable performance. However, existing LLM-based methods fail to adequately interpret spatial relationships among visual themes and design elements, leading to structural and diverse problems in layout generation. To address this issue, we introduce ReLayout, a novel method that leverages relation-CoT to generate more reasonable and aesthetically coherent layouts by fundamentally originating from design concepts. Specifically, we enhance layout annotations by introducing explicit relation definitions, such as region, salient, and margin between elements, with the goal of decomposing the layout into smaller, structured, and recursive layouts, thereby enabling the generation of more structured layouts. Furthermore, based on these defined relationships, we introduce a layout prototype rebalance sampler, which defines layout prototype features across three dimensions and quantifies distinct layout styles. This sampler addresses uniformity issues in generation that arise from data bias in the prototype distribution balance process. Extensive experimental results verify that ReLayout outperforms baselines and can generate structural and diverse layouts that are more aligned with human aesthetics and more explainable.


[12] Multi-Modal Face Anti-Spoofing via Cross-Modal Feature Transitions cs.CVPDF

Jun-Xiong Chong, Fang-Yu Hsu, Ming-Tsung Hsu, Yi-Ting Lin, Kai-Heng Chien

TL;DR: 该论文提出了一种跨模态特征转换引导网络(CTNet),用于解决多模态人脸防伪(FAS)任务中的领域差异和模态缺失问题,通过学习活体和伪造样本的特征转换差异,显著提升多模态FAS的性能。

Details

Motivation: 多模态人脸防伪(FAS)因跨模态数据分布差异和模态缺失问题,导致性能不稳定。论文基于活体和伪造样本在特征转换中的差异特性,提出了一种新的解决方案。

Result: 实验表明,CTNet在大多数协议下优于现有两分类多模态FAS方法。

Insight: 活体和伪造样本在特征转换中的差异是提升多模态FAS性能的关键;从RGB模态生成互补特征能有效缓解模态缺失问题。

Abstract: Multi-modal face anti-spoofing (FAS) aims to detect genuine human presence by extracting discriminative liveness cues from multiple modalities, such as RGB, infrared (IR), and depth images, to enhance the robustness of biometric authentication systems. However, because data from different modalities are typically captured by various camera sensors and under diverse environmental conditions, multi-modal FAS often exhibits significantly greater distribution discrepancies across training and testing domains compared to single-modal FAS. Furthermore, during the inference stage, multi-modal FAS confronts even greater challenges when one or more modalities are unavailable or inaccessible. In this paper, we propose a novel Cross-modal Transition-guided Network (CTNet) to tackle the challenges in the multi-modal FAS task. Our motivation stems from that, within a single modality, the visual differences between live faces are typically much smaller than those of spoof faces. Additionally, feature transitions across modalities are more consistent for the live class compared to those between live and spoof classes. Upon this insight, we first propose learning consistent cross-modal feature transitions among live samples to construct a generalized feature space. Next, we introduce learning the inconsistent cross-modal feature transitions between live and spoof samples to effectively detect out-of-distribution (OOD) attacks during inference. To further address the issue of missing modalities, we propose learning complementary infrared (IR) and depth features from the RGB modality as auxiliary modalities. Extensive experiments demonstrate that the proposed CTNet outperforms previous two-class multi-modal FAS methods across most protocols.


[13] GSVR: 2D Gaussian-based Video Representation for 800+ FPS with Hybrid Deformation Field cs.CVPDF

Zhizhuo Pang, Zhihui Ke, Xiaobo Zhou, Tie Qiu

TL;DR: GSVR提出了一种基于2D高斯分布的视频表示方法,结合混合形变场和动态感知时间切片策略,实现了800+FPS的解码速度和35+PSNR,训练时间仅需2秒每帧。

Details

Motivation: 现有的视频隐式神经表示方法主要通过卷积网络实现,但存在解码速度慢、训练时间长的问题。GSVR旨在解决这些问题,提供高效的视频表示和解码方案。

Result: 在Bunny和UVG数据集上,GSVR实现了800+FPS的解码速度和35+PSNR,训练时间仅需2秒每帧,解码速度比其他方法快10倍,在视频插值和压缩任务中表现优异。

Insight: 通过2D高斯分布和混合形变场的结合,GSVR显著提升了视频表示的效率和解码速度,为实时高清视频处理提供了新思路。

Abstract: Implicit neural representations for video have been recognized as a novel and promising form of video representation. Existing works pay more attention to improving video reconstruction quality but little attention to the decoding speed. However, the high computation of convolutional network used in existing methods leads to low decoding speed. Moreover, these convolution-based video representation methods also suffer from long training time, about 14 seconds per frame to achieve 35+ PSNR on Bunny. To solve the above problems, we propose GSVR, a novel 2D Gaussian-based video representation, which achieves 800+ FPS and 35+ PSNR on Bunny, only needing a training time of $2$ seconds per frame. Specifically, we propose a hybrid deformation field to model the dynamics of the video, which combines two motion patterns, namely the tri-plane motion and the polynomial motion, to deal with the coupling of camera motion and object motion in the video. Furthermore, we propose a Dynamic-aware Time Slicing strategy to adaptively divide the video into multiple groups of pictures(GOP) based on the dynamic level of the video in order to handle large camera motion and non-rigid movements. Finally, we propose quantization-aware fine-tuning to avoid performance reduction after quantization and utilize image codecs to compress Gaussians to achieve a compact representation. Experiments on the Bunny and UVG datasets confirm that our method converges much faster than existing methods and also has 10x faster decoding speed compared to other methods. Our method has comparable performance in the video interpolation task to SOTA and attains better video compression performance than NeRV.


[14] PaddleOCR 3.0 Technical Report cs.CVPDF

Cheng Cui, Ting Sun, Manhui Lin, Tingquan Gao, Yubo Zhang

TL;DR: PaddleOCR 3.0是一个开源的OCR和文档解析工具包,针对大规模语言模型时代的文档理解需求,提出了三种主要解决方案:多语言文本识别、分层文档解析和关键信息提取,同时保持了高效和轻量化。

Details

Motivation: 为了应对大规模语言模型时代对文档理解的日益增长需求,PaddleOCR 3.0旨在提供一个高效、轻量且多功能的OCR和文档解析工具。

Result: PaddleOCR 3.0的模型在保持高效的同时,达到了与主流视觉语言模型竞争的精度。

Insight: 轻量化模型通过优化设计和硬件加速,可以在文档理解任务中实现与大规模模型相当的性能,同时更适用于实际部署。

Abstract: This technical report introduces PaddleOCR 3.0, an Apache-licensed open-source toolkit for OCR and document parsing. To address the growing demand for document understanding in the era of large language models, PaddleOCR 3.0 presents three major solutions: (1) PP-OCRv5 for multilingual text recognition, (2) PP-StructureV3 for hierarchical document parsing, and (3) PP-ChatOCRv4 for key information extraction. Compared to mainstream vision-language models (VLMs), these models with fewer than 100 million parameters achieve competitive accuracy and efficiency, rivaling billion-parameter VLMs. In addition to offering a high-quality OCR model library, PaddleOCR 3.0 provides efficient tools for training, inference, and deployment, supports heterogeneous hardware acceleration, and enables developers to easily build intelligent document applications.


[15] Rethinking Layered Graphic Design Generation with a Top-Down Approach cs.CVPDF

Jingye Chen, Zhaowen Wang, Nanxuan Zhao, Li Zhang, Difan Liu

TL;DR: 提出了一种名为Accordion的图形设计生成框架,首次尝试将AI生成的图像转换为可编辑的分层设计,并通过用户提示优化无意义的生成文本。该框架采用自上而下的方式,利用视觉协调的参考图像全局引导分层设计的生成。

Details

Motivation: 现有的AI生成设计虽能提供高质量像素图,但缺乏编辑性。非分层设计虽难以编辑,却能启发设计师的布局和文本风格选择。Accordion旨在结合两者优势,将AI生成设计转换为可分层的可编辑设计。

Result: 实验和用户研究表明,Accordion在DesignIntention基准测试中表现优异,包括文本到模板、在背景中添加文本和文本去渲染等任务,且在生成设计变体方面效果显著。

Insight: 自上而下方法在图形设计生成中更具全局协调性,结合VLM和多专家模型能显著提升生成设计的可编辑性和实用性。用户提示的引入进一步优化了生成内容的质量。

Abstract: Graphic design is crucial for conveying ideas and messages. Designers usually organize their work into objects, backgrounds, and vectorized text layers to simplify editing. However, this workflow demands considerable expertise. With the rise of GenAI methods, an endless supply of high-quality graphic designs in pixel format has become more accessible, though these designs often lack editability. Despite this, non-layered designs still inspire human designers, influencing their choices in layouts and text styles, ultimately guiding the creation of layered designs. Motivated by this observation, we propose Accordion, a graphic design generation framework taking the first attempt to convert AI-generated designs into editable layered designs, meanwhile refining nonsensical AI-generated text with meaningful alternatives guided by user prompts. It is built around a vision language model (VLM) playing distinct roles in three curated stages. For each stage, we design prompts to guide the VLM in executing different tasks. Distinct from existing bottom-up methods (e.g., COLE and Open-COLE) that gradually generate elements to create layered designs, our approach works in a top-down manner by using the visually harmonious reference image as global guidance to decompose each layer. Additionally, it leverages multiple vision experts such as SAM and element removal models to facilitate the creation of graphic layers. We train our method using the in-house graphic design dataset Design39K, augmented with AI-generated design images coupled with refined ground truth created by a customized inpainting model. Experimental results and user studies by designers show that Accordion generates favorable results on the DesignIntention benchmark, including tasks such as text-to-template, adding text to background, and text de-rendering, and also excels in creating design variations.


[16] OFFSET: Segmentation-based Focus Shift Revision for Composed Image Retrieval cs.CVPDF

Zhiwei Chen, Yupeng Hu, Zixu Li, Zhiheng Fu, Xuemeng Song

TL;DR: OFFSET提出了一种基于分割的焦点转移修正方法,用于解决组合图像检索中的视觉噪声干扰和文本优先级问题,通过显著区域分割和双焦点映射提取特征,并结合文本引导的焦点修正模块提升检索性能。

Details

Motivation: 现有的组合图像检索方法忽视视觉数据中主要部分和噪声部分的异质性,导致查询特征退化;同时忽略文本数据在图像修改过程中的优先级,造成视觉焦点偏差。OFFSET旨在解决这些问题。

Result: 在四个基准数据集上的实验证明,OFFSET在组合图像检索任务中表现出优越性。

Insight: 分割方法可以有效减少噪声干扰,而文本引导的焦点修正能够显著提升对修改需求的理解和捕捉能力。

Abstract: Composed Image Retrieval (CIR) represents a novel retrieval paradigm that is capable of expressing users’ intricate retrieval requirements flexibly. It enables the user to give a multimodal query, comprising a reference image and a modification text, and subsequently retrieve the target image. Notwithstanding the considerable advances made by prevailing methodologies, CIR remains in its nascent stages due to two limitations: 1) inhomogeneity between dominant and noisy portions in visual data is ignored, leading to query feature degradation, and 2) the priority of textual data in the image modification process is overlooked, which leads to a visual focus bias. To address these two limitations, this work presents a focus mapping-based feature extractor, which consists of two modules: dominant portion segmentation and dual focus mapping. It is designed to identify significant dominant portions in images and guide the extraction of visual and textual data features, thereby reducing the impact of noise interference. Subsequently, we propose a textually guided focus revision module, which can utilize the modification requirements implied in the text to perform adaptive focus revision on the reference image, thereby enhancing the perception of the modification focus on the composed features. The aforementioned modules collectively constitute the segmentatiOn-based Focus shiFt reviSion nETwork (\mbox{OFFSET}), and comprehensive experiments on four benchmark datasets substantiate the superiority of our proposed method. The codes and data are available on https://zivchen-ty.github.io/OFFSET.github.io/


[17] Knowledge-guided Complex Diffusion Model for PolSAR Image Classification in Contourlet Domain cs.CV | eess.IVPDF

Junfei Shi, Yu Cheng, Haiyan Jin, Junhuai Li, Zhaolin Xiao

TL;DR: 论文提出了一种基于Contourlet域的知识引导复值扩散模型,用于PolSAR图像分类,通过结合多尺度和多方向信息,显著提升了分类精度和边缘保护能力。

Details

Motivation: 传统实值扩散模型在处理PolSAR数据时难以捕捉复值相位信息,且容易丢失细节结构。Contourlet变换能提供丰富的多尺度和多方向表示,适合PolSAR图像。

Result: 在三个真实PolSAR数据集上,该方法的分类精度优于现有方法,尤其是在边缘保护和区域均匀性方面表现出色。

Insight: Contourlet变换与复值扩散模型的结合是处理PolSAR数据的有效方法,结构信息的引导能显著提升模型的细节保留能力。

Abstract: Diffusion models have demonstrated exceptional performance across various domains due to their ability to model and generate complicated data distributions. However, when applied to PolSAR data, traditional real-valued diffusion models face challenges in capturing complex-valued phase information.Moreover, these models often struggle to preserve fine structural details. To address these limitations, we leverage the Contourlet transform, which provides rich multiscale and multidirectional representations well-suited for PolSAR imagery. We propose a structural knowledge-guided complex diffusion model for PolSAR image classification in the Contourlet domain. Specifically, the complex Contourlet transform is first applied to decompose the data into low- and high-frequency subbands, enabling the extraction of statistical and boundary features. A knowledge-guided complex diffusion network is then designed to model the statistical properties of the low-frequency components. During the process, structural information from high-frequency coefficients is utilized to guide the diffusion process, improving edge preservation. Furthermore, multiscale and multidirectional high-frequency features are jointly learned to further boost classification accuracy. Experimental results on three real-world PolSAR datasets demonstrate that our approach surpasses state-of-the-art methods, particularly in preserving edge details and maintaining region homogeneity in complex terrain.


[18] Dynamic Rank Adaptation for Vision-Language Models cs.CVPDF

Jiahui Wang, Qin Xu, Bo Jiang, Bin Luo

TL;DR: 论文提出动态秩适配(DRA),一种新型适配器方法,动态分配特征重要性以增强预训练视觉语言模型(VLMs)对新类的泛化能力。

Details

Motivation: 现有基于提示和适配器的方法在微调VLMs时对所有图像和文本编码器的令牌一视同仁,导致对无关特征过拟合,影响对新概念的识别。

Result: 实验表明DRA在多个基准测试(如基新类、跨数据集评估和领域泛化)中优于现有方法。

Insight: 动态调整特征重要性可有效避免过拟合,提升模型对新类的泛化能力。

Abstract: Pre-trained large vision-language models (VLMs) like CLIP demonstrate impressive generalization ability. Existing prompt-based and adapter-based works have made significant progress in fine-tuning VLMs but still face the challenges of maintaining strong generalization abilities, particularly towards unseen new classes. This limitation partly arises from these methods treating all tokens of the image and text encoder equally, which can lead to overfitting on less informative features (e.g., background noise, template words) and degrade the general representations that are crucial for novel concept recognition. To address this issue, we propose Dynamic Rank Adaptation (DRA), a novel adapter variant method, designed specifically to enhance new class generalization. DRA dynamically allocates adaptation ranks based on the importance of features during training to preserve general knowledge. DRA first employs token importance grouping, using sequence attention to evaluate and group tokens by their importance. Then, we adopt rank adaptation according to the importance of each token group dynamically by assigning higher feature ranks to the more important tokens. Also, we design a new channel response mechanism to prioritize the preservation and adaptation of feature channels identified as the most informative for each instance. In addition, a L1 regularization term is introduced to stabilize the training. Extensive experiments demonstrate the effectiveness and superiority of our proposed DRA over existing works, especially on enhancing the performance of new classes on various benchmarks, including base-new classes, cross-datasets evaluation and domain generalization. The source code will be published after the paper is received.


[19] Modeling and Reversing Brain Lesions Using Diffusion Models cs.CVPDF

Omar Zamzam, Haleh Akrami, Anand Joshi, Richard Leahy

TL;DR: 论文提出了一种基于扩散模型的框架,用于分析和逆转脑损伤过程,包括分割异常区域、估计并逆转组织变形,最后修复核心损伤区域以估计损伤前的健康大脑。

Details

Motivation: 现有的脑损伤分割方法未能区分受损与变形组织,导致分析不准确。该研究旨在通过扩散模型解决这一问题,提供更精确的损伤分析与逆转方法。

Result: 与传统方法相比,该方法在损伤分割、表征和大脑标记任务中表现出更高的准确性。

Insight: 通过逆转损伤过程,该方法不仅提升了分割精度,还为临床和研究提供了损伤分析的可靠工具,尤其是在缺乏真实预损伤数据的情况下,模拟前向模型为验证提供了新思路。

Abstract: Brain lesions are abnormalities or injuries in brain tissue that are often detectable using magnetic resonance imaging (MRI), which reveals structural changes in the affected areas. This broad definition of brain lesions includes areas of the brain that are irreversibly damaged, as well as areas of brain tissue that are deformed as a result of lesion growth or swelling. Despite the importance of differentiating between damaged and deformed tissue, existing lesion segmentation methods overlook this distinction, labeling both of them as a single anomaly. In this work, we introduce a diffusion model-based framework for analyzing and reversing the brain lesion process. Our pipeline first segments abnormal regions in the brain, then estimates and reverses tissue deformations by restoring displaced tissue to its original position, isolating the core lesion area representing the initial damage. Finally, we inpaint the core lesion area to arrive at an estimation of the pre-lesion healthy brain. This proposed framework reverses a forward lesion growth process model that is well-established in biomechanical studies that model brain lesions. Our results demonstrate improved accuracy in lesion segmentation, characterization, and brain labeling compared to traditional methods, offering a robust tool for clinical and research applications in brain lesion analysis. Since pre-lesion healthy versions of abnormal brains are not available in any public dataset for validation of the reverse process, we simulate a forward model to synthesize multiple lesioned brain images.


[20] R-VLM: Region-Aware Vision Language Model for Precise GUI Grounding cs.CVPDF

Joonhyung Park, Peng Tang, Sagnik Das, Srikar Appalaraju, Kunwar Yashraj Singh

TL;DR: R-VLM是一种基于区域的视觉语言模型,通过放大区域提案精确定位GUI元素,并结合IoU感知的损失函数,提升了GUI自动化任务的准确性和泛化能力。

Details

Motivation: GUI自动化任务中,现有视觉模型直接从杂乱的大截图中定位元素,准确性不足,且使用的交叉熵损失无法有效衡量定位质量。

Result: 在ScreenSpot和AgentStudio基准上提升13%的定位准确率,在AITW和Mind2Web导航任务中获得3.2-9.7%的绝对提升。

Insight: 通过结合视觉语言模型与目标检测技术,更有效地解决GUI元素定位问题,为GUI自动化任务提供了新的思路。

Abstract: Visual agent models for automating human activities on Graphical User Interfaces (GUIs) have emerged as a promising research direction, driven by advances in large Vision Language Models (VLMs). A critical challenge in GUI automation is the precise grounding of interface elements across diverse platforms. Existing vision-only GUI agents directly ground elements from large and cluttered screenshots, requiring them to process substantial irrelevant information that compromises their accuracy. In addition, these approaches typically employ basic cross-entropy loss for learning grounding objectives, which fails to effectively capture grounding quality compared to established object detection metrics like Intersection-over-Union (IoU). To address these issues, we introduce R-VLM, a novel GUI grounding approach that leverages zoomed-in region proposals for precise element localization. We also propose an IoU-aware objective function that facilitates model convergence toward high IoU predictions. Our approach bridges the gap between VLMs and conventional object detection techniques, improving the state-of-the-art grounding accuracy by 13% across diverse GUI platforms on the GUI grounding benchmarks ScreenSpot and AgentStudio. In addition, our R-VLM approach shows 3.2-9.7% absolute accuracy improvements in GUI navigation tasks on the AITW and Mind2Web benchmarks.


[21] MedGen: Unlocking Medical Video Generation by Scaling Granularly-annotated Medical Videos cs.CV | cs.AIPDF

Rongsheng Wang, Junying Chen, Ke Ji, Zhenyang Cai, Shunian Chen

TL;DR: 本文提出了首个针对医疗视频生成的大规模高质量数据集MedVideoCap-55K,并基于此开发了MedGen模型,在医疗视频生成的视觉质量和医学准确性上取得了领先性能。

Details

Motivation: 医疗视频生成在临床培训、教育和模拟中具有重要应用价值,但现有生成模型在医学领域缺乏高质量数据集支持,导致生成内容不准确或不真实。

Result: MedGen在多项基准测试中表现优于开源模型,并与商业系统性能相当。

Insight: 高质量数据集是提升医学领域生成模型性能的关键,MedVideoCap-55K为医疗视频生成研究提供了重要资源。

Abstract: Recent advances in video generation have shown remarkable progress in open-domain settings, yet medical video generation remains largely underexplored. Medical videos are critical for applications such as clinical training, education, and simulation, requiring not only high visual fidelity but also strict medical accuracy. However, current models often produce unrealistic or erroneous content when applied to medical prompts, largely due to the lack of large-scale, high-quality datasets tailored to the medical domain. To address this gap, we introduce MedVideoCap-55K, the first large-scale, diverse, and caption-rich dataset for medical video generation. It comprises over 55,000 curated clips spanning real-world medical scenarios, providing a strong foundation for training generalist medical video generation models. Built upon this dataset, we develop MedGen, which achieves leading performance among open-source models and rivals commercial systems across multiple benchmarks in both visual quality and medical accuracy. We hope our dataset and model can serve as a valuable resource and help catalyze further research in medical video generation. Our code and data is available at https://github.com/FreedomIntelligence/MedGen


[22] Integrated Structural Prompt Learning for Vision-Language Models cs.CVPDF

Jiahui Wang, Qin Xu, Bo Jiang, Bin Luo

TL;DR: 这篇论文提出了一种集成结构化提示学习(ISP)方法,用于增强视觉-语言模型(VLM)中文本和图像模态间的信息交互,通过自结构和跨结构提示模块建模可学习提示与冻结标记之间的关系,同时引入样本探测模块动态调整损失系数,提升模型对新类别的泛化能力。

Details

Motivation: 现有方法未能充分利用可学习提示与模态内及模态间标记的结构关系,且难以平衡基类与新类别的性能,因此需要一种更高效的方法来增强模态间的信息交互和模型泛化能力。

Result: 在基类到新类泛化、跨数据集评估和领域泛化三个实验中,ISP表现出色,优于现有方法。

Insight: 模态内及模态间的结构关系对提升模型性能至关重要;动态调整损失系数有助于平衡基类与新类别的学习,提升泛化能力。

Abstract: Prompt learning methods have significantly extended the transferability of pre-trained Vision-Language Models (VLMs) like CLIP for various downstream tasks. These methods adopt handcraft templates or learnable vectors to provide text or image instructions in fine-tuning VLMs. However, most existing works ignore the structural relationships between learnable prompts and tokens within and between modalities. Moreover, balancing the performance of base and new classes remains a significant challenge. In this paper, we propose an Integrated Structural Prompt (ISP) for VLMs to enhance the interaction of information representations between the text and image branches. ISP introduces self-structural and cross-structural prompt modules to model the structural relationships between learnable prompts and frozen tokens within and across modalities. This enables efficient information transfer while preserving feature stability. Additionally, we propose a sample probing module that dynamically adjusts loss coefficients based on sample difficulty, preventing the mode from overfitting to simple samples and improving generalization ability to new classes. Extensive experiments on three widely used settings: base-to-new generalization, cross-dataset evaluation, and domain generalization demonstrate that the proposed ISP achieves competitive performance against state-of-the-art methods.


[23] LiON-LoRA: Rethinking LoRA Fusion to Unify Controllable Spatial and Temporal Generation for Video Diffusion cs.CVPDF

Yisu Zhang, Chenjie Cao, Chaohui Yu, Jianke Zhu

TL;DR: LiON-LoRA 是一种新框架,通过线性可扩展性、正交性和范数一致性重新思考 LoRA 融合,以统一视频扩散模型中时空生成的控制。

Details

Motivation: 现有的 LoRA 方法在视频扩散模型中难以同时精确控制相机轨迹和物体运动,主要是由于融合不稳定和非线性扩展问题。

Result: LiON-LoRA 在轨迹控制精度和运动强度调整方面优于现有方法,且能用少量训练数据实现出色泛化。

Insight: LoRA 特征的正交性和范数一致性是优化视频扩散模型中时空控制的关键。

Abstract: Video Diffusion Models (VDMs) have demonstrated remarkable capabilities in synthesizing realistic videos by learning from large-scale data. Although vanilla Low-Rank Adaptation (LoRA) can learn specific spatial or temporal movement to driven VDMs with constrained data, achieving precise control over both camera trajectories and object motion remains challenging due to the unstable fusion and non-linear scalability. To address these issues, we propose LiON-LoRA, a novel framework that rethinks LoRA fusion through three core principles: Linear scalability, Orthogonality, and Norm consistency. First, we analyze the orthogonality of LoRA features in shallow VDM layers, enabling decoupled low-level controllability. Second, norm consistency is enforced across layers to stabilize fusion during complex camera motion combinations. Third, a controllable token is integrated into the diffusion transformer (DiT) to linearly adjust motion amplitudes for both cameras and objects with a modified self-attention mechanism to ensure decoupled control. Additionally, we extend LiON-LoRA to temporal generation by leveraging static-camera videos, unifying spatial and temporal controllability. Experiments demonstrate that LiON-LoRA outperforms state-of-the-art methods in trajectory control accuracy and motion strength adjustment, achieving superior generalization with minimal training data. Project Page: https://fuchengsu.github.io/lionlora.github.io/


[24] Event-RGB Fusion for Spacecraft Pose Estimation Under Harsh Lighting cs.CV | cs.ROPDF

Mohsi Jawaid, Marcus Märtens, Tat-Jun Chin

TL;DR: 该论文提出了一种结合RGB和事件传感器的融合方法,用于在极端光照条件下提升航天器姿态估计的鲁棒性,通过光束分离棱镜实现精确对齐,并开发了一种RANSAC融合技术,同时公开了数据集以推动社区研究。

Details

Motivation: 航天器姿态估计在自主空间操作中至关重要,但传统RGB传感器在极端光照条件下表现不佳,而事件传感器虽有高动态范围但存在分辨率和信噪比问题。为此,论文提出融合两种传感器以互补优势。

Result: 实验结果表明,融合方法在极端光照条件下显著提升了姿态估计的鲁棒性,支持事件传感器在航天器姿态估计中的应用。

Insight: 事件传感器在极端光照条件下具有潜力,但与RGB传感器的融合可以进一步提升性能,为未来空间任务中的传感器选择提供了新思路。

Abstract: Spacecraft pose estimation is crucial for autonomous in-space operations, such as rendezvous, docking and on-orbit servicing. Vision-based pose estimation methods, which typically employ RGB imaging sensors, is a compelling solution for spacecraft pose estimation, but are challenged by harsh lighting conditions, which produce imaging artifacts such as glare, over-exposure, blooming and lens flare. Due to their much higher dynamic range, neuromorphic or event sensors are more resilient to extreme lighting conditions. However, event sensors generally have lower spatial resolution and suffer from reduced signal-to-noise ratio during periods of low relative motion. This work addresses these individual sensor limitations by introducing a sensor fusion approach combining RGB and event sensors. A beam-splitter prism was employed to achieve precise optical and temporal alignment. Then, a RANSAC-based technique was developed to fuse the information from the RGB and event channels to achieve pose estimation that leveraged the strengths of the two modalities. The pipeline was complemented by dropout uncertainty estimation to detect extreme conditions that affect either channel. To benchmark the performance of the proposed event-RGB fusion method, we collected a comprehensive real dataset of RGB and event data for satellite pose estimation in a laboratory setting under a variety of challenging illumination conditions. Encouraging results on the dataset demonstrate the efficacy of our event-RGB fusion approach and further supports the usage of event sensors for spacecraft pose estimation. To support community research on this topic, our dataset will be released publicly.


[25] Hyperspectral Anomaly Detection Methods: A Survey and Comparative Study cs.CV | cs.AIPDF

Aayushma Pant, Arbind Agrahari Baniya, Tsz-Kwan Lee, Sunil Aryal

TL;DR: 该论文综述了高光谱异常检测(HAD)方法,对比了统计模型、表示学习方法、经典机器学习和深度学习方法,并通过17个基准数据集评估其性能,指出深度模型的检测精度最高,而统计模型速度最快。

Details

Motivation: 高光谱图像在农业、军事等领域有广泛应用,但现有异常检测方法面临计算复杂度高、对噪声敏感等问题,亟需系统性的比较和评估。

Result: 实验显示,深度学习模型检测精度最高,而统计模型计算速度最快。

Insight: 未来研究可结合深度学习的精度和统计模型的速度优势,同时需解决噪声敏感性和泛化性问题。

Abstract: Hyperspectral images are high-dimensional datasets consisting of hundreds of contiguous spectral bands, enabling detailed material and surface analysis. Hyperspectral anomaly detection (HAD) refers to the technique of identifying and locating anomalous targets in such data without prior information about a hyperspectral scene or target spectrum. This technology has seen rapid advancements in recent years, with applications in agriculture, defence, military surveillance, and environmental monitoring. Despite this significant progress, existing HAD methods continue to face challenges such as high computational complexity, sensitivity to noise, and limited generalisation across diverse datasets. This study presents a comprehensive comparison of various HAD techniques, categorising them into statistical models, representation-based methods, classical machine learning approaches, and deep learning models. We evaluated these methods across 17 benchmarking datasets using different performance metrics, such as ROC, AUC, and separability map to analyse detection accuracy, computational efficiency, their strengths, limitations, and directions for future research.The research shows that deep learning models achieved the highest detection accuracy, while statistical models demonstrated exceptional speed across all datasets. This study aims to provide valuable insights for researchers and practitioners working to advance the field of hyperspectral anomaly detection methods.


[26] SenseShift6D: Multimodal RGB-D Benchmarking for Robust 6D Pose Estimation across Environment and Sensor Variations cs.CVPDF

Yegyu Han, Taegyoon Yoon, Dayeon Woo, Sojeong Kim, Hyung-Sin Kim

TL;DR: SenseShift6D是第一個RGB-D數據集,專注於測試光源和感測器設置變化對6D姿態估計的影響,提供多種感測器配置和光源條件,並顯示實時感測器控制在測試階段的優越性。

Details

Motivation: 現有的6D姿態估計數據集(如LM-O、YCB-V和T-Less)在固定光源和相機設置下捕捉,未能反映真實世界的光源和感測器變化。為了填補這一空白,作者提出了一個新數據集SenseShift6D。

Result: 實驗結果顯示,測試階段的感測器控制比數據增廣更有效,且聯合調整RGB和深度感測器配置能進一步提升性能。

Insight: 該工作將6D姿態估計的評估範式從數據中心轉向感測器感知的魯棒性,為適應性感知系統在真實環境中的應用奠定了基礎。

Abstract: Recent advances on 6D object-pose estimation has achieved high performance on representative benchmarks such as LM-O, YCB-V, and T-Less. However, these datasets were captured under fixed illumination and camera settings, leaving the impact of real-world variations in illumination, exposure, gain or depth-sensor mode - and the potential of test-time sensor control to mitigate such variations - largely unexplored. To bridge this gap, we introduce SenseShift6D, the first RGB-D dataset that physically sweeps 13 RGB exposures, 9 RGB gains, auto-exposure, 4 depth-capture modes, and 5 illumination levels. For three common household objects (spray, pringles, and tincase), we acquire 101.9k RGB and 10k depth images, which can provide 1,380 unique sensor-lighting permutations per object pose. Experiments with state-of-the-art models on our dataset show that applying sensor control during test-time induces greater performance improvement over digital data augmentation, achieving performance comparable to or better than costly increases in real-world training data quantity and diversity. Adapting either RGB or depth sensors individually is effective, while jointly adapting multimodal RGB-D configurations yields even greater improvements. SenseShift6D extends the 6D-pose evaluation paradigm from data-centered to sensor-aware robustness, laying a foundation for adaptive, self-tuning perception systems capable of operating robustly in uncertain real-world environments. Our dataset is available at: huggingface.co/datasets/Yegyu/SenseShift6D Associated scripts can be found at: github.com/yegyu-han/SenseShift6D


[27] Normal Patch Retinex Robust Alghoritm for White Balancing in Digital Microscopy cs.CVPDF

Radoslaw Roszczyk, Artur Krupa, Izabella Antoniuk

TL;DR: 本文提出了一种名为Normal Patch Retinex的自动白平衡算法,专门用于解决显微镜图像色彩校正问题,并在实验中验证了其有效性。

Details

Motivation: 解决显微镜图像采集过程中色彩平衡的挑战,尤其是病理学中常用的染色样本。

Result: 在200张显微镜图像上验证了算法的有效性,优于传统摄影白平衡算法。

Insight: 该方法特别适用于病理学中的染色图像,为显微镜图像处理提供了更有效的解决方案。

Abstract: The acquisition of accurately coloured, balanced images in an optical microscope can be a challenge even for experienced microscope operators. This article presents an entirely automatic mechanism for balancing the white level that allows the correction of the microscopic colour images adequately. The results of the algorithm have been confirmed experimentally on a set of two hundred microscopic images. The images contained scans of three microscopic specimens commonly used in pathomorphology. Also, the results achieved were compared with other commonly used white balance algorithms in digital photography. The algorithm applied in this work is more effective than the classical algorithms used in colour photography for microscopic images stained with hematoxylin-phloxine-saffron and for immunohistochemical staining images.


[28] DreamArt: Generating Interactable Articulated Objects from a Single Image cs.CVPDF

Ruijie Lu, Yu Liu, Jiaxiang Tang, Junfeng Ni, Yuxiang Wang

TL;DR: DreamArt提出了一种从单张图像生成可交互关节化物体的新框架,通过三阶段流程实现高质量的关节化3D资产生成。

Details

Motivation: 现有方法主要关注表面几何和纹理,而忽视了部件分解和关节建模;同时,神经重建方法依赖多视角或交互数据,难以扩展。DreamArt的目标是从单视角图像生成高保真可交互的关节化资产。

Result: 实验结果表明,DreamArt能够生成高质量的关节化物体,部件形状准确、外观保真且运动合理。

Insight: DreamArt展示了如何通过结合生成模型和优化技术,从单张图像生成复杂关节化物体,为AR/VR和具身AI提供了可扩展的解决方案。

Abstract: Generating articulated objects, such as laptops and microwaves, is a crucial yet challenging task with extensive applications in Embodied AI and AR/VR. Current image-to-3D methods primarily focus on surface geometry and texture, neglecting part decomposition and articulation modeling. Meanwhile, neural reconstruction approaches (e.g., NeRF or Gaussian Splatting) rely on dense multi-view or interaction data, limiting their scalability. In this paper, we introduce DreamArt, a novel framework for generating high-fidelity, interactable articulated assets from single-view images. DreamArt employs a three-stage pipeline: firstly, it reconstructs part-segmented and complete 3D object meshes through a combination of image-to-3D generation, mask-prompted 3D segmentation, and part amodal completion. Second, we fine-tune a video diffusion model to capture part-level articulation priors, leveraging movable part masks as prompt and amodal images to mitigate ambiguities caused by occlusion. Finally, DreamArt optimizes the articulation motion, represented by a dual quaternion, and conducts global texture refinement and repainting to ensure coherent, high-quality textures across all parts. Experimental results demonstrate that DreamArt effectively generates high-quality articulated objects, possessing accurate part shape, high appearance fidelity, and plausible articulation, thereby providing a scalable solution for articulated asset generation. Our project page is available at https://dream-art-0.github.io/DreamArt/.


[29] TalkFashion: Intelligent Virtual Try-On Assistant Based on Multimodal Large Language Model cs.CVPDF

Yujie Hu, Xuanyu Zhang, Weiqi Li, Jian Zhang

TL;DR: TalkFashion提出了一种基于多模态大语言模型的智能虚拟试穿助手,通过文本指令实现多功能虚拟试穿,包括全身换装和局部编辑,解决了传统方法缺乏灵活性的问题。

Details

Motivation: 传统虚拟试穿方法主要依赖端到端网络完成单一任务,缺乏多功能性和灵活性。本文旨在通过多模态大语言模型的理解能力,实现仅需文本指令指导的多功能虚拟试穿。

Result: 实验表明,该方法在语义一致性和视觉质量上优于现有方法。

Insight: 多模态大语言模型能够显著提升虚拟试穿的灵活性和自动化程度,同时减少用户手动操作的需求。

Abstract: Virtual try-on has made significant progress in recent years. This paper addresses how to achieve multifunctional virtual try-on guided solely by text instructions, including full outfit change and local editing. Previous methods primarily relied on end-to-end networks to perform single try-on tasks, lacking versatility and flexibility. We propose TalkFashion, an intelligent try-on assistant that leverages the powerful comprehension capabilities of large language models to analyze user instructions and determine which task to execute, thereby activating different processing pipelines accordingly. Additionally, we introduce an instruction-based local repainting model that eliminates the need for users to manually provide masks. With the help of multi-modal models, this approach achieves fully automated local editings, enhancing the flexibility of editing tasks. The experimental results demonstrate better semantic consistency and visual quality compared to the current methods.


[30] SPADE: Spatial-Aware Denoising Network for Open-vocabulary Panoptic Scene Graph Generation with Long- and Local-range Context Reasoning cs.CVPDF

Xin Hu, Ke Qin, Guiduo Duan, Ming Li, Yuan-Fang Li

TL;DR: SPADE提出了一种空间感知的去噪网络,结合长程和局部上下文推理,提升了开放词汇全景场景图生成任务的性能。

Details

Motivation: 现有的基于视觉语言模型的开放词汇全景场景图生成方法在空间关系推理上存在局限,导致关系预测效果不佳。

Result: 在PSG和Visual Genome数据集上,SPADE在封闭和开放集场景下均优于现有方法,尤其在空间关系预测上表现突出。

Insight: 扩散模型的反转过程能有效保留空间结构信息,结合Transformer的长程和局部推理能力,可以显著提升开放词汇下的关系预测性能。

Abstract: Panoptic Scene Graph Generation (PSG) integrates instance segmentation with relation understanding to capture pixel-level structural relationships in complex scenes. Although recent approaches leveraging pre-trained vision-language models (VLMs) have significantly improved performance in the open-vocabulary setting, they commonly ignore the inherent limitations of VLMs in spatial relation reasoning, such as difficulty in distinguishing object relative positions, which results in suboptimal relation prediction. Motivated by the denoising diffusion model’s inversion process in preserving the spatial structure of input images, we propose SPADE (SPatial-Aware Denoising-nEtwork) framework – a novel approach for open-vocabulary PSG. SPADE consists of two key steps: (1) inversion-guided calibration for the UNet adaptation, and (2) spatial-aware context reasoning. In the first step, we calibrate a general pre-trained teacher diffusion model into a PSG-specific denoising network with cross-attention maps derived during inversion through a lightweight LoRA-based fine-tuning strategy. In the second step, we develop a spatial-aware relation graph transformer that captures both local and long-range contextual information, facilitating the generation of high-quality relation queries. Extensive experiments on benchmark PSG and Visual Genome datasets demonstrate that SPADE outperforms state-of-the-art methods in both closed- and open-set scenarios, particularly for spatial relationship prediction.


[31] DREAM: Document Reconstruction via End-to-end Autoregressive Model cs.CVPDF

Xin Li, Mingming Gong, Yunfei Wu, Jianxin Dai, Antai Guo

TL;DR: 论文提出了一种端到端的自回归模型DREAM,用于文档重建任务,解决了现有方法中错误传播和布局信息缺失的问题,并在多个子任务中表现出色。

Details

Motivation: 文档重建是文档分析与识别的重要任务,但目前的多阶段方法存在错误传播问题,而现有端到端方法无法保留布局信息。这促使作者提出新的解决方案。

Result: 实验证明DREAM在文档重建任务中性能最佳,并在布局分析、文本识别等子任务中表现优异。

Insight: 端到端自回归模型能有效整合文档元素信息,标准化任务定义和评估指标有助于推动领域进展。

Abstract: Document reconstruction constitutes a significant facet of document analysis and recognition, a field that has been progressively accruing interest within the scholarly community. A multitude of these researchers employ an array of document understanding models to generate predictions on distinct subtasks, subsequently integrating their results into a holistic document reconstruction format via heuristic principles. Nevertheless, these multi-stage methodologies are hindered by the phenomenon of error propagation, resulting in suboptimal performance. Furthermore, contemporary studies utilize generative models to extract the logical sequence of plain text, tables and mathematical expressions in an end-to-end process. However, this approach is deficient in preserving the information related to element layouts, which are vital for document reconstruction. To surmount these aforementioned limitations, we in this paper present an innovative autoregressive model specifically designed for document reconstruction, referred to as Document Reconstruction via End-to-end Autoregressive Model (DREAM). DREAM transmutes the text image into a sequence of document reconstruction in a comprehensive, end-to-end process, encapsulating a broader spectrum of document element information. In addition, we establish a standardized definition of the document reconstruction task, and introduce a novel Document Similarity Metric (DSM) and DocRec1K dataset for assessing the performance of the task. Empirical results substantiate that our methodology attains unparalleled performance in the realm of document reconstruction. Furthermore, the results on a variety of subtasks, encompassing document layout analysis, text recognition, table structure recognition, formula recognition and reading order detection, indicate that our model is competitive and compatible with various tasks.


[32] Towards Solar Altitude Guided Scene Illumination cs.CV | cs.AIPDF

Samed Doğan, Maximilian Hoh, Nico Leuze, Nicolas R. -Peña, Alfred Schöttl

TL;DR: 该论文提出了一种利用太阳高度角指导场景光照的方法,通过全局条件生成合成相机传感器数据,解决了白天光照变化的标注数据稀缺问题。

Details

Motivation: 现实世界数据采集成本高且受限,缺乏对白天光照变化的有效标注数据,因此需要通过合成数据弥补这一不足。

Result: 该方法能够准确捕捉光照特性和光照依赖的图像噪声,适用于扩散模型。

Insight: 太阳高度角是一种简单且无需额外标注的全局条件变量,可以有效指导合成数据的光照生成。

Abstract: The development of safe and robust autonomous driving functions is heavily dependent on large-scale, high-quality sensor data. However, real-word data acquisition demands intensive human labor and is strongly limited by factors such as labeling cost, driver safety protocols and diverse scenario coverage. Thus, multiple lines of work focus on the conditional generation of synthetic camera sensor data. We identify a significant gap in research regarding daytime variation, presumably caused by the scarcity of available labels. Consequently, we present the solar altitude as global conditioning variable. It is readily computable from latitude-longitude coordinates and local time, eliminating the need for extensive manual labeling. Our work is complemented by a tailored normalization approach, targeting the sensitivity of daylight towards small numeric changes in altitude. We demonstrate its ability to accurately capture lighting characteristics and illumination-dependent image noise in the context of diffusion models.


[33] Empowering Bridge Digital Twins by Bridging the Data Gap with a Unified Synthesis Framework cs.CV | cs.AIPDF

Wang Wang, Mingyu Shi, Jun Jiang, Wenqian Ma, Chong Liu

TL;DR: 本文提出了一种系统性框架,用于生成包含完整点云数据的桥梁数字孪生模型,支持语义分割和点云补全任务的训练,并在实际桥梁分析中表现优异。

Details

Motivation: 桥梁作为关键交通基础设施面临老化和损坏的挑战,传统手动检测效率低,现有3D点云技术因数据缺失和遮挡问题受限,亟需一种能生成完整且标注丰富数据的解决方案。

Result: PointNet++在真实桥梁语义分割任务中达到84.2%的mIoU,KT-Net在组件补全任务中表现优异。

Insight: 该研究为桥梁结构的3D视觉分析提供了创新方法和基础数据集,推动了基础设施自动化管理与维护的进步。

Abstract: As critical transportation infrastructure, bridges face escalating challenges from aging and deterioration, while traditional manual inspection methods suffer from low efficiency. Although 3D point cloud technology provides a new data-driven paradigm, its application potential is often constrained by the incompleteness of real-world data, which results from missing labels and scanning occlusions. To overcome the bottleneck of insufficient generalization in existing synthetic data methods, this paper proposes a systematic framework for generating 3D bridge data. This framework can automatically generate complete point clouds featuring component-level instance annotations, high-fidelity color, and precise normal vectors. It can be further extended to simulate the creation of diverse and physically realistic incomplete point clouds, designed to support the training of segmentation and completion networks, respectively. Experiments demonstrate that a PointNet++ model trained with our synthetic data achieves a mean Intersection over Union (mIoU) of 84.2% in real-world bridge semantic segmentation. Concurrently, a fine-tuned KT-Net exhibits superior performance on the component completion task. This research offers an innovative methodology and a foundational dataset for the 3D visual analysis of bridge structures, holding significant implications for advancing the automated management and maintenance of infrastructure.


[34] Video Event Reasoning and Prediction by Fusing World Knowledge from LLMs with Vision Foundation Models cs.CV | CS | I.2.10PDF

L’ea Dubois, Klaus Schmidt, Chengyu Wang, Ji-Hoon Park, Lin Wang

TL;DR: 论文提出了一种新框架,通过融合视觉基础模型(VFM)和大语言模型(LLM)来解决视频高级认知任务(如因果推理和未来预测),解决了当前模型缺乏常识性世界知识的问题。

Details

Motivation: 当前的视频理解模型在识别“发生了什么”方面表现优异,但在高级认知任务(如因果推理和未来预测)上表现不足,主要因为缺乏常识性世界知识。

Result: 模型在多个挑战性基准测试中达到了最先进的性能,并表现出卓越的零样本泛化能力。

Insight: 这项研究将机器感知从简单的识别推向真正的认知理解,为更智能的AI系统在机器人、人机交互等领域的应用铺平了道路。

Abstract: Current video understanding models excel at recognizing “what” is happening but fall short in high-level cognitive tasks like causal reasoning and future prediction, a limitation rooted in their lack of commonsense world knowledge. To bridge this cognitive gap, we propose a novel framework that synergistically fuses a powerful Vision Foundation Model (VFM) for deep visual perception with a Large Language Model (LLM) serving as a knowledge-driven reasoning core. Our key technical innovation is a sophisticated fusion module, inspired by the Q-Former architecture, which distills complex spatiotemporal and object-centric visual features into a concise, language-aligned representation. This enables the LLM to effectively ground its inferential processes in direct visual evidence. The model is trained via a two-stage strategy, beginning with large-scale alignment pre-training on video-text data, followed by targeted instruction fine-tuning on a curated dataset designed to elicit advanced reasoning and prediction skills. Extensive experiments demonstrate that our model achieves state-of-the-art performance on multiple challenging benchmarks. Notably, it exhibits remarkable zero-shot generalization to unseen reasoning tasks, and our in-depth ablation studies validate the critical contribution of each architectural component. This work pushes the boundary of machine perception from simple recognition towards genuine cognitive understanding, paving the way for more intelligent and capable AI systems in robotics, human-computer interaction, and beyond.


[35] D-FCGS: Feedforward Compression of Dynamic Gaussian Splatting for Free-Viewpoint Videos cs.CV | cs.MMPDF

Wenkang Zhang, Yan Zhao, Qiang Wang, Li Song, Zhengxue Cheng

TL;DR: D-FCGS提出了一种前馈式动态高斯泼溅压缩框架,通过I-P帧编码和稀疏控制点提取帧间运动,结合双先验熵模型实现高效压缩,无需逐场景优化。

Details

Motivation: 自由视点视频(FVV)需要高效压缩动态3D表示,但现有方法常耦合场景重建与优化依赖的编码,限制了泛化性。

Result: 在保持多视角视觉质量的同时,实现了40倍以上的压缩,耗时不足2秒,性能媲美基于优化的方法。

Insight: 前馈式方法在动态3D表示压缩中具有潜力,为FVV的传输与存储提供了可扩展的解决方案。

Abstract: Free-viewpoint video (FVV) enables immersive 3D experiences, but efficient compression of dynamic 3D representations remains a major challenge. Recent advances in 3D Gaussian Splatting (3DGS) and its dynamic extensions have enabled high-fidelity scene modeling. However, existing methods often couple scene reconstruction with optimization-dependent coding, which limits generalizability. This paper presents Feedforward Compression of Dynamic Gaussian Splatting (D-FCGS), a novel feedforward framework for compressing temporally correlated Gaussian point cloud sequences. Our approach introduces a Group-of-Frames (GoF) structure with I-P frame coding, where inter-frame motions are extracted via sparse control points. The resulting motion tensors are compressed in a feedforward manner using a dual prior-aware entropy model that combines hyperprior and spatial-temporal priors for accurate rate estimation. For reconstruction, we perform control-point-guided motion compensation and employ a refinement network to enhance view-consistent fidelity. Trained on multi-view video-derived Gaussian frames, D-FCGS generalizes across scenes without per-scene optimization. Experiments show that it matches the rate-distortion performance of optimization-based methods, achieving over 40 times compression in under 2 seconds while preserving visual quality across viewpoints. This work advances feedforward compression for dynamic 3DGS, paving the way for scalable FVV transmission and storage in immersive applications.


[36] GeoMag: A Vision-Language Model for Pixel-level Fine-Grained Remote Sensing Image Parsing cs.CVPDF

Xianzhi Ma, Jianhui Li, Changhua Pei, Hao Liu

TL;DR: GeoMag是一个基于视觉-语言模型的端到端通用框架,用于遥感图像的多粒度解析,通过动态调整注意力和自适应裁剪提升小目标识别能力并降低计算成本。

Details

Motivation: 现有遥感视觉-语言模型在像素级任务和小目标识别上表现不佳,且处理高分辨率图像时计算成本高,因此需要一种更高效的方法。

Result: 在10个基准测试中,GeoMag在像素级任务上表现优异,同时在其他粒度任务上保持竞争力。

Insight: 通过动态注意力和自适应裁剪,模型能够更高效地处理高分辨率遥感图像,同时提升小目标识别能力。

Abstract: The application of Vision-Language Models (VLMs) in remote sensing (RS) image understanding has achieved notable progress, demonstrating the basic ability to recognize and describe geographical entities. However, existing RS-VLMs are mostly limited to image-level and region-level tasks, lacking the capability to handle pixel-level tasks and performing poorly in small-object recognition scenarios. Moreover, RS-VLMs consume significant computational resources when processing high-resolution RS images, further restricting their practical applicability. In this context, we propose GeoMag (Geographical Magnifier), an end-to-end general-purpose large model framework for RS. GeoMag dynamically focuses the attention scope based on prompt semantics to effectively perform remote sensing image parsing across multiple levels of granularity. This method introduces Task-driven Multi-granularity Resolution Adjustment (TMRA) and Prompt-guided Semantic-aware Cropping (PSC), which adaptively reduce the spatial resolution of task-irrelevant regions while enhancing the visual representation of task-relevant areas. This approach improves the model’s perception of critical target regions, suppresses background redundancy, and reduces the computational cost of interpreting high-resolution RS imagery. Extensive comparative experiments on 10 benchmarks demonstrate that GeoMag not only excels in handling pixel-level tasks but also maintains competitive performance across tasks of other granularities compared to existing RS-VLMs.


[37] What You Have is What You Track: Adaptive and Robust Multimodal Tracking cs.CVPDF

Yuedong Tan, Jiawei Shao, Eduard Zamfir, Ruanjun Li, Zhaochong An

TL;DR: 论文研究了多模态数据在视觉跟踪中的作用,提出了一个灵活框架以应对数据缺失问题,通过自适应复杂性的异构专家混合机制和视频级掩码策略,实现了稳健的多模态跟踪。

Details

Motivation: 多模态数据在视觉跟踪中能提升鲁棒性,但传感器同步问题常导致数据缺失。现有跟踪器因架构僵化无法适应缺失情况,性能显著下降。

Result: 在9个基准测试中达到SOTA性能,适用于完整和缺失多模态数据的场景。

Insight: 跟踪器不仅需要适应数据缺失,还应动态调整以应对场景复杂性,混合机制与掩码策略的结合是关键。

Abstract: Multimodal data is known to be helpful for visual tracking by improving robustness to appearance variations. However, sensor synchronization challenges often compromise data availability, particularly in video settings where shortages can be temporal. Despite its importance, this area remains underexplored. In this paper, we present the first comprehensive study on tracker performance with temporally incomplete multimodal data. Unsurprisingly, under such a circumstance, existing trackers exhibit significant performance degradation, as their rigid architectures lack the adaptability needed to effectively handle missing modalities. To address these limitations, we propose a flexible framework for robust multimodal tracking. We venture that a tracker should dynamically activate computational units based on missing data rates. This is achieved through a novel Heterogeneous Mixture-of-Experts fusion mechanism with adaptive complexity, coupled with a video-level masking strategy that ensures both temporal consistency and spatial completeness which is critical for effective video tracking. Surprisingly, our model not only adapts to varying missing rates but also adjusts to scene complexity. Extensive experiments show that our model achieves SOTA performance across 9 benchmarks, excelling in both conventional complete and missing modality settings. The code and benchmark will be publicly available at https://github.com/supertyd/FlexTrack/tree/main.


[38] On the Effectiveness of Methods and Metrics for Explainable AI in Remote Sensing Image Scene Classification cs.CV | cs.AIPDF

Jonas Klotz, Tom Burgert, Begüm Demir

TL;DR: 该论文研究了遥感图像场景分类中可解释AI(xAI)方法和评估指标的有效性,分析了五种特征归因方法和十个评估指标的局限性,并提出了针对遥感场景的选型指南。

Details

Motivation: 遥感场景分类中,大多数xAI方法和评估指标直接借用自然图像领域的成果,但其适用性未被验证。论文旨在填补这一空白,分析这些方法和指标在遥感图像中的有效性。

Result: 研究发现,扰动方法(如Occlusion和LIME)的表现依赖于扰动基线和场景的空间特性;梯度方法(如GradCAM)在多标签场景中表现不佳;部分指标(如定位性和复杂性指标)在空间范围较大的类别中不可靠。鲁棒性和随机性指标表现更稳定。

Insight: 论文指出,直接迁移自然图像的xAI方法和指标可能不适合遥感场景,需根据遥感图像特性(如空间分布和多标签)选择方法。鲁棒性和随机性指标是更可靠的选择。

Abstract: The development of explainable artificial intelligence (xAI) methods for scene classification problems has attracted great attention in remote sensing (RS). Most xAI methods and the related evaluation metrics in RS are initially developed for natural images considered in computer vision (CV), and their direct usage in RS may not be suitable. To address this issue, in this paper, we investigate the effectiveness of explanation methods and metrics in the context of RS image scene classification. In detail, we methodologically and experimentally analyze ten explanation metrics spanning five categories (faithfulness, robustness, localization, complexity, randomization), applied to five established feature attribution methods (Occlusion, LIME, GradCAM, LRP, and DeepLIFT) across three RS datasets. Our methodological analysis identifies key limitations in both explanation methods and metrics. The performance of perturbation-based methods, such as Occlusion and LIME, heavily depends on perturbation baselines and spatial characteristics of RS scenes. Gradient-based approaches like GradCAM struggle when multiple labels are present in the same image, while some relevance propagation methods (LRP) can distribute relevance disproportionately relative to the spatial extent of classes. Analogously, we find limitations in evaluation metrics. Faithfulness metrics share the same problems as perturbation-based methods. Localization metrics and complexity metrics are unreliable for classes with a large spatial extent. In contrast, robustness metrics and randomization metrics consistently exhibit greater stability. Our experimental results support these methodological findings. Based on our analysis, we provide guidelines for selecting explanation methods, metrics, and hyperparameters in the context of RS image scene classification.


[39] High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning cs.CVPDF

Xinyu Huang, Yuhao Dong, Weiwei Tian, Bo Li, Rui Feng

TL;DR: 该论文针对大型多模态模型(LMMs)在高分辨率图像处理中的视觉冗余问题,提出了基于多轮对话框架的强化学习方法MGPO,通过自动裁剪关键视觉区域提升推理能力,无需昂贵的标注数据。

Details

Motivation: 现有的大型多模态模型在处理高分辨率图像时,由于视觉标记过多且多数无关任务,导致效率低下。此外,监督微调需要昂贵的标注数据,限制了模型的扩展性。

Result: 在标准视觉问答数据上,MGPO显著提升了定位能力,在MME-Realworld和V* Bench上分别取得了5.4%和5.2%的提升。Qwen2.5-VL-7B模型在OOD测试中超越了OpenAI的o1和GPT-4o。

Insight: 强化学习可以在无需额外标注的情况下提升LMMs的视觉定位能力;多轮对话框架能有效解决模型冷启动问题。

Abstract: State-of-the-art large multi-modal models (LMMs) face challenges when processing high-resolution images, as these inputs are converted into enormous visual tokens, many of which are irrelevant to the downstream task. In this paper, we propose Multi-turn Grounding-based Policy Optimization (MGPO), an end-to-end reinforcement learning (RL) framework that enables LMMs to iteratively focus on key visual regions by automatically cropping sub-images, based on model-predicted grounding coordinates within a multi-turn conversation framework. Compared to supervised fine-tuning (SFT), which requires costly additional grounding annotations, our approach highlights that LMMs can emerge robust grounding abilities during the RL training process, leveraging only a binary reward function derived from the correctness of the final answer. Additionally, we observe that LMMs struggle to autonomously trigger visual grounding during the rollout process. To address this cold start problem, we design a multi-turn conversational template and restrict policy loss computation to model outputs generated across multiple dialogue rounds, thereby promoting stable optimization. Extensive experiments demonstrate that, when trained on standard visual-question-short answering data without grounding annotations, MGPO effectively elicits stronger grounding capabilities compared to GRPO, leading to 5.4% improvement on in-distribution MME-Realworld and 5.2% improvement on the challenging out-of-distribution (OOD) V* Bench. Notably, MGPO post-training on Qwen2.5-VL-7B with 21K samples surpasses OpenAI’s o1 and GPT-4o models on the OOD V* Bench. Codes are available at https://github.com/EvolvingLMMs-Lab/MGPO.


[40] Beyond Appearance: Geometric Cues for Robust Video Instance Segmentation cs.CVPDF

Quanzhu Niu, Yikang Zhou, Shihao Chen, Tao Zhang, Shunping Ji

TL;DR: 该论文通过引入几何感知(深度估计)来增强视频实例分割(VIS)的鲁棒性,研究了三种深度集成方法,其中两种(EDC和SV)显著提升了性能,EDC方法在OVIS基准测试中取得了56.2 AP的新最优结果。

Details

Motivation: 视频实例分割(VIS)在面对遮挡、运动模糊和外观变化等问题时表现不佳,论文希望通过深度估计提供的几何信息来提升分割的鲁棒性。

Result: 实验结果显示,EDC和SV方法显著提升了VIS的鲁棒性。EDC方法在使用Swin-L骨干网络时,在OVIS基准测试中达到了56.2 AP,为当前最优结果。

Insight: 论文的洞察是,深度信息(几何线索)是增强视频理解鲁棒性的关键因素,尤其是在解决遮挡、运动模糊等挑战时。

Abstract: Video Instance Segmentation (VIS) fundamentally struggles with pervasive challenges including object occlusions, motion blur, and appearance variations during temporal association. To overcome these limitations, this work introduces geometric awareness to enhance VIS robustness by strategically leveraging monocular depth estimation. We systematically investigate three distinct integration paradigms. Expanding Depth Channel (EDC) method concatenates the depth map as input channel to segmentation networks; Sharing ViT (SV) designs a uniform ViT backbone, shared between depth estimation and segmentation branches; Depth Supervision (DS) makes use of depth prediction as an auxiliary training guide for feature learning. Though DS exhibits limited effectiveness, benchmark evaluations demonstrate that EDC and SV significantly enhance the robustness of VIS. When with Swin-L backbone, our EDC method gets 56.2 AP, which sets a new state-of-the-art result on OVIS benchmark. This work conclusively establishes depth cues as critical enablers for robust video understanding.


[41] High-Fidelity and Generalizable Neural Surface Reconstruction with Sparse Feature Volumes cs.CVPDF

Aoxiang Fan, Corentin Dumery, Nicolas Talabot, Hieu Le, Pascal Fua

TL;DR: 该论文提出了一种基于稀疏特征体素的神经表面重建方法,显著提高了重建分辨率和内存效率,无需逐场景优化,并在公开数据集上实现了优于现有方法的精度。

Details

Motivation: 当前基于密集3D特征体素的神经表面重建方法在提高体素分辨率时面临内存和计算效率的瓶颈,限制了重建质量。本文旨在通过稀疏表示解决这一问题。

Result: 在公开数据集上,该方法减少了50倍以上存储需求,支持512^3分辨率重建,且重建精度优于现有方法。

Insight: 稀疏表示是实现高分辨率神经表面重建的有效途径,定制算法能够克服密集体素的假设,提升内存和计算效率。

Abstract: Generalizable neural surface reconstruction has become a compelling technique to reconstruct from few images without per-scene optimization, where dense 3D feature volume has proven effective as a global representation of scenes. However, the dense representation does not scale well to increasing voxel resolutions, severely limiting the reconstruction quality. We thus present a sparse representation method, that maximizes memory efficiency and enables significantly higher resolution reconstructions on standard hardware. We implement this through a two-stage approach: First training a network to predict voxel occupancies from posed images and associated depth maps, then computing features and performing volume rendering only in voxels with sufficiently high occupancy estimates. To support this sparse representation, we developed custom algorithms for efficient sampling, feature aggregation, and querying from sparse volumes-overcoming the dense-volume assumptions inherent in existing works. Experiments on public datasets demonstrate that our approach reduces storage requirements by more than 50 times without performance degradation, enabling reconstructions at $512^3$ resolution compared to the typical $128^3$ on similar hardware, and achieving superior reconstruction accuracy over current state-of-the-art methods.


[42] Tora2: Motion and Appearance Customized Diffusion Transformer for Multi-Entity Video Generation cs.CVPDF

Zhenghao Zhang, Junchao Liao, Xiangyu Meng, Long Qin, Weizhi Wang

TL;DR: Tora2是Tora的升级版,通过解耦个性化提取器和门控自注意力机制,实现了多实体视频生成中的外观和运动定制。

Details

Motivation: 现有视频生成方法在多实体定制方面存在细节保留不足和多模态条件对齐不精确的问题,Tora2旨在解决这些问题。

Result: Tora2在多实体视频生成中表现优异,支持高级运动控制,性能与SOTA定制方法相当。

Insight: 解耦和门控机制是多实体定制视频生成的关键创新点,联合优化损失进一步提升了生成质量。

Abstract: Recent advances in diffusion transformer models for motion-guided video generation, such as Tora, have shown significant progress. In this paper, we present Tora2, an enhanced version of Tora, which introduces several design improvements to expand its capabilities in both appearance and motion customization. Specifically, we introduce a decoupled personalization extractor that generates comprehensive personalization embeddings for multiple open-set entities, better preserving fine-grained visual details compared to previous methods. Building on this, we design a gated self-attention mechanism to integrate trajectory, textual description, and visual information for each entity. This innovation significantly reduces misalignment in multimodal conditioning during training. Moreover, we introduce a contrastive loss that jointly optimizes trajectory dynamics and entity consistency through explicit mapping between motion and personalization embeddings. Tora2 is, to our best knowledge, the first method to achieve simultaneous multi-entity customization of appearance and motion for video generation. Experimental results demonstrate that Tora2 achieves competitive performance with state-of-the-art customization methods while providing advanced motion control capabilities, which marks a critical advancement in multi-condition video generation. Project page: https://github.com/alibaba/Tora .


[43] Automatic Synthesis of High-Quality Triplet Data for Composed Image Retrieval cs.CVPDF

Haiwen Li, Delong Liu, Zhaohui Hou, Zhicheng Zhao, Fei Su

TL;DR: 论文提出了一种自动生成高质量三元组数据的管道,构建了合成数据集CIRHS,并提出了新的CIR框架CoAlign,首次验证了完全合成数据集训练CIR模型的可行性,在零样本和监督训练中均表现优异。

Details

Motivation: 现有的CIR方法依赖昂贵的人工标注三元组数据,限制了其可扩展性和零样本能力。为了解决这一问题,论文提出自动生成三元组数据的方法。

Result: CoAlign在三个常用基准测试中实现了优异的零样本性能,并在监督训练中超越了所有现有的CIR方法。

Insight: 合成数据集可以替代人工标注数据,实现高效且可扩展的CIR模型训练;全局对齐与局部推理的结合提升了模型的鲁棒性和表现力。

Abstract: As a challenging vision-language (VL) task, Composed Image Retrieval (CIR) aims to retrieve target images using multimodal (image+text) queries. Although many existing CIR methods have attained promising performance, their reliance on costly, manually labeled triplets hinders scalability and zero-shot capability. To address this issue, we propose a scalable pipeline for automatic triplet generation, along with a fully synthetic dataset named Composed Image Retrieval on High-quality Synthetic Triplets (CIRHS). Our pipeline leverages a large language model (LLM) to generate diverse prompts, controlling a text-to-image generative model to produce image pairs with identical elements in each pair, which are then filtered and reorganized to form the CIRHS dataset. In addition, we introduce Hybrid Contextual Alignment (CoAlign), a novel CIR framework, which can accomplish global alignment and local reasoning within a broader context, enabling the model to learn more robust and informative representations. By utilizing the synthetic CIRHS dataset, CoAlign achieves outstanding zero-shot performance on three commonly used benchmarks, demonstrating for the first time the feasibility of training CIR models on a fully synthetic dataset. Furthermore, under supervised training, our method outperforms all the state-of-the-art supervised CIR approaches, validating the effectiveness of our proposed retrieval framework. The code and the CIRHS dataset will be released soon.


[44] Exploring Partial Multi-Label Learning via Integrating Semantic Co-occurrence Knowledge cs.CV | cs.AIPDF

Xin Wu, Fei Teng, Yue Feng, Kaibo Shi, Zhuosheng Lin

TL;DR: 本文提出了Semantic Co-occurrence Insight Network (SCINet),一种新型的部分多标签学习框架,通过引入语义共现知识和跨模态融合模块,解决不完全标注数据的挑战。

Details

Motivation: 部分多标签学习需要处理不完全标注的数据,即包含已知正确、已知错误和未知标签。核心挑战是如何准确识别标签与实例之间的模糊关系。

Result: 在四个基准数据集上的实验表明,SCINet超越了现有最先进方法。

Insight: 标签与实例的语义共现模式是解决部分多标签学习的关键,而跨模态融合和多模态数据增强可以显著提升模型性能。

Abstract: Partial multi-label learning aims to extract knowledge from incompletely annotated data, which includes known correct labels, known incorrect labels, and unknown labels. The core challenge lies in accurately identifying the ambiguous relationships between labels and instances. In this paper, we emphasize that matching co-occurrence patterns between labels and instances is key to addressing this challenge. To this end, we propose Semantic Co-occurrence Insight Network (SCINet), a novel and effective framework for partial multi-label learning. Specifically, SCINet introduces a bi-dominant prompter module, which leverages an off-the-shelf multimodal model to capture text-image correlations and enhance semantic alignment. To reinforce instance-label interdependencies, we develop a cross-modality fusion module that jointly models inter-label correlations, inter-instance relationships, and co-occurrence patterns across instance-label assignments. Moreover, we propose an intrinsic semantic augmentation strategy that enhances the model’s understanding of intrinsic data semantics by applying diverse image transformations, thereby fostering a synergistic relationship between label confidence and sample difficulty. Extensive experiments on four widely-used benchmark datasets demonstrate that SCINet surpasses state-of-the-art methods.


[45] Ensemble-Based Deepfake Detection using State-of-the-Art Models with Robust Cross-Dataset Generalisation cs.CVPDF

Haroon Wahab, Hassan Ugail, Lujain Jaleel

TL;DR: 该论文提出了一种基于集成学习的方法,通过结合多个最先进的深度伪造检测模型的预测概率,提升模型在多样化数据集上的泛化能力。

Details

Motivation: 现有的深度伪造检测模型在基准数据集上表现良好,但在分布外数据上性能显著下降。为了解决这一问题,作者研究了集成学习方法,以提高模型的跨数据集泛化能力。

Result: 实验结果表明,没有单个模型在所有场景中表现最佳,而集成方法在所有测试场景中均提供了更稳定和可靠的性能。

Insight: 论文的洞察是非对称集成方法是一种可扩展且鲁棒的解决方案,尤其适用于真实场景中伪造类型或质量未知的情况。

Abstract: Machine learning-based Deepfake detection models have achieved impressive results on benchmark datasets, yet their performance often deteriorates significantly when evaluated on out-of-distribution data. In this work, we investigate an ensemble-based approach for improving the generalization of deepfake detection systems across diverse datasets. Building on a recent open-source benchmark, we combine prediction probabilities from several state-of-the-art asymmetric models proposed at top venues. Our experiments span two distinct out-of-domain datasets and demonstrate that no single model consistently outperforms others across settings. In contrast, ensemble-based predictions provide more stable and reliable performance in all scenarios. Our results suggest that asymmetric ensembling offers a robust and scalable solution for real-world deepfake detection where prior knowledge of forgery type or quality is often unavailable.


[46] TextPixs: Glyph-Conditioned Diffusion with Character-Aware Attention and OCR-Guided Supervision cs.CV | cs.AI | cs.LGPDF

Syeda Anshrah Gillani, Mirza Samad Ahmed Baig, Osama Ahmed Khan, Shahid Munir Shah, Umema Mujeeb

TL;DR: 论文提出了一种基于字形条件和字符感知注意力的扩散模型框架(GCDA),解决了现有文本到图像生成模型中文本内容不可读、拼写错误的问题。通过双流文本编码器、字符感知注意力机制和OCR引导的监督学习,GCDA在文本渲染和图像合成质量上取得了新的SOTA表现。

Details

Motivation: 现有文本到图像扩散模型在生成图像时无法生成可读且拼写正确的文本,这限制了其在实际应用(如广告、教育、创意设计)中的使用。论文旨在解决这一问题。

Result: 在MARIO-10M和T2I-CompBench等数据集上,GCDA在文本渲染(字符错误率:0.08 vs 0.21;单词错误率:0.15 vs 0.25)、人类感知和高保真图像合成(FID:14.3)上均达到SOTA水平。

Insight: 1. 字形信息的显式建模对文本生成至关重要;2. 字符感知注意力机制能有效避免文本扭曲;3. OCR监督可作为生成模型的有效优化目标。

Abstract: The modern text-to-image diffusion models boom has opened a new era in digital content production as it has proven the previously unseen ability to produce photorealistic and stylistically diverse imagery based on the semantics of natural-language descriptions. However, the consistent disadvantage of these models is that they cannot generate readable, meaningful, and correctly spelled text in generated images, which significantly limits the use of practical purposes like advertising, learning, and creative design. This paper introduces a new framework, namely Glyph-Conditioned Diffusion with Character-Aware Attention (GCDA), using which a typical diffusion backbone is extended by three well-designed modules. To begin with, the model has a dual-stream text encoder that encodes both semantic contextual information and explicit glyph representations, resulting in a character-aware representation of the input text that is rich in nature. Second, an attention mechanism that is aware of the character is proposed with a new attention segregation loss that aims to limit the attention distribution of each character independently in order to avoid distortion artifacts. Lastly, GCDA has an OCR-in-the-loop fine-tuning phase, where a full text perceptual loss, directly optimises models to be legible and accurately spell. Large scale experiments to benchmark datasets, such as MARIO-10M and T2I-CompBench, reveal that GCDA sets a new state-of-the-art on all metrics, with better character based metrics on text rendering (Character Error Rate: 0.08 vs 0.21 for the previous best; Word Error Rate: 0.15 vs 0.25), human perception, and comparable image synthesis quality on high-fidelity (FID: 14.3).


[47] VisualSpeaker: Visually-Guided 3D Avatar Lip Synthesis cs.CV | cs.AIPDF

Alexandre Symeonidis-Herzig, Özge Mercanoğlu Sincan, Richard Bowden

TL;DR: VisualSpeaker提出了一种通过光真实感可微分渲染和视觉语音识别监督的新方法,显著提升了3D面部动画的质量和感知效果。

Details

Motivation: 现有的3D面部动画方法主要基于网格域,无法充分利用2D计算机视觉和图形学的快速视觉创新。

Result: 在MEAD数据集上,Lip Vertex Error指标提升了56.1%,同时保持了动画的可控性。

Insight: 感知驱动的唇读损失能够生成更准确的口型,对提升手语虚拟人的表现力至关重要。

Abstract: Realistic, high-fidelity 3D facial animations are crucial for expressive avatar systems in human-computer interaction and accessibility. Although prior methods show promising quality, their reliance on the mesh domain limits their ability to fully leverage the rapid visual innovations seen in 2D computer vision and graphics. We propose VisualSpeaker, a novel method that bridges this gap using photorealistic differentiable rendering, supervised by visual speech recognition, for improved 3D facial animation. Our contribution is a perceptual lip-reading loss, derived by passing photorealistic 3D Gaussian Splatting avatar renders through a pre-trained Visual Automatic Speech Recognition model during training. Evaluation on the MEAD dataset demonstrates that VisualSpeaker improves both the standard Lip Vertex Error metric by 56.1% and the perceptual quality of the generated animations, while retaining the controllability of mesh-driven animation. This perceptual focus naturally supports accurate mouthings, essential cues that disambiguate similar manual signs in sign language avatars.


[48] MEDTalk: Multimodal Controlled 3D Facial Animation with Dynamic Emotions by Disentangled Embedding cs.CV | cs.MMPDF

Chang Liu, Ye Pan, Chenyang Ding, Susanto Rahardja, Xiaokang Yang

TL;DR: MEDTalk提出了一种新颖的多模态控制3D面部动画生成框架,通过解耦内容和情感嵌入空间,实现动态情感和精准唇部同步的控制。

Details

Motivation: 现有方法通常依赖静态预定义情感标签,限制了生成表情的多样性和自然性。MEDTalk旨在解决这一问题,提供更精细的动态情感控制和多模态输入能力。

Result: 生成的3D面部动画能够自然表达动态情感,并实现精确唇同步,适用于工业流水线(如MetaHuman)。

Insight: MEDTalk通过解耦和多模态输入,显著提升了情感表达的动态性和用户控制能力,为3D动画生成提供了新思路。

Abstract: Audio-driven emotional 3D facial animation aims to generate synchronized lip movements and vivid facial expressions. However, most existing approaches focus on static and predefined emotion labels, limiting their diversity and naturalness. To address these challenges, we propose MEDTalk, a novel framework for fine-grained and dynamic emotional talking head generation. Our approach first disentangles content and emotion embedding spaces from motion sequences using a carefully designed cross-reconstruction process, enabling independent control over lip movements and facial expressions. Beyond conventional audio-driven lip synchronization, we integrate audio and speech text, predicting frame-wise intensity variations and dynamically adjusting static emotion features to generate realistic emotional expressions. Furthermore, to enhance control and personalization, we incorporate multimodal inputs-including text descriptions and reference expression images-to guide the generation of user-specified facial expressions. With MetaHuman as the priority, our generated results can be conveniently integrated into the industrial production pipeline.


[49] MCAM: Multimodal Causal Analysis Model for Ego-Vehicle-Level Driving Video Understanding cs.CVPDF

Tongtong Cheng, Rongzhen Li, Yixin Xiong, Tao Zhang, Jing Wang

TL;DR: MCAM提出了一种多模态因果分析模型,解决了现有方法在自动驾驶视频理解中存在浅层因果、虚假跨模态相关性和忽略自我车辆因果建模的问题。

Details

Motivation: 自动驾驶视频理解需要准确的行为识别和推理,但现有方法难以解决跨模态的虚假相关性和自我车辆级别的因果关系建模。

Result: 在BDD-X和CoVLA数据集上取得SOTA性能,验证了其在自动驾驶应用中的有效性。

Insight: 通过动态因果建模和跨模态对齐,MCAM能更有效地捕捉视频序列中的因果关系,提升自动驾驶场景的理解能力。

Abstract: Accurate driving behavior recognition and reasoning are critical for autonomous driving video understanding. However, existing methods often tend to dig out the shallow causal, fail to address spurious correlations across modalities, and ignore the ego-vehicle level causality modeling. To overcome these limitations, we propose a novel Multimodal Causal Analysis Model (MCAM) that constructs latent causal structures between visual and language modalities. Firstly, we design a multi-level feature extractor to capture long-range dependencies. Secondly, we design a causal analysis module that dynamically models driving scenarios using a directed acyclic graph (DAG) of driving states. Thirdly, we utilize a vision-language transformer to align critical visual features with their corresponding linguistic expressions. Extensive experiments on the BDD-X, and CoVLA datasets demonstrate that MCAM achieves SOTA performance in visual-language causal relationship learning. Furthermore, the model exhibits superior capability in capturing causal characteristics within video sequences, showcasing its effectiveness for autonomous driving applications. The code is available at https://github.com/SixCorePeach/MCAM.


[50] Discontinuity-aware Normal Integration for Generic Central Camera Models cs.CVPDF

Francesco Milano, Manuel López-Antequera, Naina Dhingra, Roland Siegwart, Robert Thiel

TL;DR: 该论文提出一种新的法向积分方法,显式处理深度不连续并支持通用中心相机模型,基于局部平面假设实现了更高精度的3D表面重建。

Details

Motivation: 现有法向积分方法通常隐式处理深度不连续且局限于正交或理想针孔相机模型,限制了其在复杂场景和通用相机中的适用性。

Result: 在标准法向积分基准测试中达到SOTA效果,首次直接支持通用中心相机模型。

Insight: 通过约束表面法向与光线方向的关系,显式建模不连续性,可以更准确地逼近深度与法向的关系,扩展了法向积分的应用场景。

Abstract: Recovering a 3D surface from its surface normal map, a problem known as normal integration, is a key component for photometric shape reconstruction techniques such as shape-from-shading and photometric stereo. The vast majority of existing approaches for normal integration handle only implicitly the presence of depth discontinuities and are limited to orthographic or ideal pinhole cameras. In this paper, we propose a novel formulation that allows modeling discontinuities explicitly and handling generic central cameras. Our key idea is based on a local planarity assumption, that we model through constraints between surface normals and ray directions. Compared to existing methods, our approach more accurately approximates the relation between depth and surface normals, achieves state-of-the-art results on the standard normal integration benchmark, and is the first to directly handle generic central camera models.


[51] CAST-Phys: Contactless Affective States Through Physiological signals Database cs.CVPDF

Joaquim Comas, Alexander Joel Vera, Xavier Vives, Eleonora De Filippi, Alexandre Pereda

TL;DR: 论文提出了一个新型多模态无接触情感识别数据集CAST-Phys,旨在解决情感计算中真实情绪数据缺乏和接触式设备干扰的问题,展示了生理信号在无接触情感识别中的潜力。

Details

Motivation: 情感计算应用中,现有数据集多为接触式设备采集,可能干扰真实情绪反应,且多模态数据不足限制了情感识别系统的准确性。

Result: 研究表明生理信号在真实场景中对情感识别至关重要,多模态融合显著提升了无接触情感识别的准确性。

Insight: 生理信号与面部特征的结合能够弥补单一模态的不足,为无接触情感识别技术提供了新的研究方向。

Abstract: In recent years, affective computing and its applications have become a fast-growing research topic. Despite significant advancements, the lack of affective multi-modal datasets remains a major bottleneck in developing accurate emotion recognition systems. Furthermore, the use of contact-based devices during emotion elicitation often unintentionally influences the emotional experience, reducing or altering the genuine spontaneous emotional response. This limitation highlights the need for methods capable of extracting affective cues from multiple modalities without physical contact, such as remote physiological emotion recognition. To address this, we present the Contactless Affective States Through Physiological Signals Database (CAST-Phys), a novel high-quality dataset explicitly designed for multi-modal remote physiological emotion recognition using facial and physiological cues. The dataset includes diverse physiological signals, such as photoplethysmography (PPG), electrodermal activity (EDA), and respiration rate (RR), alongside high-resolution uncompressed facial video recordings, enabling the potential for remote signal recovery. Our analysis highlights the crucial role of physiological signals in realistic scenarios where facial expressions alone may not provide sufficient emotional information. Furthermore, we demonstrate the potential of remote multi-modal emotion recognition by evaluating the impact of individual and fused modalities, showcasing its effectiveness in advancing contactless emotion recognition technologies.


[52] Tile-Based ViT Inference with Visual-Cluster Priors for Zero-Shot Multi-Species Plant Identification cs.CV | cs.IR | cs.LGPDF

Murilo Gustineli, Anthony Miyaguchi, Adrian Cheung, Divyansh Khattak

TL;DR: 本文提出了一种基于ViT的零样本多物种植物识别方法,结合分块策略和视觉聚类先验,实现了在PlantCLEF 2025挑战赛中的优异表现。

Details

Motivation: 解决多物种植物识别中的零样本问题,无需额外训练,仅通过视觉聚类和地理定位滤波提升分类性能。

Result: 在私有排行榜上达到0.348的宏平均F1分数。

Insight: 视觉聚类和地理信息可显著提升零样本分类性能,分块策略有效利用了ViT的接收场优势。

Abstract: We describe DS@GT’s second-place solution to the PlantCLEF 2025 challenge on multi-species plant identification in vegetation quadrat images. Our pipeline combines (i) a fine-tuned Vision Transformer ViTD2PC24All for patch-level inference, (ii) a 4x4 tiling strategy that aligns patch size with the network’s 518x518 receptive field, and (iii) domain-prior adaptation through PaCMAP + K-Means visual clustering and geolocation filtering. Tile predictions are aggregated by majority vote and re-weighted with cluster-specific Bayesian priors, yielding a macro-averaged F1 of 0.348 (private leaderboard) while requiring no additional training. All code, configuration files, and reproducibility scripts are publicly available at https://github.com/dsgt-arc/plantclef-2025.


[53] Reflections Unlock: Geometry-Aware Reflection Disentanglement in 3D Gaussian Splatting for Photorealistic Scenes Rendering cs.CVPDF

Jiayi Song, Zihan Ye, Qingyuan Zhou, Weidong Yang, Ben Fei

TL;DR: 论文提出了一种基于3D高斯溅射的几何感知反射解耦框架Ref-Unlock,用于解决复杂反射场景中几何一致性的问题,显著优于传统方法。

Details

Motivation: 现有方法(如NeRF和3DGS)在处理反射表面时容易将反射误认为是物理几何,导致重建质量下降。传统约束条件不完整且泛化性差,进一步加剧了问题。

Result: 在实验中,Ref-Unlock显著优于传统高斯溅射方法,与NeRF模型竞争,支持灵活的反射编辑。

Insight: 显式解耦反射分量并引入几何约束是提升复杂反射场景重建质量的关键。

Abstract: Accurately rendering scenes with reflective surfaces remains a significant challenge in novel view synthesis, as existing methods like Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) often misinterpret reflections as physical geometry, resulting in degraded reconstructions. Previous methods rely on incomplete and non-generalizable geometric constraints, leading to misalignment between the positions of Gaussian splats and the actual scene geometry. When dealing with real-world scenes containing complex geometry, the accumulation of Gaussians further exacerbates surface artifacts and results in blurred reconstructions. To address these limitations, in this work, we propose Ref-Unlock, a novel geometry-aware reflection modeling framework based on 3D Gaussian Splatting, which explicitly disentangles transmitted and reflected components to better capture complex reflections and enhance geometric consistency in real-world scenes. Our approach employs a dual-branch representation with high-order spherical harmonics to capture high-frequency reflective details, alongside a reflection removal module providing pseudo reflection-free supervision to guide clean decomposition. Additionally, we incorporate pseudo-depth maps and a geometry-aware bilateral smoothness constraint to enhance 3D geometric consistency and stability in decomposition. Extensive experiments demonstrate that Ref-Unlock significantly outperforms classical GS-based reflection methods and achieves competitive results with NeRF-based models, while enabling flexible vision foundation models (VFMs) driven reflection editing. Our method thus offers an efficient and generalizable solution for realistic rendering of reflective scenes. Our code is available at https://ref-unlock.github.io/.


[54] Omni-Video: Democratizing Unified Video Understanding and Generation cs.CVPDF

Zhiyu Tan, Hao Yang, Luozheng Qin, Jia Gong, Mengping Yang

TL;DR: 该论文提出了Omni-Video,一个高效的统一框架,用于视频理解、生成和指令编辑,通过利用多模态大语言模型(MLLMs)生成视觉线索,并将其用于扩散解码器的输入。

Details

Motivation: 当前的基础模型主要集中在图像处理上,而视频理解与生成的统一模型发展滞后,因此需要填补这一空白。

Result: 模型在视频生成、编辑和理解任务中表现出良好的泛化能力。

Insight: 通过利用现有MLLMs的能力,可以高效地实现视频任务的统一建模,同时减少数据和计算资源的需求。

Abstract: Notable breakthroughs in unified understanding and generation modeling have led to remarkable advancements in image understanding, reasoning, production and editing, yet current foundational models predominantly focus on processing images, creating a gap in the development of unified models for video understanding and generation. This report presents Omni-Video, an efficient and effective unified framework for video understanding, generation, as well as instruction-based editing. Our key insight is to teach existing multimodal large language models (MLLMs) to produce continuous visual clues that are used as the input of diffusion decoders, which produce high-quality videos conditioned on these visual clues. To fully unlock the potential of our system for unified video modeling, we integrate several technical improvements: 1) a lightweight architectural design that respectively attaches a vision head on the top of MLLMs and a adapter before the input of diffusion decoders, the former produce visual tokens for the latter, which adapts these visual tokens to the conditional space of diffusion decoders; and 2) an efficient multi-stage training scheme that facilitates a fast connection between MLLMs and diffusion decoders with limited data and computational resources. We empirically demonstrate that our model exhibits satisfactory generalization abilities across video generation, editing and understanding tasks.


[55] Prompt-Free Conditional Diffusion for Multi-object Image Augmentation cs.CVPDF

Haoyu Wang, Lei Zhang, Wei Wei, Chen Ding, Yanning Zhang

TL;DR: 论文提出了一种无需文本提示(prompt-free)的条件扩散框架,用于多目标图像增强,通过局部-全局语义融合和LoRA注入知识,解决了现有方法在多样性和类别偏差上的问题。

Details

Motivation: 现有的多目标图像生成方法要么过度依赖文本条件导致类别偏差,要么依赖原始图像导致多样性不足。本文旨在同时解决这两个问题。

Result: 实验表明,该方法在多样性和下游任务性能上优于现有基线,并展示了强大的域外泛化能力。

Insight: 通过语义融合和计数约束,能够在不依赖文本提示的情况下生成多样且符合原始数据分布的图像。

Abstract: Diffusion models has underpinned much recent advances of dataset augmentation in various computer vision tasks. However, when involving generating multi-object images as real scenarios, most existing methods either rely entirely on text condition, resulting in a deviation between the generated objects and the original data, or rely too much on the original images, resulting in a lack of diversity in the generated images, which is of limited help to downstream tasks. To mitigate both problems with one stone, we propose a prompt-free conditional diffusion framework for multi-object image augmentation. Specifically, we introduce a local-global semantic fusion strategy to extract semantics from images to replace text, and inject knowledge into the diffusion model through LoRA to alleviate the category deviation between the original model and the target dataset. In addition, we design a reward model based counting loss to assist the traditional reconstruction loss for model training. By constraining the object counts of each category instead of pixel-by-pixel constraints, bridging the quantity deviation between the generated data and the original data while improving the diversity of the generated data. Experimental results demonstrate the superiority of the proposed method over several representative state-of-the-art baselines and showcase strong downstream task gain and out-of-domain generalization capabilities. Code is available at \href{https://github.com/00why00/PFCD}{here}.


[56] SoftReMish: A Novel Activation Function for Enhanced Convolutional Neural Networks for Visual Recognition Performance cs.CV | cs.AI | cs.NEPDF

Mustafa Bayram Gücen

TL;DR: SoftReMish是一种新型激活函数,旨在提升CNN在图像分类任务中的性能,实验表明其在MNIST数据集上优于ReLU、Tanh和Mish。

Details

Motivation: 现有激活函数(如ReLU、Tanh、Mish)在CNN中的表现仍有改进空间,因此需要开发更高效的激活函数以提升模型性能。

Result: SoftReMish在MNIST数据集上取得最低训练损失(3.14e-8)和最高验证准确率(99.41%)。

Insight: SoftReMish显示出更好的收敛行为和泛化能力,适用于视觉识别任务。

Abstract: In this study, SoftReMish, a new activation function designed to improve the performance of convolutional neural networks (CNNs) in image classification tasks, is proposed. Using the MNIST dataset, a standard CNN architecture consisting of two convolutional layers, max pooling, and fully connected layers was implemented. SoftReMish was evaluated against popular activation functions including ReLU, Tanh, and Mish by replacing the activation function in all trainable layers. The model performance was assessed in terms of minimum training loss and maximum validation accuracy. Results showed that SoftReMish achieved a minimum loss (3.14e-8) and a validation accuracy (99.41%), outperforming all other functions tested. These findings demonstrate that SoftReMish offers better convergence behavior and generalization capability, making it a promising candidate for visual recognition tasks.


[57] Normalizing Diffusion Kernels with Optimal Transport cs.CVPDF

Nathan Kessler, Robin Magnet, Jean Feydy

TL;DR: 论文提出了一种方法,通过Sinkhorn算法的对称变体,将一般相似性或邻接矩阵归一化为扩散类算子,从而在无结构数据(如点云、稀疏体素网格)上实现类似拉普拉斯算子的平滑处理。

Details

Motivation: 在机器学习和几何处理中,平滑信号是核心操作,但传统的拉普拉斯算子需要严格的域结构,而简单卷积核或消息传递层对边界存在偏差。本文旨在弥合这一差距。

Result: 归一化后的算子不仅能近似热扩散,还能保留拉普拉斯算子的谱信息,应用于形状分析和匹配。

Insight: 通过最优传输理论,可以在无严格结构的域上实现类似拉普拉斯算子的平滑处理,拓展了传统方法的适用范围。

Abstract: Smoothing a signal based on local neighborhoods is a core operation in machine learning and geometry processing. On well-structured domains such as vector spaces and manifolds, the Laplace operator derived from differential geometry offers a principled approach to smoothing via heat diffusion, with strong theoretical guarantees. However, constructing such Laplacians requires a carefully defined domain structure, which is not always available. Most practitioners thus rely on simple convolution kernels and message-passing layers, which are biased against the boundaries of the domain. We bridge this gap by introducing a broad class of smoothing operators, derived from general similarity or adjacency matrices, and demonstrate that they can be normalized into diffusion-like operators that inherit desirable properties from Laplacians. Our approach relies on a symmetric variant of the Sinkhorn algorithm, which rescales positive smoothing operators to match the structural behavior of heat diffusion. This construction enables Laplacian-like smoothing and processing of irregular data such as point clouds, sparse voxel grids or mixture of Gaussians. We show that the resulting operators not only approximate heat diffusion but also retain spectral information from the Laplacian itself, with applications to shape analysis and matching.


[58] OmniPart: Part-Aware 3D Generation with Semantic Decoupling and Structural Cohesion cs.CVPDF

Yunhan Yang, Yufan Zhou, Yuan-Chen Guo, Zi-Xin Zou, Yukun Huang

TL;DR: OmniPart 是一个新颖的框架,用于生成具有明确、可编辑部分的 3D 资产,通过两阶段方法实现语义解耦和结构一致性,支持用户定义的部分粒度。

Details

Motivation: 当前大多数生成方法只能生成整体形状,缺乏可编辑的部分结构,限制了交互应用的进一步开发。因此,需要一种方法能生成具有清晰部分结构的 3D 对象。

Result: 实验表明,OmniPart 在性能上达到最先进水平,支持多样化的下游应用。

Insight: 通过结合 2D 引导和 3D 生成,OmniPart 实现了高语义解耦和结构一致性,为可编辑和可解释的 3D 内容生成开辟了新途径。

Abstract: The creation of 3D assets with explicit, editable part structures is crucial for advancing interactive applications, yet most generative methods produce only monolithic shapes, limiting their utility. We introduce OmniPart, a novel framework for part-aware 3D object generation designed to achieve high semantic decoupling among components while maintaining robust structural cohesion. OmniPart uniquely decouples this complex task into two synergistic stages: (1) an autoregressive structure planning module generates a controllable, variable-length sequence of 3D part bounding boxes, critically guided by flexible 2D part masks that allow for intuitive control over part decomposition without requiring direct correspondences or semantic labels; and (2) a spatially-conditioned rectified flow model, efficiently adapted from a pre-trained holistic 3D generator, synthesizes all 3D parts simultaneously and consistently within the planned layout. Our approach supports user-defined part granularity, precise localization, and enables diverse downstream applications. Extensive experiments demonstrate that OmniPart achieves state-of-the-art performance, paving the way for more interpretable, editable, and versatile 3D content.


[59] Enhancing Scientific Visual Question Answering through Multimodal Reasoning and Ensemble Modeling cs.CVPDF

Prahitha Movva, Naga Harshita Marupaka

TL;DR: 论文提出了一种通过多模态推理和集成建模提升科学视觉问答(VQA)性能的方法,并在SciVQA 2025任务中取得了显著效果。

Details

Motivation: 科学文献中的图表信息需要高精度解析,而现有视觉问答方法在数值处理和推理一致性上表现不佳。

Result: InternVL3在SciVQA测试集上ROUGE-1和ROUGE-L F1达0.740,BERTScore达0.983。集成模型进一步提升了部分性能。

Insight: 提示优化、多步推理和集成策略对科学VQA任务至关重要。

Abstract: Technical reports and articles often contain valuable information in the form of semi-structured data like charts, and figures. Interpreting these and using the information from them is essential for downstream tasks such as question answering (QA). Current approaches to visual question answering often struggle with the precision required for scientific data interpretation, particularly in handling numerical values, multi-step reasoning over visual elements, and maintaining consistency between visual observation and textual reasoning. We present our approach to the SciVQA 2025 shared task, focusing on answering visual and non-visual questions grounded in scientific figures from scholarly articles. We conducted a series of experiments using models with 5B to 8B parameters. Our strongest individual model, InternVL3, achieved ROUGE-1 and ROUGE-L F1 scores of \textbf{0.740} and a BERTScore of \textbf{0.983} on the SciVQA test split. We also developed an ensemble model with multiple vision language models (VLMs). Through error analysis on the validation split, our ensemble approach improved performance compared to most individual models, though InternVL3 remained the strongest standalone performer. Our findings underscore the effectiveness of prompt optimization, chain-of-thought reasoning and ensemble modeling in improving the model’s ability in visual question answering.


[60] CultureCLIP: Empowering CLIP with Cultural Awareness through Synthetic Images and Contextualized Captions cs.CV | cs.CLPDF

Yuchen Huang, Zhiyuan Fan, Zhitao He, Sandeep Polisetty, Wenyan Li

TL;DR: CultureCLIP通过合成文化数据集CulTwin和定制对比学习,增强了CLIP的文化感知能力,显著提升了细粒度文化概念的识别性能,同时保持了模型的泛化能力。

Details

Motivation: 现有视觉语言模型(如CLIP)在文化相关任务中表现不佳,主要原因包括缺乏高质量文化数据集、上下文知识不足以及难以区分视觉相似但文化差异的概念。

Result: CultureCLIP在文化相关任务中比基础CLIP提升了5.49%的细粒度概念识别能力,同时保持了泛化性能。

Insight: 合成数据和上下文增强描述的结合可以有效提升模型对文化细微差异的感知能力,同时不会牺牲其泛化性。

Abstract: Pretrained vision-language models (VLMs) such as CLIP excel in multimodal understanding but struggle with contextually relevant fine-grained visual features, making it difficult to distinguish visually similar yet culturally distinct concepts. This limitation stems from the scarcity of high-quality culture-specific datasets, the lack of integrated contextual knowledge, and the absence of hard negatives highlighting subtle distinctions. To address these challenges, we first design a data curation pipeline that leverages open-sourced VLMs and text-to-image diffusion models to construct CulTwin, a synthetic cultural dataset. This dataset consists of paired concept-caption-image triplets, where concepts visually resemble each other but represent different cultural contexts. Then, we fine-tune CLIP on CulTwin to create CultureCLIP, which aligns cultural concepts with contextually enhanced captions and synthetic images through customized contrastive learning, enabling finer cultural differentiation while preserving generalization capabilities. Experiments on culturally relevant benchmarks show that CultureCLIP outperforms the base CLIP, achieving up to a notable 5.49% improvement in fine-grained concept recognition on certain tasks, while preserving CLIP’s original generalization ability, validating the effectiveness of our data synthesis and VLM backbone training paradigm in capturing subtle cultural distinctions.


[61] Feed-Forward SceneDINO for Unsupervised Semantic Scene Completion cs.CVPDF

Aleksandar Jevtić, Christoph Reich, Felix Wimbauer, Oliver Hahn, Christian Rupprecht

TL;DR: SceneDINO是一种无监督的语义场景补全方法,通过自监督学习和2D场景理解技术,无需标注数据即可推断3D几何和语义,性能达到SOTA。

Details

Motivation: 现有语义场景补全方法依赖昂贵的标注数据,本文旨在探索无监督学习下的解决方案。

Result: 在3D和2D无监督场景理解任务中达到SOTA分割精度,线性探查3D特征可与监督方法媲美。

Insight: SceneDINO展示了无监督方法在3D场景理解中的潜力,并为单图像3D场景理解奠定了基础。

Abstract: Semantic scene completion (SSC) aims to infer both the 3D geometry and semantics of a scene from single images. In contrast to prior work on SSC that heavily relies on expensive ground-truth annotations, we approach SSC in an unsupervised setting. Our novel method, SceneDINO, adapts techniques from self-supervised representation learning and 2D unsupervised scene understanding to SSC. Our training exclusively utilizes multi-view consistency self-supervision without any form of semantic or geometric ground truth. Given a single input image, SceneDINO infers the 3D geometry and expressive 3D DINO features in a feed-forward manner. Through a novel 3D feature distillation approach, we obtain unsupervised 3D semantics. In both 3D and 2D unsupervised scene understanding, SceneDINO reaches state-of-the-art segmentation accuracy. Linear probing our 3D features matches the segmentation accuracy of a current supervised SSC approach. Additionally, we showcase the domain generalization and multi-view consistency of SceneDINO, taking the first steps towards a strong foundation for single image 3D scene understanding.


[62] RSRefSeg 2: Decoupling Referring Remote Sensing Image Segmentation with Foundation Models cs.CVPDF

Keyan Chen, Chenyang Liu, Bowen Chen, Jiafan Zhang, Zhengxia Zou

TL;DR: RSRefSeg 2提出了一种解耦范式,通过结合CLIP和SAM的基础模型能力,改进遥感图像的分割精度和语义理解。

Details

Motivation: 现有方法在复杂语义关系和跨模态对齐方面存在局限,主要因为耦合的处理机制混淆了目标定位和边界划分。

Result: 实验表明,RSRefSeg 2在分割精度(gIoU提升~3%)和复杂语义理解上优于现有方法。

Insight: 解耦设计减少了错误传播,同时通过基础模型的协作提升了模型的泛化能力和可解释性。

Abstract: Referring Remote Sensing Image Segmentation provides a flexible and fine-grained framework for remote sensing scene analysis via vision-language collaborative interpretation. Current approaches predominantly utilize a three-stage pipeline encompassing dual-modal encoding, cross-modal interaction, and pixel decoding. These methods demonstrate significant limitations in managing complex semantic relationships and achieving precise cross-modal alignment, largely due to their coupled processing mechanism that conflates target localization with boundary delineation. This architectural coupling amplifies error propagation under semantic ambiguity while restricting model generalizability and interpretability. To address these issues, we propose RSRefSeg 2, a decoupling paradigm that reformulates the conventional workflow into a collaborative dual-stage framework: coarse localization followed by fine segmentation. RSRefSeg 2 integrates CLIP’s cross-modal alignment strength with SAM’s segmentation generalizability through strategic foundation model collaboration. Specifically, CLIP is employed as the dual-modal encoder to activate target features within its pre-aligned semantic space and generate localization prompts. To mitigate CLIP’s misactivation challenges in multi-entity scenarios described by referring texts, a cascaded second-order prompter is devised, which enhances precision through implicit reasoning via decomposition of text embeddings into complementary semantic subspaces. These optimized semantic prompts subsequently direct the SAM to generate pixel-level refined masks, thereby completing the semantic transmission pipeline. Extensive experiments (RefSegRS, RRSIS-D, and RISBench) demonstrate that RSRefSeg 2 surpasses contemporary methods in segmentation accuracy (+~3% gIoU) and complex semantic interpretation. Code is available at: https://github.com/KyanChen/RSRefSeg2.


[63] Learning to Track Any Points from Human Motion cs.CVPDF

Inès Hyeonsu Kim, Seokju Cho, Jahyeok Koo, Junghyun Park, Jiahui Huang

TL;DR: 论文提出了AnthroTAP,一种利用SMPL模型自动生成伪标注数据的流程,用于训练点跟踪模型。在TAP-Vid基准测试中表现优异,仅用较少数据和计算资源取得了SOTA性能。

Details

Motivation: 人类运动数据虽然适合用于点跟踪任务的训练,但获取大规模标注数据成本高昂。论文旨在通过自动生成伪标注数据来解决这一问题。

Result: 在TAP-Vid基准测试中,AnthroTAP训练的点跟踪模型表现优异,超越其他使用真实视频训练的模型,且仅需少量数据和1天4GPU的训练时间。

Insight: 利用合成或伪标注数据可以显著降低点跟踪任务的训练成本,同时在性能上不逊于甚至优于真实数据训练的模型。

Abstract: Human motion, with its inherent complexities, such as non-rigid deformations, articulated movements, clothing distortions, and frequent occlusions caused by limbs or other individuals, provides a rich and challenging source of supervision that is crucial for training robust and generalizable point trackers. Despite the suitability of human motion, acquiring extensive training data for point tracking remains difficult due to laborious manual annotation. Our proposed pipeline, AnthroTAP, addresses this by proposing an automated pipeline to generate pseudo-labeled training data, leveraging the Skinned Multi-Person Linear (SMPL) model. We first fit the SMPL model to detected humans in video frames, project the resulting 3D mesh vertices onto 2D image planes to generate pseudo-trajectories, handle occlusions using ray-casting, and filter out unreliable tracks based on optical flow consistency. A point tracking model trained on AnthroTAP annotated dataset achieves state-of-the-art performance on the TAP-Vid benchmark, surpassing other models trained on real videos while using 10,000 times less data and only 1 day in 4 GPUs, compared to 256 GPUs used in recent state-of-the-art.


cs.CL [Back]

[64] User Behavior Prediction as a Generic, Robust, Scalable, and Low-Cost Evaluation Strategy for Estimating Generalization in LLMs cs.CL | cs.AIPDF

Sougata Saha, Monojit Choudhury

TL;DR: 本文提出用户行为预测作为衡量大型语言模型泛化能力的通用、鲁棒、可扩展且低成本的方法,避免了知识检索和推理任务的局限性,并在多个模型上验证了其有效性。

Details

Motivation: 由于数据污染问题,衡量大型语言模型(LLMs)的泛化能力具有挑战性。传统任务(如知识检索和推理)不适合评估LLMs的泛化能力,因为它们不是为特定任务设计的。

Result: 实验结果验证了框架的预测,GPT-4o表现优于其他模型,但所有模型(尤其是Llama)仍有改进空间。

Insight: 用户行为预测是一种更通用且实用的评估策略,适合衡量LLMs的泛化能力,并能避免数据污染问题。

Abstract: Measuring the generalization ability of Large Language Models (LLMs) is challenging due to data contamination. As models grow and computation becomes cheaper, ensuring tasks and test cases are unseen during training phases will become nearly impossible. We argue that knowledge-retrieval and reasoning tasks are not ideal for measuring generalization, as LLMs are not trained for specific tasks. Instead, we propose user behavior prediction, also a key aspect of personalization, as a theoretically sound, scalable, and robust alternative. We introduce a novel framework for this approach and test it on movie and music recommendation datasets for GPT-4o, GPT-4o-mini, and Llama-3.1-8B-Instruct. Results align with our framework’s predictions, showing GPT-4o outperforms GPT-4o-mini and Llama, though all models have much room for improvement, especially Llama.


[65] Beyond classical and contemporary models: a transformative ai framework for student dropout prediction in distance learning using rag, prompt engineering, and cross-modal fusion cs.CL | cs.AI | cs.CY | cs.IR | I.2.7; I.2.1; K.3.1PDF

Miloud Mihoubi, Meriem Zerkouk, Belkacem Chikhaoui

TL;DR: 这篇论文提出了一种创新的AI框架,用于预测远程学习中的学生辍学问题,结合了RAG、即时工程和跨模态融合技术,显著提升了预测准确性和可解释性。

Details

Motivation: 远程学习中的学生辍学问题具有深远的社会和经济影响,传统机器学习模型难以捕捉学生互动中的情感和情境因素,因此需要更先进的解决方案。

Result: 在4,423名学生数据集上实现了89%的准确率和0.88的F1分数,比传统模型提升了7%,误报率降低了21%。

Insight: 该框架不仅提升了预测性能,还能生成可解释的干预策略,为全球教育系统中的辍学问题提供了可扩展的解决方案。

Abstract: Student dropout in distance learning remains a critical challenge, with profound societal and economic consequences. While classical machine learning models leverage structured socio-demographic and behavioral data, they often fail to capture the nuanced emotional and contextual factors embedded in unstructured student interactions. This paper introduces a transformative AI framework that redefines dropout prediction through three synergistic innovations: Retrieval-Augmented Generation (RAG) for domain-specific sentiment analysis, prompt engineering to decode academic stressors, and cross-modal attention fusion to dynamically align textual, behavioral, and socio-demographic insights. By grounding sentiment analysis in a curated knowledge base of pedagogical content, our RAG-enhanced BERT model interprets student comments with unprecedented contextual relevance, while optimized prompts isolate indicators of academic distress (e.g., “isolation,” “workload anxiety”). A cross-modal attention layer then fuses these insights with temporal engagement patterns, creating holistic risk profiles. Evaluated on a longitudinal dataset of 4 423 students, the framework achieves 89% accuracy and an F1-score of 0.88, outperforming conventional models by 7% and reducing false negatives by 21%. Beyond prediction, the system generates interpretable interventions by retrieving contextually aligned strategies (e.g., mentorship programs for isolated learners). This work bridges the gap between predictive analytics and actionable pedagogy, offering a scalable solution to mitigate dropout risks in global education systems


[66] LCDS: A Logic-Controlled Discharge Summary Generation System Supporting Source Attribution and Expert Review cs.CL | cs.AIPDF

Cheng Yuan, Xinkai Rui, Yongqi Fan, Yawei Fan, Boyang Zhong

TL;DR: 论文提出了LCDS系统,通过逻辑控制和源映射表解决LLMs在自动生成出院小结时的幻觉问题,并支持专家审查与反馈。

Details

Motivation: 大型语言模型在生成出院小结时存在幻觉和不准确内容的问题,且难以将生成内容与电子病历中的长文本来源关联。

Result: LCDS能够生成更可靠的出院小结,支持专家高效审查与反馈,并用于逐步优化LLMs。

Insight: 逻辑控制和源映射表是解决LLMs在医疗文本生成中幻觉问题的有效手段,专家参与提升了生成内容的可靠性。

Abstract: Despite the remarkable performance of Large Language Models (LLMs) in automated discharge summary generation, they still suffer from hallucination issues, such as generating inaccurate content or fabricating information without valid sources. In addition, electronic medical records (EMRs) typically consist of long-form data, making it challenging for LLMs to attribute the generated content to the sources. To address these challenges, we propose LCDS, a Logic-Controlled Discharge Summary generation system. LCDS constructs a source mapping table by calculating textual similarity between EMRs and discharge summaries to constrain the scope of summarized content. Moreover, LCDS incorporates a comprehensive set of logical rules, enabling it to generate more reliable silver discharge summaries tailored to different clinical fields. Furthermore, LCDS supports source attribution for generated content, allowing experts to efficiently review, provide feedback, and rectify errors. The resulting golden discharge summaries are subsequently recorded for incremental fine-tuning of LLMs. Our project and demo video are in the GitHub repository https://github.com/ycycyc02/LCDS.


[67] MindFlow: Revolutionizing E-commerce Customer Support with Multimodal LLM Agents cs.CL | cs.AIPDF

Ming Gong, Xucheng Huang, Chenghan Yang, Xianhan Peng, Haoxin Wang

TL;DR: MindFlow是一个针对电子商务客服的多模态LLM代理,通过整合记忆、决策和行动模块,显著改进了复杂查询处理、用户满意度和运营成本。

Details

Motivation: 现有的LLM在复杂多模态电子商务场景中能力有限,需要更强大的解决方案以提升客服效率和用户体验。

Result: 在实际部署中,MindFlow表现出显著优势,实现了93.53%的相对改进。

Insight: 模块化和多模态推理是提升LLM在复杂场景中性能的关键。

Abstract: Recent advances in large language models (LLMs) have enabled new applications in e-commerce customer service. However, their capabilities remain constrained in complex, multimodal scenarios. We present MindFlow, the first open-source multimodal LLM agent tailored for e-commerce. Built on the CoALA framework, it integrates memory, decision-making, and action modules, and adopts a modular “MLLM-as-Tool” strategy for effect visual-textual reasoning. Evaluated via online A/B testing and simulation-based ablation, MindFlow demonstrates substantial gains in handling complex queries, improving user satisfaction, and reducing operational costs, with a 93.53% relative improvement observed in real-world deployments.


[68] LoRA-Augmented Generation (LAG) for Knowledge-Intensive Language Tasks cs.CL | cs.AI | cs.LGPDF

William Fleshman, Benjamin Van Durme

TL;DR: 论文提出了LoRA-Augmented Generation(LAG)方法,用于高效选择和结合任务特定的LoRA适配器,无需额外训练或数据,并在知识密集型任务中表现优于现有方法。

Details

Motivation: 随着针对特定任务和领域微调的语言模型专家的增多,需要高效的选择和结合方法。LAG为此提供了一种解决方案。

Result: LAG在知识密集型任务上表现优于现有数据无关方法,并展示了与其他方案(如RAG)的兼容性。

Insight: LAG提供了一种灵活且高效的方式来结合多个任务特定的专家,扩展了语言模型的应用范围。

Abstract: The proliferation of fine-tuned language model experts for specific tasks and domains signals the need for efficient selection and combination methods. We propose LoRA-Augmented Generation (LAG) for leveraging large libraries of knowledge and task-specific LoRA adapters. LAG requires no additional training or access to data, and efficiently filters, retrieves, and applies experts on a per-token and layer basis. We evaluate LAG on various knowledge-intensive tasks, achieving superior performance over existing data-free methods. We explore scenarios where additional data is available, demonstrating LAG’s compatibility with alternative solutions such as retrieval-augmented generation (RAG).


[69] On the Bias of Next-Token Predictors Toward Systematically Inefficient Reasoning: A Shortest-Path Case Study cs.CL | cs.AI | cs.LGPDF

Riccardo Alberghi, Elizaveta Demyanenko, Luca Biggio, Luca Saglietti

TL;DR: 本文研究了在最短路径任务中,语言模型在推理过程中系统性地偏好低效率的推理路径的现象。研究发现,训练时使用冗余但连贯的推理路径比最优路径更能提升模型的泛化能力。

Details

Motivation: 尽管大语言模型(LLMs)在推理任务中表现出色,但其测试时计算效率与推理路径的系统性之间存在矛盾。本文通过最短路径任务的实验,研究了冗余推理路径对模型泛化能力的影响。

Result: 实验表明,训练时使用低效但连贯的推理路径的模型,泛化能力优于使用最优路径的模型,而单纯的冗余则会损害性能。

Insight: 推理路径的连贯性和局部增量性对模型的优化信号更为重要,而非路径的最优性。这一发现为提升语言模型的推理能力提供了新思路。

Abstract: Recent advances in natural language processing highlight two key factors for improving reasoning in large language models (LLMs): (i) allocating more test-time compute tends to help on harder problems but often introduces redundancy in the reasoning trace, and (ii) compute is most effective when reasoning is systematic and incremental, forming structured chains of thought (CoTs) akin to human problem-solving. To study these factors in isolation, we introduce a controlled setting based on shortest-path tasks in layered graphs. We train decoder-only transformers on question-trace-answer triples using a custom tokenizer, comparing models trained on optimal bottom-up dynamic programming traces with those trained on longer, valid traces involving backtracking. Surprisingly, with the same training-token budget, models trained on inefficient traces generalize better to unseen graphs. This benefit is not due to length alone-injecting arbitrary redundancy into reasoning traces fails to help and can even hurt performance. Instead, we find that generalization correlates with the model’s confidence in next-token prediction, suggesting that long, coherent, and locally incremental traces make the training signal easier to optimize.


[70] EduCoder: An Open-Source Annotation System for Education Transcript Data cs.CLPDF

Guanzhong Pan, Mei Tan, Hyunji Nam, Lucía Langlois, James Malamut

TL;DR: EduCoder是一个开源的、专门针对教育对话数据标注的工具,旨在解决教育领域对话标注的复杂性,支持协作定义复杂代码本和多类型标注。

Details

Motivation: 现有的通用文本标注工具无法满足教育对话数据的复杂需求,例如定义复杂的教学特征代码本、支持多种标注类型以及上下文信息整合。

Result: EduCoder提供了一个可用的开源系统,支持教育对话数据的可靠标注,并通过标注校准提高数据质量。

Insight: 教育领域的对话标注需要专门的工具来支持复杂特征的描述和协作标注,EduCoder填补了这一空白。

Abstract: We introduce EduCoder, a domain-specialized tool designed to support utterance-level annotation of educational dialogue. While general-purpose text annotation tools for NLP and qualitative research abound, few address the complexities of coding education dialogue transcripts – with diverse teacher-student and peer interactions. Common challenges include defining codebooks for complex pedagogical features, supporting both open-ended and categorical coding, and contextualizing utterances with external features, such as the lesson’s purpose and the pedagogical value of the instruction. EduCoder is designed to address these challenges by providing a platform for researchers and domain experts to collaboratively define complex codebooks based on observed data. It incorporates both categorical and open-ended annotation types along with contextual materials. Additionally, it offers a side-by-side comparison of multiple annotators’ responses, allowing comparison and calibration of annotations with others to improve data reliability. The system is open-source, with a demo video available.


[71] The Generalization Ridge: Information Flow in Natural Language Generation cs.CLPDF

Ruidi Chang, Chunyuan Deng, Hanjie Chen

TL;DR: 该论文提出了InfoRidge框架,用于研究Transformer模型中任务相关信息如何在各层间流动,发现预测信息在中间层达到峰值(形成所谓的“泛化岭”),揭示了中间层在泛化中的关键作用。

Details

Motivation: 尽管基于Transformer的语言模型在自然语言生成任务中表现出色,但其内部信息合成的机制仍不清晰,尤其是在任务相关信息如何在不同层间流动方面。

Result: 实验表明预测信息在中间层达到峰值(“泛化岭”),且在分布偏移时模型更依赖这些中间层,凸显其泛化能力。

Insight: Transformer的中间层在泛化中起关键作用,而最终层更偏向记忆化,这一现象为模型设计提供了新视角。

Abstract: Transformer-based language models have achieved state-of-the-art performance in natural language generation (NLG) tasks, yet their internal mechanisms for synthesizing task-relevant information remain insufficiently understood. While prior studies suggest that intermediate layers often yield more generalizable representations than final layers, how this generalization ability emerges and propagates across layers during training remains unclear. To address this gap, we propose InfoRidge, an information-theoretic framework, to characterize how predictive information-the mutual information between hidden representations and target outputs-varies across depth. Estimating this quantity enables us to trace the flow of task-relevant information throughout the model during training. Our experiments across various models and datasets reveal a consistent non-monotonic trend: predictive information peaks in upper-middle layers-forming a generalization ridge-before declining in final layers, reflecting a transition between generalization and memorization. To further investigate this phenomenon, we introduce residual scaling coefficients-trainable scalar parameters applied to each residual block-which serve as functional probes for assessing the relative importance of individual transformer layers. These coefficients reveal that, under distribution shift, models downweight final layers and increasingly rely on ridge layers, highlighting their role in generalization. Together, these findings offer new insights into the internal mechanisms of transformers and underscore the critical role of intermediate layers in supporting generalization.


[72] Learn Globally, Speak Locally: Bridging the Gaps in Multilingual Reasoning cs.CL | cs.AI | cs.LGPDF

Jaedong Hwang, Kumar Tanmay, Seok-Jin Lee, Ayush Agrawal, Hamid Palangi

TL;DR: 论文提出了GeoFact-X基准和BRIDGE训练方法,旨在解决多语言推理中的语言偏见问题,并通过语言一致性奖励提升推理能力。

Details

Motivation: 当前多语言大模型在低资源语言上的推理能力不足,容易偏向高资源语言(如英语),影响事实准确性和可信度。

Result: BRIDGE显著提升了多语言推理的准确性,证明推理感知的多语言强化学习对跨语言泛化至关重要。

Insight: 语言一致性奖励是提升多语言推理能力的关键,自动评估协议为多语言任务提供更细粒度的分析工具。

Abstract: Large Language Models (LLMs) have achieved strong performance in domains like mathematics, factual QA, and code generation, yet their multilingual reasoning capabilities in these tasks remain underdeveloped. Especially for low-resource languages such as Swahili or Thai, LLMs can often misinterpret prompts or default to reasoning in English. This implicit bias toward high-resource languages undermines factual accuracy, interpretability, and trust. Current multilingual benchmarks focus only on final answers, overlooking whether models actually reason in the target language. To address this gap, we introduce GeoFact-X, a geography-based multilingual factual reasoning benchmark with annotated reasoning traces in five languages: English, Hindi, Japanese, Swahili, and Thai. We further propose BRIDGE, a novel training method that guides supervised fine-tuning and test-time reinforcement learning with a language-consistency reward to align reasoning with the input language. Finally, we develop an automatic evaluation protocol using LLM-as-a-judge to assess answer correctness and the quality and language consistency of reasoning traces, enabling nuanced and scalable analysis beyond surface-level metrics. Our results show that BRIDGE significantly enhances multilingual reasoning fidelity, demonstrating that reasoning-aware multilingual reinforcement learning is crucial for robust cross-lingual generalization. https://jd730.github.io/projects/GeoFact-X_BRIDGE


[73] “Lost-in-the-Later”: Framework for Quantifying Contextual Grounding in Large Language Models cs.CL | cs.AIPDF

Yufei Tao, Adam Hiatt, Rahul Seetharaman, Ameeta Agrawal

TL;DR: 这篇论文提出了CoPE框架,用于评估大语言模型(LLMs)在上下文知识与参数知识整合中的表现,揭示了‘lost-in-the-later’现象,即模型倾向于忽略或低估上下文后段信息。通过实验发现,思维链(CoT)提示会降低上下文利用效率,并提出基于提示的改进方法。

Details

Motivation: 大语言模型能够结合上下文和参数知识,但其整合机制尚不明确。作者希望通过系统化评估框架量化这两种知识的利用,并揭示模型在处理上下文时的潜在偏向。

Result: 1. LLMs存在‘lost-in-the-later’现象,即忽略上下文后段信息。
2. 思维链提示(CoT)导致更低的召回率和更短的响应,降低上下文利用效率。
3. 基于上下文的提示方法在摘要任务中提高了事实依据,减少了幻觉。

Insight: 1. 上下文利用存在位置偏向,模型需进一步优化以均衡处理信息。
2. 思维链提示并非总是有益,需结合具体任务调整。
3. 提示工程是改善模型性能的有效手段。

Abstract: Large language models are capable of leveraging both contextual and parametric knowledge but how they prioritize and integrate these sources remains underexplored. We introduce CoPE, a novel evaluation framework that systematically measures contextual knowledge (CK) and parametric knowledge (PK) across models and languages. Using our MultiWikiAtomic dataset in English, Spanish, and Danish, we analyze how large language models (LLMs) integrate context, prioritize information, and incorporate PK in open-ended question answering. Our analysis uncovers a phenomenon we call lost-in-the-later, where LLMs tend to overlook or deprioritize information that appears later in a given context, revealing a strong positional bias that affects contextual grounding. We further find that reasoning models, as well as non-reasoning models prompted with chain-of-thought (CoT), use context even less than non-reasoning models without CoT and fail to mitigate the lost-in-the-later effect. CoT prompting, in particular, results in lower recall and shorter responses, leading to degraded contextual grounding. Based on these insights, we design prompt-based methods to effectively leverage input context. A case study applying CoPE to summarization demonstrates that CK-informed prompting improves factual grounding and reduces hallucination.


[74] Gendered Divides in Online Discussions about Reproductive Rights cs.CL | cs.CYPDF

Ashwin Rao, Sze Yuh Nina Wang, Kristina Lerman

TL;DR: 该论文研究了美国最高法院2022年Dobbs案裁决后,X(原Twitter)平台上关于生殖权利的讨论中的性别差异,揭示了性别和地方政治背景如何影响公共话语。

Details

Motivation: 研究者的动机是探索性别和地方政治背景如何共同影响公众对堕胎问题的态度和情感表达,填补了当前研究中关于性别与空间交互作用的空白。

Result: 结果显示,在保守地区,性别对堕胎态度的影响更显著,且Dobbs案舆论泄露进一步激发了女性支持堕权的在线参与。

Insight: 研究揭示了堕胎讨论不仅因意识形态而极化,还深刻受到性别和地理空间的塑造,突显了身份认同在制度变动期间政治表达中的核心作用。

Abstract: The U.S. Supreme Court’s 2022 ruling in Dobbs v. Jackson Women’s Health Organization marked a turning point in the national debate over reproductive rights. While the ideological divide over abortion is well documented, less is known about how gender and local sociopolitical contexts interact to shape public discourse. Drawing on nearly 10 million abortion-related posts on X (formerly Twitter) from users with inferred gender, ideology and location, we show that gender significantly moderates abortion attitudes and emotional expression, particularly in conservative regions, and independently of ideology. This creates a gender gap in abortion attitudes that grows more pronounced in conservative regions. The leak of the Dobbs draft opinion further intensified online engagement, disproportionately mobilizing pro-abortion women in areas where access was under threat. These findings reveal that abortion discourse is not only ideologically polarized but also deeply structured by gender and place, highlighting the central role of identity in shaping political expression during moments of institutional disruption.


[75] Enhancing Test-Time Scaling of Large Language Models with Hierarchical Retrieval-Augmented MCTS cs.CLPDF

Alex ZH Dou, Zhongwei Wan, Dongfei Cui, Xin Wang, Jing Xiong

TL;DR: 论文提出了一种名为R2-LLMs的分层检索增强推理框架,通过结合粗粒度与细粒度的检索学习,以及蒙特卡洛树搜索和改进的过程奖励模型,显著提升了大型语言模型在推理时的表现。

Details

Motivation: 测试时扩展(test-time scaling)是一种在推理阶段利用额外计算资源提升语言模型性能的范式。然而,现有方法通常需要从更先进的模型中蒸馏生成链式思维(CoT)训练数据。本工作旨在解决这一限制,提出一种无需蒸馏的训练数据生成方法。

Result: 在MATH500、GSM8K和OlympiadBench-TO数据集上的实验表明,使用LLaMA-3.1-8B模型时,性能相对基线提升了16%。

Insight: 该框架展示了分层检索与树搜索结合在提升推理任务性能上的潜力,同时为无需蒸馏生成训练数据提供了新思路。

Abstract: Test-time scaling has emerged as a promising paradigm in language modeling, leveraging additional computational resources at inference time to enhance model performance. In this work, we introduce R2-LLMs, a novel and versatile hierarchical retrieval-augmented reasoning framework designed to improve test-time scaling in large language models (LLMs) without requiring distillation from more advanced models to obtain chain-of-thought (CoT) training data. R2-LLMs enhances inference-time generalization by integrating dual-level retrieval-based in-context learning: (1) At the coarse level, our approach extracts abstract templates from complex reasoning problems and retrieves similar problem-answer pairs to facilitate high-level in-context learning; (2) At the fine level, during Monte Carlo Tree Search (MCTS), R2-LLMs efficiently retrieves analogous intermediate solution steps from reference mathematical problem datasets, refining step-wise reasoning with the aid of a process reward model (PRM) for scoring. R2-LLMs is a robust hierarchical reasoning-augmentation method that enhances in-context-level reasoning while seamlessly integrating with step-level tree search methods. Utilizing PRM, it refines both candidate generation and decision-making for improved reasoning accuracy. Empirical evaluations on the MATH500, GSM8K, and OlympiadBench-TO datasets achieve substantial relative improvement with an increase of up to 16% using LLaMA-3.1-8B compared to the baselines, showcasing the effectiveness of our approach in complex reasoning tasks.


[76] Self-Review Framework for Enhancing Instruction Following Capability of LLM cs.CL | cs.AIPDF

Sihyun Park

TL;DR: Re5是一个自评估和修订框架,旨在提升LLM遵循指令的能力,同时保持生成内容的质量。通过任务和约束提取、结构评估及选择性修订,Re5在少量数据和有限外部监督下实现了与高性能模型相当的效果。

Details

Motivation: 现有的迭代修订方法虽然能提升LLM的指令遵循能力,但随着数据和修订次数增加,成本显著上升。开源LLM的自评估能力有限,过度修订会导致输出质量下降。因此,需要一种高效且质量保持的方法。

Result: Re5在少量数据下表现优异,指令遵循性能接近高性能模型GPT-4o-mini生成的数据,且保持64.24%的胜率优于未修订响应。

Insight: 通过结合自评估、结构化分析和选择性修订,可以在少量数据和低成本下显著提升LLM的指令遵循能力,同时避免质量下降。这一框架为LLM的高效优化提供了新思路。

Abstract: Various techniques have been proposed to improve large language models (LLMs) adherence to formatting and instruction constraints. One of the most effective approaches involves utilizing high-quality data generated by powerful models. However, such models often fail to fully comply with complex instructions in a single generation. To address this limitation, iterative revision methods have been introduced. Nevertheless, as the number of data points and revision iterations increases, the associated monetary costs grow significantly. As a resource-efficient alternative, methods have been proposed that leverage high-performance evaluation tools to compensate for the limited self-evaluation capabilities of open-source LLMs. However, these approaches often lead to a degradation in output quality due to excessive revision. To overcome these challenges, we propose Re5, a self-evaluation and revision framework designed to enhance instruction-following performance while preserving the quality of the generated content. Re5 extracts task and constraint components from user instructions, performs structural evaluations to prevent error accumulation, and applies fine-grained constraint-specific content evaluations followed by selective revisions. This process ensures precise and quality-preserving improvements. The final high-quality outputs are used for alignment tuning, enabling long-term alignment improvements through a data-centric iterative refinement loop. Experimental results demonstrate that Re5 achieves instruction-following performance comparable to models trained on data generated by GPT-4o-mini, a high-performance model, even with a small amount of data while maintaining response quality with a 64.24%-win rate over the non-revised initial responses. These results validate Re5 as an efficient and effective solution for enhancing instruction adherence with minimal external supervision.


[77] Flipping Knowledge Distillation: Leveraging Small Models’ Expertise to Enhance LLMs in Text Matching cs.CLPDF

Mingzhe Li, Jing Xiang, Qishen Zhang, Kaiyang Wan, Xiuying Chen

TL;DR: 该论文提出了一种翻转知识蒸馏方法,通过让小模型(SLM)向大模型(LLM)传递知识,利用SLM在文本匹配任务中的专长提升LLM的性能。

Details

Motivation: 传统的知识蒸馏通常是从LLM到SLM传递知识,但在文本匹配任务中,微调后的小模型往往更擅长领域特定的表示学习。为了结合两者的优势,提出了翻转知识蒸馏。

Result: 在金融和医疗领域的基准测试及实际应用中验证了有效性,模型已部署在线上环境中。

Insight: 小模型在特定任务中可能比大模型更具优势,翻转知识蒸馏可以充分利用这一点,为LLM的性能提升提供新思路。

Abstract: Knowledge distillation typically involves transferring knowledge from a Large Language Model (LLM) to a Smaller Language Model (SLM). However, in tasks such as text matching, fine-tuned smaller models often yield more effective domain-specific representations, as they focus on optimizing the similarity of input pairs. To leverage both the specialized strengths of small models and the rich semantic understanding of LLMs, we introduce a flipped knowledge distillation paradigm, where LLM learns from SLM. Specifically, we address the architectural gap between decoder-only LLMs and smaller encoder-based models by reinterpreting LLMs in an encoder-decoder manner using LoRA. The encoder generates compressed representations, while the decoder maps them to the output space. During training, the encoder produces representations and their similarities, which are then aligned with the similarity scores produced by the teacher, using our proposed Margin-aware Contrastive Learning (MCL) approach. The MCL ensures accurate similarity for both positive and negative pairs, and adaptively handles the internal differences within positive and negative samples. Our paradigm requires only a reasonably good-performing SLM, allowing the LLM to achieve improved performance. Experiments on financial and healthcare benchmarks, as well as real-world applications, confirm its effectiveness, and the model has been fully deployed in an online environment.


[78] ECom-Bench: Can LLM Agent Resolve Real-World E-commerce Customer Support Issues? cs.CLPDF

Haoxin Wang, Xianhan Peng, Xucheng Huang, Yizhe Huang, Ming Gong

TL;DR: ECom-Bench是首个用于评估具有多模态能力的LLM代理在电子商务客服领域的基准框架,基于真实用户对话和动态用户模拟,任务涵盖复杂场景,即使GPT-4o表现也有限。

Details

Motivation: 当前缺乏针对电子商务客服领域的LLM代理评估基准,真实场景复杂度高,需系统性测试多模态能力。

Result: 高级模型如GPT-4o在基准中表现不佳(10-20%通过率),凸显电子商务场景的挑战性。

Insight: 电子商务客服任务对LLM代理的多模态能力和复杂场景处理提出更高要求,现有效率仍有待提升。

Abstract: In this paper, we introduce ECom-Bench, the first benchmark framework for evaluating LLM agent with multimodal capabilities in the e-commerce customer support domain. ECom-Bench features dynamic user simulation based on persona information collected from real e-commerce customer interactions and a realistic task dataset derived from authentic e-commerce dialogues. These tasks, covering a wide range of business scenarios, are designed to reflect real-world complexities, making ECom-Bench highly challenging. For instance, even advanced models like GPT-4o achieve only a 10-20% pass^3 metric in our benchmark, highlighting the substantial difficulties posed by complex e-commerce scenarios. Upon publication, the code and data will be open-sourced to facilitate further research and development in this domain.


[79] Smoothie-Qwen: Post-Hoc Smoothing to Reduce Language Bias in Multilingual LLMs cs.CLPDF

SeungWon Ji, Jungyup Lee, Jemin Kim, Sang Park, SeungJae Lee

TL;DR: 该论文提出了Smoothie-Qwen,一种轻量级的后处理方法,用于减少多语言大模型(LLMs)中的语言偏见,无需重新训练。该方法通过选择性调整token级输出概率,有效抑制不期望的语言生成。

Details

Motivation: 多语言LLMs常因语言混淆(language confusion)而生成主导语言的响应,而忽略输入提示的语言。这限制了模型在全球应用中的可靠性和可控性。

Result: 在Qwen模型上应用,Smoothie-Qwen减少了95%以上的中文意外输出,并在多语言基准测试中保持准确性。

Insight: 该方法提供了一种高效且实用的解决方案,显著提升了多语言LLMs的语言可控性,适合全球化应用。

Abstract: Multilingual large language models (LLMs) often exhibit language confusion, a tendency to generate responses in a dominant language irrespective of the prompt’s language. To address this, we propose Smoothie-Qwen, a lightweight, post-hoc method that mitigates language bias without retraining. This technique selectively adjusts token-level output probabilities to effectively suppress undesired language generation. Applied to the Qwen model, our method reduces unintended Chinese output by over 95% while preserving task accuracy on multilingual benchmarks. This work provides a practical and efficient solution for enhancing the language controllability of LLMs, making them more reliable for global applications.


[80] Agentic-R1: Distilled Dual-Strategy Reasoning cs.CL | cs.AI | cs.LGPDF

Weihua Du, Pranjal Aggarwal, Sean Welleck, Yiming Yang

TL;DR: DualDistill框架通过蒸馏多教师模型的互补推理策略训练学生模型Agentic-R1,动态选择工具执行或文本推理,提升任务准确率。

Details

Motivation: 现有模型在数学推理上表现优异但依赖慢且易错的自然语言推理,而工具增强代理在复杂逻辑任务上表现不佳。

Result: 在多任务上提升准确率,特别是计算密集型任务和标准基准测试。

Insight: 多策略蒸馏能实现高效且鲁棒的推理,动态选择策略优于单策略模型。

Abstract: Current long chain-of-thought (long-CoT) models excel at mathematical reasoning but rely on slow and error-prone natural language traces. Tool-augmented agents address arithmetic via code execution, but often falter on complex logical tasks. We introduce a fine-tuning framework, DualDistill, that distills complementary reasoning strategies from multiple teachers into a unified student model. Using this approach, we train Agentic-R1, which dynamically selects the optimal strategy for each query, invoking tools for arithmetic and algorithmic problems, and using text-based reasoning for abstract ones. Our method improves accuracy across a range of tasks, including both computation-intensive and standard benchmarks, demonstrating the effectiveness of multi-strategy distillation in achieving robust and efficient reasoning. Our project is available at https://github.com/StigLidu/DualDistill


[81] HIRAG: Hierarchical-Thought Instruction-Tuning Retrieval-Augmented Generation cs.CL | cs.AIPDF

YiHan Jiao, ZheHao Tan, Dan Yang, DuoLin Sun, Jie Feng

TL;DR: HIRAG是一种新的检索增强生成(RAG)指令微调方法,通过引入多级渐进式思维链和分层能力,显著提升了模型处理实时信息和领域特定问题的能力。

Details

Motivation: 传统RAG系统依赖大语言模型自身的上下文学习能力,但对RAG生成模型所需的具体能力缺乏深入研究,导致文档质量不一致和检索系统不完善的问题。

Result: 在RGB、PopQA、MuSiQue、HotpotQA和PubmedQA等数据集上,HIRAG显著提升了模型性能。

Insight: 分层能力设计和渐进式思维链的使用能够更有效地解决RAG任务中的信息处理和推理问题。

Abstract: Retrieval-augmented generation (RAG) has become a fundamental paradigm for addressing the challenges faced by large language models in handling real-time information and domain-specific problems. Traditional RAG systems primarily rely on the in-context learning (ICL) capabilities of the large language model itself. Still, in-depth research on the specific capabilities needed by the RAG generation model is lacking, leading to challenges with inconsistent document quality and retrieval system imperfections. Even the limited studies that fine-tune RAG generative models often \textit{lack a granular focus on RAG task} or \textit{a deeper utilization of chain-of-thought processes}. To address this, we propose that RAG models should possess three progressively hierarchical abilities (1) Filtering: the ability to select relevant information; (2) Combination: the ability to combine semantic information across paragraphs; and (3) RAG-specific reasoning: the ability to further process external knowledge using internal knowledge. Thus, we introduce our new RAG instruction fine-tuning method, Hierarchical-Thought Instruction-Tuning Retrieval-Augmented Generation (HIRAG) incorporates a “think before answering” strategy. This method enhances the model’s open-book examination capability by utilizing multi-level progressive chain-of-thought. Experiments show that the HIRAG training strategy significantly improves the model’s performance on datasets such as RGB, PopQA, MuSiQue, HotpotQA, and PubmedQA.


[82] Omni-Router: Sharing Routing Decisions in Sparse Mixture-of-Experts for Speech Recognition cs.CL | cs.AI | cs.LG | cs.SD | eess.ASPDF

Zijin Gu, Tatiana Likhomanenko, Navdeep Jaitly

TL;DR: 该论文提出了一种名为Omni-Router Transformer的模型,通过在稀疏混合专家(MoE)架构中共享路由决策,提高了专家之间的协作和专业化,从而在自动语音识别任务中取得了更好的性能。

Details

Motivation: 传统MoE架构中,每一层的路由决策独立,缺乏专家间的协作,导致性能受限。作者希望通过共享路由决策来增强专家在不同层间的协作,提高模型的鲁棒性和性能。

Result: 在大规模伪标记数据集和10个多样化ASR基准测试中,Omni-Router Transformer的训练损失更低,平均词错误率分别比稠密模型和Switch Transformer降低了11.2%和8.2%。

Insight: 共享路由决策可以增强专家间的协作,提高模型的鲁棒性和泛化能力,为MoE架构的设计提供了新的思路。

Abstract: Mixture-of-experts (MoE) architectures have expanded from language modeling to automatic speech recognition (ASR). Traditional MoE methods, such as the Switch Transformer, route experts independently within each layer. Our analysis reveals that routers in most layers make expert choices that are not strongly correlated with the choices of the routers in other layers. To increase the cooperation between experts in different layers and encourage greater specialization, we use a shared router across different MoE layers. We call this model \emph{Omni-router Transformer}. Extensive experiments on a large-scale pseudo-labeled dataset and evaluations across 10 diverse, out-of-domain ASR benchmarks demonstrate that the Omni-router Transformer is able to achieve lower training loss and consistently outperform dense and Switch Transformer models, reducing average word error rates by 11.2% and 8.2%, respectively, while providing structured expert usage and improved robustness to diverse data.


[83] GPTKB v1.5: A Massive Knowledge Base for Exploring Factual LLM Knowledge cs.CLPDF

Yujia Hu, Tuan-Phong Nguyen, Shrestha Ghosh, Moritz Müller, Simon Razniewski

TL;DR: GPTKB v1.5是一个由GPT-4.1构建的包含1亿三元组的密集互联知识库,用于系统性分析和探索语言模型(LLM)的事实知识,支持链接遍历、SPARQL查询以及LLM知识优缺点的比较研究。

Details

Motivation: 语言模型(LLM)的事实知识仍未被充分了解,且难以通过临时浏览或可扩展的统计分析进行访问。GPTKB v1.5旨在填补这一空白。

Result: 成功构建了密集互联的知识库,为研究LLM知识提供了实用工具,并通过三种用例展示了其功能。

Insight: 大规模递归LLM知识具现化不仅在LLM知识分析中有突破性意义,也为自动化知识库构建提供了新机会。

Abstract: Language models are powerful tools, yet their factual knowledge is still poorly understood, and inaccessible to ad-hoc browsing and scalable statistical analysis. This demonstration introduces GPTKB v1.5, a densely interlinked 100-million-triple knowledge base (KB) built for $14,000 from GPT-4.1, using the GPTKB methodology for massive-recursive LLM knowledge materialization (Hu et al., ACL 2025). The demonstration experience focuses on three use cases: (1) link-traversal-based LLM knowledge exploration, (2) SPARQL-based structured LLM knowledge querying, (3) comparative exploration of the strengths and weaknesses of LLM knowledge. Massive-recursive LLM knowledge materialization is a groundbreaking opportunity both for the research area of systematic analysis of LLM knowledge, as well as for automated KB construction. The GPTKB demonstrator is accessible at https://gptkb.org.


[84] DocTalk: Scalable Graph-based Dialogue Synthesis for Enhancing LLM Conversational Capabilities cs.CLPDF

Jing Yang Lee, Hamed Bonab, Nasser Zalmout, Ming Zeng, Sanket Lokegaonkar

TL;DR: 该论文提出了一种名为DocTalk的新方法,通过将文本文档转化为多轮对话数据,以增强大语言模型(LLM)的对话能力。实验表明,使用DocTalk预训练可以显著提升模型的多轮对话性能。

Details

Motivation: 现有的LLM预训练数据主要基于连续文本,而对话任务需要多轮交互能力,导致训练数据与任务需求不匹配。论文旨在解决这一问题。

Result: 实验显示,使用DocTalk预训练的模型在多轮对话任务中显著提升了40%的性能(如上下文记忆和理解)。

Insight: 通过结构化的对话数据生成方法,可以有效弥合预训练数据与对话任务需求之间的差距,提升LLM的对话能力。

Abstract: Large Language Models (LLMs) are increasingly employed in multi-turn conversational tasks, yet their pre-training data predominantly consists of continuous prose, creating a potential mismatch between required capabilities and training paradigms. We introduce a novel approach to address this discrepancy by synthesizing conversational data from existing text corpora. We present a pipeline that transforms a cluster of multiple related documents into an extended multi-turn, multi-topic information-seeking dialogue. Applying our pipeline to Wikipedia articles, we curate DocTalk, a multi-turn pre-training dialogue corpus consisting of over 730k long conversations. We hypothesize that exposure to such synthesized conversational structures during pre-training can enhance the fundamental multi-turn capabilities of LLMs, such as context memory and understanding. Empirically, we show that incorporating DocTalk during pre-training results in up to 40% gain in context memory and understanding, without compromising base performance. DocTalk is available at https://huggingface.co/datasets/AmazonScience/DocTalk.


[85] Bridging Perception and Language: A Systematic Benchmark for LVLMs’ Understanding of Amodal Completion Reports cs.CLPDF

Amane Watahiki, Tomoki Doi, Taiga Shinozaki, Satoshi Nishida, Takuya Niikawa

TL;DR: 论文构建了一个基于基本形式本体论的基准测试,用于系统评估大型视觉语言模型(LVLMs)在理解遮罩感知描述(amodal completion)上的能力,发现某些模型在特定物体类别和日语提示下的表现较差。

Details

Motivation: 研究旨在填补LVLMs在理解和推断遮罩感知文本能力上的空白,并探索其跨语言表现。

Result: 某些LVLMs(如LLaVA-NeXT变体和Claude 3.5 Sonnet)在特定物体类别和日语提示下表现较差,甚至在无视觉内容的空白刺激上表现更好。

Insight: 部分LVLMs在日语理解能力上存在不足,语言能力可能影响其多模态任务的性能。

Abstract: One of the main objectives in developing large vision-language models (LVLMs) is to engineer systems that can assist humans with multimodal tasks, including interpreting descriptions of perceptual experiences. A central phenomenon in this context is amodal completion, in which people perceive objects even when parts of those objects are hidden. Although numerous studies have assessed whether computer-vision algorithms can detect or reconstruct occluded regions, the inferential abilities of LVLMs on texts related to amodal completion remain unexplored. To address this gap, we constructed a benchmark grounded in Basic Formal Ontology to achieve a systematic classification of amodal completion. Our results indicate that while many LVLMs achieve human-comparable performance overall, their accuracy diverges for certain types of objects being completed. Notably, in certain categories, some LLaVA-NeXT variants and Claude 3.5 Sonnet exhibit lower accuracy on original images compared to blank stimuli lacking visual content. Intriguingly, this disparity emerges only under Japanese prompting, suggesting a deficiency in Japanese-specific linguistic competence among these models.


[86] Remember Past, Anticipate Future: Learning Continual Multimodal Misinformation Detectors cs.CL | cs.MMPDF

Bing Wang, Ximing Li, Mengzhe Ye, Changchun Li, Bo Fu

TL;DR: 论文提出了一种名为DAEDCMD的新方法,用于持续多模态假新闻检测(MMD),通过隔离事件特定参数和学习连续时间动态模型,解决了过去知识遗忘和未来泛化能力不足的问题。

Details

Motivation: 现实世界中新事件不断涌现,导致传统离线训练的MMD模型性能下降,而现有方法无法有效应对持续学习中的知识遗忘和环境变化问题。

Result: 在实验中,DAEDCMD显著优于六种MMD基线和三种持续学习方法,验证了其有效性。

Insight: 通过动态隔离事件参数和预测环境变化,可以有效缓解持续学习中的知识遗忘和未来泛化问题,为其他持续学习任务提供了借鉴。

Abstract: Nowadays, misinformation articles, especially multimodal ones, are widely spread on social media platforms and cause serious negative effects. To control their propagation, Multimodal Misinformation Detection (MMD) becomes an active topic in the community to automatically identify misinformation. Previous MMD methods focus on supervising detectors by collecting offline data. However, in real-world scenarios, new events always continually emerge, making MMD models trained on offline data consistently outdated and ineffective. To address this issue, training MMD models under online data streams is an alternative, inducing an emerging task named continual MMD. Unfortunately, it is hindered by two major challenges. First, training on new data consistently decreases the detection performance on past data, named past knowledge forgetting. Second, the social environment constantly evolves over time, affecting the generalization on future data. To alleviate these challenges, we propose to remember past knowledge by isolating interference between event-specific parameters with a Dirichlet process-based mixture-of-expert structure, and anticipate future environmental distributions by learning a continuous-time dynamics model. Accordingly, we induce a new continual MMD method DAEDCMD. Extensive experiments demonstrate that DAEDCMD can consistently and significantly outperform the compared methods, including six MMD baselines and three continual learning methods.


[87] DocIE@XLLM25: In-Context Learning for Information Extraction using Fully Synthetic Demonstrations cs.CLPDF

Nicholas Popovič, Ashish Kangen, Tim Schopf, Michael Färber

TL;DR: 论文提出了一种基于LLM的全自动合成数据生成和上下文学习流水线,用于文档级实体和关系抽取,避免了手动标注需求,并在零样本场景中进行了评测。

Details

Motivation: 文档级实体和关系抽取在零样本或少样本场景中缺乏高质量标注数据,现有方法依赖手动标注或零样本推断,限制了其扩展性和性能。

Result: 在文档级实体和关系抽取任务中,上下文联合抽取对当前最优大语言模型仍具挑战性。

Insight: 全自动合成数据生成是零样本或少样本信息抽取的可行方向,但复杂文档级任务仍需进一步优化模型能力。

Abstract: Large, high-quality annotated corpora remain scarce in document-level entity and relation extraction in zero-shot or few-shot settings. In this paper, we present a fully automatic, LLM-based pipeline for synthetic data generation and in-context learning for document-level entity and relation extraction. In contrast to existing approaches that rely on manually annotated demonstrations or direct zero-shot inference, our method combines synthetic data generation with retrieval-based in-context learning, using a reasoning-optimized language model. This allows us to build a high-quality demonstration database without manual annotation and to dynamically retrieve relevant examples at inference time. Based on our approach we produce a synthetic dataset of over $5k$ Wikipedia abstracts with approximately $59k$ entities and $30k$ relation triples. Finally, we evaluate in-context learning performance on the DocIE shared task, extracting entities and relations from long documents in a zero-shot setting. We find that in-context joint entity and relation extraction at document-level remains a challenging task, even for state-of-the-art large language models.


[88] Conditional Multi-Stage Failure Recovery for Embodied Agents cs.CLPDF

Youmna Farag, Svetlana Stoyanchev, Mohan Li, Simon Keizer, Rama Doddipatla

TL;DR: 论文提出了一种基于零样本链提示的条件多阶段失败恢复框架,用于提升具身代理在复杂任务中的执行鲁棒性。该框架分为四个错误处理阶段,利用LLM的推理能力分析环境挑战并提出解决方案。实验表明,该方法在TfD基准上表现优异。

Details

Motivation: 具身代理在执行复杂任务时容易失败,需要有效的错误恢复机制。

Result: 在TEACH数据集的TfD基准上,方法比无错误恢复的基线表现好11.5%,并超过了现有最优模型19%。

Insight: 结合多阶段错误处理和LLM的推理能力可以显著提升具身代理的任务鲁棒性,零样本提示的灵活性为此提供了高效工具。

Abstract: Embodied agents performing complex tasks are susceptible to execution failures, motivating the need for effective failure recovery mechanisms. In this work, we introduce a conditional multistage failure recovery framework that employs zero-shot chain prompting. The framework is structured into four error-handling stages, with three operating during task execution and one functioning as a post-execution reflection phase. Our approach utilises the reasoning capabilities of LLMs to analyse execution challenges within their environmental context and devise strategic solutions. We evaluate our method on the TfD benchmark of the TEACH dataset and achieve state-of-the-art performance, outperforming a baseline without error recovery by 11.5% and surpassing the strongest existing model by 19%.


[89] Coding Triangle: How Does Large Language Model Understand Code? cs.CL | cs.AIPDF

Taolin Zhang, Zihan Ma, Maosong Cao, Junnan Liu, Songyang Zhang

TL;DR: 论文提出了Code Triangle框架,从编辑分析、代码实现和测试用例生成三个维度系统评估大语言模型(LLMs)的编程能力,揭示了LLMs在多样性和鲁棒性上的不足,并提出了结合人类生成内容和模型混合的方法来提升性能。

Details

Motivation: 尽管LLMs在代码生成方面取得了显著进展,但其真正的编程能力尚未被充分探索。论文旨在系统评估LLMs在编程任务中的表现及其与人类专家的差距。

Result: 实验表明,LLMs虽然能在三个维度上形成自洽的系统,但其解决方案的多样性和鲁棒性不如人类程序员。通过改进方法,显著提升了LLMs的性能和鲁棒性。

Insight: LLMs的认知与人类专家存在显著分布偏移,模型错误多源于训练数据偏差和有限的推理迁移能力。研究为未来开发更强大的编码模型提供了方向。

Abstract: Large language models (LLMs) have achieved remarkable progress in code generation, yet their true programming competence remains underexplored. We introduce the Code Triangle framework, which systematically evaluates LLMs across three fundamental dimensions: editorial analysis, code implementation, and test case generation. Through extensive experiments on competitive programming benchmarks, we reveal that while LLMs can form a self-consistent system across these dimensions, their solutions often lack the diversity and robustness of human programmers. We identify a significant distribution shift between model cognition and human expertise, with model errors tending to cluster due to training data biases and limited reasoning transfer. Our study demonstrates that incorporating human-generated editorials, solutions, and diverse test cases, as well as leveraging model mixtures, can substantially enhance both the performance and robustness of LLMs. Furthermore, we reveal both the consistency and inconsistency in the cognition of LLMs that may facilitate self-reflection and self-improvement, providing a potential direction for developing more powerful coding models.


[90] Skywork-R1V3 Technical Report cs.CL | cs.CVPDF

Wei Shen, Jiangbo Pei, Yi Peng, Xuchen Song, Yang Liu

TL;DR: Skywork-R1V3 是一种先进的、开源的视觉语言模型(VLM),通过在文本大语言模型(LLMs)的基础上实现视觉任务的推理能力,其创新点在于无需额外的预训练,通过后训练强化学习(RL)框架激活模型的推理能力。该模型在 MMMU 基准测试中实现了 76.0% 的性能,媲美人类初级水平。

Details

Motivation: 传统的视觉语言模型在跨模态推理任务中表现有限。论文旨在通过结合 LLMs 的文本推理能力,利用强化学习框架提升视觉任务的推理性能,推动开源 VLM 的发展。

Result: 1. Skywork-R1V3 在 MMMU 测试中从 64.3% 提升至 76.0%,媲美人类初级水平;2. 38B 参数的模型性能与闭源 VLMs 相当;3. 数学推理能力成功迁移到其他学科。

Insight: 1. 强化学习是提升开源 VLM 性能的有力工具;2. 跨模态对齐对多模态推理至关重要;3. 关键推理 Token 的熵可作为模型推理能力的有效量化指标。

Abstract: We introduce Skywork-R1V3, an advanced, open-source vision-language model (VLM) that pioneers a new approach to visual reasoning. Its key innovation lies in effectively transferring reasoning skills from text-only Large Language Models (LLMs) to visual tasks. The strong performance of Skywork-R1V3 primarily stems from our elaborate post-training RL framework, which effectively activates and enhances the model’s reasoning ability, without the need for additional continue pre-training. Through this framework, we further uncover the fundamental role of the connector module in achieving robust cross-modal alignment for multimodal reasoning models. In addition, we introduce a unique indicator of reasoning capability, the entropy of critical reasoning tokens, which has proven highly effective for checkpoint selection during RL training. Skywork-R1V3 achieves state-of-the-art results on MMMU, significantly improving from 64.3% to 76.0%. This performance matches entry-level human capabilities. Remarkably, our RL-powered post-training approach enables even the 38B parameter model to rival top closed-source VLMs. The implementation successfully transfers mathematical reasoning to other subject-related reasoning tasks. We also include an analysis of curriculum learning and reinforcement finetuning strategies, along with a broader discussion on multimodal reasoning. Skywork-R1V3 represents a significant leap in multimodal reasoning, showcasing RL as a powerful engine for advancing open-source VLM capabilities.


[91] CriticLean: Critic-Guided Reinforcement Learning for Mathematical Formalization cs.CLPDF

Zhongyuan Peng, Yifan Yao, Kaijing Ma, Shuyue Guo, Yizhe Li

TL;DR: CriticLean提出了一种基于评论家引导的强化学习框架,将评论家从被动验证器转变为主动学习组件,显著提升了数学形式化任务的语义保真度。该方法在基准测试中表现优于其他基线模型。

Details

Motivation: 现有研究主要关注数学形式化的生成和编译成功率,而忽略了评论家阶段(即验证生成的形式化是否真正捕获原始问题的语义意图)的重要性。

Result: CriticLeanGPT在基准测试中显著优于开闭源基线模型,并构建了包含285K问题的FineLeanCorpus数据集。

Insight: 优化评论家阶段对生成可靠的形式化结果至关重要,为形式化数学推理领域提供了新方向。

Abstract: Translating natural language mathematical statements into formal, executable code is a fundamental challenge in automated theorem proving. While prior work has focused on generation and compilation success, little attention has been paid to the critic phase-the evaluation of whether generated formalizations truly capture the semantic intent of the original problem. In this paper, we introduce CriticLean, a novel critic-guided reinforcement learning framework that elevates the role of the critic from a passive validator to an active learning component. Specifically, first, we propose the CriticLeanGPT, trained via supervised fine-tuning and reinforcement learning, to rigorously assess the semantic fidelity of Lean 4 formalizations. Then, we introduce CriticLeanBench, a benchmark designed to measure models’ ability to distinguish semantically correct from incorrect formalizations, and demonstrate that our trained CriticLeanGPT models can significantly outperform strong open- and closed-source baselines. Building on the CriticLean framework, we construct FineLeanCorpus, a dataset comprising over 285K problems that exhibits rich domain diversity, broad difficulty coverage, and high correctness based on human evaluation. Overall, our findings highlight that optimizing the critic phase is essential for producing reliable formalizations, and we hope our CriticLean will provide valuable insights for future advances in formal mathematical reasoning.


[92] DS@GT at CheckThat! 2025: Detecting Subjectivity via Transfer-Learning and Corrective Data Augmentation cs.CL | cs.AIPDF

Maximilian Heil, Dionne Bang

TL;DR: 本文介绍了在CheckThat! 2025任务1中,通过迁移学习和风格化数据增强提升新闻文本中主观性和客观性句子分类的方法。研究发现,特定编码器的迁移学习优于通用编码器的微调,且精心设计的数据增强显著提高了模型鲁棒性。官方提交结果排名第16。

Details

Motivation: 动机是提升新闻文本中主观性和客观性句子的分类效果,探索迁移学习和数据增强在任务中的应用潜力。

Result: 结果显示,特定编码器的迁移学习优于通用编码器的微调,且经过精心设计的数据增强显著提高了模型在检测主观内容上的鲁棒性。官方提交结果排名第16(共24个团队)。

Insight: 研究强调了结合编码器专业化与标签一致性数据增强在改进主观性检测任务中的重要性。

Abstract: This paper presents our submission to Task 1, Subjectivity Detection, of the CheckThat! Lab at CLEF 2025. We investigate the effectiveness of transfer-learning and stylistic data augmentation to improve classification of subjective and objective sentences in English news text. Our approach contrasts fine-tuning of pre-trained encoders and transfer-learning of fine-tuned transformer on related tasks. We also introduce a controlled augmentation pipeline using GPT-4o to generate paraphrases in predefined subjectivity styles. To ensure label and style consistency, we employ the same model to correct and refine the generated samples. Results show that transfer-learning of specified encoders outperforms fine-tuning general-purpose ones, and that carefully curated augmentation significantly enhances model robustness, especially in detecting subjective content. Our official submission placed us $16^{th}$ of 24 participants. Overall, our findings underscore the value of combining encoder specialization with label-consistent augmentation for improved subjectivity detection. Our code is available at https://github.com/dsgt-arc/checkthat-2025-subject.


[93] DS@GT at CheckThat! 2025: Evaluating Context and Tokenization Strategies for Numerical Fact Verification cs.CLPDF

Maximilian Heil, Aleksandar Pramov

TL;DR: 该研究评估了数值事实核查的上下文和分词策略,发现右到左分词(R2L)对自然语言推理任务无提升,较长上下文窗口也未改善性能,证据质量是关键瓶颈。

Details

Motivation: 数值声明(如数量和比较)对自动事实核查系统带来独特挑战,研究旨在探索建模策略以提升其准确性。

Result: R2L分词对任务无提升,较长上下文窗口也未改善性能;最佳系统F1得分为0.57(CheckThat! 2025任务3前4)。

Insight: 数值事实核查中,证据质量比分词方向或上下文长度更具决定性。

Abstract: Numerical claims, statements involving quantities, comparisons, and temporal references, pose unique challenges for automated fact-checking systems. In this study, we evaluate modeling strategies for veracity prediction of such claims using the QuanTemp dataset and building our own evidence retrieval pipeline. We investigate three key factors: (1) the impact of more evidences with longer input context windows using ModernBERT, (2) the effect of right-to-left (R2L) tokenization, and (3) their combined influence on classification performance. Contrary to prior findings in arithmetic reasoning tasks, R2L tokenization does not boost natural language inference (NLI) of numerical tasks. A longer context window does also not enhance veracity performance either, highlighting evidence quality as the dominant bottleneck. Our best-performing system achieves competitive macro-average F1 score of 0.57 and places us among the Top-4 submissions in Task 3 of CheckThat! 2025. Our code is available at https://github.com/dsgt-arc/checkthat-2025-numerical.


[94] UQLM: A Python Package for Uncertainty Quantification in Large Language Models cs.CL | cs.AI | cs.LGPDF

Dylan Bouchard, Mohit Singh Chauhan, David Skarbrevik, Ho-Kyeong Ra, Viren Bajaj

TL;DR: UQLM是一个Python工具包,用于通过不确定性量化(UQ)技术检测大型语言模型(LLM)的幻觉问题,提供0到1的置信度评分,旨在提升LLM输出的可靠性。

Details

Motivation: LLM生成的幻觉内容(虚假或误导性信息)对下游应用的安全性和可信度构成挑战,需高效检测方法。

Result: UQLM提供易集成的置信度评分功能,有助于提高LLM输出的可信度。

Insight: 不确定性量化技术可用于检测LLM的幻觉问题,为模型可信度评估提供了新工具。

Abstract: Hallucinations, defined as instances where Large Language Models (LLMs) generate false or misleading content, pose a significant challenge that impacts the safety and trust of downstream applications. We introduce UQLM, a Python package for LLM hallucination detection using state-of-the-art uncertainty quantification (UQ) techniques. This toolkit offers a suite of UQ-based scorers that compute response-level confidence scores ranging from 0 to 1. This library provides an off-the-shelf solution for UQ-based hallucination detection that can be easily integrated to enhance the reliability of LLM outputs.


[95] A Survey on Latent Reasoning cs.CLPDF

Rui-Jie Zhu, Tianhao Peng, Tianhao Cheng, Xingwei Qu, Jinfa Huang

TL;DR: 本文综述了潜在推理(Latent Reasoning)这一新兴领域,探讨了如何通过模型的连续隐藏状态进行多步推理,解决了显式链式推理(CoT)依赖自然语言表达的局限性。文章分析了神经网络层次在推理中的基础作用,并讨论了多种潜在推理方法及前沿范式。

Details

Motivation: 显式链式推理(CoT)虽然提升了模型的解释性和准确性,但其依赖自然语言表达的中间步骤限制了模型的表达带宽。潜在推理旨在通过隐藏状态进行推理,克服这一瓶颈。

Result: 通过隐藏状态进行多步推理,潜在推理能够实现更高的表达效率和全局一致性,同时消除了对显式推理痕迹的依赖。

Insight: 潜在推理为LLM的推理能力提供了新的研究方向,其通过连续隐藏状态实现推理的方法可能在未来的认知模型中发挥重要作用。

Abstract: Large Language Models (LLMs) have demonstrated impressive reasoning capabilities, especially when guided by explicit chain-of-thought (CoT) reasoning that verbalizes intermediate steps. While CoT improves both interpretability and accuracy, its dependence on natural language reasoning limits the model’s expressive bandwidth. Latent reasoning tackles this bottleneck by performing multi-step inference entirely in the model’s continuous hidden state, eliminating token-level supervision. To advance latent reasoning research, this survey provides a comprehensive overview of the emerging field of latent reasoning. We begin by examining the foundational role of neural network layers as the computational substrate for reasoning, highlighting how hierarchical representations support complex transformations. Next, we explore diverse latent reasoning methodologies, including activation-based recurrence, hidden state propagation, and fine-tuning strategies that compress or internalize explicit reasoning traces. Finally, we discuss advanced paradigms such as infinite-depth latent reasoning via masked diffusion models, which enable globally consistent and reversible reasoning processes. By unifying these perspectives, we aim to clarify the conceptual landscape of latent reasoning and chart future directions for research at the frontier of LLM cognition. An associated GitHub repository collecting the latest papers and repos is available at: https://github.com/multimodal-art-projection/LatentCoT-Horizon/.


cs.HC [Back]

[96] NRXR-ID: Two-Factor Authentication (2FA) in VR Using Near-Range Extended Reality and Smartphones cs.HC | cs.CV | cs.GRPDF

Aiur Nanzatov, Lourdes Peña-Castillo, Oscar Meruvia-Pastor

TL;DR: NRXR-ID通过结合近距扩展现实(XR)和智能手机,提出了一种VR环境下的双因素认证(2FA)方法,无需用户摘掉头显设备即可完成认证挑战。

Details

Motivation: 虚拟现实(VR)用户戴着头显设备无法看到现实环境,导致传统的2FA方法难以实现。NRXR-ID旨在解决这一问题。

Result: 跳棋式视觉匹配最适合VR环境,其次是智能手机输入数字PIN并在VR中提交的方式。

Insight: 智能手机与XR技术的结合为VR环境下的安全认证提供了创新解决方案,跳棋式挑战因其直观性表现最佳。

Abstract: Two-factor authentication (2FA) has become widely adopted as an efficient and secure way to validate someone’s identity online. Two-factor authentication is difficult in virtual reality (VR) because users are usually wearing a head-mounted display (HMD) which does not allow them to see their real-world surroundings. We present NRXR-ID, a technique to implement two-factor authentication while using extended reality systems and smartphones. The proposed method allows users to complete an authentication challenge using their smartphones without removing their HMD. We performed a user study where we explored four types of challenges for users, including a novel checkers-style challenge. Users responded to these challenges under three different configurations, including a technique that uses the smartphone to support gaze-based selection without the use of VR controllers. A 4X3 within-subjects design allowed us to study all the variations proposed. We collected performance metrics and performed user experience questionnaires to collect subjective impressions from 30 participants. Results suggest that the checkers-style visual matching challenge was the most appropriate option, followed by entering a digital PIN challenge submitted via the smartphone and answered within the VR environment.


eess.AS [Back]

[97] ContextASR-Bench: A Massive Contextual Speech Recognition Benchmark eess.AS | cs.CL | cs.SDPDF

He Wang, Linhan Ma, Dake Guo, Xiong Wang, Lei Xie

TL;DR: 论文提出了ContextASR-Bench,一个大规模上下文语音识别基准测试,填补了传统ASR模型在上下文建模和世界知识推理能力评估上的空白。

Details

Motivation: 传统ASR评估局限于无上下文场景,而近期LLMs和LALMs的发展使得评估ASR系统的通用性和智能性成为迫切需求。

Result: 实验表明,具备世界知识和上下文学习能力的LALMs显著优于传统ASR模型。

Insight: LALMs在上下文语音识别任务中的优势凸显了上下文建模和世界知识的重要性,为未来ASR系统设计提供了方向。

Abstract: Automatic Speech Recognition (ASR) has been extensively investigated, yet prior evaluative efforts have largely been restricted to contextless paradigms. This constraint stems from the limited proficiency of conventional ASR models in context modeling and their deficiency in memory and reasoning based on world knowledge. Recent breakthroughs in the development of Large Language Models (LLMs) and corresponding Large Audio Language Models (LALMs) have markedly enhanced the visibility of general artificial intelligence capabilities. Consequently, there exists a compelling need for a benchmark that can evaluate both the generality and intelligence of ASR systems. To address this gap, we propose ContextASR-Bench: a comprehensive, large-scale benchmark designed to assess contextual speech recognition. This benchmark encompasses up to 40,000 data entries across over 10 domains, enabling a thorough evaluation of model performance in scenarios that omit or incorporate coarse-grained or fine-grained contextual information. Moreover, diverging from conventional ASR evaluations, our benchmark includes an analysis of model efficacy in recognizing named entities mentioned within the auditory input. Our extensive evaluation highlights that LALMs, with strong world knowledge and context learning capabilities, outperform conventional ASR models by a large margin. The dataset and evaluation code have been released at https://github.com/MrSupW/ContextASR-Bench.


cs.GR [Back]

[98] Self-Attention Based Multi-Scale Graph Auto-Encoder Network of 3D Meshes cs.GR | cs.AI | cs.CVPDF

Saqib Nazir, Olivier Lézoray, Sébastien Bougleux

TL;DR: 论文提出了一种基于图卷积网络(GCN)的新框架3DGeoMeshNet,用于3D网格数据的高效重建,通过各向异性卷积层直接学习空间域中的全局和局部特征。

Details

Motivation: 3D网格数据具有非欧几里得特性,传统的卷积神经网络(CNN)难以直接处理。现有图卷积方法多依赖各向同性滤波器或谱分解,难以同时捕捉局部和全局特征。

Result: 在COMA人脸数据集上的实验显示,3DGeoMeshNet在重建精度上表现优异。

Insight: 直接在3D网格上操作而非转换中间表示,可以更准确地保留几何细节;各向异性卷积结合多尺度设计是处理非欧几里得数据的有效方法。

Abstract: 3D meshes are fundamental data representations for capturing complex geometric shapes in computer vision and graphics applications. While Convolutional Neural Networks (CNNs) have excelled in structured data like images, extending them to irregular 3D meshes is challenging due to the non-Euclidean nature of the data. Graph Convolutional Networks (GCNs) offer a solution by applying convolutions to graph-structured data, but many existing methods rely on isotropic filters or spectral decomposition, limiting their ability to capture both local and global mesh features. In this paper, we introduce 3D Geometric Mesh Network (3DGeoMeshNet), a novel GCN-based framework that uses anisotropic convolution layers to effectively learn both global and local features directly in the spatial domain. Unlike previous approaches that convert meshes into intermediate representations like voxel grids or point clouds, our method preserves the original polygonal mesh format throughout the reconstruction process, enabling more accurate shape reconstruction. Our architecture features a multi-scale encoder-decoder structure, where separate global and local pathways capture both large-scale geometric structures and fine-grained local details. Extensive experiments on the COMA dataset containing human faces demonstrate the efficiency of 3DGeoMeshNet in terms of reconstruction accuracy.


cs.RO [Back]

[99] Evaluation of Habitat Robotics using Large Language Models cs.RO | cs.CLPDF

William Li, Lei Hamilton, Kaise Al-natour, Sanjeev Mohindra

TL;DR: 论文评估了大型语言模型在机器人实体化任务中的表现,发现推理型模型(如OpenAI o3-mini)在Meta PARTNER基准中优于非推理型模型(如GPT-4o和Llama 3)。

Details

Motivation: 研究动机是探索大型语言模型在协作机器人任务中的表现,特别是在简化但随机的室内厨房环境中,为机器人实体化开发提供新的研究方向。

Result: 结果显示,OpenAI o3-mini在各项配置中均优于GPT-4o和Llama 3,为机器人实体化研究提供了新的可能。

Insight: 研究揭示了推理能力在机器人协作任务中的重要性,为未来开发更高效的机器人语言模型提供了方向。

Abstract: This paper focuses on evaluating the effectiveness of Large Language Models at solving embodied robotic tasks using the Meta PARTNER benchmark. Meta PARTNR provides simplified environments and robotic interactions within randomized indoor kitchen scenes. Each randomized kitchen scene is given a task where two robotic agents cooperatively work together to solve the task. We evaluated multiple frontier models on Meta PARTNER environments. Our results indicate that reasoning models like OpenAI o3-mini outperform non-reasoning models like OpenAI GPT-4o and Llama 3 when operating in PARTNR’s robotic embodied environments. o3-mini displayed outperform across centralized, decentralized, full observability, and partial observability configurations. This provides a promising avenue of research for embodied robotic development.


[100] 3DGS_LSR:Large_Scale Relocation for Autonomous Driving Based on 3D Gaussian Splatting cs.RO | cs.CVPDF

Haitao Lu, Haijier Chen, Haoze Liu, Shoujian Zhang, Bo Xu

TL;DR: 3DGS-LSR 是一种基于 3D 高斯溅射(3D Gaussian Splatting)的大规模重定位框架,通过单目 RGB 图像实现厘米级定位,适用于自动驾驶。它在 KITTI 数据集上表现优异,定位精度显著优于其他方法。

Details

Motivation: 在复杂城市环境中,GNSS 定位常因信号遮挡和多径效应变得不可靠。传统地图方法又因存储和计算效率问题难以应用于资源有限的机器人平台。因此,需要一种高效、精确的定位解决方案。

Result: 在 KITTI 数据集上,3DGS-LSR 在城镇道路、林荫道和交通密集的高速公路上分别实现了 0.026m、0.029m 和 0.081m 的平均定位精度。

Insight: 通过 3D 高斯溅射和单目 RGB 输入即可实现高精度定位,解决了 GNSS 不可靠的问题,为自动驾驶提供了可靠的定位方案。

Abstract: In autonomous robotic systems, precise localization is a prerequisite for safe navigation. However, in complex urban environments, GNSS positioning often suffers from signal occlusion and multipath effects, leading to unreliable absolute positioning. Traditional mapping approaches are constrained by storage requirements and computational inefficiency, limiting their applicability to resource-constrained robotic platforms. To address these challenges, we propose 3DGS-LSR: a large-scale relocalization framework leveraging 3D Gaussian Splatting (3DGS), enabling centimeter-level positioning using only a single monocular RGB image on the client side. We combine multi-sensor data to construct high-accuracy 3DGS maps in large outdoor scenes, while the robot-side localization requires just a standard camera input. Using SuperPoint and SuperGlue for feature extraction and matching, our core innovation is an iterative optimization strategy that refines localization results through step-by-step rendering, making it suitable for real-time autonomous navigation. Experimental validation on the KITTI dataset demonstrates our 3DGS-LSR achieves average positioning accuracies of 0.026m, 0.029m, and 0.081m in town roads, boulevard roads, and traffic-dense highways respectively, significantly outperforming other representative methods while requiring only monocular RGB input. This approach provides autonomous robots with reliable localization capabilities even in challenging urban environments where GNSS fails.


cs.DC [Back]

[101] ECORE: Energy-Conscious Optimized Routing for Deep Learning Models at the Edge cs.DC | cs.CVPDF

Daghash K. Alqahtani, Maria A. Rodriguez, Muhammad Aamir Cheema, Hamid Rezatofighi, Adel N. Toosi

TL;DR: 论文提出ECORE框架,通过动态路由策略优化边缘设备上的深度学习模型,在保证检测精度的同时显著降低能耗和延迟。

Details

Motivation: 边缘设备在实时视觉分析(如目标检测)中资源受限,需平衡能耗与检测精度。

Result: 实验表明,ECORE在YOLO、SSD等模型及多种边缘平台上能耗和延迟显著降低,仅损失2%精度。

Insight: 动态路由策略可有效解决边缘资源受限问题,适合实时视觉分析场景。

Abstract: Edge computing enables data processing closer to the source, significantly reducing latency an essential requirement for real-time vision-based analytics such as object detection in surveillance and smart city environments. However, these tasks place substantial demands on resource constrained edge devices, making the joint optimization of energy consumption and detection accuracy critical. To address this challenge, we propose ECORE, a framework that integrates multiple dynamic routing strategies including estimation based techniques and a greedy selection algorithm to direct image processing requests to the most suitable edge device-model pair. ECORE dynamically balances energy efficiency and detection performance based on object characteristics. We evaluate our approach through extensive experiments on real-world datasets, comparing the proposed routers against widely used baseline techniques. The evaluation leverages established object detection models (YOLO, SSD, EfficientDet) and diverse edge platforms, including Jetson Orin Nano, Raspberry Pi 4 and 5, and TPU accelerators. Results demonstrate that our proposed context-aware routing strategies can reduce energy consumption and latency by 45% and 49%, respectively, while incurring only a 2% loss in detection accuracy compared to accuracy-centric methods.


eess.IV [Back]

[102] Learning Segmentation from Radiology Reports eess.IV | cs.CVPDF

Pedro R. A. S. Bassi, Wenxuan Li, Jieneng Chen, Zheren Zhu, Tianyu Lin

TL;DR: 论文提出了一种报告监督损失(R-Super),利用放射学报告为肿瘤分割AI提供体素级监督,显著提升了分割性能,特别是在标注掩码稀缺的情况下。

Details

Motivation: 肿瘤分割在CT扫描中至关重要,但标注掩码稀缺且制作耗时。放射学报告数量庞大但未被充分利用,因此需要一种方法将报告转化为监督信号以提升分割模型性能。

Result: 实验表明,R-Super显著提升了肿瘤分割性能,F1分数最高提升了16%,尤其在标注掩码稀缺时效果更明显。

Insight: 放射学报告可以作为有效的监督信号补充稀缺的标注数据,为医学影像分割任务提供了新的数据利用思路。

Abstract: Tumor segmentation in CT scans is key for diagnosis, surgery, and prognosis, yet segmentation masks are scarce because their creation requires time and expertise. Public abdominal CT datasets have from dozens to a couple thousand tumor masks, but hospitals have hundreds of thousands of tumor CTs with radiology reports. Thus, leveraging reports to improve segmentation is key for scaling. In this paper, we propose a report-supervision loss (R-Super) that converts radiology reports into voxel-wise supervision for tumor segmentation AI. We created a dataset with 6,718 CT-Report pairs (from the UCSF Hospital), and merged it with public CT-Mask datasets (from AbdomenAtlas 2.0). We used our R-Super to train with these masks and reports, and strongly improved tumor segmentation in internal and external validation–F1 Score increased by up to 16% with respect to training with masks only. By leveraging readily available radiology reports to supplement scarce segmentation masks, R-Super strongly improves AI performance both when very few training masks are available (e.g., 50), and when many masks were available (e.g., 1.7K). Project: https://github.com/MrGiovanni/R-Super


[103] Diffusion-Based Limited-Angle CT Reconstruction under Noisy Conditions eess.IV | cs.CVPDF

Jiaqi Guo, Santiago López-Tapia

TL;DR: 该论文提出了一种基于扩散模型的有限角度CT重建方法,通过MR-SDE框架和RNSD⁺噪声感知校正机制,解决了噪声条件下的图像重建问题,显著提升了数据一致性和感知质量。

Details

Motivation: 有限角度CT(LACT)因投影角度缺失导致重建图像存在严重伪影,现有方法多假设理想无噪声条件,忽略了实际噪声的影响。

Result: 实验表明,该方法在数据一致性和感知质量上优于基线模型,且对不同噪声强度和数据采集场景具有良好泛化性。

Insight: 扩散模型与噪声感知机制的结合,为复杂逆问题的鲁棒求解提供了新思路。

Abstract: Limited-Angle Computed Tomography (LACT) is a challenging inverse problem where missing angular projections lead to incomplete sinograms and severe artifacts in the reconstructed images. While recent learning-based methods have demonstrated effectiveness, most of them assume ideal, noise-free measurements and fail to address the impact of measurement noise. To overcome this limitation, we treat LACT as a sinogram inpainting task and propose a diffusion-based framework that completes missing angular views using a Mean-Reverting Stochastic Differential Equation (MR-SDE) formulation. To improve robustness under realistic noise, we propose RNSD$^+$, a novel noise-aware rectification mechanism that explicitly models inference-time uncertainty, enabling reliable and robust reconstruction. Extensive experiments demonstrate that our method consistently surpasses baseline models in data consistency and perceptual quality, and generalizes well across varying noise intensity and acquisition scenarios.


[104] A novel framework for fully-automated co-registration of intravascular ultrasound and optical coherence tomography imaging data eess.IV | cs.CVPDF

Xingwei He, Kit Mills Bransby, Ahmet Emir Ulutas, Thamil Kumaran, Nathan Angelo Lecaros Yap

TL;DR: 该论文提出了一种基于深度学习的全新框架,用于完全自动化地将血管内超声(IVUS)和光学相干断层扫描(OCT)图像进行纵向和圆周配准,性能与专家分析相当且处理速度快。

Details

Motivation: 在多模态成像研究中,IVUS和OCT图像的配准通常需要人工干预,耗时且效率低。本文旨在开发一种自动化方法,提高配准效率和准确性。

Result: 纵向配准的相关系数>0.99,圆周配准>0.90;Williams Index分别为0.96和0.97,处理时间<90秒/血管。

Insight: 深度学习在多模态医学图像配准中表现出色,可显著提升大规模数据的分析效率,为斑块组成研究提供有力工具。

Abstract: Aims: To develop a deep-learning (DL) framework that will allow fully automated longitudinal and circumferential co-registration of intravascular ultrasound (IVUS) and optical coherence tomography (OCT) images. Methods and results: Data from 230 patients (714 vessels) with acute coronary syndrome that underwent near-infrared spectroscopy (NIRS)-IVUS and OCT imaging in their non-culprit vessels were included in the present analysis. The lumen borders annotated by expert analysts in 61,655 NIRS-IVUS and 62,334 OCT frames, and the side branches and calcific tissue identified in 10,000 NIRS-IVUS frames and 10,000 OCT frames, were used to train DL solutions for the automated extraction of these features. The trained DL solutions were used to process NIRS-IVUS and OCT images and their output was used by a dynamic time warping algorithm to co-register longitudinally the NIRS-IVUS and OCT images, while the circumferential registration of the IVUS and OCT was optimized through dynamic programming. On a test set of 77 vessels from 22 patients, the DL method showed high concordance with the expert analysts for the longitudinal and circumferential co-registration of the two imaging sets (concordance correlation coefficient >0.99 for the longitudinal and >0.90 for the circumferential co-registration). The Williams Index was 0.96 for longitudinal and 0.97 for circumferential co-registration, indicating a comparable performance to the analysts. The time needed for the DL pipeline to process imaging data from a vessel was <90s. Conclusion: The fully automated, DL-based framework introduced in this study for the co-registration of IVUS and OCT is fast and provides estimations that compare favorably to the expert analysts. These features renders it useful in research in the analysis of large-scale data collected in studies that incorporate multimodality imaging to characterize plaque composition.


[105] Enhancing Synthetic CT from CBCT via Multimodal Fusion and End-To-End Registration eess.IV | cs.AI | cs.CVPDF

Maximilian Tschuchnig, Lukas Lamminger, Philipp Steininger, Michael Gadermayr

TL;DR: 通过多模态融合和端到端配准技术,本文提升了从CBCT生成合成CT的质量。

Details

Motivation: CBCT图像由于采集速度快、辐射剂量低,被广泛用于术中成像,但其存在伪影和视觉质量较低的问题。传统的合成CT生成方法未能充分利用多模态数据,且模态间的配准问题未被有效解决。

Result: 实验表明,多模态融合与配准的结合在90个评估场景中79个优于基线方法,尤其在CBCT质量低且CT配准中度偏差时效果显著。

Insight: 配准模块在多模态sCT生成中至关重要,能够显著提升图像质量,尤其在数据质量不理想时效果更明显。

Abstract: Cone-Beam Computed Tomography (CBCT) is widely used for intraoperative imaging due to its rapid acquisition and low radiation dose. However, CBCT images typically suffer from artifacts and lower visual quality compared to conventional Computed Tomography (CT). A promising solution is synthetic CT (sCT) generation, where CBCT volumes are translated into the CT domain. In this work, we enhance sCT generation through multimodal learning by jointly leveraging intraoperative CBCT and preoperative CT data. To overcome the inherent misalignment between modalities, we introduce an end-to-end learnable registration module within the sCT pipeline. This model is evaluated on a controlled synthetic dataset, allowing precise manipulation of data quality and alignment parameters. Further, we validate its robustness and generalizability on two real-world clinical datasets. Experimental results demonstrate that integrating registration in multimodal sCT generation improves sCT quality, outperforming baseline multimodal methods in 79 out of 90 evaluation settings. Notably, the improvement is most significant in cases where CBCT quality is low and the preoperative CT is moderately misaligned.


[106] LangMamba: A Language-driven Mamba Framework for Low-dose CT Denoising with Vision-language Models eess.IV | cs.AI | cs.CVPDF

Zhihao Chen, Tao Chen, Chenhui Wang, Qi Gao, Huidong Xie

TL;DR: LangMamba是一个新颖的框架,通过结合视觉语言模型(VLMs)和高效的Mamba机制,实现了低剂量CT(LDCT)去噪,显著提升了图像质量和解释性。

Details

Motivation: 低剂量CT减少了辐射暴露但降低了图像质量,传统深度学习方法忽略了高级语义信息的潜在优势。视觉语言模型的进展为利用语言作为监督信号提供了新机会。

Result: 在两个公开数据集上超越现有方法,提升了细节保留和视觉保真度。LangAE展现了对新数据集的强泛化能力,LangDA损失增强了模型解释性。

Insight: 语言可以作为有效的监督信号,结合语义信息与高效建模机制(如Mamba),能够显著提升医学图像重建任务的性能与通用性。

Abstract: Low-dose computed tomography (LDCT) reduces radiation exposure but often degrades image quality, potentially compromising diagnostic accuracy. Existing deep learning-based denoising methods focus primarily on pixel-level mappings, overlooking the potential benefits of high-level semantic guidance. Recent advances in vision-language models (VLMs) suggest that language can serve as a powerful tool for capturing structured semantic information, offering new opportunities to improve LDCT reconstruction. In this paper, we introduce LangMamba, a Language-driven Mamba framework for LDCT denoising that leverages VLM-derived representations to enhance supervision from normal-dose CT (NDCT). LangMamba follows a two-stage learning strategy. First, we pre-train a Language-guided AutoEncoder (LangAE) that leverages frozen VLMs to map NDCT images into a semantic space enriched with anatomical information. Second, we synergize LangAE with two key components to guide LDCT denoising: Semantic-Enhanced Efficient Denoiser (SEED), which enhances NDCT-relevant local semantic while capturing global features with efficient Mamba mechanism, and Language-engaged Dual-space Alignment (LangDA) Loss, which ensures that denoised images align with NDCT in both perceptual and semantic spaces. Extensive experiments on two public datasets demonstrate that LangMamba outperforms conventional state-of-the-art methods, significantly improving detail preservation and visual fidelity. Remarkably, LangAE exhibits strong generalizability to unseen datasets, thereby reducing training costs. Furthermore, LangDA loss improves explainability by integrating language-guided insights into image reconstruction and offers a plug-and-play fashion. Our findings shed new light on the potential of language as a supervisory signal to advance LDCT denoising. The code is publicly available on https://github.com/hao1635/LangMamba.


cs.IR [Back]

[107] A Survey on Proactive Defense Strategies Against Misinformation in Large Language Models cs.IR | cs.AI | cs.CLPDF

Shuliang Liu, Hongyi Liu, Aiwei Liu, Bingchen Duan, Qi Zheng

TL;DR: 这篇论文探讨了如何通过主动防御策略应对大语言模型(LLMs)生成的错误信息,提出了基于知识可信性、推理可靠性和输入鲁棒性的三支柱框架,展现了63%的改进效果。

Details

Motivation: 大语言模型的广泛部署加剧了算法生成错误信息的社会风险,传统检测方法难以应对其自我强化、高度可信和多语言传播的特性,因此需要转向主动防御策略。

Result: 主动防御策略在错误信息预防上比传统方法提高了63%,但存在计算开销和泛化挑战。

Insight: 未来研究应关注知识基础、推理认证和对抗接口的协同设计,以增强大语言模型在各领域的抗误导能力。

Abstract: The widespread deployment of large language models (LLMs) across critical domains has amplified the societal risks posed by algorithmically generated misinformation. Unlike traditional false content, LLM-generated misinformation can be self-reinforcing, highly plausible, and capable of rapid propagation across multiple languages, which traditional detection methods fail to mitigate effectively. This paper introduces a proactive defense paradigm, shifting from passive post hoc detection to anticipatory mitigation strategies. We propose a Three Pillars framework: (1) Knowledge Credibility, fortifying the integrity of training and deployed data; (2) Inference Reliability, embedding self-corrective mechanisms during reasoning; and (3) Input Robustness, enhancing the resilience of model interfaces against adversarial attacks. Through a comprehensive survey of existing techniques and a comparative meta-analysis, we demonstrate that proactive defense strategies offer up to 63% improvement over conventional methods in misinformation prevention, despite non-trivial computational overhead and generalization challenges. We argue that future research should focus on co-designing robust knowledge foundations, reasoning certification, and attack-resistant interfaces to ensure LLMs can effectively counter misinformation across varied domains.


cs.AI [Back]

[108] Fine-Grained Vision-Language Modeling for Multimodal Training Assistants in Augmented Reality cs.AI | cs.CL | cs.CVPDF

Haochen Huang, Jiahuan Pei, Mohammad Aliannejadi, Xin Sun, Moonisa Ahsan

TL;DR: 该论文探索了增强现实(AR)训练中的细粒度视觉-语言模型(VLM)应用,提出了一个专为AR训练设计的数据集,并评估了9种先进VLM模型的表现,揭示了现有模型在细粒度任务中的局限性。

Details

Motivation: 增强现实(AR)训练需要AI助手具备多模态理解能力,但现有视觉-语言模型在细粒度任务上的表现仍不足。论文旨在填补这一研究空白,并为盲人和视障用户提供平等的学习机会。

Result: 实验显示,即使是GPT-4o等先进模型,在细粒度任务上的表现也有限(最高F1分数仅40.54%),表明需要更多数据集和基准测试的改进。

Insight: 该研究强调了细粒度视觉-语言对齐的重要性,同时也为AI驱动的公平学习机会提供了社会价值。开放资源将促进未来研究的进展。

Abstract: Vision-language models (VLMs) are essential for enabling AI-powered smart assistants to interpret and reason in multimodal environments. However, their application in augmented reality (AR) training remains largely unexplored. In this work, we introduce a comprehensive dataset tailored for AR training, featuring systematized vision-language tasks, and evaluate nine state-of-the-art VLMs on it. Our results reveal that even advanced models, including GPT-4o, struggle with fine-grained assembly tasks, achieving a maximum F1 score of just 40.54% on state detection. These findings highlight the demand for enhanced datasets, benchmarks, and further research to improve fine-grained vision-language alignment. Beyond technical contributions, our work has broader social implications, particularly in empowering blind and visually impaired users with equitable access to AI-driven learning opportunities. We provide all related resources, including the dataset, source code, and evaluation results, to support the research community.


[109] MusiScene: Leveraging MU-LLaMA for Scene Imagination and Enhanced Video Background Music Generation cs.AI | cs.CLPDF

Fathinah Izzati, Xinyue Li, Yuxuan Wu, Gus Xia

TL;DR: 论文提出MusiScene,通过微调MU-LLaMA实现音乐场景想象(MSI),并利用其生成的描述增强视频背景音乐生成(VBMG)。

Details

Motivation: 人类通过音乐能联想到场景,但现有音乐描述模型仅关注音乐元素,缺乏跨模态关联。

Result: MusiScene生成的音乐场景描述更贴合视频内容,优于仅基于音乐的MU-LLaMA。

Insight: 跨模态信息(如视频-音乐关联)能提升音乐描述模型的场景想象力,从而增强下游任务效果。

Abstract: Humans can imagine various atmospheres and settings when listening to music, envisioning movie scenes that complement each piece. For example, slow, melancholic music might evoke scenes of heartbreak, while upbeat melodies suggest celebration. This paper explores whether a Music Language Model, e.g. MU-LLaMA, can perform a similar task, called Music Scene Imagination (MSI), which requires cross-modal information from video and music to train. To improve upon existing music captioning models which focusing solely on musical elements, we introduce MusiScene, a music captioning model designed to imagine scenes that complement each music. In this paper, (1) we construct a large-scale video-audio caption dataset with 3,371 pairs, (2) we finetune Music Understanding LLaMA for the MSI task to create MusiScene, and (3) we conduct comprehensive evaluations and prove that our MusiScene is more capable of generating contextually relevant captions compared to MU-LLaMA. We leverage the generated MSI captions to enhance Video Background Music Generation (VBMG) from text.


cs.LG [Back]

[110] Reinforcement Fine-Tuning Naturally Mitigates Forgetting in Continual Post-Training cs.LG | cs.AI | cs.CLPDF

Song Lai, Haohan Zhao, Rong Feng, Changyi Ma, Wenzhuo Liu

TL;DR: 本文比较了监督微调(SFT)和强化微调(RFT)在持续后训练(CPT)中对知识保留的影响,发现RFT能有效缓解遗忘问题并保持模型性能,而SFT则导致灾难性遗忘。

Details

Motivation: 研究持续后训练中不同学习范式(SFT和RFT)对知识保留的影响,探索如何更有效地适应不断变化的下游任务。

Result: RFT能保护甚至增强模型的一般知识,而SFT导致灾难性遗忘和性能下降。进一步分析显示,RFT的隐式正则化是缓解遗忘的关键。

Insight: RFT的隐式正则化机制在持续学习中发挥了重要作用,为CPT提供了一种鲁棒的范式选择。

Abstract: Continual post-training (CPT) is a popular and effective technique for adapting foundation models like multimodal large language models to specific and ever-evolving downstream tasks. While existing research has primarily concentrated on methods like data replay, model expansion, or parameter regularization, the fundamental role of the learning paradigm within CPT remains largely unexplored. This paper presents a comparative analysis of two core post-training paradigms: supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT), investigating their respective impacts on knowledge retention during CPT. Our experiments are conducted on a benchmark comprising seven diverse multimodal tasks, utilizing Qwen2.5-VL-7B-Instruct as the base model for continual post-training. The investigation yields two significant findings: (1) When continuously learning on downstream tasks, SFT leads to catastrophic forgetting of previously learned tasks. In contrast, RFT inherently preserves prior knowledge and achieve performance comparable to multi-task training. (2) RFT successfully protects and even enhances the model’s general knowledge on standard benchmarks (e.g., MMMU and MMLU-Pro). Conversely, SFT degrades general model capabilities severely. Further analysis shows that explicit mechanisms, such as KL penalty and chain-of-thought reasoning, are not the primary factors. Instead, we find that the implicit regularization inherent to RFT is a key factor in mitigating forgetting. Finally, we propose a rollout-based instance filtering algorithm to improve the stability and efficiency of RFT. Our comprehensive study demonstrates the superiority of RFT as a robust paradigm for continual post-training.


[111] AutoTriton: Automatic Triton Programming with Reinforcement Learning in LLMs cs.LG | cs.CLPDF

Shangzhan Li, Zefan Wang, Ye He, Yuxuan Li, Qi Shi

TL;DR: AutoTriton利用强化学习自动优化Triton编程,通过监督微调和GRPO算法结合规则与执行奖励,显著提升GPU内核性能。

Details

Motivation: 深度学习内核开发需手动调优关键参数,如瓦片大小和内存访问模式,导致性能优化困难且耗时。AutoTriton旨在通过自动化减少人工干预。

Result: 在TritonBench和KernelBench上,8B模型性能媲美主流大模型(如Claude-4-Sonnet)。实验验证了各模块(SFT、RL、奖励设计)的关键作用。

Insight: 强化学习在高性能内核自动生成中潜力巨大,为构建高效AI系统奠定重要基础。

Abstract: Kernel development in deep learning requires optimizing computational units across hardware while balancing memory management, parallelism, and hardware-specific optimizations through extensive empirical tuning. Although domain-specific languages like Triton simplify GPU programming by abstracting low-level details, developers must still manually tune critical parameters such as tile sizes and memory access patterns through iterative experimentation, creating substantial barriers to optimal performance and wider adoption. In this work, we introduce AutoTriton, the first model dedicated to Triton programming powered by reinforcement learning (RL). AutoTriton performs supervised fine-tuning (SFT) to be equipped with essential Triton programming expertise using a high-quality data gathering pipeline, and conducts RL with Group Relative Policy Optimization (GRPO) algorithm, combining a rule-based reward and an execution-based reward to further improve Triton programming ability, sequentially. Experiments across five evaluation channels of TritonBench and KernelBench illustrate that our 8B model AutoTriton achieves performance comparable to mainstream large models, including Claude-4-Sonnet and DeepSeek-R1-0528. Further experimental analysis demonstrates the crucial role of each module within AutoTriton, including the SFT stage, the RL stage, and the reward design strategy. These findings underscore the promise of RL for automatically generating high-performance kernels, and since high-performance kernels are core components of AI systems, this breakthrough establishes an important foundation for building more efficient AI systems. The model and code will be available at https://github.com/AI9Stars/AutoTriton.


[112] MobileGUI-RL: Advancing Mobile GUI Agent through Reinforcement Learning in Online Environment cs.LG | cs.CLPDF

Yucheng Shi, Wenhao Yu, Zaitang Li, Yonglin Wang, Hongming Zhang

TL;DR: MobileGUI-RL提出了一种在在线环境中训练GUI代理的强化学习框架,通过自探索和课程学习生成任务,并优化GRPO算法以提高导航效率。

Details

Motivation: 现有GUI代理多基于离线环境训练,面临可扩展性差、对特定UI模板过拟合及策略脆弱的问题,MobileGUI-RL旨在通过在线训练解决这些挑战。

Result: 在三个在线移动代理基准测试中表现优于现有方法,验证了其有效性。

Insight: 在线训练和课程学习可有效提升GUI代理的通用性和鲁棒性,复合奖励设计对平衡任务成功与效率至关重要。

Abstract: Recently, there has been a surge of vision-based GUI agents designed to automate everyday mobile and web tasks. These agents interpret raw GUI screenshots and autonomously decide where to click, scroll, or type, which bypasses handcrafted rules and app-specific APIs. However, most existing methods trained GUI agent in the offline environment using pre-collected trajectories. This approach limits scalability, causes overfitting to specific UI templates, and leads to brittle policies when faced with unseen environment. We present MobileGUI-RL, a scalable framework that trains GUI agent in online environment. MobileGUI-RL contains two key components. It (i) synthesizes a curriculum of learnable tasks through self-exploration and filtering, and (ii) adapts GRPO to GUI navigation with trajectory-aware advantages and composite rewards that balance task success and execution efficiency. Experiments on three online mobile-agent benchmarks show consistent gains, validating the effectiveness of our approach.


[113] Conditional Graph Neural Network for Predicting Soft Tissue Deformation and Forces cs.LG | cs.AI | cs.CVPDF

Madina Kojanazarova, Florentin Bieder, Robin Sandkühler, Philippe C. Cattin

TL;DR: 该论文提出了一种条件图神经网络(cGNN),用于预测软组织虚拟环境中的变形和力,解决了高变形性和数据稀缺的挑战,并通过实验数据微调提升了模型性能。

Details

Motivation: 虚拟环境中的软组织模拟对医学应用至关重要,但高变形性和精确力反馈的复杂性带来了挑战。现有方法依赖于分段、网格化和刚度估计,难以满足需求。

Result: 模型预测变形距离误差为0.35±0.03 mm(最大变形30 mm),力绝对误差为0.37±0.05 N(最大力7.5 N),表现出高精度。

Insight: 数据驱动方法结合转移学习能有效解决软组织模拟的复杂性,适用于医学及其他需要真实软组织模拟的领域。

Abstract: Soft tissue simulation in virtual environments is becoming increasingly important for medical applications. However, the high deformability of soft tissue poses significant challenges. Existing methods rely on segmentation, meshing and estimation of stiffness properties of tissues. In addition, the integration of haptic feedback requires precise force estimation to enable a more immersive experience. We introduce a novel data-driven model, a conditional graph neural network (cGNN) to tackle this complexity. Our model takes surface points and the location of applied forces, and is specifically designed to predict the deformation of the points and the forces exerted on them. We trained our model on experimentally collected surface tracking data of a soft tissue phantom and used transfer learning to overcome the data scarcity by initially training it with mass-spring simulations and fine-tuning it with the experimental data. This approach improves the generalisation capability of the model and enables accurate predictions of tissue deformations and corresponding interaction forces. The results demonstrate that the model can predict deformations with a distance error of 0.35$\pm$0.03 mm for deformations up to 30 mm and the force with an absolute error of 0.37$\pm$0.05 N for forces up to 7.5 N. Our data-driven approach presents a promising solution to the intricate challenge of simulating soft tissues within virtual environments. Beyond its applicability in medical simulations, this approach holds the potential to benefit various fields where realistic soft tissue simulations are required.


[114] Concept-Based Mechanistic Interpretability Using Structured Knowledge Graphs cs.LG | cs.AI | cs.CVPDF

Sofiia Chorna, Kateryna Tarelkina, Eloïse Berthier, Gianni Franchi

TL;DR: 该论文提出了一种基于概念和结构化知识图谱的机制可解释性框架,用于全局分析模型行为,揭示概念在模型内部的表现、交互和传播方式。

Details

Motivation: 传统的概念可解释性方法主要关注局部解释,而该研究旨在扩展至机制可解释性领域,分析模型内部高层次语义概念的表现和交互方式,揭示潜在的信息流和电路。

Result: 开发了交互式工具BAGEL,可揭示模型中的虚假关联和信息流,提升了对深度学习模型泛化行为的理解。

Insight: 通过全局视角分析概念交互,该方法不仅能识别模型的决策机制,还能帮助发现数据集偏见对模型行为的影响。

Abstract: While concept-based interpretability methods have traditionally focused on local explanations of neural network predictions, we propose a novel framework and interactive tool that extends these methods into the domain of mechanistic interpretability. Our approach enables a global dissection of model behavior by analyzing how high-level semantic attributes (referred to as concepts) emerge, interact, and propagate through internal model components. Unlike prior work that isolates individual neurons or predictions, our framework systematically quantifies how semantic concepts are represented across layers, revealing latent circuits and information flow that underlie model decision-making. A key innovation is our visualization platform that we named BAGEL (for Bias Analysis with a Graph for global Explanation Layers), which presents these insights in a structured knowledge graph, allowing users to explore concept-class relationships, identify spurious correlations, and enhance model trustworthiness. Our framework is model-agnostic, scalable, and contributes to a deeper understanding of how deep learning models generalize (or fail to) in the presence of dataset biases. The demonstration is available at https://knowledge-graph-ui-4a7cb5.gitlab.io/.


[115] Fair Domain Generalization: An Information-Theoretic View cs.LG | cs.CVPDF

Tangzheng Lian, Guanyu Hu, Dimitrios Kollias, Xinyu Yang, Oya Celiktutan

TL;DR: 这篇论文首次研究了领域泛化(DG)与算法公平性的结合问题,提出了FairDG任务,并通过信息论视角导出了风险与公平性违反的上界,提出了PAFDG框架,在真实数据集上验证了其优越性。

Details

Motivation: 领域泛化(DG)方法通常只关注目标域的期望风险,忽略了算法公平性;而公平性方法又未考虑领域偏移。因此,需要一个统一的框架来解决领域泛化与公平性的双重挑战。

Result: 在真实视觉和语言数据集上,PAFDG表现优于现有方法,实现了更好的效用-公平性权衡。

Insight: 信息论为领域泛化与公平性的统一提供了理论基础,Pareto优化是实现两者平衡的有效方法。

Abstract: Domain generalization (DG) and algorithmic fairness are two critical challenges in machine learning. However, most DG methods focus only on minimizing expected risk in the unseen target domain without considering algorithmic fairness. Conversely, fairness methods typically do not account for domain shifts, so the fairness achieved during training may not generalize to unseen test domains. In this work, we bridge these gaps by studying the problem of Fair Domain Generalization (FairDG), which aims to minimize both expected risk and fairness violations in unseen target domains. We derive novel mutual information-based upper bounds for expected risk and fairness violations in multi-class classification tasks with multi-group sensitive attributes. These bounds provide key insights for algorithm design from an information-theoretic perspective. Guided by these insights, we introduce PAFDG (Pareto-Optimal Fairness for Domain Generalization), a practical framework that solves the FairDG problem and models the utility-fairness trade-off through Pareto optimization. Experiments on real-world vision and language datasets show that PAFDG achieves superior utility-fairness trade-offs compared to existing methods.