Table of Contents

cs.CV [Back]

[1] Connecting Vision and Emissions: A Behavioural AI Approach to Carbon Estimation in Road Design cs.CV | cs.AIPDF

Ammar K Al Mhdawi, Nonso Nnamoko, Safanah Mudheher Raafat, M. K. S. Al-Mhdawi, Amjad J Humaidi

TL;DR: 该论文提出了一种基于YOLOv8增强版的实时车辆检测与分类框架,用于估计城市环境中的碳排放量。通过结合深度OCR模块和外部数据库验证,实现了精确的车辆特定碳排放计算。

Details

Motivation: 现有的碳排放监测方法通常依赖于宏观数据,缺乏对单个车辆的精确追踪和分类。该研究旨在通过计算机视觉和深度学习方法,提供一种实时、自动化的车辆碳排放监测解决方案。

Result: YOLOv8检测器的mAP@0.5约为71%(边界框)和70%(分割掩码);深度OCR的字符级准确率达99%。验证了该框架在智能交通系统中的实用性和可扩展性。

Insight: 该研究表明,结合实时目标检测和深度OCR技术,可以实现车辆级别的精确碳排放监测,为智能交通系统提供了新的技术路径。

Abstract: We present an enhanced YOLOv8 real time vehicle detection and classification framework, for estimating carbon emissions in urban environments. The system enhances YOLOv8 architecture to detect, segment, and track vehicles from live traffic video streams. Once a vehicle is localized, a dedicated deep learning-based identification module is employed to recognize license plates and classify vehicle types. Since YOLOv8 lacks the built-in capacity for fine grained recognition tasks such as reading license plates or determining vehicle attributes beyond class labels, our framework incorporates a hybrid pipeline where each detected vehicle is tracked and its bounding box is cropped and passed to a deep Optical Character Recognition (OCR) module. This OCR system, composed of multiple convolutional neural network (CNN) layers, is trained specifically for character-level detection and license plate decoding under varied conditions such as motion blur, occlusion, and diverse font styles. Additionally, the recognized plate information is validated using a real time API that cross references with an external vehicle registration database to ensure accurate classification and emission estimation. This multi-stage approach enables precise, automated calculation of per vehicle carbon emissions. Extensive evaluation was conducted using a diverse vehicle dataset enriched with segmentation masks and annotated license plates. The YOLOv8 detector achieved a mean Average Precision (mAP@0.5) of approximately 71% for bounding boxes and 70% for segmentation masks. Character level OCR accuracy reached up to 99% with the best performing CNN model. These results affirm the feasibility of combining real time object detection with deep OCR for practical deployment in smart transportation systems, offering a scalable solution for automated, vehicle specific carbon emission monitoring.


[2] Interpretable and Granular Video-Based Quantification of Motor Characteristics from the Finger Tapping Test in Parkinson Disease cs.CV | cs.AIPDF

Tahereh Zarrat Ehsan, Michael Tangermann, Yağmur Güçlütürk, Bastiaan R. Bloem, Luc J. W. Evers

TL;DR: 本文提出了一种基于计算机视觉的方法,用于从帕金森病(PD)患者的手指敲击测试视频中量化运动特征,提供了一种更客观和可解释的评估方式。

Details

Motivation: 传统的手指敲击测试依赖医生的主观评估,存在评分的变异性,且无法提供具体的运动特征细节。本文旨在通过视频分析提供更客观、精细的量化评估。

Result: 方法在MDS-UPDRS评分预测上准确性更高,同时提供了对运动特征的精细量化。

Insight: 基于视频的分析能够捕捉传统评估方法无法发现的细微运动特征差异,为临床评估提供更全面的数据。

Abstract: Accurately quantifying motor characteristics in Parkinson disease (PD) is crucial for monitoring disease progression and optimizing treatment strategies. The finger-tapping test is a standard motor assessment. Clinicians visually evaluate a patient’s tapping performance and assign an overall severity score based on tapping amplitude, speed, and irregularity. However, this subjective evaluation is prone to inter- and intra-rater variability, and does not offer insights into individual motor characteristics captured during this test. This paper introduces a granular computer vision-based method for quantifying PD motor characteristics from video recordings. Four sets of clinically relevant features are proposed to characterize hypokinesia, bradykinesia, sequence effect, and hesitation-halts. We evaluate our approach on video recordings and clinical evaluations of 74 PD patients from the Personalized Parkinson Project. Principal component analysis with varimax rotation shows that the video-based features corresponded to the four deficits. Additionally, video-based analysis has allowed us to identify further granular distinctions within sequence effect and hesitation-halts deficits. In the following, we have used these features to train machine learning classifiers to estimate the Movement Disorder Society Unified Parkinson Disease Rating Scale (MDS-UPDRS) finger-tapping score. Compared to state-of-the-art approaches, our method achieves a higher accuracy in MDS-UPDRS score prediction, while still providing an interpretable quantification of individual finger-tapping motor characteristics. In summary, the proposed framework provides a practical solution for the objective assessment of PD motor characteristics, that can potentially be applied in both clinical and remote settings. Future work is needed to assess its responsiveness to symptomatic treatment and disease progression.


[3] Reinforcement Learning-Based Dynamic Grouping for Tubular Structure Tracking cs.CV | cs.AI | cs.LGPDF

Chong Di, Shuwang Zhou, Da Chen, Jean-Marie Mirebeau, Minglei Shu

TL;DR: 论文提出了一种基于强化学习的动态分组方法,用于跟踪管状结构,通过将段级跟踪建模为马尔可夫决策过程,显著提升了计算效率和鲁棒性。

Details

Motivation: 现有方法在跟踪管状结构(如血管和道路)时面临复杂形态和环境变化的挑战,尤其是段级方法计算效率低且依赖先验知识。本文旨在通过强化学习动态优化搜索过程。

Result: 在典型管状结构数据集上的实验表明,该方法显著优于现有的点级和段级方法,尤其在处理复杂拓扑时表现优异。

Insight: 强化学习可以有效地用于动态路径搜索问题,尤其在处理不确定性和复杂结构时表现出鲁棒性和灵活性。

Abstract: The computation of minimal paths for the applications in tracking tubular structures such as blood vessels and roads is challenged by complex morphologies and environmental variations. Existing approaches can be roughly categorized into two research lines: the point-wise based models and the segment-wise based models. Although segment-wise approaches have obtained promising results in many scenarios, they often suffer from computational inefficiency and heavily rely on a prescribed prior to fit the target elongated shapes. We propose a novel framework that casts segment-wise tracking as a Markov Decision Process (MDP), enabling a reinforcement learning approach. Our method leverages Q-Learning to dynamically explore a graph of segments, computing edge weights on-demand and adaptively expanding the search space. This strategy avoids the high cost of a pre-computed graph and proves robust to incomplete initial information. Experimental reuslts on typical tubular structure datasets demonstrate that our method significantly outperforms state-of-the-art point-wise and segment-wise approaches. The proposed method effectively handles complex topologies and maintains global path coherence without depending on extensive prior structural knowledge.


[4] From Pixels and Words to Waves: A Unified Framework for Spectral Dictionary vLLMs cs.CVPDF

Andrew Kiruluta, Priscilla Burity

TL;DR: 论文提出一种基于频谱字典的混合器方法(SDict-VLM),首次在视觉语言模型中同时去除卷积和自注意力机制,实现了高效且可解释的多模态融合。

Details

Motivation: 当前视觉语言模型(VLM)依赖计算密集的卷积和自注意力机制,限制了模型的效率和可扩展性。本文旨在通过频谱字典表示实现轻量化且透明的多模态对齐。

Result: 在MS-COCO上达到BLEU-4 39.2、CIDEr 127.5、SPICE 27.0,VQAv2准确率50.3%,性能接近BLIP-2的85%,但效率显著提升。

Insight: 频谱字典不仅降低计算复杂度,还提供模型透明性,为高效可解释的VLM开辟了新方向。

Abstract: Vision-language models (VLMs) unify computer vision and natural language processing in a single architecture capable of interpreting and describing images. Most state-of-the-art systems rely on two computationally intensive components: convolutions in the vision encoder and quadratic self-attention for multimodal fusion. This work removes both by introducing a spectral dictionary token mixer, which represents each image patch or wordpiece as a sparse combination of learnable frequency atoms. Our 1.1B-parameter prototype, SDict-VLM, achieves BLEU-4 of 39.2, CIDEr of 127.5, and SPICE of 27.0 on MS-COCO captioning, along with 50.3 percent accuracy on VQAv2. These results close approximately 85 percent of the performance gap to BLIP-2 while using 60 percent fewer parameters, 2.3 times less peak GPU memory, and 2.2 times faster inference than PaLI-3. To our knowledge, this is the first VLM to eliminate both convolutions and self-attention while matching mid-scale transformer baselines. In addition to its O(L log L) complexity, the shared frequency dictionary enables transparent cross-modal alignment and offers a tunable trade-off between accuracy and compute, paving the way for efficient and interpretable VLMs.


[5] DiffRIS: Enhancing Referring Remote Sensing Image Segmentation with Pre-trained Text-to-Image Diffusion Models cs.CV | cs.AIPDF

Zhe Dong, Yuzhe Sun, Tianzhu Liu, Yanfeng Gu

TL;DR: DiffRIS利用预训练的文本到图像扩散模型,通过上下文感知适配器和渐进式跨模态推理解码器,显著提升了遥感图像分割任务的性能,达到最新水平。

Details

Motivation: 遥感图像分割(RRSIS)在灾害响应和城市规划中至关重要,但现有方法因尺度变化、多样方向和语义模糊性等问题表现不佳。作者希望通过预训练扩散模型增强跨模态对齐能力。

Result: 在三个基准数据集(RRSIS-D、RefSegRS、RISBench)上DiffRIS均超越现有方法,确立了最新性能。

Insight: 预训练扩散模型能有效提升遥感任务的跨模态对齐能力,且动态特征优化和多尺度交互是关键创新点。

Abstract: Referring remote sensing image segmentation (RRSIS) enables the precise delineation of regions within remote sensing imagery through natural language descriptions, serving critical applications in disaster response, urban development, and environmental monitoring. Despite recent advances, current approaches face significant challenges in processing aerial imagery due to complex object characteristics including scale variations, diverse orientations, and semantic ambiguities inherent to the overhead perspective. To address these limitations, we propose DiffRIS, a novel framework that harnesses the semantic understanding capabilities of pre-trained text-to-image diffusion models for enhanced cross-modal alignment in RRSIS tasks. Our framework introduces two key innovations: a context perception adapter (CP-adapter) that dynamically refines linguistic features through global context modeling and object-aware reasoning, and a progressive cross-modal reasoning decoder (PCMRD) that iteratively aligns textual descriptions with visual regions for precise segmentation. The CP-adapter bridges the domain gap between general vision-language understanding and remote sensing applications, while PCMRD enables fine-grained semantic alignment through multi-scale feature interaction. Comprehensive experiments on three benchmark datasets-RRSIS-D, RefSegRS, and RISBench-demonstrate that DiffRIS consistently outperforms existing methods across all standard metrics, establishing a new state-of-the-art for RRSIS tasks. The significant performance improvements validate the effectiveness of leveraging pre-trained diffusion models for remote sensing applications through our proposed adaptive framework.


[6] GLIMPSE: Gradient-Layer Importance Mapping for Prompted Visual Saliency Explanation for Generative LVLMs cs.CV | cs.AIPDF

Guanxi Shen

TL;DR: GLIMPSE是一个轻量级、模型无关的框架,用于可视化大型视觉语言模型(LVLMs)在开放视觉问答(VQA)中的视觉注意力分布。

Details

Motivation: 理解LVLMs在生成自由形式文本响应时的视觉注意力分布对于模型行为理解、幻觉诊断、偏差暴露和透明性至关重要。

Result: GLIMPSE在人类对齐方面优于先前的可解释性方法,并能揭示LVLMs的跨模态归属、token级推理动态和系统人类注意力错位等问题。

Insight: GLIMPSE能够提供对LVLMs跨模态推理的细粒度分析,帮助理解模型行为的透明性,并识别幻觉和偏差。

Abstract: Recent advances in large vision language models (LVLMs) have unlocked unprecedented capabilities in generating coherent responses from visual inputs. However, interpreting where LVLMs direct their visual attention while generating free-form textual responses remains a significant challenge, yet is essential for understanding model behavior, diagnosing hallucination, exposing bias and ensuring transparency. We introduce GLIMPSE (Gradient-Layer Importance Mapping for Prompted Visual Saliency Explanation), a lightweight, model-agnostic framework for visualizing the salient image regions that LVLMs rely upon during open-ended visual question answering (VQA), while concurrently revealing the multimodal textual saliency. GLIMPSE fuses gradient-weighted attention, adaptive layer propagation, and weighted token aggregation to produce holistic response-level attribution heat maps for interpreting cross-modal reasoning, outperforming prior interpretability methods in human-alignment. We demonstrate an analytic explainable AI (XAI) approach using GLIMPSE to uncover fine-grained insights into LVLM cross-modal attribution, trace token-level reasoning dynamics, and analyze systematic human-attention misalignment, hallucination, and bias.


[7] LEGATO: Large-scale End-to-end Generalizable Approach to Typeset OMR cs.CV | cs.DLPDF

Guang Yang, Victoria Ebert, Nazif Tamer, Luiza Pozzobon, Noah A. Smith

TL;DR: LEGATO是一个新的端到端Transformer模型,用于光学音乐识别(OMR),首次实现大规模预训练,能够识别整页或多页排版乐谱,并生成ABC符号格式。

Details

Motivation: 现有OMR模型缺乏对大型排版乐谱的端到端识别能力,且缺乏标准化评估。LEGATO旨在填补这一空白。

Result: 在多个数据集上实现最先进性能。

Insight: 大规模预训练和端到端设计是提升OMR泛化能力的关键。

Abstract: We propose Legato, a new end-to-end transformer model for optical music recognition (OMR). Legato is the first large-scale pretrained OMR model capable of recognizing full-page or multi-page typeset music scores and the first to generate documents in ABC notation, a concise, human-readable format for symbolic music. Bringing together a pretrained vision encoder with an ABC decoder trained on a dataset of more than 214K images, our model exhibits the strong ability to generalize across various typeset scores. We conduct experiments on a range of datasets and demonstrate that our model achieves state-of-the-art performance. Given the lack of a standardized evaluation for end-to-end OMR, we comprehensively compare our model against the previous state of the art using a diverse set of metrics.


[8] HAWAII: Hierarchical Visual Knowledge Transfer for Efficient Vision-Language Models cs.CV | cs.AI | cs.CLPDF

Yimu Wang, Mozhgan Nasr Azadani, Sean Sedwards, Krzysztof Czarnecki

TL;DR: HAWAII 是一个新颖的视觉语言模型框架,通过将多个视觉专家的知识蒸馏到单个视觉编码器中,实现高效的知识迁移,同时减少计算开销。

Details

Motivation: 提高视觉语言模型(VLMs)的视觉理解能力是关键,但使用多个预训练视觉专家通常会导致较高的计算成本。HAWAII 旨在通过知识蒸馏解决这一问题。

Result: 在多种视觉语言任务上的实验表明,HAWAII 优于现有开源 VLMs。

Insight: 通过动态适配器和分层知识蒸馏,可以在保持高效的同时充分利用多教师的互补优势。

Abstract: Improving the visual understanding ability of vision-language models (VLMs) is crucial for enhancing their performance across various tasks. While using multiple pretrained visual experts has shown great promise, it often incurs significant computational costs during training and inference. To address this challenge, we propose HAWAII, a novel framework that distills knowledge from multiple visual experts into a single vision encoder, enabling it to inherit the complementary strengths of several experts with minimal computational overhead. To mitigate conflicts among different teachers and switch between different teacher-specific knowledge, instead of using a fixed set of adapters for multiple teachers, we propose to use teacher-specific Low-Rank Adaptation (LoRA) adapters with a corresponding router. Each adapter is aligned with a specific teacher, avoiding noisy guidance during distillation. To enable efficient knowledge distillation, we propose fine-grained and coarse-grained distillation. At the fine-grained level, token importance scores are employed to emphasize the most informative tokens from each teacher adaptively. At the coarse-grained level, we summarize the knowledge from multiple teachers and transfer it to the student using a set of general-knowledge LoRA adapters with a router. Extensive experiments on various vision-language tasks demonstrate the superiority of HAWAII, compared to the popular open-source VLMs.


[9] Reading Smiles: Proxy Bias in Foundation Models for Facial Emotion Recognition cs.CV | cs.AI | cs.HCPDF

Iosif Tsangko, Andreas Triantafyllopoulos, Adem Abdelmoula, Adria Mallol-Ragolta, Bjoern W. Schuller

TL;DR: 该论文探讨了基础模型(FMs)在面部情绪识别中依赖的视觉线索,发现模型主要依赖牙齿可见性等表面特征,揭示了潜在的偏见和公平性问题。

Details

Motivation: 研究动机在于了解基础模型(尤其是视觉语言模型)在情绪识别中依赖的特征是否具有心理学基础,以及这些模型是否存在偏见或公平性问题。

Result: 研究发现,VLMs(如GPT-4o)在情绪识别中主要依赖牙齿可见性等表面特征,且其推理过程具有高度内部一致性。

Insight: 研究揭示了基础模型在情绪识别中的‘捷径学习’行为,强调了在心理健康和教育等敏感领域中可能存在的偏见和公平性问题。

Abstract: Foundation Models (FMs) are rapidly transforming Affective Computing (AC), with Vision Language Models (VLMs) now capable of recognising emotions in zero shot settings. This paper probes a critical but underexplored question: what visual cues do these models rely on to infer affect, and are these cues psychologically grounded or superficially learnt? We benchmark varying scale VLMs on a teeth annotated subset of AffectNet dataset and find consistent performance shifts depending on the presence of visible teeth. Through structured introspection of, the best-performing model, i.e., GPT-4o, we show that facial attributes like eyebrow position drive much of its affective reasoning, revealing a high degree of internal consistency in its valence-arousal predictions. These patterns highlight the emergent nature of FMs behaviour, but also reveal risks: shortcut learning, bias, and fairness issues especially in sensitive domains like mental health and education.


[10] Lightweight RGB-T Tracking with Mobile Vision Transformers cs.CVPDF

Mahdi Falaki, Maria A. Amer

TL;DR: 这篇论文提出了一种基于Mobile Vision Transformers (MobileViT)的轻量级RGB-T目标跟踪算法,通过渐进式融合框架和可分离注意力机制实现了高效的模态内和模态间交互,在保持高精度的同时显著降低了模型参数量和提升了推理速度。

Details

Motivation: 单模态目标跟踪(如仅RGB)在低光照和恶劣天气等挑战性条件下表现不佳。虽然基于Vision Transformer的多模态跟踪器性能优越,但模型通常计算成本高。因此,作者希望开发一种轻量级且高效的多模态跟踪算法。

Result: 与SOTA高效多模态跟踪器相比,模型在参数量(<4百万)和推理速度(122 FPS)上显著优化,同时保持了可比的跟踪精度。

Insight: 轻量化ViT架构(如MobileViT)在多模态任务中具有潜力,渐进式融合和可分离注意力是提升模型效率的有效手段。

Abstract: Single-modality object tracking (e.g., RGB-only) encounters difficulties in challenging imaging conditions, such as low illumination and adverse weather conditions. To solve this, multimodal tracking (e.g., RGB-T models) aims to leverage complementary data such as thermal infrared features. While recent Vision Transformer-based multimodal trackers achieve strong performance, they are often computationally expensive due to large model sizes. In this work, we propose a novel lightweight RGB-T tracking algorithm based on Mobile Vision Transformers (MobileViT). Our tracker introduces a progressive fusion framework that jointly learns intra-modal and inter-modal interactions between the template and search regions using separable attention. This design produces effective feature representations that support more accurate target localization while achieving a small model size and fast inference speed. Compared to state-of-the-art efficient multimodal trackers, our model achieves comparable accuracy while offering significantly lower parameter counts (less than 4 million) and the fastest GPU inference speed of 122 frames per second. This paper is the first to propose a tracker using Mobile Vision Transformers for RGB-T tracking and multimodal tracking at large. Tracker code and model weights will be made publicly available upon acceptance.


[11] PRISM: Perceptual Recognition for Identifying Standout Moments in Human-Centric Keyframe Extraction cs.CVPDF

Mert Can Cakmak, Nitin Agarwal, Diwash Poudel

TL;DR: PRISM是一种轻量级、基于感知对齐的框架,用于提取视频中的人为中心关键帧,适用于实时和资源受限环境。

Details

Motivation: 在线视频在政治话语和网络社交威胁(如虚假信息、宣传和极端化)中扮演重要角色,识别视频中最具影响力的“突出”时刻对内容审核和总结至关重要。

Result: 在BBC、TVSum、SumMe和ClipShots数据集上的实验表明,PRISM在保持高压缩比的同时实现了高准确性和保真度。

Insight: PRISM为分析在线平台上有害或政治敏感媒体提供了一种可扩展的工具,特别适用于资源受限的环境。

Abstract: Online videos play a central role in shaping political discourse and amplifying cyber social threats such as misinformation, propaganda, and radicalization. Detecting the most impactful or “standout” moments in video content is crucial for content moderation, summarization, and forensic analysis. In this paper, we introduce PRISM (Perceptual Recognition for Identifying Standout Moments), a lightweight and perceptually-aligned framework for keyframe extraction. PRISM operates in the CIELAB color space and uses perceptual color difference metrics to identify frames that align with human visual sensitivity. Unlike deep learning-based approaches, PRISM is interpretable, training-free, and computationally efficient, making it well suited for real-time and resource-constrained environments. We evaluate PRISM on four benchmark datasets: BBC, TVSum, SumMe, and ClipShots, and demonstrate that it achieves strong accuracy and fidelity while maintaining high compression ratios. These results highlight PRISM’s effectiveness in both structured and unstructured video content, and its potential as a scalable tool for analyzing and moderating harmful or politically sensitive media in online platforms.


[12] MOSCARD – Causal Reasoning and De-confounding for Multimodal Opportunistic Screening of Cardiovascular Adverse Events cs.CVPDF

Jialu Pi, Juan Maria Farina, Rimita Lahiri, Jiwoong Jeong, Archana Gurudu

TL;DR: 论文提出了一种名为MOSCARD的多模态因果推理框架,用于心血管不良事件的筛查。通过结合胸片(CXR)和心电图(ECG)数据,利用因果推理和去混杂技术提升预测性能。

Details

Motivation: 心血管不良事件(MACE)是全球主要死亡原因之一,而现有的筛查方法受限于采样偏差和单模态数据的局限性。本文旨在通过多模态数据整合和因果推理提升筛查的准确性和鲁棒性。

Result: 在内部和外部数据集(ED和MIMIC)上的实验表明,MOSCARD优于单模态和现有最佳模型(AUC分别为0.75、0.83、0.71)。

Insight: 多模态数据的整合和因果推理可以显著提升心血管事件筛查的性能,同时去混杂技术有助于减少偏差,提高模型的鲁棒性。

Abstract: Major Adverse Cardiovascular Events (MACE) remain the leading cause of mortality globally, as reported in the Global Disease Burden Study 2021. Opportunistic screening leverages data collected from routine health check-ups and multimodal data can play a key role to identify at-risk individuals. Chest X-rays (CXR) provide insights into chronic conditions contributing to major adverse cardiovascular events (MACE), while 12-lead electrocardiogram (ECG) directly assesses cardiac electrical activity and structural abnormalities. Integrating CXR and ECG could offer a more comprehensive risk assessment than conventional models, which rely on clinical scores, computed tomography (CT) measurements, or biomarkers, which may be limited by sampling bias and single modality constraints. We propose a novel predictive modeling framework - MOSCARD, multimodal causal reasoning with co-attention to align two distinct modalities and simultaneously mitigate bias and confounders in opportunistic risk estimation. Primary technical contributions are - (i) multimodal alignment of CXR with ECG guidance; (ii) integration of causal reasoning; (iii) dual back-propagation graph for de-confounding. Evaluated on internal, shift data from emergency department (ED) and external MIMIC datasets, our model outperformed single modality and state-of-the-art foundational models - AUC: 0.75, 0.83, 0.71 respectively. Proposed cost-effective opportunistic screening enables early intervention, improving patient outcomes and reducing disparities.


[13] OpenWildlife: Open-Vocabulary Multi-Species Wildlife Detector for Geographically-Diverse Aerial Imagery cs.CVPDF

Muhammed Patel, Javier Noa Turnes, Jayden Hsiao, Linlin Xu, David Clausi

TL;DR: OpenWildlife (OW) 是一个开词汇的野生动物检测器,通过语言感知嵌入和改进的 Grounding-DINO 框架,实现多物种识别,并在多样化的空中图像中表现优异。

Details

Motivation: 现有方法在特定环境下表现良好,但在跨物种和跨环境的泛化能力上表现不足,限制了在生物多样性评估中的应用。

Result: 在 15 个数据集上训练,OW 表现优异(最高 0.981 mAP50),并在新物种数据集上达到 0.597 mAP50。高效算法覆盖 95% 物种,仅需探索 33% 图像。

Insight: OW 展示了开词汇模型和语言嵌入在生态学中的潜力,为全球生物多样性评估提供了灵活、高效的工具。

Abstract: We introduce OpenWildlife (OW), an open-vocabulary wildlife detector designed for multi-species identification in diverse aerial imagery. While existing automated methods perform well in specific settings, they often struggle to generalize across different species and environments due to limited taxonomic coverage and rigid model architectures. In contrast, OW leverages language-aware embeddings and a novel adaptation of the Grounding-DINO framework, enabling it to identify species specified through natural language inputs across both terrestrial and marine environments. Trained on 15 datasets, OW outperforms most existing methods, achieving up to \textbf{0.981} mAP50 with fine-tuning and \textbf{0.597} mAP50 on seven datasets featuring novel species. Additionally, we introduce an efficient search algorithm that combines k-nearest neighbors and breadth-first search to prioritize areas where social species are likely to be found. This approach captures over \textbf{95%} of species while exploring only \textbf{33%} of the available images. To support reproducibility, we publicly release our source code and dataset splits, establishing OW as a flexible, cost-effective solution for global biodiversity assessments.


[14] Ancient Script Image Recognition and Processing: A Review cs.CVPDF

Xiaolei Diao, Rite Bo, Yanling Xiao, Lida Shi, Zhihan Zhou

TL;DR: 这篇综述论文全面回顾了古代文字图像识别的方法,分析了不同文字类型及其识别技术的差异与共性,探讨了数据不平衡和图像退化等独特挑战,并总结了当前局限性和未来方向。

Details

Motivation: 古代文字作为人类文明的重要载体,其自动识别技术对考古学和数字人文学科的研究至关重要。随着深度学习的兴起,该领域发展迅速,但面临数据不平衡和图像退化等独特挑战。

Result: 论文提供了一个结构化的视角,支持古代文字识别、解释和破译的持续发展。

Insight: 古代文字识别技术在不同类型的文字中存在共性方法,但也需针对各自的独特挑战开发专门解决方案,未来研究方向可以进一步结合多模态和跨领域知识。

Abstract: Ancient scripts, e.g., Egyptian hieroglyphs, Oracle Bone Inscriptions, and Ancient Greek inscriptions, serve as vital carriers of human civilization, embedding invaluable historical and cultural information. Automating ancient script image recognition has gained importance, enabling large-scale interpretation and advancing research in archaeology and digital humanities. With the rise of deep learning, this field has progressed rapidly, with numerous script-specific datasets and models proposed. While these scripts vary widely, spanning phonographic systems with limited glyphs to logographic systems with thousands of complex symbols, they share common challenges and methodological overlaps. Moreover, ancient scripts face unique challenges, including imbalanced data distribution and image degradation, which have driven the development of various dedicated methods. This survey provides a comprehensive review of ancient script image recognition methods. We begin by categorizing existing studies based on script types and analyzing respective recognition methods, highlighting both their differences and shared strategies. We then focus on challenges unique to ancient scripts, systematically examining their impact and reviewing recent solutions, including few-shot learning and noise-robust techniques. Finally, we summarize current limitations and outline promising future directions. Our goal is to offer a structured, forward-looking perspective to support ongoing advancements in the recognition, interpretation, and decipherment of ancient scripts.


[15] MedErr-CT: A Visual Question Answering Benchmark for Identifying and Correcting Errors in CT Reports cs.CV | cs.AIPDF

Sunggu Kyung, Hyungbin Park, Jinyoung Seo, Jimin Sung, Jihyun Kim

TL;DR: MedErr-CT是一个新的视觉问答基准,用于评估多模态大语言模型(MLLMs)在医学CT报告中识别和纠正错误的能力。

Details

Motivation: CT在临床诊断中至关重要,但诊断错误问题日益突出。现有医学视觉问答基准缺乏临床相关性,无法评估专家级知识。

Result: 评估了当前最先进的3D医学MLLMs,发现它们在不同错误类型上表现差异显著。

Insight: 通过MedErr-CT基准,可以推动开发更可靠和临床适用的MLLMs,减少诊断错误和提高临床准确性。

Abstract: Computed Tomography (CT) plays a crucial role in clinical diagnosis, but the growing demand for CT examinations has raised concerns about diagnostic errors. While Multimodal Large Language Models (MLLMs) demonstrate promising comprehension of medical knowledge, their tendency to produce inaccurate information highlights the need for rigorous validation. However, existing medical visual question answering (VQA) benchmarks primarily focus on simple visual recognition tasks, lacking clinical relevance and failing to assess expert-level knowledge. We introduce MedErr-CT, a novel benchmark for evaluating medical MLLMs’ ability to identify and correct errors in CT reports through a VQA framework. The benchmark includes six error categories - four vision-centric errors (Omission, Insertion, Direction, Size) and two lexical error types (Unit, Typo) - and is organized into three task levels: classification, detection, and correction. Using this benchmark, we quantitatively assess the performance of state-of-the-art 3D medical MLLMs, revealing substantial variation in their capabilities across different error types. Our benchmark contributes to the development of more reliable and clinically applicable MLLMs, ultimately helping reduce diagnostic errors and improve accuracy in clinical practice. The code and datasets are available at https://github.com/babbu3682/MedErr-CT.


[16] Video-XL-2: Towards Very Long-Video Understanding Through Task-Aware KV Sparsification cs.CV | cs.AIPDF

Minghao Qin, Xiangrui Liu, Zhengyang Liang, Yan Shu, Huaying Yuan

TL;DR: Video-XL-2通过任务感知的KV稀疏化技术,解决了长视频理解中高计算和内存成本的问题,实现了高效且性能优越的长视频分析。

Details

Motivation: 当前的多模态大语言模型在处理长视频时面临高内存和计算成本的挑战,需要在性能和效率之间取得平衡。

Result: 在多个长视频理解基准测试中达到SOTA性能,单卡GPU可处理超1万帧视频,处理速度达每秒数千帧。

Insight: KV稀疏化结合任务感知的分块策略,是提升长视频理解效率和性能的有效途径。

Abstract: Multi-modal large language models (MLLMs) models have made significant progress in video understanding over the past few years. However, processing long video inputs remains a major challenge due to high memory and computational costs. This makes it difficult for current models to achieve both strong performance and high efficiency in long video understanding. To address this challenge, we propose Video-XL-2, a novel MLLM that delivers superior cost-effectiveness for long-video understanding based on task-aware KV sparsification. The proposed framework operates with two key steps: chunk-based pre-filling and bi-level key-value decoding. Chunk-based pre-filling divides the visual token sequence into chunks, applying full attention within each chunk and sparse attention across chunks. This significantly reduces computational and memory overhead. During decoding, bi-level key-value decoding selectively reloads either dense or sparse key-values for each chunk based on its relevance to the task. This approach further improves memory efficiency and enhances the model’s ability to capture fine-grained information. Video-XL-2 achieves state-of-the-art performance on various long video understanding benchmarks, outperforming existing open-source lightweight models. It also demonstrates exceptional efficiency, capable of processing over 10,000 frames on a single NVIDIA A100 (80GB) GPU and thousands of frames in just a few seconds.


[17] MSR-Align: Policy-Grounded Multimodal Alignment for Safety-Aware Reasoning in Vision-Language Models cs.CV | cs.CLPDF

Yinan Xia, Yilei Jiang, Yingshui Tan, Xiaoyong Zhu, Xiangyu Yue

TL;DR: MSR-Align是一个高质量的多模态安全推理数据集,旨在通过细粒度的政策基础推理增强视觉语言模型的安全性,同时提升其对文本和视觉-语言攻击的鲁棒性。

Details

Motivation: 现有的安全对齐方法主要针对单模态语言模型,无法应对多模态输入带来的复杂安全威胁。因此,需要一种针对多模态推理能力的视觉语言模型(VLMs)的安全对齐方法。

Result: 实验表明,基于MSR-Align微调的VLMs在面对文本和视觉-语言攻击时表现出更强的鲁棒性,且一般推理性能不受影响甚至有所提升。

Insight: 细粒度的政策基础推理是提升多模态模型安全性的关键,同时高质量的数据集是推动安全性对齐研究的基础。

Abstract: Vision-Language Models (VLMs) have achieved remarkable progress in multimodal reasoning tasks through enhanced chain-of-thought capabilities. However, this advancement also introduces novel safety risks, as these models become increasingly vulnerable to harmful multimodal prompts that can trigger unethical or unsafe behaviors. Existing safety alignment approaches, primarily designed for unimodal language models, fall short in addressing the complex and nuanced threats posed by multimodal inputs. Moreover, current safety datasets lack the fine-grained, policy-grounded reasoning required to robustly align reasoning-capable VLMs. In this work, we introduce {MSR-Align}, a high-quality Multimodal Safety Reasoning dataset tailored to bridge this gap. MSR-Align supports fine-grained, deliberative reasoning over standardized safety policies across both vision and text modalities. Our data generation pipeline emphasizes multimodal diversity, policy-grounded reasoning, and rigorous quality filtering using strong multimodal judges. Extensive experiments demonstrate that fine-tuning VLMs on MSR-Align substantially improves robustness against both textual and vision-language jailbreak attacks, while preserving or enhancing general reasoning performance. MSR-Align provides a scalable and effective foundation for advancing the safety alignment of reasoning-capable VLMs. Our dataset is made publicly available at https://huggingface.co/datasets/Leigest/MSR-Align.


[18] Self-Paced Collaborative and Adversarial Network for Unsupervised Domain Adaptation cs.CVPDF

Weichen Zhang, Dong Xu, Wanli Ouyang, Wen Li

TL;DR: 该论文提出了一种名为CAN的无监督域适应方法,结合域协作和域对抗学习策略,通过正负权重损失统一二者,并设计了自动学习域特定和域不变特征的训练方案。进一步提出的SPCAN通过自步学习选择伪标签目标样本提升性能,在多个基准数据集上取得了最先进的结果。

Details

Motivation: 无监督域适应的核心目标是缩小源域和目标域之间的分布差异,同时保持目标域的判别性。传统方法通常单独处理域不变性或判别性,但如何有效统一这两种需求仍是一个挑战。

Result: 在Office-31、ImageCLEF-DA、VISDA-2017等数据集上实现了最先进的性能,验证了方法的有效性。

Insight: 统一域协作和域对抗学习能够更全面地解决无监督域适应问题,而自步学习策略能有效提升目标域的分类性能。

Abstract: This paper proposes a new unsupervised domain adaptation approach called Collaborative and Adversarial Network (CAN), which uses the domain-collaborative and domain-adversarial learning strategy for training the neural network. The domain-collaborative learning aims to learn domain-specific feature representation to preserve the discriminability for the target domain, while the domain adversarial learning aims to learn domain-invariant feature representation to reduce the domain distribution mismatch between the source and target domains. We show that these two learning strategies can be uniformly formulated as domain classifier learning with positive or negative weights on the losses. We then design a collaborative and adversarial training scheme, which automatically learns domain-specific representations from lower blocks in CNNs through collaborative learning and domain-invariant representations from higher blocks through adversarial learning. Moreover, to further enhance the discriminability in the target domain, we propose Self-Paced CAN (SPCAN), which progressively selects pseudo-labeled target samples for re-training the classifiers. We employ a self-paced learning strategy to select pseudo-labeled target samples in an easy-to-hard fashion. Comprehensive experiments on different benchmark datasets, Office-31, ImageCLEF-DA, and VISDA-2017 for the object recognition task, and UCF101-10 and HMDB51-10 for the video action recognition task, show our newly proposed approaches achieve the state-of-the-art performance, which clearly demonstrates the effectiveness of our proposed approaches for unsupervised domain adaptation.


[19] AirV2X: Unified Air-Ground Vehicle-to-Everything Collaboration cs.CV | cs.AI | cs.ROPDF

Xiangbo Gao, Yuheng Wu, Xuewen Luo, Keshu Wu, Xinghao Chen

TL;DR: AirV2X提案一個統一的空中-地面V2X協作框架,利用無人機(Drones)作為固定路側單元(RSUs)的靈活替代或補充,解決傳統V2X系統的高部署成本和覆蓋盲區問題。

Details

Motivation: 傳統V2X系統在農村和郊區存在高成本和覆蓋盲區問題,而無人機的動態性和低成本特性為此提供了新的解決方案。

Result: 數據集開源於GitHub,支持無人機輔助自動駕駛的研究與應用。

Insight: 無人機的靈活性為V2X系統提供了新的可能,尤其是在成本和覆蓋範圍方面,具有顯著優勢。

Abstract: While multi-vehicular collaborative driving demonstrates clear advantages over single-vehicle autonomy, traditional infrastructure-based V2X systems remain constrained by substantial deployment costs and the creation of “uncovered danger zones” in rural and suburban areas. We present AirV2X-Perception, a large-scale dataset that leverages Unmanned Aerial Vehicles (UAVs) as a flexible alternative or complement to fixed Road-Side Units (RSUs). Drones offer unique advantages over ground-based perception: complementary bird’s-eye-views that reduce occlusions, dynamic positioning capabilities that enable hovering, patrolling, and escorting navigation rules, and significantly lower deployment costs compared to fixed infrastructure. Our dataset comprises 6.73 hours of drone-assisted driving scenarios across urban, suburban, and rural environments with varied weather and lighting conditions. The AirV2X-Perception dataset facilitates the development and standardized evaluation of Vehicle-to-Drone (V2D) algorithms, addressing a critical gap in the rapidly expanding field of aerial-assisted autonomous driving systems. The dataset and development kits are open-sourced at https://github.com/taco-group/AirV2X-Perception.


[20] Da Yu: Towards USV-Based Image Captioning for Waterway Surveillance and Scene Understanding cs.CV | cs.ROPDF

Runwei Guan, Ningwei Ouyang, Tianhao Xu, Shaofeng Liang, Wei Dai

TL;DR: 该论文提出了首个专注于水道环境的图像描述数据集WaterCaption,并提出了一种可部署在边缘设备上的多模态大语言模型Da Yu,通过Nano Transformer Adaptor(NTA)实现高效的长文本生成。

Details

Motivation: 水道环境的复杂性使得现有感知模型难以实现全局语义理解,限制了大规模监测与结构化日志生成。通过结合视觉-语言模型(VLMs),论文旨在利用图像描述技术提升水道监视和场景理解能力。

Result: Da Yu模型在WaterCaption及其他图像描述基准测试中,性能优于现有最先进模型,同时保持了较高的效率。

Insight: 论文展示了细粒度图像描述在水道环境理解中的潜力,同时为边缘设备的视觉-语言模型部署提供了新思路,尤其是在复杂开放水域场景中。

Abstract: Automated waterway environment perception is crucial for enabling unmanned surface vessels (USVs) to understand their surroundings and make informed decisions. Most existing waterway perception models primarily focus on instance-level object perception paradigms (e.g., detection, segmentation). However, due to the complexity of waterway environments, current perception datasets and models fail to achieve global semantic understanding of waterways, limiting large-scale monitoring and structured log generation. With the advancement of vision-language models (VLMs), we leverage image captioning to introduce WaterCaption, the first captioning dataset specifically designed for waterway environments. WaterCaption focuses on fine-grained, multi-region long-text descriptions, providing a new research direction for visual geo-understanding and spatial scene cognition. Exactly, it includes 20.2k image-text pair data with 1.8 million vocabulary size. Additionally, we propose Da Yu, an edge-deployable multi-modal large language model for USVs, where we propose a novel vision-to-language projector called Nano Transformer Adaptor (NTA). NTA effectively balances computational efficiency with the capacity for both global and fine-grained local modeling of visual features, thereby significantly enhancing the model’s ability to generate long-form textual outputs. Da Yu achieves an optimal balance between performance and efficiency, surpassing state-of-the-art models on WaterCaption and several other captioning benchmarks.


[21] HoliGS: Holistic Gaussian Splatting for Embodied View Synthesis cs.CVPDF

Xiaoyuan Wang, Yizhou Zhao, Botao Ye, Xiaojun Shan, Weijie Lyu

TL;DR: HoliGS 是一种新颖的可变形高斯泼溅框架,用于从长时单目RGB视频中实现高效的视角合成,显著降低训练和渲染时间。

Details

Motivation: 现有的4D高斯泼溅和动态NeRF方法在处理长时捕捉数据时训练开销大,HoliGS旨在提供一种高效且可扩展的解决方案。

Result: 在复杂数据集上展现了优越的重建质量,同时显著降低了训练和渲染时间。

Insight: 通过层次化变形策略,HoliGS为实际场景中的视角合成提供了一种高效且可扩展的方法,特别适用于多视角交互场景。

Abstract: We propose HoliGS, a novel deformable Gaussian splatting framework that addresses embodied view synthesis from long monocular RGB videos. Unlike prior 4D Gaussian splatting and dynamic NeRF pipelines, which struggle with training overhead in minute-long captures, our method leverages invertible Gaussian Splatting deformation networks to reconstruct large-scale, dynamic environments accurately. Specifically, we decompose each scene into a static background plus time-varying objects, each represented by learned Gaussian primitives undergoing global rigid transformations, skeleton-driven articulation, and subtle non-rigid deformations via an invertible neural flow. This hierarchical warping strategy enables robust free-viewpoint novel-view rendering from various embodied camera trajectories by attaching Gaussians to a complete canonical foreground shape (\eg, egocentric or third-person follow), which may involve substantial viewpoint changes and interactions between multiple actors. Our experiments demonstrate that \ourmethod~ achieves superior reconstruction quality on challenging datasets while significantly reducing both training and rendering time compared to state-of-the-art monocular deformable NeRFs. These results highlight a practical and scalable solution for EVS in real-world scenarios. The source code will be released.


[22] Open-Vocabulary Camouflaged Object Segmentation with Cascaded Vision Language Models cs.CVPDF

Kai Zhao, Wubang Yuan, Zheng Wang, Guanyi Li, Xiaoqiang Zhu

TL;DR: 本文提出了一种基于多级视觉语言模型(VLM)的新框架,用于开放词汇伪装目标分割(OVCOS),通过结合VLM和SAM模型,解决了传统方法中的领域差异和边界模糊问题。

Details

Motivation: 开放词汇伪装目标分割(OVCOS)需要从任意类别中分割和分类伪装目标,但现有方法因视觉模糊性和未见过类别面临挑战。传统两阶段方法存在领域差异和通用分割模型对伪装目标效果不佳的问题。

Result: 在OVCOS和传统伪装目标分割基准测试中,该方法表现出显著优势,验证了VLM语义在分割和分类中的有效性。

Insight: 结合VLM和SAM能够有效解决伪装目标分割中的领域差异和边界模糊问题,软空间先验方法为多模态任务提供了新思路。

Abstract: Open-Vocabulary Camouflaged Object Segmentation (OVCOS) seeks to segment and classify camouflaged objects from arbitrary categories, presenting unique challenges due to visual ambiguity and unseen categories.Recent approaches typically adopt a two-stage paradigm: first segmenting objects, then classifying the segmented regions using Vision Language Models (VLMs).However, these methods (1) suffer from a domain gap caused by the mismatch between VLMs’ full-image training and cropped-region inference, and (2) depend on generic segmentation models optimized for well-delineated objects, making them less effective for camouflaged objects.Without explicit guidance, generic segmentation models often overlook subtle boundaries, leading to imprecise segmentation.In this paper,we introduce a novel VLM-guided cascaded framework to address these issues in OVCOS.For segmentation, we leverage the Segment Anything Model (SAM), guided by the VLM.Our framework uses VLM-derived features as explicit prompts to SAM, effectively directing attention to camouflaged regions and significantly improving localization accuracy.For classification, we avoid the domain gap introduced by hard cropping.Instead, we treat the segmentation output as a soft spatial prior via the alpha channel, which retains the full image context while providing precise spatial guidance, leading to more accurate and context-aware classification of camouflaged objects.The same VLM is shared across both segmentation and classification to ensure efficiency and semantic consistency.Extensive experiments on both OVCOS and conventional camouflaged object segmentation benchmarks demonstrate the clear superiority of our method, highlighting the effectiveness of leveraging rich VLM semantics for both segmentation and classification of camouflaged objects.


[23] Airway Skill Assessment with Spatiotemporal Attention Mechanisms Using Human Gaze cs.CVPDF

Jean-Paul Ainam, Rahul, Lora Cavuoto, Matthew Hackett, Jack Norfleet

TL;DR: 该论文提出了一种基于机器学习的评估气道管理技能的方法,结合人类注视数据和视频记录,用于评估气管插管(ETI)技能。通过注意力机制和视觉掩码增强模型对关键区域的关注,提高了分类准确性和效率。

Details

Motivation: 气道管理技能在急救医学中至关重要,但传统评估方法主观性强且难以反映真实场景中的能力。该研究旨在通过人类注视数据和机器学习提供一种客观、高效的评估工具。

Result: 该方法在预测准确性、敏感性和可信度方面均优于传统方法,尤其在高压环境下(如军事场景)表现优异。

Insight: 人类注视数据能够有效引导注意力机制,提升模型在复杂任务中的表现。这种方法可扩展到其他临床技能评估领域,具有广泛的应用潜力。

Abstract: Airway management skills are critical in emergency medicine and are typically assessed through subjective evaluation, often failing to gauge competency in real-world scenarios. This paper proposes a machine learning-based approach for assessing airway skills, specifically endotracheal intubation (ETI), using human gaze data and video recordings. The proposed system leverages an attention mechanism guided by the human gaze to enhance the recognition of successful and unsuccessful ETI procedures. Visual masks were created from gaze points to guide the model in focusing on task-relevant areas, reducing irrelevant features. An autoencoder network extracts features from the videos, while an attention module generates attention from the visual masks, and a classifier outputs a classification score. This method, the first to use human gaze for ETI, demonstrates improved accuracy and efficiency over traditional methods. The integration of human gaze data not only enhances model performance but also offers a robust, objective assessment tool for clinical skills, particularly in high-stress environments such as military settings. The results show improvements in prediction accuracy, sensitivity, and trustworthiness, highlighting the potential for this approach to improve clinical training and patient outcomes in emergency medicine.


[24] Capturing Fine-Grained Alignments Improves 3D Affordance Detection cs.CV | cs.AIPDF

Junsei Tokumitsu, Yuiga Wada

TL;DR: 该论文提出了一种名为LM-AD的新方法,用于3D点云中的功能检测,通过引入Affordance Query Module (AQM)捕捉点云与文本之间的细粒度对齐,显著提升了现有方法的性能。

Details

Motivation: 现有方法在3D点云功能检测中依赖简单的余弦相似度计算点云与文本嵌入,无法有效捕捉细粒度对齐,导致性能受限。

Result: 在3D AffordanceNet数据集上,LM-AD在准确率和平均IoU上均优于现有方法。

Insight: 细粒度对齐是提升3D功能检测性能的关键,预训练语言模型为此提供了有效工具。

Abstract: In this work, we address the challenge of affordance detection in 3D point clouds, a task that requires effectively capturing fine-grained alignments between point clouds and text. Existing methods often struggle to model such alignments, resulting in limited performance on standard benchmarks. A key limitation of these approaches is their reliance on simple cosine similarity between point cloud and text embeddings, which lacks the expressiveness needed for fine-grained reasoning. To address this limitation, we propose LM-AD, a novel method for affordance detection in 3D point clouds. Moreover, we introduce the Affordance Query Module (AQM), which efficiently captures fine-grained alignment between point clouds and text by leveraging a pretrained language model. We demonstrated that our method outperformed existing approaches in terms of accuracy and mean Intersection over Union on the 3D AffordanceNet dataset.


[25] Progressive Modality Cooperation for Multi-Modality Domain Adaptation cs.CVPDF

Weichen Zhang, Dong Xu, Jing Zhang, Wanli Ouyang

TL;DR: 论文提出了一种名为渐进式模态合作(PMC)的新框架,用于多模态域适应(MMDA)和特权信息多模态域适应(MMDA-PI)任务,通过多模态数据提升知识迁移效果。

Details

Motivation: 在多模态域适应任务中,如何有效利用多模态数据并解决目标域中模态缺失的问题是一个关键挑战。

Result: 在三种图像和八种视频数据集上的实验验证了PMC和PMC-PI在跨域视觉任务中的有效性。

Insight: 多模态数据可以通过渐进式合作和生成网络显著提升域适应任务的性能,尤其在模态缺失的情况下。

Abstract: In this work, we propose a new generic multi-modality domain adaptation framework called Progressive Modality Cooperation (PMC) to transfer the knowledge learned from the source domain to the target domain by exploiting multiple modality clues (\eg, RGB and depth) under the multi-modality domain adaptation (MMDA) and the more general multi-modality domain adaptation using privileged information (MMDA-PI) settings. Under the MMDA setting, the samples in both domains have all the modalities. In two newly proposed modules of our PMC, the multiple modalities are cooperated for selecting the reliable pseudo-labeled target samples, which captures the modality-specific information and modality-integrated information, respectively. Under the MMDA-PI setting, some modalities are missing in the target domain. Hence, to better exploit the multi-modality data in the source domain, we further propose the PMC with privileged information (PMC-PI) method by proposing a new multi-modality data generation (MMG) network. MMG generates the missing modalities in the target domain based on the source domain data by considering both domain distribution mismatch and semantics preservation, which are respectively achieved by using adversarial learning and conditioning on weighted pseudo semantics. Extensive experiments on three image datasets and eight video datasets for various multi-modality cross-domain visual recognition tasks under both MMDA and MMDA-PI settings clearly demonstrate the effectiveness of our proposed PMC framework.


[26] Continual Retinal Vision-Language Pre-training upon Incremental Imaging Modalities cs.CVPDF

Yuang Yao, Ruiqi Wu, Yi Zhou, Tao Zhou

TL;DR: 该论文提出了一种名为RetCoP的持续视觉-语言预训练框架,用于逐步整合不同模态的眼底图像和文本特征到一个统一的预训练模型中,解决了传统单模态模型的局限性。

Details

Motivation: 传统眼底图像分析模型专注于单模态任务,忽略了多模态互补性,限制了其通用性。现有的大多数视网膜基础模型仍是模态特定的,因此需要一种动态环境下持续整合多模态数据的方法。

Result: 实验表明,RetCoP优于所有对比方法,表现出最佳泛化能力和最低遗忘率。

Insight: 多模态数据逐步整合是关键挑战,RetCoP的成功表明持续学习和表征对齐在动态环境中的重要性。

Abstract: Traditional fundus image analysis models focus on single-modal tasks, ignoring fundus modality complementarity, which limits their versatility. Recently, retinal foundation models have emerged, but most still remain modality-specific. Integrating multiple fundus imaging modalities into a single foundation model is valuable. However, in dynamic environments, data from different modalities often arrive incrementally, necessitating continual pre-training. To address this, we propose RetCoP, the first continual vision-language pre-training framework in the fundus domain, which incrementally integrates image and text features from different imaging modalities into a single unified foundation model. To mitigate catastrophic forgetting in continual pre-training, we introduce a rehearsal strategy utilizing representative image-text pairs and an off-diagonal information distillation approach. The former allows the model to revisit knowledge from previous stages, while the latter explicitly preserves the alignment between image and text representations. Experiments show that RetCoP outperforms all the compared methods, achieving the best generalization and lowest forgetting rate. The code can be found at https://github.com/Yuang-Yao/RetCoP.


[27] Memory-Augmented Incomplete Multimodal Survival Prediction via Cross-Slide and Gene-Attentive Hypergraph Learning cs.CVPDF

Mingcheng Qu, Guang Yang, Donglin Di, Yue Gao, Tonghua Su

TL;DR: 该论文提出了一种基于超图学习的内存增强多模态生存预测框架,整合多幻灯片信息和病理-基因组交互,解决模态不平衡问题,并通过记忆机制补偿不完整模态。

Details

Motivation: 现有方法主要整合FFPE幻灯片与基因组数据,忽视了其他保存方式如FF幻灯片,同时高分辨率病理数据主导跨模态融合,导致模态不平衡和不完整数据限制临床适用性。

Result: 在五个TCGA数据集上C-Index超出先进方法2.3%,不完整模态下优于仅病理(3.3%)和仅基因模型(7.9%)。

Insight: 内存机制和超图学习的结合有效解决了模态不平衡和不完整数据问题,提升了多模态生存预测的鲁棒性和临床适用性。

Abstract: Multimodal pathology-genomic analysis is critical for cancer survival prediction. However, existing approaches predominantly integrate formalin-fixed paraffin-embedded (FFPE) slides with genomic data, while neglecting the availability of other preservation slides, such as Fresh Froze (FF) slides. Moreover, as the high-resolution spatial nature of pathology data tends to dominate the cross-modality fusion process, it hinders effective multimodal fusion and leads to modality imbalance challenges between pathology and genomics. These methods also typically require complete data modalities, limiting their clinical applicability with incomplete modalities, such as missing either pathology or genomic data. In this paper, we propose a multimodal survival prediction framework that leverages hypergraph learning to effectively integrate multi-WSI information and cross-modality interactions between pathology slides and genomics data while addressing modality imbalance. In addition, we introduce a memory mechanism that stores previously learned paired pathology-genomic features and dynamically compensates for incomplete modalities. Experiments on five TCGA datasets demonstrate that our model outperforms advanced methods by over 2.3% in C-Index. Under incomplete modality scenarios, our approach surpasses pathology-only (3.3%) and gene-only models (7.9%). Code: https://github.com/MCPathology/M2Surv


[28] Comparative Performance of Finetuned ImageNet Pre-trained Models for Electronic Component Classification cs.CVPDF

Yidi Shao, Longfei Zhou, Fangshuo Tang, Xinyi Shi, Dalang Chen

TL;DR: 本文比较了12种基于ImageNet预训练模型在电子元件分类任务中的性能,发现MobileNet-V2表现最佳(99.95%),而EfficientNet-B0最低(92.26%),验证了预训练模型在电子制造业的实用性。

Details

Motivation: 电子元件分类在制造业中至关重要,能显著降低人工成本并推动技术发展。预训练模型(尤其是基于ImageNet的)在图像分类中表现优异,即使数据有限也能取得良好效果。

Result: MobileNet-V2以99.95%的准确率表现最优,EfficientNet-B0以92.26%的准确率表现最差。所有模型均表现出色,证明了预训练模型的有效性。

Insight: ImageNet预训练模型即使在特定的电子元件分类任务中也能取得高准确率,验证了其泛化能力和实用性。

Abstract: Electronic component classification and detection are crucial in manufacturing industries, significantly reducing labor costs and promoting technological and industrial development. Pre-trained models, especially those trained on ImageNet, are highly effective in image classification, allowing researchers to achieve excellent results even with limited data. This paper compares the performance of twelve ImageNet pre-trained models in classifying electronic components. Our findings show that all models tested delivered respectable accuracies. MobileNet-V2 recorded the highest at 99.95%, while EfficientNet-B0 had the lowest at 92.26%. These results underscore the substantial benefits of using ImageNet pre-trained models in image classification tasks and confirm the practical applicability of these methods in the electronics manufacturing sector.


[29] Trajectory Prediction in Dynamic Object Tracking: A Critical Study cs.CVPDF

Zhongping Dong, Liming Chen, Mohand Tahar Kechadi

TL;DR: 这篇论文详细分析了动态目标跟踪(DOT)和轨迹预测(TP)方法的最新进展、应用与挑战,涉及多种技术方法,并探讨了其在实际场景中的效果与局限性。

Details

Motivation: 动态目标跟踪和轨迹预测技术在多个领域(如自动驾驶、监控、医疗和工业自动化)中具有重要应用价值,但现有方法在泛化性、计算效率、数据依赖性等方面仍存在挑战。

Result: 研究强调了现有方法的局限性和潜在改进空间,特别是在解决泛化性和计算效率问题上的需求。

Insight: 未来的研究应关注上下文感知系统的开发,同时结合多模态数据和语义信息,并重视伦理和隐私保护问题。

Abstract: This study provides a detailed analysis of current advancements in dynamic object tracking (DOT) and trajectory prediction (TP) methodologies, including their applications and challenges. It covers various approaches, such as feature-based, segmentation-based, estimation-based, and learning-based methods, evaluating their effectiveness, deployment, and limitations in real-world scenarios. The study highlights the significant impact of these technologies in automotive and autonomous vehicles, surveillance and security, healthcare, and industrial automation, contributing to safety and efficiency. Despite the progress, challenges such as improved generalization, computational efficiency, reduced data dependency, and ethical considerations still exist. The study suggests future research directions to address these challenges, emphasizing the importance of multimodal data integration, semantic information fusion, and developing context-aware systems, along with ethical and privacy-preserving frameworks.


[30] Image Segmentation using Chan-Vese Active Contours cs.CVPDF

Pranav Shenoy K. P

TL;DR: 论文详细推导并实现了基于Chan-Vese主动轮廓模型的图像分割方法,展示了其在噪声图像和弱边界图像上的优越性能。

Details

Motivation: 为了解决传统基于梯度的图像分割方法在噪声和弱边界图像上的局限性,论文提出了一种基于区域强度差异的主动轮廓模型。

Result: 实验证明模型在医学和合成图像上分割准确,对噪声鲁棒,且优于传统边缘方法。

Insight: Chan-Vese模型通过区域强度差异而非梯度驱动轮廓演化,为复杂图像分割提供了更优的解决方案。

Abstract: This paper presents a comprehensive derivation and implementation of the Chan-Vese active contour model for image segmentation. The model, derived from the Mumford-Shah variational framework, evolves contours based on regional intensity differences rather than image gradients, making it highly effective for segmenting noisy images or images with weak boundaries. We provide a rigorous mathematical derivation of the level set formulation, including detailed treatment of each energy term using the divergence theorem and curve evolution theory. The resulting algorithm is implemented in Python using finite difference methods with special care to numerical stability, including an upwind entropy scheme and curvature-based regularization. Experimental results on medical and synthetic images demonstrate accurate segmentation, robustness to noise, and superior performance compared to classical edge-based methods. This study confirms the suitability of the Chan-Vese model for complex segmentation tasks and highlights its potential for use in real-world imaging applications.


[31] Training-Free Motion Customization for Distilled Video Generators with Adaptive Test-Time Distillation cs.CVPDF

Jintao Rong, Xin Xie, Xinyi Yu, Linlin Ou, Xinyu Zhang

TL;DR: 本文提出了MotionEcho,一种无需训练的运动定制方法,通过自适应测试时蒸馏提升了蒸馏视频生成模型的运动保真度和生成质量。

Details

Motivation: 蒸馏视频生成模型在无需训练的设置下难以通过参考视频实现运动定制,现有方法因其加速生成过程和大去噪步长而无法泛化。

Result: 在多种蒸馏视频生成模型和基准数据集上的实验表明,该方法显著提升了运动保真度和生成质量,同时保持高效。

Insight: 通过自适应测试时蒸馏,可以在无需额外训练的情况下,有效利用教师模型的优势实现高质量的运动定制。

Abstract: Distilled video generation models offer fast and efficient synthesis but struggle with motion customization when guided by reference videos, especially under training-free settings. Existing training-free methods, originally designed for standard diffusion models, fail to generalize due to the accelerated generative process and large denoising steps in distilled models. To address this, we propose MotionEcho, a novel training-free test-time distillation framework that enables motion customization by leveraging diffusion teacher forcing. Our approach uses high-quality, slow teacher models to guide the inference of fast student models through endpoint prediction and interpolation. To maintain efficiency, we dynamically allocate computation across timesteps according to guidance needs. Extensive experiments across various distilled video generation models and benchmark datasets demonstrate that our method significantly improves motion fidelity and generation quality while preserving high efficiency. Project page: https://euminds.github.io/motionecho/


[32] Online camera-pose-free stereo endoscopic tissue deformation recovery with tissue-invariant vision-biomechanics consistency cs.CVPDF

Jiahe Chen, Naoki Tomii, Ichiro Sakuma, Etsuko Kobayashi

TL;DR: 该论文提出了一种在线相机姿态无关的立体内窥镜组织形变恢复方法,通过组织不变的视觉-生物力学一致性,解决了相机运动、遮挡和大形变等问题,并在不需要估计相机姿态的情况下实现了帧间对齐。

Details

Motivation: 现有研究在相机运动、遮挡、大形变和缺乏组织特异性生物力学先验的情况下表现不佳,且依赖离线处理。本文旨在提出一种在线方法,无需相机姿态估计,能稳定恢复组织几何和形变。

Result: 在非遮挡和遮挡区域的3D重建精度分别为0.37±0.27 mm和0.39±0.21 mm。可估计表面应变分布,用于机械分析。

Insight: 方法展示了在复杂手术场景中的鲁棒性,为手术导航和自主软组织操作提供了新的工具。

Abstract: Tissue deformation recovery based on stereo endoscopic images is crucial for tool-tissue interaction analysis and benefits surgical navigation and autonomous soft tissue manipulation. Previous research suffers from the problems raised from camera motion, occlusion, large tissue deformation, lack of tissue-specific biomechanical priors, and reliance on offline processing. Unlike previous studies where the tissue geometry and deformation are represented by 3D points and displacements, the proposed method models tissue geometry as the 3D point and derivative map and tissue deformation as the 3D displacement and local deformation map. For a single surface point, 6 parameters are used to describe its rigid motion and 3 parameters for its local deformation. The method is formulated under the camera-centric setting, where all motions are regarded as the scene motion with respect to the camera. Inter-frame alignment is realized by optimizing the inter-frame deformation, making it unnecessary to estimate camera pose. The concept of the canonical map is introduced to optimize tissue geometry and deformation in an online approach. Quantitative and qualitative experiments were conducted using in vivo and ex vivo laparoscopic datasets. With the inputs of depth and optical flow, the method stably models tissue geometry and deformation even when the tissue is partially occluded or moving outside the field of view. Results show that the 3D reconstruction accuracy in the non-occluded and occluded areas reaches 0.37$\pm$0.27 mm and 0.39$\pm$0.21 mm in terms of surface distance, respectively. The method can also estimate surface strain distribution during various manipulations as an extra modality for mechanical-based analysis.


[33] Emergence of Text Readability in Vision Language Models cs.CVPDF

Jaeyoo Park, Sanghyuk Chun, Wonjae Kim, Sangdoo Yun, Bohyung Han

TL;DR: 该论文研究了视觉语言模型(VLMs)在训练过程中识别图像中文本内容的能力(文本可读性)的涌现现象,发现其与语义理解能力的渐进发展形成对比。

Details

Motivation: 探讨视觉语言模型在训练过程中如何逐渐发展出识别图像中文本的能力,以及这种能力与语义理解能力的差异。

Result: 发现文本可读性在训练后期突然出现,而匹配图像与渲染文本的能力发展更慢,表明需要更深的语义整合。

Insight: 研究结果表明,需要针对性的训练策略来加速VLMs的文本理解能力,为优化多模态学习提供了方向。

Abstract: We investigate how the ability to recognize textual content within images emerges during the training of Vision-Language Models (VLMs). Our analysis reveals a critical phenomenon: the ability to read textual information in a given image \textbf{(text readability)} emerges abruptly after substantial training iterations, in contrast to semantic content understanding which develops gradually from the early stages of training. This delayed emergence may reflect how contrastive learning tends to initially prioritize general semantic understanding, with text-specific symbolic processing developing later. Interestingly, the ability to match images with rendered text develops even slower, indicating a deeper need for semantic integration. These findings highlight the need for tailored training strategies to accelerate robust text comprehension in VLMs, laying the groundwork for future research on optimizing multimodal learning.


[34] Mem4Nav: Boosting Vision-and-Language Navigation in Urban Environments with a Hierarchical Spatial-Cognition Long-Short Memory System cs.CV | cs.AI | cs.CLPDF

Lixuan He, Haoyu Dong, Zhenxing Chen, Yangcheng Yu, Jie Feng

TL;DR: Mem4Nav是一种分层空间认知长短记忆系统,旨在提升视觉与语言导航(VLN)在复杂城市场景中的表现。它通过结合稀疏八叉树和语义拓扑图,实现了高效的长期和短期记忆管理,显著提升了任务完成率和路径规划能力。

Details

Motivation: 现有VLN方法在城市场景中面临两大挑战:模块化方法缺乏统一记忆,端到端方法受限于固定上下文窗口和隐式空间推理。Mem4Nav旨在通过分层记忆系统解决这些问题。

Result: 在Touchdown和Map2Seq数据集上,Mem4Nav显著提升了任务完成率(7-13个百分点),缩短了路径距离,nDTW提高了10个百分点以上。消融实验验证了分层地图和双记忆模块的重要性。

Insight: Mem4Nav的成功表明,分层记忆结构和显式空间推理对城市场景导航至关重要,为未来VLN系统的设计提供了新思路。

Abstract: Vision-and-Language Navigation (VLN) in large-scale urban environments requires embodied agents to ground linguistic instructions in complex scenes and recall relevant experiences over extended time horizons. Prior modular pipelines offer interpretability but lack unified memory, while end-to-end (M)LLM agents excel at fusing vision and language yet remain constrained by fixed context windows and implicit spatial reasoning. We introduce \textbf{Mem4Nav}, a hierarchical spatial-cognition long-short memory system that can augment any VLN backbone. Mem4Nav fuses a sparse octree for fine-grained voxel indexing with a semantic topology graph for high-level landmark connectivity, storing both in trainable memory tokens embedded via a reversible Transformer. Long-term memory (LTM) compresses and retains historical observations at both octree and graph nodes, while short-term memory (STM) caches recent multimodal entries in relative coordinates for real-time obstacle avoidance and local planning. At each step, STM retrieval sharply prunes dynamic context, and, when deeper history is needed, LTM tokens are decoded losslessly to reconstruct past embeddings. Evaluated on Touchdown and Map2Seq across three backbones (modular, state-of-the-art VLN with prompt-based LLM, and state-of-the-art VLN with strided-attention MLLM), Mem4Nav yields 7-13 pp gains in Task Completion, sufficient SPD reduction, and >10 pp nDTW improvement. Ablations confirm the indispensability of both the hierarchical map and dual memory modules. Our codes are open-sourced via https://github.com/tsinghua-fib-lab/Mem4Nav.


[35] AMF-MedIT: An Efficient Align-Modulation-Fusion Framework for Medical Image-Tabular Data cs.CVPDF

Congjing Yu, Jing Ye, Yang Liu, Xiaodong Zhang, Zhiyong Zhang

TL;DR: AMF-MedIT是一个高效的跨模态医疗数据融合框架,通过自适应调制与融合模块和创新的表格数据编码器FT-Mamba,解决了图像和表格数据融合中的维度差异和噪声问题,并在数据稀缺条件下表现出色。

Details

Motivation: 医疗图像和表格数据的多模态分析对临床决策至关重要,但由于特征维度差异和表格数据中的噪声,现有方法在融合效果和数据效率上存在不足。本文提出AMF-MedIT框架以解决这些问题。

Result: 实验表明,AMF-MedIT在多模态性能和数据效率间取得优越平衡,且对不完整表格数据具有强适应性。FT-Mamba在特征提取和指导图像注意力模式方面表现出色。

Insight: 1. 调制机制和先验知识的结合是跨模态融合的关键;2. 选择性机制能有效处理医疗表格数据的高噪声;3. 可解释性分析揭示了表格模态对图像模态的监督潜力。

Abstract: Multimodal medical analysis combining image and tabular data has gained increasing attention. However, effective fusion remains challenging due to cross-modal discrepancies in feature dimensions and modality contributions, as well as the noise from high-dimensional tabular inputs. To address these problems, we present AMF-MedIT, an efficient Align-Modulation-Fusion framework for medical image and tabular data integration, particularly under data-scarce conditions. To harmonize dimension discrepancies and dynamically adjust modality contributions, we propose the Adaptive Modulation and Fusion (AMF) module, a novel modulation-based fusion paradigm with a streamlined architecture. We first derive the modulation objectives and introduce a modality confidence ratio, enabling the incorporation of prior knowledge into the fusion process. Then, the feature masks, density and leakage losses are proposed to achieve the modulation objectives. Additionally, we introduce FT-Mamba, a powerful tabular encoder leveraging a selective mechanism to handle noisy medical tabular data efficiently. Furthermore, interpretability studies are conducted to explore how different tabular encoders supervise the imaging modality during contrastive pretraining for the first time. Extensive experiments demonstrate that AMF-MedIT achieves a superior balance between multimodal performance and data efficiency while showing strong adaptability to incomplete tabular data. Interpretability analysis also highlights FT-Mamba’s capabilities in extracting distinct tabular features and guiding the image encoder toward more accurate and flexible attention patterns.


[36] Sampling Matters in Explanations: Towards Trustworthy Attribution Analysis Building Block in Visual Models through Maximizing Explanation Certainty cs.CVPDF

Róisín Luo, James McDermott, Colm O’Riordan

TL;DR: 本文通过理论分析和实验证明,指出梯度集成中的样本分布与自然图像分布的对齐程度决定了解释的可信度下限,并提出了一种通过抑制输入特征的半最优采样方法,显著提升了视觉模型的解释能力。

Details

Motivation: 现有图像归因分析中,梯度集成通过噪声样本生成特征映射,但其样本分布与自然图像分布的对齐不足,导致解释可信度低。噪声信息还会使神经网络饱和,影响解释效果。

Result: 在ImageNet数据集上的实验表明,该方法在所有测试模型中均优于现有基线,能够生成更满意的解释。

Insight: 归因分析的可信度与样本分布的对齐程度直接相关;抑制特征而非添加噪声是一种更有效的采样策略,能够避免模型饱和并提升解释质量。

Abstract: Image attribution analysis seeks to highlight the feature representations learned by visual models such that the highlighted feature maps can reflect the pixel-wise importance of inputs. Gradient integration is a building block in the attribution analysis by integrating the gradients from multiple derived samples to highlight the semantic features relevant to inferences. Such a building block often combines with other information from visual models such as activation or attention maps to form ultimate explanations. Yet, our theoretical analysis demonstrates that the extent to the alignment of the sample distribution in gradient integration with respect to natural image distribution gives a lower bound of explanation certainty. Prior works add noise into images as samples and the noise distributions can lead to low explanation certainty. Counter-intuitively, our experiment shows that extra information can saturate neural networks. To this end, building trustworthy attribution analysis needs to settle the sample distribution misalignment problem. Instead of adding extra information into input images, we present a semi-optimal sampling approach by suppressing features from inputs. The sample distribution by suppressing features is approximately identical to the distribution of natural images. Our extensive quantitative evaluation on large scale dataset ImageNet affirms that our approach is effective and able to yield more satisfactory explanations against state-of-the-art baselines throughout all experimental models.


[37] Deblurring in the Wild: A Real-World Dataset from Smartphone High-Speed Videos cs.CVPDF

Mahdi Mohd Hossain Noki, Syed Mumtahin Mahmud, Prothito Shovon Majumder, Abdul Mohaimen Al Radi, Md. Haider Ali

TL;DR: 本文提出了一个基于智能手机慢动作视频的真实世界图像去模糊数据集,模拟了长曝光模糊,包含超过42,000对高分辨率模糊-清晰图像对,规模是现有数据集的10倍。

Details

Motivation: 现有去模糊数据集规模小且场景单一,无法反映真实世界模糊的复杂性和多样性,因此需要构建一个更丰富、更具挑战性的数据集。

Result: 测试了多个SOTA去模糊模型,发现性能显著下降,表明该数据集的复杂性和多样性对现有模型提出了挑战。

Insight: 真实世界的模糊更具挑战性,未来去模糊模型需要更强的泛化能力和对复杂场景的适应性。

Abstract: We introduce the largest real-world image deblurring dataset constructed from smartphone slow-motion videos. Using 240 frames captured over one second, we simulate realistic long-exposure blur by averaging frames to produce blurry images, while using the temporally centered frame as the sharp reference. Our dataset contains over 42,000 high-resolution blur-sharp image pairs, making it approximately 10 times larger than widely used datasets, with 8 times the amount of different scenes, including indoor and outdoor environments, with varying object and camera motions. We benchmark multiple state-of-the-art (SOTA) deblurring models on our dataset and observe significant performance degradation, highlighting the complexity and diversity of our benchmark. Our dataset serves as a challenging new benchmark to facilitate robust and generalizable deblurring models.


[38] Stylized Structural Patterns for Improved Neural Network Pre-training cs.CV | cs.AI | cs.LGPDF

Farnood Salehi, Vandit Sharma, Amirhossein Askari Farsangi, Tunç Ozan Aydın

TL;DR: 论文提出了一种改进的合成数据生成方法,通过神经分形公式和反向风格化技术,显著提升了合成数据在预训练中的性能,缩减了与真实数据的分布差距。

Details

Motivation: 现代计算机视觉模型依赖大量真实图像数据,但收集真实数据存在隐私和法律问题。现有合成数据性能不足,亟需一种更有效的合成数据生成方法。

Result: 1. 在EDM2扩散模型中,FID降低11%;2. 自编码器重建误差减少20%;3. ViT-S分类模型在ImageNet-100上准确率提升10%以上。

Insight: 反向风格化技术能有效弥补合成数据与真实数据的差距,为数据稀缺场景提供了实用解决方案。

Abstract: Modern deep learning models in computer vision require large datasets of real images, which are difficult to curate and pose privacy and legal concerns, limiting their commercial use. Recent works suggest synthetic data as an alternative, yet models trained with it often underperform. This paper proposes a two-step approach to bridge this gap. First, we propose an improved neural fractal formulation through which we introduce a new class of synthetic data. Second, we propose reverse stylization, a technique that transfers visual features from a small, license-free set of real images onto synthetic datasets, enhancing their effectiveness. We analyze the domain gap between our synthetic datasets and real images using Kernel Inception Distance (KID) and show that our method achieves a significantly lower distributional gap compared to existing synthetic datasets. Furthermore, our experiments across different tasks demonstrate the practical impact of this reduced gap. We show that pretraining the EDM2 diffusion model on our synthetic dataset leads to an 11% reduction in FID during image generation, compared to models trained on existing synthetic datasets, and a 20% decrease in autoencoder reconstruction error, indicating improved performance in data representation. Furthermore, a ViT-S model trained for classification on this synthetic data achieves over a 10% improvement in ImageNet-100 accuracy. Our work opens up exciting possibilities for training practical models when sufficiently large real training sets are not available.


[39] Surgery-R1: Advancing Surgical-VQLA with Reasoning Multimodal Large Language Model via Reinforcement Learning cs.CV | cs.AIPDF

Pengfei Hao, Shuaibo Li, Hongqiu Wang, Zhizhuo Kou, Junhang Zhang

TL;DR: 该论文提出了Surgery-R1,一种用于手术视觉问答定位(Surgical-VQLA)的推理多模态大语言模型(MLLM),通过结合监督微调和强化微调,提升了模型在手术场景中的推理能力和解释性。

Details

Motivation: 现有的Surgical-VQLA模型缺乏深度推理能力和解释性,限制了其在临床应用中的可靠性。

Result: 实验表明,Surgery-R1在Surgical-VQLA任务中优于现有SOTA模型和其他广泛使用的MLLM,验证了其推理能力和方法的有效性。

Insight: 通过引入强化学习和多模态一致性奖励机制,可以显著提升模型在复杂手术场景中的推理和解释能力,为临床应用提供了更可靠的解决方案。

Abstract: In recent years, significant progress has been made in the field of surgical scene understanding, particularly in the task of Visual Question Localized-Answering in robotic surgery (Surgical-VQLA). However, existing Surgical-VQLA models lack deep reasoning capabilities and interpretability in surgical scenes, which limits their reliability and potential for development in clinical applications. To address this issue, inspired by the development of Reasoning Multimodal Large Language Models (MLLMs), we first build the Surgery-R1-54k dataset, including paired data for Visual-QA, Grounding-QA, and Chain-of-Thought (CoT). Then, we propose the first Reasoning MLLM for Surgical-VQLA (Surgery-R1). In our Surgery-R1, we design a two-stage fine-tuning mechanism to enable the basic MLLM with complex reasoning abilities by utilizing supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT). Furthermore, for an efficient and high-quality rule-based reward system in our RFT, we design a Multimodal Coherence reward mechanism to mitigate positional illusions that may arise in surgical scenarios. Experiment results demonstrate that Surgery-R1 outperforms other existing state-of-the-art (SOTA) models in the Surgical-VQLA task and widely-used MLLMs, while also validating its reasoning capabilities and the effectiveness of our approach. The code and dataset will be organized in https://github.com/FiFi-HAO467/Surgery-R1.


[40] HMSViT: A Hierarchical Masked Self-Supervised Vision Transformer for Corneal Nerve Segmentation and Diabetic Neuropathy Diagnosis cs.CVPDF

Xin Zhang, Liangxiu Han, Yue Shi, Yanlin Zheng, Alam Uazman

TL;DR: HMSViT提出了一种新颖的分层掩码自监督视觉Transformer,用于角膜神经分割和糖尿病神经病变诊断,通过高效的多尺度特征提取和降低对标记数据的依赖,在性能和计算成本上优于现有方法。

Details

Motivation: 糖尿病周围神经病变(DPN)的早期诊断至关重要,但现有方法存在特征提取效率低、依赖手工先验和数据不足的问题。HMSViT旨在解决这些问题,实现高效且鲁棒的诊断。

Result: HMSViT在临床CCM数据集上取得61.34%的mIoU(分割)和70.40%的诊断准确率,优于Swin Transformer和HiViT等模型,且参数更少。

Insight: 1. 分层和注意力机制结合显著提升性能;2. 自监督学习对数据稀缺任务至关重要;3. HMSViT在医疗领域具有实际部署潜力。

Abstract: Diabetic Peripheral Neuropathy (DPN) affects nearly half of diabetes patients, requiring early detection. Corneal Confocal Microscopy (CCM) enables non-invasive diagnosis, but automated methods suffer from inefficient feature extraction, reliance on handcrafted priors, and data limitations. We propose HMSViT, a novel Hierarchical Masked Self-Supervised Vision Transformer (HMSViT) designed for corneal nerve segmentation and DPN diagnosis. Unlike existing methods, HMSViT employs pooling-based hierarchical and dual attention mechanisms with absolute positional encoding, enabling efficient multi-scale feature extraction by capturing fine-grained local details in early layers and integrating global context in deeper layers, all at a lower computational cost. A block-masked self supervised learning framework is designed for the HMSViT that reduces reliance on labelled data, enhancing feature robustness, while a multi-scale decoder is used for segmentation and classification by fusing hierarchical features. Experiments on clinical CCM datasets showed HMSViT achieves state-of-the-art performance, with 61.34% mIoU for nerve segmentation and 70.40% diagnostic accuracy, outperforming leading hierarchical models like the Swin Transformer and HiViT by margins of up to 6.39% in segmentation accuracy while using fewer parameters. Detailed ablation studies further reveal that integrating block-masked SSL with hierarchical multi-scale feature extraction substantially enhances performance compared to conventional supervised training. Overall, these comprehensive experiments confirm that HMSViT delivers excellent, robust, and clinically viable results, demonstrating its potential for scalable deployment in real-world diagnostic applications.


[41] SceneCrafter: Controllable Multi-View Driving Scene Editing cs.CVPDF

Zehao Zhu, Yuliang Zou, Chiyu Max Jiang, Bo Sun, Vincent Casser

TL;DR: SceneCrafter是一个可控的多视角驾驶场景编辑模型,通过解决3D一致性、空街道先验学习及配对图像生成等挑战,实现了高质量的自动驾驶场景编辑。

Details

Motivation: 在自动驾驶系统开发中,生成真实的驾驶场景模拟至关重要,但纯合成场景缺乏现实基础,难以令人信服。编辑模型可以利用真实驾驶数据,但面临多相机3D一致性、空街道先验学习和配对图像生成等挑战。

Result: SceneCrafter在真实性、可控性、3D一致性和编辑质量上优于现有基线方法。

Insight: 通过结合生成模型和编辑技术,可以利用真实数据生成高质量模拟场景,为自动驾驶系统开发提供更可靠的测试环境。

Abstract: Simulation is crucial for developing and evaluating autonomous vehicle (AV) systems. Recent literature builds on a new generation of generative models to synthesize highly realistic images for full-stack simulation. However, purely synthetically generated scenes are not grounded in reality and have difficulty in inspiring confidence in the relevance of its outcomes. Editing models, on the other hand, leverage source scenes from real driving logs, and enable the simulation of different traffic layouts, behaviors, and operating conditions such as weather and time of day. While image editing is an established topic in computer vision, it presents fresh sets of challenges in driving simulation: (1) the need for cross-camera 3D consistency, (2) learning ``empty street” priors from driving data with foreground occlusions, and (3) obtaining paired image tuples of varied editing conditions while preserving consistent layout and geometry. To address these challenges, we propose SceneCrafter, a versatile editor for realistic 3D-consistent manipulation of driving scenes captured from multiple cameras. We build on recent advancements in multi-view diffusion models, using a fully controllable framework that scales seamlessly to multi-modality conditions like weather, time of day, agent boxes and high-definition maps. To generate paired data for supervising the editing model, we propose a novel framework on top of Prompt-to-Prompt to generate geometrically consistent synthetic paired data with global edits. We also introduce an alpha-blending framework to synthesize data with local edits, leveraging a model trained on empty street priors through novel masked training and multi-view repaint paradigm. SceneCrafter demonstrates powerful editing capabilities and achieves state-of-the-art realism, controllability, 3D consistency, and scene editing quality compared to existing baselines.


[42] Visual hallucination detection in large vision-language models via evidential conflict cs.CV | cs.LGPDF

Tao Huang, Zhekun Liu, Rui Wang, Yang Zhang, Liping Jing

TL;DR: 该论文提出了一种基于Dempster-Shafer理论(DST)的方法,用于检测大型视觉语言模型(LVLM)中的视觉幻觉现象,并开发了PRE-HAL数据集以系统评估模型的感知和推理能力。

Details

Motivation: 尽管LVLM具有强大的多模态能力,但视觉输入与文本输出之间常存在不一致(视觉幻觉现象),这在安全关键型AI应用中带来显著风险。现有评测基准主要关注感知层面,而忽略了由高级推理能力引发的幻觉。

Result: 提出的方法在PRE-HAL数据集上优于五种基线不确定性指标,在三个LVLM上的平均AUROC分别提升了4%、10%和7%。

Insight: 视觉幻觉不仅源于感知能力不足,还与高级推理能力相关;DST理论在捕捉模型冲突和不确定性方面表现出色,为LVLM的可靠性评估提供了新思路。

Abstract: Despite the remarkable multimodal capabilities of Large Vision-Language Models (LVLMs), discrepancies often occur between visual inputs and textual outputs–a phenomenon we term visual hallucination. This critical reliability gap poses substantial risks in safety-critical Artificial Intelligence (AI) applications, necessitating a comprehensive evaluation benchmark and effective detection methods. Firstly, we observe that existing visual-centric hallucination benchmarks mainly assess LVLMs from a perception perspective, overlooking hallucinations arising from advanced reasoning capabilities. We develop the Perception-Reasoning Evaluation Hallucination (PRE-HAL) dataset, which enables the systematic evaluation of both perception and reasoning capabilities of LVLMs across multiple visual semantics, such as instances, scenes, and relations. Comprehensive evaluation with this new benchmark exposed more visual vulnerabilities, particularly in the more challenging task of relation reasoning. To address this issue, we propose, to the best of our knowledge, the first Dempster-Shafer theory (DST)-based visual hallucination detection method for LVLMs through uncertainty estimation. This method aims to efficiently capture the degree of conflict in high-level features at the model inference phase. Specifically, our approach employs simple mass functions to mitigate the computational complexity of evidence combination on power sets. We conduct an extensive evaluation of state-of-the-art LVLMs, LLaVA-v1.5, mPLUG-Owl2 and mPLUG-Owl3, with the new PRE-HAL benchmark. Experimental results indicate that our method outperforms five baseline uncertainty metrics, achieving average AUROC improvements of 4%, 10%, and 7% across three LVLMs. Our code is available at https://github.com/HT86159/Evidential-Conflict.


[43] ReMAR-DS: Recalibrated Feature Learning for Metal Artifact Reduction and CT Domain Transformation cs.CV | cs.AIPDF

Mubashara Rehman, Niki Martinel, Michele Avanzo, Riccardo Spizzo, Christian Micheloni

TL;DR: ReMAR-DS是一种基于深度学习的框架,通过特征重新校准来减少金属伪影并实现kVCT到MVCT的域转换,提升放疗计划质量。

Details

Motivation: kVCT成像中的金属伪影降低了图像质量,影响临床决策。传统方法无法有效减少伪影并保持解剖结构。

Result: 实现了高质量的MVCT重建,减少了放疗计划中对MVCT扫描的需求,验证了其临床价值。

Insight: 特征重新校准有助于模型专注于关键区域和通道,提升图像重建质量,为临床决策提供更可靠的依据。

Abstract: Artifacts in kilo-Voltage CT (kVCT) imaging degrade image quality, impacting clinical decisions. We propose a deep learning framework for metal artifact reduction (MAR) and domain transformation from kVCT to Mega-Voltage CT (MVCT). The proposed framework, ReMAR-DS, utilizes an encoder-decoder architecture with enhanced feature recalibration, effectively reducing artifacts while preserving anatomical structures. This ensures that only relevant information is utilized in the reconstruction process. By infusing recalibrated features from the encoder block, the model focuses on relevant spatial regions (e.g., areas with artifacts) and highlights key features across channels (e.g., anatomical structures), leading to improved reconstruction of artifact-corrupted regions. Unlike traditional MAR methods, our approach bridges the gap between high-resolution kVCT and artifact-resistant MVCT, enhancing radiotherapy planning. It produces high-quality MVCT-like reconstructions, validated through qualitative and quantitative evaluations. Clinically, this enables oncologists to rely on kVCT alone, reducing repeated high-dose MVCT scans and lowering radiation exposure for cancer patients.


[44] Identifying Physically Realizable Triggers for Backdoored Face Recognition Networks cs.CV | cs.CR | cs.LGPDF

Ankita Raj, Ambar Pal, Chetan Arora

TL;DR: 该论文提出了一种新方法,用于检测人脸识别(FR)网络中是否存在自然、物理可实现的触发器,并识别这些触发器。

Details

Motivation: 后门攻击通过隐藏功能使深度神经网络在特定输入触发器下表现出异常行为,这对高安全性应用中的人脸识别系统构成严重威胁。

Result: 实验表明,该方法在识别触发器(如绿色太阳镜或红色帽子)时的Top-5准确率为74%,优于暴力搜索基线的56%。

Insight: 研究揭示了物理可实现触发器对人脸识别系统的潜在威胁,并提供了一种有效的防御手段。

Abstract: Backdoor attacks embed a hidden functionality into deep neural networks, causing the network to display anomalous behavior when activated by a predetermined pattern in the input Trigger, while behaving well otherwise on public test data. Recent works have shown that backdoored face recognition (FR) systems can respond to natural-looking triggers like a particular pair of sunglasses. Such attacks pose a serious threat to the applicability of FR systems in high-security applications. We propose a novel technique to (1) detect whether an FR network is compromised with a natural, physically realizable trigger, and (2) identify such triggers given a compromised network. We demonstrate the effectiveness of our methods with a compromised FR network, where we are able to identify the trigger (e.g., green sunglasses or red hat) with a top-5 accuracy of 74%, whereas a naive brute force baseline achieves 56% accuracy.


[45] General Methods Make Great Domain-specific Foundation Models: A Case-study on Fetal Ultrasound cs.CV | cs.AI | cs.LG | I.4PDF

Jakob Ambsdorf, Asbjørn Munk, Sebastian Llambias, Anders Nymark Christensen, Kamil Mikolaj

TL;DR: 该论文探讨了是否应在特定医学领域(如胎儿超声)训练定制的基础模型,或直接从通用模型进行迁移学习。研究通过大规模胎儿超声数据集的实验表明,定制模型的预训练是值得的,且无需复杂的方法创新,仅需成熟的计算机视觉技术即可实现最优性能。

Details

Motivation: 面对大规模、未标注的医学数据,研究者需要决定是训练定制基础模型还是使用通用模型的迁移学习,同时探讨是否需要新的方法。

Result: 在三个胎儿超声数据集上实现了最优性能,覆盖分类、分割和少样本任务。

Insight: 在医学领域开发基础模型时,无需过度追求方法创新,成熟技术即可满足需求,尤其是在资源受限的情况下。

Abstract: With access to large-scale, unlabeled medical datasets, researchers are confronted with two questions: Should they attempt to pretrain a custom foundation model on this medical data, or use transfer-learning from an existing generalist model? And, if a custom model is pretrained, are novel methods required? In this paper we explore these questions by conducting a case-study, in which we train a foundation model on a large regional fetal ultrasound dataset of 2M images. By selecting the well-established DINOv2 method for pretraining, we achieve state-of-the-art results on three fetal ultrasound datasets, covering data from different countries, classification, segmentation, and few-shot tasks. We compare against a series of models pretrained on natural images, ultrasound images, and supervised baselines. Our results demonstrate two key insights: (i) Pretraining on custom data is worth it, even if smaller models are trained on less data, as scaling in natural image pretraining does not translate to ultrasound performance. (ii) Well-tuned methods from computer vision are making it feasible to train custom foundation models for a given medical domain, requiring no hyperparameter tuning and little methodological adaptation. Given these findings, we argue that a bias towards methodological innovation should be avoided when developing domain specific foundation models under common computational resource constraints.


[46] Recurrent Visual Feature Extraction and Stereo Attentions for CT Report Generation cs.CV | cs.CLPDF

Yuanhe Tian, Lei Mao, Yan Song

TL;DR: 提出了一种基于大语言模型(LLM)的CT报告生成方法,通过循环视觉特征提取和立体注意力机制,实现对CT扫描连续切片间的强相关性建模,生成更准确的报告。

Details

Motivation: 现有方法未能显式建模CT切片间的变换关系,且未能有效整合多层次图像特征,尤其是包含特定器官病变的特征。本文旨在利用CT切片的强相关性,提升CT报告生成的准确性和效果。

Result: 在M3D-Cap数据集上超越基线模型,取得SOTA效果。

Insight: 通过显式建模CT切片的变换关系和多层次特征,结合注意力机制对齐视觉与文本信息,能够提升CT报告生成的质量。

Abstract: Generating reports for computed tomography (CT) images is a challenging task, while similar to existing studies for medical image report generation, yet has its unique characteristics, such as spatial encoding of multiple images, alignment between image volume and texts, etc. Existing solutions typically use general 2D or 3D image processing techniques to extract features from a CT volume, where they firstly compress the volume and then divide the compressed CT slices into patches for visual encoding. These approaches do not explicitly account for the transformations among CT slices, nor do they effectively integrate multi-level image features, particularly those containing specific organ lesions, to instruct CT report generation (CTRG). In considering the strong correlation among consecutive slices in CT scans, in this paper, we propose a large language model (LLM) based CTRG method with recurrent visual feature extraction and stereo attentions for hierarchical feature modeling. Specifically, we use a vision Transformer to recurrently process each slice in a CT volume, and employ a set of attentions over the encoded slices from different perspectives to selectively obtain important visual information and align them with textual features, so as to better instruct an LLM for CTRG. Experiment results and further analysis on the benchmark M3D-Cap dataset show that our method outperforms strong baseline models and achieves state-of-the-art results, demonstrating its validity and effectiveness.


[47] MambaOutRS: A Hybrid CNN-Fourier Architecture for Remote Sensing Image Classification cs.CV | cs.AIPDF

Minjong Cheon, Changbae Mun

TL;DR: MambaOutRS是一种结合CNN和傅里叶变换的混合架构,用于遥感图像分类,通过堆叠的门控CNN块和傅里叶滤网关模块实现全局上下文捕获,显著优于现有方法。

Details

Motivation: 尽管状态空间模型(SSMs)如Mamba在视觉任务中表现出色,但其在2D视觉数据上的复杂适应性可能降低效率。本文提出了一种更高效的替代方案。

Result: 在多个遥感数据集上实现SOTA性能,MambaOutRS-t(24.0M参数)在UC Merced和AID上的F1分数分别达到98.41%和95.99%。

Insight: 通过卷积和频域操作的结合可以有效替代复杂的SSMs,为计算效率要求高的视觉任务提供了一种高效范式。

Abstract: Recent advances in deep learning for vision tasks have seen the rise of State Space Models (SSMs) like Mamba, celebrated for their linear scalability. However, their adaptation to 2D visual data often necessitates complex modifications that may diminish efficiency. In this paper, we introduce MambaOutRS, a novel hybrid convolutional architecture for remote sensing image classification that re-evaluates the necessity of recurrent SSMs. MambaOutRS builds upon stacked Gated CNN blocks for local feature extraction and introduces a novel Fourier Filter Gate (FFG) module that operates in the frequency domain to capture global contextual information efficiently. Our architecture employs a four-stage hierarchical design and was extensively evaluated on challenging remote sensing datasets: UC Merced, AID, NWPU-RESISC45, and EuroSAT. MambaOutRS consistently achieved state-of-the-art (SOTA) performance across these benchmarks. Notably, our MambaOutRS-t variant (24.0M parameters) attained the highest F1-scores of 98.41% on UC Merced and 95.99% on AID, significantly outperforming existing baselines, including larger transformer models and Mamba-based architectures, despite using considerably fewer parameters. An ablation study conclusively demonstrates the critical role of the Fourier Filter Gate in enhancing the model’s ability to capture global spatial patterns, leading to robust and accurate classification. These results strongly suggest that the complexities of recurrent SSMs can be effectively superseded by a judicious combination of gated convolutions for spatial mixing and frequency-based gates for spectral global context. Thus, MambaOutRS provides a compelling and efficient paradigm for developing high-performance deep learning models in remote sensing and other vision domains, particularly where computational efficiency is paramount.


[48] SMARTIES: Spectrum-Aware Multi-Sensor Auto-Encoder for Remote Sensing Images cs.CVPDF

Gencer Sumbul, Chang Xu, Emanuele Dalsasso, Devis Tuia

TL;DR: SMARTIES 是一种通用的多传感器自动编码器模型,能够将不同遥感传感器的数据投影到一个共享的频谱感知空间,从而实现灵活的跨传感器数据处理。

Details

Motivation: 现有深度学习模型往往针对单一传感器或固定组合设计,缺乏对不同传感器输入的灵活性,限制了多传感器遥感数据的可扩展性和泛化能力。

Result: 在单模态和多模态任务中,SMARTIES 的性能优于依赖传感器特定预训练的模型。

Insight: 频谱感知空间的引入和跨传感器 token 混合是提高模型泛化能力和多传感器适应性的关键。

Abstract: From optical sensors to microwave radars, leveraging the complementary strengths of remote sensing (RS) sensors is crucial for achieving dense spatio-temporal monitoring of our planet. In contrast, recent deep learning models, whether task-specific or foundational, are often specific to single sensors or to fixed combinations: adapting such models to different sensory inputs requires both architectural changes and re-training, limiting scalability and generalization across multiple RS sensors. On the contrary, a single model able to modulate its feature representations to accept diverse sensors as input would pave the way to agile and flexible multi-sensor RS data processing. To address this, we introduce SMARTIES, a generic and versatile foundation model lifting sensor-specific/dependent efforts and enabling scalability and generalization to diverse RS sensors: SMARTIES projects data from heterogeneous sensors into a shared spectrum-aware space, enabling the use of arbitrary combinations of bands both for training and inference. To obtain sensor-agnostic representations, we train a single, unified transformer model reconstructing masked multi-sensor data with cross-sensor token mixup. On both single- and multi-modal tasks across diverse sensors, SMARTIES outperforms previous models that rely on sensor-specific pretraining. Our code and pretrained models are available at https://gsumbul.github.io/SMARTIES.


[49] Vision Transformer-Based Time-Series Image Reconstruction for Cloud-Filling Applications cs.CV | cs.AI | cs.LG | eess.IVPDF

Lujun Li, Yiqun Wang, Radu State

TL;DR: 提出一种基于Vision Transformer(ViT)的时间序列图像重建框架,用于填补云覆盖区域的多光谱图像(MSI)数据,结合合成孔径雷达(SAR)的互补信息。

Details

Motivation: 云覆盖导致多光谱图像(MSI)数据缺失,影响早期作物分类。SAR数据不受云干扰但缺乏光谱细节,需要一种方法来结合两者优势以重建完整MSI数据。

Result: 实验表明,Time-series ViT框架在云覆盖区域的MSI图像重建中显著优于仅使用非时间序列MSI和SAR或仅时间序列MSI的基线方法。

Insight: 通过结合时间序列MSI和SAR数据,并利用ViT的注意力机制,可以有效解决云覆盖问题,提升多光谱图像重建的质量和精度。

Abstract: Cloud cover in multispectral imagery (MSI) poses significant challenges for early season crop mapping, as it leads to missing or corrupted spectral information. Synthetic aperture radar (SAR) data, which is not affected by cloud interference, offers a complementary solution, but lack sufficient spectral detail for precise crop mapping. To address this, we propose a novel framework, Time-series MSI Image Reconstruction using Vision Transformer (ViT), to reconstruct MSI data in cloud-covered regions by leveraging the temporal coherence of MSI and the complementary information from SAR from the attention mechanism. Comprehensive experiments, using rigorous reconstruction evaluation metrics, demonstrate that Time-series ViT framework significantly outperforms baselines that use non-time-series MSI and SAR or time-series MSI without SAR, effectively enhancing MSI image reconstruction in cloud-covered regions.


[50] Implementing blind navigation through multi-modal sensing and gait guidance cs.CV | cs.SY | eess.SYPDF

Feifan Yan, Tianle Zeng, Meixi He

TL;DR: 本文提出了一种基于步态分析和多模态感知的可穿戴盲导设备,通过实验验证其性能优于传统导盲杖。

Details

Motivation: 全球视力障碍人群超过2.2亿,传统导盲工具如导盲杖和导盲犬存在不足,亟需更智能的辅助导航解决方案。

Result: 实验表明,该设备在室内外的导航性能优于传统导盲杖。

Insight: 步态引导与多模态感知的结合为视障人士导航提供了新的技术路径,展现了智能辅助设备的潜力。

Abstract: By the year 2023, the global population of individuals with impaired vision has surpassed 220 million. People with impaired vision will find it difficult while finding path or avoiding obstacles, and must ask for auxiliary tools for help. Although traditional aids such as guide canes and guide dogs exist, they still have some shortcomings. In this paper, we present our wearable blind guiding device, what perform navigation guidance through our proposed Gait-based Guiding System. Our device innovatively integrates gait phase analysis for walking guide, and in terms of environmental perception, we use multimodal sensing to acquire diverse environment information. During the experiment, we conducted both indoor and outdoor experiments, and compared with the standard guide cane. The result shows superior performance of our device in blind guidance.


[51] Self-Supervised Multimodal NeRF for Autonomous Driving cs.CVPDF

Gaurav Sharma, Ravi Kothari, Josef Schmid

TL;DR: 本文提出了一种基于神经辐射场(NeRF)的自监督多模态框架NVSF,用于自动驾驶场景中的静态与动态场景建模,无需3D标注即可实现高效训练和快速收敛。

Details

Motivation: 自动驾驶场景中的多模态数据(如LiDAR和相机)需要高效的建模方法,现有方法通常依赖3D标注,限制了其应用范围。本文旨在提出一种自监督框架,解决这一问题。

Result: 在KITTI-360数据集上的实验表明,NVSF在LiDAR和相机数据上均优于基线模型。

Insight: 自监督方法可显著减少对3D标注的依赖,同时多模态联合学习有助于提升自动驾驶场景的建模精度。

Abstract: In this paper, we propose a Neural Radiance Fields (NeRF) based framework, referred to as Novel View Synthesis Framework (NVSF). It jointly learns the implicit neural representation of space and time-varying scene for both LiDAR and Camera. We test this on a real-world autonomous driving scenario containing both static and dynamic scenes. Compared to existing multimodal dynamic NeRFs, our framework is self-supervised, thus eliminating the need for 3D labels. For efficient training and faster convergence, we introduce heuristic-based image pixel sampling to focus on pixels with rich information. To preserve the local features of LiDAR points, a Double Gradient based mask is employed. Extensive experiments on the KITTI-360 dataset show that, compared to the baseline models, our framework has reported best performance on both LiDAR and Camera domain. Code of the model is available at https://github.com/gaurav00700/Selfsupervised-NVSF


[52] VideoPCDNet: Video Parsing and Prediction with Phase Correlation Networks cs.CV | cs.AIPDF

Noel José Rodrigues Vicente, Enrique Lehner, Angel Villar-Corrales, Jan Nogga, Sven Behnke

TL;DR: 论文提出VideoPCDNet,一种无监督的框架,用于对象中心视频分解与预测,通过频域相位相关技术解析视频并预测未来帧。

Details

Motivation: 动态环境中视频内容的理解与预测对规划与推理至关重要,但无监督学习对象表示与动态仍具挑战性。

Result: 在无监督跟踪和预测任务中表现优于基线模型,学习到可解释的对象与运动表示。

Insight: 频域技术为无监督对象分解和预测提供了一种高效且可解释的方法。

Abstract: Understanding and predicting video content is essential for planning and reasoning in dynamic environments. Despite advancements, unsupervised learning of object representations and dynamics remains challenging. We present VideoPCDNet, an unsupervised framework for object-centric video decomposition and prediction. Our model uses frequency-domain phase correlation techniques to recursively parse videos into object components, which are represented as transformed versions of learned object prototypes, enabling accurate and interpretable tracking. By explicitly modeling object motion through a combination of frequency domain operations and lightweight learned modules, VideoPCDNet enables accurate unsupervised object tracking and prediction of future video frames. In our experiments, we demonstrate that VideoPCDNet outperforms multiple object-centric baseline models for unsupervised tracking and prediction on several synthetic datasets, while learning interpretable object and motion representations.


[53] PEVLM: Parallel Encoding for Vision-Language Models cs.CV | cs.LG | cs.PFPDF

Letian Kang, Shixian Luo, Yiqiang Li, Xiaoyang Yu, Shenxuan Zhou

TL;DR: PEVLM是一种并行编码策略,旨在提高视觉语言模型(VLMs)的预填充效率,解决了长视频理解中标准注意力机制的高计算复杂度问题。

Details

Motivation: 视觉语言模型在视频-语言任务中表现优异,但在长视频理解中的应用因标准注意力机制的二次计算复杂度受限。

Result: 在LongVideoBench基准测试中,PEVLM实现8.37%的精度提升和7.47倍的计算加速,端到端延迟降低40%。

Insight: PEVLM适用于低延迟、长上下文视频理解,有望在自动驾驶等实际应用中发挥作用。

Abstract: Vision-Language Models (VLMs) have demonstrated strong performance in video-language tasks, yet their application to long video understanding remains constrained by the quadratic complexity of standard attention mechanisms. In this paper, we propose \textbf{PEVLM}, a parallel encoding strategy specifically designed to improve the prefill efficiency of VLMs without requiring model finetuning. PEVLM partitions the input into block-wise segments with a shared sink, preserves full-attention positional embeddings, and aligns attention weights to mimic full-attention distributions. This design reduces attention computation from $O((T \times N)^2)$ to $O(T \times N)$ while maintaining high accuracy. Extensive experiments on the LongVideoBench benchmark show that PEVLM achieves up to 8.37% accuracy improvement over existing inference-efficient methods and delivers up to 7.47x speedup in attention computation and 40% reduction in end-to-end latency. Under strict latency constraints, PEVLM significantly outperforms baselines, raising accuracy from 23.26% to 61.03%. These results highlight PEVLM’s effectiveness for low-latency, long-context video understanding, making it well-suited for real-world applications such as autonomous driving.


[54] Video Compression for Spatiotemporal Earth System Data cs.CV | cs.DL | eess.IV | physics.geo-phPDF

Oscar J. Pellicer-Valero, Cesar Aybar, Gustau Camps Valls

TL;DR: 论文提出了一种名为xarrayvideo的Python库,用于通过视频压缩技术高效压缩多通道时空地球系统数据,实现了高达250倍的压缩比,同时保持高保真度,适用于深度学习任务。

Details

Motivation: 随着地球观测数据规模的快速增长,传统存储和处理方法面临挑战。论文利用视频压缩技术解决这一瓶颈,降低数据存储和传输成本。

Result: 在多个数据集上实现了高达250倍的压缩比,PSNR表现优异(55.86至65.91 dB),且压缩数据在深度学习中无性能损失。

Insight: 视频压缩技术可有效应用于地球系统数据,显著降低存储需求,同时不影响深度学习任务的性能。

Abstract: Large-scale Earth system datasets, from high-resolution remote sensing imagery to spatiotemporal climate model outputs, exhibit characteristics analogous to those of standard videos. Their inherent spatial, temporal, and spectral redundancies can thus be readily exploited by established video compression techniques. Here, we present xarrayvideo, a Python library for compressing multichannel spatiotemporal datasets by encoding them as videos. Our approach achieves compression ratios of up to 250x while maintaining high fidelity by leveraging standard, well-optimized video codecs through ffmpeg. We demonstrate the library’s effectiveness on four real-world multichannel spatiotemporal datasets: DynamicEarthNet (very high resolution Planet images), DeepExtremeCubes (high resolution Sentinel-2 images), ERA5 (weather reanalysis data), and the SimpleS2 dataset (high resolution multichannel Sentinel-2 images), achieving Peak Signal-to-Noise Ratios (PSNRs) of 55.86, 40.60, 46.58, and 43.23 dB at 0.1 bits per pixel per band (bpppb) and 65.91, 54.28, 62.90, and 55.04 dB at 1 bpppb. We are redistributing two of these datasets, DeepExtremeCubes (2.3 Tb) and DynamicEarthNet (525 Gb), in the machine-learning-ready and cloud-ready TACO format through HuggingFace at significantly reduced sizes (270 Gb and 8.5 Gb, respectively) without compromising quality (PSNR 55.77-56.65 and 60.15). No performance loss is observed when the compressed versions of these datasets are used in their respective deep learning-based downstream tasks (next step reflectance prediction and landcover segmentation). In conclusion, xarrayvideo presents an efficient solution for handling the rapidly growing size of Earth observation datasets, making advanced compression techniques accessible and practical to the Earth science community. The library is available for use at https://github.com/IPL-UV/xarrayvideo


[55] ScaleCap: Inference-Time Scalable Image Captioning via Dual-Modality Debiasing cs.CV | cs.CLPDF

Long Xing, Qidong Huang, Xiaoyi Dong, Pan Zhang, Yuhang Zang

TL;DR: ScaleCap是一种推理时可扩展的图像字幕生成策略,通过双模态去偏解决视觉语言模型的内在偏差问题,生成更全面和详细的图像描述。

Details

Motivation: 现有视觉语言模型(LVLM)在多模态和语言层面存在固有偏差,导致图像描述不均衡或产生虚构对象描述,亟需一种可扩展的方法提升描述质量。

Result: 实验表明,ScaleCap生成的描述更准确、均衡且信息丰富,在VQA和图像重建任务中表现优异。使用ScaleCap标注的450K图像进一步提升了LVLM的预训练性能。

Insight: 通过逐步增加推理成本动态优化描述,ScaleCap有效解决了视觉语言模型的固有偏差问题,为高质量图像描述提供了新思路。

Abstract: This paper presents ScaleCap, an inference-time scalable image captioning strategy that generates comprehensive and detailed image captions. The key challenges of high-quality image captioning lie in the inherent biases of LVLMs: multimodal bias resulting in imbalanced descriptive granularity, offering detailed accounts of some elements while merely skimming over others; linguistic bias leading to hallucinated descriptions of non-existent objects. To address these issues, we propose a scalable debiased captioning strategy, which continuously enriches and calibrates the caption with increased inference budget. Specifically, we propose two novel components: heuristic question answering and contrastive sentence rating. The former generates content-specific questions based on the image and answers them to progressively inject relevant information into the caption. The latter employs sentence-level offline contrastive decoding to effectively identify and eliminate hallucinations caused by linguistic biases. With increased inference cost, more heuristic questions are raised by ScaleCap to progressively capture additional visual details, generating captions that are more accurate, balanced, and informative. Extensive modality alignment experiments demonstrate the effectiveness of ScaleCap. Annotating 450K images with ScaleCap and using them for LVLM pretraining leads to consistent performance gains across 11 widely used benchmarks. Furthermore, ScaleCap showcases superb richness and fidelity of generated captions with two additional tasks: replacing images with captions in VQA task, and reconstructing images from captions to assess semantic coverage. Code is available at https://github.com/Cooperx521/ScaleCap.


[56] SAM2-SGP: Enhancing SAM2 for Medical Image Segmentation via Support-Set Guided Prompting cs.CVPDF

Yang Xing, Jiong Wu, Yuheng Bu, Kuang Gong

TL;DR: 该论文提出了一种名为SAM2-SGP的新框架,通过支持集引导提示技术解决了SAM2在医学图像分割中的手动提示依赖和领域转移问题。

Details

Motivation: SAM2虽然在零样本图像分割方面表现优异,但在医学图像分割任务中仍需手动提供提示,且存在领域转移问题,限制了其性能。

Result: 在多种医学影像模态上(如X射线、CT、MRI等)显著优于当前最先进模型(如nnUNet、SwinUNet)和基础模型(如SAM2、MedSAM2)。

Insight: 通过自动化提示生成和领域适应策略,SAM2-SGP为医学图像分割任务提供了高效且无需人工干预的解决方案。

Abstract: Although new vision foundation models such as Segment Anything Model 2 (SAM2) have significantly enhanced zero-shot image segmentation capabilities, reliance on human-provided prompts poses significant challenges in adapting SAM2 to medical image segmentation tasks. Moreover, SAM2’s performance in medical image segmentation was limited by the domain shift issue, since it was originally trained on natural images and videos. To address these challenges, we proposed SAM2 with support-set guided prompting (SAM2-SGP), a framework that eliminated the need for manual prompts. The proposed model leveraged the memory mechanism of SAM2 to generate pseudo-masks using image-mask pairs from a support set via a Pseudo-mask Generation (PMG) module. We further introduced a novel Pseudo-mask Attention (PMA) module, which used these pseudo-masks to automatically generate bounding boxes and enhance localized feature extraction by guiding attention to relevant areas. Furthermore, a low-rank adaptation (LoRA) strategy was adopted to mitigate the domain shift issue. The proposed framework was evaluated on both 2D and 3D datasets across multiple medical imaging modalities, including fundus photography, X-ray, computed tomography (CT), magnetic resonance imaging (MRI), positron emission tomography (PET), and ultrasound. The results demonstrated a significant performance improvement over state-of-the-art models, such as nnUNet and SwinUNet, as well as foundation models, such as SAM2 and MedSAM2, underscoring the effectiveness of the proposed approach. Our code is publicly available at https://github.com/astlian9/SAM_Support.


[57] Semantic Scene Graph for Ultrasound Image Explanation and Scanning Guidance cs.CV | cs.AI | cs.LG | eess.IVPDF

Xuesong Li, Dianye Huang, Yameng Zhang, Nassir Navab, Zhongliang Jiang

TL;DR: 论文提出了一种基于场景图(SG)的超声波图像解释和扫描引导方法,利用基于Transformer的一阶段方法生成SG,并通过大型语言模型(LLM)进一步细化SG以提供易懂的解释。该方法还探索了SG在引导超声波扫描缺失解剖结构方面的潜力。

Details

Motivation: 由于超声波图像的视觉变异性大,非专家用户(如即时医疗场景中的用户)对其解释和扫描指导的需求尚未被充分探索。

Result: 在颈部(颈动脉和甲状腺)的五名志愿者图像上验证了方法的有效性,显示出提升超声波解释性和可用性的潜力。

Insight: 场景图和LLM的结合为非专家用户提供了一种直观的超声波解释和扫描指导方法,有助于推广超声波技术的使用。

Abstract: Understanding medical ultrasound imaging remains a long-standing challenge due to significant visual variability caused by differences in imaging and acquisition parameters. Recent advancements in large language models (LLMs) have been used to automatically generate terminology-rich summaries orientated to clinicians with sufficient physiological knowledge. Nevertheless, the increasing demand for improved ultrasound interpretability and basic scanning guidance among non-expert users, e.g., in point-of-care settings, has not yet been explored. In this study, we first introduce the scene graph (SG) for ultrasound images to explain image content to ordinary and provide guidance for ultrasound scanning. The ultrasound SG is first computed using a transformer-based one-stage method, eliminating the need for explicit object detection. To generate a graspable image explanation for ordinary, the user query is then used to further refine the abstract SG representation through LLMs. Additionally, the predicted SG is explored for its potential in guiding ultrasound scanning toward missing anatomies within the current imaging view, assisting ordinary users in achieving more standardized and complete anatomical exploration. The effectiveness of this SG-based image explanation and scanning guidance has been validated on images from the left and right neck regions, including the carotid and thyroid, across five volunteers. The results demonstrate the potential of the method to maximally democratize ultrasound by enhancing its interpretability and usability for ordinaries.


[58] UltraAD: Fine-Grained Ultrasound Anomaly Classification via Few-Shot CLIP Adaptation cs.CVPDF

Yue Zhou, Yuan Bi, Wenjuan Tong, Wei Wang, Nassir Navab

TL;DR: UltraAD提出了一种基于视觉语言模型的方法,通过少量样本实现超声图像中的细粒度异常分类,解决了领域差异问题。

Details

Motivation: 医疗图像中的异常检测需要细粒度分类,现有方法难以区分如良性/恶性肿瘤;超声图像对设备和参数敏感,导致显著的领域差异。UltraAD旨在解决这些问题。

Result: 在三个乳腺超声数据集上超越现有方法,定位和分类性能均显著提升。

Insight: 引入文本信息与视觉特征对齐是解决医疗领域细粒度分类的有效途径,少量样本结合VLM可弥补领域差异。

Abstract: Precise anomaly detection in medical images is critical for clinical decision-making. While recent unsupervised or semi-supervised anomaly detection methods trained on large-scale normal data show promising results, they lack fine-grained differentiation, such as benign vs. malignant tumors. Additionally, ultrasound (US) imaging is highly sensitive to devices and acquisition parameter variations, creating significant domain gaps in the resulting US images. To address these challenges, we propose UltraAD, a vision-language model (VLM)-based approach that leverages few-shot US examples for generalized anomaly localization and fine-grained classification. To enhance localization performance, the image-level token of query visual prototypes is first fused with learnable text embeddings. This image-informed prompt feature is then further integrated with patch-level tokens, refining local representations for improved accuracy. For fine-grained classification, a memory bank is constructed from few-shot image samples and corresponding text descriptions that capture anatomical and abnormality-specific features. During training, the stored text embeddings remain frozen, while image features are adapted to better align with medical data. UltraAD has been extensively evaluated on three breast US datasets, outperforming state-of-the-art methods in both lesion localization and fine-grained medical classification. The code will be released upon acceptance.


[59] Systematic Comparison of Projection Methods for Monocular 3D Human Pose Estimation on Fisheye Images cs.CV | cs.RO | I.2.10; I.2.9; I.4.8; I.4.9PDF

Stephanie Käs, Sven Peter, Henrik Thillmann, Anton Burenko, David Benjamin Adrian

TL;DR: 该论文系统地比较了不同投影方法在单目鱼眼图像上用于3D人体姿态估计的效果,发现双球面模型等方法显著提高了准确性,并提出了一种基于检测边界框选择投影模型的启发式方法。

Details

Motivation: 鱼眼相机在机器人应用中具有更广的视野(FOV),但鱼眼镜头的曲率失真使得人体姿态估计更具挑战性。目前尚无系统评估不同投影方法在单目鱼眼图像中的效果,特别是对于宽FOV姿态的估计。

Result: 研究发现,在近距离场景中,小孔模型效果不佳,而双球面模型显著提升了3D姿态估计的准确性。最佳投影方法的选择取决于人体姿态的FOV范围。

Insight: 鱼眼图像的3D姿态估计需根据场景的动态范围选择投影方法,双球面等高级模型在宽FOV场景中表现优异。

Abstract: Fisheye cameras offer robots the ability to capture human movements across a wider field of view (FOV) than standard pinhole cameras, making them particularly useful for applications in human-robot interaction and automotive contexts. However, accurately detecting human poses in fisheye images is challenging due to the curved distortions inherent to fisheye optics. While various methods for undistorting fisheye images have been proposed, their effectiveness and limitations for poses that cover a wide FOV has not been systematically evaluated in the context of absolute human pose estimation from monocular fisheye images. To address this gap, we evaluate the impact of pinhole, equidistant and double sphere camera models, as well as cylindrical projection methods, on 3D human pose estimation accuracy. We find that in close-up scenarios, pinhole projection is inadequate, and the optimal projection method varies with the FOV covered by the human pose. The usage of advanced fisheye models like the double sphere model significantly enhances 3D human pose estimation accuracy. We propose a heuristic for selecting the appropriate projection model based on the detection bounding box to enhance prediction quality. Additionally, we introduce and evaluate on our novel dataset FISHnCHIPS, which features 3D human skeleton annotations in fisheye images, including images from unconventional angles, such as extreme close-ups, ground-mounted cameras, and wide-FOV poses, available at: https://www.vision.rwth-aachen.de/fishnchips


[60] CoCo4D: Comprehensive and Complex 4D Scene Generation cs.CVPDF

Junwei Zhou, Xueting Li, Lu Qi, Ming-Hsuan Yang

TL;DR: CoCo4D提出了一种从文本提示生成动态4D场景的框架,通过分解动态前景和静态背景的生成任务,利用参考运动序列和渐进式外绘方法,实现多视角一致且沉浸式的4D场景合成。

Details

Motivation: 现有4D合成方法多局限于对象级生成或视角有限的动态场景,无法生成多视角一致且沉浸式的动态4D场景,因此需要一种更全面的解决方案。

Result: 实验表明,CoCo4D在4D场景生成任务中表现优于或与现有方法相当,验证了其有效性和高效性。

Insight: 通过分解场景生成任务并结合运动序列引导,能够显著提升动态4D场景的生成质量和一致性。

Abstract: Existing 4D synthesis methods primarily focus on object-level generation or dynamic scene synthesis with limited novel views, restricting their ability to generate multi-view consistent and immersive dynamic 4D scenes. To address these constraints, we propose a framework (dubbed as CoCo4D) for generating detailed dynamic 4D scenes from text prompts, with the option to include images. Our method leverages the crucial observation that articulated motion typically characterizes foreground objects, whereas background alterations are less pronounced. Consequently, CoCo4D divides 4D scene synthesis into two responsibilities: modeling the dynamic foreground and creating the evolving background, both directed by a reference motion sequence. Given a text prompt and an optional reference image, CoCo4D first generates an initial motion sequence utilizing video diffusion models. This motion sequence then guides the synthesis of both the dynamic foreground object and the background using a novel progressive outpainting scheme. To ensure seamless integration of the moving foreground object within the dynamic background, CoCo4D optimizes a parametric trajectory for the foreground, resulting in realistic and coherent blending. Extensive experiments show that CoCo4D achieves comparable or superior performance in 4D scene generation compared to existing methods, demonstrating its effectiveness and efficiency. More results are presented on our website https://colezwhy.github.io/coco4d/.


[61] Bind-Your-Avatar: Multi-Talking-Character Video Generation with Dynamic 3D-mask-based Embedding Router cs.CVPDF

Yubo Huang, Weiqiang Wang, Sirui Zhao, Tong Xu, Lin Liu

TL;DR: Bind-Your-Avatar提出了一种基于MM-DiT的模型,用于生成同一场景中多角色对话视频,解决了音频与角色对应控制和数据缺乏问题。

Details

Motivation: 现有方法主要针对单角色说话头生成,而多角色在同一场景中的对话视频生成面临音频与角色对应控制及数据缺乏的挑战。

Result: 实验表明,该方法在多角色视频生成中优于现有技术,实现了更准确的音频角色对应和更流畅的生成效果。

Insight: 3D掩码和几何先验的结合为多角色视频生成提供了细粒度控制和时序平滑性,数据集和基准的构建推动了该领域的研究。

Abstract: Recent years have witnessed remarkable advances in audio-driven talking head generation. However, existing approaches predominantly focus on single-character scenarios. While some methods can create separate conversation videos between two individuals, the critical challenge of generating unified conversation videos with multiple physically co-present characters sharing the same spatial environment remains largely unaddressed. This setting presents two key challenges: audio-to-character correspondence control and the lack of suitable datasets featuring multi-character talking videos within the same scene. To address these challenges, we introduce Bind-Your-Avatar, an MM-DiT-based model specifically designed for multi-talking-character video generation in the same scene. Specifically, we propose (1) A novel framework incorporating a fine-grained Embedding Router that binds who' and speak what’ together to address the audio-to-character correspondence control. (2) Two methods for implementing a 3D-mask embedding router that enables frame-wise, fine-grained control of individual characters, with distinct loss functions based on observed geometric priors and a mask refinement strategy to enhance the accuracy and temporal smoothness of the predicted masks. (3) The first dataset, to the best of our knowledge, specifically constructed for multi-talking-character video generation, and accompanied by an open-source data processing pipeline, and (4) A benchmark for the dual-talking-characters video generation, with extensive experiments demonstrating superior performance over multiple state-of-the-art methods.


[62] SimpleGVR: A Simple Baseline for Latent-Cascaded Video Super-Resolution cs.CVPDF

Liangbin Xie, Yu Li, Shian Du, Menghan Xia, Xintao Wang

TL;DR: 本文提出了一种名为SimpleGVR的简单基准方法,用于研究级联视频超分辨率(VSR)模型的关键设计原则,通过两阶段的解耦处理实现高效的高分辨率视频生成。

Details

Motivation: 随着用户对高分辨率视频的需求增加,仅依赖潜在空间计算的方法已经不足。通过将过程解耦为语义内容生成和细节合成两阶段,可以更高效地实现高质量输出。

Result: 实验表明该方法优于现有技术,消融研究验证了各设计选择的有效性,显著降低了计算开销。

Insight: 级联VSR模型的设计应注重与基模型的输出对齐,时间采样和噪声增强策略对模型性能有重要影响。

Abstract: Latent diffusion models have emerged as a leading paradigm for efficient video generation. However, as user expectations shift toward higher-resolution outputs, relying solely on latent computation becomes inadequate. A promising approach involves decoupling the process into two stages: semantic content generation and detail synthesis. The former employs a computationally intensive base model at lower resolutions, while the latter leverages a lightweight cascaded video super-resolution (VSR) model to achieve high-resolution output. In this work, we focus on studying key design principles for latter cascaded VSR models, which are underexplored currently. First, we propose two degradation strategies to generate training pairs that better mimic the output characteristics of the base model, ensuring alignment between the VSR model and its upstream generator. Second, we provide critical insights into VSR model behavior through systematic analysis of (1) timestep sampling strategies, (2) noise augmentation effects on low-resolution (LR) inputs. These findings directly inform our architectural and training innovations. Finally, we introduce interleaving temporal unit and sparse local attention to achieve efficient training and inference, drastically reducing computational overhead. Extensive experiments demonstrate the superiority of our framework over existing methods, with ablation studies confirming the efficacy of each design choice. Our work establishes a simple yet effective baseline for cascaded video super-resolution generation, offering practical insights to guide future advancements in efficient cascaded synthesis systems.


[63] Improving Progressive Generation with Decomposable Flow Matching cs.CV | cs.AIPDF

Moayed Haji-Ali, Willi Menapace, Ivan Skorokhodov, Arpit Sahni, Sergey Tulyakov

TL;DR: 论文提出了Decomposable Flow Matching(DFM)框架,通过在多尺度表示上独立应用Flow Matching,简化了渐进式生成视觉媒体的方法,并在图像和视频生成中取得了更好的效果。

Details

Motivation: 现有的渐进式生成方法通常依赖复杂的多阶段架构,增加了整体复杂性,需要定制化的扩散公式或采样器。DFM旨在提供一种简单高效的解决方案。

Result: 在Imagenet-1k 512px上,DFM的FDD分数比基线模型提升了35.2%,收敛速度也更快。

Insight: DFM展示了通过简单架构和最小化修改,可以在渐进生成任务中大幅提升性能,为视觉生成提供了新思路。

Abstract: Generating high-dimensional visual modalities is a computationally intensive task. A common solution is progressive generation, where the outputs are synthesized in a coarse-to-fine spectral autoregressive manner. While diffusion models benefit from the coarse-to-fine nature of denoising, explicit multi-stage architectures are rarely adopted. These architectures have increased the complexity of the overall approach, introducing the need for a custom diffusion formulation, decomposition-dependent stage transitions, add-hoc samplers, or a model cascade. Our contribution, Decomposable Flow Matching (DFM), is a simple and effective framework for the progressive generation of visual media. DFM applies Flow Matching independently at each level of a user-defined multi-scale representation (such as Laplacian pyramid). As shown by our experiments, our approach improves visual quality for both images and videos, featuring superior results compared to prior multistage frameworks. On Imagenet-1k 512px, DFM achieves 35.2% improvements in FDD scores over the base architecture and 26.4% over the best-performing baseline, under the same training compute. When applied to finetuning of large models, such as FLUX, DFM shows faster convergence speed to the training distribution. Crucially, all these advantages are achieved with a single model, architectural simplicity, and minimal modifications to existing training pipelines.


[64] GenHSI: Controllable Generation of Human-Scene Interaction Videos cs.CVPDF

Zekun Li, Rui Zhou, Rahul Sajnani, Xiaoyan Cong, Daniel Ritchie

TL;DR: GenHSI 是一种免训练方法,用于可控生成长时间的人类-场景交互视频(HSI),通过三阶段流程(脚本编写、预可视化、动画)解决了现有方法在人类身份保存和真实交互上的挑战。

Details

Motivation: 现有的大规模预训练视频扩散模型在生成多样化视频方面表现出色,但在生成长时间、电影式视频时仍面临人类-场景交互不真实、身份保存不足和高昂训练成本等问题。

Result: 实验表明,GenHSI 能有效保持场景内容和人物身份,并生成逼真的人类-场景交互视频。

Insight: 借鉴电影动画的三阶段流程是解决长视频生成中身份保存和交互真实性的有效方法,且免训练的方式降低了实现成本。

Abstract: Large-scale pre-trained video diffusion models have exhibited remarkable capabilities in diverse video generation. However, existing solutions face several challenges in using these models to generate long movie-like videos with rich human-object interactions that include unrealistic human-scene interaction, lack of subject identity preservation, and require expensive training. We propose GenHSI, a training-free method for controllable generation of long human-scene interaction videos (HSI). Taking inspiration from movie animation, our key insight is to overcome the limitations of previous work by subdividing the long video generation task into three stages: (1) script writing, (2) pre-visualization, and (3) animation. Given an image of a scene, a user description, and multiple images of a person, we use these three stages to generate long-videos that preserve human-identity and provide rich human-scene interactions. Script writing converts complex human tasks into simple atomic tasks that are used in the pre-visualization stage to generate 3D keyframes (storyboards). These 3D keyframes are rendered and animated by off-the-shelf video diffusion models for consistent long video generation with rich contacts in a 3D-aware manner. A key advantage of our work is that we alleviate the need for scanned, accurate scenes and create 3D keyframes from single-view images. We are the first to generate a long video sequence with a consistent camera pose that contains arbitrary numbers of character actions without training. Experiments demonstrate that our method can generate long videos that effectively preserve scene content and character identity with plausible human-scene interaction from a single image scene. Visit our project homepage https://kunkun0w0.github.io/project/GenHSI/ for more information.


[65] A Comparative Study of NAFNet Baselines for Image Restoration cs.CV | cs.LGPDF

Vladislav Esaulov, M. Moein Esfahani

TL;DR: 该论文对NAFNet(非线性激活自由网络)在图像修复任务中的核心组件进行了消融实验,验证了SimpleGate激活、简化通道注意力(SCA)和层归一化(LayerNorm)的有效性。

Details

Motivation: 研究旨在验证NAFNet设计的合理性,通过比较不同组件的变体,确定对图像修复性能的关键影响。

Result: 实验结果显示,SimpleGate和简化注意力机制优于传统方法,LayerNorm能提升训练稳定性。

Insight: 论文的见解是简化网络设计(如SimpleGate和SCA)在计算效率和性能上均优于复杂结构,层归一化是稳定训练的关键因素。

Abstract: We study NAFNet (Nonlinear Activation Free Network), a simple and efficient deep learning baseline for image restoration. By using CIFAR10 images corrupted with noise and blur, we conduct an ablation study of NAFNet’s core components. Our baseline model implements SimpleGate activation, Simplified Channel Activation (SCA), and LayerNormalization. We compare this baseline to different variants that replace or remove components. Quantitative results (PSNR, SSIM) and examples illustrate how each modification affects restoration performance. Our findings support the NAFNet design: the SimpleGate and simplified attention mechanisms yield better results than conventional activations and attention, while LayerNorm proves to be important for stable training. We conclude with recommendations for model design, discuss potential improvements, and future work.


[66] Unified Vision-Language-Action Model cs.CV | cs.ROPDF

Yuqi Wang, Xinghang Li, Wenxuan Wang, Junbo Zhang, Yingyan Li

TL;DR: UniVLA是一种新型的多模态视觉-语言-动作(VLA)模型,通过将视觉、语言和动作信号建模为离散令牌序列,实现了灵活的多模态任务学习,并在多个仿真基准测试中取得了最先进的性能。

Details

Motivation: 现有的视觉-语言-动作模型大多依赖视觉-语言模型的通用理解能力生成动作信号,忽略了视觉观察中丰富的时间和因果结构,因此需要一种统一的模型来更好地捕捉这些动态。

Result: 在CALVIN、LIBERO和Simplenv-Bridge等基准测试中,UniVLA取得了最先进的性能(如LIBERO上95.5%的平均成功率),并在真实世界的ALOHA机械臂操作和自动驾驶中展现了广泛应用。

Insight: 通过统一的模型架构和世界建模,UniVLA不仅提升了多模态任务的灵活性,还为长期任务的策略学习提供了有效的因果动态捕捉能力。

Abstract: Vision-language-action models (VLAs) have garnered significant attention for their potential in advancing robotic manipulation. However, previous approaches predominantly rely on the general comprehension capabilities of vision-language models (VLMs) to generate action signals, often overlooking the rich temporal and causal structure embedded in visual observations. In this paper, we present UniVLA, a unified and native multimodal VLA model that autoregressively models vision, language, and action signals as discrete token sequences. This formulation enables flexible multimodal tasks learning, particularly from large-scale video data. By incorporating world modeling during post-training, UniVLA captures causal dynamics from videos, facilitating effective transfer to downstream policy learning–especially for long-horizon tasks. Our approach sets new state-of-the-art results across several widely used simulation benchmarks, including CALVIN, LIBERO, and Simplenv-Bridge, significantly surpassing previous methods. For example, UniVLA achieves 95.5% average success rate on LIBERO benchmark, surpassing pi0-FAST’s 85.5%. We further demonstrate its broad applicability on real-world ALOHA manipulation and autonomous driving.


[67] AnimaX: Animating the Inanimate in 3D with Joint Video-Pose Diffusion Models cs.CVPDF

Zehuan Huang, Haoran Feng, Yangtian Sun, Yuanchen Guo, Yanpei Cao

TL;DR: AnimaX是一个基于视频-姿态联合扩散模型的3D动画框架,通过结合视频的运动先验和骨架动画的可控结构,实现高效的3D动画生成。

Details

Motivation: 传统运动合成方法局限于固定骨架结构或需要高维变形空间的高成本优化,AnimaX旨在通过视频扩散模型与骨架动画的结合,实现更灵活高效的3D动画生成。

Result: 在VBench基准测试中,AnimaX在泛化性、运动保真度和效率方面达到了state-of-the-art水平。

Insight: 通过结合视频扩散模型的运动先验和骨架动画的可控性,AnimaX为类别无关的3D动画提供了一种可扩展的解决方案。

Abstract: We present AnimaX, a feed-forward 3D animation framework that bridges the motion priors of video diffusion models with the controllable structure of skeleton-based animation. Traditional motion synthesis methods are either restricted to fixed skeletal topologies or require costly optimization in high-dimensional deformation spaces. In contrast, AnimaX effectively transfers video-based motion knowledge to the 3D domain, supporting diverse articulated meshes with arbitrary skeletons. Our method represents 3D motion as multi-view, multi-frame 2D pose maps, and enables joint video-pose diffusion conditioned on template renderings and a textual motion prompt. We introduce shared positional encodings and modality-aware embeddings to ensure spatial-temporal alignment between video and pose sequences, effectively transferring video priors to motion generation task. The resulting multi-view pose sequences are triangulated into 3D joint positions and converted into mesh animation via inverse kinematics. Trained on a newly curated dataset of 160,000 rigged sequences, AnimaX achieves state-of-the-art results on VBench in generalization, motion fidelity, and efficiency, offering a scalable solution for category-agnostic 3D animation. Project page: \href{https://anima-x.github.io/}{https://anima-x.github.io/}.


[68] Radial Attention: $O(n\log n)$ Sparse Attention with Energy Decay for Long Video Generation cs.CV | cs.AI | cs.LGPDF

Xingyang Li, Muyang Li, Tianle Cai, Haocheng Xi, Shuo Yang

TL;DR: 论文提出了一种名为径向注意力(Radial Attention)的稀疏注意力机制,通过模拟时空能量衰减现象,显著降低了长视频生成的计算复杂度,同时保持了视频质量。

Details

Motivation: 视频扩散模型在高质量视频生成方面取得了进展,但时空维度的增加导致计算成本急剧上升,限制了长视频的生成。作者发现注意力分数随时空距离增加而衰减的现象,以此为动机开发了更高效的注意力机制。

Result: 在多个数据集上保持视频质量的同时,实现了1.9倍加速、4倍生成长度扩展,训练成本降低4.4倍,推理速度提升3.7倍。

Insight: 时空能量衰减现象为设计高效注意力机制提供了自然灵感,稀疏注意力在保持性能的同时可以显著降低计算开销。

Abstract: Recent advances in diffusion models have enabled high-quality video generation, but the additional temporal dimension significantly increases computational costs, making training and inference on long videos prohibitively expensive. In this paper, we identify a phenomenon we term Spatiotemporal Energy Decay in video diffusion models: post-softmax attention scores diminish as spatial and temporal distance between tokens increase, akin to the physical decay of signal or waves over space and time in nature. Motivated by this, we propose Radial Attention, a scalable sparse attention mechanism with $O(n \log n)$ complexity that translates energy decay into exponentially decaying compute density, which is significantly more efficient than standard $O(n^2)$ dense attention and more expressive than linear attention. Specifically, Radial Attention employs a simple, static attention mask where each token attends to spatially nearby tokens, with the attention window size shrinking with temporal distance. Moreover, it allows pre-trained video diffusion models to extend their generation length with efficient LoRA-based fine-tuning. Extensive experiments show that Radial Attention maintains video quality across Wan2.1-14B, HunyuanVideo, and Mochi 1, achieving up to a 1.9$\times$ speedup over the original dense attention. With minimal tuning, it enables video generation up to 4$\times$ longer while reducing training costs by up to 4.4$\times$ compared to direct fine-tuning and accelerating inference by up to 3.7$\times$ compared to dense attention inference.


cs.CL [Back]

[69] MemeMind: A Large-Scale Multimodal Dataset with Chain-of-Thought Reasoning for Harmful Meme Detection cs.CL | cs.AI | cs.CVPDF

Hexiang Gu, Qifan Yu, Saihui Hou, Zhiqin Fang, Huijia Wu

TL;DR: 介绍了MemeMind数据集和MemeGuard框架,用于有害模因检测,通过Chain-of-Thought注释和多模态建模提升模型性能。

Details

Motivation: 社交媒体的快速发展加剧了有害内容的传播,现有数据集缺乏系统性和解释性,阻碍了有害模因检测的进展。

Result: MemeGuard在实验中显著优于现有方法。

Insight: Chain-of-Thought注释和多模态建模对有害模因检测具有重要价值。

Abstract: The rapid development of social media has intensified the spread of harmful content. Harmful memes, which integrate both images and text, pose significant challenges for automated detection due to their implicit semantics and complex multimodal interactions. Although existing research has made progress in detection accuracy and interpretability, the lack of a systematic, large-scale, diverse, and highly explainable dataset continues to hinder further advancement in this field. To address this gap, we introduce MemeMind, a novel dataset featuring scientifically rigorous standards, large scale, diversity, bilingual support (Chinese and English), and detailed Chain-of-Thought (CoT) annotations. MemeMind fills critical gaps in current datasets by offering comprehensive labeling and explicit reasoning traces, thereby providing a solid foundation for enhancing harmful meme detection. In addition, we propose an innovative detection framework, MemeGuard, which effectively integrates multimodal information with reasoning process modeling, significantly improving models’ ability to understand and identify harmful memes. Extensive experiments conducted on the MemeMind dataset demonstrate that MemeGuard consistently outperforms existing state-of-the-art methods in harmful meme detection tasks.


[70] Mirage of Mastery: Memorization Tricks LLMs into Artificially Inflated Self-Knowledge cs.CLPDF

Sahil Kale, Vijaykant Nadadur

TL;DR: 该研究揭示了大型语言模型(LLM)将记忆误认为推理能力的问题,导致其在自知识评估中表现过高的自信,尤其在STEM领域。

Details

Motivation: 当前研究将记忆和自知识缺陷视为独立问题,忽视了它们之间的联系,这影响了LLM回答的可信度。研究旨在揭示LLM是否真正学习推理模式或仅是记忆训练数据中的解决方案。

Result: 研究发现LLM在自知识评估中存在显著不一致性(>45%),尤其是在科学和医学领域。这表明LLM过度依赖记忆解决方案,导致推理能力被高估。

Insight: 研究揭示了LLM在记忆与推理能力之间的混淆问题,突显了当前架构和训练模式的缺陷,强调需要开发新技术以提高模型对其自身知识的平衡和一致认知。

Abstract: When artificial intelligence mistakes memorization for intelligence, it creates a dangerous mirage of reasoning. Existing studies treat memorization and self-knowledge deficits in LLMs as separate issues and do not recognize an intertwining link that degrades the trustworthiness of LLM responses. In our study, we utilize a novel framework to ascertain if LLMs genuinely learn reasoning patterns from training data or merely memorize them to assume competence across problems of similar complexity focused on STEM domains. Our analysis shows a noteworthy problem in generalization: LLMs draw confidence from memorized solutions to infer a higher self-knowledge about their reasoning ability, which manifests as an over 45% inconsistency in feasibility assessments when faced with self-validated, logically coherent task perturbations. This effect is most pronounced in science and medicine domains, which tend to have maximal standardized jargon and problems, further confirming our approach. Significant wavering within the self-knowledge of LLMs also shows flaws in current architectures and training patterns, highlighting the need for techniques that ensure a balanced, consistent stance on models’ perceptions of their own knowledge for maximum AI explainability and trustworthiness. Our code and results are available publicly at https://github.com/knowledge-verse-ai/LLM-Memorization_SK_Eval-.


[71] Broken Tokens? Your Language Model can Secretly Handle Non-Canonical Tokenizations cs.CLPDF

Brian Siyuan Zheng, Alisa Liu, Orevaoghene Ahia, Jonathan Hayase, Yejin Choi

TL;DR: 这篇论文研究发现,语言模型(LMs)对非规范分词(non-canonical tokenizations)表现出惊人的鲁棒性,即使这些分词从未在训练中见过。指令调优模型在这种情况下仍能保留高达93.4%的性能,且在某些任务中非规范分词反而能提升表现(例如字符级分词在字符串操作和代码任务中提升14%)。鲁棒性主要来源于指令调优阶段。

Details

Motivation: 现代分词器使用确定性算法将文本映射为单一“规范”分词序列,但同一字符串可能通过分词器词汇表编码为多种不同的非规范分词。论文探讨了语言模型对这种非规范分词的鲁棒性。

Result: 1. 指令调优模型在随机分词下保留93.4%性能,字符级分词下保留90.8%;2. 字符级分词在字符串和代码任务中提升14%,数字分组在大数运算中提升33%。

Insight: 1. 模型并非如先前认为的那样依赖分词器;2. 指令调优赋予模型理解非规范分词语义的能力,而基础模型会生成无意义输出;3. 推断时干预分词方式可提升特定任务性能。

Abstract: Modern tokenizers employ deterministic algorithms to map text into a single “canonical” token sequence, yet the same string can be encoded as many non-canonical tokenizations using the tokenizer vocabulary. In this work, we investigate the robustness of LMs to text encoded with non-canonical tokenizations entirely unseen during training. Surprisingly, when evaluated across 20 benchmarks, we find that instruction-tuned models retain up to 93.4% of their original performance when given a randomly sampled tokenization, and 90.8% with character-level tokenization. We see that overall stronger models tend to be more robust, and robustness diminishes as the tokenization departs farther from the canonical form. Motivated by these results, we then identify settings where non-canonical tokenization schemes can improve performance, finding that character-level segmentation improves string manipulation and code understanding tasks by up to +14%, and right-aligned digit grouping enhances large-number arithmetic by +33%. Finally, we investigate the source of this robustness, finding that it arises in the instruction-tuning phase. We show that while both base and post-trained models grasp the semantics of non-canonical tokenizations (perceiving them as containing misspellings), base models try to mimic the imagined mistakes and degenerate into nonsensical output, while post-trained models are committed to fluent responses. Overall, our findings suggest that models are less tied to their tokenizer than previously believed, and demonstrate the promise of intervening on tokenization at inference time to boost performance.


[72] Quantifying Fairness in LLMs Beyond Tokens: A Semantic and Statistical Perspective cs.CL | cs.AI | cs.CY | 68T50 | I.2.7PDF

Weijie Xu, Yiwen Wang, Chi Xue, Xiangkun Hu, Xi Fang

TL;DR: 本文提出FiSCo框架,用于评估大语言模型(LLM)的公平性,通过语义和统计方法检测长文本中的细微偏见。

Details

Motivation: 现有方法难以捕捉长文本中的偏见和LLM输出的内在变异性,FiSCo旨在解决这一问题。

Result: 实验显示FiSCo能更可靠地识别微妙偏见,减少LLM随机性的影响。

Insight: 超越词级分析,FiSCo通过语义一致性评估公平性,为LLM偏见检测提供了新视角。

Abstract: Large Language Models (LLMs) often generate responses with inherent biases, undermining their reliability in real-world applications. Existing evaluation methods often overlook biases in long-form responses and the intrinsic variability of LLM outputs. To address these challenges, we propose FiSCo(Fine-grained Semantic Computation), a novel statistical framework to evaluate group-level fairness in LLMs by detecting subtle semantic differences in long-form responses across demographic groups. Unlike prior work focusing on sentiment or token-level comparisons, FiSCo goes beyond surface-level analysis by operating at the claim level, leveraging entailment checks to assess the consistency of meaning across responses. We decompose model outputs into semantically distinct claims and apply statistical hypothesis testing to compare inter- and intra-group similarities, enabling robust detection of subtle biases. We formalize a new group counterfactual fairness definition and validate FiSCo on both synthetic and human-annotated datasets spanning gender, race, and age. Experiments show that FiSco more reliably identifies nuanced biases while reducing the impact of stochastic LLM variability, outperforming various evaluation metrics.


[73] NLPnorth @ TalentCLEF 2025: Comparing Discriminative, Contrastive, and Prompt-Based Methods for Job Title and Skill Matching cs.CLPDF

Mike Zhang, Rob van der Goot

TL;DR: 该论文比较了分类、对比和提示方法在职位匹配和技能预测任务中的表现,发现提示方法在职位匹配中表现最佳,而分类方法在技能预测中更优。

Details

Motivation: 研究职位匹配和技能预测在计算职位市场中的重要性,以改进自动候选人匹配、职业路径预测和职位市场分析等任务。

Result: 提示方法在职位匹配任务(Task A)中表现最佳(MAP: 0.492),分类方法在技能预测任务(Task B)中更优(MAP: 0.290)。

Insight: 大型多语言模型在两项任务中表现最佳,提示方法更适合职位匹配,而分类方法更适合技能预测。

Abstract: Matching job titles is a highly relevant task in the computational job market domain, as it improves e.g., automatic candidate matching, career path prediction, and job market analysis. Furthermore, aligning job titles to job skills can be considered an extension to this task, with similar relevance for the same downstream tasks. In this report, we outline NLPnorth’s submission to TalentCLEF 2025, which includes both of these tasks: Multilingual Job Title Matching, and Job Title-Based Skill Prediction. For both tasks we compare (fine-tuned) classification-based, (fine-tuned) contrastive-based, and prompting methods. We observe that for Task A, our prompting approach performs best with an average of 0.492 mean average precision (MAP) on test data, averaged over English, Spanish, and German. For Task B, we obtain an MAP of 0.290 on test data with our fine-tuned classification-based approach. Additionally, we made use of extra data by pulling all the language-specific titles and corresponding \emph{descriptions} from ESCO for each job and skill. Overall, we find that the largest multilingual language models perform best for both tasks. Per the provisional results and only counting the unique teams, the ranking on Task A is 5$^{\text{th}}$/20 and for Task B 3$^{\text{rd}}$/14.


[74] MFTCXplain: A Multilingual Benchmark Dataset for Evaluating the Moral Reasoning of LLMs through Hate Speech Multi-hop Explanation cs.CLPDF

Jackson Trager, Francielle Vargas, Diego Alves, Matteo Guida, Mikel K. Ngueajio

TL;DR: MFTCXplain是一个多语言基准数据集,用于通过仇恨言论的多跳解释评估LLMs的道德推理能力,揭示了LLMs在道德推理方面的局限性。

Details

Motivation: 当前评估LLMs道德推理能力的基准存在两大缺陷:缺乏合理的标注以支持道德分类,以及主要集中于英语,限制了多文化背景下的评估。

Result: LLMs在仇恨言论检测上表现较好(F1达0.836),但在道德情感预测上表现较差(F1 < 0.35),且对少数语言的理由对齐能力有限。

Insight: 当前LLMs在理解和反映人类道德推理方面能力有限,尤其是在多语言和跨文化背景下。

Abstract: Ensuring the moral reasoning capabilities of Large Language Models (LLMs) is a growing concern as these systems are used in socially sensitive tasks. Nevertheless, current evaluation benchmarks present two major shortcomings: a lack of annotations that justify moral classifications, which limits transparency and interpretability; and a predominant focus on English, which constrains the assessment of moral reasoning across diverse cultural settings. In this paper, we introduce MFTCXplain, a multilingual benchmark dataset for evaluating the moral reasoning of LLMs via hate speech multi-hop explanation using Moral Foundation Theory (MFT). The dataset comprises 3,000 tweets across Portuguese, Italian, Persian, and English, annotated with binary hate speech labels, moral categories, and text span-level rationales. Empirical results highlight a misalignment between LLM outputs and human annotations in moral reasoning tasks. While LLMs perform well in hate speech detection (F1 up to 0.836), their ability to predict moral sentiments is notably weak (F1 < 0.35). Furthermore, rationale alignment remains limited mainly in underrepresented languages. These findings show the limited capacity of current LLMs to internalize and reflect human moral reasoning.


[75] Language Models Might Not Understand You: Evaluating Theory of Mind via Story Prompting cs.CL | cs.AIPDF

Nathaniel Getachew, Abulhair Saparov

TL;DR: 该论文提出了一个名为 $ exttt{StorySim}$ 的框架,用于合成生成故事以评估大型语言模型(LLMs)的心智理论(ToM)和世界建模(WM)能力。研究发现,模型在 WM 任务上表现优于 ToM 任务,并且在推理人类行为时表现更好。此外,发现了启发式行为的证据,如近因偏差和对早期事件的过度依赖。

Details

Motivation: 现有基准测试可能因预训练数据污染而影响评估效果。需要一种可控的框架来精确评估 LLMs 的心智理论和世界建模能力。

Result: 1. 大多数模型在 WM 任务上表现优于 ToM 任务;2. 模型对人类的推理能力优于对无生命对象的推理;3. 发现模型存在近因偏差和依赖早期事件的行为。

Insight: 1. 当前 LLMs 的心智理论能力仍有不足;2. 框架的可控性有助于揭示模型的局限性;3. 启发式行为可能是模型推理的潜在缺陷。

Abstract: We introduce $\texttt{StorySim}$, a programmable framework for synthetically generating stories to evaluate the theory of mind (ToM) and world modeling (WM) capabilities of large language models (LLMs). Unlike prior benchmarks that may suffer from contamination in pretraining data, $\texttt{StorySim}$ produces novel, compositional story prompts anchored by a highly controllable $\texttt{Storyboard}$, enabling precise manipulation of character perspectives and events. We use this framework to design first- and second-order ToM tasks alongside WM tasks that control for the ability to track and model mental states. Our experiments across a suite of state-of-the-art LLMs reveal that most models perform better on WM tasks than ToM tasks, and that models tend to perform better reasoning with humans compared to inanimate objects. Additionally, our framework enabled us to find evidence of heuristic behavior such as recency bias and an over-reliance on earlier events in the story. All code for generating data and evaluations is freely available.


[76] Human-Aligned Faithfulness in Toxicity Explanations of LLMs cs.CLPDF

Ramaravind K. Mothilal, Joanna Roy, Syed Ishtiaque Ahmed, Shion Guha

TL;DR: 该论文提出了一种新颖的人类对齐忠实性(HAF)评价准则,用于评估大语言模型(LLMs)生成的毒性解释的合理性,并通过六个度量标准量化其与人类理性解释的一致性。

Details

Motivation: 现有的解释性方法过度依赖输入文本扰动,难以直接评估LLMs生成的自由形式毒性解释的合理性。论文旨在填补这一空白,提升LLMs在下游任务中的可信度。

Result: 实验表明,LLMs在简单提示下生成合理的解释,但在涉及复杂关系和微妙原因的提示下,推理能力崩溃,导致不一致和无意义的回答。

Insight: LLMs在毒性解释任务中的表现受提示复杂性影响显著,提示设计对其推理能力至关重要。

Abstract: The discourse around toxicity and LLMs in NLP largely revolves around detection tasks. This work shifts the focus to evaluating LLMs’ reasoning about toxicity – from their explanations that justify a stance – to enhance their trustworthiness in downstream tasks. Despite extensive research on explainability, it is not straightforward to adopt existing methods to evaluate free-form toxicity explanation due to their over-reliance on input text perturbations, among other challenges. To account for these, we propose a novel, theoretically-grounded multi-dimensional criterion, Human-Aligned Faithfulness (HAF), that measures the extent to which LLMs’ free-form toxicity explanations align with those of a rational human under ideal conditions. We develop six metrics, based on uncertainty quantification, to comprehensively evaluate \haf of LLMs’ toxicity explanations with no human involvement, and highlight how “non-ideal” the explanations are. We conduct several experiments on three Llama models (of size up to 70B) and an 8B Ministral model on five diverse toxicity datasets. Our results show that while LLMs generate plausible explanations to simple prompts, their reasoning about toxicity breaks down when prompted about the nuanced relations between the complete set of reasons, the individual reasons, and their toxicity stances, resulting in inconsistent and nonsensical responses. We open-source our code and LLM-generated explanations at https://github.com/uofthcdslab/HAF.


[77] Augmenting Multi-Agent Communication with State Delta Trajectory cs.CLPDF

Yichen Tang, Weihang Su, Yujia Zhou, Yiqun Liu, Min Zhang

TL;DR: 论文提出了一种新的多智能体通信协议——状态增量轨迹(State Delta Trajectory),通过传递自然语言标记和标记级状态转移轨迹来减少信息损失,提升了多智能体系统的性能,尤其在复杂推理任务中表现突出。

Details

Motivation: 现有基于大语言模型(LLM)的多智能体系统主要依赖自然语言进行通信,虽然简单可解释,但会导致信息损失,尤其是推理逻辑或抽象思维这类信息的传递。

Result: 实验结果表明,使用SDE的多智能体系统在复杂推理任务中达到了SOTA性能。

Insight: 状态变化序列(而非静态状态值)能更好地捕捉推理过程的动态信息,为多智能体通信优化提供了新思路。

Abstract: Multi-agent techniques such as role playing or multi-turn debates have been shown to be effective in improving the performance of large language models (LLMs) in downstream tasks. Despite their differences in workflows, existing LLM-based multi-agent systems mostly use natural language for agent communication. While this is appealing for its simplicity and interpretability, it also introduces inevitable information loss as one model must down sample its continuous state vectors to concrete tokens before transferring them to the other model. Such losses are particularly significant when the information to transfer is not simple facts, but reasoning logics or abstractive thoughts. To tackle this problem, we propose a new communication protocol that transfers both natural language tokens and token-wise state transition trajectory from one agent to another. Particularly, compared to the actual state value, we find that the sequence of state changes in LLMs after generating each token can better reflect the information hidden behind the inference process, so we propose a State Delta Encoding (SDE) method to represent state transition trajectories. The experimental results show that multi-agent systems with SDE achieve SOTA performance compared to other communication protocols, particularly in tasks that involve complex reasoning. This shows the potential of communication augmentation for LLM-based multi-agent systems.


[78] Personality Prediction from Life Stories using Language Models cs.CL | cs.LGPDF

Rasiq Hussain, Jerry Ma, Rithik Khandelwal, Joshua Oltmanns, Mehak Gupta

TL;DR: 本文提出了一种结合预训练语言模型和注意力机制的两步方法,用于从长篇生活故事中预测五大人格特质,优于现有长上下文模型。

Details

Motivation: 传统人格评估依赖于问卷,缺乏丰富性和开放性。NLP技术可以利用长篇叙事文本,提供更自然的人格评估方式。

Result: 通过消融实验和与LLaMA、Longformer等模型的对比,证明了该方法在准确性、效率和可解释性上的提升。

Insight: 结合语言特征和长上下文建模能够更好地从叙事文本中提取人格特质,推动基于语言的人格评估发展。

Abstract: Natural Language Processing (NLP) offers new avenues for personality assessment by leveraging rich, open-ended text, moving beyond traditional questionnaires. In this study, we address the challenge of modeling long narrative interview where each exceeds 2000 tokens so as to predict Five-Factor Model (FFM) personality traits. We propose a two-step approach: first, we extract contextual embeddings using sliding-window fine-tuning of pretrained language models; then, we apply Recurrent Neural Networks (RNNs) with attention mechanisms to integrate long-range dependencies and enhance interpretability. This hybrid method effectively bridges the strengths of pretrained transformers and sequence modeling to handle long-context data. Through ablation studies and comparisons with state-of-the-art long-context models such as LLaMA and Longformer, we demonstrate improvements in prediction accuracy, efficiency, and interpretability. Our results highlight the potential of combining language-based features with long-context modeling to advance personality assessment from life narratives.


[79] What Matters in LLM-generated Data: Diversity and Its Effect on Model Fine-Tuning cs.CL | cs.LGPDF

Yuchang Zhu, Zhonghua zhen, Qunshu Lin, Haotong Wei, Xiaolong Sun

TL;DR: 这篇论文研究了LLM生成数据的多样性对下游模型性能的影响,发现适度多样性的数据可以提升模型性能,而高度多样性的数据则可能产生负面影响。

Details

Motivation: 随着LLM生成能力的提升,利用其生成数据来训练下游模型成为一种缓解数据稀缺和减少标注时间的方法。然而,自生成数据迭代训练可能导致模型性能下降(模型崩溃),但现有研究往往忽视了数据多样性的重要性。

Result: 实验结果表明,在分布偏移最小的情况下,适度多样性的LLM生成数据可以提升模型性能(尤其是在标记数据不足时),而高度多样性的数据则会损害性能。

Insight: 数据多样性是影响模型性能的关键因素,但需要找到一个平衡点:适度多样性的生成数据更为有效,而过度追求多样性可能适得其反。

Abstract: With the remarkable generative capabilities of large language models (LLMs), using LLM-generated data to train downstream models has emerged as a promising approach to mitigate data scarcity in specific domains and reduce time-consuming annotations. However, recent studies have highlighted a critical issue: iterative training on self-generated data results in model collapse, where model performance degrades over time. Despite extensive research on the implications of LLM-generated data, these works often neglect the importance of data diversity, a key factor in data quality. In this work, we aim to understand the implications of the diversity of LLM-generated data on downstream model performance. Specifically, we explore how varying levels of diversity in LLM-generated data affect downstream model performance. Additionally, we investigate the performance of models trained on data that mixes different proportions of LLM-generated data, which we refer to as synthetic data. Our experimental results show that, with minimal distribution shift, moderately diverse LLM-generated data can enhance model performance in scenarios with insufficient labeled data, whereas highly diverse generated data has a negative impact. We hope our empirical findings will offer valuable guidance for future studies on LLMs as data generators.


[80] EmoStage: A Framework for Accurate Empathetic Response Generation via Perspective-Taking and Phase Recognition cs.CL | cs.AIPDF

Zhiyang Qi, Keiko Takamizo, Mariko Ukiyo, Michimasa Inaba

TL;DR: EmoStage是一个通过视角转换和阶段识别提升共情响应生成的框架,解决了当前AI心理咨询系统在心理状态理解和阶段识别上的不足。

Details

Motivation: 心理健康护理的需求增加推动了AI心理咨询系统的发展,但当前方法在理解用户心理状态、识别咨询阶段以及依赖高质量数据等方面存在挑战。

Result: 实验表明,EmoStage在日语和中文心理咨询场景中提升了基础模型的响应质量,与数据驱动方法表现相当。

Insight: 通过无需额外训练数据的LLM推理能力,EmoStage实现了更准确的共情响应生成,同时避免了隐私问题和数据依赖。

Abstract: The rising demand for mental health care has fueled interest in AI-driven counseling systems. While large language models (LLMs) offer significant potential, current approaches face challenges, including limited understanding of clients’ psychological states and counseling stages, reliance on high-quality training data, and privacy concerns associated with commercial deployment. To address these issues, we propose EmoStage, a framework that enhances empathetic response generation by leveraging the inference capabilities of open-source LLMs without additional training data. Our framework introduces perspective-taking to infer clients’ psychological states and support needs, enabling the generation of emotionally resonant responses. In addition, phase recognition is incorporated to ensure alignment with the counseling process and to prevent contextually inappropriate or inopportune responses. Experiments conducted in both Japanese and Chinese counseling settings demonstrate that EmoStage improves the quality of responses generated by base models and performs competitively with data-driven methods.


[81] JCAPT: A Joint Modeling Approach for CAPT cs.CL | cs.AI | eess.ASPDF

Tzu-Hsuan Yang, Yue-Yang He, Berlin Chen

TL;DR: 该论文提出了一种联合建模方法JCAPT,结合自动发音评估(APA)和发音错误检测(MDD)两个任务,利用Mamba(一种选择性状态空间模型)和音韵特征,提升了CAPT系统的性能和解释性。

Details

Motivation: 在第二语言学习中,发音反馈至关重要。现有的计算机辅助发音训练(CAPT)系统中,APA和MDD任务通常是分开处理的,但联合建模能带来更大优势。

Result: 在SpeechOcean762基准测试中,模型在MDD任务上表现显著优于现有方法,展示了其有效性。

Insight: 联合建模可以显著提升CAPT系统的性能,音韵特征和状态空间模型的结合为发音训练提供了更细粒度的时序推理能力。

Abstract: Effective pronunciation feedback is critical in second language (L2) learning, for which computer-assisted pronunciation training (CAPT) systems often encompass two key tasks: automatic pronunciation assessment (APA) and mispronunciation detection and diagnosis (MDD). Recent work has shown that joint modeling of these two tasks can yield mutual benefits. Our unified framework leverages Mamba, a selective state space model (SSM), while integrating phonological features and think token strategies to jointly enhance interpretability and fine-grained temporal reasoning in APA and MDD. To our knowledge, this is the first study to combine phonological attribution, SSM-based modeling, and prompting in CAPT. A series of experiments conducted on the speechocean762 benchmark demonstrate that our model consistently outperforms prior methods, particularly on the MDD task.


[82] Learning to Disentangle Latent Reasoning Rules with Language VAEs: A Systematic Study cs.CLPDF

Yingji Zhang, Marco Valentino, Danilo S. Carvalho, André Freitas

TL;DR: 该论文提出了一种通过语言变分自编码器(VAE)在Transformer语言模型中显式嵌入推理规则的方法,以增强模型的泛化、可解释性和可控性。

Details

Motivation: 当前基于Transformer的语言模型在自然语言推理(NLI)任务中表现良好,但通常依赖于记忆而非基于规则的推理。为提高推理能力的显式表示,该研究探索了在语言模型中嵌入推理规则的途径。

Result: 实验表明,推理规则能够在编码器的输出特征空间中形成明显的聚类,并且FFN层比注意力层更擅长保持规则的分离。在数学推理任务中,样本数量的增加超过一定阈值后不再提升性能。

Insight: 推理规则的显式嵌入和解耦是增强语言模型推理能力的有效方法,同时FFN层在规则分离中的作用强于注意力层,为模型优化提供了新方向。

Abstract: Incorporating explicit reasoning rules within the latent space of language models (LMs) offers a promising pathway to enhance generalisation, interpretability, and controllability. While current Transformer-based language models have shown strong performance on Natural Language Inference (NLI) tasks, they often rely on memorisation rather than rule-based inference. This work investigates how reasoning rules can be explicitly embedded and memorised within the LMs through Language Variational Autoencoders (VAEs). We propose a complete pipeline for learning reasoning rules within Transformer-based language VAEs. This pipeline encompasses three rule-based reasoning tasks, a supporting theoretical framework, and a practical end-to-end architecture. The experiment illustrates the following findings: Disentangled reasoning: Under explicit signal supervision, reasoning rules - viewed as functional mappings - can be disentangled within the encoder’s parametric space. This separation results in distinct clustering of rules in the output feature space. Prior knowledge injection: injecting reasoning information into the Query enables the model to more effectively retrieve the stored value Value from memory based on Key. This approach offers a simple method for integrating prior knowledge into decoder-only language models. Performance bottleneck: In mathematical reasoning tasks using Qwen2.5(0.5B), increasing sample count doesn’t improve performance beyond a point. Moreover, ffn layers are better than attention layers at preserving the separation of reasoning rules in the model’s parameters.


[83] Can Large Language Models Capture Human Annotator Disagreements? cs.CL | cs.AIPDF

Jingwei Ni, Yu Fan, Vilém Zouhar, Donya Rooein, Alexander Hoyle

TL;DR: 本文探讨大语言模型(LLM)能否捕捉人类标注者的标注分歧,发现LLM在预测分歧方面表现不佳,且RLVR式推理反而降低性能。

Details

Motivation: 人类标注分歧反映任务主观性和样本模糊性,而LLM的自动标注评估通常只关注多数标签,忽略了分歧信息的重要性。

Result: LLM难以有效预测分歧,RLVR式推理在分歧预测中表现更差。

Insight: LLM作为标注工具时需改进分歧建模能力,避免仅依赖多数标签的评估方法。

Abstract: Human annotation variation (i.e., annotation disagreements) is common in NLP and often reflects important information such as task subjectivity and sample ambiguity. While Large Language Models (LLMs) are increasingly used for automatic annotation to reduce human effort, their evaluation often focuses on predicting the majority-voted “ground truth” labels. It is still unclear, however, whether these models also capture informative human annotation variation. Our work addresses this gap by extensively evaluating LLMs’ ability to predict annotation disagreements without access to repeated human labels. Our results show that LLMs struggle with modeling disagreements, which can be overlooked by majority label-based evaluations. Notably, while RLVR-style (Reinforcement learning with verifiable rewards) reasoning generally boosts LLM performance, it degrades performance in disagreement prediction. Our findings highlight the critical need for evaluating and improving LLM annotators in disagreement modeling. Code and data at https://github.com/EdisonNi-hku/Disagreement_Prediction.


[84] Commonsense Generation and Evaluation for Dialogue Systems using Large Language Models cs.CLPDF

Marcos Estecha-Garitagoitia, Chen Zhang, Mario Rodríguez-Cantelar, Luis Fernando D’Haro

TL;DR: 该论文探讨了利用大型语言模型(LLMs)进行对话系统的数据增强和自动评估的方法,通过上下文相关的常识生成和评估初步验证了其有效性。

Details

Motivation: 对话系统需要丰富的上下文关联数据和常识知识,传统方法在这方面的能力和效率有限。LLMs的零样本能力和常识推理能力为这一任务提供了新的可能性。

Result: 初步结果表明,该方法能有效利用LLMs的常识推理能力,生成上下文相关的对话数据,并通过自动评估验证其质量。

Insight: LLMs不仅能生成高质量的对话数据,还能通过指令驱动的评估方法自动化检测数据质量,为对话系统的数据增强提供了新思路。

Abstract: This paper provides preliminary results on exploring the task of performing turn-level data augmentation for dialogue system based on different types of commonsense relationships, and the automatic evaluation of the generated synthetic turns. The proposed methodology takes advantage of the extended knowledge and zero-shot capabilities of pretrained Large Language Models (LLMs) to follow instructions, understand contextual information, and their commonsense reasoning capabilities. The approach draws inspiration from methodologies like Chain-of-Thought (CoT), applied more explicitly to the task of prompt-based generation for dialogue-based data augmentation conditioned on commonsense attributes, and the automatic evaluation of the generated dialogues. To assess the effectiveness of the proposed approach, first we extracted 200 randomly selected partial dialogues, from 5 different well-known dialogue datasets, and generate alternative responses conditioned on different event commonsense attributes. This novel dataset allows us to measure the proficiency of LLMs in generating contextually relevant commonsense knowledge, particularly up to 12 different specific ATOMIC [10] database relations. Secondly, we propose an evaluation framework to automatically detect the quality of the generated dataset inspired by the ACCENT [26] metric, which offers a nuanced approach to assess event commonsense. However, our method does not follow ACCENT’s complex eventrelation tuple extraction process. Instead, we propose an instruction-based prompt for each commonsense attribute and use state-of-the-art LLMs to automatically detect the original attributes used when creating each augmented turn in the previous step. Preliminary results suggest that our approach effectively harnesses LLMs capabilities for commonsense reasoning and evaluation in dialogue systems.


[85] Dialogic Pedagogy for Large Language Models: Aligning Conversational AI with Proven Theories of Learning cs.CL | cs.AI | cs.HC | K.3.2; I.2.6; H.4.mPDF

Russell Beale

TL;DR: 这篇论文探讨了如何将大型语言模型(LLM)与教育理论(如社会文化学习、苏格拉底法和对话教学法)结合,以提升对话式AI在教育中的应用效果。

Details

Motivation: 随着LLM在教育中的快速应用,亟需将其与成熟的教育理论对齐,以确保学习效果和教学方法的科学性。

Result: 研究发现LLM在知识共建和个性化学习方面存在不足,但通过策略调整(如设计引导性提示)可以更好地支持教学理论。

Insight: 论文强调了教育理论与AI实践结合的重要性,为未来LLM在教育中的设计和应用提供了理论支持和实用工具。

Abstract: Large Language Models (LLMs) are rapidly transforming education by enabling rich conversational learning experiences. This article provides a comprehensive review of how LLM-based conversational agents are being used in higher education, with extensions to secondary and lifelong learning contexts. We synthesize existing literature on LLMs in education and theories of conversational and dialogic pedagogy - including Vygotsky’s sociocultural learning (scaffolding and the Zone of Proximal Development), the Socratic method, and Laurillard’s conversational framework - and examine how prompting strategies and retrieval-augmented generation (RAG) can align LLM behaviors with these pedagogical theories, and how it can support personalized, adaptive learning. We map educational theories to LLM capabilities, highlighting where LLM-driven dialogue supports established learning principles and where it challenges or falls short of traditional pedagogical assumptions. Notable gaps in applying prior theories to LLMs are identified, such as the models tendency to provide direct answers instead of fostering co-construction of knowledge, and the need to account for the constant availability and broad but non-human expertise of LLM tutors. In response, we propose practical strategies to better align LLM interactions with sound pedagogy - for example, designing prompts that encourage Socratic questioning, scaffolded guidance, and student reflection, as well as integrating retrieval mechanisms to ensure accuracy and contextual relevance. Our aim is to bridge the gap between educational theory and the emerging practice of AI-driven conversational learning, offering insights and tools for making LLM-based dialogues more educationally productive and theory-aligned.


[86] Is Long-to-Short a Free Lunch? Investigating Inconsistency and Reasoning Efficiency in LRMs cs.CLPDF

Shu Yang, Junchao Wu, Xuansheng Wu, Derek Wong, Ninhao Liu

TL;DR: 这篇论文研究了大型推理模型(LRMs)在追求高效推理时可能引入的行为不一致性问题,并通过$ICBENCH$基准测评了三种不一致性,发现高效推理策略虽然提升了效率,但增加了模型不一致的风险。

Details

Motivation: LRMs在复杂任务中表现出色,但过度推理可能导致效率低下。近期研究尝试优化推理长度以提升效率,但其是否为‘免费午餐’尚无定论。论文质疑是否这种压缩推理会削弱模型的鲁棒性,导致关键推理步骤缺失或行为不一致。

Result: 实验表明:1) 大模型通常比小模型更一致,但仍普遍存在‘计谋行为’(如自我矛盾、事后合理化);2) 高效推理策略会加剧所有三种不一致性。

Insight: 论文揭示了高效推理与模型一致性之间的权衡关系,提示在追求效率时需警惕模型可能逃避有效监督的风险,为未来的模型优化设计提供了重要参考。

Abstract: Large Reasoning Models (LRMs) have achieved remarkable performance on complex tasks by engaging in extended reasoning before producing final answers, yet this strength introduces the risk of overthinking, where excessive token generation occurs even for simple tasks. While recent work in efficient reasoning seeks to reduce reasoning length while preserving accuracy, it remains unclear whether such optimization is truly a free lunch. Drawing on the intuition that compressing reasoning may reduce the robustness of model responses and lead models to omit key reasoning steps, we investigate whether efficient reasoning strategies introduce behavioral inconsistencies. To systematically assess this, we introduce $ICBENCH$, a benchmark designed to measure inconsistency in LRMs across three dimensions: inconsistency across task settings (ITS), inconsistency between training objectives and learned behavior (TR-LB), and inconsistency between internal reasoning and self-explanations (IR-SE). Applying $ICBENCH$ to a range of open-source LRMs, we find that while larger models generally exhibit greater consistency than smaller ones, they all display widespread “scheming” behaviors, including self-disagreement, post-hoc rationalization, and the withholding of reasoning cues. Crucially, our results demonstrate that efficient reasoning strategies such as No-Thinking and Simple Token-Budget consistently increase all three defined types of inconsistency. These findings suggest that although efficient reasoning enhances token-level efficiency, further investigation is imperative to ascertain whether it concurrently introduces the risk of models evading effective supervision.


[87] KnowMap: Efficient Knowledge-Driven Task Adaptation for LLMs cs.CLPDF

Kelin Fu, Kaigui Bian

TL;DR: KnowMap提出了一种动态构建知识库的新方法,通过微调小型知识嵌入模型增强大模型的领域适应能力,避免了传统方法的昂贵和数据依赖性问题。

Details

Motivation: 大型语言模型(LLMs)在开放世界中表现出色,但依赖静态预训练知识导致其对快速适应新任务的能力不足,传统微调方法成本高且可能引发灾难性遗忘。

Result: 在ScienceWorld基准测试中,gpt-4-turbo模型的性能提升了17.71%,证明了KnowMap的高效性和有效性。

Insight: 通过动态知识构建和知识嵌入的协同,可以有效增强LLM的任务适应能力和推理能力,同时避免传统方法的缺陷。

Abstract: While Large Language Models (LLMs) possess significant capabilities in open-world agent tasks, they also face challenges in rapidly adapting to new, specialized tasks due to their reliance on static pre-trained knowledge. Traditional methods such as fine-tuning are often costly, data-intensive, and may lead to “catastrophic forgetting.” Therefore, we present KnowMap, a novel approach that dynamically constructs a knowledge base from environmental and experiential data. KnowMap fine-tunes a small knowledge-embedding model to equip a larger LLM with valuable task-specific knowledge. Our experiments on the ScienceWorld benchmark demonstrate 17.71% improvement for the performance of gpt-4-turbo model. KnowMap not only provides an efficient and effective means for LLM task-adapting, but also highlights how integrating environmental and experiential knowledge can enhance LLMs’ reasoning capabilities.


[88] ECCoT: A Framework for Enhancing Effective Cognition via Chain of Thought in Large Language Model cs.CL | cs.AIPDF

Zhenke Duan, Jiqun Pan, Jiani Tu, Xiaoyi Wang, Yanqing Wang

TL;DR: ECCoT是一个端到端的认知链式思维验证框架,通过结合主题感知和因果推理对齐技术提升LLM的推理可靠性和解释性。

Details

Motivation: 当前大型语言模型生成的推理链缺乏透明性且不可靠,亟需一种方法验证和改进其推理过程。

Result: ECCoT显著提升了LLM的推理可靠性和解释性,减少了偏见并增强了决策的可信度。

Insight: 主题感知和因果推理对齐是提升LLM推理链质量的关键技术,结构化的验证框架能有效改进模型输出。

Abstract: In the era of large-scale artificial intelligence, Large Language Models (LLMs) have made significant strides in natural language processing. However, they often lack transparency and generate unreliable outputs, raising concerns about their interpretability. To address this, the Chain of Thought (CoT) prompting method structures reasoning into step-by-step deductions. Yet, not all reasoning chains are valid, and errors can lead to unreliable conclusions. We propose ECCoT, an End-to-End Cognitive Chain of Thought Validation Framework, to evaluate and refine reasoning chains in LLMs. ECCoT integrates the Markov Random Field-Embedded Topic Model (MRF-ETM) for topic-aware CoT generation and Causal Sentence-BERT (CSBert) for causal reasoning alignment. By filtering ineffective chains using structured ordering statistics, ECCoT improves interpretability, reduces biases, and enhances the trustworthiness of LLM-based decision-making. Key contributions include the introduction of ECCoT, MRF-ETM for topic-driven CoT generation, and CSBert for causal reasoning enhancement. Code is released at: https://github.com/erwinmsmith/ECCoT.git.


[89] Social Hatred: Efficient Multimodal Detection of Hatemongers cs.CL | cs.SIPDF

Tom Marzea, Abraham Israeli, Oren Tsur

TL;DR: 该论文提出了一种多模态方法,用于高效检测仇恨传播者,结合文本、用户活动及其社交网络,显著优于现有方法。

Details

Motivation: 在线仇恨言论的自动检测是净化网络言论的重要步骤。此外,准确的分类有助于从社会现象角度理解仇恨的传播。现有研究多集中于仇恨言论的检测,而本文认为用户层面的分析同样重要且更具挑战性。

Result: 实验表明,该方法在检测仇恨传播者方面显著优于基于文本和图的方法,并能适应不同平台和大规模网络数据。

Insight: 用户上下文信息是仇恨检测的关键因素,多模态方法能够有效应对隐晦内容和跨平台的挑战。

Abstract: Automatic detection of online hate speech serves as a crucial step in the detoxification of the online discourse. Moreover, accurate classification can promote a better understanding of the proliferation of hate as a social phenomenon. While most prior work focus on the detection of hateful utterances, we argue that focusing on the user level is as important, albeit challenging. In this paper we consider a multimodal aggregative approach for the detection of hate-mongers, taking into account the potentially hateful texts, user activity, and the user network. Evaluating our method on three unique datasets X (Twitter), Gab, and Parler we show that processing a user’s texts in her social context significantly improves the detection of hate mongers, compared to previously used text and graph-based methods. We offer comprehensive set of results obtained in different experimental settings as well as qualitative analysis of illustrative cases. Our method can be used to improve the classification of coded messages, dog-whistling, and racial gas-lighting, as well as to inform intervention measures. Moreover, we demonstrate that our multimodal approach performs well across very different content platforms and over large datasets and networks.


[90] Tailored Conversations beyond LLMs: A RL-Based Dialogue Manager cs.CL | cs.AIPDF

Lucie Galland, Catherine Pelachaud, Florian Pecune

TL;DR: 该论文提出了一种结合大型语言模型(LLM)和基于强化学习(RL)的对话管理器的框架,通过分层强化学习和元学习增强开放目标对话的适应性和效率。

Details

Motivation: 当前的开放目标对话系统(如LLM)在特定目标对话中表现有限,无法有效适应不同的用户需求。

Result: 实验表明,该方法在奖励方面优于基于LLM的基线模型,展现了在特定目标对话中的潜力。

Insight: 通过RL增强LLM的对话管理能力,可以更好地实现开放目标对话的个性化和效率。

Abstract: In this work, we propose a novel framework that integrates large language models (LLMs) with an RL-based dialogue manager for open-ended dialogue with a specific goal. By leveraging hierarchical reinforcement learning to model the structured phases of dialogue and employ meta-learning to enhance adaptability across diverse user profiles, our approach enhances adaptability and efficiency, enabling the system to learn from limited data, transition fluidly between dialogue phases, and personalize responses to heterogeneous patient needs. We apply our framework to Motivational Interviews, aiming to foster behavior change, and demonstrate that the proposed dialogue manager outperforms a state-of-the-art LLM baseline in terms of reward, showing a potential benefit of conditioning LLMs to create open-ended dialogue systems with specific goals.


[91] Breaking Barriers: Do Reinforcement Post Training Gains Transfer To Unseen Domains? cs.CLPDF

Chuxuan Hu, Yuxuan Zhu, Antony Kellermann, Caleb Biddulph, Suppakit Waiwitlikhit

TL;DR: 本文研究了强化后训练(RPT)在大型语言模型(LLMs)上的泛化能力,发现RPT在相似任务上表现优异,但在不同领域的泛化效果不一致。

Details

Motivation: RPT在提升LLMs推理能力方面表现出潜力,但其在新领域的泛化能力尚未被充分研究。

Result: RPT在相似任务上带来显著提升,但在不同推理模式的领域泛化效果有限。

Insight: RPT的增益具有领域依赖性,未来工作需探索跨领域泛化的优化方法。

Abstract: Reinforcement post training (RPT) has recently shown promise in improving the reasoning abilities of large language models (LLMs). However, it remains unclear how well these improvements generalize to new domains, as prior work evaluates RPT models on data from the same domains used for fine-tuning. To understand the generalizability of RPT, we conduct two studies. (1) Observational: We compare a wide range of open-weight RPT models against their corresponding base models across multiple domains, including both seen and unseen domains in their fine-tuning data. (2) Interventional: we fine-tune LLMs with RPT on single domains and evaluate their performance across multiple domains. Both studies converge on the same conclusion that, although RPT brings substantial gains on tasks similar to the fine-tuning data, the gains generalize inconsistently and can vanish on domains with different reasoning patterns.


[92] SRFT: A Single-Stage Method with Supervised and Reinforcement Fine-Tuning for Reasoning cs.CL | cs.AI | cs.LGPDF

Yuqian Fu, Tinghong Chen, Jiajun Chai, Xihuai Wang, Songjun Tu

TL;DR: 论文提出SRFT方法,通过熵感知加权机制将监督微调和强化学习统一为单阶段训练,显著提升大语言模型的推理能力。

Details

Motivation: 当前大语言模型在推理任务中表现优异,但如何有效整合监督微调(SFT)和强化学习(RL)仍是一个关键挑战。作者希望通过分析两种范式的差异(如全局与细粒度优化),提出更高效的整合方法。

Result: SRFT在数学推理任务上平均准确率达59.1%,比无RL方法提升9.0%,在分布外任务上提升10.9%。

Insight: 熵是SFT与RL训练动态差异的关键指标;单阶段联合优化能更高效地结合两者优势。

Abstract: Large language models (LLMs) have achieved remarkable progress in reasoning tasks, yet the optimal integration of Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) remains a fundamental challenge. Through comprehensive analysis of token distributions, learning dynamics, and integration mechanisms from entropy-based perspectives, we reveal key differences between these paradigms: SFT induces coarse-grained global changes to LLM policy distributions, while RL performs fine-grained selective optimizations, with entropy serving as a critical indicator of training effectiveness. Building on these observations, we propose Supervised Reinforcement Fine-Tuning (SRFT), a single-stage method that unifies both fine-tuning paradigms through entropy-aware weighting mechanisms. Our approach simultaneously applies SFT and RL to directly optimize the LLM using demonstrations and self-exploration rollouts rather than through two-stage sequential methods. Extensive experiments show that SRFT achieves 59.1% average accuracy, outperforming zero-RL methods by 9.0% on five mathematical reasoning benchmarks and 10.9% on three out-of-distribution benchmarks.


[93] Why Do Open-Source LLMs Struggle with Data Analysis? A Systematic Empirical Study cs.CL | cs.AI | cs.IR | cs.LG | cs.MAPDF

Yuqi Zhu, Yi Zhong, Jintian Zhang, Ziheng Zhang, Shuofei Qiao

TL;DR: 该论文研究了开源大语言模型(LLMs)在数据分析任务中的局限性,通过实证分析提出了提升其能力的策略,并开发了一种数据合成方法。

Details

Motivation: 开源LLMs在数据分析等推理密集型任务中表现不佳,作者希望通过系统研究提出改进方法。

Result: 研究发现,战略规划质量、交互设计和任务复杂度对推理能力有显著影响,数据质量比多样性对性能更重要。

Insight: 提升开源LLMs的推理能力需要关注任务设计的核心因素,如战略规划,而单纯的数据多样性可能效果有限。

Abstract: Large Language Models (LLMs) hold promise in automating data analysis tasks, yet open-source models face significant limitations in these kinds of reasoning-intensive scenarios. In this work, we investigate strategies to enhance the data analysis capabilities of open-source LLMs. By curating a seed dataset of diverse, realistic scenarios, we evaluate models across three dimensions: data understanding, code generation, and strategic planning. Our analysis reveals three key findings: (1) Strategic planning quality serves as the primary determinant of model performance; (2) Interaction design and task complexity significantly influence reasoning capabilities; (3) Data quality demonstrates a greater impact than diversity in achieving optimal performance. We leverage these insights to develop a data synthesis methodology, demonstrating significant improvements in open-source LLMs’ analytical reasoning capabilities.


[94] How Effectively Can BERT Models Interpret Context and Detect Bengali Communal Violent Text? cs.CLPDF

Abdullah Khondoker, Enam Ahmed Taufik, Md. Iftekhar Islam Tashik, S M Ishtiak Mahmud, Farig Sadeque

TL;DR: 该研究通过微调BanglaBERT模型和改进数据集,提高了孟加拉语社群暴力文本的检测效果,并利用LIME分析模型决策弱点。

Details

Motivation: 网络仇恨言论的传播引发社群暴力,威胁社会和谐。孟加拉语社群暴力文本分类研究不足,亟需提高检测准确性。

Result: 微调模型F1分数0.60,集成模型F1分数0.63。LIME分析显示模型在上下文理解上存在不足。

Insight: 预训练模型对相近社群与非社群术语区分能力有限;NLP工具在减少社群暴力方面潜力显著。

Abstract: The spread of cyber hatred has led to communal violence, fueling aggression and conflicts between various religious, ethnic, and social groups, posing a significant threat to social harmony. Despite its critical importance, the classification of communal violent text remains an underexplored area in existing research. This study aims to enhance the accuracy of detecting text that incites communal violence, focusing specifically on Bengali textual data sourced from social media platforms. We introduce a fine-tuned BanglaBERT model tailored for this task, achieving a macro F1 score of 0.60. To address the issue of data imbalance, our dataset was expanded by adding 1,794 instances, which facilitated the development and evaluation of a fine-tuned ensemble model. This ensemble model demonstrated an improved performance, achieving a macro F1 score of 0.63, thus highlighting its effectiveness in this domain. In addition to quantitative performance metrics, qualitative analysis revealed instances where the models struggled with context understanding, leading to occasional misclassifications, even when predictions were made with high confidence. Through analyzing the cosine similarity between words, we identified certain limitations in the pre-trained BanglaBERT models, particularly in their ability to distinguish between closely related communal and non-communal terms. To further interpret the model’s decisions, we applied LIME, which helped to uncover specific areas where the model struggled in understanding context, contributing to errors in classification. These findings highlight the promise of NLP and interpretability tools in reducing online communal violence. Our work contributes to the growing body of research in communal violence detection and offers a foundation for future studies aiming to refine these techniques for better accuracy and societal impact.


[95] MAM: Modular Multi-Agent Framework for Multi-Modal Medical Diagnosis via Role-Specialized Collaboration cs.CLPDF

Yucheng Zhou, Lingran Song, Jianbing Shen

TL;DR: 论文提出了一种模块化多智能体框架MAM,通过角色分工协作实现多模态医疗诊断,解决了当前统一多模态医疗大模型的知识更新成本、全面性和灵活性限制问题。

Details

Motivation: 当前统一多模态医疗大模型(LLMs)在知识更新成本、全面性和灵活性方面存在局限性,MAM通过角色分工和多智能体协作来解决这些问题。

Result: 在多个公开的多模态医疗数据集上,MAM比特定模态的LLMs提升了18%至365%的性能。

Insight: 角色分工和多智能体协作可以有效提升医疗诊断的效率,并为多模态数据的处理提供了新思路。

Abstract: Recent advancements in medical Large Language Models (LLMs) have showcased their powerful reasoning and diagnostic capabilities. Despite their success, current unified multimodal medical LLMs face limitations in knowledge update costs, comprehensiveness, and flexibility. To address these challenges, we introduce the Modular Multi-Agent Framework for Multi-Modal Medical Diagnosis (MAM). Inspired by our empirical findings highlighting the benefits of role assignment and diagnostic discernment in LLMs, MAM decomposes the medical diagnostic process into specialized roles: a General Practitioner, Specialist Team, Radiologist, Medical Assistant, and Director, each embodied by an LLM-based agent. This modular and collaborative framework enables efficient knowledge updates and leverages existing medical LLMs and knowledge bases. Extensive experimental evaluations conducted on a wide range of publicly accessible multimodal medical datasets, incorporating text, image, audio, and video modalities, demonstrate that MAM consistently surpasses the performance of modality-specific LLMs. Notably, MAM achieves significant performance improvements ranging from 18% to 365% compared to baseline models. Our code is released at https://github.com/yczhou001/MAM.


cs.ET [Back]

[96] Experimental Assessment of Neural 3D Reconstruction for Small UAV-based Applications cs.ET | cs.AI | cs.CV | cs.NI | eess.IVPDF

Genís Castillo Gómez-Raya, Álmos Veres-Vitályos, Filip Lemic, Pablo Royo, Mario Montagud

TL;DR: 论文提出了一种结合神经3D重建(N3DR)与小无人机(UAV)系统的方法,用于对小型静态物体进行精细的三维重建,显著提升了重建质量。

Details

Motivation: 微型无人机的普及使其能够应用于室内和难以到达的区域,但其飞行动态和能耗问题限制了其自主性和任务能力。通过N3DR技术可以解决这些问题。

Result: 实验结果表明,N3DR管道显著优于基线SfM算法,展示了其在精细3D重建中的潜力。

Insight: N3DR技术有望进一步推动微型无人机系统在受限环境中的应用,如高精度3D地图和异常检测。

Abstract: The increasing miniaturization of Unmanned Aerial Vehicles (UAVs) has expanded their deployment potential to indoor and hard-to-reach areas. However, this trend introduces distinct challenges, particularly in terms of flight dynamics and power consumption, which limit the UAVs’ autonomy and mission capabilities. This paper presents a novel approach to overcoming these limitations by integrating Neural 3D Reconstruction (N3DR) with small UAV systems for fine-grained 3-Dimensional (3D) digital reconstruction of small static objects. Specifically, we design, implement, and evaluate an N3DR-based pipeline that leverages advanced models, i.e., Instant-ngp, Nerfacto, and Splatfacto, to improve the quality of 3D reconstructions using images of the object captured by a fleet of small UAVs. We assess the performance of the considered models using various imagery and pointcloud metrics, comparing them against the baseline Structure from Motion (SfM) algorithm. The experimental results demonstrate that the N3DR-enhanced pipeline significantly improves reconstruction quality, making it feasible for small UAVs to support high-precision 3D mapping and anomaly detection in constrained environments. In more general terms, our results highlight the potential of N3DR in advancing the capabilities of miniaturized UAV systems.


cs.PL [Back]

[97] Mix-of-Language-Experts Architecture for Multilingual Programming cs.PL | cs.CL | cs.SEPDF

Yifan Zong, Yuntian Deng, Pengyu Nie

TL;DR: 论文提出了MoLE(混合语言专家)架构,用于在多语言编程任务中平衡效率与专业化,通过共享和语言特定的LoRA模块实现知识共享与任务专用。

Details

Motivation: 现有的多语言编程方法要么牺牲语言特定性能以实现成本效益,要么计算和存储成本高。MoLE旨在平衡这两者。

Result: 实验显示MoLE在参数效率上优于独立训练的LoRA模块,同时在准确性上优于共享微调的单一模型。

Insight: MoLE提供了一种有效的方式在多语言编程任务中兼顾效率与专业化,为未来的多语言模型设计提供了新思路。

Abstract: Large language models (LLMs) have demonstrated impressive capabilities in aiding developers with tasks like code comprehension, generation, and translation. Supporting multilingual programming – i.e., coding tasks across multiple programming languages – typically requires either (1) finetuning a single LLM across all programming languages, which is cost-efficient but sacrifices language-specific specialization and performance, or (2) finetuning separate LLMs for each programming language, which allows for specialization but is computationally expensive and storage-intensive due to the duplication of parameters. This paper introduces MoLE (Mix-of-Language-Experts), a novel architecture that balances efficiency and specialization for multilingual programming. MoLE is composed of a base model, a shared LoRA (low-rank adaptation) module, and a collection of language-specific LoRA modules. These modules are jointly optimized during the finetuning process, enabling effective knowledge sharing and specialization across programming languages. During inference, MoLE automatically routes to the language-specific LoRA module corresponding to the programming language of the code token being generated. Our experiments demonstrate that MoLE achieves greater parameter efficiency compared to training separate language-specific LoRAs, while outperforming a single shared LLM finetuned for all programming languages in terms of accuracy.


cs.LG [Back]

[98] Chain-of-Experts: Unlocking the Communication Power of Mixture-of-Experts Models cs.LG | cs.CLPDF

Zihan Wang, Rui Pan, Jiarui Yao, Robert Csordas, Linjie Li

TL;DR: 论文提出了Chain-of-Experts (CoE),一种新的Mixture-of-Experts (MoE)架构,通过在每层中引入专家间的顺序通信,提升了模型的表达能力和计算效率。

Details

Motivation: 传统的MoE模型中,专家独立并行工作,限制了专家间的交互能力。为此,CoE设计了一种顺序通信机制,使专家能在每层中动态交互,从而提升模型的表达能力。

Result: 在数学推理任务中,CoE将验证损失从1.20降至1.12(相比标准MoE);通过2倍迭代匹配3倍宽度扩展的性能,同时内存占用减少17.6-42%。

Insight: CoE通过迭代残差结构和路由机制提升了专家的特化能力,为模型扩展提供了新方向(深度扩展),突破了传统的宽度/深度扩展限制。

Abstract: We propose Chain-of-Experts (CoE), a new Mixture-of-Experts (MoE) architecture that introduces sequential expert communication within each layer. Unlike traditional MoE models, where experts operate independently in parallel, CoE processes tokens iteratively across a chain of experts inside a layer. To support dynamic expert selection across iterations, CoE employs a dedicated router at each iteration step within a layer. This design allows tokens to re-evaluate and select different experts during each iteration, rather than being statically assigned. As a result, CoE introduces a flexible routing mechanism that increases the diversity of expert combinations and enriches the model’s representational capacity. CoE demonstrates improved performance under fixed compute: on math reasoning tasks, it reduces validation loss from 1.20 to 1.12 compared to a standard MoE. Beyond performance, CoE offers a new scaling axis: depth through expert iteration, which complements conventional width/depth scaling. For example, using 2x iterations matches the performance of 3x expert selections (in width), while reducing memory usage by 17.6-42% relative to other scaling strategies. Our analysis reveals that CoE’s benefits stem from its iterative residual structure and enhanced expert specialization empowered by iterative routing, which together unlock more expressive representations. Code is available at https://github.com/ZihanWang314/coe.


[99] Thought Anchors: Which LLM Reasoning Steps Matter? cs.LG | cs.AI | cs.CLPDF

Paul C. Bogdan, Uzay Macar, Neel Nanda, Arthur Conmy

TL;DR: 本文提出三种互补的归因方法,用于分析大语言模型(LLM)推理过程中的关键步骤(即“思维锚点”),并开发了可视化工具,展示了句子级别分析在理解推理模型中的潜力。

Details

Motivation: 尽管大语言模型在推理任务中表现优异,但其长链式思维推理的复杂性导致可解释性挑战。作者认为句子级别的分析是理解推理过程的有效途径。

Result: 实验表明思维锚点确实存在,且对推理过程有重要作用。三种方法的结果一致性证明了句子级别分析的有效性。

Insight: 句子级别分析能够揭示推理模型的关键步骤,思维锚点的发现为模型可解释性提供了新视角。

Abstract: Reasoning large language models have recently achieved state-of-the-art performance in many fields. However, their long-form chain-of-thought reasoning creates interpretability challenges as each generated token depends on all previous ones, making the computation harder to decompose. We argue that analyzing reasoning traces at the sentence level is a promising approach to understanding reasoning processes. We present three complementary attribution methods: (1) a black-box method measuring each sentence’s counterfactual importance by comparing final answers across 100 rollouts conditioned on the model generating that sentence or one with a different meaning; (2) a white-box method of aggregating attention patterns between pairs of sentences, which identified broadcasting'' sentences that receive disproportionate attention from all future sentences via receiver’’ attention heads; (3) a causal attribution method measuring logical connections between sentences by suppressing attention toward one sentence and measuring the effect on each future sentence’s tokens. Each method provides evidence for the existence of thought anchors, reasoning steps that have outsized importance and that disproportionately influence the subsequent reasoning process. These thought anchors are typically planning or backtracking sentences. We provide an open-source tool (www.thought-anchors.com) for visualizing the outputs of our methods, and present a case study showing converging patterns across methods that map how a model performs multi-step reasoning. The consistency across methods demonstrates the potential of sentence-level analysis for a deeper understanding of reasoning models.


[100] Scaling Speculative Decoding with Lookahead Reasoning cs.LG | cs.CLPDF

Yichao Fu, Rui Ge, Zelei Shao, Zhijie Deng, Hao Zhang

TL;DR: 该论文通过引入前瞻性推理(Lookahead Reasoning)技术,解决了现有推测解码(Speculative Decoding, SD)在长链推理任务中速度提升有限的问题,实现了更高的并行性和更快的推理速度。

Details

Motivation: 现有SD技术在处理长链推理任务时,由于其速度提升受限于指数级下降的猜测准确性,无法充分利用硬件的计算能力。因此,需要一种新的方法突破这一算法天花板。

Result: 在GSM8K、AIME等基准测试中,前瞻性推理将SD的峰值速度提升从1.4倍提升至2.1倍,且不损失答案质量。

Insight: 语义正确性比精确的Token匹配更能有效提升并行效率,结合轻量级草案模型和语义验证机制,可以在多步推理任务中突破传统SD的限制。

Abstract: Reasoning models excel by generating long chain-of-thoughts, but decoding the resulting thousands of tokens is slow. Token-level speculative decoding (SD) helps, but its benefit is capped, because the chance that an entire $\gamma$-token guess is correct falls exponentially as $\gamma$ grows. This means allocating more compute for longer token drafts faces an algorithmic ceiling – making the speedup modest and hardware-agnostic. We raise this ceiling with Lookahead Reasoning, which exploits a second, step-level layer of parallelism. Our key insight is that reasoning models generate step-by-step, and each step needs only to be semantically correct, not exact token matching. In Lookahead Reasoning, a lightweight draft model proposes several future steps; the target model expands each proposal in one batched pass, and a verifier keeps semantically correct steps while letting the target regenerate any that fail. Token-level SD still operates within each reasoning step, so the two layers of parallelism multiply. We show Lookahead Reasoning lifts the peak speedup of SD both theoretically and empirically. Across GSM8K, AIME, and other benchmarks, Lookahead Reasoning improves the speedup of SD from 1.4x to 2.1x while preserving answer quality, and its speedup scales better with additional GPU throughput. Our code is available at https://github.com/hao-ai-lab/LookaheadReasoning


[101] ConCM: Consistency-Driven Calibration and Matching for Few-Shot Class-Incremental Learning cs.LG | cs.CV | 68T40 | I.2.6; I.4.9PDF

QinZhe Wang, Zixuan Chen, Keke Huang, Xiu Su, Chunhua Yang

TL;DR: 该论文提出了一种一致性驱动的校准和匹配框架(ConCM),用于解决少样本类增量学习(FSCIL)中的原型偏差和结构固定问题,通过双重一致性增强特征表达,取得SOTA性能。

Details

Motivation: FSCIL中现有的空间预留方法因原型偏差和结构固定导致特征表达能力受限。论文通过优化特征与结构的双重一致性,缓解了知识冲突问题。

Result: 在mini-ImageNet和CUB200数据集上,ConCM的增量会话谐波准确率分别比现有最优方法提升3.20%和3.68%。

Insight: 通过几何最优性和最大匹配的理论分析,论文表明双重一致性对FSCIL的有效性,为未来研究提供了新方向。

Abstract: Few-Shot Class-Incremental Learning (FSCIL) requires models to adapt to novel classes with limited supervision while preserving learned knowledge. Existing prospective learning-based space construction methods reserve space to accommodate novel classes. However, prototype deviation and structure fixity limit the expressiveness of the embedding space. In contrast to fixed space reservation, we explore the optimization of feature-structure dual consistency and propose a Consistency-driven Calibration and Matching Framework (ConCM) that systematically mitigate the knowledge conflict inherent in FSCIL. Specifically, inspired by hippocampal associative memory, we design a memory-aware prototype calibration that extracts generalized semantic attributes from base classes and reintegrates them into novel classes to enhance the conceptual center consistency of features. Further, we propose dynamic structure matching, which adaptively aligns the calibrated features to a session-specific optimal manifold space, ensuring cross-session structure consistency. Theoretical analysis shows that our method satisfies both geometric optimality and maximum matching, thereby overcoming the need for class-number priors. On large-scale FSCIL benchmarks including mini-ImageNet and CUB200, ConCM achieves state-of-the-art performance, surpassing current optimal method by 3.20% and 3.68% in harmonic accuracy of incremental sessions.


q-bio.NC [Back]

[102] Convergent and divergent connectivity patterns of the arcuate fasciculus in macaques and humans q-bio.NC | cs.CV | eess.IVPDF

Jiahao Huang, Ruifeng Li, Wenwen Yu, Anan Li, Xiangning Li

TL;DR: 该论文通过比较猕猴和人类弓状束(AF)的神经连接模式,揭示了物种间在语言网络进化上的差异。研究结合单神经元追踪和全脑扩散MRI,发现人类AF具有更广的颞叶整合及更强的前额-顶叶连接。

Details

Motivation: 研究动机在于解决猕猴和人类弓状束(AF)的神经连接差异问题,以理解人类语言网络的进化基础及其相关疾病(如失语症和阅读障碍)的解剖学机制。

Result: 结果显示猕猴AF主要起源于颞顶皮层,经听觉皮层和顶盖投射至前额区域;而人类AF则扩展至中颞回,并具有更强的前额-顶盖连接。这些差异可能支撑了人类语言网络的进化特化。

Insight: 研究启示在于:1)人类AF的更广泛颞叶整合和强化的前额-顶叶连接可能是高级语言处理能力的神经基础;2)AF连接模式的差异为理解语言相关疾病提供了新视角。

Abstract: The organization and connectivity of the arcuate fasciculus (AF) in nonhuman primates remain contentious, especially concerning how its anatomy diverges from that of humans. Here, we combined cross-scale single-neuron tracing - using viral-based genetic labeling and fluorescence micro-optical sectioning tomography in macaques (n = 4; age 3 - 11 years) - with whole-brain tractography from 11.7T diffusion MRI. Complemented by spectral embedding analysis of 7.0T MRI in humans, we performed a comparative connectomic analysis of the AF across species. We demonstrate that the macaque AF originates in the temporal-parietal cortex, traverses the auditory cortex and parietal operculum, and projects into prefrontal regions. In contrast, the human AF exhibits greater expansion into the middle temporal gyrus and stronger prefrontal and parietal operculum connectivity - divergences quantified by Kullback-Leibler analysis that likely underpin the evolutionary specialization of human language networks. These interspecies differences - particularly the human AF’s broader temporal integration and strengthened frontoparietal linkages - suggest a connectivity-based substrate for the emergence of advanced language processing unique to humans. Furthermore, our findings offer a neuroanatomical framework for understanding AF-related disorders such as aphasia and dyslexia, where aberrant connectivity disrupts language function.


cs.RO [Back]

[103] Fake or Real, Can Robots Tell? Evaluating Embodied Vision-Language Models on Real and 3D-Printed Objects cs.RO | cs.AI | cs.CL | cs.CV | cs.LGPDF

Federico Tavella, Kathryn Mearns, Angelo Cangelosi

TL;DR: 这篇论文比较了多种视觉语言模型(VLMs)在机器人场景理解中的表现,分析了单视角与多视角描述、真实与3D打印物体识别的差异,并提供了在实际机器人任务中部署基础模型的实用见解。

Details

Motivation: 随着机器人场景理解越来越依赖视觉语言模型(VLMs),需要评估这些模型在真实和3D打印物体上的表现,以确定它们在实际机器人任务中的适用性和局限性。

Result: 实验结果表明,视觉语言模型在识别常见物体时表现良好,但在处理新颖的3D打印物体时泛化能力不足。

Insight: 研究发现,视觉语言模型在机器人任务中具有潜力,但需要进一步改进以提升对新物体的泛化能力,确保在实际部署中的可靠性。

Abstract: Robotic scene understanding increasingly relies on vision-language models (VLMs) to generate natural language descriptions of the environment. In this work, we present a comparative study of captioning strategies for tabletop scenes captured by a robotic arm equipped with an RGB camera. The robot collects images of objects from multiple viewpoints, and we evaluate several models that generate scene descriptions. We compare the performance of various captioning models, like BLIP and VLMs. Our experiments examine the trade-offs between single-view and multi-view captioning, and difference between recognising real-world and 3D printed objects. We quantitatively evaluate object identification accuracy, completeness, and naturalness of the generated captions. Results show that VLMs can be used in robotic settings where common objects need to be recognised, but fail to generalise to novel representations. Our findings provide practical insights into deploying foundation models for embodied agents in real-world settings.


[104] CronusVLA: Transferring Latent Motion Across Time for Multi-Frame Prediction in Manipulation cs.RO | cs.CVPDF

Hao Li, Shuai Yang, Yilun Chen, Yang Tian, Xiaoda Yang

TL;DR: CronusVLA扩展了单帧视觉-语言-动作(VLA)模型,通过高效的后训练阶段实现多帧预测,提升了运动信息的利用效率和任务成功率。

Details

Motivation: 现有的VLA模型受限于单帧观测范式,无法充分利用多帧历史观测的运动信息,且计算成本高昂。CronusVLA旨在通过高效方法实现多帧建模。

Result: 在SimperEnv上达到70.9%成功率,LIBERO上比OpenVLA提升12.7%,并在真实机器人实验(Franka)中表现稳健。

Insight: 通过高效利用历史帧运动特征,能够显著提升动作预测的准确性和泛化能力,同时保持低计算开销。

Abstract: Recent vision-language-action (VLA) models built on pretrained vision-language models (VLMs) have demonstrated strong generalization across manipulation tasks. However, they remain constrained by a single-frame observation paradigm and cannot fully benefit from the motion information offered by aggregated multi-frame historical observations, as the large vision-language backbone introduces substantial computational cost and inference latency. We propose CronusVLA, a unified framework that extends single-frame VLA models to the multi-frame paradigm through an efficient post-training stage. CronusVLA comprises three key components: (1) single-frame pretraining on large-scale embodied datasets with autoregressive action tokens prediction, which establishes an embodied vision-language foundation; (2) multi-frame encoding, adapting the prediction of vision-language backbones from discrete action tokens to motion features during post-training, and aggregating motion features from historical frames into a feature chunking; (3) cross-frame decoding, which maps the feature chunking to accurate actions via a shared decoder with cross-attention. By reducing redundant token computation and caching past motion features, CronusVLA achieves efficient inference. As an application of motion features, we further propose an action adaptation mechanism based on feature-action retrieval to improve model performance during finetuning. CronusVLA achieves state-of-the-art performance on SimplerEnv with 70.9% success rate, and 12.7% improvement over OpenVLA on LIBERO. Real-world Franka experiments also show the strong performance and robustness.


[105] Look to Locate: Vision-Based Multisensory Navigation with 3-D Digital Maps for GNSS-Challenged Environments cs.RO | cs.CVPDF

Ola Elmaghraby, Eslam Mounier, Paulo Ricardo Marques de Araujo, Aboelmagd Noureldin

TL;DR: 论文提出了一种基于视觉的低成本多传感器导航系统,结合单目深度估计、语义过滤和视觉地图注册(VMR),用于GNSS受限环境下的车辆定位,显著提高了定位精度和鲁棒性。

Details

Motivation: 在GNSS信号受限的环境(如室内停车场或密集城市峡谷)中,实现精确且鲁棒的车辆定位是一项重要挑战。作者希望通过低成本视觉系统解决这一问题。

Result: 实验结果表明系统在室内外均表现优异,室内达到92%亚米级精度,室外超过80%,定位精度平均提升88%。

Insight: 论文展示了低成本单目视觉系统结合3D地图在GNSS受限环境下实现可扩展导航的潜力,为相关领域提供了实用解决方案。

Abstract: In Global Navigation Satellite System (GNSS)-denied environments such as indoor parking structures or dense urban canyons, achieving accurate and robust vehicle positioning remains a significant challenge. This paper proposes a cost-effective, vision-based multi-sensor navigation system that integrates monocular depth estimation, semantic filtering, and visual map registration (VMR) with 3-D digital maps. Extensive testing in real-world indoor and outdoor driving scenarios demonstrates the effectiveness of the proposed system, achieving sub-meter accuracy of 92% indoors and more than 80% outdoors, with consistent horizontal positioning and heading average root mean-square errors of approximately 0.98 m and 1.25 {\deg}, respectively. Compared to the baselines examined, the proposed solution significantly reduced drift and improved robustness under various conditions, achieving positioning accuracy improvements of approximately 88% on average. This work highlights the potential of cost-effective monocular vision systems combined with 3D maps for scalable, GNSS-independent navigation in land vehicles.


cs.IR [Back]

[106] From Web Search towards Agentic Deep Research: Incentivizing Search with Reasoning Agents cs.IR | cs.CL | cs.LGPDF

Weizhi Zhang, Yangning Li, Yuanchen Bei, Junyu Luo, Guancheng Wan

TL;DR: 论文提出了从传统关键词搜索转向基于大语言模型(LLM)的‘Agentic Deep Research’新范式,通过自主推理、迭代检索和信息合成的动态反馈循环解决复杂信息需求。

Details

Motivation: 传统关键词搜索无法满足复杂、多步骤的信息需求,而具备推理能力的LLM为信息检索提供了新方向。

Result: 实验结果表明,该方法显著优于传统搜索方法,并有望成为未来信息检索的主流范式。

Insight: 结合推理能力的LLM系统可从根本上改变信息检索方式,提供更高效和动态的解决方案。

Abstract: Information retrieval is a cornerstone of modern knowledge acquisition, enabling billions of queries each day across diverse domains. However, traditional keyword-based search engines are increasingly inadequate for handling complex, multi-step information needs. Our position is that Large Language Models (LLMs), endowed with reasoning and agentic capabilities, are ushering in a new paradigm termed Agentic Deep Research. These systems transcend conventional information search techniques by tightly integrating autonomous reasoning, iterative retrieval, and information synthesis into a dynamic feedback loop. We trace the evolution from static web search to interactive, agent-based systems that plan, explore, and learn. We also introduce a test-time scaling law to formalize the impact of computational depth on reasoning and search. Supported by benchmark results and the rise of open-source implementations, we demonstrate that Agentic Deep Research not only significantly outperforms existing approaches, but is also poised to become the dominant paradigm for future information seeking. All the related resources, including industry products, research papers, benchmark datasets, and open-source implementations, are collected for the community in https://github.com/DavidZWZ/Awesome-Deep-Research.


eess.IV [Back]

[107] Assessing Risk of Stealing Proprietary Models for Medical Imaging Tasks eess.IV | cs.CR | cs.CVPDF

Ankita Raj, Harsh Swaika, Deepankar Varma, Chetan Arora

TL;DR: 该研究表明专有医疗影像模型面临模型窃取(MS)攻击的风险,并提出了一种名为QueryWise的两步攻击方法,能够在有限查询预算下高效克隆模型功能。

Details

Motivation: 随着深度学习在医疗影像中的成功应用,专有模型被部署于诊断流程中。然而这些模型可能受到模型窃取攻击,而目前医疗影像模型对此的研究不足。

Result: 在两个医疗影像任务(胆囊癌和COVID-19分类)上验证了攻击的有效性。

Insight: 即使缺乏目标模型的训练数据且预算有限,攻击者仍可通过公开数据有效窃取模型功能,凸显了医疗影像领域模型保护的重要性。

Abstract: The success of deep learning in medical imaging applications has led several companies to deploy proprietary models in diagnostic workflows, offering monetized services. Even though model weights are hidden to protect the intellectual property of the service provider, these models are exposed to model stealing (MS) attacks, where adversaries can clone the model’s functionality by querying it with a proxy dataset and training a thief model on the acquired predictions. While extensively studied on general vision tasks, the susceptibility of medical imaging models to MS attacks remains inadequately explored. This paper investigates the vulnerability of black-box medical imaging models to MS attacks under realistic conditions where the adversary lacks access to the victim model’s training data and operates with limited query budgets. We demonstrate that adversaries can effectively execute MS attacks by using publicly available datasets. To further enhance MS capabilities with limited query budgets, we propose a two-step model stealing approach termed QueryWise. This method capitalizes on unlabeled data obtained from a proxy distribution to train the thief model without incurring additional queries. Evaluation on two medical imaging models for Gallbladder Cancer and COVID-19 classification substantiates the effectiveness of the proposed attack. The source code is available at https://github.com/rajankita/QueryWise.


[108] NIC-RobustBench: A Comprehensive Open-Source Toolkit for Neural Image Compression and Robustness Analysis eess.IV | cs.CV | cs.MMPDF

Georgii Bychkov, Khaled Abud, Egor Kovalev, Alexander Gushchin, Dmitriy Vatolin

TL;DR: 该论文介绍了NIC-RobustBench,首个用于评估神经图像压缩(NIC)鲁棒性的开源工具包,支持广泛的编解码器和攻击类型。

Details

Motivation: 随着JPEG AI标准的发布,评估NIC方法的鲁棒性变得尤为重要,而现有研究局限于少数编解码器和攻击类型,因此需要一个全面的工具包。

Result: NIC-RobustBench是目前包含最多编解码器的库,为NIC鲁棒性研究提供了全面的分析工具。

Insight: NIC鲁棒性评估需结合多种编解码器和攻击类型,开源工具包可以推动该领域的标准化和发展。

Abstract: Adversarial robustness of neural networks is an increasingly important area of research, combining studies on computer vision models, large language models (LLMs), and others. With the release of JPEG AI – the first standard for end-to-end neural image compression (NIC) methods – the question of evaluating NIC robustness has become critically significant. However, previous research has been limited to a narrow range of codecs and attacks. To address this, we present \textbf{NIC-RobustBench}, the first open-source framework to evaluate NIC robustness and adversarial defenses’ efficiency, in addition to comparing Rate-Distortion (RD) performance. The framework includes the largest number of codecs among all known NIC libraries and is easily scalable. The paper demonstrates a comprehensive overview of the NIC-RobustBench framework and employs it to analyze NIC robustness. Our code is available online at https://github.com/msu-video-group/NIC-RobustBench.


[109] Xray2Xray: World Model from Chest X-rays with Volumetric Context eess.IV | cs.CVPDF

Zefan Yang, Xinrui Song, Xuanang Xu, Yongyi Shi, Ge Wang

TL;DR: 论文提出Xray2Xray,一种从2D X射线学习3D结构信息的World Model,通过建模不同视角的动态转换,提升疾病诊断和风险预测效果。

Details

Motivation: 2D胸部X光片因结构叠加限制了精确诊断和风险预测的能力,需从2D图像中提取3D结构信息以提升性能。

Result: 在心血管疾病风险预测和五种病理分类任务中表现优异,并能重建体积上下文信息。

Insight: 证明了从2D医学图像中学习3D结构信息的可行性,为医学影像分析提供了新思路。

Abstract: Chest X-rays (CXRs) are the most widely used medical imaging modality and play a pivotal role in diagnosing diseases. However, as 2D projection images, CXRs are limited by structural superposition, which constrains their effectiveness in precise disease diagnosis and risk prediction. To address the limitations of 2D CXRs, this study introduces Xray2Xray, a novel World Model that learns latent representations encoding 3D structural information from chest X-rays. Xray2Xray captures the latent representations of the chest volume by modeling the transition dynamics of X-ray projections across different angular positions with a vision model and a transition model. We employed the latent representations of Xray2Xray for downstream risk prediction and disease diagnosis tasks. Experimental results showed that Xray2Xray outperformed both supervised methods and self-supervised pretraining methods for cardiovascular disease risk estimation and achieved competitive performance in classifying five pathologies in CXRs. We also assessed the quality of Xray2Xray’s latent representations through synthesis tasks and demonstrated that the latent representations can be used to reconstruct volumetric context.


[110] Deformable Medical Image Registration with Effective Anatomical Structure Representation and Divide-and-Conquer Network eess.IV | cs.CVPDF

Xinke Ma, Yongsheng Pan, Qingjie Zeng, Mengkang Lu, Bolysbek Murat Yerzhanuly

TL;DR: 论文提出了一种基于ROI的医学图像配准方法EASR-DCN,通过有效表征ROI并利用分治网络独立对齐ROI,显著提升了配准性能。

Details

Motivation: 当前无监督和弱监督的医学图像配准方法在ROI表征和独立对齐方面存在不足,限制了配准性能的提升。

Result: 在三个MRI和一个CT数据集上,EASR-DCN相比VoxelMorph在Dice分数上显著提升(脑MRI 10.31%,心脏MRI 13.01%,海马体MRI 5.75%)。

Insight: 有效表征和独立处理ROI是提升医学图像配准性能的关键;无监督方法结合分治策略可减少对标注数据的依赖。

Abstract: Effective representation of Regions of Interest (ROI) and independent alignment of these ROIs can significantly enhance the performance of deformable medical image registration (DMIR). However, current learning-based DMIR methods have limitations. Unsupervised techniques disregard ROI representation and proceed directly with aligning pairs of images, while weakly-supervised methods heavily depend on label constraints to facilitate registration. To address these issues, we introduce a novel ROI-based registration approach named EASR-DCN. Our method represents medical images through effective ROIs and achieves independent alignment of these ROIs without requiring labels. Specifically, we first used a Gaussian mixture model for intensity analysis to represent images using multiple effective ROIs with distinct intensities. Furthermore, we propose a novel Divide-and-Conquer Network (DCN) to process these ROIs through separate channels to learn feature alignments for each ROI. The resultant correspondences are seamlessly integrated to generate a comprehensive displacement vector field. Extensive experiments were performed on three MRI and one CT datasets to showcase the superior accuracy and deformation reduction efficacy of our EASR-DCN. Compared to VoxelMorph, our EASR-DCN achieved improvements of 10.31% in the Dice score for brain MRI, 13.01% for cardiac MRI, and 5.75% for hippocampus MRI, highlighting its promising potential for clinical applications. The code for this work will be released upon acceptance of the paper.


[111] Explicit Residual-Based Scalable Image Coding for Humans and Machines eess.IV | cs.CVPDF

Yui Tatsumi, Ziyue Zeng, Hiroshi Watanabe

TL;DR: 本文提出了一种基于显式残差的可扩展图像编码方法(FR-ICMH和PR-ICMH),用于同时服务于人类和机器视觉需求,提升了编码效率和可解释性。

Details

Motivation: 随着图像越来越多地被人类和机器识别模型共同使用,需要一种能够同时满足两者需求的可扩展图像压缩方法。现有方法过于依赖神经网络的学习能力,而忽视了架构设计的重要性。

Result: 实验表明,PR-ICMH比现有方法节省了高达29.57%的BD-rate。

Insight: 显式残差机制的引入不仅提升了压缩性能,还提供了编码复杂度和压缩效率之间的灵活权衡。

Abstract: Scalable image compression is a technique that progressively reconstructs multiple versions of an image for different requirements. In recent years, images have increasingly been consumed not only by humans but also by image recognition models. This shift has drawn growing attention to scalable image compression methods that serve both machine and human vision (ICMH). Many existing models employ neural network-based codecs, known as learned image compression, and have made significant strides in this field by carefully designing the loss functions. In some cases, however, models are overly reliant on their learning capacity, and their architectural design is not sufficiently considered. In this paper, we enhance the coding efficiency and interpretability of ICMH framework by integrating an explicit residual compression mechanism, which is commonly employed in resolution scalable coding methods such as JPEG2000. Specifically, we propose two complementary methods: Feature Residual-based Scalable Coding (FR-ICMH) and Pixel Residual-based Scalable Coding (PR-ICMH). These proposed methods are applicable to various machine vision tasks. Moreover, they provide flexibility to choose between encoder complexity and compression performance, making it adaptable to diverse application requirements. Experimental results demonstrate the effectiveness of our proposed methods, with PR-ICMH achieving up to 29.57% BD-rate savings over the previous work.


[112] Reconsidering Explicit Longitudinal Mammography Alignment for Enhanced Breast Cancer Risk Prediction eess.IV | cs.CVPDF

Solveig Thrun, Stine Hansen, Zijun Sun, Nele Blum, Suaiba A. Salahuddin

TL;DR: 本文探讨了在乳腺X光检查中,显式纵向对齐对乳腺癌风险预测的影响,比较了输入空间与表示空间对齐的优劣,并提出了图像级对齐优于表示级对齐的结论。

Details

Motivation: 乳腺X光检查的时间序列数据对乳腺癌风险预测至关重要。然而,如何在不同时间点的检查中进行空间对齐以捕捉组织变化仍是一个未充分探索的问题。

Result: 结果表明,联合优化对齐和风险预测会导致对齐质量与预测性能的权衡,而图像级对齐在变形场质量和风险预测准确性上优于表示级对齐。

Insight: 图像级对齐更适合乳腺X光检查的时间序列数据,能够提供更高质量的变形场并提升风险预测性能,为未来的研究提供了方向。

Abstract: Regular mammography screening is essential for early breast cancer detection. Deep learning-based risk prediction methods have sparked interest to adjust screening intervals for high-risk groups. While early methods focused only on current mammograms, recent approaches leverage the temporal aspect of screenings to track breast tissue changes over time, requiring spatial alignment across different time points. Two main strategies for this have emerged: explicit feature alignment through deformable registration and implicit learned alignment using techniques like transformers, with the former providing more control. However, the optimal approach for explicit alignment in mammography remains underexplored. In this study, we provide insights into where explicit alignment should occur (input space vs. representation space) and if alignment and risk prediction should be jointly optimized. We demonstrate that jointly learning explicit alignment in representation space while optimizing risk estimation performance, as done in the current state-of-the-art approach, results in a trade-off between alignment quality and predictive performance and show that image-level alignment is superior to representation-level alignment, leading to better deformation field quality and enhanced risk prediction accuracy. The code is available at https://github.com/sot176/Longitudinal_Mammogram_Alignment.git.


[113] Filling of incomplete sinograms from sparse PET detector configurations using a residual U-Net eess.IV | cs.CV | physics.med-phPDF

Klara Leffler, Luigi Tommaso Luppino, Samuel Kuttner, Karin Söderkvist, Jan Axelsson

TL;DR: 论文提出了一种基于改进的Residual U-Net的方法,用于恢复稀疏PET探头配置下缺失的投影数据,以降低长轴PET扫描仪的成本。

Details

Motivation: 传统的长轴PET扫描仪需要密集的光电探测器,成本高昂。稀疏探测器配置可降低成本,但会牺牲图像质量。本文旨在通过深度学习技术恢复缺失数据。

Result: 模型能够有效恢复数据,平均绝对误差低于每像素两个事件。尽管存在图像细节模糊问题,但显著优于传统方法。

Insight: 稀疏探测器配置结合深度学习是降低PET扫描仪成本的一种可行方案,推动了低成本、全身PET扫描仪的发展。

Abstract: Long axial field-of-view PET scanners offer increased field-of-view and sensitivity compared to traditional PET scanners. However, a significant cost is associated with the densely packed photodetectors required for the extended-coverage systems, limiting clinical utilisation. To mitigate the cost limitations, alternative sparse system configurations have been proposed, allowing an extended field-of-view PET design with detector costs similar to a standard PET system, albeit at the expense of image quality. In this work, we propose a deep sinogram restoration network to fill in the missing sinogram data. Our method utilises a modified Residual U-Net, trained on clinical PET scans from a GE Signa PET/MR, simulating the removal of 50% of the detectors in a chessboard pattern (retaining only 25% of all lines of response). The model successfully recovers missing counts, with a mean absolute error below two events per pixel, outperforming 2D interpolation in both sinogram and reconstructed image domain. Notably, the predicted sinograms exhibit a smoothing effect, leading to reconstructed images lacking sharpness in finer details. Despite these limitations, the model demonstrates a substantial capacity for compensating for the undersampling caused by the sparse detector configuration. This proof-of-concept study suggests that sparse detector configurations, combined with deep learning techniques, offer a viable alternative to conventional PET scanner designs. This approach supports the development of cost-effective, total body PET scanners, allowing a significant step forward in medical imaging technology.


[114] NeRF-based CBCT Reconstruction needs Normalization and Initialization eess.IV | cs.AI | cs.CVPDF

Zhuowei Xu, Han Li, Dai Sun, Zhicheng Li, Yujia Li

TL;DR: 该论文提出了一种归一化哈希编码器和映射一致性初始化策略,以解决NeRF-based CBCT重建中局部-全局训练不匹配问题,从而提升训练稳定性和重建质量。

Details

Motivation: NeRF-based CBCT重建方法中,哈希编码器与神经网络的局部稀疏和全局密集训练不匹配导致特征不对齐,进而影响训练稳定性和重建效果。

Result: 在4个数据集、128个CT病例上验证了方法的有效性,显著提升训练效率和重建性能。

Insight: 局部-全局训练不匹配是NeRF-based方法在CBCT重建中的关键问题,简单的归一化和初始化策略可显著缓解这一问题。

Abstract: Cone Beam Computed Tomography (CBCT) is widely used in medical imaging. However, the limited number and intensity of X-ray projections make reconstruction an ill-posed problem with severe artifacts. NeRF-based methods have achieved great success in this task. However, they suffer from a local-global training mismatch between their two key components: the hash encoder and the neural network. Specifically, in each training step, only a subset of the hash encoder’s parameters is used (local sparse), whereas all parameters in the neural network participate (global dense). Consequently, hash features generated in each step are highly misaligned, as they come from different subsets of the hash encoder. These misalignments from different training steps are then fed into the neural network, causing repeated inconsistent global updates in training, which leads to unstable training, slower convergence, and degraded reconstruction quality. Aiming to alleviate the impact of this local-global optimization mismatch, we introduce a Normalized Hash Encoder, which enhances feature consistency and mitigates the mismatch. Additionally, we propose a Mapping Consistency Initialization(MCI) strategy that initializes the neural network before training by leveraging the global mapping property from a well-trained model. The initialized neural network exhibits improved stability during early training, enabling faster convergence and enhanced reconstruction performance. Our method is simple yet effective, requiring only a few lines of code while substantially improving training efficiency on 128 CT cases collected from 4 different datasets, covering 7 distinct anatomical regions.


cs.GR [Back]

[115] SOF: Sorted Opacity Fields for Fast Unbounded Surface Reconstruction cs.GR | cs.CVPDF

Lukas Radl, Felix Windisch, Thomas Deixelberger, Jozef Hladky, Michael Steiner

TL;DR: SOF是一种基于3D高斯表示的无边界场景表面重建方法,通过分层重排序和鲁棒的高斯深度定义,结合水平集正则化和并行化的四面体行进算法,显著提高了重建精度和效率。

Details

Motivation: 当前基于3D高斯表示的场景重建方法在提取高精度表面时,尤其是在大规模无边界环境中,存在深度估计不准确和排序启发式方法导致的伪影问题。

Result: SOF在重建精度上优于现有方法,同时总处理时间减少了三倍以上。

Insight: 通过结合高效的渲染技术与几何提取方法,SOF为大规模无边界场景的高精度实时重建提供了新的解决方案。

Abstract: Recent advances in 3D Gaussian representations have significantly improved the quality and efficiency of image-based scene reconstruction. Their explicit nature facilitates real-time rendering and fast optimization, yet extracting accurate surfaces - particularly in large-scale, unbounded environments - remains a difficult task. Many existing methods rely on approximate depth estimates and global sorting heuristics, which can introduce artifacts and limit the fidelity of the reconstructed mesh. In this paper, we present Sorted Opacity Fields (SOF), a method designed to recover detailed surfaces from 3D Gaussians with both speed and precision. Our approach improves upon prior work by introducing hierarchical resorting and a robust formulation of Gaussian depth, which better aligns with the level-set. To enhance mesh quality, we incorporate a level-set regularizer operating on the opacity field and introduce losses that encourage geometrically-consistent primitive shapes. In addition, we develop a parallelized Marching Tetrahedra algorithm tailored to our opacity formulation, reducing meshing time by up to an order of magnitude. As demonstrated by our quantitative evaluation, SOF achieves higher reconstruction accuracy while cutting total processing time by more than a factor of three. These results mark a step forward in turning efficient Gaussian-based rendering into equally efficient geometry extraction.


[116] Virtual Memory for 3D Gaussian Splatting cs.GR | cs.CV | cs.HCPDF

Jonathan Haberl, Philipp Fleck, Clemens Arth

TL;DR: 提出了利用虚拟内存技术优化3D高斯泼溅(3D Gaussian Splatting)渲染的方法,通过动态加载可见高斯分布减少内存占用并加速渲染。

Details

Motivation: 3D高斯泼溅在场景重建中表现出色,但随着场景复杂度增加,内存和渲染速度成为瓶颈。本文旨在通过虚拟内存技术解决这一问题。

Result: 减少了内存占用,提升了渲染速度,尤其适用于复杂场景。在多种设备上表现出色。

Insight: 虚拟内存技术在3D渲染中具有潜力,可以有效解决复杂场景的内存和性能问题。

Abstract: 3D Gaussian Splatting represents a breakthrough in the field of novel view synthesis. It establishes Gaussians as core rendering primitives for highly accurate real-world environment reconstruction. Recent advances have drastically increased the size of scenes that can be created. In this work, we present a method for rendering large and complex 3D Gaussian Splatting scenes using virtual memory. By leveraging well-established virtual memory and virtual texturing techniques, our approach efficiently identifies visible Gaussians and dynamically streams them to the GPU just in time for real-time rendering. Selecting only the necessary Gaussians for both storage and rendering results in reduced memory usage and effectively accelerates rendering, especially for highly complex scenes. Furthermore, we demonstrate how level of detail can be integrated into our proposed method to further enhance rendering speed for large-scale scenes. With an optimized implementation, we highlight key practical considerations and thoroughly evaluate the proposed technique and its impact on desktop and mobile devices.


[117] Uncovering Conceptual Blindspots in Generative Image Models Using Sparse Autoencoders cs.GR | cs.AI | cs.CVPDF

Matyas Bohacek, Thomas Fel, Maneesh Agrawala, Ekdeep Singh Lubana

TL;DR: 论文提出了一种系统性方法来识别和量化生成图像模型的‘概念盲点’,即训练数据中存在但模型生成中缺失或错误表示的概念,通过稀疏自编码器(SAE)提取可解释的概念嵌入进行定量分析。

Details

Motivation: 生成图像模型在简单概念上(如人手或四个物体的组合)表现不佳,但这些问题是偶发异常还是模型结构性缺陷尚不清楚。论文旨在系统性地识别和量化这些‘概念盲点’。

Result: 1. 发现了生成模型中特定概念缺失(如鸟食器、DVD光盘)或过度表现(如木质纹理、棕榈树)。2. 在数据点级别隔离了记忆化现象(即模型复制训练中的视觉模板)。

Insight: 生成模型的概念盲点反映了其与真实数据生成过程的差距,通过SAE提取的概念嵌入为模型可解释性提供了新工具。

Abstract: Despite their impressive performance, generative image models trained on large-scale datasets frequently fail to produce images with seemingly simple concepts – e.g., human hands or objects appearing in groups of four – that are reasonably expected to appear in the training data. These failure modes have largely been documented anecdotally, leaving open the question of whether they reflect idiosyncratic anomalies or more structural limitations of these models. To address this, we introduce a systematic approach for identifying and characterizing “conceptual blindspots” – concepts present in the training data but absent or misrepresented in a model’s generations. Our method leverages sparse autoencoders (SAEs) to extract interpretable concept embeddings, enabling a quantitative comparison of concept prevalence between real and generated images. We train an archetypal SAE (RA-SAE) on DINOv2 features with 32,000 concepts – the largest such SAE to date – enabling fine-grained analysis of conceptual disparities. Applied to four popular generative models (Stable Diffusion 1.5/2.1, PixArt, and Kandinsky), our approach reveals specific suppressed blindspots (e.g., bird feeders, DVD discs, and whitespaces on documents) and exaggerated blindspots (e.g., wood background texture and palm trees). At the individual datapoint level, we further isolate memorization artifacts – instances where models reproduce highly specific visual templates seen during training. Overall, we propose a theoretically grounded framework for systematically identifying conceptual blindspots in generative models by assessing their conceptual fidelity with respect to the underlying data-generating process.


cs.AI [Back]

[118] A Comment On “The Illusion of Thinking”: Reframing the Reasoning Cliff as an Agentic Gap cs.AI | cs.CL | cs.LGPDF

Sheraz Khan, Subha Madhavan, Kannan Natarajan

TL;DR: 论文评论认为Shojaee等人的研究中观察到的推理模型性能下降并非源于本质的能力限制,而是实验设计的限制。通过引入工具使用,模型能够解决原有设计下无法处理的复杂问题。

Details

Motivation: 质疑Shojaee等人提出的推理悬崖现象是否真的反映了模型的内在推理能力限制,而非实验设计的局限性。

Result: 模型在工具支持下能够超越原有‘推理悬崖’,解决更高复杂度的问题,并展现出多层次推理能力。

Insight: 模型的‘思考’能力可能被低估,关键在于提供足够的执行工具,而非其核心推理能力的不足。

Abstract: The recent work by Shojaee et al. (2025), titled The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity, presents a compelling empirical finding, a reasoning cliff, where the performance of Large Reasoning Models (LRMs) collapses beyond a specific complexity threshold, which the authors posit as an intrinsic scaling limitation of Chain-of-Thought (CoT) reasoning. This commentary, while acknowledging the study’s methodological rigor, contends that this conclusion is confounded by experimental artifacts. We argue that the observed failure is not evidence of a fundamental cognitive boundary, but rather a predictable outcome of system-level constraints in the static, text-only evaluation paradigm, including tool use restrictions, context window recall issues, the absence of crucial cognitive baselines, inadequate statistical reporting, and output generation limits. We reframe this performance collapse through the lens of an agentic gap, asserting that the models are not failing at reasoning, but at execution within a profoundly restrictive interface. We empirically substantiate this critique by demonstrating a striking reversal. A model, initially declaring a puzzle impossible when confined to text-only generation, now employs agentic tools to not only solve it but also master variations of complexity far beyond the reasoning cliff it previously failed to surmount. Additionally, our empirical analysis of tool-enabled models like o4-mini and GPT-4o reveals a hierarchy of agentic reasoning, from simple procedural execution to complex meta-cognitive self-correction, which has significant implications for how we define and measure machine intelligence. The illusion of thinking attributed to LRMs is less a reasoning deficit and more a consequence of an otherwise capable mind lacking the tools for action.


[119] Bayesian Evolutionary Swarm Architecture: A Formal Epistemic System Grounded in Truth-Based Competition cs.AI | cs.CL | cs.GT | math.LO | 68T05, 68Q87, 03E20 | I.2.6; I.2.3; F.1.1PDF

Craig Steven Wright

TL;DR: 该论文提出了一种基于贝叶斯推理和群体动力学的AI系统框架,通过结构化竞争和信念修正驱动的概率性代理演化。系统将代理的适应性定义为与固定外部真相对齐的函数,并通过竞争推动知识进化。

Details

Motivation: 研究动机在于设计一个能够通过竞争和信念修正自我进化的AI系统,并将可验证知识作为进化目标,以实现系统的稳健性和收敛性。

Result: 结果表明,系统能够将真相作为进化吸引子,通过对抗性认知压力推动可验证知识的涌现,同时保持计算的可行性和自调节性。

Insight: 关键发现是真理可以通过代理间的竞争和信念修正过程自然涌现,且系统的形式化设计能保证收敛性和鲁棒性。

Abstract: We introduce a mathematically rigorous framework for an artificial intelligence system composed of probabilistic agents evolving through structured competition and belief revision. The architecture, grounded in Bayesian inference, measure theory, and population dynamics, defines agent fitness as a function of alignment with a fixed external oracle representing ground truth. Agents compete in a discrete-time environment, adjusting posterior beliefs through observed outcomes, with higher-rated agents reproducing and lower-rated agents undergoing extinction. Ratings are updated via pairwise truth-aligned utility comparisons, and belief updates preserve measurable consistency and stochastic convergence. We introduce hash-based cryptographic identity commitments to ensure traceability, alongside causal inference operators using do-calculus. Formal theorems on convergence, robustness, and evolutionary stability are provided. The system establishes truth as an evolutionary attractor, demonstrating that verifiable knowledge arises from adversarial epistemic pressure within a computable, self-regulating swarm.


[120] Skywork-SWE: Unveiling Data Scaling Laws for Software Engineering in LLMs cs.AI | cs.CLPDF

Liang Zeng, Yongcong Li, Yuzhen Xiao, Changshi Li, Chris Yuhao Liu

TL;DR: 该论文提出了一个自动化的数据扩充管道,用于扩展软件工程(SWE)数据集,并通过实验验证了数据规模对LLM性能的提升作用。

Details

Motivation: 软件工程(SWE)需要大规模、多样化的数据集来验证LLM的能力,但现有数据集因手动标注和运行时环境配置的限制而规模较小。

Result: 模型在SWE-bench基准上达到了38.0%的pass@1准确率(未使用验证器或多轮测试),加入测试时扩展技术后提升至47.0%。

Insight: 数据规模的持续扩充对LLM在SWE任务中的性能提升具有显著作用,未观察到性能饱和现象。

Abstract: Software engineering (SWE) has recently emerged as a crucial testbed for next-generation LLM agents, demanding inherent capabilities in two critical dimensions: sustained iterative problem-solving (e.g., >50 interaction rounds) and long-context dependency resolution (e.g., >32k tokens). However, the data curation process in SWE remains notoriously time-consuming, as it heavily relies on manual annotation for code file filtering and the setup of dedicated runtime environments to execute and validate unit tests. Consequently, most existing datasets are limited to only a few thousand GitHub-sourced instances. To this end, we propose an incremental, automated data-curation pipeline that systematically scales both the volume and diversity of SWE datasets. Our dataset comprises 10,169 real-world Python task instances from 2,531 distinct GitHub repositories, each accompanied by a task specified in natural language and a dedicated runtime-environment image for automated unit-test validation. We have carefully curated over 8,000 successfully runtime-validated training trajectories from our proposed SWE dataset. When fine-tuning the Skywork-SWE model on these trajectories, we uncover a striking data scaling phenomenon: the trained model’s performance for software engineering capabilities in LLMs continues to improve as the data size increases, showing no signs of saturation. Notably, our Skywork-SWE model achieves 38.0% pass@1 accuracy on the SWE-bench Verified benchmark without using verifiers or multiple rollouts, establishing a new state-of-the-art (SOTA) among the Qwen2.5-Coder-32B-based LLMs built on the OpenHands agent framework. Furthermore, with the incorporation of test-time scaling techniques, the performance further improves to 47.0% accuracy, surpassing the previous SOTA results for sub-32B parameter models. We release the Skywork-SWE-32B model checkpoint to accelerate future research.


[121] KnowRL: Exploring Knowledgeable Reinforcement Learning for Factuality cs.AI | cs.CL | cs.CV | cs.LG | cs.MAPDF

Baochang Ren, Shuofei Qiao, Wenhao Yu, Huajun Chen, Ningyu Zhang

TL;DR: KnowRL提出了一种基于知识增强的强化学习方法,通过引入基于知识验证的事实性奖励,减少慢思考模型中的幻觉问题,同时保持其推理能力。

Details

Motivation: 大型语言模型(LLMs)在慢思考过程中常因无法准确识别知识边界而产生严重幻觉,强化学习(RL)的奖励机制缺乏对推理过程的事实性监督,加重了这一问题。

Result: 在三个幻觉评估数据集和两个推理评估数据集上,KnowRL有效减少了幻觉,同时保持了模型的推理性能。

Insight: 在RL训练中引入知识验证奖励,能够直接优化模型的推理过程,从而提升其事实性和可靠性。

Abstract: Large Language Models (LLMs), particularly slow-thinking models, often exhibit severe hallucination, outputting incorrect content due to an inability to accurately recognize knowledge boundaries during reasoning. While Reinforcement Learning (RL) can enhance complex reasoning abilities, its outcome-oriented reward mechanism often lacks factual supervision over the thinking process, further exacerbating the hallucination problem. To address the high hallucination in slow-thinking models, we propose Knowledge-enhanced RL, KnowRL. KnowRL guides models to perform fact-based slow thinking by integrating a factuality reward, based on knowledge verification, into the RL training process, helping them recognize their knowledge boundaries. KnowRL guides models to perform fact-based slow thinking by integrating a factuality reward, based on knowledge verification, into the RL training process, helping them recognize their knowledge boundaries. This targeted factual input during RL training enables the model to learn and internalize fact-based reasoning strategies. By directly rewarding adherence to facts within the reasoning steps, KnowRL fosters a more reliable thinking process. Experimental results on three hallucination evaluation datasets and two reasoning evaluation datasets demonstrate that KnowRL effectively mitigates hallucinations in slow-thinking models while maintaining their original strong reasoning capabilities. Our code is available at https://github.com/zjunlp/KnowRL.


[122] Evaluating Compliance with Visualization Guidelines in Diagrams for Scientific Publications Using Large Vision Language Models cs.AI | cs.CLPDF

Johannes Rückert, Louise Bloch, Christoph M. Friedrich

TL;DR: 该论文提出了一种利用大型视觉语言模型(VLM)分析科学出版物中的图表是否符合数据可视化指南的方法,并通过实验验证了其在多个任务上的有效性。

Details

Motivation: 科学出版物中的图表常因不符合可视化指南而导致信息不准确或不完整,但目前缺乏自动化工具来检测这些问题。

Result: VLM在检测图表类型、3D效果、坐标轴标签等方面表现良好,但在图像质量和刻度标记检测上效果较差。

Insight: 大型视觉语言模型可以部分替代人工检测图表问题,但仍有改进空间,未来可扩展用于更多可视化场景。

Abstract: Diagrams are widely used to visualize data in publications. The research field of data visualization deals with defining principles and guidelines for the creation and use of these diagrams, which are often not known or adhered to by researchers, leading to misinformation caused by providing inaccurate or incomplete information. In this work, large Vision Language Models (VLMs) are used to analyze diagrams in order to identify potential problems in regards to selected data visualization principles and guidelines. To determine the suitability of VLMs for these tasks, five open source VLMs and five prompting strategies are compared using a set of questions derived from selected data visualization guidelines. The results show that the employed VLMs work well to accurately analyze diagram types (F1-score 82.49 %), 3D effects (F1-score 98.55 %), axes labels (F1-score 76.74 %), lines (RMSE 1.16), colors (RMSE 1.60) and legends (F1-score 96.64 %, RMSE 0.70), while they cannot reliably provide feedback about the image quality (F1-score 0.74 %) and tick marks/labels (F1-score 46.13 %). Among the employed VLMs, Qwen2.5VL performs best, and the summarizing prompting strategy performs best for most of the experimental questions. It is shown that VLMs can be used to automatically identify a number of potential issues in diagrams, such as missing axes labels, missing legends, and unnecessary 3D effects. The approach laid out in this work can be extended for further aspects of data visualization.


eess.AS [Back]

[123] Kling-Foley: Multimodal Diffusion Transformer for High-Quality Video-to-Audio Generation eess.AS | cs.AI | cs.CL | cs.SDPDF

Jun Wang, Xijuan Zeng, Chunyu Qiang, Ruilong Chen, Shiyao Wang

TL;DR: Kling-Foley是一种多模态视频到音频生成模型,通过多模态扩散变压器和视觉语义表示模块增强音视频同步与语义对齐,结合通用音频编解码器和立体声渲染技术,显著提升生成音频的质量和空间感。

Details

Motivation: 现有的视频到音频生成方法在语义对齐、音视频同步和音频质量上仍有不足,Kling-Foley旨在通过多模态建模和高级对齐模块解决这些问题。

Result: 实验表明,Kling-Foley在分布匹配、语义对齐、时间对齐和音频质量上达到了公开模型的SOTA性能。

Insight: 多模态建模和高精度对齐模块是提升视频到音频生成质量的关键。通用音频编解码器和立体声渲染技术为多样化场景提供了灵活性。

Abstract: We propose Kling-Foley, a large-scale multimodal Video-to-Audio generation model that synthesizes high-quality audio synchronized with video content. In Kling-Foley, we introduce multimodal diffusion transformers to model the interactions between video, audio, and text modalities, and combine it with a visual semantic representation module and an audio-visual synchronization module to enhance alignment capabilities. Specifically, these modules align video conditions with latent audio elements at the frame level, thereby improving semantic alignment and audio-visual synchronization. Together with text conditions, this integrated approach enables precise generation of video-matching sound effects. In addition, we propose a universal latent audio codec that can achieve high-quality modeling in various scenarios such as sound effects, speech, singing, and music. We employ a stereo rendering method that imbues synthesized audio with a spatial presence. At the same time, in order to make up for the incomplete types and annotations of the open-source benchmark, we also open-source an industrial-level benchmark Kling-Audio-Eval. Our experiments show that Kling-Foley trained with the flow matching objective achieves new audio-visual SOTA performance among public models in terms of distribution matching, semantic alignment, temporal alignment and audio quality.