Table of Contents

cs.CV [Back]

[1] LENS: Learning to Segment Anything with Unified Reinforced Reasoning cs.CV | cs.AIPDF

Lianghui Zhu, Bin Ouyang, Yuxuan Zhang, Tianheng Cheng, Rui Hu

TL;DR: LENS是一种通过强化学习联合优化分割任务和推理过程的框架,通过跨句子、框和分割级别的奖励生成信息化的CoT(Chain-of-Thought)推理,提升分割质量。

Details

Motivation: 现有方法在测试时忽略了显式的CoT推理,限制了模型对未见过的提示和领域的泛化能力。

Result: 在RefCOCO等基准上平均cIoU达81.2%,比GLaMM方法提升5.6%。

Insight: 强化学习驱动的CoT推理是一种鲁棒的先验,可提升分割模型的通用性。

Abstract: Text-prompted image segmentation enables fine-grained visual understanding and is critical for applications such as human-computer interaction and robotics. However, existing supervised fine-tuning methods typically ignore explicit chain-of-thought (CoT) reasoning at test time, which limits their ability to generalize to unseen prompts and domains. To address this issue, we introduce LENS, a scalable reinforcement-learning framework that jointly optimizes the reasoning process and segmentation in an end-to-end manner. We propose unified reinforcement-learning rewards that span sentence-, box-, and segment-level cues, encouraging the model to generate informative CoT rationales while refining mask quality. Using a publicly available 3-billion-parameter vision-language model, i.e., Qwen2.5-VL-3B-Instruct, LENS achieves an average cIoU of 81.2% on the RefCOCO, RefCOCO+, and RefCOCOg benchmarks, outperforming the strong fine-tuned method, i.e., GLaMM, by up to 5.6%. These results demonstrate that RL-driven CoT reasoning serves as a robust prior for text-prompted segmentation and offers a practical path toward more generalizable Segment Anything models. Code is available at https://github.com/hustvl/LENS.


[2] RynnEC: Bringing MLLMs into Embodied World cs.CV | cs.AI | cs.ROPDF

Ronghao Dang, Yuqian Yuan, Yunxuan Mao, Kehan Li, Jiangpin Liu

TL;DR: RynnEC是一个面向具身认知的视频多模态大语言模型,通过区域编码器和掩码解码器实现了灵活的基于区域的视频交互。尽管结构紧凑,它在物体属性理解、物体分割和空间推理方面达到了最先进的性能。

Details

Motivation: 为了解决具身智能体在物理世界中需要精细感知和精确交互的需求,同时缓解标注3D数据稀缺的问题。

Result: 在物体属性理解、分割和空间推理任务上取得了最先进的性能。

Insight: 区域中心的视频范式可以为具身智能体提供更精细的感知能力,并推动通用认知核心的发展。

Abstract: We introduce RynnEC, a video multimodal large language model designed for embodied cognition. Built upon a general-purpose vision-language foundation model, RynnEC incorporates a region encoder and a mask decoder, enabling flexible region-level video interaction. Despite its compact architecture, RynnEC achieves state-of-the-art performance in object property understanding, object segmentation, and spatial reasoning. Conceptually, it offers a region-centric video paradigm for the brain of embodied agents, providing fine-grained perception of the physical world and enabling more precise interactions. To mitigate the scarcity of annotated 3D datasets, we propose an egocentric video based pipeline for generating embodied cognition data. Furthermore, we introduce RynnEC-Bench, a region-centered benchmark for evaluating embodied cognitive capabilities. We anticipate that RynnEC will advance the development of general-purpose cognitive cores for embodied agents and facilitate generalization across diverse embodied tasks. The code, model checkpoints, and benchmark are available at: https://github.com/alibaba-damo-academy/RynnEC


[3] Local Scale Equivariance with Latent Deep Equilibrium Canonicalizer cs.CV | cs.GR | cs.LGPDF

Md Ashiqur Rahman, Chiao-An Yang, Michael N. Cheng, Lim Jun Hao, Jeremiah Jiang

TL;DR: 论文提出了一种深度均衡规范化器(DEC),用于提升模型的局部尺度等变性,解决计算机视觉中尺度变化的挑战。DEC易于集成到现有网络中,并在ImageNet基准测试中显著提升了性能与局部尺度一致性。

Details

Motivation: 计算机视觉中,同一类别的物体可能因距离或自身大小而表现出不同的尺度变化,这种变化是局部的。现有方法难以有效处理这种局部尺度变化,因此需要一种能够提升模型局部尺度等变性的方法。

Result: 在ImageNet基准测试中,DEC显著提升了ViT、DeiT、Swin和BEiT等预训练模型的性能和局部尺度一致性。

Insight: DEC提供了一种通用的解决方案,能够在不改变网络架构的情况下提升模型对局部尺度变化的鲁棒性。这表明规范化器在解决尺度等变性问题上具有潜力。

Abstract: Scale variation is a fundamental challenge in computer vision. Objects of the same class can have different sizes, and their perceived size is further affected by the distance from the camera. These variations are local to the objects, i.e., different object sizes may change differently within the same image. To effectively handle scale variations, we present a deep equilibrium canonicalizer (DEC) to improve the local scale equivariance of a model. DEC can be easily incorporated into existing network architectures and can be adapted to a pre-trained model. Notably, we show that on the competitive ImageNet benchmark, DEC improves both model performance and local scale consistency across four popular pre-trained deep-nets, e.g., ViT, DeiT, Swin, and BEiT. Our code is available at https://github.com/ashiq24/local-scale-equivariance.


[4] CLIPSym: Delving into Symmetry Detection with CLIP cs.CVPDF

Tinghan Yang, Md Ashiqur Rahman, Raymond A. Yeh

TL;DR: 论文提出CLIPSym,通过结合CLIP的视觉和语言编码器及旋转等变解码器,利用对称性提示检测图像中的旋转和反射对称性,新提示技术SAPG提升了性能,实验显示其在多个数据集上优于现有方法。

Details

Motivation: 对称性是计算机视觉中基础的几何线索,但检测仍有挑战。作者探索预训练的CLIP模型是否能通过自然图像描述中的对称性线索改进检测。

Result: 在DENDI、SDRW和LDRS数据集上超越现有方法。消融实验验证了CLIP预训练、解码器和SAPG的有效性。

Insight: 预训练的视觉-语言模型能有效捕捉对称性线索,提示技术的多样性对性能提升至关重要。

Abstract: Symmetry is one of the most fundamental geometric cues in computer vision, and detecting it has been an ongoing challenge. With the recent advances in vision-language models,~i.e., CLIP, we investigate whether a pre-trained CLIP model can aid symmetry detection by leveraging the additional symmetry cues found in the natural image descriptions. We propose CLIPSym, which leverages CLIP’s image and language encoders and a rotation-equivariant decoder based on a hybrid of Transformer and $G$-Convolution to detect rotation and reflection symmetries. To fully utilize CLIP’s language encoder, we have developed a novel prompting technique called Semantic-Aware Prompt Grouping (SAPG), which aggregates a diverse set of frequent object-based prompts to better integrate the semantic cues for symmetry detection. Empirically, we show that CLIPSym outperforms the current state-of-the-art on three standard symmetry detection datasets (DENDI, SDRW, and LDRS). Finally, we conduct detailed ablations verifying the benefits of CLIP’s pre-training, the proposed equivariant decoder, and the SAPG technique. The code is available at https://github.com/timyoung2333/CLIPSym.


[5] A Survey on Video Anomaly Detection via Deep Learning: Human, Vehicle, and Environment cs.CV | cs.AIPDF

Ghazal Alinezhad Noghre, Armin Danesh Pazho, Hamed Tabkhi

TL;DR: 这篇综述系统地梳理了基于深度学习的视频异常检测(VAD)研究,涵盖了不同监督水平的文献以及自适应学习方法,并针对人类、车辆和环境三类应用场景进行了分析,指出了当前方法的贡献与局限。

Details

Motivation: 视频异常检测在计算机视觉中具有重要意义,尽管深度学习推动了该领域的进展,但研究仍较为分散,缺乏系统性整合。本文旨在为社区提供一个结构化的综述,推动理论和实际应用的进步。

Result: 总结了VAD在不同应用场景中的研究现状,明确了当前方法的优势和不足,为未来研究提供了方向。

Insight: 视频异常检测的研究需要进一步结合多学科知识,解决实际部署中的挑战,同时需要关注算法的泛化性和实时性。

Abstract: Video Anomaly Detection (VAD) has emerged as a pivotal task in computer vision, with broad relevance across multiple fields. Recent advances in deep learning have driven significant progress in this area, yet the field remains fragmented across domains and learning paradigms. This survey offers a comprehensive perspective on VAD, systematically organizing the literature across various supervision levels, as well as adaptive learning methods such as online, active, and continual learning. We examine the state of VAD across three major application categories: human-centric, vehicle-centric, and environment-centric scenarios, each with distinct challenges and design considerations. In doing so, we identify fundamental contributions and limitations of current methodologies. By consolidating insights from subfields, we aim to provide the community with a structured foundation for advancing both theoretical understanding and real-world applicability of VAD systems. This survey aims to support researchers by providing a useful reference, while also drawing attention to the broader set of open challenges in anomaly detection, including both fundamental research questions and practical obstacles to real-world deployment.


[6] Accelerating Image Classification with Graph Convolutional Neural Networks using Voronoi Diagrams cs.CV | cs.LGPDF

Mustafa Mohammadi Gharasuie, Luis Rueda

TL;DR: 论文提出一种结合Voronoi图和图卷积网络(GCN)的图像分类方法,通过图结构表示图像,并引入归一化Voronoi图卷积网络(NVGCN),显著提升预处理速度和分类精度。

Details

Motivation: 传统卷积神经网络(CNNs)在复杂场景和细粒度分类任务中表现有限。作者提出利用图结构表达图像关系,并借助Voronoi图的几何特性优化计算效率。

Result: 实验表明,方法在预处理时间和分类准确率上优于现有模型,尤其在复杂场景和细粒度分类任务中表现突出。

Insight: 图结构与几何分割(如Voronoi图)的结合为图像分类提供了新思路,同时NVGCN的设计可推广到其他非结构化数据任务中。

Abstract: Recent advances in image classification have been significantly propelled by the integration of Graph Convolutional Networks (GCNs), offering a novel paradigm for handling complex data structures. This study introduces an innovative framework that employs GCNs in conjunction with Voronoi diagrams to peform image classification, leveraging their exceptional capability to model relational data. Unlike conventional convolutional neural networks, our approach utilizes a graph-based representation of images, where pixels or regions are treated as vertices of a graph, which are then simplified in the form of the corresponding Delaunay triangulations. Our model yields significant improvement in pre-processing time and classification accuracy on several benchmark datasets, surpassing existing state-of-the-art models, especially in scenarios that involve complex scenes and fine-grained categories. The experimental results, validated via cross-validation, underscore the potential of integrating GCNs with Voronoi diagrams in advancing image classification tasks. This research contributes to the field by introducing a novel approach to image classification, while opening new avenues for developing graph-based learning paradigms in other domains of computer vision and non-structured data. In particular, we have proposed a new version of the GCN in this paper, namely normalized Voronoi Graph Convolution Network (NVGCN), which is faster than the regular GCN.


[7] Directed-Tokens: A Robust Multi-Modality Alignment Approach to Large Language-Vision Models cs.CVPDF

Thanh-Dat Truong, Huu-Thien Tran, Tran Thai Son, Bhiksha Raj, Khoa Luu

TL;DR: 该论文提出了一种名为Directed-Tokens的多模态对齐方法,通过解决图像和文本顺序的重构问题,提升大型语言-视觉模型的鲁棒性与泛化能力。

Details

Motivation: 现有大型多模态模型(LMMs)在视觉与文本特征的鲁棒对齐和相关性方面存在局限性,影响了模型的泛化能力和推理性能。

Result: 所提方法在学术任务导向和指令跟随的LMM基准测试中实现了最先进的性能。

Insight: 通过顺序重构任务和Directed-Tokens的设计,有效提升了模型的多模态对齐能力和视觉理解能力,同时增强了模型的鲁棒性。

Abstract: Large multimodal models (LMMs) have gained impressive performance due to their outstanding capability in various understanding tasks. However, these models still suffer from some fundamental limitations related to robustness and generalization due to the alignment and correlation between visual and textual features. In this paper, we introduce a simple but efficient learning mechanism for improving the robust alignment between visual and textual modalities by solving shuffling problems. In particular, the proposed approach can improve reasoning capability, visual understanding, and cross-modality alignment by introducing two new tasks: reconstructing the image order and the text order into the LMM’s pre-training and fine-tuning phases. In addition, we propose a new directed-token approach to capture visual and textual knowledge, enabling the capability to reconstruct the correct order of visual inputs. Then, we introduce a new Image-to-Response Guided loss to further improve the visual understanding of the LMM in its responses. The proposed approach consistently achieves state-of-the-art (SoTA) performance compared with prior LMMs on academic task-oriented and instruction-following LMM benchmarks.


[8] Multi-Rationale Explainable Object Recognition via Contrastive Conditional Inference cs.CVPDF

Ali Rasekh, Sepehr Kazemi Ranjbar, Simon Gottschalk

TL;DR: 该论文提出了一个多理由可解释物体识别基准和对比条件推理(CCI)框架,解决现有方法在CLIP模型下解释性不足的问题。

Details

Motivation: 现有基于视觉-语言模型(如CLIP)的可解释物体识别方法依赖提示词条件,但其文本编码器受限且解释结构条件较弱。此外,数据集中多包含单一或噪声理由,未能捕捉判别特征的多样性。

Result: 在基准测试中达到最优结果,零样本表现突出,分类准确率和理由质量均提升。

Insight: 多理由标注和概率建模的结合为可解释性任务提供了更全面的评估标准,且无需训练的特点使其更具通用性。

Abstract: Explainable object recognition using vision-language models such as CLIP involves predicting accurate category labels supported by rationales that justify the decision-making process. Existing methods typically rely on prompt-based conditioning, which suffers from limitations in CLIP’s text encoder and provides weak conditioning on explanatory structures. Additionally, prior datasets are often restricted to single, and frequently noisy, rationales that fail to capture the full diversity of discriminative image features. In this work, we introduce a multi-rationale explainable object recognition benchmark comprising datasets in which each image is annotated with multiple ground-truth rationales, along with evaluation metrics designed to offer a more comprehensive representation of the task. To overcome the limitations of previous approaches, we propose a contrastive conditional inference (CCI) framework that explicitly models the probabilistic relationships among image embeddings, category labels, and rationales. Without requiring any training, our framework enables more effective conditioning on rationales to predict accurate object categories. Our approach achieves state-of-the-art results on the multi-rationale explainable object recognition benchmark, including strong zero-shot performance, and sets a new standard for both classification accuracy and rationale quality. Together with the benchmark, this work provides a more complete framework for evaluating future models in explainable object recognition. The code will be made available online.


[9] OccluNet: Spatio-Temporal Deep Learning for Occlusion Detection on DSA cs.CV | cs.AIPDF

Anushka A. Kore, Frank G. te Nijenhuis, Matthijs van der Sluijs, Wim van Zwam, Charles Majoie

TL;DR: OccluNet是一个时空深度学习模型,结合YOLOX和目标检测和变换器的时间注意力机制,用于在DSA序列中自动检测血管闭塞,显著优于基线模型。

Details

Motivation: 在急性缺血性卒中治疗中,准确检测血管闭塞对数字减影血管造影(DSA)序列的解读至关重要,但由于解剖复杂性和时间压力,手动检测具有挑战性。

Result: 在MR CLEAN Registry的DSA图像上,OccluNet的精确率和召回率分别达到89.02%和74.87%,显著优于基线模型。

Insight: 时空注意力机制有效捕捉了时间一致性特征,为医学图像中的动态目标检测提供了新的解决思路。

Abstract: Accurate detection of vascular occlusions during endovascular thrombectomy (EVT) is critical in acute ischemic stroke (AIS). Interpretation of digital subtraction angiography (DSA) sequences poses challenges due to anatomical complexity and time constraints. This work proposes OccluNet, a spatio-temporal deep learning model that integrates YOLOX, a single-stage object detector, with transformer-based temporal attention mechanisms to automate occlusion detection in DSA sequences. We compared OccluNet with a YOLOv11 baseline trained on either individual DSA frames or minimum intensity projections. Two spatio-temporal variants were explored for OccluNet: pure temporal attention and divided space-time attention. Evaluation on DSA images from the MR CLEAN Registry revealed the model’s capability to capture temporally consistent features, achieving precision and recall of 89.02% and 74.87%, respectively. OccluNet significantly outperformed the baseline models, and both attention variants attained similar performance. Source code is available at https://github.com/anushka-kore/OccluNet.git


[10] Pixels to Play: A Foundation Model for 3D Gameplay cs.CV | cs.AI | cs.LGPDF

Yuguang Yue, Chris Green, Samuel Hunt, Irakli Salia, Wenzhe Shi

TL;DR: Pixels2Play-0.1 (P2P0.1) 是一款基础模型,能够通过像素流学习和玩多种 3D 视频游戏,展现类似人类的行为。

Details

Motivation: 研究动机是满足用户和开发者对 AI 队友、可控 NPC、个性化直播助手等应用的需求,要求模型仅依赖玩家可见的像素流,并能泛化到新游戏。

Result: 模型在 Roblox 和经典 MS-DOS 游戏中表现出色,展示了泛化能力和潜力。

Insight: 通过结合标注和无标注数据,以及低延迟的模型设计,为未来实现专家级、文本驱动的游戏控制提供了基础。

Abstract: We introduce Pixels2Play-0.1 (P2P0.1), a foundation model that learns to play a wide range of 3D video games with recognizable human-like behavior. Motivated by emerging consumer and developer use cases - AI teammates, controllable NPCs, personalized live-streamers, assistive testers - we argue that an agent must rely on the same pixel stream available to players and generalize to new titles with minimal game-specific engineering. P2P0.1 is trained end-to-end with behavior cloning: labeled demonstrations collected from instrumented human game-play are complemented by unlabeled public videos, to which we impute actions via an inverse-dynamics model. A decoder-only transformer with auto-regressive action output handles the large action space while remaining latency-friendly on a single consumer GPU. We report qualitative results showing competent play across simple Roblox and classic MS-DOS titles, ablations on unlabeled data, and outline the scaling and evaluation steps required to reach expert-level, text-conditioned control.


[11] MoVieDrive: Multi-Modal Multi-View Urban Scene Video Generation cs.CVPDF

Guile Wu, David Huang, Dongfeng Bai, Bingbing Liu

TL;DR: 该论文提出了一种用于自动驾驶场景的多模态多视角视频生成方法MoVieDrive,通过统一的扩散变换器模型解决了现有方法仅支持RGB视频生成的问题。

Details

Motivation: 现有自动驾驶视频生成方法主要关注RGB视频,缺乏多模态数据(如深度图和语义图)的支持。多模态数据对全面理解场景至关重要,但使用多个模型会增加部署难度且无法利用互补信息。

Result: 在nuScenes数据集上的实验表明,该方法在生成多模态多视角视频时具有高保真性和可控性,优于现有方法。

Insight: 通过统一框架结合多模态数据,能够有效提升自动驾驶场景视频生成的全面性和实用性,同时减少模型部署的复杂度。

Abstract: Video generation has recently shown superiority in urban scene synthesis for autonomous driving. Existing video generation approaches to autonomous driving primarily focus on RGB video generation and lack the ability to support multi-modal video generation. However, multi-modal data, such as depth maps and semantic maps, are crucial for holistic urban scene understanding in autonomous driving. Although it is feasible to use multiple models to generate different modalities, this increases the difficulty of model deployment and does not leverage complementary cues for multi-modal data generation. To address this problem, in this work, we propose a novel multi-modal multi-view video generation approach to autonomous driving. Specifically, we construct a unified diffusion transformer model composed of modal-shared components and modal-specific components. Then, we leverage diverse conditioning inputs to encode controllable scene structure and content cues into the unified diffusion model for multi-modal multi-view video generation. In this way, our approach is capable of generating multi-modal multi-view driving scene videos in a unified framework. Our experiments on the challenging real-world autonomous driving dataset, nuScenes, show that our approach can generate multi-modal multi-view urban scene videos with high fidelity and controllability, surpassing the state-of-the-art methods.


[12] Inter-Class Relational Loss for Small Object Detection: A Case Study on License Plates cs.CV | cs.AIPDF

Dian Ning, Dong Seog Han

TL;DR: 本文提出了一个新的小物体检测损失函数(ICR损失),通过利用类间空间关系(如车牌与车的关联),提升小物体的梯度更新效率,并发布了SVMLP数据集。实验表明,ICR损失显著提升了检测性能。

Details

Motivation: 传统的基于IoU的损失函数对小物体的梯度更新效果不佳,导致小物体检测性能较差。本文通过利用类间空间关系(如车牌与车的固定位置关系)来解决这一问题。

Result: 在YOLOv12-T和UAV-DETR上,ICR损失分别提升了10.3%和1.6%的mAP$^{\text{test}}_{50}$,且无需额外调参。

Insight: 利用类间空间关系可以有效解决小物体检测中的梯度更新问题,同时避免了对其他物体学习效率的影响。这一思路可以推广到其他小物体检测任务中。

Abstract: In one-stage multi-object detection tasks, various intersection over union (IoU)-based solutions aim at smooth and stable convergence near the targets during training. However, IoU-based losses fail to correctly update the gradient of small objects due to an extremely flat gradient. During the update of multiple objects, the learning of small objects’ gradients suffers more because of insufficient gradient updates. Therefore, we propose an inter-class relational loss to efficiently update the gradient of small objects while not sacrificing the learning efficiency of other objects based on the simple fact that an object has a spatial relationship to another object (e.g., a car plate is attached to a car in a similar position). When the predicted car plate’s bounding box is not within its car, a loss punishment is added to guide the learning, which is inversely proportional to the overlapped area of the car’s and predicted car plate’s bounding box. By leveraging the spatial relationship at the inter-class level, the loss guides small object predictions using larger objects and enhances latent information in deeper feature maps. In this paper, we present twofold contributions using license plate detection as a case study: (1) a new small vehicle multi-license plate dataset (SVMLP), featuring diverse real-world scenarios with high-quality annotations; and (2) a novel inter-class relational loss function designed to promote effective detection performance. We highlight the proposed ICR loss penalty can be easily added to existing IoU-based losses and enhance the performance. These contributions improve the standard mean Average Precision (mAP) metric, achieving gains of 10.3% and 1.6% in mAP$^{\text{test}}_{50}$ for YOLOv12-T and UAV-DETR, respectively, without any additional hyperparameter tuning. Code and dataset will be available soon.


[13] Deep Learning for Taxol Exposure Analysis: A New Cell Image Dataset and Attention-Based Baseline Model cs.CVPDF

Sean Fletcher, Gabby Scott, Douglas Currie, Xin Zhang, Yuqi Song

TL;DR: 论文提出了一种新的显微镜图像数据集和基于注意力的基线模型,用于分析紫杉醇(Taxol)对细胞的形态影响。数据集填补了这一领域的空白,而ResAttention-KNN模型结合了ResNet-50、注意力模块和KNN分类器,提供了高效的分类方法。

Details

Motivation: 现有的紫杉醇细胞效应检测方法需要专业设备和人员,成本高且不适用于高通量或实时分析。深度学习可以自动化分析细胞形态,但目前缺乏公开的数据集和基准模型。

Result: ResAttention-KNN在紫杉醇浓度分类任务上表现良好,数据集和实现代码已公开,支持未来研究的复现和扩展。

Insight: 注意力机制能够有效捕捉细胞形态的微小变化,而KNN在低数据量场景下提供了简单但高效的分类方案。公开的数据集和代码为这一领域的研究提供了重要资源。

Abstract: Monitoring the effects of the chemotherapeutic agent Taxol at the cellular level is critical for both clinical evaluation and biomedical research. However, existing detection methods require specialized equipment, skilled personnel, and extensive sample preparation, making them expensive, labor-intensive, and unsuitable for high-throughput or real-time analysis. Deep learning approaches have shown great promise in medical and biological image analysis, enabling automated, high-throughput assessment of cellular morphology. Yet, no publicly available dataset currently exists for automated morphological analysis of cellular responses to Taxol exposure. To address this gap, we introduce a new microscopy image dataset capturing C6 glioma cells treated with varying concentrations of Taxol. To provide an effective solution for Taxol concentration classification and establish a benchmark for future studies on this dataset, we propose a baseline model named ResAttention-KNN, which combines a ResNet-50 with Convolutional Block Attention Modules and uses a k-Nearest Neighbors classifier in the learned embedding space. This model integrates attention-based refinement and non-parametric classification to enhance robustness and interpretability. Both the dataset and implementation are publicly released to support reproducibility and facilitate future research in vision-based biomedical analysis.


[14] Taming Transformer for Emotion-Controllable Talking Face Generation cs.CVPDF

Ziqi Zhang, Cheng Deng

TL;DR: 该论文提出了一种新方法,用于实现情感可控的说话人脸生成任务,通过预训练策略和情感锚(EA)表示,结合自回归Transformer模型,生成身份保持的情感化视频。

Details

Motivation: 当前说话人脸生成任务面临两个挑战:如何有效建模与特定情感相关的多模态关系,以及如何利用这种关系合成身份保持的情感化视频。论文旨在解决这两个问题。

Result: 在MEAD数据集上的实验表明,该方法在生成情感可控的视频方面表现优异,定性和定量结果均优于现有方法。

Insight: 通过量化视频和解耦音频,结合情感锚表示和自回归Transformer,可以更有效地生成身份保持且情感丰富的说话人脸视频。

Abstract: Talking face generation is a novel and challenging generation task, aiming at synthesizing a vivid speaking-face video given a specific audio. To fulfill emotion-controllable talking face generation, current methods need to overcome two challenges: One is how to effectively model the multimodal relationship related to the specific emotion, and the other is how to leverage this relationship to synthesize identity preserving emotional videos. In this paper, we propose a novel method to tackle the emotion-controllable talking face generation task discretely. Specifically, we employ two pre-training strategies to disentangle audio into independent components and quantize videos into combinations of visual tokens. Subsequently, we propose the emotion-anchor (EA) representation that integrates the emotional information into visual tokens. Finally, we introduce an autoregressive transformer to model the global distribution of the visual tokens under the given conditions and further predict the index sequence for synthesizing the manipulated videos. We conduct experiments on the MEAD dataset that controls the emotion of videos conditioned on multiple emotional audios. Extensive experiments demonstrate the superiorities of our method both qualitatively and quantitatively.


[15] TCFNet: Bidirectional face-bone transformation via a Transformer-based coarse-to-fine point movement network cs.CVPDF

Runshi Zhang, Bimeng Jie, Yang He, Junchen Wang

TL;DR: TCFNet是一个基于Transformer的粗到细点移动网络,用于精确模拟面部与骨骼点云之间的双向变换,解决了传统方法和现有深度学习方法在计算时间、精度和适用性上的局限性。

Details

Motivation: 传统的生物力学模拟方法计算耗时、数据处理复杂且精度低,而现有的深度学习方法在处理大规模点云、感受野限制和复杂预处理方面存在问题,因此需要一种更高效、精确的解决方案。

Result: 在数据集上,TCFNet在评估指标和可视化结果上均优于现有SOTA方法。

Insight: 1. 分阶段的粗到细方法能显著提升点云变换的精度;2. 结合全局和局部信息是处理密集点云变换的关键;3. 专家知识的引入可以进一步优化医学图像相关任务。

Abstract: Computer-aided surgical simulation is a critical component of orthognathic surgical planning, where accurately simulating face-bone shape transformations is significant. The traditional biomechanical simulation methods are limited by their computational time consumption levels, labor-intensive data processing strategies and low accuracy. Recently, deep learning-based simulation methods have been proposed to view this problem as a point-to-point transformation between skeletal and facial point clouds. However, these approaches cannot process large-scale points, have limited receptive fields that lead to noisy points, and employ complex preprocessing and postprocessing operations based on registration. These shortcomings limit the performance and widespread applicability of such methods. Therefore, we propose a Transformer-based coarse-to-fine point movement network (TCFNet) to learn unique, complicated correspondences at the patch and point levels for dense face-bone point cloud transformations. This end-to-end framework adopts a Transformer-based network and a local information aggregation network (LIA-Net) in the first and second stages, respectively, which reinforce each other to generate precise point movement paths. LIA-Net can effectively compensate for the neighborhood precision loss of the Transformer-based network by modeling local geometric structures (edges, orientations and relative position features). The previous global features are employed to guide the local displacement using a gated recurrent unit. Inspired by deformable medical image registration, we propose an auxiliary loss that can utilize expert knowledge for reconstructing critical organs.Compared with the existing state-of-the-art (SOTA) methods on gathered datasets, TCFNet achieves outstanding evaluation metrics and visualization results. The code is available at https://github.com/Runshi-Zhang/TCFNet.


[16] QuadINR: Hardware-Efficient Implicit Neural Representations Through Quadratic Activation cs.CVPDF

Wenyong Zhou, Boyu Li, Jiachen Ren, Taiqiang Wu, Zhilin Ai

TL;DR: QuadINR是一种硬件高效的隐式神经表示方法,通过二次激活函数减少硬件开销,同时提升高频信号表达能力,并在FPGA和ASIC上验证了其高效性。

Details

Motivation: 传统的隐式神经表示(INR)使用复杂激活函数以缓解频谱偏差,但导致硬件开销大。QuadINR旨在通过二次激活函数实现高效硬件实现。

Result: 在图像和视频任务中,QuadINR相比基线方法PSNR提升2.06dB,硬件面积仅1914μm²,动态功耗6.14mW,资源减少97%,延迟降低93%。

Insight: 二次激活函数在硬件效率和高频信号表达之间实现了良好的平衡,为INR的实际部署提供了可行方案。

Abstract: Implicit Neural Representations (INRs) encode discrete signals continuously while addressing spectral bias through activation functions (AFs). Previous approaches mitigate this bias by employing complex AFs, which often incur significant hardware overhead. To tackle this challenge, we introduce QuadINR, a hardware-efficient INR that utilizes piecewise quadratic AFs to achieve superior performance with dramatic reductions in hardware consumption. The quadratic functions encompass rich harmonic content in their Fourier series, delivering enhanced expressivity for high-frequency signals, as verified through Neural Tangent Kernel (NTK) analysis. We develop a unified $N$-stage pipeline framework that facilitates efficient hardware implementation of various AFs in INRs. We demonstrate FPGA implementations on the VCU128 platform and an ASIC implementation in a 28nm process. Experiments across images and videos show that QuadINR achieves up to 2.06dB PSNR improvement over prior work, with an area of only 1914$\mu$m$^2$ and a dynamic power of 6.14mW, reducing resource and power consumption by up to 97% and improving latency by up to 93% vs existing baselines.


[17] Img2ST-Net: Efficient High-Resolution Spatial Omics Prediction from Whole Slide Histology Images via Fully Convolutional Image-to-Image Learning cs.CVPDF

Junchao Zhu, Ruining Deng, Junlin Guo, Tianyuan Yao, Juming Xiong

TL;DR: Img2ST-Net 是一种高效的高分辨率空间转录组学预测框架,通过全卷积图像到图像学习从全切片组织学图像中生成密集的基因表达图,解决了现有方法计算效率低和不稳定的问题。

Details

Motivation: 当前的空间转录组学(ST)数据获取成本高且耗时,而现有的逐点推理方法在超高分辨率下效率低下且不稳定。本文旨在提出一种高效并行的预测方法。

Result: 提出的方法在计算效率和预测准确性上均优于传统的逐点推理方法,能够高效生成高分辨率基因表达图。

Insight: 通过将 ST 预测任务转化为图像生成问题,并引入适合高分辨率数据的评估指标,为下一代空间转录组学建模提供了方向。

Abstract: Recent advances in multi-modal AI have demonstrated promising potential for generating the currently expensive spatial transcriptomics (ST) data directly from routine histology images, offering a means to reduce the high cost and time-intensive nature of ST data acquisition. However, the increasing resolution of ST, particularly with platforms such as Visium HD achieving 8um or finer, introduces significant computational and modeling challenges. Conventional spot-by-spot sequential regression frameworks become inefficient and unstable at this scale, while the inherent extreme sparsity and low expression levels of high-resolution ST further complicate both prediction and evaluation. To address these limitations, we propose Img2ST-Net, a novel histology-to-ST generation framework for efficient and parallel high-resolution ST prediction. Unlike conventional spot-by-spot inference methods, Img2ST-Net employs a fully convolutional architecture to generate dense, HD gene expression maps in a parallelized manner. By modeling HD ST data as super-pixel representations, the task is reformulated from image-to-omics inference into a super-content image generation problem with hundreds or thousands of output channels. This design not only improves computational efficiency but also better preserves the spatial organization intrinsic to spatial omics data. To enhance robustness under sparse expression patterns, we further introduce SSIM-ST, a structural-similarity-based evaluation metric tailored for high-resolution ST analysis. We present a scalable, biologically coherent framework for high-resolution ST prediction. Img2ST-Net offers a principled solution for efficient and accurate ST inference at scale. Our contributions lay the groundwork for next-generation ST modeling that is robust and resolution-aware. The source code has been made publicly available at https://github.com/hrlblab/Img2ST-Net.


[18] CTA-Flux: Integrating Chinese Cultural Semantics into High-Quality English Text-to-Image Communities cs.CVPDF

Yue Gong, Shanyuan Liu, Liuzhuozheng Li, Jian Zhu, Bo Cheng

TL;DR: 论文提出了一种名为CTA-Flux的适配方法,通过MultiModal Diffusion Transformer(MMDiT)将中文语义直接嵌入到英文文本到图像生成模型Flux中,解决了现有方法在文化特定语义上的不足,提升了生成图像的品质与文化真实性。

Details

Motivation: 现有的英文文本到图像生成模型(如Flux)在处理非英文(尤其是中文)提示时表现不佳,主要因为训练数据的语言和文化偏见。现有方法(如翻译或双语微调)无法充分捕捉文化特定语义,导致图像生成质量下降。

Result: 实验表明,CTA-Flux支持中英文提示,在图像生成质量、视觉真实性和中文语义表达上优于现有方法。

Insight: 通过直接控制主干模型而非依赖翻译或双语微调,可以更高效地解决多语言和文化多样性问题,同时保持模型的轻量化和兼容性。

Abstract: We proposed the Chinese Text Adapter-Flux (CTA-Flux). An adaptation method fits the Chinese text inputs to Flux, a powerful text-to-image (TTI) generative model initially trained on the English corpus. Despite the notable image generation ability conditioned on English text inputs, Flux performs poorly when processing non-English prompts, particularly due to linguistic and cultural biases inherent in predominantly English-centric training datasets. Existing approaches, such as translating non-English prompts into English or finetuning models for bilingual mappings, inadequately address culturally specific semantics, compromising image authenticity and quality. To address this issue, we introduce a novel method to bridge Chinese semantic understanding with compatibility in English-centric TTI model communities. Existing approaches relying on ControlNet-like architectures typically require a massive parameter scale and lack direct control over Chinese semantics. In comparison, CTA-flux leverages MultiModal Diffusion Transformer (MMDiT) to control the Flux backbone directly, significantly reducing the number of parameters while enhancing the model’s understanding of Chinese semantics. This integration significantly improves the generation quality and cultural authenticity without extensive retraining of the entire model, thus maintaining compatibility with existing text-to-image plugins such as LoRA, IP-Adapter, and ControlNet. Empirical evaluations demonstrate that CTA-flux supports Chinese and English prompts and achieves superior image generation quality, visual realism, and faithful depiction of Chinese semantics.


[19] MoCHA-former: Moiré-Conditioned Hybrid Adaptive Transformer for Video Demoiréing cs.CVPDF

Jeahun Sung, Changhyun Roh, Chanho Eom, Jihyong Oh

TL;DR: MoCHA-former提出了一种用于视频去摩尔纹的混合自适应Transformer方法,通过解耦摩尔纹与内容并结合时空自适应处理,显著提升了去摩尔纹的效果。

Details

Motivation: 便携式成像设备在拍摄屏幕时,由于相机CFA与显示器子像素之间的频率混叠,会产生严重影响画质的摩尔纹。现有方法在处理时空变化、大尺度结构和通道依赖性方面存在不足。

Result: 在两个视频数据集(RAW和sRGB)上,MoCHA-former在PSNR、SSIM和LPIPS指标上均优于现有方法。

Insight: 通过解耦摩尔纹与内容并结合时空自适应处理,可以显著提升复杂场景下去摩尔纹的效果。无需显式对齐模块的设计简化了模型结构。

Abstract: Recent advances in portable imaging have made camera-based screen capture ubiquitous. Unfortunately, frequency aliasing between the camera’s color filter array (CFA) and the display’s sub-pixels induces moir'e patterns that severely degrade captured photos and videos. Although various demoir'eing models have been proposed to remove such moir'e patterns, these approaches still suffer from several limitations: (i) spatially varying artifact strength within a frame, (ii) large-scale and globally spreading structures, (iii) channel-dependent statistics and (iv) rapid temporal fluctuations across frames. We address these issues with the Moir'e Conditioned Hybrid Adaptive Transformer (MoCHA-former), which comprises two key components: Decoupled Moir'e Adaptive Demoir'eing (DMAD) and Spatio-Temporal Adaptive Demoir'eing (STAD). DMAD separates moir'e and content via a Moir'e Decoupling Block (MDB) and a Detail Decoupling Block (DDB), then produces moir'e-adaptive features using a Moir'e Conditioning Block (MCB) for targeted restoration. STAD introduces a Spatial Fusion Block (SFB) with window attention to capture large-scale structures, and a Feature Channel Attention (FCA) to model channel dependence in RAW frames. To ensure temporal consistency, MoCHA-former performs implicit frame alignment without any explicit alignment module. We analyze moir'e characteristics through qualitative and quantitative studies, and evaluate on two video datasets covering RAW and sRGB domains. MoCHA-former consistently surpasses prior methods across PSNR, SSIM, and LPIPS.


[20] HyperDiff: Hypergraph Guided Diffusion Model for 3D Human Pose Estimation cs.CVPDF

Bing Han, Yuhua Huang, Pan Gao

TL;DR: 论文提出了一种名为HyperDiff的新方法,结合扩散模型和HyperGCN解决单目3D人体姿态估计中的深度模糊性和遮挡问题,并在性能和效率之间取得平衡。

Details

Motivation: 单目3D人体姿态估计存在深度模糊性和遮挡问题,且传统方法可能忽视多尺度骨架特征。HyperDiff通过结合扩散模型和HyperGCN提升精度。

Result: 在Human3.6M和MPI-INF-3DHP数据集上表现优于现有方法,且能灵活适应不同计算资源需求。

Insight: HyperGCN的多粒度结构设计能有效提升复杂姿态的去噪能力,为3D姿态估计提供了新思路。

Abstract: Monocular 3D human pose estimation (HPE) often encounters challenges such as depth ambiguity and occlusion during the 2D-to-3D lifting process. Additionally, traditional methods may overlook multi-scale skeleton features when utilizing skeleton structure information, which can negatively impact the accuracy of pose estimation. To address these challenges, this paper introduces a novel 3D pose estimation method, HyperDiff, which integrates diffusion models with HyperGCN. The diffusion model effectively captures data uncertainty, alleviating depth ambiguity and occlusion. Meanwhile, HyperGCN, serving as a denoiser, employs multi-granularity structures to accurately model high-order correlations between joints. This improves the model’s denoising capability especially for complex poses. Experimental results demonstrate that HyperDiff achieves state-of-the-art performance on the Human3.6M and MPI-INF-3DHP datasets and can flexibly adapt to varying computational resources to balance performance and efficiency.


[21] FOCUS: Frequency-Optimized Conditioning of DiffUSion Models for mitigating catastrophic forgetting during Test-Time Adaptation cs.CVPDF

Gabriel Tjio, Jie Zhang, Xulei Yang, Yun Xing, Nhat Chung

TL;DR: FOCUS通过频率优化的扩散模型条件化方法解决了测试时适应中的灾难性遗忘问题,结合轻量级Y-FPN网络和FrequencyMix数据增强,提升了语义分割和深度估计的性能。

Details

Motivation: 在测试时适应中,模型需平衡领域适应与任务相关知识的保留,但现有方法易导致灾难性遗忘。为此,FOCUS提出了一种基于频率优化的解决方案。

Result: 在15种损坏类型和3个数据集上,FOCUS在语义分割和深度估计任务中达到了SOTA水平,并缓解了灾难性遗忘问题。

Insight: 频率分解是缓解灾难性遗忘的有效手段,扩散模型的条件化可以灵活结合现有适应方法。

Abstract: Test-time adaptation enables models to adapt to evolving domains. However, balancing the tradeoff between preserving knowledge and adapting to domain shifts remains challenging for model adaptation methods, since adapting to domain shifts can induce forgetting of task-relevant knowledge. To address this problem, we propose FOCUS, a novel frequency-based conditioning approach within a diffusion-driven input-adaptation framework. Utilising learned, spatially adaptive frequency priors, our approach conditions the reverse steps during diffusion-driven denoising to preserve task-relevant semantic information for dense prediction. FOCUS leverages a trained, lightweight, Y-shaped Frequency Prediction Network (Y-FPN) that disentangles high and low frequency information from noisy images. This minimizes the computational costs involved in implementing our approach in a diffusion-driven framework. We train Y-FPN with FrequencyMix, a novel data augmentation method that perturbs the images across diverse frequency bands, which improves the robustness of our approach to diverse corruptions. We demonstrate the effectiveness of FOCUS for semantic segmentation and monocular depth estimation across 15 corruption types and three datasets, achieving state-of-the-art averaged performance. In addition to improving standalone performance, FOCUS complements existing model adaptation methods since we can derive pseudo labels from FOCUS-denoised images for additional supervision. Even under limited, intermittent supervision with the pseudo labels derived from the FOCUS denoised images, we show that FOCUS mitigates catastrophic forgetting for recent model adaptation methods.


[22] MUSE: Multi-Subject Unified Synthesis via Explicit Layout Semantic Expansion cs.CVPDF

Fei Peng, Junqiang Wu, Yan Li, Tingting Gao, Di Zhang

TL;DR: MUSE 是一个多主题统一合成框架,通过显式布局语义扩展实现文本到图像的多主题合成,解决了现有方法在空间精度和身份一致性上的挑战。

Details

Motivation: 现有的文本到图像扩散模型在多主题合成中难以同时满足空间控制和身份保留的需求,MUSE 旨在解决这一问题。

Result: 实验表明,MUSE 在零样本端到端生成中优于现有方法,实现了更高的空间精度和身份一致性。

Insight: 通过显式语义空间扩展和任务分解,可以提升多主题合成的控制能力和生成质量。

Abstract: Existing text-to-image diffusion models have demonstrated remarkable capabilities in generating high-quality images guided by textual prompts. However, achieving multi-subject compositional synthesis with precise spatial control remains a significant challenge. In this work, we address the task of layout-controllable multi-subject synthesis (LMS), which requires both faithful reconstruction of reference subjects and their accurate placement in specified regions within a unified image. While recent advancements have separately improved layout control and subject synthesis, existing approaches struggle to simultaneously satisfy the dual requirements of spatial precision and identity preservation in this composite task. To bridge this gap, we propose MUSE, a unified synthesis framework that employs concatenated cross-attention (CCA) to seamlessly integrate layout specifications with textual guidance through explicit semantic space expansion. The proposed CCA mechanism enables bidirectional modality alignment between spatial constraints and textual descriptions without interference. Furthermore, we design a progressive two-stage training strategy that decomposes the LMS task into learnable sub-objectives for effective optimization. Extensive experiments demonstrate that MUSE achieves zero-shot end-to-end generation with superior spatial accuracy and identity consistency compared to existing solutions, advancing the frontier of controllable image synthesis. Our code and model are available at https://github.com/pf0607/MUSE.


[23] Reconstruction Using the Invisible: Intuition from NIR and Metadata for Enhanced 3D Gaussian Splatting cs.CVPDF

Gyusam Chang, Tuan-Anh Vu, Vivek Alumootil, Harris Song, Deanna Pham

TL;DR: 该论文提出了一种名为NIRSplat的多模态3D高斯泼溅方法,结合近红外(NIR)影像和文本元数据,以解决农业场景中的3D重建难题。通过引入新数据集NIRPlant和跨注意力机制,显著提升了复杂农业环境的重建效果。

Details

Motivation: 农业场景中存在光照不均、遮挡和视野受限等问题,传统3D重建方法效果不佳。近红外影像和植被指数数据尚未被充分利用。

Result: NIRSplat性能优于3DGS、CoR-GS和InstantSplat等现有方法,尤其在复杂农业场景中表现突出。

Insight: 近红外和植被指数数据能够显著提升3D重建的鲁棒性,尤其在农业场景中提供超越可见光谱的植物学信息。

Abstract: While 3D Gaussian Splatting (3DGS) has rapidly advanced, its application in agriculture remains underexplored. Agricultural scenes present unique challenges for 3D reconstruction methods, particularly due to uneven illumination, occlusions, and a limited field of view. To address these limitations, we introduce \textbf{NIRPlant}, a novel multimodal dataset encompassing Near-Infrared (NIR) imagery, RGB imagery, textual metadata, Depth, and LiDAR data collected under varied indoor and outdoor lighting conditions. By integrating NIR data, our approach enhances robustness and provides crucial botanical insights that extend beyond the visible spectrum. Additionally, we leverage text-based metadata derived from vegetation indices, such as NDVI, NDWI, and the chlorophyll index, which significantly enriches the contextual understanding of complex agricultural environments. To fully exploit these modalities, we propose \textbf{NIRSplat}, an effective multimodal Gaussian splatting architecture employing a cross-attention mechanism combined with 3D point-based positional encoding, providing robust geometric priors. Comprehensive experiments demonstrate that \textbf{NIRSplat} outperforms existing landmark methods, including 3DGS, CoR-GS, and InstantSplat, highlighting its effectiveness in challenging agricultural scenarios. The code and dataset are publicly available at: https://github.com/StructuresComp/3D-Reconstruction-NIR


[24] D^3-Talker: Dual-Branch Decoupled Deformation Fields for Few-Shot 3D Talking Head Synthesis cs.CVPDF

Yuhang Guo, Kaijun Deng, Siyang Song, Jindong Xie, Wenhui Ma

TL;DR: D^3-Talker提出了一种双分支解耦变形场的方法,用于小样本3D说话头合成,通过分离通用和个性化变形预测,实现了更好的唇形同步和图像质量。

Details

Motivation: 现有方法在利用少量训练数据时难以将音频准确映射到目标面部的唇部动作,导致唇形同步和图像质量不佳。

Result: 实验表明,D^3-Talker在高保真渲染和唇形同步方面优于现有方法。

Insight: 解耦通用和个性化变形预测是提升小样本3D说话头合成的有效方法。

Abstract: A key challenge in 3D talking head synthesis lies in the reliance on a long-duration talking head video to train a new model for each target identity from scratch. Recent methods have attempted to address this issue by extracting general features from audio through pre-training models. However, since audio contains information irrelevant to lip motion, existing approaches typically struggle to map the given audio to realistic lip behaviors in the target face when trained on only a few frames, causing poor lip synchronization and talking head image quality. This paper proposes D^3-Talker, a novel approach that constructs a static 3D Gaussian attribute field and employs audio and Facial Motion signals to independently control two distinct Gaussian attribute deformation fields, effectively decoupling the predictions of general and personalized deformations. We design a novel similarity contrastive loss function during pre-training to achieve more thorough decoupling. Furthermore, we integrate a Coarse-to-Fine module to refine the rendered images, alleviating blurriness caused by head movements and enhancing overall image quality. Extensive experiments demonstrate that D^3-Talker outperforms state-of-the-art methods in both high-fidelity rendering and accurate audio-lip synchronization with limited training data. Our code will be provided upon acceptance.


[25] Ouroboros: Single-step Diffusion Models for Cycle-consistent Forward and Inverse Rendering cs.CVPDF

Shanlin Sun, Yifan Wang, Hanwen Zhang, Yifeng Xiong, Qin Ren

TL;DR: Ouroboros提出了一种单步扩散模型框架,通过相互强化的方式同时处理正向和逆向渲染任务,实现了循环一致性和快速推理。

Details

Motivation: 现有的多步扩散模型在处理正向和逆向渲染任务时通常是独立的,导致循环不一致和推理速度慢。这促使作者开发一个统一的框架来改进这些问题。

Result: 实验结果表明,Ouroboros在多种场景中实现了最先进的性能,且推理速度显著快于其他基于扩散的方法。在视频分解任务中,无需训练即可减少时间不一致性。

Insight: 单步扩散模型的循环一致性框架可以高效解决正向和逆向渲染任务,同时为视频分解等任务提供零样本迁移能力。

Abstract: While multi-step diffusion models have advanced both forward and inverse rendering, existing approaches often treat these problems independently, leading to cycle inconsistency and slow inference speed. In this work, we present Ouroboros, a framework composed of two single-step diffusion models that handle forward and inverse rendering with mutual reinforcement. Our approach extends intrinsic decomposition to both indoor and outdoor scenes and introduces a cycle consistency mechanism that ensures coherence between forward and inverse rendering outputs. Experimental results demonstrate state-of-the-art performance across diverse scenes while achieving substantially faster inference speed compared to other diffusion-based methods. We also demonstrate that Ouroboros can transfer to video decomposition in a training-free manner, reducing temporal inconsistency in video sequences while maintaining high-quality per-frame inverse rendering.


[26] DreamSwapV: Mask-guided Subject Swapping for Any Customized Video Editing cs.CVPDF

Weitao Wang, Zichen Wang, Hongdeng Shen, Yulei Lu, Xirui Fan

TL;DR: DreamSwapV是一个基于掩码引导、主体无关的端到端框架,用于视频中任意主体的替换,支持用户指定的掩码和参考图像,通过多功能条件融合和自适应掩码策略提升效果。

Details

Motivation: 随着视频生成技术的迅速发展,定制化视频编辑需求激增,但主体替换技术仍停留在狭窄领域或依赖间接编辑范式,限制了实际应用。

Result: 在VBench指标和DreamSwapV-Benchmark上优于现有方法,验证了其高效性和泛化能力。

Insight: 掩码引导和自适应策略是提升主体替换效率的关键;多功能条件融合模块为复杂场景提供了更强的控制能力。

Abstract: With the rapid progress of video generation, demand for customized video editing is surging, where subject swapping constitutes a key component yet remains under-explored. Prevailing swapping approaches either specialize in narrow domains–such as human-body animation or hand-object interaction–or rely on some indirect editing paradigm or ambiguous text prompts that compromise final fidelity. In this paper, we propose DreamSwapV, a mask-guided, subject-agnostic, end-to-end framework that swaps any subject in any video for customization with a user-specified mask and reference image. To inject fine-grained guidance, we introduce multiple conditions and a dedicated condition fusion module that integrates them efficiently. In addition, an adaptive mask strategy is designed to accommodate subjects of varying scales and attributes, further improving interactions between the swapped subject and its surrounding context. Through our elaborate two-phase dataset construction and training scheme, our DreamSwapV outperforms existing methods, as validated by comprehensive experiments on VBench indicators and our first introduced DreamSwapV-Benchmark.


[27] LookOut: Real-World Humanoid Egocentric Navigation cs.CVPDF

Boxiao Pan, Adam W. Harley, C. Karen Liu, Leonidas J. Guibas

TL;DR: 这篇论文提出了一个从第一人称视角视频中预测未来6D头部姿态序列的框架,并引入了一个新的数据集AND,用于学习真实世界中的导航行为。

Details

Motivation: 在仿人机器人、VR/AR和辅助导航等应用中,预测无碰撞的未来轨迹至关重要。然而,目前缺乏相关的训练数据和有效的模型来模拟人类的主动信息收集行为(如转头)。

Result: 模型能够学习人类的导航行为(如等待、绕行和观察交通),并在未见过的环境中表现出泛化能力。

Insight: 通过结合几何和语义约束的3D特征建模,可以更好地模拟人类的主动导航行为,同时真实世界数据集对训练至关重要。

Abstract: The ability to predict collision-free future trajectories from egocentric observations is crucial in applications such as humanoid robotics, VR / AR, and assistive navigation. In this work, we introduce the challenging problem of predicting a sequence of future 6D head poses from an egocentric video. In particular, we predict both head translations and rotations to learn the active information-gathering behavior expressed through head-turning events. To solve this task, we propose a framework that reasons over temporally aggregated 3D latent features, which models the geometric and semantic constraints for both the static and dynamic parts of the environment. Motivated by the lack of training data in this space, we further contribute a data collection pipeline using the Project Aria glasses, and present a dataset collected through this approach. Our dataset, dubbed Aria Navigation Dataset (AND), consists of 4 hours of recording of users navigating in real-world scenarios. It includes diverse situations and navigation behaviors, providing a valuable resource for learning real-world egocentric navigation policies. Extensive experiments show that our model learns human-like navigation behaviors such as waiting / slowing down, rerouting, and looking around for traffic while generalizing to unseen environments. Check out our project webpage at https://sites.google.com/stanford.edu/lookout.


[28] Vivid-VR: Distilling Concepts from Text-to-Video Diffusion Transformer for Photorealistic Video Restoration cs.CVPDF

Haoran Bai, Xiaoxu Chen, Canqian Yang, Zongyao He, Sibin Deng

TL;DR: Vivid-VR是一个基于DiT的生成式视频修复方法,利用ControlNet控制生成过程以保持内容一致性。为了解决传统微调导致的质量下降问题,提出了一种概念蒸馏训练策略和增强控制架构,显著提升了纹理真实性和时间一致性。

Details

Motivation: 现有基于T2V的可控视频修复方法在微调时因多模态对齐不完美导致分布漂移,从而影响纹理真实性和时间一致性。

Result: 在合成和真实基准测试中优于现有方法,实现了高纹理真实性和时间一致性。

Insight: 通过预训练模型的生成能力和动态控制信号调制,可以显著提升视频修复的质量和可控性。

Abstract: We present Vivid-VR, a DiT-based generative video restoration method built upon an advanced T2V foundation model, where ControlNet is leveraged to control the generation process, ensuring content consistency. However, conventional fine-tuning of such controllable pipelines frequently suffers from distribution drift due to limitations in imperfect multimodal alignment, resulting in compromised texture realism and temporal coherence. To tackle this challenge, we propose a concept distillation training strategy that utilizes the pretrained T2V model to synthesize training samples with embedded textual concepts, thereby distilling its conceptual understanding to preserve texture and temporal quality. To enhance generation controllability, we redesign the control architecture with two key components: 1) a control feature projector that filters degradation artifacts from input video latents to minimize their propagation through the generation pipeline, and 2) a new ControlNet connector employing a dual-branch design. This connector synergistically combines MLP-based feature mapping with cross-attention mechanism for dynamic control feature retrieval, enabling both content preservation and adaptive control signal modulation. Extensive experiments show that Vivid-VR performs favorably against existing approaches on both synthetic and real-world benchmarks, as well as AIGC videos, achieving impressive texture realism, visual vividness, and temporal consistency. The codes and checkpoints are publicly available at https://github.com/csbhr/Vivid-VR.


[29] PB-IAD: Utilizing multimodal foundation models for semantic industrial anomaly detection in dynamic manufacturing environments cs.CV | cs.AIPDF

Bernd Hofmann, Albert Scheck, Joerg Franke, Patrick Bruendl

TL;DR: PB-IAD 是一种基于多模态基础模型的工业异常检测框架,通过提示模板和语义指令,适应动态制造环境中的稀疏数据和高适应性需求,无需大量标注数据。

Details

Motivation: 传统工业异常检测方法依赖大量标注数据且灵活性不足,无法适应动态生产环境。多模态基础模型的发展为此提供了新机会。

Result: PB-IAD 在数据稀疏和低样本情况下优于 PatchCore 等先进方法,仅通过语义指令即实现高性能。

Insight: 提示工程和多模态基础模型的结合为工业异常检测提供了新的可能性,尤其在缺乏标注数据时表现出色。

Abstract: The detection of anomalies in manufacturing processes is crucial to ensure product quality and identify process deviations. Statistical and data-driven approaches remain the standard in industrial anomaly detection, yet their adaptability and usability are constrained by the dependence on extensive annotated datasets and limited flexibility under dynamic production conditions. Recent advances in the perception capabilities of foundation models provide promising opportunities for their adaptation to this downstream task. This paper presents PB-IAD (Prompt-based Industrial Anomaly Detection), a novel framework that leverages the multimodal and reasoning capabilities of foundation models for industrial anomaly detection. Specifically, PB-IAD addresses three key requirements of dynamic production environments: data sparsity, agile adaptability, and domain user centricity. In addition to the anomaly detection, the framework includes a prompt template that is specifically designed for iteratively implementing domain-specific process knowledge, as well as a pre-processing module that translates domain user inputs into effective system prompts. This user-centric design allows domain experts to customise the system flexibly without requiring data science expertise. The proposed framework is evaluated by utilizing GPT-4.1 across three distinct manufacturing scenarios, two data modalities, and an ablation study to systematically assess the contribution of semantic instructions. Furthermore, PB-IAD is benchmarked to state-of-the-art methods for anomaly detection such as PatchCore. The results demonstrate superior performance, particularly in data-sparse scenarios and low-shot settings, achieved solely through semantic instructions.


[30] Adversarial Generation and Collaborative Evolution of Safety-Critical Scenarios for Autonomous Vehicles cs.CVPDF

Jiangfan Liu, Yongkang Guo, Fangzhi Zhong, Tianyuan Zhang, Zonglei Jing

TL;DR: 提出ScenGE框架,通过对抗生成与协作演化方法,为自动驾驶车辆生成多样化的安全关键场景,显著提升碰撞案例的严重性,并通过实验验证其有效性和实用性。

Details

Motivation: 当前自动驾驶车辆的安全性评估依赖于预定义的威胁模式或基于规则的方法,难以暴露多样化和未预见的故障模式,因此需要一种能生成更多样化且更严峻的安全关键场景的方法。

Result: 实验表明,ScenGE比现有方法平均多生成31.96%的严重碰撞案例,且通过对抗训练显著提升了模型鲁棒性。真实测试和人类评估也验证了场景的合理性。

Insight: 1. 对抗生成和协作演化能有效生成多样化的安全关键场景;2. 大语言模型在场景生成中具有潜力;3. 生成的场景对提升自动驾驶安全性具有实际意义。

Abstract: The generation of safety-critical scenarios in simulation has become increasingly crucial for safety evaluation in autonomous vehicles prior to road deployment in society. However, current approaches largely rely on predefined threat patterns or rule-based strategies, which limit their ability to expose diverse and unforeseen failure modes. To overcome these, we propose ScenGE, a framework that can generate plentiful safety-critical scenarios by reasoning novel adversarial cases and then amplifying them with complex traffic flows. Given a simple prompt of a benign scene, it first performs Meta-Scenario Generation, where a large language model, grounded in structured driving knowledge, infers an adversarial agent whose behavior poses a threat that is both plausible and deliberately challenging. This meta-scenario is then specified in executable code for precise in-simulator control. Subsequently, Complex Scenario Evolution uses background vehicles to amplify the core threat introduced by Meta-Scenario. It builds an adversarial collaborator graph to identify key agent trajectories for optimization. These perturbations are designed to simultaneously reduce the ego vehicle’s maneuvering space and create critical occlusions. Extensive experiments conducted on multiple reinforcement learning based AV models show that ScenGE uncovers more severe collision cases (+31.96%) on average than SoTA baselines. Additionally, our ScenGE can be applied to large model based AV systems and deployed on different simulators; we further observe that adversarial training on our scenarios improves the model robustness. Finally, we validate our framework through real-world vehicle tests and human evaluation, confirming that the generated scenarios are both plausible and critical. We hope our paper can build up a critical step towards building public trust and ensuring their safe deployment.


[31] WISE-FUSE: Efficient Whole Slide Image Encoding via Coarse-to-Fine Patch Selection with VLM and LLM Knowledge Fusion cs.CVPDF

Yonghan Shin, SeungKyu Kim, Won-Ki Jeong

TL;DR: WISE-FUSE 提出了一种高效的整张切片图像(WSI)编码框架,通过结合视觉语言模型(VLM)和大语言模型(LLM)的知识,选择性处理诊断相关区域,显著减少了计算负担和编码时间。

Details

Motivation: 整张切片图像(WSI)的高分辨率带来了巨大的计算挑战,传统方法需要处理数十万甚至数百万个高分辨率图像块,导致编码成本和时间过高,难以在实际场景中高效部署。

Result: 实验表明,WISE-FUSE 将编码时间减少了三倍以上,同时诊断性能与全量块处理方法相当或更优。

Insight: 通过结合多模态知识(视觉和文本),可以在减少计算量的同时保持甚至提升诊断性能,为计算病理学的实际应用提供了实用方案。

Abstract: Whole slide images (WSIs) in computational pathology (CPath) pose a major computational challenge due to their gigapixel scale, often requiring the processing of tens to hundreds of thousands of high-resolution patches per slide. This results in prohibitive encoding costs, with preprocessing and training times extending to days or even weeks-making WSI encoding the most significant bottleneck in real-world deployment. In this work, we propose WISE-FUSE, an adaptive WSI encoding framework that leverages pathology-domain vision-language models and large language models to address this challenge by selectively processing diagnostically relevant regions. WISE-FUSE first computes similarity scores between low-resolution patches and class-specific textual descriptions using a knowledge distillation mechanism that preserves fine-grained diagnostic features. Based on these similarity scores, we select a small subset of informative regions for the target task, which quickly eliminates irrelevant patches at the coarse level. The corresponding high-resolution patches are then selectively encoded and fused with textual embeddings to reinforce diagnostic context. Extensive experiments demonstrate that WISE-FUSE reduces WSI encoding time by over threefold while achieving diagnostic performance comparable to or surpassing that of exhaustive patch processing, offering a scalable and practical solution for CPath.


[32] Improving OCR using internal document redundancy cs.CV | cs.LG | eess.IVPDF

Diego Belzarena, Seginus Mowlavi, Aitor Artola, Camilo Mariño, Marina Gardella

TL;DR: 该论文提出了一种无监督方法,利用文档内部的字符形状冗余性来改进OCR系统的输出质量,通过扩展高斯混合模型(GMM)和统计测试提高字符识别的准确性。

Details

Motivation: 当前OCR系统在低质量数据上的表现不佳,尤其是在印刷文档中,未能充分利用文档内部的冗余信息。

Result: 在退化程度不同的文档(如乌拉圭军事档案和欧洲历史报纸)上展示了显著的性能提升。

Insight: 文档内部的冗余信息可以成为改进OCR系统的有效资源,尤其在低质量数据场景中。

Abstract: Current OCR systems are based on deep learning models trained on large amounts of data. Although they have shown some ability to generalize to unseen data, especially in detection tasks, they can struggle with recognizing low-quality data. This is particularly evident for printed documents, where intra-domain data variability is typically low, but inter-domain data variability is high. In that context, current OCR methods do not fully exploit each document’s redundancy. We propose an unsupervised method by leveraging the redundancy of character shapes within a document to correct imperfect outputs of a given OCR system and suggest better clustering. To this aim, we introduce an extended Gaussian Mixture Model (GMM) by alternating an Expectation-Maximization (EM) algorithm with an intra-cluster realignment process and normality statistical testing. We demonstrate improvements in documents with various levels of degradation, including recovered Uruguayan military archives and 17th to mid-20th century European newspapers.


[33] A Comprehensive Review of Agricultural Parcel and Boundary Delineation from Remote Sensing Images: Recent Progress and Future Perspectives cs.CV | eess.IVPDF

Juepeng Zheng, Zi Ye, Yibin Wen, Jianxi Huang, Zhiwei Zhang

TL;DR: 这篇综述文章总结了利用遥感图像进行农业地块与边界划分(APBD)的研究进展,分类整理了传统图像处理、机器学习及深度学习方法,并探讨了未来方向。

Details

Motivation: 随着高分辨率遥感图像的发展,自动化高效精确的农业地块分析成为可能,推动了APBD领域的研究需求。文章旨在系统梳理APBD方法,为研究者提供清晰的知识图谱和发展趋势。

Result: 文献综述表明,深度学习方法在APBD领域占据主导地位,尤其是在高分辨率遥感图像处理中表现优异。文章还总结了不同方法的比较结果及适用场景。

Insight: 1. 深度学习方法在APBD任务中潜力巨大,尤其是结合Transformer等新兴技术;2. 多传感器数据和多任务学习是提升APBD性能的关键;3. 未来研究可以关注自动化标注、小样本学习和跨域泛化等方向。

Abstract: Powered by advances in multiple remote sensing sensors, the production of high spatial resolution images provides great potential to achieve cost-efficient and high-accuracy agricultural inventory and analysis in an automated way. Lots of studies that aim at providing an inventory of the level of each agricultural parcel have generated many methods for Agricultural Parcel and Boundary Delineation (APBD). This review covers APBD methods for detecting and delineating agricultural parcels and systematically reviews the past and present of APBD-related research applied to remote sensing images. With the goal to provide a clear knowledge map of existing APBD efforts, we conduct a comprehensive review of recent APBD papers to build a meta-data analysis, including the algorithm, the study site, the crop type, the sensor type, the evaluation method, etc. We categorize the methods into three classes: (1) traditional image processing methods (including pixel-based, edge-based and region-based); (2) traditional machine learning methods (such as random forest, decision tree); and (3) deep learning-based methods. With deep learning-oriented approaches contributing to a majority, we further discuss deep learning-based methods like semantic segmentation-based, object detection-based and Transformer-based methods. In addition, we discuss five APBD-related issues to further comprehend the APBD domain using remote sensing data, such as multi-sensor data in APBD task, comparisons between single-task learning and multi-task learning in the APBD domain, comparisons among different algorithms and different APBD tasks, etc. Finally, this review proposes some APBD-related applications and a few exciting prospects and potential hot topics in future APBD research. We hope this review help researchers who involved in APBD domain to keep track of its development and tendency.


[34] Controllable Latent Space Augmentation for Digital Pathology cs.CVPDF

Sofiène Boutaj, Marin Scalbert, Pierre Marza, Florent Couzinie-Devy, Maria Vakalopoulou

TL;DR: 该论文提出了一种名为HistAug的生成模型,用于在数字病理学中进行可控的潜在空间增强,以解决传统图像增强方法的不足,提升多实例学习模型的性能。

Details

Motivation: 由于全玻片图像(WSI)的高分辫率和密集监督信号的稀缺性,传统图像增强方法在数字病理学中难以高效地增加数据多样性并减少过拟合。

Result: 实验表明,HistAug在多种器官和低数据量任务中优于现有方法,提升了多实例学习模型的性能。

Insight: 论文揭示了学习变换优于基于噪声的扰动,并强调了均匀WSI增强的重要性。

Abstract: Whole slide image (WSI) analysis in digital pathology presents unique challenges due to the gigapixel resolution of WSIs and the scarcity of dense supervision signals. While Multiple Instance Learning (MIL) is a natural fit for slide-level tasks, training robust models requires large and diverse datasets. Even though image augmentation techniques could be utilized to increase data variability and reduce overfitting, implementing them effectively is not a trivial task. Traditional patch-level augmentation is prohibitively expensive due to the large number of patches extracted from each WSI, and existing feature-level augmentation methods lack control over transformation semantics. We introduce HistAug, a fast and efficient generative model for controllable augmentations in the latent space for digital pathology. By conditioning on explicit patch-level transformations (e.g., hue, erosion), HistAug generates realistic augmented embeddings while preserving initial semantic information. Our method allows the processing of a large number of patches in a single forward pass efficiently, while at the same time consistently improving MIL model performance. Experiments across multiple slide-level tasks and diverse organs show that HistAug outperforms existing methods, particularly in low-data regimes. Ablation studies confirm the benefits of learned transformations over noise-based perturbations and highlight the importance of uniform WSI-wise augmentation. Code is available at https://github.com/MICS-Lab/HistAug.


[35] Reliable Smoke Detection via Optical Flow-Guided Feature Fusion and Transformer-Based Uncertainty Modeling cs.CVPDF

Nitish Kumar Mahala, Muzammil Khan, Pushpendra Kumar

TL;DR: 该论文提出了一种基于光流引导特征融合和Transformer不确定性建模的可靠烟雾检测方法,通过双相不确定感知Shifted Windows Transformer提升检测鲁棒性。

Details

Motivation: 烟雾的复杂时空动态、光照变化和环境噪声导致传统检测器可靠性不足,亟需一种无需复杂多传感器的高精度早期预警方法。

Result: 实验表明方法在多个指标上优于现有技术,具有优异的泛化能力和鲁棒性。

Insight: 通过联合建模偶然和认知不确定性,模型能更可靠地评估预测置信度,适用于工业安全和监控场景。

Abstract: Fire outbreaks pose critical threats to human life and infrastructure, necessitating high-fidelity early-warning systems that detect combustion precursors such as smoke. However, smoke plumes exhibit complex spatiotemporal dynamics influenced by illumination variability, flow kinematics, and environmental noise, undermining the reliability of traditional detectors. To address these challenges without the logistical complexity of multi-sensor arrays, we propose an information-fusion framework by integrating smoke feature representations extracted from monocular imagery. Specifically, a Two-Phase Uncertainty-Aware Shifted Windows Transformer for robust and reliable smoke detection, leveraging a novel smoke segmentation dataset, constructed via optical flow-based motion encoding, is proposed. The optical flow estimation is performed with a four-color-theorem-inspired dual-phase level-set fractional-order variational model, which preserves motion discontinuities. The resulting color-encoded optical flow maps are fused with appearance cues via a Gaussian Mixture Model to generate binary segmentation masks of the smoke regions. These fused representations are fed into the novel Shifted-Windows Transformer, which is augmented with a multi-scale uncertainty estimation head and trained under a two-phase learning regimen. First learning phase optimizes smoke detection accuracy, while during the second phase, the model learns to estimate plausibility confidence in its predictions by jointly modeling aleatoric and epistemic uncertainties. Extensive experiments using multiple evaluation metrics and comparative analysis with state-of-the-art approaches demonstrate superior generalization and robustness, offering a reliable solution for early fire detection in surveillance, industrial safety, and autonomous monitoring applications.


[36] Incremental Object Detection with Prompt-based Methods cs.CVPDF

Matthias Neuwirth-Trapp, Maarten Bieshaar, Danda Pani Paudel, Luc Van Gool

TL;DR: 该论文研究了基于视觉提示的方法在增量目标检测(IOD)中的应用,发现其在复杂域增量学习设置中表现不佳,但结合少量历史数据回放后效果最佳。

Details

Motivation: 研究基于视觉提示的增量学习方法在目标检测任务中的通用性,填补了这一领域的空白。

Result: 实验表明纯视觉提示方法在IOD中表现不佳,但结合数据回放后效果显著提升,且对提示长度和初始化的实验结果提供了进一步见解。

Insight: 提示方法在IOD中的表现需要结合其他技术(如数据回放)才能发挥最佳效果,为未来研究提供了方向。

Abstract: Visual prompt-based methods have seen growing interest in incremental learning (IL) for image classification. These approaches learn additional embedding vectors while keeping the model frozen, making them efficient to train. However, no prior work has applied such methods to incremental object detection (IOD), leaving their generalizability unclear. In this paper, we analyze three different prompt-based methods under a complex domain-incremental learning setting. We additionally provide a wide range of reference baselines for comparison. Empirically, we show that the prompt-based approaches we tested underperform in this setting. However, a strong yet practical method, combining visual prompts with replaying a small portion of previous data, achieves the best results. Together with additional experiments on prompt length and initialization, our findings offer valuable insights for advancing prompt-based IL in IOD.


[37] Virtual Community: An Open World for Humans, Robots, and Society cs.CV | cs.CL | cs.ROPDF

Qinhong Zhou, Hongxin Zhang, Xiangye Lin, Zheyuan Zhang, Yutian Chen

TL;DR: 该论文提出了一个名为“虚拟社区”的开放世界平台,用于研究人类社会与机器人的共存问题,支持多智能体协作与竞争,并提出了两个新挑战任务。

Details

Motivation: 随着AI和机器人技术的快速发展,社会将迎来人类与机器人共存的变革,亟需研究其带来的机会与挑战。

Result: 通过实验验证了现有方法在高层规划与底层控制协作任务中的挑战,展示了平台的实用性。

Insight: 虚拟社区为研究开放世界中人类与机器人的社会智能提供了新平台,推动了相关领域的研究。

Abstract: The rapid progress in AI and Robotics may lead to a profound societal transformation, as humans and robots begin to coexist within shared communities, introducing both opportunities and challenges. To explore this future, we present Virtual Community-an open-world platform for humans, robots, and society-built on a universal physics engine and grounded in real-world 3D scenes. With Virtual Community, we aim to study embodied social intelligence at scale: 1) How robots can intelligently cooperate or compete; 2) How humans develop social relations and build community; 3) More importantly, how intelligent robots and humans can co-exist in an open world. To support these, Virtual Community features: 1) An open-source multi-agent physics simulator that supports robots, humans, and their interactions within a society; 2) A large-scale, real-world aligned community generation pipeline, including vast outdoor space, diverse indoor scenes, and a community of grounded agents with rich characters and appearances. Leveraging Virtual Community, we propose two novel challenges. The Community Planning Challenge evaluates multi-agent reasoning and planning ability in open-world settings, such as cooperating to help agents with daily activities and efficiently connecting other agents. The Community Robot Challenge requires multiple heterogeneous robots to collaborate in solving complex open-world tasks. We evaluate various baselines on these tasks and demonstrate the challenges in both high-level open-world task planning and low-level cooperation controls. We hope that Virtual Community will unlock further study of human-robot coexistence within open-world environments.


[38] UST-SSM: Unified Spatio-Temporal State Space Models for Point Cloud Video Modeling cs.CV | cs.AIPDF

Peiming Li, Ziyi Wang, Yulin Yuan, Hong Liu, Xiangming Meng

TL;DR: UST-SSM提出了一种统一的时空状态空间模型,用于处理点云视频,通过空间-时间选择扫描(STSS)重组无序点,并通过时空结构聚合(STSA)和时序交互采样(TIS)增强时空特征和时序依赖关系,实现了高效的动态3D动作识别。

Details

Motivation: 点云视频的动态3D运动捕捉能力使其在识别连续和精细的人类动作时具有优势,但其时空无序性限制了传统选择性状态空间模型(SSMs)的直接应用,因此需要一种新的方法来处理这一问题。

Result: 在MSR-Action3D、NTU RGB+D和Synthia 4D数据集上的实验验证了UST-SSM的有效性。

Insight: 1. 点云视频的时空无序性可以通过语义重组和特征聚合来解决。
2. 时序交互的增强对于动态3D动作识别至关重要。

Abstract: Point cloud videos capture dynamic 3D motion while reducing the effects of lighting and viewpoint variations, making them highly effective for recognizing subtle and continuous human actions. Although Selective State Space Models (SSMs) have shown good performance in sequence modeling with linear complexity, the spatio-temporal disorder of point cloud videos hinders their unidirectional modeling when directly unfolding the point cloud video into a 1D sequence through temporally sequential scanning. To address this challenge, we propose the Unified Spatio-Temporal State Space Model (UST-SSM), which extends the latest advancements in SSMs to point cloud videos. Specifically, we introduce Spatial-Temporal Selection Scanning (STSS), which reorganizes unordered points into semantic-aware sequences through prompt-guided clustering, thereby enabling the effective utilization of points that are spatially and temporally distant yet similar within the sequence. For missing 4D geometric and motion details, Spatio-Temporal Structure Aggregation (STSA) aggregates spatio-temporal features and compensates. To improve temporal interaction within the sampled sequence, Temporal Interaction Sampling (TIS) enhances fine-grained temporal dependencies through non-anchor frame utilization and expanded receptive fields. Experimental results on the MSR-Action3D, NTU RGB+D, and Synthia 4D datasets validate the effectiveness of our method. Our code is available at https://github.com/wangzy01/UST-SSM.


[39] SMTrack: End-to-End Trained Spiking Neural Networks for Multi-Object Tracking in RGB Videos cs.CVPDF

Pengzhi Zhong, Xinzhe Wang, Dan Zeng, Qihua Zhou, Feixiang He

TL;DR: SMTrack是第一个直接在标准RGB视频上端到端训练深度脉冲神经网络(SNN)进行多目标跟踪的框架,通过自适应尺度感知归一化Wasserstein距离损失(Asa-NWDLoss)和TrackTrack身份模块,实现了与主流基于人工神经网络(ANN)的MOT方法媲美的性能。

Details

Motivation: 尽管脉冲神经网络(SNN)在低功耗计算中展现出潜力,但其在视觉任务中的应用主要集中在图像分类、物体检测和基于事件的跟踪上。对于复杂时序任务如标准RGB视频的多目标跟踪(MOT),SNN的直接训练仍未被充分探索。

Result: 在BEE24、MOT17、MOT20和DanceTrack等数据集上的实验表明,SMTrack的性能与主流基于ANN的MOT方法相当,证明了SNN在复杂场景下高效多目标跟踪的能力。

Insight: 通过在RGB视频中直接训练SNN完成MOT任务,展示了SNN在复杂时序任务中的潜力,同时为低功耗视觉系统提供了新思路。

Abstract: Brain-inspired Spiking Neural Networks (SNNs) exhibit significant potential for low-power computation, yet their application in visual tasks remains largely confined to image classification, object detection, and event-based tracking. In contrast, real-world vision systems still widely use conventional RGB video streams, where the potential of directly-trained SNNs for complex temporal tasks such as multi-object tracking (MOT) remains underexplored. To address this challenge, we propose SMTrack-the first directly trained deep SNN framework for end-to-end multi-object tracking on standard RGB videos. SMTrack introduces an adaptive and scale-aware Normalized Wasserstein Distance loss (Asa-NWDLoss) to improve detection and localization performance under varying object scales and densities. Specifically, the method computes the average object size within each training batch and dynamically adjusts the normalization factor, thereby enhancing sensitivity to small objects. For the association stage, we incorporate the TrackTrack identity module to maintain robust and consistent object trajectories. Extensive evaluations on BEE24, MOT17, MOT20, and DanceTrack show that SMTrack achieves performance on par with leading ANN-based MOT methods, advancing robust and accurate SNN-based tracking in complex scenarios.


[40] AnchorSync: Global Consistency Optimization for Long Video Editing cs.CVPDF

Zichi Liu, Yinggui Wang, Tao Wei, Chao Ma

TL;DR: AnchorSync 是一种基于扩散模型的视频编辑框架,通过稀疏锚帧编辑和中间帧插值实现长视频的全局一致性和时间连贯性,优于现有方法。

Details

Motivation: 长视频编辑面临全局结构漂移和时间不一致的挑战,现有方法难以在分钟级序列中保持高质量编辑效果。

Result: 实验表明,AnchorSync 在视觉质量和时间稳定性上优于现有方法,生成连贯且高保真的编辑效果。

Insight: 长视频编辑的关键在于解耦任务和动态一致性约束,扩散模型结合多模态引导是提升效果的有效路径。

Abstract: Editing long videos remains a challenging task due to the need for maintaining both global consistency and temporal coherence across thousands of frames. Existing methods often suffer from structural drift or temporal artifacts, particularly in minute-long sequences. We introduce AnchorSync, a novel diffusion-based framework that enables high-quality, long-term video editing by decoupling the task into sparse anchor frame editing and smooth intermediate frame interpolation. Our approach enforces structural consistency through a progressive denoising process and preserves temporal dynamics via multimodal guidance. Extensive experiments show that AnchorSync produces coherent, high-fidelity edits, surpassing prior methods in visual quality and temporal stability.


[41] GeMS: Efficient Gaussian Splatting for Extreme Motion Blur cs.CVPDF

Gopi Raju Matta, Trisha Reddypalli, Vemunuri Divya Madhuri, Kaushik Mitra

TL;DR: GeMS 是一个针对极端运动模糊图像的 3D 高斯泼溅(3DGS)框架,直接从模糊输入重建场景,无需依赖清晰图像。GeMS-E 在此基础上加入事件数据细化,进一步提升重建效果。

Details

Motivation: 现有极端模糊去模糊方法(如 ExBluRF 和 Deblur-GS)依赖清晰图像进行姿态估计和点云生成,而基于 COLMAP 的方法(如 BAD-Gaussians)在严重模糊下特征对应不可靠。这些假设在实际中不成立,因此需要直接从模糊输入重建场景的解决方案。

Result: GeMS 和 GeMS-E 在合成和真实数据集上表现优于现有方法,首次直接从极端模糊输入重建 3D 场景。

Insight: 1. 直接从模糊输入重建是可行的;2. 事件数据(如 EDI)可以有效优化模糊场景的重建;3. 3DGS-MCMC 为高斯泼溅提供了一种鲁棒的初始化方法。

Abstract: We introduce GeMS, a framework for 3D Gaussian Splatting (3DGS) designed to handle severely motion-blurred images. State-of-the-art deblurring methods for extreme blur, such as ExBluRF, as well as Gaussian Splatting-based approaches like Deblur-GS, typically assume access to sharp images for camera pose estimation and point cloud generation, an unrealistic assumption. Methods relying on COLMAP initialization, such as BAD-Gaussians, also fail due to unreliable feature correspondences under severe blur. To address these challenges, we propose GeMS, a 3DGS framework that reconstructs scenes directly from extremely blurred images. GeMS integrates: (1) VGGSfM, a deep learning-based Structure-from-Motion pipeline that estimates poses and generates point clouds directly from blurred inputs; (2) 3DGS-MCMC, which enables robust scene initialization by treating Gaussians as samples from a probability distribution, eliminating heuristic densification and pruning; and (3) joint optimization of camera trajectories and Gaussian parameters for stable reconstruction. While this pipeline produces strong results, inaccuracies may remain when all inputs are severely blurred. To mitigate this, we propose GeMS-E, which integrates a progressive refinement step using events: (4) Event-based Double Integral (EDI) deblurring restores sharper images that are then fed into GeMS, improving pose estimation, point cloud generation, and overall reconstruction. Both GeMS and GeMS-E achieve state-of-the-art performance on synthetic and real-world datasets. To our knowledge, this is the first framework to address extreme motion blur within 3DGS directly from severely blurred inputs.


[42] Seeing Further on the Shoulders of Giants: Knowledge Inheritance for Vision Foundation Models cs.CVPDF

Jiabo Huang, Chen Chen, Lingjuan Lyu

TL;DR: 论文提出了一种基于模型驱动的视觉基础模型(VFM)训练方法,通过联合知识迁移与保存,利用多个预训练模型的知识来构建通用的VFM,避免了大规模数据的训练需求。

Details

Motivation: 目前视觉基础模型主要依赖数据驱动方法,需要大量高质量标注数据和计算资源,限制了大多数机构的发展。而许多开源领域特定模型已经具备丰富的知识,如何有效利用这些资源成为关键挑战。

Result: 在图像分类、目标检测、语义和实例分割四项基础任务中,该方法优于现有数据驱动模型。

Insight: 通过联合知识迁移与保存,可以有效利用现有预训练模型资源,降低对大规模数据的依赖,同时提升模型的通用性和性能。

Abstract: Vision foundation models (VFMs) are predominantly developed using data-centric methods. These methods require training on vast amounts of data usually with high-quality labels, which poses a bottleneck for most institutions that lack both large-scale data and high-end GPUs. On the other hand, many open-source vision models have been pretrained on domain-specific data, enabling them to distill and represent core knowledge in a form that is transferable across diverse applications. Even though these models are highly valuable assets, they remain largely under-explored in empowering the development of a general-purpose VFM. In this paper, we presents a new model-driven approach for training VFMs through joint knowledge transfer and preservation. Our method unifies multiple pre-trained teacher models in a shared latent space to mitigate the ``imbalanced transfer’’ issue caused by their distributional gaps. Besides, we introduce a knowledge preservation strategy to take a general-purpose teacher as a knowledge base for integrating knowledge from the remaining purpose-specific teachers using an adapter module. By unifying and aggregating existing models, we build a powerful VFM to inherit teachers’ expertise without needing to train on a large amount of labeled data. Our model not only provides generalizable visual features, but also inherently supports multiple downstream tasks. Extensive experiments demonstrate that our VFM outperforms existing data-centric models across four fundamental vision tasks, including image classification, object detection, semantic and instance segmentation.


[43] Multiscale Video Transformers for Class Agnostic Segmentation in Autonomous Driving cs.CVPDF

Leila Cheshmi, Mennatullah Siam

TL;DR: 该论文提出了一种多尺度视频Transformer,用于自动驾驶中的类无关分割任务,通过运动线索检测未知物体,避免了依赖已知类别的局限性,同时提出了一种高效的解码器和内存设计,实现了高分辨率信息的保留和多尺度特征的捕捉。

Details

Motivation: 自动驾驶的安全问题需要处理未知物体和未预见的场景,现有视频分割方法通常依赖已知类别的训练数据,忽略了新类别。此外,基于大语言模型的视觉定位方法计算成本高,不适合像素级输出。因此,需要一种高效的类无关分割方法。

Result: 在DAVIS’16、KITTI和Cityscapes数据集上的实验表明,该方法在多尺度基准测试中表现优越,同时在GPU内存和运行效率上均表现出色,适合实时密集预测任务。

Insight: 论文提出了一种内存中心的设计思想,保留了高分辨率信息,同时通过多尺度特征和运动线索实现了对未知物体的检测,为安全关键型机器人任务提供了新思路。

Abstract: Ensuring safety in autonomous driving is a complex challenge requiring handling unknown objects and unforeseen driving scenarios. We develop multiscale video transformers capable of detecting unknown objects using only motion cues. Video semantic and panoptic segmentation often relies on known classes seen during training, overlooking novel categories. Recent visual grounding with large language models is computationally expensive, especially for pixel-level output. We propose an efficient video transformer trained end-to-end for class-agnostic segmentation without optical flow. Our method uses multi-stage multiscale query-memory decoding and a scale-specific random drop-token to ensure efficiency and accuracy, maintaining detailed spatiotemporal features with a shared, learnable memory module. Unlike conventional decoders that compress features, our memory-centric design preserves high-resolution information at multiple scales. We evaluate on DAVIS’16, KITTI, and Cityscapes. Our method consistently outperforms multiscale baselines while being efficient in GPU memory and run-time, demonstrating a promising direction for real-time, robust dense prediction in safety-critical robotics.


[44] Fusing Monocular RGB Images with AIS Data to Create a 6D Pose Estimation Dataset for Marine Vessels cs.CV | cs.ROPDF

Fabian Holst, Emre Gülsoylu, Simone Frintrop

TL;DR: 该论文提出了一种通过融合单目RGB图像与AIS数据创建海洋船舶6D姿态估计数据集的新方法,解决了仅依赖AIS数据的局限性,并生成无需人工标注的高质量数据集。

Details

Motivation: 传统方法依赖AIS数据获取船舶位置,但存在设备可靠性、数据操纵和传输延迟等问题。为了克服这些限制,论文提出结合视觉与AIS数据的方法。

Result: PnP方法的投影误差显著低于单应性方法;YOLOX-X在IoU阈值0.5下mAP达到0.80;发布包含3753张标注图像的数据集。

Insight: 视觉与AIS数据融合可高效生成姿态数据集,减少人工标注需求;PnP在坐标对齐中表现更优。

Abstract: The paper presents a novel technique for creating a 6D pose estimation dataset for marine vessels by fusing monocular RGB images with Automatic Identification System (AIS) data. The proposed technique addresses the limitations of relying purely on AIS for location information, caused by issues like equipment reliability, data manipulation, and transmission delays. By combining vessel detections from monocular RGB images, obtained using an object detection network (YOLOX-X), with AIS messages, the technique generates 3D bounding boxes that represent the vessels’ 6D poses, i.e. spatial and rotational dimensions. The paper evaluates different object detection models to locate vessels in image space. We also compare two transformation methods (homography and Perspective-n-Point) for aligning AIS data with image coordinates. The results of our work demonstrate that the Perspective-n-Point (PnP) method achieves a significantly lower projection error compared to homography-based approaches used before, and the YOLOX-X model achieves a mean Average Precision (mAP) of 0.80 at an Intersection over Union (IoU) threshold of 0.5 for relevant vessel classes. We show indication that our approach allows the creation of a 6D pose estimation dataset without needing manual annotation. Additionally, we introduce the Boats on Nordelbe Kehrwieder (BONK-pose), a publicly available dataset comprising 3753 images with 3D bounding box annotations for pose estimation, created by our data fusion approach. This dataset can be used for training and evaluating 6D pose estimation networks. In addition we introduce a set of 1000 images with 2D bounding box annotations for ship detection from the same scene.


[45] 6-DoF Object Tracking with Event-based Optical Flow and Frames cs.CVPDF

Zhichao Li, Arren Glover, Chiara Bartolozzi, Lorenzo Natale

TL;DR: 该论文提出了一种结合事件相机光流和RGB相机全局位姿估计的方法,用于高速运动物体的6自由度(6-DoF)位姿跟踪。

Details

Motivation: 传统相机在高动态运动场景中由于帧率限制和运动模糊,难以实时跟踪物体的6-DoF位姿。事件相机具有高时间分辨率和低延迟特性,而RGB相机则提供丰富的视觉信息。

Result: 在合成数据和真实数据上验证了方法的有效性,特别适用于高速运动场景。

Insight: 通过事件相机与RGB相机的互补性,解决了传统相机在高动态场景中的局限性,为机器人交互提供了更鲁棒的位姿跟踪方案。

Abstract: Tracking the position and orientation of objects in space (i.e., in 6-DoF) in real time is a fundamental problem in robotics for environment interaction. It becomes more challenging when objects move at high-speed due to frame rate limitations in conventional cameras and motion blur. Event cameras are characterized by high temporal resolution, low latency and high dynamic range, that can potentially overcome the impacts of motion blur. Traditional RGB cameras provide rich visual information that is more suitable for the challenging task of single-shot object pose estimation. In this work, we propose using event-based optical flow combined with an RGB based global object pose estimator for 6-DoF pose tracking of objects at high-speed, exploiting the core advantages of both types of vision sensors. Specifically, we propose an event-based optical flow algorithm for object motion measurement to implement an object 6-DoF velocity tracker. By integrating the tracked object 6-DoF velocity with low frequency estimated pose from the global pose estimator, the method can track pose when objects move at high-speed. The proposed algorithm is tested and validated on both synthetic and real world data, demonstrating its effectiveness, especially in high-speed motion scenarios.


[46] MF-LPR$^2$: Multi-Frame License Plate Image Restoration and Recognition using Optical Flow cs.CV | cs.AIPDF

Kihyun Na, Junseok Oh, Youngkwan Cho, Bumjin Kim, Sungmin Cho

TL;DR: MF-LPR²提出了一种多帧车牌图像恢复与识别框架,利用光流对齐相邻帧以提升低质量图像的恢复和识别效果。

Details

Motivation: 现有的生成模型依赖预训练先验知识,难以可靠恢复低分辨率、运动模糊和反光的车牌图像,常引入严重失真。

Result: MF-LPR²在PSNR、SSIM和LPIPS上显著优于8种恢复模型,识别准确率达86.44%,优于所有基线模型。

Insight: 多帧信息融合和光流误差修正显著提升车牌图像恢复和识别性能,真实数据集RLPR为未来研究提供了重要基准。

Abstract: License plate recognition (LPR) is important for traffic law enforcement, crime investigation, and surveillance. However, license plate areas in dash cam images often suffer from low resolution, motion blur, and glare, which make accurate recognition challenging. Existing generative models that rely on pretrained priors cannot reliably restore such poor-quality images, frequently introducing severe artifacts and distortions. To address this issue, we propose a novel multi-frame license plate restoration and recognition framework, MF-LPR$^2$, which addresses ambiguities in poor-quality images by aligning and aggregating neighboring frames instead of relying on pretrained knowledge. To achieve accurate frame alignment, we employ a state-of-the-art optical flow estimator in conjunction with carefully designed algorithms that detect and correct erroneous optical flow estimations by leveraging the spatio-temporal consistency inherent in license plate image sequences. Our approach enhances both image quality and recognition accuracy while preserving the evidential content of the input images. In addition, we constructed a novel Realistic LPR (RLPR) dataset to evaluate MF-LPR$^2$. The RLPR dataset contains 200 pairs of low-quality license plate image sequences and high-quality pseudo ground-truth images, reflecting the complexities of real-world scenarios. In experiments, MF-LPR$^2$ outperformed eight recent restoration models in terms of PSNR, SSIM, and LPIPS by significant margins. In recognition, MF-LPR$^2$ achieved an accuracy of 86.44%, outperforming both the best single-frame LPR (14.04%) and the multi-frame LPR (82.55%) among the eleven baseline models. The results of ablation studies confirm that our filtering and refinement algorithms significantly contribute to these improvements.


[47] Tinker: Diffusion’s Gift to 3D–Multi-View Consistent Editing From Sparse Inputs without Per-Scene Optimization cs.CVPDF

Canyu Zhao, Xiaoman Li, Tianjian Feng, Zhiyue Zhao, Hao Chen

TL;DR: Tinker提出了一种无需逐场景优化的3D编辑框架,通过预训练扩散模型的3D感知能力,实现多视角一致的编辑功能,仅需少量输入即可生成高质量结果。

Details

Motivation: 现有的3D编辑技术通常需要大量逐场景优化或数十个一致编辑输入视图,计算成本高且难以扩展。Tinker旨在解决这一问题,提供更高效、通用的3D编辑解决方案。

Result: Tinker在编辑、新视角合成和渲染增强任务上达到SOTA性能,显著降低了通用3D内容创建的难度。

Insight: Tinker通过扩散模型的潜空间3D感知,展示了无需逐场景优化的3D编辑潜力,为可扩展的零样本3D编辑技术提供了新方向。

Abstract: We introduce Tinker, a versatile framework for high-fidelity 3D editing that operates in both one-shot and few-shot regimes without any per-scene finetuning. Unlike prior techniques that demand extensive per-scene optimization to ensure multi-view consistency or to produce dozens of consistent edited input views, Tinker delivers robust, multi-view consistent edits from as few as one or two images. This capability stems from repurposing pretrained diffusion models, which unlocks their latent 3D awareness. To drive research in this space, we curate the first large-scale multi-view editing dataset and data pipeline, spanning diverse scenes and styles. Building on this dataset, we develop our framework capable of generating multi-view consistent edited views without per-scene training, which consists of two novel components: (1) Referring multi-view editor: Enables precise, reference-driven edits that remain coherent across all viewpoints. (2) Any-view-to-video synthesizer: Leverages spatial-temporal priors from video diffusion to perform high-quality scene completion and novel-view generation even from sparse inputs. Through extensive experiments, Tinker significantly reduces the barrier to generalizable 3D content creation, achieving state-of-the-art performance on editing, novel-view synthesis, and rendering enhancement tasks. We believe that Tinker represents a key step towards truly scalable, zero-shot 3D editing. Project webpage: https://aim-uofa.github.io/Tinker


[48] Repeating Words for Video-Language Retrieval with Coarse-to-Fine Objectives cs.CVPDF

Haoyu Zhao, Jiaxi Gu, Shicong Wang, Xing Zhang, Hang Xu

TL;DR: 论文提出了一种新颖的视频-文本检索框架,通过粗到细的目标学习和关键词重复技术,显著提升检索性能,同时降低了训练成本。

Details

Motivation: 视频流数据的爆炸性增长对视频-文本检索的高精度和低成本训练提出了挑战,现有方法依赖大规模预训练,计算成本高,且细粒度信息未充分挖掘。

Result: 在四个基准测试上表现优异,MSR-VTT和DiDeMo数据集的Recall@1分别提升2.1%和1.6%。

Insight: 关键词重复可增强视频-文本对齐,通过推理流程的改进可显著提升性能而无需额外训练。

Abstract: The explosive growth of video streaming presents challenges in achieving high accuracy and low training costs for video-language retrieval. However, existing methods rely on large-scale pre-training to improve video retrieval performance, resulting in significant computational demands. Additionally, the fine-grained information in videos and texts remains underexplored. To alleviate these problems, we propose a novel framework to learn fine-grained features for better alignment and introduce an inference pipeline to improve performance without additional training. Specifically, we employ coarse-to-fine objectives to understand the semantic information of video-text pairs, including contrastive and matching learning. The fine-grained data used for training is obtained through the Granularity-Aware Representation module, which is designed based on similarity analysis between video frames and words in captions. Furthermore, we observe that the repetition of keywords in the original captions, referred to as “Repetition”, can enhance retrieval performance and improve alignment between video and text. Based on this insight, we propose a novel and effective inference pipeline that incorporates a voting mechanism and a new Matching Entropy metric to achieve better retrieval performance without requiring additional pre-training. Experimental results on four benchmarks demonstrate that the proposed method outperforms previous approaches. Additionally, our inference pipeline achieves significant performance improvements, with a 2.1% increase in Recall@1 on the MSR-VTT dataset and a 1.6% increase on the DiDeMo dataset.


[49] EventSSEG: Event-driven Self-Supervised Segmentation with Probabilistic Attention cs.CVPDF

Lakshmi Annamalai, Chetan Singh Thakur

TL;DR: EventSSEG提出了一种基于事件摄像头(event cameras)的自监督学习框架,用于道路分割任务,通过概率注意力机制和事件驱动的计算方式,减少了标注数据的需求,实现了低延迟和低计算开销的性能。

Details

Motivation: 传统的基于帧摄像头的道路分割方案在高延迟和高计算需求方面存在问题,而事件摄像头作为一个低功耗的替代方案潜力巨大,但缺乏预训练权重和标注数据的问题限制了其应用。

Result: 在DSEC-Semantic和DDD17数据集上的实验表明,EventSSEG在极少标注数据的情况下达到了最先进的性能。

Insight: 事件摄像头的自监督学习是解决标注数据稀缺问题的有效途径,同时概率注意力机制可以高效处理事件数据的时间动态特性。

Abstract: Road segmentation is pivotal for autonomous vehicles, yet achieving low latency and low compute solutions using frame based cameras remains a challenge. Event cameras offer a promising alternative. To leverage their low power sensing, we introduce EventSSEG, a method for road segmentation that uses event only computing and a probabilistic attention mechanism. Event only computing poses a challenge in transferring pretrained weights from the conventional camera domain, requiring abundant labeled data, which is scarce. To overcome this, EventSSEG employs event-based self supervised learning, eliminating the need for extensive labeled data. Experiments on DSEC-Semantic and DDD17 show that EventSSEG achieves state of the art performance with minimal labeled events. This approach maximizes event cameras capabilities and addresses the lack of labeled events.


[50] Lifespan Pancreas Morphology for Control vs Type 2 Diabetes using AI on Largescale Clinical Imaging cs.CVPDF

Lucas W. Remedios, Chloe Cho, Trent M. Schwartz, Dingjie Su, Gaurav Rudravaram

TL;DR: 该论文通过AI技术分析大规模临床影像数据,研究了0至90岁人群中胰腺形态的年龄变化趋势,并对比了2型糖尿病患者与非糖尿病患者的差异。

Details

Motivation: 胰腺形态变化对于2型糖尿病和其他胰腺疾病的早期检测至关重要,但目前缺乏系统性的研究。

Result: 在调整混杂因素后,2型糖尿病患者的10/13项胰腺形态特征显著不同于对照组(p < 0.05)。MRI与CT的测量结果也存在差异。

Insight: 胰腺在2型糖尿病中显著缩小,且形态学特征可能成为早期诊断的生物标志物。研究还提供了非糖尿病患者胰腺形态的参考数据。

Abstract: Purpose: Understanding how the pancreas changes is critical for detecting deviations in type 2 diabetes and other pancreatic disease. We measure pancreas size and shape using morphological measurements from ages 0 to 90. Our goals are to 1) identify reliable clinical imaging modalities for AI-based pancreas measurement, 2) establish normative morphological aging trends, and 3) detect potential deviations in type 2 diabetes. Approach: We analyzed a clinically acquired dataset of 2533 patients imaged with abdominal CT or MRI. We resampled the scans to 3mm isotropic resolution, segmented the pancreas using automated methods, and extracted 13 morphological pancreas features across the lifespan. First, we assessed CT and MRI measurements to determine which modalities provide consistent lifespan trends. Second, we characterized distributions of normative morphological patterns stratified by age group and sex. Third, we used GAMLSS regression to model pancreas morphology trends in 1350 patients matched for age, sex, and type 2 diabetes status to identify any deviations from normative aging associated with type 2 diabetes. Results: When adjusting for confounders, the aging trends for 10 of 13 morphological features were significantly different between patients with type 2 diabetes and non-diabetic controls (p < 0.05 after multiple comparisons corrections). Additionally, MRI appeared to yield different pancreas measurements than CT using our AI-based method. Conclusions: We provide lifespan trends demonstrating that the size and shape of the pancreas is altered in type 2 diabetes using 675 control patients and 675 diabetes patients. Moreover, our findings reinforce that the pancreas is smaller in type 2 diabetes. Additionally, we contribute a reference of lifespan pancreas morphology from a large cohort of non-diabetic control patients in a clinical setting.


[51] GaussianArt: Unified Modeling of Geometry and Motion for Articulated Objects cs.CVPDF

Licheng Shen, Saining Zhang, Honghan Li, Peilin Yang, Zihao Huang

TL;DR: GaussianArt提出了一种统一建模几何和运动的方法,用于重建包含多部分铰接的物体,显著提升了鲁棒性和扩展性。

Details

Motivation: 现有方法通常将几何和运动解耦,导致重建流程复杂且难以处理多部分铰接物体。

Result: 在90个铰接物体的实验中,该方法在几何重建和运动估计中表现出色。

Insight: 统一表示法在复杂铰接物体处理中更具潜力,适用于机器人仿真和人物场景交互等下游任务。

Abstract: Reconstructing articulated objects is essential for building digital twins of interactive environments. However, prior methods typically decouple geometry and motion by first reconstructing object shape in distinct states and then estimating articulation through post-hoc alignment. This separation complicates the reconstruction pipeline and restricts scalability, especially for objects with complex, multi-part articulation. We introduce a unified representation that jointly models geometry and motion using articulated 3D Gaussians. This formulation improves robustness in motion decomposition and supports articulated objects with up to 20 parts, significantly outperforming prior approaches that often struggle beyond 2–3 parts due to brittle initialization. To systematically assess scalability and generalization, we propose MPArt-90, a new benchmark consisting of 90 articulated objects across 20 categories, each with diverse part counts and motion configurations. Extensive experiments show that our method consistently achieves superior accuracy in part-level geometry reconstruction and motion estimation across a broad range of object types. We further demonstrate applicability to downstream tasks such as robotic simulation and human-scene interaction modeling, highlighting the potential of unified articulated representations in scalable physical modeling.


cs.CL [Back]

[52] From Image Captioning to Visual Storytelling cs.CL | cs.CVPDF

Admitos Passadakis, Yingjin Song, Albert Gatt

TL;DR: 该论文提出了一种将图像描述(Image Captioning)与视觉叙事(Visual Storytelling)结合的框架,通过分步方法(先生成图像描述,再转化为连贯故事)提升叙事质量,并加速训练时间。同时,作者提出了一种新度量工具‘ideality’,用于模拟结果与理想模型的差距。

Details

Motivation: 视觉叙事(Visual Storytelling)是一个多模态任务,需要在图像序列的基础上生成既接地气又连贯的故事。现有方法通常直接生成故事,忽略了与图像描述任务的关联。本文旨在通过结合这两种任务,优化叙事质量和效率。

Result: 实验表明,该框架在叙事质量上优于现有方法,同时训练时间更短。‘ideality’指标有效模拟了人类化程度。

Insight: 将复杂任务分解为子任务(如先描述再叙事)可以提升效果;统一框架的设计有助于可重用性和可复现性。

Abstract: Visual Storytelling is a challenging multimodal task between Vision & Language, where the purpose is to generate a story for a stream of images. Its difficulty lies on the fact that the story should be both grounded to the image sequence but also narrative and coherent. The aim of this work is to balance between these aspects, by treating Visual Storytelling as a superset of Image Captioning, an approach quite different compared to most of prior relevant studies. This means that we firstly employ a vision-to-language model for obtaining captions of the input images, and then, these captions are transformed into coherent narratives using language-to-language methods. Our multifarious evaluation shows that integrating captioning and storytelling under a unified framework, has a positive impact on the quality of the produced stories. In addition, compared to numerous previous studies, this approach accelerates training time and makes our framework readily reusable and reproducible by anyone interested. Lastly, we propose a new metric/tool, named ideality, that can be used to simulate how far some results are from an oracle model, and we apply it to emulate human-likeness in visual storytelling.


[53] Contrastive Analysis of Constituent Order Preferences Within Adverbial Roles in English and Chinese News: A Large-Language-Model-Driven Approach cs.CLPDF

Yiran Rex Ma

TL;DR: 本文通过大型语言模型标注的英汉新闻语料,对比分析了英汉新闻中状语功能块的语序差异,揭示了系统性偏好与动态适应性。

Details

Motivation: 研究英汉新闻中状语功能块的语序差异,以揭示两种语言在信息结构上的不同特点。

Result: 英语新闻倾向于核心信息前置,状语后置;汉语新闻偏好背景前置,状语前置。英汉在SVO结构中的分布差异显著。

Insight: 语序既反映系统性偏好,也具备动态适应性,受到信息和语用目的的驱动。

Abstract: Based on comparable English-Chinese news corpora annotated by Large Language Model (LLM), this paper attempts to explore the differences in constituent order of English-Chinese news from the perspective of functional chunks with adverbial roles, and analyze their typical positional preferences and distribution patterns. It is found that: (1) English news prefers linear narrative of core information first, and functional chunks are mostly post-positioned, while Chinese news prefers overall presentation mode of background first, and functional chunks are often pre-positioned; (2) In SVO structure, both English and Chinese news show differences in the distribution of functional chunks, but the tendency of Chinese pre-positioning is more significant, while that of English post-positioning is relatively mild; (3) When function blocks are co-occurring, both English and Chinese news show high flexibility, and the order adjustment is driven by information and pragmatic purposes. The study reveals that word order has both systematic preference and dynamic adaptability, providing new empirical support for contrastive study of English-Chinese information structure.


[54] T-REX: Table – Refute or Entail eXplainer cs.CL | cs.AIPDF

Tim Luka Horstmann, Baptiste Geisenberger, Mehwish Alam

TL;DR: T-REX是一个交互式工具,用于通过多模态、多语言表格验证文本声明,基于指令调优的大型语言模型(LLMs),旨在为非专家提供易于使用的先进事实核查技术。

Details

Motivation: 现有的大型语言模型(LLMs)在表格事实核查方面取得了进展,但这些技术对非专家仍然难以访问。因此,作者开发了T-REX,提供一个透明且易于使用的交互式工具。

Result: T-REX已公开发布,提供了一种高效、透明的表格事实核查解决方案。

Insight: 通过交互式设计和非专家友好的界面,T-REX展示了如何在复杂任务中降低技术门槛,同时保持准确性。

Abstract: Verifying textual claims against structured tabular data is a critical yet challenging task in Natural Language Processing with broad real-world impact. While recent advances in Large Language Models (LLMs) have enabled significant progress in table fact-checking, current solutions remain inaccessible to non-experts. We introduce T-REX (T-REX: Table – Refute or Entail eXplainer), the first live, interactive tool for claim verification over multimodal, multilingual tables using state-of-the-art instruction-tuned reasoning LLMs. Designed for accuracy and transparency, T-REX empowers non-experts by providing access to advanced fact-checking technology. The system is openly available online.


[55] Confidence Estimation for Text-to-SQL in Large Language Models cs.CL | cs.DBPDF

Sepideh Entezari Maleki, Mohammadreza Pourreza, Davood Rafiei

TL;DR: 本文研究了在大语言模型(LLMs)中为文本到SQL生成任务提供置信度估计的方法,重点关注黑盒和白盒策略,其中基于一致性和SQL语法感知的方法表现突出,执行查询的补充信号进一步提升了效果。

Details

Motivation: 在文本到SQL任务中,评估模型生成SQL查询的置信度是重要的,尤其是在无法获取标准答案的情况下。大语言模型的权重和梯度通常受限,因此需要开发无需访问内部参数的置信度估计方法。

Result: 实验表明,基于一致性的黑盒方法和SQL语法感知的白盒方法在跨领域文本到SQL任务中表现最佳,执行查询的补充信号进一步提升了置信度估计的准确性。

Insight: 1. 黑盒方法无需访问模型内部,适用于受限环境。2. 白盒方法通过语法分析能更精确解读模型输出。3. 执行查询的反馈为置信度估计提供了额外的验证维度。

Abstract: Confidence estimation for text-to-SQL aims to assess the reliability of model-generated SQL queries without having access to gold answers. We study this problem in the context of large language models (LLMs), where access to model weights and gradients is often constrained. We explore both black-box and white-box confidence estimation strategies, evaluating their effectiveness on cross-domain text-to-SQL benchmarks. Our evaluation highlights the superior performance of consistency-based methods among black-box models and the advantage of SQL-syntax-aware approaches for interpreting LLM logits in white-box settings. Furthermore, we show that execution-based grounding of queries provides a valuable supplementary signal, improving the effectiveness of both approaches.


[56] Assessing and Mitigating Data Memorization Risks in Fine-Tuned Large Language Models cs.CL | cs.AIPDF

Badrinath Ramakrishnan, Akshaya Balaji

TL;DR: 该论文研究了微调大语言模型(LLM)时数据记忆化的隐私风险,提出了一种多层隐私保护框架,并验证了四种方法能有效减少数据泄露。

Details

Motivation: 大语言模型在微调过程中容易记忆训练数据,导致隐私泄露风险增加。论文旨在量化这一风险并提出解决方案。

Result: 实验显示,微调后隐私泄露率显著上升(从0-5%到60-75%),而提出的框架能将泄露率降至0%,同时保留94.7%的模型性能。

Insight: 模型微调中的重复敏感数据是隐私泄露的主要风险源,而多层隐私保护方法可有效平衡隐私与性能。

Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse natural language processing tasks, but their tendency to memorize training data poses significant privacy risks, particularly during fine-tuning processes. This paper presents a comprehensive empirical analysis of data memorization in fine-tuned LLMs and introduces a novel multi-layered privacy protection framework. Through controlled experiments on modern LLM architectures including GPT-2, Phi-3, and Gemma-2, we demonstrate that fine-tuning with repeated sensitive data increases privacy leakage rates from baseline levels of 0-5% to 60-75%, representing a 64.2% average increase across tested models. We propose and rigorously evaluate four complementary privacy protection methods: semantic data deduplication, differential privacy during generation, entropy-based filtering, and pattern-based content filtering. Our experimental results show that these techniques can reduce data leakage to 0% while maintaining 94.7% of original model utility.


[57] Punctuation and Predicates in Language Models cs.CL | cs.LGPDF

Sonakshi Chauhan, Maheep Chaudhary, Koby Choy, Samuel Nellessen, Nandi Schoots

TL;DR: 该论文探究了标点符号和大语言模型(LLM)中其他语言成分的作用和信息传播机制,发现不同模型对标点的依赖程度不同,并揭示了条件语句和全称量词等逻辑规则的处理差异。

Details

Motivation: 研究大语言模型中信息的收集和传播机制,尤其是标点符号和其他语言成分(如主语、形容词等)的动态处理方式,以及不同逻辑规则(如条件语句)的处理差异。

Result: 标点符号在GPT-2多个层中既必要又充分,但在DeepSeek和Gemma中作用较小;条件语句和全称量词的处理方式差异显著。

Insight: LLM处理标点和逻辑规则时存在模型特异性,信息传播可能并非静态,而是动态变化的,这对模型设计和可解释性具有启发意义。

Abstract: In this paper we explore where information is collected and how it is propagated throughout layers in large language models (LLMs). We begin by examining the surprising computational importance of punctuation tokens which previous work has identified as attention sinks and memory aids. Using intervention-based techniques, we evaluate the necessity and sufficiency (for preserving model performance) of punctuation tokens across layers in GPT-2, DeepSeek, and Gemma. Our results show stark model-specific differences: for GPT-2, punctuation is both necessary and sufficient in multiple layers, while this holds far less in DeepSeek and not at all in Gemma. Extending beyond punctuation, we ask whether LLMs process different components of input (e.g., subjects, adjectives, punctuation, full sentences) by forming early static summaries reused across the network, or if the model remains sensitive to changes in these components across layers. Extending beyond punctuation, we investigate whether different reasoning rules are processed differently by LLMs. In particular, through interchange intervention and layer-swapping experiments, we find that conditional statements (if, then), and universal quantification (for all) are processed very differently. Our findings offer new insight into the internal mechanisms of punctuation usage and reasoning in LLMs and have implications for interpretability.


[58] MMReview: A Multidisciplinary and Multimodal Benchmark for LLM-Based Peer Review Automation cs.CLPDF

Xian Gao, Jiacheng Ruan, Zongyun Zhang, Jingsheng Gao, Ting Liu

TL;DR: 论文提出了MMReview基准,用于评估基于LLM的多模态同行评审自动化任务,涵盖17个研究领域和多种模态内容。

Details

Motivation: 随着学术出版物快速增长,同行评审任务繁重且耗时,而现有LLM评审任务缺乏统一的多模态评估基准。

Result: 实验证明基准的全面性,为自动化评审系统开发奠定了基础。

Insight: MMReview填补了多模态评审评估的空白,有望推动标准化自动化评审系统的发展。

Abstract: With the rapid growth of academic publications, peer review has become an essential yet time-consuming responsibility within the research community. Large Language Models (LLMs) have increasingly been adopted to assist in the generation of review comments; however, current LLM-based review tasks lack a unified evaluation benchmark to rigorously assess the models’ ability to produce comprehensive, accurate, and human-aligned assessments, particularly in scenarios involving multimodal content such as figures and tables. To address this gap, we propose \textbf{MMReview}, a comprehensive benchmark that spans multiple disciplines and modalities. MMReview includes multimodal content and expert-written review comments for 240 papers across 17 research domains within four major academic disciplines: Artificial Intelligence, Natural Sciences, Engineering Sciences, and Social Sciences. We design a total of 13 tasks grouped into four core categories, aimed at evaluating the performance of LLMs and Multimodal LLMs (MLLMs) in step-wise review generation, outcome formulation, alignment with human preferences, and robustness to adversarial input manipulation. Extensive experiments conducted on 16 open-source models and 5 advanced closed-source models demonstrate the thoroughness of the benchmark. We envision MMReview as a critical step toward establishing a standardized foundation for the development of automated peer review systems.


[59] Disentangling concept semantics via multilingual averaging in Sparse Autoencoders cs.CL | cs.AIPDF

Cliff O’Reilly, Ernesto Jimenez-Ruiz, Tillman Weyde

TL;DR: 论文提出了一种通过多语言平均稀疏自编码器提取概念语义的方法,揭示了大语言模型中概念语义的真关系。

Details

Motivation: 如何将大语言模型与形式化知识表示结合以解决其语义和语言特定信息的纠缠问题。

Result: 实验结果表明,概念平均结果与真实类关系高度一致,优于单一语言的结果。

Insight: 多语言视角的结合可以更准确地解耦概念语义,为网络内部状态的机理解释提供了新思路。

Abstract: Connecting LLMs with formal knowledge representation and reasoning is a promising approach to address their shortcomings. Embeddings and sparse autoencoders are widely used to represent textual content, but the semantics are entangled with syntactic and language-specific information. We propose a method that isolates concept semantics in Large Langue Models by averaging concept activations derived via Sparse Autoencoders. We create English text representations from OWL ontology classes, translate the English into French and Chinese and then pass these texts as prompts to the Gemma 2B LLM. Using the open source Gemma Scope suite of Sparse Autoencoders, we obtain concept activations for each class and language version. We average the different language activations to derive a conceptual average. We then correlate the conceptual averages with a ground truth mapping between ontology classes. Our results give a strong indication that the conceptual average aligns to the true relationship between classes when compared with a single language by itself. The result hints at a new technique which enables mechanistic interpretation of internal network states with higher accuracy.


[60] GRILE: A Benchmark for Grammar Reasoning and Explanation in Romanian LLMs cs.CL | cs.CYPDF

Adrian-Marius Dumitran, Alexandra-Mihaela Danila, Angela-Liliana Dumitran

TL;DR: GRILE是首个针对罗马尼亚语的语法推理和解释的基准测试,包含1151个选择题,用于评估LLM在低资源语言中的表现。研究发现,尽管Gemini 2.5 Pro准确率达83%,但多数开源模型表现不佳,且解释中存在大量问题。

Details

Motivation: 探究大型语言模型在低资源语言(罗马尼亚语)中的语法推理和解释能力,填补现有研究的空白。

Result: Gemini 2.5 Pro准确率83%,多数开源模型低于65%,48%的解释存在事实或教学错误。

Insight: 1) LLM在低资源语言中的表现仍有显著提升空间;2) 形态学和拼写规范是常见弱点;3) GRILE为可控解释生成提供了新测试平台。

Abstract: LLMs (Large language models) have revolutionized NLP (Natural Language Processing), yet their pedagogical value for low-resource languages remains unclear. We present GRILE (Grammar Romanian Inference and Language Explanations) , the first open benchmark of 1,151 multiple-choice questions harvested from Romanian high-stakes exams (National Evaluation, Baccalaureate, university admissions). GRILE enables us to probe two complementary abilities of seven state-of-the-art multilingual and Romanian-specific LLMs: (i) selecting the correct answer, and (ii) producing linguistically accurate explanations. While Gemini 2.5 Pro reaches 83% accuracy, most open-weight models stay below 65%, and 48% of their explanations contain factual or pedagogical flaws according to expert review. A detailed error analysis pinpoints systematic weaknesses in morphology and in applying the latest DOOM3 orthographic norms. All data, code and a public web demo are released to catalyze future research. Our findings expose open challenges for trustworthy educational NLP in low-resource settings and establish GRILE as a new test-bed for controllable explanation generation and evaluation.


[61] Tokens with Meaning: A Hybrid Tokenization Approach for NLP cs.CL | 68T50 | I.2.7; I.2.6; H.3.1PDF

M. Ali Bayram, Ali Arda Fincan, Ahmet Semih Gümüş, Sercan Karakaş, Banu Diri

TL;DR: 该论文提出了一种混合分词方法,结合了基于规则的形态分析和统计子词分割,显著提升了在形态丰富语言(如土耳其语)中的分词效果。

Details

Motivation: 传统子词分词方法(如BPE和WordPiece)在形态丰富语言中效果不佳,因其依赖频率而非语言结构。

Result: 在土耳其语TR-MMLU基准测试中,分词准确率达到90.29%(土耳其语分词百分比)和85.8%(纯分词百分比),优于LLaMA、Gemma和GPT的分词器。

Insight: 该方法独立于语言,可扩展至其他形态丰富语言,为多语言NLP提供更可解释和高效的分词方案。

Abstract: Tokenization plays a pivotal role in natural language processing (NLP), shaping how text is segmented and interpreted by language models. While subword methods such as Byte Pair Encoding (BPE) and WordPiece have been effective, they often struggle with morphologically rich and agglutinative languages because they rely on frequency rather than linguistic structure. We introduce a hybrid tokenization framework that combines rule-based morphological analysis with statistical subword segmentation. The method uses phonological normalization, root-affix dictionaries, and a novel algorithm that balances morpheme preservation with vocabulary efficiency. It assigns shared identifiers to phonologically variant affixes (e.g., -ler and -lar) and altered root forms (e.g., kitap vs. kitab{\i}), reducing redundancy while maintaining semantic integrity. Special tokens are added for whitespace and case, including an UPPERCASE marker to avoid vocabulary inflation from capitalization. BPE is integrated for out-of-vocabulary coverage without harming morphological coherence. On the TR-MMLU benchmark, the tokenizer achieves the highest Turkish Token Percentage (90.29%) and Pure Token Percentage (85.8%). Comparisons with tokenizers from LLaMA, Gemma, and GPT show more linguistically meaningful and coherent tokens. Although demonstrated on Turkish, the approach is language-independent and adaptable to other languages, offering a practical path toward more interpretable and effective multilingual NLP systems.


[62] A Joint Multitask Model for Morpho-Syntactic Parsing cs.CLPDF

Demian Inostroza, Mel Mistica, Ekaterina Vylomova, Chris Guest, Kemal Kurniawan

TL;DR: 该论文提出了一种联合多任务模型,用于同时预测形态和句法分析,在UniDive 2025共享任务中取得最佳性能,平均MSLAS为78.7%。

Details

Motivation: 为了在多语言和多样性的数据集上统一预测形态和句法分析,提出了一个联合模型,以解决单一任务模型的局限性。

Result: 在共享任务中,平均MSLAS为78.7%,LAS为80.1%,Feats F1为90.3%。

Insight: 模型在核心语法格(如Nom-Acc)和名词特征上表现较差,表明这些是未来改进的方向。

Abstract: We present a joint multitask model for the UniDive 2025 Morpho-Syntactic Parsing shared task, where systems predict both morphological and syntactic analyses following novel UD annotation scheme. Our system uses a shared XLM-RoBERTa encoder with three specialized decoders for content word identification, dependency parsing, and morphosyntactic feature prediction. Our model achieves the best overall performance on the shared task’s leaderboard covering nine typologically diverse languages, with an average MSLAS score of 78.7 percent, LAS of 80.1 percent, and Feats F1 of 90.3 percent. Our ablation studies show that matching the task’s gold tokenization and content word identification are crucial to model performance. Error analysis reveals that our model struggles with core grammatical cases (particularly Nom-Acc) and nominal features across languages.


[63] ZPD-SCA: Unveiling the Blind Spots of LLMs in Assessing Students’ Cognitive Abilities cs.CL | cs.AI | cs.CYPDF

Wenhan Dong, Zhen Sun, Yuemeng Zhao, Zifan Peng, Jun Wu

TL;DR: 论文提出了ZPD-SCA基准,用于评估大语言模型(LLMs)在匹配学生认知能力与阅读材料难度方面的表现,发现其零样本学习能力较差但上下文学习能力有所提升。

Details

Motivation: 本研究填补了LLMs在中文教育中评估阅读材料与学生认知能力对齐能力的空白,基于‘最近发展区(ZPD)’的教育原则。

Result: 零样本学习下LLMs表现不佳,甚至低于随机猜测;上下文学习中模型性能显著提升,但仍存在系统性偏差和不同体裁间的显著差异。

Insight: LLMs在评估阅读难度方面表现出新兴能力,但其训练仍存在局限性,未来需进一步提升其在教育对齐任务中的准确性。

Abstract: Large language models (LLMs) have demonstrated potential in educational applications, yet their capacity to accurately assess the cognitive alignment of reading materials with students’ developmental stages remains insufficiently explored. This gap is particularly critical given the foundational educational principle of the Zone of Proximal Development (ZPD), which emphasizes the need to match learning resources with Students’ Cognitive Abilities (SCA). Despite the importance of this alignment, there is a notable absence of comprehensive studies investigating LLMs’ ability to evaluate reading comprehension difficulty across different student age groups, especially in the context of Chinese language education. To fill this gap, we introduce ZPD-SCA, a novel benchmark specifically designed to assess stage-level Chinese reading comprehension difficulty. The benchmark is annotated by 60 Special Grade teachers, a group that represents the top 0.15% of all in-service teachers nationwide. Experimental results reveal that LLMs perform poorly in zero-shot learning scenarios, with Qwen-max and GLM even falling below the probability of random guessing. When provided with in-context examples, LLMs performance improves substantially, with some models achieving nearly double the accuracy of their zero-shot baselines. These results reveal that LLMs possess emerging abilities to assess reading difficulty, while also exposing limitations in their current training for educationally aligned judgment. Notably, even the best-performing models display systematic directional biases, suggesting difficulties in accurately aligning material difficulty with SCA. Furthermore, significant variations in model performance across different genres underscore the complexity of task. We envision that ZPD-SCA can provide a foundation for evaluating and improving LLMs in cognitively aligned educational applications.


[64] Credence Calibration Game? Calibrating Large Language Models through Structured Play cs.CL | cs.AIPDF

Ke Fang, Tianyi Zhao, Lu Cheng

TL;DR: 提出了一种基于提示的校准框架,通过结构化互动循环和反馈驱动提示,动态提升大语言模型(LLMs)的置信度校准能力。

Details

Motivation: 现有校准方法通常依赖于后处理或额外的监督训练,缺乏动态性和灵活性。

Result: 在多种模型和游戏配置下均表现出校准指标的显著提升。

Insight: 游戏化提示策略为LLM校准提供了一种无需参数更新的新型有效方法。

Abstract: As Large Language Models (LLMs) are increasingly deployed in decision-critical domains, it becomes essential to ensure that their confidence estimates faithfully correspond to their actual correctness. Existing calibration methods have primarily focused on post-hoc adjustments or auxiliary model training; however, many of these approaches necessitate additional supervision or parameter updates. In this work, we propose a novel prompt-based calibration framework inspired by the Credence Calibration Game. Our method establishes a structured interaction loop wherein LLMs receive feedback based on the alignment of their predicted confidence with correctness. Through feedback-driven prompting and natural language summaries of prior performance, our framework dynamically improves model calibration. Extensive experiments across models and game configurations demonstrate consistent improvements in evaluation metrics. Our results highlight the potential of game-based prompting as an effective strategy for LLM calibration. Code and data are available at https://anonymous.4open.science/r/LLM-Calibration/.


[65] DEPTH: Hallucination-Free Relation Extraction via Dependency-Aware Sentence Simplification and Two-tiered Hierarchical Refinement cs.CL | cs.AIPDF

Yupei Yang, Fan Feng, Lin Yang, Wanxi Deng, Lin Qu

TL;DR: DEPTH提出了一个依赖感知的句子简化和两级层次精化的框架,用于消除关系抽取中的幻觉问题,显著提升了性能。

Details

Motivation: 现有基于大语言模型的关系抽取方法在复杂句子和语义下容易产生虚假预测(幻觉),影响知识图谱的准确性。DEPTH旨在解决这一问题。

Result: 在6个基准测试中,DEPTH将平均幻觉率降至7.0%,F1分数提升了17.2%。

Insight: 依赖路径和层次化精化能有效减少幻觉,因果奖励模型有助于鲁棒的强化学习调优。

Abstract: Relation extraction enables the construction of structured knowledge for many downstream applications. While large language models (LLMs) have shown great promise in this domain, most existing methods concentrate on relation classification, which predicts the semantic relation type between a related entity pair. However, we observe that LLMs often struggle to reliably determine whether a relation exists, especially in cases involving complex sentence structures or intricate semantics, which leads to spurious predictions. Such hallucinations can introduce noisy edges in knowledge graphs, compromising the integrity of structured knowledge and downstream reliability. To address these challenges, we propose DEPTH, a framework that integrates Dependency-aware sEntence simPlification and Two-tiered Hierarchical refinement into the relation extraction pipeline. Given a sentence and its candidate entity pairs, DEPTH operates in two stages: (1) the Grounding module extracts relations for each pair by leveraging their shortest dependency path, distilling the sentence into a minimal yet coherent relational context that reduces syntactic noise while preserving key semantics; (2) the Refinement module aggregates all local predictions and revises them based on a holistic understanding of the sentence, correcting omissions and inconsistencies. We further introduce a causality-driven reward model that mitigates reward hacking by disentangling spurious correlations, enabling robust fine-tuning via reinforcement learning with human feedback. Experiments on six benchmarks demonstrate that DEPTH reduces the average hallucination rate to 7.0% while achieving a 17.2% improvement in average F1 score over state-of-the-art baselines.


[66] Cognitive Surgery: The Awakening of Implicit Territorial Awareness in LLMs cs.CL | cs.AIPDF

Yinghan Zhou, Weifeng Zhu, Juan Wen, Wanli Peng, Zhengxian Wu

TL;DR: 论文研究了大型语言模型(LLMs)在个体展示范式(IPP)下难以区分自身生成文本的现象,提出了一种称为‘认知手术’(CoSur)的新方法,通过唤醒隐式领地意识(ITA)显著提升了LLMs在IPP下的表现。

Details

Motivation: 虽然LLMs在成对展示范式(PPP)下能够可靠地识别自身生成的文本,但在个体展示范式(IPP)下表现显著下降。论文旨在探究这一现象的原因,并提出解决方案。

Result: 实验结果显示,CoSur方法在三种不同LLMs上均显著提升了IPP下的表现,平均准确率分别达到83.25%、66.19%和88.01%。

Insight: 论文揭示了LLMs在隐式层面上具备区分自身与他人生成文本的能力,但这种能力未在输出行为中显式表现出来,需要通过特定方法‘唤醒’。

Abstract: Large language models (LLMs) have been shown to possess a degree of self-recognition capability-the ability to identify whether a given text was generated by themselves. Prior work has demonstrated that this capability is reliably expressed under the Pair Presentation Paradigm (PPP), where the model is presented with two texts and asked to choose which one it authored. However, performance deteriorates sharply under the Individual Presentation Paradigm (IPP), where the model is given a single text to judge authorship. Although this phenomenon has been observed, its underlying causes have not been systematically analyzed. In this paper, we first replicate existing findings to confirm that LLMs struggle to distinguish self- from other-generated text under IPP. We then investigate the reasons for this failure and attribute it to a phenomenon we term Implicit Territorial Awareness (ITA)-the model’s latent ability to distinguish self- and other-texts in representational space, which remains unexpressed in its output behavior. To awaken the ITA of LLMs, we propose Cognitive Surgery (CoSur), a novel framework comprising four main modules: representation extraction, territory construction, authorship discrimination and cognitive editing. Experimental results demonstrate that our proposed method improves the performance of three different LLMs in the IPP scenario, achieving average accuracies of 83.25%, 66.19%, and 88.01%, respectively.


[67] Knowledge Graph-Infused Fine-Tuning for Structured Reasoning in Large Language Models cs.CLPDF

Wuyang Zhang, Yexin Tian, Xiandong Meng, Mengjie Wang, Junliang Du

TL;DR: 该论文提出了一种基于知识图谱注入的微调算法框架,旨在解决大语言模型在处理需要结构化知识的任务时推理链缺失和实体级语义理解不足的问题。通过图神经网络和图语义表示,结合语言模型表示进行联合建模,提升了语义推理和实体预测的准确性。

Details

Motivation: 大语言模型在处理需要结构化知识的任务时,常因推理链缺失和实体级语义理解不足而表现不佳。因此,研究如何通过知识图谱辅助增强模型的推理和语义表示能力。

Result: 实验表明,该方法在实体识别、问答和语言生成等任务中显著提升了模型对复杂语义单元的表示能力,增强了语义一致性和上下文逻辑建模。

Insight: 通过动态平衡语言语义和结构化知识,可以有效缓解不同表征空间的冲突,提升模型的推理和语义理解能力。这种方法可推广至其他需要结构化知识的任务。

Abstract: This paper addresses the problems of missing reasoning chains and insufficient entity-level semantic understanding in large language models when dealing with tasks that require structured knowledge. It proposes a fine-tuning algorithm framework based on knowledge graph injection. The method builds on pretrained language models and introduces structured graph information for auxiliary learning. A graph neural network is used to encode entities and their relations, constructing a graph-based semantic representation. A fusion mechanism is then designed to jointly model the knowledge graph embeddings with the contextual representations from the language model. To enhance the robustness of knowledge integration, a gating mechanism is introduced to dynamically balance the contributions of linguistic semantics and structural knowledge. This effectively mitigates conflicts between different representational spaces. During training, a joint loss function is constructed to account for both task performance and structural alignment objectives. This helps improve the accuracy of entity prediction and semantic reasoning. The study also includes a series of systematic sensitivity experiments. It evaluates the effects of learning rate, graph coverage, and structural perturbations on model performance. The results further validate the effectiveness and stability of the proposed method across tasks such as entity recognition, question answering, and language generation. Experimental findings show that the proposed structure-aware fine-tuning framework significantly enhances the model’s ability to represent complex semantic units. It demonstrates better semantic consistency and contextual logic modeling in scenarios involving structural reasoning and entity extraction.


[68] NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model cs.CL | cs.AI | cs.LGPDF

NVIDIA, :, Aarti Basant, Abhijit Khairnar, Abhijit Paithankar

TL;DR: 论文介绍了Nemotron-Nano-9B-v2,一种混合Mamba-Transformer语言模型,旨在提高推理工作负载的吞吐量,同时达到与同类规模模型相媲美的最先进精度。该模型通过替换Transformer的自注意力层为Mamba-2层,显著提升了推理速度,支持长推理轨迹生成。

Details

Motivation: 现有Transformer模型在长序列推理任务中性能受限,而纯Mamba模型在精度上难以匹敌Transformer。本研究希望通过结合两者优势,提升推理速度和精度。

Result: 在推理任务中,相比同类模型(如Qwen3-8B),Nemotron-Nano-9B-v2精度相当或更高,吞吐量提升高达6倍。

Insight: 混合Mamba-Transformer架构在推理任务中表现优越,表明结合动态状态空间模型和自注意力机制是有效的未来方向。

Abstract: We introduce Nemotron-Nano-9B-v2, a hybrid Mamba-Transformer language model designed to increase throughput for reasoning workloads while achieving state-of-the-art accuracy compared to similarly-sized models. Nemotron-Nano-9B-v2 builds on the Nemotron-H architecture, in which the majority of the self-attention layers in the common Transformer architecture are replaced with Mamba-2 layers, to achieve improved inference speed when generating the long thinking traces needed for reasoning. We create Nemotron-Nano-9B-v2 by first pre-training a 12-billion-parameter model (Nemotron-Nano-12B-v2-Base) on 20 trillion tokens using an FP8 training recipe. After aligning Nemotron-Nano-12B-v2-Base, we employ the Minitron strategy to compress and distill the model with the goal of enabling inference on up to 128k tokens on a single NVIDIA A10G GPU (22GiB of memory, bfloat16 precision). Compared to existing similarly-sized models (e.g., Qwen3-8B), we show that Nemotron-Nano-9B-v2 achieves on-par or better accuracy on reasoning benchmarks while achieving up to 6x higher inference throughput in reasoning settings like 8k input and 16k output tokens. We are releasing Nemotron-Nano-9B-v2, Nemotron-Nano12B-v2-Base, and Nemotron-Nano-9B-v2-Base checkpoints along with the majority of our pre- and post-training datasets on Hugging Face.


[69] Reasoning is about giving reasons cs.CLPDF

Krunal Shah, Dan Roth

TL;DR: 论文提出了一种名为RLS(逻辑结构表示)的中间表示方法,用于理解和表达自然语言论证的逻辑结构,从而支持多种形式的确定性推理。

Details

Motivation: 当前基于规则链的方法在解释性和扩展性上存在局限,无法支持复杂的推理任务(如溯因或矛盾识别)。

Result: 在三个流行的推理数据集上,RLS能高精度地提取逻辑结构,显著扩展了模型的推理能力。

Insight: 逻辑结构的显式表示是提高推理模型解释性和灵活性的关键。

Abstract: Convincing someone of the truth value of a premise requires understanding and articulating the core logical structure of the argument which proves or disproves the premise. Understanding the logical structure of an argument refers to understanding the underlying “reasons” which make up the proof or disproof of the premise - as a function of the “logical atoms” in the argument. While it has been shown that transformers can “chain” rules to derive simple arguments, the challenge of articulating the “reasons” remains. Not only do current approaches to chaining rules suffer in terms of their interpretability, they are also quite constrained in their ability to accommodate extensions to theoretically equivalent reasoning tasks - a model trained to chain rules cannot support abduction or identify contradictions. In this work we suggest addressing these shortcomings by identifying an intermediate representation (which we call the Representation of the Logical Structure (RLS) of the argument) that possesses an understanding of the logical structure of a natural language argument - the logical atoms in the argument and the rules incorporating them. Given the logical structure, reasoning is deterministic and easy to compute. Therefore, our approach supports all forms of reasoning that depend on the logical structure of the natural language argument, including arbitrary depths of reasoning, on-the-fly mistake rectification and interactive discussion with respect to an argument. We show that we can identify and extract the logical structure of natural language arguments in three popular reasoning datasets with high accuracies, thus supporting explanation generation and extending the reasoning capabilities significantly.


[70] ShizhenGPT: Towards Multimodal LLMs for Traditional Chinese Medicine cs.CL | cs.AI | cs.CV | cs.LG | cs.MMPDF

Junying Chen, Zhenyang Cai, Zhiheng Liu, Yunjin Yang, Rongsheng Wang

TL;DR: 论文提出了首个针对中医的多模态大语言模型ShizhenGPT,解决了中医领域数据稀缺和多模态诊疗的挑战,并在多个任务中表现优异。

Details

Motivation: 中医诊疗涉及多模态感官信息(视觉、听觉、嗅觉、触觉),传统大语言模型无法处理此类需求,且高质量中医数据稀缺。

Result: ShizhenGPT在中医视觉理解和多模态感知任务中优于同类模型,并与更大规模专有模型竞争。

Insight: 多模态大语言模型在中医领域有巨大潜力,可推动全面的感知和诊断。

Abstract: Despite the success of large language models (LLMs) in various domains, their potential in Traditional Chinese Medicine (TCM) remains largely underexplored due to two critical barriers: (1) the scarcity of high-quality TCM data and (2) the inherently multimodal nature of TCM diagnostics, which involve looking, listening, smelling, and pulse-taking. These sensory-rich modalities are beyond the scope of conventional LLMs. To address these challenges, we present ShizhenGPT, the first multimodal LLM tailored for TCM. To overcome data scarcity, we curate the largest TCM dataset to date, comprising 100GB+ of text and 200GB+ of multimodal data, including 1.2M images, 200 hours of audio, and physiological signals. ShizhenGPT is pretrained and instruction-tuned to achieve deep TCM knowledge and multimodal reasoning. For evaluation, we collect recent national TCM qualification exams and build a visual benchmark for Medicinal Recognition and Visual Diagnosis. Experiments demonstrate that ShizhenGPT outperforms comparable-scale LLMs and competes with larger proprietary models. Moreover, it leads in TCM visual understanding among existing multimodal LLMs and demonstrates unified perception across modalities like sound, pulse, smell, and vision, paving the way toward holistic multimodal perception and diagnosis in TCM. Datasets, models, and code are publicly available. We hope this work will inspire further exploration in this field.


[71] The Digital Sous Chef – A Comparative Study on Fine-Tuning Language Models for Recipe Generation cs.CLPDF

Shubham Pundhir, Ganesh Bagler

TL;DR: 论文《The Digital Sous Chef》通过比较微调的GPT-2大模型与小模型及传统LSTM/RNN基线,提出了针对食谱生成的优化分词策略,显著提升了生成质量。

Details

Motivation: 食谱生成是自然语言生成的基础任务,但通用分词器无法有效保留食谱结构和精确数值,限制了生成质量。

Result: 大模型在BERTScore上相对提升了20%(0.92 vs 0.72),困惑度降低69.8%。

Insight: 优化分词策略可显著提升食谱生成的领域特异性,但事实准确性仍是未来研究的重要挑战。

Abstract: We established a rigorous benchmark for text-based recipe generation, a fundamental task in natural language generation. We present a comprehensive comparative study contrasting a fine-tuned GPT-2 large (774M) model against the GPT-2 small (124M) model and traditional LSTM/RNN baselines on the 5-cuisine corpus from RecipeDB. Our key contribution is a targeted tokenization strategy that augments the vocabulary with 23 common fraction tokens and custom structural markers. This approach addresses a critical limitation of generic tokenizers by preserving essential recipe structures and precise numerical quantities, thereby enhancing domain specificity. Performance is evaluated using a comprehensive suite of seven automatic metrics spanning fluency (BLEU-4, METEOR), coherence (ROUGE-L), semantic relevance (BERTScore), and diversity. Our experiments show that the large transformer-based approach yields a >20% relative improvement in BERTScore (F1) (0.92 vs 0.72) over the best recurrent baseline, while reducing perplexity by 69.8%. We conclude with a discussion of remaining challenges, particularly regarding factual accuracy, and outline how this foundational study paves the way for integrating real-world constraints and multi-modal inputs in advanced recipe generation research.


[72] Transplant Then Regenerate: A New Paradigm for Text Data Augmentation cs.CL | cs.AIPDF

Guangzhan Wang, Hongyu Zhang, Beijun Shen, Xiaodong Gu

TL;DR: 论文提出了一种新的文本数据增强范式LMTransplant,利用大语言模型(LLM)通过transplant-then-regenerate策略生成更具多样性和创造性的文本变体,同时保留原始文本的核心属性。

Details

Motivation: 传统文本增强方法(如回译)主要产生语义相同的变体,而LLM难以精确控制输出风格和结构。本文旨在利用LLM的“知识涌现”能力,提出更灵活的数据增强方法。

Result: 实验表明LMTransplant在多个文本相关任务中优于现有方法,且随着增强数据规模的增加表现出优异扩展性。

Insight: 通过结合种子文本与LLM生成的内容,实现了更具创造性的文本变体生成,同时避免了对提示工程的过度依赖。

Abstract: Data augmentation is a critical technique in deep learning. Traditional methods like Back-translation typically focus on lexical-level rephrasing, which primarily produces variations with the same semantics. While large language models (LLMs) have enhanced text augmentation by their “knowledge emergence” capability, controlling the style and structure of these outputs remains challenging and requires meticulous prompt engineering. In this paper, we propose LMTransplant, a novel text augmentation paradigm leveraging LLMs. The core idea of LMTransplant is transplant-then-regenerate: incorporating seed text into a context expanded by LLM, and asking the LLM to regenerate a variant based on the expanded context. This strategy allows the model to create more diverse and creative content-level variants by fully leveraging the knowledge embedded in LLMs, while preserving the core attributes of the original text. We evaluate LMTransplant across various text-related tasks, demonstrating its superior performance over existing text augmentation methods. Moreover, LMTransplant demonstrates exceptional scalability as the size of augmented data grows.


[73] Evaluating Multilingual and Code-Switched Alignment in LLMs via Synthetic Natural Language Inference cs.CL | cs.AIPDF

Samir Abdaljalil, Erchin Serpedin, Khalid Qaraqe, Hasan Kurban

TL;DR: 该论文提出了一种通过合成自然语言推理任务评估多语言大模型(LLMs)在逻辑一致性和跨语言对齐能力上的框架。研究发现,代码切换(code-switching)不仅不会降低性能,反而可能提升模型表现。

Details

Motivation: 当前多语言大模型在跨语言逻辑一致性和对齐能力上的表现缺乏系统评估,尤其是在代码切换场景下的表现尚未深入研究。

Result: 代码切换不仅未降低性能,还可能提升模型表现,翻译引入的词汇变化可能作为正则化信号。

Insight: 跨语言对齐能力尚存脆弱性,而代码切换可能是提升多语言模型鲁棒性的有效手段。

Abstract: Large language models (LLMs) are increasingly applied in multilingual contexts, yet their capacity for consistent, logically grounded alignment across languages remains underexplored. We present a controlled evaluation framework for multilingual natural language inference (NLI) that generates synthetic, logic-based premise-hypothesis pairs and translates them into a typologically diverse set of languages. This design enables precise control over semantic relations and allows testing in both monolingual and mixed-language (code-switched) conditions. Surprisingly, code-switching does not degrade, and can even improve, performance, suggesting that translation-induced lexical variation may serve as a regularization signal. We validate semantic preservation through embedding-based similarity analyses and cross-lingual alignment visualizations, confirming the fidelity of translated pairs. Our findings expose both the potential and the brittleness of current LLM cross-lingual reasoning, and identify code-switching as a promising lever for improving multilingual robustness. Code available at: https://github.com/KurbanIntelligenceLab/nli-stress-testing


[74] TransLLM: A Unified Multi-Task Foundation Framework for Urban Transportation via Learnable Prompting cs.CL | cs.AIPDF

Jiaming Leng, Yunying Bi, Chuan Qin, Bing Yin, Yanyong Zhang

TL;DR: TransLLM提出了一种统一的多任务基础框架,通过可学习的提示组合将时空建模与大型语言模型(LLM)结合,解决了城市交通系统中多样任务的通用性问题。

Details

Motivation: 现有方法中,小规模深度学习模型任务专用且数据需求高,通用性差;而大型语言模型在结构化时空数据和数值推理方面表现不佳。TransLLM旨在解决这些问题。

Result: 在7个数据集和3个任务上的实验表明,TransLLM在监督和零样本设置中表现优异,优于10个基线模型。

Insight: 动态提示机制显著提升了模型在多任务和跨任务场景中的适应性和泛化能力。

Abstract: Urban transportation systems encounter diverse challenges across multiple tasks, such as traffic forecasting, electric vehicle (EV) charging demand prediction, and taxi dispatch. Existing approaches suffer from two key limitations: small-scale deep learning models are task-specific and data-hungry, limiting their generalizability across diverse scenarios, while large language models (LLMs), despite offering flexibility through natural language interfaces, struggle with structured spatiotemporal data and numerical reasoning in transportation domains. To address these limitations, we propose TransLLM, a unified foundation framework that integrates spatiotemporal modeling with large language models through learnable prompt composition. Our approach features a lightweight spatiotemporal encoder that captures complex dependencies via dilated temporal convolutions and dual-adjacency graph attention networks, seamlessly interfacing with LLMs through structured embeddings. A novel instance-level prompt routing mechanism, trained via reinforcement learning, dynamically personalizes prompts based on input characteristics, moving beyond fixed task-specific templates. The framework operates by encoding spatiotemporal patterns into contextual representations, dynamically composing personalized prompts to guide LLM reasoning, and projecting the resulting representations through specialized output layers to generate task-specific predictions. Experiments across seven datasets and three tasks demonstrate the exceptional effectiveness of TransLLM in both supervised and zero-shot settings. Compared to ten baseline models, it delivers competitive performance on both regression and planning problems, showing strong generalization and cross-task adaptability. Our code is available at https://github.com/BiYunying/TransLLM.


[75] Evaluating Retrieval-Augmented Generation vs. Long-Context Input for Clinical Reasoning over EHRs cs.CL | cs.AIPDF

Skatje Myers, Dmitriy Dligach, Timothy A. Miller, Samantha Barr, Yanjun Gao

TL;DR: 论文比较了检索增强生成(RAG)和长上下文输入在电子健康记录(EHR)临床推理任务中的表现,发现RAG在减少输入令牌的同时,性能接近或优于长上下文方法。

Details

Motivation: 电子健康记录(EHR)冗长、噪声大且冗余,临床医生难以高效处理。尽管大语言模型(LLM)提供了解决方法,但EHR的长度常超出模型的上下文窗口限制。

Result: RAG在减少输入令牌的同时,性能接近或优于长上下文方法,并在效率上显著优于后者。

Insight: 研究表明,即使新模型能处理更长的文本,RAG仍是高效且竞争力的解决方案。

Abstract: Electronic health records (EHRs) are long, noisy, and often redundant, posing a major challenge for the clinicians who must navigate them. Large language models (LLMs) offer a promising solution for extracting and reasoning over this unstructured text, but the length of clinical notes often exceeds even state-of-the-art models’ extended context windows. Retrieval-augmented generation (RAG) offers an alternative by retrieving task-relevant passages from across the entire EHR, potentially reducing the amount of required input tokens. In this work, we propose three clinical tasks designed to be replicable across health systems with minimal effort: 1) extracting imaging procedures, 2) generating timelines of antibiotic use, and 3) identifying key diagnoses. Using EHRs from actual hospitalized patients, we test three state-of-the-art LLMs with varying amounts of provided context, using either targeted text retrieval or the most recent clinical notes. We find that RAG closely matches or exceeds the performance of using recent notes, and approaches the performance of using the models’ full context while requiring drastically fewer input tokens. Our results suggest that RAG remains a competitive and efficient approach even as newer models become capable of handling increasingly longer amounts of text.


[76] Long Chain-of-Thought Reasoning Across Languages cs.CL | cs.AI | cs.LGPDF

Josh Barua, Seun Eisape, Kayo Yin, Alane Suhr

TL;DR: 论文探讨了多语言环境下的长链思维推理能力,通过翻译数据集和多语言预训练模型,揭示了英语作为中介语言的有效性因语言而异,并强调了数据质量和规模对不同语言的影响。

Details

Motivation: 当前大型语言模型的长链思维推理能力主要集中在英语上,多语言环境下的推理能力研究不足。论文旨在填补这一空白,通过实验分析多语言环境下的推理表现。

Result: 1) 英语作为中介语言的效果因语言而异;2) 多语言预训练缩小但未消除性能差距;3) 数据质量和规模的权衡因语言而异。

Insight: 多语言推理能力不仅依赖于模型规模,还需要语言特定的数据支持,小规模高质量数据对某些语言更有效。

Abstract: Scaling inference through long chains-of-thought (CoTs) has unlocked impressive reasoning capabilities in large language models (LLMs), yet the reasoning process remains almost exclusively English-centric. We construct translated versions of two popular English reasoning datasets, fine-tune Qwen 2.5 (7B) and Qwen 3 (8B) models, and present a systematic study of long CoT generation across French, Japanese, Latvian, and Swahili. Our experiments reveal three key findings. First, the efficacy of using English as a pivot language varies by language: it provides no benefit for French, improves performance when used as the reasoning language for Japanese and Latvian, and proves insufficient for Swahili where both task comprehension and reasoning remain poor. Second, extensive multilingual pretraining in Qwen 3 narrows but does not eliminate the cross-lingual performance gap. A lightweight fine-tune using only 1k traces still improves performance by over 30% in Swahili. Third, data quality versus scale trade-offs are language dependent: small, carefully curated datasets suffice for English and French, whereas larger but noisier corpora prove more effective for Swahili and Latvian. Together, these results clarify when and why long CoTs transfer across languages and provide translated datasets to foster equitable multilingual reasoning research.


[77] MedReseacher-R1: Expert-Level Medical Deep Researcher via A Knowledge-Informed Trajectory Synthesis Framework cs.CLPDF

Ailing Yu, Lan Yao, Jingnan Liu, Zhe Chen, Jiajun Yin

TL;DR: 论文提出MedResearcher-R1,一种通过知识驱动的轨迹合成框架实现专家级医学深度研究的系统,解决了通用LLM在医学领域的局限性,通过结合医学知识图谱和专用检索工具,显著提升了医学信息合成能力。

Details

Motivation: 通用LLM在医学领域的表现受限,主要问题包括医学知识不足和缺乏专业检索工具。因此,作者提出MedResearcher-R1,通过领域专用创新解决这些问题。

Result: MedResearcher-R1在医学基准测试中表现优异,同时保持通用任务的竞争力,证明小模型也能超越大模型。

Insight: 领域专用的架构、工具设计和训练数据构造是实现小模型在专业领域超越大模型的关键。

Abstract: Recent developments in Large Language Model (LLM)-based agents have shown impressive capabilities spanning multiple domains, exemplified by deep research systems that demonstrate superior performance on complex information-seeking and synthesis tasks. While general-purpose deep research agents have shown impressive capabilities, they struggle significantly with medical domain challenges, as evidenced by leading proprietary systems achieving limited accuracy on complex medical benchmarks. The key limitations are: (1) the model lacks sufficient dense medical knowledge for clinical reasoning, and (2) the framework is constrained by the absence of specialized retrieval tools tailored for medical contexts.We present a medical deep research agent that addresses these challenges through two core innovations. First, we develop a novel data synthesis framework using medical knowledge graphs, extracting the longest chains from subgraphs around rare medical entities to generate complex multi-hop question-answer pairs. Second, we integrate a custom-built private medical retrieval engine alongside general-purpose tools, enabling accurate medical information synthesis. Our approach generates 2100+ diverse trajectories across 12 medical specialties, each averaging 4.2 tool interactions.Through a two-stage training paradigm combining supervised fine-tuning and online reinforcement learning with composite rewards, our MedResearcher-R1-32B model demonstrates exceptional performance, establishing new state-of-the-art results on medical benchmarks while maintaining competitive performance on general deep research tasks. Our work demonstrates that strategic domain-specific innovations in architecture, tool design, and training data construction can enable smaller open-source models to outperform much larger proprietary systems in specialized domains.


[78] Quantization Meets dLLMs: A Systematic Study of Post-training Quantization for Diffusion LLMs cs.CL | cs.AIPDF

Haokun Lin, Haobo Xu, Yichen Wu, Ziyu Guo, Renrui Zhang

TL;DR: 本文首次系统地研究了扩散大语言模型(dLLMs)的后训练量化(PTQ)问题,识别了激活离群值的影响,并评估了不同配置下的量化效果。

Details

Motivation: 扩散大语言模型(dLLMs)在自然语言生成任务中展现出潜力,但其巨大的参数量和资源需求阻碍了在边缘设备上的部署。后训练量化虽广泛用于压缩自回归LLMs,但在dLLMs上的适用性尚未探索。

Result: 激活离群值会主导动态范围,导致低比特量化难以保持多数值的精度。量化性能受任务类型和模型配置显著影响。

Insight: dLLMs的量化需要针对任务和模型类型优化,未来研究应关注如何有效处理激活离群值以提升低比特量化效果。

Abstract: Recent advances in diffusion large language models (dLLMs) have introduced a promising alternative to autoregressive (AR) LLMs for natural language generation tasks, leveraging full attention and denoising-based decoding strategies. However, the deployment of these models on edge devices remains challenging due to their massive parameter scale and high resource demands. While post-training quantization (PTQ) has emerged as a widely adopted technique for compressing AR LLMs, its applicability to dLLMs remains largely unexplored. In this work, we present the first systematic study on quantizing diffusion-based language models. We begin by identifying the presence of activation outliers, characterized by abnormally large activation values that dominate the dynamic range. These outliers pose a key challenge to low-bit quantization, as they make it difficult to preserve precision for the majority of values. More importantly, we implement state-of-the-art PTQ methods and conduct a comprehensive evaluation across multiple task types and model variants. Our analysis is structured along four key dimensions: bit-width, quantization method, task category, and model type. Through this multi-perspective evaluation, we offer practical insights into the quantization behavior of dLLMs under different configurations. We hope our findings provide a foundation for future research in efficient dLLM deployment. All codes and experimental setups will be released to support the community.


cs.AI [Back]

[79] Who Sees What? Structured Thought-Action Sequences for Epistemic Reasoning in LLMs cs.AI | cs.CL | cs.HC | I.2.9; I.2.10; I.2.7; J.4PDF

Luca Annese, Sabrina Patania, Silvia Serino, Tom Foulsham, Silvia Rossi

TL;DR: 论文探讨了如何通过结构化的思想-行动序列改进大型语言模型(LLM)在认知推理任务中的表现,但发现仅结构化示例不足以实现稳健的视角推理。

Details

Motivation: 现有LLM在涉及主动感知、协作推理和视角推理的任务中表现不佳,研究者希望通过结构化示例提升其能力。

Result: L型示例略微减少了澄清请求和行动步骤,但未能显著提升效果;模型在基础注意力过滤任务中表现良好,但在涉及遮挡空间或认知行为成本权衡的任务中表现不佳。

Insight: 结构化示例对提升视角推理能力作用有限,需要结合显式信念追踪、成本建模和更丰富的环境来实现LLM的社会化协作能力。

Abstract: Recent advances in large language models (LLMs) and reasoning frameworks have opened new possibilities for improving the perspective -taking capabilities of autonomous agents. However, tasks that involve active perception, collaborative reasoning, and perspective taking (understanding what another agent can see or knows) pose persistent challenges for current LLM-based systems. This study investigates the potential of structured examples derived from transformed solution graphs generated by the Fast Downward planner to improve the performance of LLM-based agents within a ReAct framework. We propose a structured solution-processing pipeline that generates three distinct categories of examples: optimal goal paths (G-type), informative node paths (E-type), and step-by-step optimal decision sequences contrasting alternative actions (L-type). These solutions are further converted into ``thought-action’’ examples by prompting an LLM to explicitly articulate the reasoning behind each decision. While L-type examples slightly reduce clarification requests and overall action steps, they do not yield consistent improvements. Agents are successful in tasks requiring basic attentional filtering but struggle in scenarios that required mentalising about occluded spaces or weighing the costs of epistemic actions. These findings suggest that structured examples alone are insufficient for robust perspective-taking, underscoring the need for explicit belief tracking, cost modelling, and richer environments to enable socially grounded collaboration in LLM-based agents.


[80] MCP-Universe: Benchmarking Large Language Models with Real-World Model Context Protocol Servers cs.AI | cs.CLPDF

Ziyang Luo, Zhiqi Shen, Wenzhuo Yang, Zirui Zhao, Prathyusha Jwalapuram

TL;DR: 论文提出了MCP-Universe,首个专门为评估大语言模型(LLM)在真实MCP服务器交互中的表现而设计的综合性基准测试,覆盖6个核心领域和11种MCP服务器,揭示了当前SOTA模型在长时推理和陌生工具空间中的局限性。

Details

Motivation: 现有基准测试过于简化,未能捕捉LLM在真实应用中的挑战(如长时推理和大规模陌生工具空间)。为了填补这一空白,作者开发了MCP-Universe。

Result: SOTA模型(如GPT-5、Grok-4和Claude-4.0-Sonnet)表现不佳(准确率43.72%、33.33%和29.44%)。基准测试揭示了LLM在长上下文和陌生工具使用中的挑战。

Insight: 1. 当前LLM在长时推理和陌生工具空间中表现不足;2. 企业级代理(如Cursor)未能超越标准ReAct框架;3. 开源框架有望推动MCP生态的创新。

Abstract: The Model Context Protocol has emerged as a transformative standard for connecting large language models to external data sources and tools, rapidly gaining adoption across major AI providers and development platforms. However, existing benchmarks are overly simplistic and fail to capture real application challenges such as long-horizon reasoning and large, unfamiliar tool spaces. To address this critical gap, we introduce MCP-Universe, the first comprehensive benchmark specifically designed to evaluate LLMs in realistic and hard tasks through interaction with real-world MCP servers. Our benchmark encompasses 6 core domains spanning 11 different MCP servers: Location Navigation, Repository Management, Financial Analysis, 3D Design, Browser Automation, and Web Searching. To ensure rigorous evaluation, we implement execution-based evaluators, including format evaluators for agent format compliance, static evaluators for time-invariant content matching, and dynamic evaluators that automatically retrieve real-time ground truth for temporally sensitive tasks. Through extensive evaluation of leading LLMs, we find that even SOTA models such as GPT-5 (43.72%), Grok-4 (33.33%) and Claude-4.0-Sonnet (29.44%) exhibit significant performance limitations. In addition, our benchmark poses a significant long-context challenge for LLM agents, as the number of input tokens increases rapidly with the number of interaction steps. Moreover, it introduces an unknown-tools challenge, as LLM agents often lack familiarity with the precise usage of the MCP servers. Notably, enterprise-level agents like Cursor cannot achieve better performance than standard ReAct frameworks. Beyond evaluation, we open-source our extensible evaluation framework with UI support, enabling researchers and practitioners to seamlessly integrate new agents and MCP servers while fostering innovation in the rapidly evolving MCP ecosystem.


eess.IV [Back]

[81] 3D Cardiac Anatomy Generation Using Mesh Latent Diffusion Models eess.IV | cs.CV | cs.LG | q-bio.TOPDF

Jolanta Mozyrska, Marcel Beetz, Luke Melas-Kyriazi, Vicente Grau, Abhirup Banerjee

TL;DR: 该论文提出了一种名为MeshLDM的潜在扩散模型(LDM)架构,用于生成3D心脏解剖结构网格,并在急性心肌梗死患者的左心室解剖数据上验证了其性能,生成结果与金标准相比仅相差2.4%。

Details

Motivation: 3D医学影像生成在心脏学领域应用较少,而生成多样且真实的心脏解剖结构对于计算机仿真、数据增强等应用至关重要。

Result: MeshLDM生成的网格在临床和3D重建指标上表现优异,与金标准的群体均值差异仅为2.4%。

Insight: 扩散模型在3D医学影像生成中具有潜力,尤其是在心脏学领域,可以扩展应用到其他器官或病理状态的建模中。

Abstract: Diffusion models have recently gained immense interest for their generative capabilities, specifically the high quality and diversity of the synthesized data. However, examples of their applications in 3D medical imaging are still scarce, especially in cardiology. Generating diverse realistic cardiac anatomies is crucial for applications such as in silico trials, electromechanical computer simulations, or data augmentations for machine learning models. In this work, we investigate the application of Latent Diffusion Models (LDMs) for generating 3D meshes of human cardiac anatomies. To this end, we propose a novel LDM architecture – MeshLDM. We apply the proposed model on a dataset of 3D meshes of left ventricular cardiac anatomies from patients with acute myocardial infarction and evaluate its performance in terms of both qualitative and quantitative clinical and 3D mesh reconstruction metrics. The proposed MeshLDM successfully captures characteristics of the cardiac shapes at end-diastolic (relaxation) and end-systolic (contraction) cardiac phases, generating meshes with a 2.4% difference in population mean compared to the gold standard.


[82] Automated surgical planning with nnU-Net: delineation of the anatomy in hepatobiliary phase MRI eess.IV | cs.AI | cs.CVPDF

Karin A. Olthof, Matteo Fusagli, Bianca Güttner, Tiziano Natali, Bram Westerink

TL;DR: 该研究开发了一种基于nnU-Net的深度学习方法,用于从肝胆期MRI中自动分割肝脏解剖结构,优化术前规划流程。

Details

Motivation: 术前规划在肝脏手术中至关重要,但手动分割肝脏解剖结构耗时且主观性强。本研究旨在通过自动化方法减轻临床负担。

Result: 模型在测试集上表现出色,尤其是肝脏实质分割(DSC 0.97);临床评估中仅需少量调整,并额外检测到放射科医生遗漏的肿瘤。

Insight: 自动化分割方法在临床中有实际价值,能够补充人工检查的不足,为肝胆手术的术前规划提供标准化工具。

Abstract: Background: The aim of this study was to develop and evaluate a deep learning-based automated segmentation method for hepatic anatomy (i.e., parenchyma, tumors, portal vein, hepatic vein and biliary tree) from the hepatobiliary phase of gadoxetic acid-enhanced MRI. This method should ease the clinical workflow of preoperative planning. Methods: Manual segmentation was performed on hepatobiliary phase MRI scans from 90 consecutive patients who underwent liver surgery between January 2020 and October 2023. A deep learning network (nnU-Net v1) was trained on 72 patients with an extra focus on thin structures and topography preservation. Performance was evaluated on an 18-patient test set by comparing automated and manual segmentations using Dice similarity coefficient (DSC). Following clinical integration, 10 segmentations (assessment dataset) were generated using the network and manually refined for clinical use to quantify required adjustments using DSC. Results: In the test set, DSCs were 0.97+/-0.01 for liver parenchyma, 0.80+/-0.04 for hepatic vein, 0.79+/-0.07 for biliary tree, 0.77+/-0.17 for tumors, and 0.74+/-0.06 for portal vein. Average tumor detection rate was 76.6+/-24.1%, with a median of one false-positive per patient. The assessment dataset showed minor adjustments were required for clinical use of the 3D models, with high DSCs for parenchyma (1.00+/-0.00), portal vein (0.98+/-0.01) and hepatic vein (0.95+/-0.07). Tumor segmentation exhibited greater variability (DSC 0.80+/-0.27). During prospective clinical use, the model detected three additional tumors initially missed by radiologists. Conclusions: The proposed nnU-Net-based segmentation method enables accurate and automated delineation of hepatic anatomy. This enables 3D planning to be applied efficiently as a standard-of-care for every patient undergoing liver surgery.


[83] A Systematic Study of Deep Learning Models and xAI Methods for Region-of-Interest Detection in MRI Scans eess.IV | cs.AI | cs.CVPDF

Justin Yiu, Kushank Arora, Daniel Steinberg, Rohit Ghiya

TL;DR: 论文系统评估了多种深度学习架构与可解释AI(xAI)方法在膝关节MRI扫描中自动检测感兴趣区域(ROI)的效果,发现ResNet50在分类和ROI识别中表现最佳。

Details

Motivation: MRI手动分析耗时且易受主观差异影响,需自动化的ROI检测方法提升效率和准确性。

Result: ResNet50在分类和ROI识别中表现最优;Grad-CAM提供最具临床意义的解释;transformer模型受限于数据规模,潜力未完全释放。

Insight: 1. CNN迁移学习是目前MRI ROI检测的最有效方法。2. transformer模型可能需更大规模预训练以发挥潜力。3. Grad-CAM是最适用的xAI工具。

Abstract: Magnetic Resonance Imaging (MRI) is an essential diagnostic tool for assessing knee injuries. However, manual interpretation of MRI slices remains time-consuming and prone to inter-observer variability. This study presents a systematic evaluation of various deep learning architectures combined with explainable AI (xAI) techniques for automated region of interest (ROI) detection in knee MRI scans. We investigate both supervised and self-supervised approaches, including ResNet50, InceptionV3, Vision Transformers (ViT), and multiple U-Net variants augmented with multi-layer perceptron (MLP) classifiers. To enhance interpretability and clinical relevance, we integrate xAI methods such as Grad-CAM and Saliency Maps. Model performance is assessed using AUC for classification and PSNR/SSIM for reconstruction quality, along with qualitative ROI visualizations. Our results demonstrate that ResNet50 consistently excels in classification and ROI identification, outperforming transformer-based models under the constraints of the MRNet dataset. While hybrid U-Net + MLP approaches show potential for leveraging spatial features in reconstruction and interpretability, their classification performance remains lower. Grad-CAM consistently provided the most clinically meaningful explanations across architectures. Overall, CNN-based transfer learning emerges as the most effective approach for this dataset, while future work with larger-scale pretraining may better unlock the potential of transformer models.


[84] Fine-grained Image Quality Assessment for Perceptual Image Restoration eess.IV | cs.CV | cs.MMPDF

Xiangfei Sheng, Xiaofeng Pan, Zhichao Yang, Pengfei Chen, Leida Li

TL;DR: 这篇论文提出了一个细粒度图像质量评估(IQA)数据集FGRestore,并设计了一个新的IQA模型FGResQ,专门用于图像恢复任务。

Details

Motivation: 现有IQA指标在图像恢复任务中表现不佳,尤其是在区分恢复图像的细粒度质量差异时。为了解决这一问题,作者提出了一个新的数据集和模型。

Result: 实验表明,FGResQ在图像恢复任务中显著优于现有IQA指标。

Insight: 传统IQA指标在图像恢复任务中可能不够准确,细粒度评估能更好反映恢复质量。

Abstract: Recent years have witnessed remarkable achievements in perceptual image restoration (IR), creating an urgent demand for accurate image quality assessment (IQA), which is essential for both performance comparison and algorithm optimization. Unfortunately, the existing IQA metrics exhibit inherent weakness for IR task, particularly when distinguishing fine-grained quality differences among restored images. To address this dilemma, we contribute the first-of-its-kind fine-grained image quality assessment dataset for image restoration, termed FGRestore, comprising 18,408 restored images across six common IR tasks. Beyond conventional scalar quality scores, FGRestore was also annotated with 30,886 fine-grained pairwise preferences. Based on FGRestore, a comprehensive benchmark was conducted on the existing IQA metrics, which reveal significant inconsistencies between score-based IQA evaluations and the fine-grained restoration quality. Motivated by these findings, we further propose FGResQ, a new IQA model specifically designed for image restoration, which features both coarse-grained score regression and fine-grained quality ranking. Extensive experiments and comparisons demonstrate that FGResQ significantly outperforms state-of-the-art IQA metrics. Codes and model weights have been released in https://pxf0429.github.io/FGResQ/


[85] From Slices to Structures: Unsupervised 3D Reconstruction of Female Pelvic Anatomy from Freehand Transvaginal Ultrasound eess.IV | cs.CVPDF

Max Krähenmann, Sergio Tascon-Morales, Fabian Laumer, Julia E. Vogt, Ece Ozkan

TL;DR: 该论文提出了一种无监督的框架,从自由手的2D经阴道超声(TVS)扫描中重建女性盆腔3D解剖结构,无需外部跟踪或学习的姿态估计器。通过引入针对超声成像物理和几何特性的‘切片感知’可微分光栅化方法,实现了高空间保真度的3D重建。

Details

Motivation: 传统的3D超声成像依赖专用硬件和严格采集协议,限制了其广泛应用。本文旨在通过纯计算方法实现从2D超声图像的3D重建,提供一种可扩展的替代方案。

Result: 该方法实现了高空间保真度的3D重建,生成了紧凑、灵活的3D体积表示,为AI辅助分析和诊断提供了新机会。

Insight: 通过纯计算方法实现3D重建是可行的,这不仅减少了硬件依赖,还为未来AI在超声领域的应用提供了新方向。

Abstract: Volumetric ultrasound has the potential to significantly improve diagnostic accuracy and clinical decision-making, yet its widespread adoption remains limited by dependence on specialized hardware and restrictive acquisition protocols. In this work, we present a novel unsupervised framework for reconstructing 3D anatomical structures from freehand 2D transvaginal ultrasound (TVS) sweeps, without requiring external tracking or learned pose estimators. Our method adapts the principles of Gaussian Splatting to the domain of ultrasound, introducing a slice-aware, differentiable rasterizer tailored to the unique physics and geometry of ultrasound imaging. We model anatomy as a collection of anisotropic 3D Gaussians and optimize their parameters directly from image-level supervision, leveraging sensorless probe motion estimation and domain-specific geometric priors. The result is a compact, flexible, and memory-efficient volumetric representation that captures anatomical detail with high spatial fidelity. This work demonstrates that accurate 3D reconstruction from 2D ultrasound images can be achieved through purely computational means, offering a scalable alternative to conventional 3D systems and enabling new opportunities for AI-assisted analysis and diagnosis.


[86] Virtual Multiplex Staining for Histological Images using a Marker-wise Conditioned Diffusion Model eess.IV | cs.CVPDF

Hyun-Jic Oh, Junsik Kim, Zhiyi Shi, Yichen Wu, Yu-An Chen

TL;DR: 该论文提出了一种基于标记条件扩散模型的新框架,用于从H&E图像生成虚拟多路复用图像,解决了多路成像的高成本和复杂性问题,并显著提高了生成标记类型的数量和准确性。

Details

Motivation: 多路成像在病理学中具有重要作用,但其复杂性和高成本限制了广泛应用。现有大量H&E图像缺乏对应的多路复用图像,限制了多模态分析的潜力。

Result: 在两个公开数据集上验证了框架的有效性,实现了生成多达18种标记类型,准确率优于以往方法(仅2-3种标记)。

Insight: 该方法为H&E图像与多路成像之间搭建了桥梁,有望支持回顾性研究和现有H&E图像库的大规模分析,为病理学提供了新的研究工具。

Abstract: Multiplex imaging is revolutionizing pathology by enabling the simultaneous visualization of multiple biomarkers within tissue samples, providing molecular-level insights that traditional hematoxylin and eosin (H&E) staining cannot provide. However, the complexity and cost of multiplex data acquisition have hindered its widespread adoption. Additionally, most existing large repositories of H&E images lack corresponding multiplex images, limiting opportunities for multimodal analysis. To address these challenges, we leverage recent advances in latent diffusion models (LDMs), which excel at modeling complex data distributions utilizing their powerful priors for fine-tuning to a target domain. In this paper, we introduce a novel framework for virtual multiplex staining that utilizes pretrained LDM parameters to generate multiplex images from H&E images using a conditional diffusion model. Our approach enables marker-by-marker generation by conditioning the diffusion model on each marker, while sharing the same architecture across all markers. To tackle the challenge of varying pixel value distributions across different marker stains and to improve inference speed, we fine-tune the model for single-step sampling, enhancing both color contrast fidelity and inference efficiency through pixel-level loss functions. We validate our framework on two publicly available datasets, notably demonstrating its effectiveness in generating up to 18 different marker types with improved accuracy, a substantial increase over the 2-3 marker types achieved in previous approaches. This validation highlights the potential of our framework, pioneering virtual multiplex staining. Finally, this paper bridges the gap between H&E and multiplex imaging, potentially enabling retrospective studies and large-scale analyses of existing H&E image repositories.


cs.GR [Back]

[87] A Real-world Display Inverse Rendering Dataset cs.GR | cs.CVPDF

Seokjun Choi, Hoon-Gyu Chung, Yujin Jeon, Giljoo Nam, Seung-Hwan Baek

TL;DR: 该论文提出了首个基于显示器-相机系统的真实世界逆向渲染数据集,填补了该领域的数据空白,支持了逆向渲染方法的研究和评估。

Details

Motivation: 现有的逆向渲染数据集多基于光舞台(light stage)等设备,而基于显示器-相机系统的数据集尚未公开。这种数据缺失限制了相关方法的发展。为填补这一空白,论文提出并构建了首个此类数据集。

Result: 实验表明,该数据集能有效支持合成任意显示模式和噪声水平下的图像,并验证了论文提出的方法在逆向渲染任务中的优越性。

Insight: 显示器-相机系统通过可控像素光源和偏振光特性,为逆向渲染提供了独特优势;开源数据集将极大促进该领域的研究。

Abstract: Inverse rendering aims to reconstruct geometry and reflectance from captured images. Display-camera imaging systems offer unique advantages for this task: each pixel can easily function as a programmable point light source, and the polarized light emitted by LCD displays facilitates diffuse-specular separation. Despite these benefits, there is currently no public real-world dataset captured using display-camera systems, unlike other setups such as light stages. This absence hinders the development and evaluation of display-based inverse rendering methods. In this paper, we introduce the first real-world dataset for display-based inverse rendering. To achieve this, we construct and calibrate an imaging system comprising an LCD display and stereo polarization cameras. We then capture a diverse set of objects with diverse geometry and reflectance under one-light-at-a-time (OLAT) display patterns. We also provide high-quality ground-truth geometry. Our dataset enables the synthesis of captured images under arbitrary display patterns and different noise levels. Using this dataset, we evaluate the performance of existing photometric stereo and inverse rendering methods, and provide a simple, yet effective baseline for display inverse rendering, outperforming state-of-the-art inverse rendering methods. Code and dataset are available on our project page at https://michaelcsj.github.io/DIR/


[88] MeshCoder: LLM-Powered Structured Mesh Code Generation from Point Clouds cs.GR | cs.CVPDF

Bingquan Dai, Li Ray Luo, Qihong Tang, Jie Wang, Xinyu Lian

TL;DR: MeshCoder是一个新颖的框架,能够从点云中重建复杂3D对象为可编辑的Blender Python脚本,通过多模态大语言模型实现高效形状到代码的转换。

Details

Motivation: 现有方法依赖有限的领域特定语言和小规模数据集,难以建模复杂几何与结构,MeshCoder旨在解决这一问题。

Result: 在形状到代码重建任务中表现优异,支持通过代码修改实现直观的几何与拓扑编辑。

Insight: 代码化表示提升了LLM在3D形状理解任务中的推理能力,为程序化3D形状重建提供了灵活解决方案。

Abstract: Reconstructing 3D objects into editable programs is pivotal for applications like reverse engineering and shape editing. However, existing methods often rely on limited domain-specific languages (DSLs) and small-scale datasets, restricting their ability to model complex geometries and structures. To address these challenges, we introduce MeshCoder, a novel framework that reconstructs complex 3D objects from point clouds into editable Blender Python scripts. We develop a comprehensive set of expressive Blender Python APIs capable of synthesizing intricate geometries. Leveraging these APIs, we construct a large-scale paired object-code dataset, where the code for each object is decomposed into distinct semantic parts. Subsequently, we train a multimodal large language model (LLM) that translates 3D point cloud into executable Blender Python scripts. Our approach not only achieves superior performance in shape-to-code reconstruction tasks but also facilitates intuitive geometric and topological editing through convenient code modifications. Furthermore, our code-based representation enhances the reasoning capabilities of LLMs in 3D shape understanding tasks. Together, these contributions establish MeshCoder as a powerful and flexible solution for programmatic 3D shape reconstruction and understanding.


cs.IR [Back]

[89] FinAgentBench: A Benchmark Dataset for Agentic Retrieval in Financial Question Answering cs.IR | cs.AI | cs.CLPDF

Chanyeol Choi, Jihoon Kwon, Alejandro Lopez-Lira, Chaewoon Kim, Minjae Kim

TL;DR: FinAgentBench是首个针对金融领域多步推理检索的大规模基准数据集,旨在评估LLM代理在金融问答中的检索能力。

Details

Motivation: 传统检索方法在金融领域的检索精度不足,缺乏多步推理能力,且无专用基准评估。FinAgentBench填补了这一空白。

Result: 评估了多种SOTA模型,证实针对性微调能显著提升代理检索性能。

Insight: 金融领域需要多步推理的代理检索,FinAgentBench为研究复杂领域任务中的LLM行为提供了基础。

Abstract: Accurate information retrieval (IR) is critical in the financial domain, where investors must identify relevant information from large collections of documents. Traditional IR methods-whether sparse or dense-often fall short in retrieval accuracy, as it requires not only capturing semantic similarity but also performing fine-grained reasoning over document structure and domain-specific knowledge. Recent advances in large language models (LLMs) have opened up new opportunities for retrieval with multi-step reasoning, where the model ranks passages through iterative reasoning about which information is most relevant to a given query. However, there exists no benchmark to evaluate such capabilities in the financial domain. To address this gap, we introduce FinAgentBench, the first large-scale benchmark for evaluating retrieval with multi-step reasoning in finance – a setting we term agentic retrieval. The benchmark consists of 3,429 expert-annotated examples on S&P-100 listed firms and assesses whether LLM agents can (1) identify the most relevant document type among candidates, and (2) pinpoint the key passage within the selected document. Our evaluation framework explicitly separates these two reasoning steps to address context limitations. This design enables to provide a quantitative basis for understanding retrieval-centric LLM behavior in finance. We evaluate a suite of state-of-the-art models and further demonstrated how targeted fine-tuning can significantly improve agentic retrieval performance. Our benchmark provides a foundation for studying retrieval-centric LLM behavior in complex, domain-specific tasks for finance. We will release the dataset publicly upon acceptance of the paper and plan to expand and share dataset for the full S&P 500 and beyond.


cs.LG [Back]

[90] GLASS: Test-Time Acceleration for LLMs via Global-Local Neural Importance Aggregation cs.LG | cs.AI | cs.CLPDF

Amirmohsen Sattarifard, Sepehr Lavasani, Ehsan Imani, Kunlin Zhang, Hanlin Xu

TL;DR: 该论文提出了GLASS(Global-Local Neural Importance Aggregation)方法,通过动态选择FFN单元,结合提示的局部统计和模型全局统计,显著提升了LLM的推理加速效果,特别是在长文本生成任务中表现优异。

Details

Motivation: 当前LLM在边缘硬件上的部署需要高效的动态剪枝方法,但静态或基于预测的方法存在固定稀疏模式或额外运行时开销的问题,而零样本方法在短提示或长生成任务中表现不佳。

Result: 实验表明,GLASS在多个LLM和基准测试中显著优于现有的零样本剪枝方法,尤其在长文本生成任务中表现突出。

Insight: 结合局部和全局统计的动态剪枝方法能更好地适应不同任务需求,而无需额外训练或运行时开销,为LLM在边缘设备的高效部署提供了新思路。

Abstract: Deploying Large Language Models (LLMs) on edge hardware demands aggressive, prompt-aware dynamic pruning to reduce computation without degrading quality. Static or predictor-based schemes either lock in a single sparsity pattern or incur extra runtime overhead, and recent zero-shot methods that rely on statistics from a single prompt fail on short prompt and/or long generation scenarios. We introduce A/I-GLASS: Activation- and Impact-based Global-Local neural importance Aggregation for feed-forward network SparSification, two training-free methods that dynamically select FFN units using a rank-aggregation of prompt local and model-intrinsic global neuron statistics. Empirical results across multiple LLMs and benchmarks demonstrate that GLASS significantly outperforms prior training-free methods, particularly in challenging long-form generation scenarios, without relying on auxiliary predictors or adding any inference overhead.


[91] DuPO: Enabling Reliable LLM Self-Verification via Dual Preference Optimization cs.LG | cs.CLPDF

Shuaijie She, Yu Bao, Yu Lu, Lu Xu, Tao Li

TL;DR: DuPO是一种基于双学习偏好优化的框架,无需标注反馈即可通过广义对偶性实现可靠的LLM自验证,提升了翻译、数学推理等任务的性能。

Details

Motivation: 传统的RLVR依赖高成本标注且仅适用于可验证任务,而传统双学习仅适用于严格对偶任务对。DuPO旨在解决这些限制,提出一种无需标注且适用于非对偶任务的优化框架。

Result: 在756个翻译方向上平均提升2.13 COMET,三个数学推理基准上平均提升6.4分,推理时重排序性能提升9.3分。

Insight: DuPO展示了LLM通过自监督对偶任务实现自我优化的潜力,为LLM优化提供了一种可扩展且通用的方法。

Abstract: We present DuPO, a dual learning-based preference optimization framework that generates annotation-free feedback via a generalized duality. DuPO addresses two key limitations: Reinforcement Learning with Verifiable Rewards (RLVR)’s reliance on costly labels and applicability restricted to verifiable tasks, and traditional dual learning’s restriction to strictly dual task pairs (e.g., translation and back-translation). Specifically, DuPO decomposes a primal task’s input into known and unknown components, then constructs its dual task to reconstruct the unknown part using the primal output and known information (e.g., reversing math solutions to recover hidden variables), broadening applicability to non-invertible tasks. The quality of this reconstruction serves as a self-supervised reward to optimize the primal task, synergizing with LLMs’ ability to instantiate both tasks via a single model. Empirically, DuPO achieves substantial gains across diverse tasks: it enhances the average translation quality by 2.13 COMET over 756 directions, boosts the mathematical reasoning accuracy by an average of 6.4 points on three challenge benchmarks, and enhances performance by 9.3 points as an inference-time reranker (trading computation for accuracy). These results position DuPO as a scalable, general, and annotation-free paradigm for LLM optimization.


[92] STAS: Spatio-Temporal Adaptive Computation Time for Spiking Transformers cs.LG | cs.AI | cs.CV | cs.NEPDF

Donghwa Kang, Doohyun Kim, Sang-Ki Ko, Jinkyu Lee, Brent ByungHoon Kang

TL;DR: STAS提出了一种时空自适应计算时间框架,通过联合设计静态架构和动态计算策略,解决了SNN中高延迟和计算开销的问题,显著提升了能效和准确性。

Details

Motivation: 脉冲神经网络(SNN)虽然比传统人工神经网络(ANN)更节能,但因其多时间步操作特性导致高延迟和计算开销。现有方法未能统一解决其时空冗余问题,亟需一种集成方案。

Result: 在CIFAR-10、CIFAR-100和ImageNet上,STAS分别降低了45.9%、43.8%和30.1%的能耗,同时准确性优于现有最佳模型。

Insight: 通过联合设计静态架构和动态计算策略,STAS展示了在SNN中高效处理时空冗余的潜力,为未来节能计算提供了新思路。

Abstract: Spiking neural networks (SNNs) offer energy efficiency over artificial neural networks (ANNs) but suffer from high latency and computational overhead due to their multi-timestep operational nature. While various dynamic computation methods have been developed to mitigate this by targeting spatial, temporal, or architecture-specific redundancies, they remain fragmented. While the principles of adaptive computation time (ACT) offer a robust foundation for a unified approach, its application to SNN-based vision Transformers (ViTs) is hindered by two core issues: the violation of its temporal similarity prerequisite and a static architecture fundamentally unsuited for its principles. To address these challenges, we propose STAS (Spatio-Temporal Adaptive computation time for Spiking transformers), a framework that co-designs the static architecture and dynamic computation policy. STAS introduces an integrated spike patch splitting (I-SPS) module to establish temporal stability by creating a unified input representation, thereby solving the architectural problem of temporal dissimilarity. This stability, in turn, allows our adaptive spiking self-attention (A-SSA) module to perform two-dimensional token pruning across both spatial and temporal axes. Implemented on spiking Transformer architectures and validated on CIFAR-10, CIFAR-100, and ImageNet, STAS reduces energy consumption by up to 45.9%, 43.8%, and 30.1%, respectively, while simultaneously improving accuracy over SOTA models.


[93] Organ-Agents: Virtual Human Physiology Simulator via LLMs cs.LG | cs.AI | cs.CVPDF

Rihao Chang, He Jiao, Weizhi Nie, Honglin Guo, Keliang Xie

TL;DR: 论文提出了一种名为Organ-Agents的多智能体框架,利用大语言模型(LLM)模拟人体生理系统。通过在特定系统数据上进行监督微调和强化学习协调,该方法在模拟精度和外部验证中表现优异,得到了医生的认可并支持临床决策任务。

Details

Motivation: 随着大语言模型的发展,模拟复杂生理系统成为可能。作者旨在设计一个可信、可解释且通用的数字孪生模型,用于重症监护中的精准诊断和治疗模拟。

Result: 1. 在4,509例独立患者中,每个系统的均方误差(MSE)<0.16。
2. 外部验证表明模型在不同医院数据中表现稳定。
3. 生成的反事实模拟与真实患者数据一致。
4. 15位重症医生对仿真结果给予高评价(Likert平均分3.9和3.7)。

Insight: 1. 多智能体框架能够高效模拟复杂的生理系统动态。
2. 模型不仅仿真精度高,还保留了决策相关模式,支持下游临床任务。

Abstract: Recent advances in large language models (LLMs) have enabled new possibilities in simulating complex physiological systems. We introduce Organ-Agents, a multi-agent framework that simulates human physiology via LLM-driven agents. Each Simulator models a specific system (e.g., cardiovascular, renal, immune). Training consists of supervised fine-tuning on system-specific time-series data, followed by reinforcement-guided coordination using dynamic reference selection and error correction. We curated data from 7,134 sepsis patients and 7,895 controls, generating high-resolution trajectories across 9 systems and 125 variables. Organ-Agents achieved high simulation accuracy on 4,509 held-out patients, with per-system MSEs <0.16 and robustness across SOFA-based severity strata. External validation on 22,689 ICU patients from two hospitals showed moderate degradation under distribution shifts with stable simulation. Organ-Agents faithfully reproduces critical multi-system events (e.g., hypotension, hyperlactatemia, hypoxemia) with coherent timing and phase progression. Evaluation by 15 critical care physicians confirmed realism and physiological plausibility (mean Likert ratings 3.9 and 3.7). Organ-Agents also enables counterfactual simulations under alternative sepsis treatment strategies, generating trajectories and APACHE II scores aligned with matched real-world patients. In downstream early warning tasks, classifiers trained on synthetic data showed minimal AUROC drops (<0.04), indicating preserved decision-relevant patterns. These results position Organ-Agents as a credible, interpretable, and generalizable digital twin for precision diagnosis, treatment simulation, and hypothesis testing in critical care.


[94] Understanding Data Influence with Differential Approximation cs.LG | cs.CVPDF

Haoru Tan, Sitong Wu, Xiuzhe Wu, Wang Wang, Bo Zhao

TL;DR: 本文提出了一种新的数据影响量化方法Diff-In,通过累积连续训练步骤中的影响差异来近似样本的影响。该方法利用二阶近似,在不依赖模型凸性的情况下提高了准确性,同时保持了与一阶方法相当的计算复杂度,并在多个数据任务中表现优异。

Details

Motivation: 现有数据影响分析工具因假设损失函数凸性等限制导致准确性不足,难以有效支持模型训练中的数据利用。本文旨在提出一种更准确且高效的数据影响量化方法。

Result: 理论分析表明Diff-In的近似误差显著低于现有方法。实验验证其在数据清洗、删除和核心集选择等任务中的优越性,尤其是在大规模视觉语言预训练中表现出色。

Insight: 1. 通过差分累积近似影响的方法提供了更灵活且高效的解决方案。2. 二阶方法可通过优化实现与一阶方法相当的计算效率,扩展了其适用性。3. Diff-In为非凸模型的实用数据影响分析提供了新思路。

Abstract: Data plays a pivotal role in the groundbreaking advancements in artificial intelligence. The quantitative analysis of data significantly contributes to model training, enhancing both the efficiency and quality of data utilization. However, existing data analysis tools often lag in accuracy. For instance, many of these tools even assume that the loss function of neural networks is convex. These limitations make it challenging to implement current methods effectively. In this paper, we introduce a new formulation to approximate a sample’s influence by accumulating the differences in influence between consecutive learning steps, which we term Diff-In. Specifically, we formulate the sample-wise influence as the cumulative sum of its changes/differences across successive training iterations. By employing second-order approximations, we approximate these difference terms with high accuracy while eliminating the need for model convexity required by existing methods. Despite being a second-order method, Diff-In maintains computational complexity comparable to that of first-order methods and remains scalable. This efficiency is achieved by computing the product of the Hessian and gradient, which can be efficiently approximated using finite differences of first-order gradients. We assess the approximation accuracy of Diff-In both theoretically and empirically. Our theoretical analysis demonstrates that Diff-In achieves significantly lower approximation error compared to existing influence estimators. Extensive experiments further confirm its superior performance across multiple benchmark datasets in three data-centric tasks: data cleaning, data deletion, and coreset selection. Notably, our experiments on data pruning for large-scale vision-language pre-training show that Diff-In can scale to millions of data points and outperforms strong baselines.


[95] Squeezed Diffusion Models cs.LG | cs.CVPDF

Jyotirmai Singh, Samar Khanna, James Burgess

TL;DR: 该论文提出了一种新的扩散模型(Squeezed Diffusion Models,SDM),通过数据依赖性的噪声缩放(各向异性)来改进生成性能,而不是传统的各向同性高斯噪声。实验表明,轻微的逆挤压(增加主轴的方差)能显著提升生成质量。

Details

Motivation: 扩散模型通常使用各向同性高斯噪声,忽略了数据的结构信息。受量子挤压态中根据海森堡不确定性原理重新分布不确定性的启发,作者希望通过数据依赖性的噪声缩放来提升模型的生成能力。

Result: 在CIFAR-10/100和CelebA-64上,轻微的逆挤压(增加主轴方差)使FID提升高达15%,并将精准-召回曲线推向更高的召回率。

Insight: 简单的数据感知噪声调整(无需模型架构变化)可以显著提升扩散模型的生成性能,逆挤压的效果出乎意料地好。

Abstract: Diffusion models typically inject isotropic Gaussian noise, disregarding structure in the data. Motivated by the way quantum squeezed states redistribute uncertainty according to the Heisenberg uncertainty principle, we introduce Squeezed Diffusion Models (SDM), which scale noise anisotropically along the principal component of the training distribution. As squeezing enhances the signal-to-noise ratio in physics, we hypothesize that scaling noise in a data-dependent manner can better assist diffusion models in learning important data features. We study two configurations: (i) a Heisenberg diffusion model that compensates the scaling on the principal axis with inverse scaling on orthogonal directions and (ii) a standard SDM variant that scales only the principal axis. Counterintuitively, on CIFAR-10/100 and CelebA-64, mild antisqueezing - i.e. increasing variance on the principal axis - consistently improves FID by up to 15% and shifts the precision-recall frontier toward higher recall. Our results demonstrate that simple, data-aware noise shaping can deliver robust generative gains without architectural changes.


cs.NI [Back]

[96] OmniSense: Towards Edge-Assisted Online Analytics for 360-Degree Videos cs.NI | cs.CV | cs.MM | eess.IVPDF

Miao Zhang, Yifei Zhu, Linfeng Shen, Fangxin Wang, Jiangchuan Liu

TL;DR: OmniSense是一个边缘辅助的在线分析框架,旨在高效处理360度视频,通过轻量级的SRoI预测算法和动态模型缩放,实现低延迟和高准确率。

Details

Motivation: 随着360度视频的普及,如何高效提取有用信息成为挑战。OmniSense旨在解决计算和网络资源受限问题,提供低延迟、高准确率的在线视频分析。

Result: 相比基线方法,准确性提升19.8%—114.6%,速度提升2.0—2.4倍,同时保持相似延迟。

Insight: 360度视频的分析可以通过关注关键区域(SRoI)和动态优化资源分配,显著提升性能。

Abstract: With the reduced hardware costs of omnidirectional cameras and the proliferation of various extended reality applications, more and more $360^\circ$ videos are being captured. To fully unleash their potential, advanced video analytics is expected to extract actionable insights and situational knowledge without blind spots from the videos. In this paper, we present OmniSense, a novel edge-assisted framework for online immersive video analytics. OmniSense achieves both low latency and high accuracy, combating the significant computation and network resource challenges of analyzing $360^\circ$ videos. Motivated by our measurement insights into $360^\circ$ videos, OmniSense introduces a lightweight spherical region of interest (SRoI) prediction algorithm to prune redundant information in $360^\circ$ frames. Incorporating the video content and network dynamics, it then smartly scales vision models to analyze the predicted SRoIs with optimized resource utilization. We implement a prototype of OmniSense with commodity devices and evaluate it on diverse real-world collected $360^\circ$ videos. Extensive evaluation results show that compared to resource-agnostic baselines, it improves the accuracy by $19.8%$ – $114.6%$ with similar end-to-end latencies. Meanwhile, it hits $2.0\times$ – $2.4\times$ speedups while keeping the accuracy on par with the highest accuracy of baselines.


q-bio.QM [Back]

[97] High-Throughput Low-Cost Segmentation of Brightfield Microscopy Live Cell Images q-bio.QM | cs.AI | cs.CV | eess.IVPDF

Surajit Das, Gourav Roy, Pavel Zun

TL;DR: 该论文提出了一种低成本、高吞吐量的CNN流水线,用于分割未染色的活细胞明场显微图像,克服了低对比度、噪声和运动模糊等挑战。

Details

Motivation: 明场显微活细胞图像分割在生物医学研究中至关重要,但现有方法难以在低对比度、噪声和运动模糊的情况下保持高吞吐量和准确性。

Result: 在公开数据集上达到了93%的测试准确率和89%的平均F1分数,并展示了跨模态的泛化能力。

Insight: 该模型在低计算资源下表现优异,适合实际实验室部署,且在训练数据有限的情况下展现了良好的适应性。

Abstract: Live cell culture is crucial in biomedical studies for analyzing cell properties and dynamics in vitro. This study focuses on segmenting unstained live cells imaged with bright-field microscopy. While many segmentation approaches exist for microscopic images, none consistently address the challenges of bright-field live-cell imaging with high throughput, where temporal phenotype changes, low contrast, noise, and motion-induced blur from cellular movement remain major obstacles. We developed a low-cost CNN-based pipeline incorporating comparative analysis of frozen encoders within a unified U-Net architecture enhanced with attention mechanisms, instance-aware systems, adaptive loss functions, hard instance retraining, dynamic learning rates, progressive mechanisms to mitigate overfitting, and an ensemble technique. The model was validated on a public dataset featuring diverse live cell variants, showing consistent competitiveness with state-of-the-art methods, achieving 93% test accuracy and an average F1-score of 89% (std. 0.07) on low-contrast, noisy, and blurry images. Notably, the model was trained primarily on bright-field images with limited exposure to phase-contrast microscopy (<10%), yet it generalized effectively to the phase-contrast LIVECell dataset, demonstrating modality, robustness and strong performance. This highlights its potential for real-world laboratory deployment across imaging conditions. The model requires minimal compute power and is adaptable using basic deep learning setups such as Google Colab, making it practical for training on other cell variants. Our pipeline outperforms existing methods in robustness and precision for bright-field microscopy segmentation. The code and dataset are available for reproducibility


eess.AS [Back]

[98] RAG-Boost: Retrieval-Augmented Generation Enhanced LLM-based Speech Recognition eess.AS | cs.CLPDF

Pengcheng Wang, Sheng Li, Takahiro Shinozaki

TL;DR: RAG-Boost通过检索增强生成模块改进基于LLM的语音识别系统,动态检索音频-文本对和领域术语,以修正识别错误。

Details

Motivation: 传统的LLM语音识别系统可能在领域特定术语或上下文不足时表现不佳,RAG-Boost旨在通过动态检索增强生成过程来解决这一问题。

Result: 实验结果表明,RAG-Boost能够显著提升语音识别的准确性,尤其在处理领域特定术语和上下文不足的场景下表现突出。

Insight: 检索增强生成能有效弥补LLM在实时语音识别中的局限性,尤其适用于需要高准确性和领域适应性的场景。

Abstract: In this paper, we propose RAG-Boost (ST-ShinozakiLab Task I system), which enhances the baseline LLM-based ASR system of the MLC-SLM Challenge (task I) with a retrieval-augmented generation (RAG) module on the fly. Each partial ASR hypothesis queries a vector store of audio-text pairs and domain terms, and the retrieved results are fused with the live ASR hypotheses to fix recognition errors. The fused hypotheses are passed to the LLM, yielding improved responses.


cs.CR [Back]

[99] MultiFuzz: A Dense Retrieval-based Multi-Agent System for Network Protocol Fuzzing cs.CR | cs.CL | cs.MA | cs.NIPDF

Youssef Maklad, Fares Wael, Ali Hamdi, Wael Elsersy, Khaled Shaban

TL;DR: MultiFuzz是一个基于密集检索的多智能体系统,用于网络协议模糊测试。它通过结合语义感知的上下文检索、专用智能体和结构化工具辅助推理,显著提升了模糊测试的效果。

Details

Motivation: 传统模糊测试技术(如AFL-based系统)在复杂协议语法理解和种子突变策略上存在局限性。ChatAFL等最近的工作虽然引入了大语言模型(LLM),但仍面临输出不可靠、LLM幻觉以及对协议规范的错误假设等问题。

Result: 在实时流协议(RTSP)上的实验表明,MultiFuzz在分支覆盖率和协议状态探索上显著优于NSFuzz、AFLNet和ChatAFL等现有技术。

Insight: MultiFuzz通过结合密集检索、智能体协调和语言模型推理,为自主协议模糊测试提供了可扩展和可扩展的基础,为未来基于智能体的模糊测试系统研究指明了新方向。

Abstract: Traditional protocol fuzzing techniques, such as those employed by AFL-based systems, often lack effectiveness due to a limited semantic understanding of complex protocol grammars and rigid seed mutation strategies. Recent works, such as ChatAFL, have integrated Large Language Models (LLMs) to guide protocol fuzzing and address these limitations, pushing protocol fuzzers to wider exploration of the protocol state space. But ChatAFL still faces issues like unreliable output, LLM hallucinations, and assumptions of LLM knowledge about protocol specifications. This paper introduces MultiFuzz, a novel dense retrieval-based multi-agent system designed to overcome these limitations by integrating semantic-aware context retrieval, specialized agents, and structured tool-assisted reasoning. MultiFuzz utilizes agentic chunks of protocol documentation (RFC Documents) to build embeddings in a vector database for a retrieval-augmented generation (RAG) pipeline, enabling agents to generate more reliable and structured outputs, enhancing the fuzzer in mutating protocol messages with enhanced state coverage and adherence to syntactic constraints. The framework decomposes the fuzzing process into modular groups of agents that collaborate through chain-of-thought reasoning to dynamically adapt fuzzing strategies based on the retrieved contextual knowledge. Experimental evaluations on the Real-Time Streaming Protocol (RTSP) demonstrate that MultiFuzz significantly improves branch coverage and explores deeper protocol states and transitions over state-of-the-art (SOTA) fuzzers such as NSFuzz, AFLNet, and ChatAFL. By combining dense retrieval, agentic coordination, and language model reasoning, MultiFuzz establishes a new paradigm in autonomous protocol fuzzing, offering a scalable and extensible foundation for future research in intelligent agentic-based fuzzing systems.