cs.CV [Total: 157]
cs.CL [Total: 98]
cs.CE [Total: 1]
cs.SD [Total: 2]
cs.CR [Total: 3]
cs.MA [Total: 2]
cs.LG [Total: 23]
cs.AI [Total: 4]
cs.CY [Total: 1]
cs.GR [Total: 1]
eess.IV [Total: 2]
cs.IR [Total: 4]
cs.SE [Total: 1]
cs.RO [Total: 5]

cs.CV [Back]

[1] TinyViT-Batten: Few-Shot Vision Transformer with Explainable Attention for Early Batten-Disease Detection on Pediatric MRI cs.CV | cs.AIPDF

Khartik Uppalapati, Bora Yimenicioglu, Shakeel Abdulkareem, Adan Eftekhari, Bhavya Uppalapati

TL;DR: TinyViT-Batten 是一种基于小规模视觉Transformer（ViT）的框架，用于从有限的儿科脑部MRI数据中检测早期Batten病。通过蒸馏大型ViT并结合元学习方法，该模型在准确性和可解释性上表现出色。

Details

Motivation: Batten病是一种罕见的儿科神经退行性疾病，其早期MRI信号细微且容易被漏诊。现有方法通常需要大量标注数据，而Batten病的罕见性导致数据稀缺。因此，开发一种高效的少样本学习模型具有重要意义。

Result: 在79例Batten病MRI和90例对照数据上，模型准确率达91%，ROC曲线下面积≥0.95，敏感性和特异性均约为90%。

Insight: 轻量化模型结合少样本学习方法可以有效解决罕见病数据稀缺问题；Grad-CAM的可解释性技术有助于临床医生理解和信任AI预测。

Abstract: Batten disease (neuronal ceroid lipofuscinosis) is a rare pediatric neurodegenerative disorder whose early MRI signs are subtle and often missed. We propose TinyViT-Batten, a few-shot Vision Transformer (ViT) framework to detect early Batten disease from pediatric brain MRI with limited training cases. We distill a large teacher ViT into a 5 M-parameter TinyViT and fine-tune it using metric-based few-shot learning (prototypical loss with 5-shot episodes). Our model achieves high accuracy (approximately 91%) and area under ROC of at least 0.95 on a multi-site dataset of 79 genetically confirmed Batten-disease MRIs (27 CLN3 from the Hochstein natural-history study, 32 CLN2 from an international longitudinal cohort, 12 early-manifestation CLN2 cases reported by Cokal et al., and 8 public Radiopaedia scans) together with 90 age-matched controls, outperforming a 3D-ResNet and Swin-Tiny baseline. We further integrate Gradient-weighted Class Activation Mapping (Grad-CAM) to highlight disease-relevant brain regions, enabling explainable predictions. The model’s small size and strong performance (sensitivity greater than 90%, specificity approximately 90%) demonstrates a practical AI solution for early Batten disease detection.

[2] Ultralytics YOLO Evolution: An Overview of YOLO26, YOLO11, YOLOv8 and YOLOv5 Object Detectors for Computer Vision and Pattern Recognition cs.CV | cs.AIPDF

Ranjan Sapkota, Manoj Karkee

TL;DR: 这篇论文全面回顾了Ultralytics YOLO家族的进化历程，包括YOLO26、YOLO11、YOLOv8和YOLOv5的创新点、性能对比和应用场景，指出了未来发展方向。

Details

Motivation: YOLO系列目标检测器在计算机视觉和模式识别领域广泛应用，但其演进过程、性能差异和未来挑战缺乏系统总结。本文旨在填补这一空白。

Result: YOLO26在MS COCO上表现出色，尤其在速度和精度平衡上优于YOLOv5、YOLOv8和YOLO11。与其他先进检测器（如RT-DETR）的对比也展示了其竞争力。

Insight: YOLO系列的演进围绕效率与精度的平衡展开，未来方向包括密集场景优化、CNN-Transformer混合架构和开放词汇检测。

Abstract: This paper presents a comprehensive overview of the Ultralytics YOLO(You Only Look Once) family of object detectors, focusing the architectural evolution, benchmarking, deployment perspectives, and future challenges. The review begins with the most recent release, YOLO26 (YOLOv26), which introduces key innovations including Distribution Focal Loss (DFL) removal, native NMS-free inference, Progressive Loss Balancing (ProgLoss), Small-Target-Aware Label Assignment (STAL), and the MuSGD optimizer for stable training. The progression is then traced through YOLO11, with its hybrid task assignment and efficiency-focused modules; YOLOv8, which advanced with a decoupled detection head and anchor-free predictions; and YOLOv5, which established the modular PyTorch foundation that enabled modern YOLO development. Benchmarking on the MS COCO dataset provides a detailed quantitative comparison of YOLOv5, YOLOv8, YOLO11, and YOLO26, alongside cross-comparisons with YOLOv12, YOLOv13, RT-DETR, and DEIM. Metrics including precision, recall, F1 score, mean Average Precision, and inference speed are analyzed to highlight trade-offs between accuracy and efficiency. Deployment and application perspectives are further discussed, covering export formats, quantization strategies, and real-world use in robotics, agriculture, surveillance, and manufacturing. Finally, the paper identifies challenges and future directions, including dense-scene limitations, hybrid CNN-Transformer integration, open-vocabulary detection, and edge-aware training approaches.

[3] OmniSAT: Compact Action Token, Faster Auto Regression cs.CV | cs.ROPDF

Huaihai Lyu, Chaofan Chen, Senwei Xie, Pengwei Wang, Xiansheng Chen

TL;DR: 论文提出OmniSAT，一种紧凑且可转移的动作表示方法，通过标准化值范围和时间范围、结合B样条编码和多阶段残差量化，显著缩短了训练序列长度并降低了目标熵，进一步通过跨体现学习策略提升了性能。

Details

Motivation: 现有的视觉-语言-动作（VLA）模型中，自回归方法虽高效但面临动作块导致的长序列和高维问题，传统压缩方法在重建质量或效率上表现不佳。

Result: OmniSAT将训练序列缩短6.8倍，降低目标熵，同时在真实机器人和仿真实验中表现出更高的压缩效率和重建质量，加速自回归训练收敛。

Insight: OmniSAT通过紧凑表示和跨体现学习实现了高效的动作序列建模，为大规模预训练提供了新的解决方案。

Abstract: Existing Vision-Language-Action (VLA) models can be broadly categorized into diffusion-based and auto-regressive (AR) approaches: diffusion models capture continuous action distributions but rely on computationally heavy iterative denoising. In contrast, AR models enable efficient optimization and flexible sequence construction, making them better suited for large-scale pretraining. To further improve AR efficiency, particularly when action chunks induce extended and high-dimensional sequences, prior work applies entropy-guided and token-frequency techniques to shorten the sequence length. However, such compression struggled with \textit{poor reconstruction or inefficient compression}. Motivated by this, we introduce an Omni Swift Action Tokenizer, which learns a compact, transferable action representation. Specifically, we first normalize value ranges and temporal horizons to obtain a consistent representation with B-Spline encoding. Then, we apply multi-stage residual quantization to the position, rotation, and gripper subspaces, producing compressed discrete tokens with coarse-to-fine granularity for each part. After pre-training on the large-scale dataset Droid, the resulting discrete tokenization shortens the training sequence by 6.8$\times$, and lowers the target entropy. To further explore the potential of OmniSAT, we develop a cross-embodiment learning strategy that builds on the unified action-pattern space and jointly leverages robot and human demonstrations. It enables scalable auxiliary supervision from heterogeneous egocentric videos. Across diverse real-robot and simulation experiments, OmniSAT encompasses higher compression while preserving reconstruction quality, enabling faster AR training convergence and model performance.

[4] Knowledge-Aware Mamba for Joint Change Detection and Classification from MODIS Times Series cs.CVPDF

Zhengsen Xu, Yimin Zhu, Zack Dewis, Mabel Heffring, Motasem Alkayid

TL;DR: 论文提出了一种名为KAMamba的知识感知Mamba方法，用于改进MODIS时间序列的联合变化检测和分类任务，通过知识驱动的转换矩阵和多任务学习提升准确性。

Details

Motivation: MODIS时间序列中的混合像素、时空光谱信息耦合效应以及背景类异质性使得变化检测极具挑战性，因此需要一种新方法来应对这些问题。

Result: 在加拿大萨斯喀彻温省的MODIS数据集上，变化检测的平均F1值提升了1.5-6%，LULC分类的OA、AA和Kappa指标提高了约2%。

Insight: 知识驱动和多任务学习的结合可以有效提升变化检测和分类任务的性能，同时稀疏可变形设计有助于降低模型计算负担。

Abstract: Although change detection using MODIS time series is critical for environmental monitoring, it is a highly challenging task due to key MODIS difficulties, e.g., mixed pixels, spatial-spectral-temporal information coupling effect, and background class heterogeneity. This paper presents a novel knowledge-aware Mamba (KAMamba) for enhanced MODIS change detection, with the following contributions. First, to leverage knowledge regarding class transitions, we design a novel knowledge-driven transition-matrix-guided approach, leading to a knowledge-aware transition loss (KAT-loss) that can enhance detection accuracies. Second, to improve model constraints, a multi-task learning approach is designed, where three losses, i.e., pre-change classification loss (PreC-loss), post-change classification loss (PostC-loss), and change detection loss (Chg-loss) are used for improve model learning. Third, to disentangle information coupling in MODIS time series, novel spatial-spectral-temporal Mamba (SSTMamba) modules are designed. Last, to improve Mamba model efficiency and remove computational cost, a sparse and deformable Mamba (SDMamba) backbone is used in SSTMamba. On the MODIS time-series dataset for Saskatchewan, Canada, we evaluate the method on land-cover change detection and LULC classification; results show about 1.5-6% gains in average F1 for change detection over baselines, and about 2% improvements in OA, AA, and Kappa for LULC classification.

[5] Multi Camera Connected Vision System with Multi View Analytics: A Comprehensive Survey cs.CVPDF

Muhammad Munsif, Waqas Ahmad, Amjid Ali, Mohib Ullah, Adnan Hussain

TL;DR: 这篇综述论文首次将多摄像头多视图跟踪、重识别和行为理解统一为一个框架，提出了新的分类法，总结了当前的最新技术和研究挑战，并指出了未来的研究方向。

Details

Motivation: 现有的研究大多专注于单一任务或单视图系统，忽视了多摄像头协作和多视图数据分析的潜力，而这篇论文旨在填补这一空白，为连接视觉系统提供全面的视角。

Result: 论文总结了当前的研究成果和性能指标，展示了多摄像头协作在提升系统性能方面的潜力。

Insight: 多摄像头协作和多视图数据分析是提升连接视觉系统鲁棒性和适应性的关键，未来的研究需要关注隐私保护、联邦学习等新兴技术。

Abstract: Connected Vision Systems (CVS) are transforming a variety of applications, including autonomous vehicles, smart cities, surveillance, and human-robot interaction. These systems harness multi-view multi-camera (MVMC) data to provide enhanced situational awareness through the integration of MVMC tracking, re-identification (Re-ID), and action understanding (AU). However, deploying CVS in real-world, dynamic environments presents a number of challenges, particularly in addressing occlusions, diverse viewpoints, and environmental variability. Existing surveys have focused primarily on isolated tasks such as tracking, Re-ID, and AU, often neglecting their integration into a cohesive system. These reviews typically emphasize single-view setups, overlooking the complexities and opportunities provided by multi-camera collaboration and multi-view data analysis. To the best of our knowledge, this survey is the first to offer a comprehensive and integrated review of MVMC that unifies MVMC tracking, Re-ID, and AU into a single framework. We propose a unique taxonomy to better understand the critical components of CVS, dividing it into four key parts: MVMC tracking, Re-ID, AU, and combined methods. We systematically arrange and summarize the state-of-the-art datasets, methodologies, results, and evaluation metrics, providing a structured view of the field’s progression. Furthermore, we identify and discuss the open research questions and challenges, along with emerging technologies such as lifelong learning, privacy, and federated learning, that need to be addressed for future advancements. The paper concludes by outlining key research directions for enhancing the robustness, efficiency, and adaptability of CVS in complex, real-world applications. We hope this survey will inspire innovative solutions and guide future research toward the next generation of intelligent and adaptive CVS.

[6] Constructive Distortion: Improving MLLMs with Attention-Guided Image Warping cs.CV | cs.LGPDF

Dwip Dalal, Gautam Vashishtha, Utkarsh Mishra, Jeonghwan Kim, Madhav Kanda

TL;DR: AttWarp是一种轻量级方法，通过注意力引导的图像扭曲在测试时为MLLMs分配更多分辨率到查询相关区域，保留全局上下文并提升细粒度感知准确性。

Details

Motivation: MLLMs在复杂场景中容易忽略小细节和空间关系，导致细粒度感知任务出错。AttWarp旨在通过注意力引导的图像扭曲优化分辨率分配，提升模型性能。

Result: 在五个基准测试和四种MLLMs上，AttWarp显著提升了准确性、组合推理能力，并减少了幻觉现象，优于四种基线方法。

Insight: 注意力引导的图像扭曲能有效优化信息分配，即使相同的MLLMs在处理扭曲后的输入时表现更优，证明了动态分辨率分配的重要性。

Abstract: Multimodal large language models (MLLMs) often miss small details and spatial relations in cluttered scenes, leading to errors in fine-grained perceptual grounding. We introduce AttWarp, a lightweight method that allocates more resolution to query-relevant content while compressing less informative areas, all while preserving global context. At test time, the approach uses an MLLM’s cross-modal attention to perform rectilinear warping of the input image, reallocating spatial resolution toward regions the model deems important, without changing model weights or architecture. This attention-guided warping preserves all original image information but redistributes it non-uniformly, so small objects and subtle relationships become easier for the same model to read while the global layout remains intact. Across five benchmarks (TextVQA, GQA, DocVQA, POPE, MMMU) and four MLLMs (LLaVA, Qwen-VL, InternVL, and InstructBLIP), AttWarp consistently improves accuracy, strengthens compositional reasoning, and reduces hallucinations, outperforming four competitive baselines that manipulate raw images at test time. Together, these results show that attention-guided warping prioritizes information relevant to the query while preserving context, and that the same MLLMs perform better when given such warped inputs.

[7] Towards Understanding Ambiguity Resolution in Multimodal Inference of Meaning cs.CV | cs.AIPDF

Yufei Wang, Adriana Kovashka, Loretta Fernández, Marc N. Coutanche, Seth Wiener

TL;DR: 论文研究了在多模态环境下学习外语的新情境，分析了图像和文本的特征如何帮助人类参与者推断陌生词汇的意义，并探索了AI系统在这一任务中的表现。

Details

Motivation: 外语学习中，学习者常通过多模态上下文（如图像和句子）推断陌生词汇的意义。然而，哪些特征有助于这种推断尚不明确，且AI系统在这方面的能力也有待提升。

Result: 发现部分直观特征与参与者表现有强相关性，但需进一步研究预测特征。AI系统在推理人类表现方面显示出改进潜力。

Insight: 多模态学习中，单纯依赖直观特征可能不足，需深入挖掘更有效的特征。AI系统在这一任务中表现尚不完善，但未来有望通过改进推理能力提升效果。

Abstract: We investigate a new setting for foreign language learning, where learners infer the meaning of unfamiliar words in a multimodal context of a sentence describing a paired image. We conduct studies with human participants using different image-text pairs. We analyze the features of the data (i.e., images and texts) that make it easier for participants to infer the meaning of a masked or unfamiliar word, and what language backgrounds of the participants correlate with success. We find only some intuitive features have strong correlations with participant performance, prompting the need for further investigating of predictive features for success in these tasks. We also analyze the ability of AI systems to reason about participant performance, and discover promising future directions for improving this reasoning ability.

[8] Task-Aware Resolution Optimization for Visual Large Language Models cs.CV | cs.CLPDF

Weiqing Luo, Zhen Tan, Yifan Li, Xinyu Zhao, Kwonjoon Lee

TL;DR: 论文提出了一种任务感知的分辨率优化方法，用于提升视觉大语言模型（VLLMs）在多种任务中的性能，并通过实验验证了其有效性。

Details

Motivation: 现有的视觉大语言模型（如LLaVA）通常假设固定分辨率适用于所有下游任务，导致性能不佳。研究发现不同任务对分辨率的偏好与图像复杂度和模型不确定性相关，因此需要一个任务感知的分辨率优化方法。

Result: 在多种视觉语言任务上的实验表明，该方法显著提升了VLLMs的性能。

Insight: 不同任务对分辨率的需求不同，结合图像复杂度和模型不确定性可以更高效地确定最优分辨率，从而提升模型性能。

Abstract: Real-world vision-language applications demand varying levels of perceptual granularity. However, most existing visual large language models (VLLMs), such as LLaVA, pre-assume a fixed resolution for downstream tasks, which leads to subpar performance. To address this problem, we first conduct a comprehensive and pioneering investigation into the resolution preferences of different vision-language tasks, revealing a correlation between resolution preferences with image complexity, and uncertainty variance of the VLLM at different image input resolutions. Building on this insight, we propose an empirical formula to determine the optimal resolution for a given vision-language task, combining these two factors. Second, based on rigorous experiments, we propose a novel parameter-efficient fine-tuning technique to extend the visual input resolution of pre-trained VLLMs to the identified optimal resolution. Extensive experiments on various vision-language tasks validate the effectiveness of our method.

[9] Cluster-Aware Prompt Ensemble Learning for Few-Shot Vision-Language Model Adaptation cs.CVPDF

Zhi Chen, Xin Yu, Xiaohui Tao, Yan Li, Zi Huang

TL;DR: 本文提出了CAPEL框架，通过聚类感知的提示集成学习优化少样本视觉语言模型的适应能力，避免了传统集成方法导致的类别中心偏移问题。

Details

Motivation: 现有的视觉语言模型（如CLIP）通过多组上下文提示来实现零样本迁移，但传统提示集成方法通过特征平均会导致类别中心偏离真实分布，影响性能。

Result: CAPEL能更好地对齐视觉特征分布，在不同数据集和任务中表现鲁棒。

Insight: 在视觉语言模型中，保留提示的聚类特性比简单特征平均更有效，动态调整提示权重可进一步提升性能。

Abstract: Vision-language models (VLMs) such as CLIP achieve zero-shot transfer across various tasks by pre-training on numerous image-text pairs. These models often benefit from using an ensemble of context prompts to represent a class. Despite being effective, conventional prompt ensembling that averages textual features of context prompts often yields suboptimal results. This is because feature averaging shifts the class centroids away from the true class distribution. To address this issue, we propose the Cluster-Aware Prompt Ensemble Learning (CAPEL) framework, which preserves the cluster nature of context prompts. CAPEL classifies images into one of several class clusters, each represented by a distinct prompt. Instead of ensembling prompts in the feature space, we perform ensembling in the classification logits space, aligning better with the visual feature distribution. To further optimize prompt fine-tuning while maintaining cluster-specific discriminative power, we introduce a cluster-preserving regularization term. This ensures that prompts remain distinct and specialized for different clusters, preventing collapse into a uniform direction. Additionally, we integrate an adaptive prompt weighting technique to dynamically adjust the attention weights for flawed or ambiguous prompts, ensuring robust performance across diverse datasets and tasks.

[10] CHUG: Crowdsourced User-Generated HDR Video Quality Dataset cs.CV | cs.AIPDF

Shreshth Saini, Alan C. Bovik, Neil Birkbeck, Yilin Wang, Balu Adsumilli

TL;DR: CHUG是首个大规模用户生成HDR视频质量数据集，填补了现有PGC-HDR数据集的空白，涵盖856个源视频和5,992个处理后视频，提供21万+主观评分，助力无参考HDR视频质量评估研究。

Details

Motivation: 现有HDR-VQA数据集主要针对专业生成内容(UGC)，而用户生成内容(UGC-HDR)的多样性、捕捉条件和压缩失真等挑战未被充分研究，需要更贴近实际的评估基准。

Result: CHUG数据集成为首个针对UGC-HDR的基准，揭示了UGC特有失真对质量的影响，为无参考质量评估提供数据支持。

Insight: UGC-HDR的质量评估需考虑多样化场景和复杂失真，CHUG的多样性填补了研究空白，未来可推动NR-HDR-VQA算法的创新。

Abstract: High Dynamic Range (HDR) videos enhance visual experiences with superior brightness, contrast, and color depth. The surge of User-Generated Content (UGC) on platforms like YouTube and TikTok introduces unique challenges for HDR video quality assessment (VQA) due to diverse capture conditions, editing artifacts, and compression distortions. Existing HDR-VQA datasets primarily focus on professionally generated content (PGC), leaving a gap in understanding real-world UGC-HDR degradations. To address this, we introduce CHUG: Crowdsourced User-Generated HDR Video Quality Dataset, the first large-scale subjective study on UGC-HDR quality. CHUG comprises 856 UGC-HDR source videos, transcoded across multiple resolutions and bitrates to simulate real-world scenarios, totaling 5,992 videos. A large-scale study via Amazon Mechanical Turk collected 211,848 perceptual ratings. CHUG provides a benchmark for analyzing UGC-specific distortions in HDR videos. We anticipate CHUG will advance No-Reference (NR) HDR-VQA research by offering a large-scale, diverse, and real-world UGC dataset. The dataset is publicly available at: https://shreshthsaini.github.io/CHUG/.

[11] SpectralCA: Bi-Directional Cross-Attention for Next-Generation UAV Hyperspectral Vision cs.CV | cs.AI | I.4.8; I.2.6; I.2.10; I.5.1; I.5.4PDF

D. V. Brovko

TL;DR: 该论文提出了一种名为SpectralCA的双向跨注意力机制，用于无人机高光谱视觉任务。通过改进Mobile 3D Vision Transformer（MDvT），结合光谱与空间特征，提升了无人机在导航、目标检测和地形分类中的感知效率，并减少了参数和推理时间。

Details

Motivation: 无人机在复杂环境中（如干扰、低能见度或伪装）的可靠操作需求日益增长。高光谱成像（HSI）因其细粒度材料识别和目标区分能力，被视为提升无人机计算机视觉能力的关键。

Result: 实验表明，该架构在总体精度、平均精度和Kappa系数等指标上表现优异，提升了无人机感知任务的效率。

Insight: 光谱与空间特征的融合是提升高光谱视觉任务性能的关键，双向跨注意力机制在此类任务中具有显著优势。

Abstract: The relevance of this research lies in the growing demand for unmanned aerial vehicles (UAVs) capable of operating reliably in complex environments where conventional navigation becomes unreliable due to interference, poor visibility, or camouflage. Hyperspectral imaging (HSI) provides unique opportunities for UAV-based computer vision by enabling fine-grained material recognition and object differentiation, which are critical for navigation, surveillance, agriculture, and environmental monitoring. The aim of this work is to develop a deep learning architecture integrating HSI into UAV perception for navigation, object detection, and terrain classification. Objectives include: reviewing existing HSI methods, designing a hybrid 2D/3D convolutional architecture with spectral-spatial cross-attention, training, and benchmarking. The methodology is based on the modification of the Mobile 3D Vision Transformer (MDvT) by introducing the proposed SpectralCA block. This block employs bi-directional cross-attention to fuse spectral and spatial features, enhancing accuracy while reducing parameters and inference time. Experimental evaluation was conducted on the WHU-Hi-HongHu dataset, with results assessed using Overall Accuracy, Average Accuracy, and the Kappa coefficient. The findings confirm that the proposed architecture improves UAV perception efficiency, enabling real-time operation for navigation, object recognition, and environmental monitoring tasks. Keywords: SpectralCA, deep learning, computer vision, hyperspectral imaging, unmanned aerial vehicle, object detection, semi-supervised learning.

[12] HeadsUp! High-Fidelity Portrait Image Super-Resolution cs.CVPDF

Renjie Li, Zihao Zhu, Xiaoyu Wang, Zhengzhong Tu

TL;DR: 论文《HeadsUp!》提出了一种用于高质量肖像图像超分辨率的单步扩散模型，解决了现有方法在处理肖像照片时因混合不同模型而引入的边界伪影问题。

Details

Motivation: 现有的图像超分辨率技术要么专注于通用真实世界图像，要么专注于严格对齐的面部图像（即人脸超分辨率）。处理肖像照片时需要混合不同模型，但会引入边界伪影。人类感知对面部保真度特别敏感，因此需要一种无缝恢复和提升分辨率的方法。

Result: HeadsUp在PortraitISR任务中达到了最先进的性能，同时在通用图像和对齐人脸数据集上表现相当或更好。

Insight: 单一模型端到端处理肖像图像可以有效避免混合模型的边界伪影问题，而面部监督和参考机制的结合提升了面部区域的保真度和身份一致性。

Abstract: Portrait pictures, which typically feature both human subjects and natural backgrounds, are one of the most prevalent forms of photography on social media. Existing image super-resolution (ISR) techniques generally focus either on generic real-world images or strictly aligned facial images (i.e., face super-resolution). In practice, separate models are blended to handle portrait photos: the face specialist model handles the face region, and the general model processes the rest. However, these blending approaches inevitably introduce blending or boundary artifacts around the facial regions due to different model training recipes, while human perception is particularly sensitive to facial fidelity. To overcome these limitations, we study the portrait image supersolution (PortraitISR) problem, and propose HeadsUp, a single-step diffusion model that is capable of seamlessly restoring and upscaling portrait images in an end-to-end manner. Specifically, we build our model on top of a single-step diffusion model and develop a face supervision mechanism to guide the model in focusing on the facial region. We then integrate a reference-based mechanism to help with identity restoration, reducing face ambiguity in low-quality face restoration. Additionally, we have built a high-quality 4K portrait image ISR dataset dubbed PortraitSR-4K, to support model training and benchmarking for portrait images. Extensive experiments show that HeadsUp achieves state-of-the-art performance on the PortraitISR task while maintaining comparable or higher performance on both general image and aligned face datasets.

[13] Denoising Diffusion as a New Framework for Underwater Images cs.CV | cs.AIPDF

Nilesh Jain, Elie Alhajjar

TL;DR: 本文提出了一种基于去噪扩散模型的新框架，用于提升水下图像的质量和多样性，并通过Controlnet进一步增强数据集，以克服现有水下图像数据集的局限性。

Details

Motivation: 水下图像在水域研究和环境监测中至关重要，但现有图像质量差（如低可见度、模糊、色彩失真和噪声），且依赖的数据集缺乏多样性和高质量样本。现有方法泛化能力差，亟需新方法解决这些问题。

Result: 该方法能够生成更丰富的水下图像数据集，提升图像质量，从而支持更准确的海洋生态分析。

Insight: 去噪扩散模型不仅适用于传统图像生成任务，还可用于解决特定领域（如水下图像）的数据稀缺和质量问题，为其他类似问题提供了新思路。

Abstract: Underwater images play a crucial role in ocean research and marine environmental monitoring since they provide quality information about the ecosystem. However, the complex and remote nature of the environment results in poor image quality with issues such as low visibility, blurry textures, color distortion, and noise. In recent years, research in image enhancement has proven to be effective but also presents its own limitations, like poor generalization and heavy reliance on clean datasets. One of the challenges herein is the lack of diversity and the low quality of images included in these datasets. Also, most existing datasets consist only of monocular images, a fact that limits the representation of different lighting conditions and angles. In this paper, we propose a new plan of action to overcome these limitations. On one hand, we call for expanding the datasets using a denoising diffusion model to include a variety of image types such as stereo, wide-angled, macro, and close-up images. On the other hand, we recommend enhancing the images using Controlnet to evaluate and increase the quality of the corresponding datasets, and hence improve the study of the marine ecosystem. Tags - Underwater Images, Denoising Diffusion, Marine ecosystem, Controlnet

[14] Scaling Traffic Insights with AI and Language Model-Powered Camera Systems for Data-Driven Transportation Decision Making cs.CV | eess.IVPDF

Fan Zuo, Donglin Zhou, Jingqin Gao, Kaan Ozbay

TL;DR: 该论文提出了一种基于AI和语言模型的端到端框架，用于利用现有交通摄像头基础设施进行大规模交通监测。通过优化的YOLOv11模型和创新的视角归一化方法，实现了高分辨率、实时交通数据分析，并结合领域专用大语言模型生成自动化交通模式摘要。

Details

Motivation: 当前交通监测面临高成本、动态视角和大规模数据处理的挑战，亟需一种高效、低人工干预的解决方案。

Result: 在纽约市拥堵收费政策的早期实施中，系统成功监测到车辆密度下降9%，卡车流量先减后增，行人及自行车活动增加。

Insight: 基于示例的提示可提升LLM数值准确性，减少幻觉；该系统为政策相关的大规模交通监测提供了实用解决方案。

Abstract: Accurate, scalable traffic monitoring is critical for real-time and long-term transportation management, particularly during disruptions such as natural disasters, large construction projects, or major policy changes like New York City’s first-in-the-nation congestion pricing program. However, widespread sensor deployment remains limited due to high installation, maintenance, and data management costs. While traffic cameras offer a cost-effective alternative, existing video analytics struggle with dynamic camera viewpoints and massive data volumes from large camera networks. This study presents an end-to-end AI-based framework leveraging existing traffic camera infrastructure for high-resolution, longitudinal analysis at scale. A fine-tuned YOLOv11 model, trained on localized urban scenes, extracts multimodal traffic density and classification metrics in real time. To address inconsistencies from non-stationary pan-tilt-zoom cameras, we introduce a novel graph-based viewpoint normalization method. A domain-specific large language model was also integrated to process massive data from a 24/7 video stream to generate frequent, automated summaries of evolving traffic patterns, a task far exceeding manual capabilities. We validated the system using over 9 million images from roughly 1,000 traffic cameras during the early rollout of NYC congestion pricing in 2025. Results show a 9% decline in weekday passenger vehicle density within the Congestion Relief Zone, early truck volume reductions with signs of rebound, and consistent increases in pedestrian and cyclist activity at corridor and zonal scales. Experiments showed that example-based prompts improved LLM’s numerical accuracy and reduced hallucinations. These findings demonstrate the framework’s potential as a practical, infrastructure-ready solution for large-scale, policy-relevant traffic monitoring with minimal human intervention.

[15] FlareX: A Physics-Informed Dataset for Lens Flare Removal via 2D Synthesis and 3D Rendering cs.CVPDF

Lishen Qu, Zhihao Liu, Jinshan Pan, Shihao Zhou, Jinglei Shi

TL;DR: 该论文提出了FlareX，一种基于物理原理的数据集，通过参数化模板创建、光照感知的2D合成和基于物理引擎的3D渲染三阶段方法，生成了具有多样性和真实性的镜头炫光数据。

Details

Motivation: 现有的数据集通常是通过在背景图像上叠加人工炫光模板合成的2D数据，缺乏炫光多样性且忽视了物理原理，导致训练模型在真实场景中泛化能力差。

Result: 实验表明，FlareX数据集显著提升了模型在真实世界图像中的泛化能力。

Insight: 通过结合物理原理和多视角数据生成方法，可以有效解决镜头炫光去除任务中的数据多样性问题。

Abstract: Lens flare occurs when shooting towards strong light sources, significantly degrading the visual quality of images. Due to the difficulty in capturing flare-corrupted and flare-free image pairs in the real world, existing datasets are typically synthesized in 2D by overlaying artificial flare templates onto background images. However, the lack of flare diversity in templates and the neglect of physical principles in the synthesis process hinder models trained on these datasets from generalizing well to real-world scenarios. To address these challenges, we propose a new physics-informed method for flare data generation, which consists of three stages: parameterized template creation, the laws of illumination-aware 2D synthesis, and physical engine-based 3D rendering, which finally gives us a mixed flare dataset that incorporates both 2D and 3D perspectives, namely FlareX. This dataset offers 9,500 2D templates derived from 95 flare patterns and 3,000 flare image pairs rendered from 60 3D scenes. Furthermore, we design a masking approach to obtain real-world flare-free images from their corrupted counterparts to measure the performance of the model on real-world images. Extensive experiments demonstrate the effectiveness of our method and dataset.

[16] BurstDeflicker: A Benchmark Dataset for Flicker Removal in Dynamic Scenes cs.CVPDF

Lishen Qu, Zhihao Liu, Shihao Zhou, Yaqi Luo, Jie Liang

TL;DR: 本文提出了BurstDeflicker数据集，用于解决动态场景中由于滚动快门相机和AC光源相互作用产生的闪烁问题。通过合成数据、真实拍摄和绿幕方法三种策略构建了一个大规模、多样化的基准数据集。

Details

Motivation: 闪烁伪影（flicker）是由于滚动快门相机的逐行曝光机制与AC光源的时变亮度相互作用导致的，表现为图像中的暗带。这不仅影响图像质量，还干扰高层视觉任务（如目标检测和跟踪）。然而，缺乏大规模真实数据集阻碍了相关研究的进展。

Result: 实验验证了数据集的有效性，展示了其在推动闪烁去除研究方面的潜力。

Insight: 闪烁问题不仅是图像质量的挑战，还影响高层视觉任务。通过多策略数据采集，可以更好地建模闪烁的时空特性并提高泛化能力。

Abstract: Flicker artifacts in short-exposure images are caused by the interplay between the row-wise exposure mechanism of rolling shutter cameras and the temporal intensity variations of alternating current (AC)-powered lighting. These artifacts typically appear as uneven brightness distribution across the image, forming noticeable dark bands. Beyond compromising image quality, this structured noise also affects high-level tasks, such as object detection and tracking, where reliable lighting is crucial. Despite the prevalence of flicker, the lack of a large-scale, realistic dataset has been a significant barrier to advancing research in flicker removal. To address this issue, we present BurstDeflicker, a scalable benchmark constructed using three complementary data acquisition strategies. First, we develop a Retinex-based synthesis pipeline that redefines the goal of flicker removal and enables controllable manipulation of key flicker-related attributes (e.g., intensity, area, and frequency), thereby facilitating the generation of diverse flicker patterns. Second, we capture 4,000 real-world flicker images from different scenes, which help the model better understand the spatial and temporal characteristics of real flicker artifacts and generalize more effectively to wild scenarios. Finally, due to the non-repeatable nature of dynamic scenes, we propose a green-screen method to incorporate motion into image pairs while preserving real flicker degradation. Comprehensive experiments demonstrate the effectiveness of our dataset and its potential to advance research in flicker removal.

[17] MIMO: A medical vision language model with visual referring multimodal input and pixel grounding multimodal output cs.CVPDF

Yanyuan Chen, Dexuan Xu, Yu Huang, Songkun Zhan, Hanpin Wang

TL;DR: 本文提出了一种统一的医学视觉语言模型MIMO，通过视觉参考多模态输入和像素定位多模态输出解决了现有模型在输入和输出上的不足。

Details

Motivation: 现有的医学视觉语言模型仅依赖文本指令输入和文本答案输出，缺乏对图像视觉线索的直接理解和与图像关键区域的联系。

Result: 实验验证MIMO在多个医学多模态下游任务中表现出色，具备视觉参考和像素定位能力。

Insight: 视觉参考和像素定位的结合能显著提升医学视觉语言模型的表现，多模态数据集的构建是关键。

Abstract: Currently, medical vision language models are widely used in medical vision question answering tasks. However, existing models are confronted with two issues: for input, the model only relies on text instructions and lacks direct understanding of visual clues in the image; for output, the model only gives text answers and lacks connection with key areas in the image. To address these issues, we propose a unified medical vision language model MIMO, with visual referring Multimodal Input and pixel grounding Multimodal Output. MIMO can not only combine visual clues and textual instructions to understand complex medical images and semantics, but can also ground medical terminologies in textual output within the image. To overcome the scarcity of relevant data in the medical field, we propose MIMOSeg, a comprehensive medical multimodal dataset including 895K samples. MIMOSeg is constructed from four different perspectives, covering basic instruction following and complex question answering with multimodal input and multimodal output. We conduct experiments on several downstream medical multimodal tasks. Extensive experimental results verify that MIMO can uniquely combine visual referring and pixel grounding capabilities, which are not available in previous models.

Junan Chen, Trung Thanh Nguyen, Takahiro Komamizu, Ichiro Ide

TL;DR: 该论文提出了Q-Adapter，一种轻量级的视觉适配器模块，用于在视频字幕任务中高效微调多模态大型语言模型（MLLMs），仅需1.4%的参数即可达到与全微调方法竞争的性能。

Details

Motivation: 随着模型规模增大，全微调方法的计算成本变得高昂。现有的参数高效微调（PEFT）方法主要集中在语言组件，而多模态任务中的视觉信息处理仍未充分探索。

Result: 在MSR-VTT和MSVD数据集上，Q-Adapter在BLEU@4、METEOR、ROUGE-L和CIDEr指标上表现优异，达到了PEFT方法的最高水平。

Insight: Q-Adapter的设计为视频-语言建模提供了可扩展的优化策略，展示了在性能和效率之间的权衡潜力。

Abstract: Recent advances in video captioning are driven by large-scale pretrained models, which follow the standard “pre-training followed by fine-tuning” paradigm, where the full model is fine-tuned for downstream tasks. Although effective, this approach becomes computationally prohibitive as the model size increases. The Parameter-Efficient Fine-Tuning (PEFT) approach offers a promising alternative, but primarily focuses on the language components of Multimodal Large Language Models (MLLMs). Despite recent progress, PEFT remains underexplored in multimodal tasks and lacks sufficient understanding of visual information during fine-tuning the model. To bridge this gap, we propose Query-Adapter (Q-Adapter), a lightweight visual adapter module designed to enhance MLLMs by enabling efficient fine-tuning for the video captioning task. Q-Adapter introduces learnable query tokens and a gating layer into Vision Encoder, enabling effective extraction of sparse, caption-relevant features without relying on external textual supervision. We evaluate Q-Adapter on two well-known video captioning datasets, MSR-VTT and MSVD, where it achieves state-of-the-art performance among the methods that take the PEFT approach across BLEU@4, METEOR, ROUGE-L, and CIDEr metrics. Q-Adapter also achieves competitive performance compared to methods that take the full fine-tuning approach while requiring only 1.4% of the parameters. We further analyze the impact of key hyperparameters and design choices on fine-tuning effectiveness, providing insights into optimization strategies for adapter-based learning. These results highlight the strong potential of Q-Adapter in balancing caption quality and parameter efficiency, demonstrating its scalability for video-language modeling.

[19] P-4DGS: Predictive 4D Gaussian Splatting with 90$\times$ Compression cs.CVPDF

Henan Wang, Hanxin Zhu, Xinliang Gong, Tianyu He, Xin Li

TL;DR: P-4DGS提出了一种基于预测的4D高斯泼溅方法，通过利用时空相关性和自适应量化策略，显著降低动态场景存储开销，最高实现90倍压缩。

Details

Motivation: 动态3D场景重建（4D重建）中，现有方法忽略时空冗余性，导致内存消耗巨大。P-4DGS旨在解决这一问题，提供高效的压缩表示。

Result: P-4DGS在合成和真实场景中分别实现40倍和90倍压缩，存储占用仅约1MB，同时保持最佳重建质量和最快渲染速度。

Insight: 视频压缩技术可迁移至4D高斯泼溅，有效解决动态场景存储问题，为实时动态重建提供新思路。

Abstract: 3D Gaussian Splatting (3DGS) has garnered significant attention due to its superior scene representation fidelity and real-time rendering performance, especially for dynamic 3D scene reconstruction (\textit{i.e.}, 4D reconstruction). However, despite achieving promising results, most existing algorithms overlook the substantial temporal and spatial redundancies inherent in dynamic scenes, leading to prohibitive memory consumption. To address this, we propose P-4DGS, a novel dynamic 3DGS representation for compact 4D scene modeling. Inspired by intra- and inter-frame prediction techniques commonly used in video compression, we first design a 3D anchor point-based spatial-temporal prediction module to fully exploit the spatial-temporal correlations across different 3D Gaussian primitives. Subsequently, we employ an adaptive quantization strategy combined with context-based entropy coding to further reduce the size of the 3D anchor points, thereby achieving enhanced compression efficiency. To evaluate the rate-distortion performance of our proposed P-4DGS in comparison with other dynamic 3DGS representations, we conduct extensive experiments on both synthetic and real-world datasets. Experimental results demonstrate that our approach achieves state-of-the-art reconstruction quality and the fastest rendering speed, with a remarkably low storage footprint (around \textbf{1MB} on average), achieving up to \textbf{40$\times$} and \textbf{90$\times$} compression on synthetic and real-world scenes, respectively.

[20] Complementary and Contrastive Learning for Audio-Visual Segmentation cs.CVPDF

Sitong Gong, Yunzhi Zhuge, Lu Zhang, Pingping Zhang, Huchuan Lu

TL;DR: 本文提出了Complementary and Contrastive Transformer (CCFormer)，一种新颖的框架，结合多尺度视觉特征和音频数据，通过并行双边结构和对比学习增强跨模态互补性，提升了音频-视觉分割的性能。

Details

Motivation: 现有的音频-视觉分割方法在处理局部和全局信息、时空动态以及跨模态对齐方面存在不足，限制了分割准确性和鲁棒性。

Result: CCFormer在S4、MS3和AVSS数据集上实现了新的SOTA性能。

Insight: 结合并行双边结构和对比学习能够有效提升跨模态任务的性能，尤其是在捕捉时空动态和模态对齐方面表现出色。

Abstract: Audio-Visual Segmentation (AVS) aims to generate pixel-wise segmentation maps that correlate with the auditory signals of objects. This field has seen significant progress with numerous CNN and Transformer-based methods enhancing the segmentation accuracy and robustness. Traditional CNN approaches manage audio-visual interactions through basic operations like padding and multiplications but are restricted by CNNs’ limited local receptive field. More recently, Transformer-based methods treat auditory cues as queries, utilizing attention mechanisms to enhance audio-visual cooperation within frames. Nevertheless, they typically struggle to extract multimodal coefficients and temporal dynamics adequately. To overcome these limitations, we present the Complementary and Contrastive Transformer (CCFormer), a novel framework adept at processing both local and global information and capturing spatial-temporal context comprehensively. Our CCFormer initiates with the Early Integration Module (EIM) that employs a parallel bilateral architecture, merging multi-scale visual features with audio data to boost cross-modal complementarity. To extract the intra-frame spatial features and facilitate the perception of temporal coherence, we introduce the Multi-query Transformer Module (MTM), which dynamically endows audio queries with learning capabilities and models the frame and video-level relations simultaneously. Furthermore, we propose the Bi-modal Contrastive Learning (BCL) to promote the alignment across both modalities in the unified feature space. Through the effective combination of those designs, our method sets new state-of-the-art benchmarks across the S4, MS3 and AVSS datasets. Our source code and model weights will be made publicly available at https://github.com/SitongGong/CCFormer

[21] Think Twice to See More: Iterative Visual Reasoning in Medical VLMs cs.CV | cs.AIPDF

Kaitao Chen, Shaohao Rui, Yankai Jiang, Jiamin Wu, Qihao Zheng

TL;DR: ViTAR是一种新型医疗视觉-语言模型（VLM），通过模仿人类专家的迭代推理过程（”思考-行动-再思考-回答”），显著提升了医疗图像诊断的准确性和可信度。

Details

Motivation: 现有医疗VLM通常依赖单次推理，忽略了局部视觉线索，而人类专家则通过多次迭代聚焦和优化感兴趣区域。ViTAR旨在缩小这一机器与人类感知差距。

Result: 实验表明，ViTAR在性能上超越现有最先进模型，视觉注意力分析显示其逐渐聚焦临床关键区域，推理过程中始终保持高注意力分配。

Insight: 将专家风格的迭代思维链嵌入VLM，不仅能提升性能，还能增强医疗AI的可信度。注意力机制分析为模型改进提供了机制化解释。

Abstract: Medical vision-language models (VLMs) excel at image-text understanding but typically rely on a single-pass reasoning that neglects localized visual cues. In clinical practice, however, human experts iteratively scan, focus, and refine the regions of interest before reaching a final diagnosis. To narrow this machine-human perception gap, we introduce ViTAR, a novel VLM framework that emulates the iterative reasoning process of human experts through a cognitive chain of “think-act-rethink-answer”. ViTAR treats medical images as interactive objects, enabling models to engage multi-step visual reasoning. To support this approach, we curate a high-quality instruction dataset comprising 1K interactive examples that encode expert-like diagnostic behaviors. In addition, a 16K visual question answering training data has been curated towards fine-grained visual diagnosis. We introduce a two-stage training strategy that begins with supervised fine-tuning to guide cognitive trajectories, followed by the reinforcement learning to optimize decision-making. Extensive evaluations demonstrate that ViTAR outperforms strong state-of-the-art models. Visual attention analysis reveals that from the “think” to “rethink” rounds, ViTAR increasingly anchors visual grounding to clinically critical regions and maintains high attention allocation to visual tokens during reasoning, providing mechanistic insight into its improved performance. These findings demonstrate that embedding expert-style iterative thinking chains into VLMs enhances both performance and trustworthiness of medical AI.

[22] DREAM: A Benchmark Study for Deepfake REalism AssessMent cs.CVPDF

Bo Peng, Zichuan Wang, Sheng Yu, Xiaochuan Jin, Wei Wang

TL;DR: 该论文提出了一个新的基准研究DREAM，专注于深度伪造视频的视觉真实感评估，通过大规模数据集和人类标注，对比了16种评估方法。

Details

Motivation: 深度伪造技术对信息可信度构成威胁，但主观感知的真实感评估缺乏研究。

Result: DREAM基准为深度伪造真实感评估提供了基础，展示了多种方法的性能差异。

Insight: 深度伪造的真实感评估不仅有助于检测技术，还能优化生成过程并预测其社会影响。

Abstract: Deep learning based face-swap videos, widely known as deepfakes, have drawn wide attention due to their threat to information credibility. Recent works mainly focus on the problem of deepfake detection that aims to reliably tell deepfakes apart from real ones, in an objective way. On the other hand, the subjective perception of deepfakes, especially its computational modeling and imitation, is also a significant problem but lacks adequate study. In this paper, we focus on the visual realism assessment of deepfakes, which is defined as the automatic assessment of deepfake visual realism that approximates human perception of deepfakes. It is important for evaluating the quality and deceptiveness of deepfakes which can be used for predicting the influence of deepfakes on Internet, and it also has potentials in improving the deepfake generation process by serving as a critic. This paper prompts this new direction by presenting a comprehensive benchmark called DREAM, which stands for Deepfake REalism AssessMent. It is comprised of a deepfake video dataset of diverse quality, a large scale annotation that includes 140,000 realism scores and textual descriptions obtained from 3,500 human annotators, and a comprehensive evaluation and analysis of 16 representative realism assessment methods, including recent large vision language model based methods and a newly proposed description-aligned CLIP method. The benchmark and insights included in this study can lay the foundation for future research in this direction and other related areas.

[23] Collaborative Learning of Semantic-Aware Feature Learning and Label Recovery for Multi-Label Image Recognition with Incomplete Labels cs.CVPDF

Zhi-Fen He, Ren-Dong Xie, Bo Li, Bin Liu, Jin-Yan Hu

TL;DR: 该论文提出了一种名为CLSL的协同学习方法，用于解决不完全标签下的多标签图像识别问题，统一了语义感知特征学习和缺失标签恢复的两大挑战。

Details

Motivation: 不完全标签下的多标签图像识别存在两大核心挑战：语义感知特征学习和缺失标签恢复。现有方法未能有效统一这两者，限制了性能提升。

Result: 在MS-COCO、VOC2007和NUS-WIDE数据集上，CLSL超越现有不完全标签多标签识别方法。

Insight: 协同学习框架通过双向优化（特征学习和标签恢复）实现了性能和鲁棒性的提升，为不完全标签任务提供了新思路。

Abstract: Multi-label image recognition with incomplete labels is a critical learning task and has emerged as a focal topic in computer vision. However, this task is confronted with two core challenges: semantic-aware feature learning and missing label recovery. In this paper, we propose a novel Collaborative Learning of Semantic-aware feature learning and Label recovery (CLSL) method for multi-label image recognition with incomplete labels, which unifies the two aforementioned challenges into a unified learning framework. More specifically, we design a semantic-related feature learning module to learn robust semantic-related features by discovering semantic information and label correlations. Then, a semantic-guided feature enhancement module is proposed to generate high-quality discriminative semantic-aware features by effectively aligning visual and semantic feature spaces. Finally, we introduce a collaborative learning framework that integrates semantic-aware feature learning and label recovery, which can not only dynamically enhance the discriminability of semantic-aware features but also adaptively infer and recover missing labels, forming a mutually reinforced loop between the two processes. Extensive experiments on three widely used public datasets (MS-COCO, VOC2007, and NUS-WIDE) demonstrate that CLSL outperforms the state-of-the-art multi-label image recognition methods with incomplete labels.

Pîrvu Mihai-Cristian, Leordeanu Marius

TL;DR: 论文提出了PHG-MAE模型，通过随机掩码多模态数据构建概率超图，并结合MAE预训练与微调的统一训练框架，提升了多模态多任务学习的性能。

Details

Motivation: 当前多模态视觉任务依赖大量标注数据，而自监督预训练方法（如MAE）虽能减少标注需求，但通常需额外微调步骤。PHG-MAE旨在统一神经网络图理论与MAE方法，通过随机掩码多模态数据动态生成超图，简化训练流程并提升性能。

Result: PHG-MAE在无人机多模态场景中表现优异，支持小参数量模型的知识蒸馏，性能损失极小。扩展数据集和工具已开源。

Insight: 通过动态超图建模和多模态掩码，PHG-MAE为多任务学习提供了一种高效的自监督框架，适用于自动驾驶等复杂场景。

Abstract: The computer vision domain has greatly benefited from an abundance of data across many modalities to improve on various visual tasks. Recently, there has been a lot of focus on self-supervised pre-training methods through Masked Autoencoders (MAE) \cite{he2022masked,bachmann2022multimae}, usually used as a first step before optimizing for a downstream task, such as classification or regression. This is very useful as it doesn’t require any manually labeled data. In this work, we introduce Probabilistic Hyper-Graphs using Masked Autoencoders (PHG-MAE): a novel model that unifies the classical work on neural graphs \cite{leordeanu2021semi} with the modern approach of masked autoencoders under a common theoretical framework. Through random masking of entire modalities, not just patches, the model samples from the distribution of hyper-edges on each forward pass. Additionally, the model adapts the standard MAE algorithm by combining pre-training and fine-tuning into a single training loop. Moreover, our approach enables the creation of inference-time ensembles which, through aggregation, boost the final prediction performance and consistency. Lastly, we show that we can apply knowledge distillation on top of the ensembles with little loss in performance, even with models that have fewer than 1M parameters. While our work mostly focuses on outdoor UAV scenes that contain multiple world interpretations and modalities, the same steps can be followed in other similar domains, such as autonomous driving or indoor robotics. In order to streamline the process of integrating external pre-trained experts for computer vision multi-modal multi-task learning (MTL) scenarios, we developed a data-pipeline software. Using this tool, we have created and released a fully-automated extension of the Dronescapes dataset. All the technical details, code and reproduction steps are publicly released.

[25] Tracking the Spatiotemporal Evolution of Landslide Scars Using a Vision Foundation Model: A Novel and Universal Framework cs.CVPDF

Meijun Zhou, Gang Mei, Zhengjing Ma, Nengxiong Xu, Jianbing Peng

TL;DR: 该论文提出了一种新颖且通用的框架，利用视觉基础模型追踪大规模滑坡疤痕的时空演化，重点关注滑坡前后的连续变化及其预警潜力。

Details

Motivation: 现有研究多集中于单阶段或双阶段的滑坡识别，难以追踪滑坡疤痕的时空演化过程。为解决这一问题，作者提出了一种新的框架，结合光学遥感图像和视频分割技术。

Result: 在两个典型案例（白格滑坡和Sela滑坡）中验证了框架的有效性，成功捕捉了滑坡前的预警信号和滑坡后的演化特征。

Insight: 将遥感图像重构为视频序列的方法为地质灾害动态监测提供了新思路，结合视觉基础模型的技术展示了在灾害预警中的潜力。

Abstract: Tracking the spatiotemporal evolution of large-scale landslide scars is critical for understanding the evolution mechanisms and failure precursors, enabling effective early-warning. However, most existing studies have focused on single-phase or pre- and post-failure dual-phase landslide identification. Although these approaches delineate post-failure landslide boundaries, it is challenging to track the spatiotemporal evolution of landslide scars. To address this problem, this study proposes a novel and universal framework for tracking the spatiotemporal evolution of large-scale landslide scars using a vision foundation model. The key idea behind the proposed framework is to reconstruct discrete optical remote sensing images into a continuous video sequence. This transformation enables a vision foundation model, which is developed for video segmentation, to be used for tracking the evolution of landslide scars. The proposed framework operates within a knowledge-guided, auto-propagation, and interactive refinement paradigm to ensure the continuous and accurate identification of landslide scars. The proposed framework was validated through application to two representative cases: the post-failure Baige landslide and the active Sela landslide (2017-2025). Results indicate that the proposed framework enables continuous tracking of landslide scars, capturing both failure precursors critical for early warning and post-failure evolution essential for assessing secondary hazards and long-term stability.

[26] Gesplat: Robust Pose-Free 3D Reconstruction via Geometry-Guided Gaussian Splatting cs.CVPDF

Jiahui Lu, Haihong Xiao, Xueyan Zhao, Wenxiong Kang

TL;DR: Gesplat 是一种基于 3D Gaussian Splatting (3DGS) 的框架，旨在解决在稀疏视角下相机姿态不准确和视角覆盖不足的问题。通过结合 VGGT 基础模型、混合高斯表示和流式深度正则化，Gesplat 实现了无需准确相机姿态的稳健 3D 重建和新视角合成。

Details

Motivation: NeRF 和 3DGS 在 3D 重建和新视角合成中取得了显著进展，但它们严重依赖准确的相机姿态和密集的视角覆盖。这限制了它们在稀疏视角下的应用，而 Gesplat 旨在克服这一限制。

Result: Gesplat 在正向和大规模复杂数据集上均表现出色，相比其他无需姿态的方法，重建和新视角合成更稳健。

Insight: 通过结合基础模型和动态优化模块，Gesplat 展示了在稀疏视角下实现高质量 3D 重建的潜力，为实际应用提供了新思路。

Abstract: Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) have advanced 3D reconstruction and novel view synthesis, but remain heavily dependent on accurate camera poses and dense viewpoint coverage. These requirements limit their applicability in sparse-view settings, where pose estimation becomes unreliable and supervision is insufficient. To overcome these challenges, we introduce Gesplat, a 3DGS-based framework that enables robust novel view synthesis and geometrically consistent reconstruction from unposed sparse images. Unlike prior works that rely on COLMAP for sparse point cloud initialization, we leverage the VGGT foundation model to obtain more reliable initial poses and dense point clouds. Our approach integrates several key innovations: 1) a hybrid Gaussian representation with dual position-shape optimization enhanced by inter-view matching consistency; 2) a graph-guided attribute refinement module to enhance scene details; and 3) flow-based depth regularization that improves depth estimation accuracy for more effective supervision. Comprehensive quantitative and qualitative experiments demonstrate that our approach achieves more robust performance on both forward-facing and large-scale complex datasets compared to other pose-free methods.

[27] Cooperative Pseudo Labeling for Unsupervised Federated Classification cs.CV | cs.LGPDF

Kuangpu Guo, Lijun Sheng, Yongcan Yu, Jian Liang, Zilei Wang

TL;DR: 本文首次将无监督联邦学习（UFL）扩展到分类问题，提出了一种名为FedCoPL的新方法，通过伪标签分布估计和调整实现全局类别平衡，并结合部分提示聚合协议提升协作效果和个性化。

Details

Motivation: 无监督联邦学习（UFL）通常仅用于表示学习和聚类任务。随着视觉语言模型（如CLIP）的强大零样本预测能力出现，如何在UFL范式下实现分类任务成为一个新机遇，但仍未被充分探索。

Result: 实验表明FedCoPL在性能上显著优于基线方法。

Insight: 结合CLIP的零样本能力与联邦学习的分布式特性，可以高效解决UFL中的分类问题；部分聚合协议平衡了全局协作与个性化需求。

Abstract: Unsupervised Federated Learning (UFL) aims to collaboratively train a global model across distributed clients without sharing data or accessing label information. Previous UFL works have predominantly focused on representation learning and clustering tasks. Recently, vision language models (e.g., CLIP) have gained significant attention for their powerful zero-shot prediction capabilities. Leveraging this advancement, classification problems that were previously infeasible under the UFL paradigm now present promising new opportunities, yet remain largely unexplored. In this paper, we extend UFL to the classification problem with CLIP for the first time and propose a novel method, \underline{\textbf{Fed}}erated \underline{\textbf{Co}}operative \underline{\textbf{P}}seudo \underline{\textbf{L}}abeling (\textbf{FedCoPL}). Specifically, clients estimate and upload their pseudo label distribution, and the server adjusts and redistributes them to avoid global imbalance among classes. Moreover, we introduce a partial prompt aggregation protocol for effective collaboration and personalization. In particular, visual prompts containing general image features are aggregated at the server, while text prompts encoding personalized knowledge are retained locally. Extensive experiments demonstrate the superior performance of our FedCoPL compared to baseline methods. Our code is available at \href{https://github.com/krumpguo/FedCoPL}{https://github.com/krumpguo/FedCoPL}.

Minbin Huang, Runhui Huang, Chuanyang Zheng, Jingyao Li, Guoxuan Chen

TL;DR: 为了解决多模态大语言模型在强化学习过程中推理链与最终答案不一致的问题，作者提出了答案一致性强化学习（ACRE）。该方法通过引入一致性验证奖励机制，显著提高了推理与答案的一致性，并在视频和数学推理任务中取得了性能提升。

Details

Motivation: 传统的强化学习方法虽然在提高答案准确性方面有效，但可能导致推理链与最终答案不一致，影响模型的可解释性和可靠性。因此，作者希望通过一种新方法解决这一问题。

Result: 在视频推理和多模态数学推理任务中，ACRE分别比基线GRPO方法平均提高了2.2%和1.5%。

Insight: 一致性验证奖励不仅能提高推理与答案的一致性，还能减少模型对虚假模式（如选项顺序偏差）的依赖，增强模型的鲁棒性。

Abstract: Recent advances in large language models (LLMs) have demonstrated that reinforcement learning with verifiable rewards (RLVR) can significantly enhance reasoning abilities by directly optimizing correctness, rather than relying solely on supervised imitation. This paradigm has been extended to multimodal LLMs for complex video and image understanding tasks. However, while outcome-driven RL improves answer accuracy, it can inadvertently decouple the reasoning chain from the final answer, leading to situations where models produce inconsistency between the reasoning trace and final answer. In our experiments on multiple-choice visual question-answering tasks, the standard GRPO method yields only 79.7% consistency on MMVU between the reasoning steps and the chosen answers, indicating frequent mismatches between answers and reasoning. To this end, we propose Answer-Consistent Reinforcement Learning (ACRE) that modifies the GRPO algorithm with an auxiliary consistency check. After the model generates a chain of thought and an initial answer for a given question, we shuffle the answer options and prompt the model again with the same reasoning trace to predict a second answer. We design a consistency-verification reward that grants a high reward only if both the original and the post-shuffle answers agree and are correct; otherwise, a lower reward is assigned accordingly. This mechanism penalizes reasoning-answer misalignment and discourages the model from relying on spurious patterns, such as option ordering biases. We evaluate ACRE on challenging Video Reasoning benchmarks and multimodal math reasoning benchmarks, achieving an average 2.2% and 1.5% improvement for Video Reasoning and Math Reasoning tasks over the GRPO baseline.

[29] Uncertainty-Aware Post-Detection Framework for Enhanced Fire and Smoke Detection in Compact Deep Learning Models cs.CV | cs.AI | cs.LG | eess.IVPDF

Aniruddha Srinivas Joshi, Godwyn James William, Shreyas Srinivas Joshi

TL;DR: 该论文提出了一种不确定性感知的后检测框架，用于增强紧凑型深度学习模型（如YOLOv5n和YOLOv8n）在火灾和烟雾检测中的性能。该方法通过结合统计不确定性和领域相关的视觉线索重新调整检测置信度，显著提升了检测精度和召回率。

Details

Motivation: 现有基于视觉的火灾和烟雾检测方法在效率和可靠性之间存在平衡问题，紧凑型深度学习模型因容量限制常导致误检和漏检。传统后检测方法仅依赖空间重叠信息，无法有效处理复杂或模糊场景。

Result: 在D-Fire数据集上的实验表明，该方法相比现有基线在精确率、召回率和平均精度（mAP）上均有提升，计算开销仅为适度增加。

Insight: 后检测阶段的置信度优化是提升紧凑型深度学习模型鲁棒性的有效途径，尤其是在复杂或模糊场景中，结合不确定性和多模态视觉线索能显著减少误检和漏检。

Abstract: Accurate fire and smoke detection is critical for safety and disaster response, yet existing vision-based methods face challenges in balancing efficiency and reliability. Compact deep learning models such as YOLOv5n and YOLOv8n are widely adopted for deployment on UAVs, CCTV systems, and IoT devices, but their reduced capacity often results in false positives and missed detections. Conventional post-detection methods such as Non-Maximum Suppression and Soft-NMS rely only on spatial overlap, which can suppress true positives or retain false alarms in cluttered or ambiguous fire scenes. To address these limitations, we propose an uncertainty aware post-detection framework that rescales detection confidences using both statistical uncertainty and domain relevant visual cues. A lightweight Confidence Refinement Network integrates uncertainty estimates with color, edge, and texture features to adjust detection scores without modifying the base model. Experiments on the D-Fire dataset demonstrate improved precision, recall, and mean average precision compared to existing baselines, with only modest computational overhead. These results highlight the effectiveness of post-detection rescoring in enhancing the robustness of compact deep learning models for real-world fire and smoke detection.

[30] Training-Free In-Context Forensic Chain for Image Manipulation Detection and Localization cs.CV | cs.AI | cs.CRPDF

Rui Chen, Bin Liu, Changtao Miao, Xinghao Wang, Yi Li

TL;DR: 本文提出了In-Context Forensic Chain（ICFC），一种无需训练的框架，利用多模态大语言模型（MLLMs）实现可解释的图像篡改检测与定位。ICFC通过层级推理流程，结合自适应过滤和多模态知识库，显著优于现有无需训练方法，甚至在某些情况下媲美全监督方法。

Details

Motivation: 图像篡改技术带来的安全威胁日益严重，而现有监督方法依赖昂贵的像素级标注，弱监督或无监督方法则性能不足且缺乏可解释性。因此，研究提出了一种无需训练的、可解释的解决方案。

Result: 在多个数据集上，ICFC超越了现有的无需训练方法，甚至在某些情况下媲美弱监督和全监督方法，同时提供了文本级可解释性。

Insight: 无需训练的方法可以通过多模态模型的推理能力实现高性能的篡改检测，且层级推理设计和知识库构建是关键因素。

Abstract: Advances in image tampering pose serious security threats, underscoring the need for effective image manipulation localization (IML). While supervised IML achieves strong performance, it depends on costly pixel-level annotations. Existing weakly supervised or training-free alternatives often underperform and lack interpretability. We propose the In-Context Forensic Chain (ICFC), a training-free framework that leverages multi-modal large language models (MLLMs) for interpretable IML tasks. ICFC integrates an objectified rule construction with adaptive filtering to build a reliable knowledge base and a multi-step progressive reasoning pipeline that mirrors expert forensic workflows from coarse proposals to fine-grained forensics results. This design enables systematic exploitation of MLLM reasoning for image-level classification, pixel-level localization, and text-level interpretability. Across multiple benchmarks, ICFC not only surpasses state-of-the-art training-free methods but also achieves competitive or superior performance compared to weakly and fully supervised approaches.

[31] Multi Class Parkinsons Disease Detection Based on Finger Tapping Using Attention-Enhanced CNN BiLSTM cs.CVPDF

Abu Saleh Musa Miah, Najmul Hassan, Md Maruf Al Hossain, Yuichi Okuyama, Jungpil Shin

TL;DR: 该论文提出了一种基于注意力增强CNN-BiLSTM的多元帕金森病检测系统，通过分析手指敲击视频提取时空特征，有效区分帕金森病的五种严重程度。

Details

Motivation: 现有基于手势的帕金森病识别系统性能不足，影响临床管理和干预效果。该研究旨在通过结合深度学习和注意力机制，提升帕金森病严重程度的自动检测准确性。

Result: 模型在区分五种严重程度类别上表现优异，表明结合空间-时间表征和注意力机制能显著提升帕金森病严重程度的自动化检测。

Insight: 研究展示了深度学习与非侵入性传感器数据结合在医疗诊断中的潜力，为临床医生提供了一种有效的辅助工具。

Abstract: Effective clinical management and intervention development depend on accurate evaluation of Parkinsons disease (PD) severity. Many researchers have worked on developing gesture-based PD recognition systems; however, their performance accuracy is not satisfactory. In this study, we propose a multi-class Parkinson Disease detection system based on finger tapping using an attention-enhanced CNN BiLSTM. We collected finger tapping videos and derived temporal, frequency, and amplitude based features from wrist and hand movements. Then, we proposed a hybrid deep learning framework integrating CNN, BiLSTM, and attention mechanisms for multi-class PD severity classification from video-derived motion features. First, the input sequence is reshaped and passed through a Conv1D MaxPooling block to capture local spatial dependencies. The resulting feature maps are fed into a BiLSTM layer to model temporal dynamics. An attention mechanism focuses on the most informative temporal features, producing a context vector that is further processed by a second BiLSTM layer. CNN-derived features and attention-enhanced BiLSTM outputs are concatenated, followed by dense and dropout layers, before the final softmax classifier outputs the predicted PD severity level. The model demonstrated strong performance in distinguishing between the five severity classes, suggesting that integrating spatial temporal representations with attention mechanisms can improve automated PD severity detection, making it a promising non-invasive tool to support clinicians in PD monitoring and progression tracking.

[32] DeepFusionNet: Autoencoder-Based Low-Light Image Enhancement and Super-Resolution cs.CV | cs.AI | 68T45, 68T10 | I.2.10; I.4.9PDF

Halil Hüseyin Çalışkan, Talha Koruk

TL;DR: DeepFusionNet是一种基于自动编码器的架构，用于低光图像增强和超分辨率任务，能够在SSIM和PSNR指标上实现高性能，同时保持较少的参数数量。

Details

Motivation: 低光和模糊图像在计算机视觉应用中影响性能，现有方法参数多、计算成本高且性能较低，需要一种更高效的解决方案。

Result: 在LOL-v1数据集上，低光增强模型的SSIM达到92.8%，PSNR为26.30；超分辨率模型的PSNR为25.30，SSIM为80.7%。

Insight: 自动编码器可以用于低光和超分辨率任务，通过设计高效的架构，能够在减少参数的同时保持高性能，优于传统GAN方法。

Abstract: Computer vision and image processing applications suffer from dark and low-light images, particularly during real-time image transmission. Currently, low light and dark images are converted to bright and colored forms using autoencoders; however, these methods often achieve low SSIM and PSNR scores and require high computational power due to their large number of parameters. To address these challenges, the DeepFusionNet architecture has been developed. According to the results obtained with the LOL-v1 dataset, DeepFusionNet achieved an SSIM of 92.8% and a PSNR score of 26.30, while containing only approximately 2.5 million parameters. On the other hand, conversion of blurry and low-resolution images into high-resolution and blur-free images has gained importance in image processing applications. Unlike GAN-based super-resolution methods, an autoencoder-based super resolution model has been developed that contains approximately 100 thousand parameters and uses the DeepFusionNet architecture. According to the results of the tests, the DeepFusionNet based super-resolution method achieved a PSNR of 25.30 and a SSIM score of 80.7 percent according to the validation set.

[33] Color3D: Controllable and Consistent 3D Colorization with Personalized Colorizer cs.CVPDF

Yecong Wan, Mingwen Shao, Renlong Wu, Wangmeng Zuo

TL;DR: Color3D提出了一个高度适配的框架，用于从单色输入中对静态和动态3D场景进行着色，提供视觉多样且色彩丰富的重建，并支持灵活的用户引导控制。

Details

Motivation: 现有方法通常专注于静态场景，并通过平均颜色变化来强制多视图一致性，牺牲了色彩丰富性和可控性。Color3D旨在解决这一问题，同时确保跨视图和时间的一致性。

Result: 实验表明，Color3D能够在多样化的静态和动态3D着色基准测试中提供更一致且色彩丰富的渲染结果，同时支持精确的用户控制。

Insight: 通过将3D着色问题简化为单图像着色任务，Color3D展示了如何在不牺牲一致性的情况下增强色彩多样性和用户控制能力。

Abstract: In this work, we present Color3D, a highly adaptable framework for colorizing both static and dynamic 3D scenes from monochromatic inputs, delivering visually diverse and chromatically vibrant reconstructions with flexible user-guided control. In contrast to existing methods that focus solely on static scenarios and enforce multi-view consistency by averaging color variations which inevitably sacrifice both chromatic richness and controllability, our approach is able to preserve color diversity and steerability while ensuring cross-view and cross-time consistency. In particular, the core insight of our method is to colorize only a single key view and then fine-tune a personalized colorizer to propagate its color to novel views and time steps. Through personalization, the colorizer learns a scene-specific deterministic color mapping underlying the reference view, enabling it to consistently project corresponding colors to the content in novel views and video frames via its inherent inductive bias. Once trained, the personalized colorizer can be applied to infer consistent chrominance for all other images, enabling direct reconstruction of colorful 3D scenes with a dedicated Lab color space Gaussian splatting representation. The proposed framework ingeniously recasts complicated 3D colorization as a more tractable single image paradigm, allowing seamless integration of arbitrary image colorization models with enhanced flexibility and controllability. Extensive experiments across diverse static and dynamic 3D colorization benchmarks substantiate that our method can deliver more consistent and chromatically rich renderings with precise user control. Project Page https://yecongwan.github.io/Color3D/.

[34] ReMix: Towards a Unified View of Consistent Character Generation and Editing cs.CVPDF

Benjia Zhou, Bin Fu, Pei Cheng, Yanru Wang, Jiayuan Fan

TL;DR: ReMix提出了一种统一的框架，用于字符一致性生成和编辑，结合了ReMix模块与IP-ControlNet，解决了现有方法在语义一致性与空间可控性上的不足。

Details

Motivation: 现有的大规模文本到图像扩散模型在字符一致性生成和编辑任务中缺乏统一框架，生成方法难以实现细粒度身份一致性，而编辑方法常丢失空间可控性和指令对齐。

Result: 实验表明ReMix在个性化生成、图像编辑、风格迁移等多任务中高效且有效。

Insight: 1. 语义与布局的分离设计解决了传统方法的限制；2. 受生物学和量子物理启发的设计增强了特征对齐。

Abstract: Recent advances in large-scale text-to-image diffusion models (e.g., FLUX.1) have greatly improved visual fidelity in consistent character generation and editing. However, existing methods rarely unify these tasks within a single framework. Generation-based approaches struggle with fine-grained identity consistency across instances, while editing-based methods often lose spatial controllability and instruction alignment. To bridge this gap, we propose ReMix, a unified framework for character-consistent generation and editing. It constitutes two core components: the ReMix Module and IP-ControlNet. The ReMix Module leverages the multimodal reasoning ability of MLLMs to edit semantic features of input images and adapt instruction embeddings to the native DiT backbone without fine-tuning. While this ensures coherent semantic layouts, pixel-level consistency and pose controllability remain challenging. To address this, IP-ControlNet extends ControlNet to decouple semantic and layout cues from reference images and introduces an {\epsilon}-equivariant latent space that jointly denoises the reference and target images within a shared noise space. Inspired by convergent evolution and quantum decoherence,i.e., where environmental noise drives state convergence, this design promotes feature alignment in the hidden space, enabling consistent object generation while preserving identity. ReMix supports a wide range of tasks, including personalized generation, image editing, style transfer, and multi-condition synthesis. Extensive experiments validate its effectiveness and efficiency as a unified framework for character-consistent image generation and editing.

[35] SaFiRe: Saccade-Fixation Reiteration with Mamba for Referring Image Segmentation cs.CV | cs.AIPDF

Zhenjie Mao, Yuhuan Yang, Chaofan Ma, Dongsheng Jiang, Jiangchao Yao

TL;DR: SaFiRe是一种新颖的框架，模仿人类的两阶段认知过程（全局理解→细节精修），结合Mamba的高效扫描更新特性，解决Referring Image Segmentation（RIS）中的指代模糊问题，并在新基准aRefCOCO上验证了其优势。

Details

Motivation: 当前RIS方法主要处理简单的名词短语（如“红色汽车”），忽视了真实场景中指代模糊的表达（如对象干扰和隐含类别）。SaFiRe旨在填补这一空白。

Result: 在标准和新基准数据集上，SaFiRe优于现有方法。

Insight: 人类认知过程的建模和高效线性复杂度的设计是提升RIS性能的关键。

Abstract: Referring Image Segmentation (RIS) aims to segment the target object in an image given a natural language expression. While recent methods leverage pre-trained vision backbones and more training corpus to achieve impressive results, they predominantly focus on simple expressions–short, clear noun phrases like “red car” or “left girl”. This simplification often reduces RIS to a key word/concept matching problem, limiting the model’s ability to handle referential ambiguity in expressions. In this work, we identify two challenging real-world scenarios: object-distracting expressions, which involve multiple entities with contextual cues, and category-implicit expressions, where the object class is not explicitly stated. To address the challenges, we propose a novel framework, SaFiRe, which mimics the human two-phase cognitive process–first forming a global understanding, then refining it through detail-oriented inspection. This is naturally supported by Mamba’s scan-then-update property, which aligns with our phased design and enables efficient multi-cycle refinement with linear complexity. We further introduce aRefCOCO, a new benchmark designed to evaluate RIS models under ambiguous referring expressions. Extensive experiments on both standard and proposed datasets demonstrate the superiority of SaFiRe over state-of-the-art baselines.

[36] SparseUWSeg: Active Sparse Point-Label Augmentation for Underwater Semantic Segmentation cs.CVPDF

César Borja, Carlos Plou, Rubén Martinez-Cantín, Ana C. Murillo

TL;DR: SparseUWSeg是一种用于水下语义分割的新框架，通过主动采样和稀疏标签传播，显著提升了分割模型性能。

Details

Motivation: 水下场景的语义分割需要密集标注，成本高昂。稀疏点标注更容易获取，但需解决标注选择和信息传播的挑战。

Result: 在两个水下数据集上，SparseUWSeg比现有方法提升最多5% mIoU。

Insight: 稀疏标注结合主动学习和混合传播方法，可以有效降低标注成本并提升分割性能。

Abstract: Semantic segmentation is essential to automate underwater imagery analysis with ecology monitoring purposes. Unfortunately, fine grained underwater scene analysis is still an open problem even for top performing segmentation models. The high cost of obtaining dense, expert-annotated, segmentation labels hinders the supervision of models in this domain. While sparse point-labels are easier to obtain, they introduce challenges regarding which points to annotate and how to propagate the sparse information. We present SparseUWSeg, a novel framework that addresses both issues. SparseUWSeg employs an active sampling strategy to guide annotators, maximizing the value of their point labels. Then, it propagates these sparse labels with a hybrid approach leverages both the best of SAM2 and superpixel-based methods. Experiments on two diverse underwater datasets demonstrate the benefits of SparseUWSeg over state-of-the-art approaches, achieving up to +5% mIoU over D+NN. Our main contribution is the design and release of a simple but effective interactive annotation tool, integrating our algorithms. It enables ecology researchers to leverage foundation models and computer vision to efficiently generate high-quality segmentation masks to process their data.

[37] ViConEx-Med: Visual Concept Explainability via Multi-Concept Token Transformer for Medical Image Analysis cs.CVPDF

Cristiano Patrício, Luís F. Teixeira, João C. Neves

TL;DR: ViConEx-Med 是一种基于Transformer的框架，通过多概念可学习token联合预测和定位视觉概念，提供了人类可理解的视觉概念解释。该方法在医学图像分析中展示了优越的性能。

Details

Motivation: 现有概念模型将概念视为数值属性，缺乏视觉解释能力，限制了其在高风险场景（如医学应用）中的实用性。ViConEx-Med旨在填补这一空白。

Result: 在合成和真实医学数据集上，ViConEx-Med优于先前概念模型，并与黑盒模型在概念检测和定位精度上竞争性能。

Insight: ViConEx-Med为构建基于视觉概念的固有可解释模型提供了新方向，特别适用于医学等高风险领域。

Abstract: Concept-based models aim to explain model decisions with human-understandable concepts. However, most existing approaches treat concepts as numerical attributes, without providing complementary visual explanations that could localize the predicted concepts. This limits their utility in real-world applications and particularly in high-stakes scenarios, such as medical use-cases. This paper proposes ViConEx-Med, a novel transformer-based framework for visual concept explainability, which introduces multi-concept learnable tokens to jointly predict and localize visual concepts. By leveraging specialized attention layers for processing visual and text-based concept tokens, our method produces concept-level localization maps while maintaining high predictive accuracy. Experiments on both synthetic and real-world medical datasets demonstrate that ViConEx-Med outperforms prior concept-based models and achieves competitive performance with black-box models in terms of both concept detection and localization precision. Our results suggest a promising direction for building inherently interpretable models grounded in visual concepts. Code is publicly available at https://github.com/CristianoPatricio/viconex-med.

Zixu Zhao, Yang Zhan

TL;DR: 该论文提出了一个用于无人机跨模态文本-视频检索的TCMA框架，并构建了一个细粒度的数据集DVTMD，通过多粒度对齐方法显著提升了检索性能。

Details

Motivation: 无人机采集的视频数据量大，但现有的文本-视频检索方法在无人机领域的应用受限，主要是由于数据集标注粗糙且冗余。因此，需要构建细粒度的数据集并设计高效的多粒度对齐方法。

Result: 在DVTMD和CapERA数据集上，TCMA实现了SOTA性能，文本到视频检索的R@1为45.5%，视频到文本检索的R@1为42.8%。

Insight: 细粒度的数据集和多粒度对齐方法显著提升了无人机领域的文本-视频检索性能，动态温度机制为跨模态注意力建模提供了新思路。

Abstract: Unmanned aerial vehicles (UAVs) have become powerful platforms for real-time, high-resolution data collection, producing massive volumes of aerial videos. Efficient retrieval of relevant content from these videos is crucial for applications in urban management, emergency response, security, and disaster relief. While text-video retrieval has advanced in natural video domains, the UAV domain remains underexplored due to limitations in existing datasets, such as coarse and redundant captions. Thus, in this work, we construct the Drone Video-Text Match Dataset (DVTMD), which contains 2,864 videos and 14,320 fine-grained, semantically diverse captions. The annotations capture multiple complementary aspects, including human actions, objects, background settings, environmental conditions, and visual style, thereby enhancing text-video correspondence and reducing redundancy. Building on this dataset, we propose the Text-Conditioned Multi-granularity Alignment (TCMA) framework, which integrates global video-sentence alignment, sentence-guided frame aggregation, and word-guided patch alignment. To further refine local alignment, we design a Word and Patch Selection module that filters irrelevant content, as well as a Text-Adaptive Dynamic Temperature Mechanism that adapts attention sharpness to text type. Extensive experiments on DVTMD and CapERA establish the first complete benchmark for drone text-video retrieval. Our TCMA achieves state-of-the-art performance, including 45.5% R@1 in text-to-video and 42.8% R@1 in video-to-text retrieval, demonstrating the effectiveness of our dataset and method. The code and dataset will be released.

[39] Fairness Without Labels: Pseudo-Balancing for Bias Mitigation in Face Gender Classification cs.CV | 68T07 | I.2.10; I.4.8; I.5.4PDF

Haohua Dong, Ana Manzano Rodríguez, Camille Guinaudeau, Shin’ichi Satoh

TL;DR: 论文提出了一种名为伪平衡（pseudo-balancing）的方法，用于在半监督学习中缓解人脸性别分类模型中的偏见问题。该方法无需真实标签，仅通过利用种族平衡的无标注数据集实现对模型的去偏处理。实验结果表明，伪平衡显著提升了公平性和准确性。

Details

Motivation: 人脸性别分类模型通常会放大训练数据中的偏见，导致在不同性别和种族子群中的性能不均。现有方法通常依赖真实标签，但实际场景中标签往往不可得。因此，需要一种无需真实标签的去偏方法。

Result: 实验结果显示，伪平衡方法显著提升了公平性和准确性：整体准确率提升6.53%至79.81%，性别准确率差距缩小44.17%。在东亚子群中，基线的49%差距被缩小至5.01%。

Insight: 论文表明，即使没有真实标签监督，仅通过人口统计学平衡（或适度倾斜）的无标注数据集，也能有效去偏现有计算机视觉模型。这为实际应用提供了一种低成本解决方案。

Abstract: Face gender classification models often reflect and amplify demographic biases present in their training data, leading to uneven performance across gender and racial subgroups. We introduce pseudo-balancing, a simple and effective strategy for mitigating such biases in semi-supervised learning. Our method enforces demographic balance during pseudo-label selection, using only unlabeled images from a race-balanced dataset without requiring access to ground-truth annotations. We evaluate pseudo-balancing under two conditions: (1) fine-tuning a biased gender classifier using unlabeled images from the FairFace dataset, and (2) stress-testing the method with intentionally imbalanced training data to simulate controlled bias scenarios. In both cases, models are evaluated on the All-Age-Faces (AAF) benchmark, which contains a predominantly East Asian population. Our results show that pseudo-balancing consistently improves fairness while preserving or enhancing accuracy. The method achieves 79.81% overall accuracy - a 6.53% improvement over the baseline - and reduces the gender accuracy gap by 44.17%. In the East Asian subgroup, where baseline disparities exceeded 49%, the gap is narrowed to just 5.01%. These findings suggest that even in the absence of label supervision, access to a demographically balanced or moderately skewed unlabeled dataset can serve as a powerful resource for debiasing existing computer vision models.

[40] B2N3D: Progressive Learning from Binary to N-ary Relationships for 3D Object Grounding cs.CVPDF

Feng Xiao, Hongbin Xu, Hai Ci, Wenxiong Kang

TL;DR: 论文提出了一种渐进式关系学习框架B2N3D，将3D物体定位中的关系建模从二元扩展到n元，提升了多模态关系理解的全局感知能力。通过设计分组监督损失和多模态混合注意力网络，该方法在ReferIt3D和ScanRefer基准测试中表现优于现有方法。

Details

Motivation: 当前3D物体定位方法仅建模二元对象关系，忽略了多模态理解中n元组合的全局感知重要性，导致描述涉及多空间关系时对齐困难。

Result: 在ReferIt3D和ScanRefer基准测试中表现优于现有方法，验证了n元关系感知在3D定位中的优势。

Insight: n元关系建模能更全面地捕捉多模态描述中的全局信息，从而提升复杂空间关系下的3D物体定位精度。

Abstract: Localizing 3D objects using natural language is essential for robotic scene understanding. The descriptions often involve multiple spatial relationships to distinguish similar objects, making 3D-language alignment difficult. Current methods only model relationships for pairwise objects, ignoring the global perceptual significance of n-ary combinations in multi-modal relational understanding. To address this, we propose a novel progressive relational learning framework for 3D object grounding. We extend relational learning from binary to n-ary to identify visual relations that match the referential description globally. Given the absence of specific annotations for referred objects in the training data, we design a grouped supervision loss to facilitate n-ary relational learning. In the scene graph created with n-ary relationships, we use a multi-modal network with hybrid attention mechanisms to further localize the target within the n-ary combinations. Experiments and ablation studies on the ReferIt3D and ScanRefer benchmarks demonstrate that our method outperforms the state-of-the-art, and proves the advantages of the n-ary relational perception in 3D localization.

[41] From Generic to Specialized: A Subspecialty Diagnostic System Powered by Self-Supervised Learning for Cervical Histopathology cs.CVPDF

Yizhi Wang, Li Chen, Qiang Huang, Tian Guan, Xi Deng

TL;DR: 该论文提出了CerS-Path系统，通过自监督学习和多模态增强构建宫颈病理学专用诊断模型，显著提高了诊断的准确性和泛化能力。

Details

Motivation: 宫颈癌的诊断需要复杂的组织病理学评估，现有深度学习模型在准确性和泛化性上表现不足，且通用基础模型难以捕捉专科特征。

Result: 在3,173例前瞻性测试中，系统保持了99.38%的筛查敏感性和优秀的泛化能力。

Insight: 自监督学习和多模态增强的结合能够显著提升专科诊断模型的性能，为其他专科领域的AI应用提供了参考。

Abstract: Cervical cancer remains a major malignancy, necessitating extensive and complex histopathological assessments and comprehensive support tools. Although deep learning shows promise, these models still lack accuracy and generalizability. General foundation models offer a broader reach but remain limited in capturing subspecialty-specific features and task adaptability. We introduce the Cervical Subspecialty Pathology (CerS-Path) diagnostic system, developed through two synergistic pretraining stages: self-supervised learning on approximately 190 million tissue patches from 140,000 slides to build a cervical-specific feature extractor, and multimodal enhancement with 2.5 million image-text pairs, followed by integration with multiple downstream diagnostic functions. Supporting eight diagnostic functions, including rare cancer classification and multimodal Q&A, CerS-Path surpasses prior foundation models in scope and clinical applicability. Comprehensive evaluations demonstrate a significant advance in cervical pathology, with prospective testing on 3,173 cases across five centers maintaining 99.38% screening sensitivity and excellent generalizability, highlighting its potential for subspecialty diagnostic translation and cervical cancer screening.

[42] Semantic Visual Anomaly Detection and Reasoning in AI-Generated Images cs.CVPDF

Chuangchuang Tan, Xiang Ming, Jinglu Wang, Renshuai Tao, Bin Li

TL;DR: 论文提出了AnomReason基准和AnomAgent框架，用于检测和解释AI生成图像中的语义异常，并通过实验验证了其有效性。

Details

Motivation: AI生成内容（AIGC）虽然视觉上逼真，但常存在语义异常，如物体配置不合理或常识违反，这些异常会影响图像的可靠性。论文旨在解决这一问题，提升AIGC的语义可信度。

Result: 实验表明，基于AnomReason微调的模型在语义匹配指标上优于基线方法，并在可解释的Deepfake检测和图像生成器语义合理性评估中展示了实用性。

Insight: 语义异常检测是AIGC可信度的关键；多代理流水线和轻量人工验证的结合可以高效生成大规模高质量标注；结构化标注和语义指标为AIGC研究提供了新工具。

Abstract: The rapid advancement of AI-generated content (AIGC) has enabled the synthesis of visually convincing images; however, many such outputs exhibit subtle \textbf{semantic anomalies}, including unrealistic object configurations, violations of physical laws, or commonsense inconsistencies, which compromise the overall plausibility of the generated scenes. Detecting these semantic-level anomalies is essential for assessing the trustworthiness of AIGC media, especially in AIGC image analysis, explainable deepfake detection and semantic authenticity assessment. In this paper, we formalize \textbf{semantic anomaly detection and reasoning} for AIGC images and introduce \textbf{AnomReason}, a large-scale benchmark with structured annotations as quadruples \emph{(Name, Phenomenon, Reasoning, Severity)}. Annotations are produced by a modular multi-agent pipeline (\textbf{AnomAgent}) with lightweight human-in-the-loop verification, enabling scale while preserving quality. At construction time, AnomAgent processed approximately 4.17,B GPT-4o tokens, providing scale evidence for the resulting structured annotations. We further show that models fine-tuned on AnomReason achieve consistent gains over strong vision-language baselines under our proposed semantic matching metric (\textit{SemAP} and \textit{SemF1}). Applications to {explainable deepfake detection} and {semantic reasonableness assessment of image generators} demonstrate practical utility. In summary, AnomReason and AnomAgent serve as a foundation for measuring and improving the semantic plausibility of AI-generated images. We will release code, metrics, data, and task-aligned models to support reproducible research on semantic authenticity and interpretable AIGC forensics.

[43] MRI Brain Tumor Detection with Computer Vision cs.CV | cs.AI | cs.LG | 68T07, 68U10 | I.2.6; I.2.10; I.4.6PDF

Jack Krolik, Jake Lynn, John Henry Rudden, Dmytro Vremenko

TL;DR: 论文探讨了深度学习在MRI脑瘤自动检测与分割中的应用，结合多种机器学习模型，展示了在诊断准确性和效率上的显著提升。

Details

Motivation: 传统脑瘤诊断依赖人工分析MRI图像，耗时且易出错。深度学习技术在医学影像中的潜力激发了这一研究，旨在提高诊断的自动化水平和准确性。

Result: 实验结果表明，提出的方法在脑瘤检测和分割任务中表现优异，显著提升了诊断的准确性和效率。

Insight: 深度学习在医学影像分析中具有广阔前景，未来可进一步优化模型，结合多模态数据以提高临床实用性。

Abstract: This study explores the application of deep learning techniques in the automated detection and segmentation of brain tumors from MRI scans. We employ several machine learning models, including basic logistic regression, Convolutional Neural Networks (CNNs), and Residual Networks (ResNet) to classify brain tumors effectively. Additionally, we investigate the use of U-Net for semantic segmentation and EfficientDet for anchor-based object detection to enhance the localization and identification of tumors. Our results demonstrate promising improvements in the accuracy and efficiency of brain tumor diagnostics, underscoring the potential of deep learning in medical imaging and its significance in improving clinical outcomes.

[44] Are Video Models Emerging as Zero-Shot Learners and Reasoners in Medical Imaging? cs.CVPDF

Yuxiang Lai, Jike Zhong, Ming Li, Yuheng Li, Xiaofeng Yang

TL;DR: 这篇论文探讨了大型生成模型（LVM）在零样本（zero-shot）设置下直接应用于医学影像任务的潜力，尽管模型从未在医学数据上进行过训练。结果表明，LVM在器官分割、去噪、超分辨率和放疗运动预测等任务上表现出色，甚至在运动预测中超越了专用基线方法。

Details

Motivation: 受近期大型生成模型在多领域零样本泛化能力的启发，研究团队试图验证是否可以通过自回归视频建模原理，直接将这些能力迁移到医学影像任务中，而无需针对医学数据进行微调。

Result: LVM在4D-CT数据的122名患者上进行了测试，总计超过1,820个3D CT体积。结果表明，即使未经医学数据训练，模型在所有任务中均表现优异，尤其在运动预测任务中超越了专用基线方法。

Insight: 文章揭示了视频模型在医学领域的潜力，表明通用视频模型可以作为统一的学习和推理工具，未来可能成为医学基础模型的核心组成部分。

Abstract: Recent advances in large generative models have shown that simple autoregressive formulations, when scaled appropriately, can exhibit strong zero-shot generalization across domains. Motivated by this trend, we investigate whether autoregressive video modeling principles can be directly applied to medical imaging tasks, despite the model never being trained on medical data. Specifically, we evaluate a large vision model (LVM) in a zero-shot setting across four representative tasks: organ segmentation, denoising, super-resolution, and motion prediction. Remarkably, even without domain-specific fine-tuning, the LVM can delineate anatomical structures in CT scans and achieve competitive performance on segmentation, denoising, and super-resolution. Most notably, in radiotherapy motion prediction, the model forecasts future 3D CT phases directly from prior phases of a 4D CT scan, producing anatomically consistent predictions that capture patient-specific respiratory dynamics with realistic temporal coherence. We evaluate the LVM on 4D CT data from 122 patients, totaling over 1,820 3D CT volumes. Despite no prior exposure to medical data, the model achieves strong performance across all tasks and surpasses specialized DVF-based and generative baselines in motion prediction, achieving state-of-the-art spatial accuracy. These findings reveal the emergence of zero-shot capabilities in medical video modeling and highlight the potential of general-purpose video models to serve as unified learners and reasoners laying the groundwork for future medical foundation models built on video models.

[45] Opacity-Gradient Driven Density Control for Compact and Efficient Few-Shot 3D Gaussian Splatting cs.CV | cs.LGPDF

Abdelrhman Elrawy, Emad A. Mohammed

TL;DR: 论文提出了一种基于不透明度梯度的密度控制方法，改进了3D高斯泼溅在少样本场景下的效率和紧凑性，显著减少了基元数量，同时保持重建质量。

Details

Motivation: 3D高斯泼溅（3DGS）在少样本场景中容易过拟合且重建臃肿，现有方法如FSGS虽然提高了质量，但大幅增加了基元数量。本文旨在优化核心3DGS的效率和紧凑性。

Result: 在3-view LLFF数据集上，模型的基元数量比FSGS减少了40%（32k vs. 57k）；在Mip-NeRF 360数据集上，减少了约70%，且重建指标仅略有下降。

Insight: 不透明度梯度是一种有效的轻量级代理指标，可以高效指导密度控制；保守修剪与激进密度控制的结合是实现高效紧凑重建的关键。

Abstract: 3D Gaussian Splatting (3DGS) struggles in few-shot scenarios, where its standard adaptive density control (ADC) can lead to overfitting and bloated reconstructions. While state-of-the-art methods like FSGS improve quality, they often do so by significantly increasing the primitive count. This paper presents a framework that revises the core 3DGS optimization to prioritize efficiency. We replace the standard positional gradient heuristic with a novel densification trigger that uses the opacity gradient as a lightweight proxy for rendering error. We find this aggressive densification is only effective when paired with a more conservative pruning schedule, which prevents destructive optimization cycles. Combined with a standard depth-correlation loss for geometric guidance, our framework demonstrates a fundamental improvement in efficiency. On the 3-view LLFF dataset, our model is over 40% more compact (32k vs. 57k primitives) than FSGS, and on the Mip-NeRF 360 dataset, it achieves a reduction of approximately 70%. This dramatic gain in compactness is achieved with a modest trade-off in reconstruction metrics, establishing a new state-of-the-art on the quality-vs-efficiency Pareto frontier for few-shot view synthesis.

[46] VividAnimator: An End-to-End Audio and Pose-driven Half-Body Human Animation Framework cs.CVPDF

Donglin Huang, Yongyuan Li, Tianhang Liu, Junming Huang, Xiaoda Yang

TL;DR: VividAnimator是一个端到端的音频和姿态驱动半身人体动画框架，通过预训练手部清晰代码本（HCC）、双流音频感知模块（DSAA）和姿态校准技巧（PCT），解决了现有方法中头部僵硬和手部模糊的问题。

Details

Motivation: 现有音频和姿态驱动的人体动画方法在头部运动和手部细节上表现不佳，主要原因是音频与头部运动关联性弱，以及手部结构复杂。

Result: 实验表明，VividAnimator在手部细节、手势真实性和身份一致性上优于现有方法。

Insight: 分离建模音频驱动的不同部分（唇同步和头部动态）并结合手部先验知识，是提升动画质量的关键。

Abstract: Existing for audio- and pose-driven human animation methods often struggle with stiff head movements and blurry hands, primarily due to the weak correlation between audio and head movements and the structural complexity of hands. To address these issues, we propose VividAnimator, an end-to-end framework for generating high-quality, half-body human animations driven by audio and sparse hand pose conditions. Our framework introduces three key innovations. First, to overcome the instability and high cost of online codebook training, we pre-train a Hand Clarity Codebook (HCC) that encodes rich, high-fidelity hand texture priors, significantly mitigating hand degradation. Second, we design a Dual-Stream Audio-Aware Module (DSAA) to model lip synchronization and natural head pose dynamics separately while enabling interaction. Third, we introduce a Pose Calibration Trick (PCT) that refines and aligns pose conditions by relaxing rigid constraints, ensuring smooth and natural gesture transitions. Extensive experiments demonstrate that Vivid Animator achieves state-of-the-art performance, producing videos with superior hand detail, gesture realism, and identity consistency, validated by both quantitative metrics and qualitative evaluations.

[47] SAM2LoRA: Composite Loss-Guided, Parameter-Efficient Finetuning of SAM2 for Retinal Fundus Segmentation cs.CVPDF

Sayan Mandal, Divyadarshini Karthikeyan, Manas Paldhe

TL;DR: SAM2LoRA是一种参数高效的微调策略，通过低秩适配器（LoRA）和复合损失函数优化Segment Anything Model 2（SAM2），用于视网膜眼底图像分割，显著减少参数需求并提升性能。

Details

Motivation: SAM2虽然在低资源场景下表现优异，但微调仍然具有挑战性。需要一种高效的方法，既能减少训练开销，又能提升跨数据集分割任务的性能。

Result: 在跨数据集训练条件下，SAM2LoRA在血管和视盘分割任务中表现优异，Dice分数分别达0.86和0.93，AUC值达0.98和0.99，达到SOTA性能。

Insight: 1）低秩适配器是实现参数高效微调的有效手段；2）复合损失函数对多任务分割问题至关重要；3）SAM2LoRA在低资源场景下具有广泛应用潜力。

Abstract: We propose SAM2LoRA, a parameter-efficient fine-tuning strategy that adapts the Segment Anything Model 2 (SAM2) for fundus image segmentation. SAM2 employs a masked autoencoder-pretrained Hierarchical Vision Transformer for multi-scale feature decoding, enabling rapid inference in low-resource settings; however, fine-tuning remains challenging. To address this, SAM2LoRA integrates a low-rank adapter into both the image encoder and mask decoder, requiring fewer than 5% of the original trainable parameters. Our analysis indicates that for cross-dataset fundus segmentation tasks, a composite loss function combining segmentationBCE, SoftDice, and FocalTversky losses is essential for optimal network tuning. Evaluated on 11 challenging fundus segmentation datasets, SAM2LoRA demonstrates high performance in both blood vessel and optic disc segmentation under cross-dataset training conditions. It achieves Dice scores of up to 0.86 and 0.93 for blood vessel and optic disc segmentation, respectively, and AUC values of up to 0.98 and 0.99, achieving state-of-the-art performance while substantially reducing training overhead.

[48] From Programs to Poses: Factored Real-World Scene Generation via Learned Program Libraries cs.CV | cs.AIPDF

Joy Hsu, Emily Jin, Jiajun Wu, Niloy J. Mitra

TL;DR: FactoredScenes是一个框架，通过学习房间结构和物体姿态的变化，生成逼真的3D场景。它将场景分解为层级化的程序概念和物体姿态，并通过语言模型生成高级程序，同时预测和检索物体姿态。

Details

Motivation: 真实世界的场景（如ScanNet）难以捕捉且数据有限，生成多样化的物体姿态仍具挑战性。需要一种能够利用场景结构和姿态变化的方法。

Result: 生成的场景逼真，难以与真实的ScanNet场景区分。

Insight: 层级化分解和程序生成方法能够有效地捕捉场景结构和姿态变化，为3D场景生成提供了新思路。

Abstract: Real-world scenes, such as those in ScanNet, are difficult to capture, with highly limited data available. Generating realistic scenes with varied object poses remains an open and challenging task. In this work, we propose FactoredScenes, a framework that synthesizes realistic 3D scenes by leveraging the underlying structure of rooms while learning the variation of object poses from lived-in scenes. We introduce a factored representation that decomposes scenes into hierarchically organized concepts of room programs and object poses. To encode structure, FactoredScenes learns a library of functions capturing reusable layout patterns from which scenes are drawn, then uses large language models to generate high-level programs, regularized by the learned library. To represent scene variations, FactoredScenes learns a program-conditioned model to hierarchically predict object poses, and retrieves and places 3D objects in a scene. We show that FactoredScenes generates realistic, real-world rooms that are difficult to distinguish from real ScanNet scenes.

Yu-Hsuan Lin

TL;DR: 本文提出了一种多模态框架，结合视觉语言推理（CLIP）、目标检测（YOLO-World）和运动分析（MOG2），用于按有序尺度分类交通拥堵等级，显著优于单模态基线。

Details

Motivation: 准确的交通拥堵分类对智能交通系统和实时城市交通管理至关重要。现有的单模态方法难以充分捕捉交通场景的复杂性，需要结合多模态信息以提高分类性能。

Result: 实验结果表明，模型准确率达76.7%，F1分数为0.752，二次加权Kappa（QWK）为0.684，显著优于单模态基线。

Insight: 多模态信息的结合可以有效提升交通拥堵分类的性能和语义一致性。未来可通过车辆尺寸和更精细的密度指标进一步优化。

Abstract: Accurate traffic congestion classification is essential for intelligent transportation systems and real-time urban traffic management. This paper presents a multimodal framework combining open-vocabulary visual-language reasoning (CLIP), object detection (YOLO-World), and motion analysis via MOG2-based background subtraction. The system predicts congestion levels on an ordinal scale from 1 (free flow) to 5 (severe congestion), enabling semantically aligned and temporally consistent classification. To enhance interpretability, we incorporate motion-based confidence weighting and generate annotated visual outputs. Experimental results show the model achieves 76.7 percent accuracy, an F1 score of 0.752, and a Quadratic Weighted Kappa (QWK) of 0.684, significantly outperforming unimodal baselines. These results demonstrate the framework’s effectiveness in preserving ordinal structure and leveraging visual-language and motion modalities. Future enhancements include incorporating vehicle sizing and refined density metrics.

[50] PointMAC: Meta-Learned Adaptation for Robust Test-Time Point Cloud Completion cs.CVPDF

Linlian Jiang, Rui Ma, Li Gu, Ziqiang Wang, Xinxin Zuo

TL;DR: PointMAC是一个基于元学习（meta-learning）的框架，通过自监督辅助目标和MAML策略实现测试时的点云补全自适应，无需额外监督。

Details

Motivation: 现有的点云补全模型在测试时缺乏对新结构和传感器失真的适应能力，限制了其在安全关键应用中的鲁棒性。

Result: 在合成、模拟和真实数据集上取得了最先进的结果，表明PointMAC能够生成高质量的补全点云。

Insight: 测试时自适应和自监督辅助目标是提升点云补全鲁棒性的有效方法，尤其是在面对新结构或传感器噪声时。

Abstract: Point cloud completion is essential for robust 3D perception in safety-critical applications such as robotics and augmented reality. However, existing models perform static inference and rely heavily on inductive biases learned during training, limiting their ability to adapt to novel structural patterns and sensor-induced distortions at test time. To address this limitation, we propose PointMAC, a meta-learned framework for robust test-time adaptation in point cloud completion. It enables sample-specific refinement without requiring additional supervision. Our method optimizes the completion model under two self-supervised auxiliary objectives that simulate structural and sensor-level incompleteness. A meta-auxiliary learning strategy based on Model-Agnostic Meta-Learning (MAML) ensures that adaptation driven by auxiliary objectives is consistently aligned with the primary completion task. During inference, we adapt the shared encoder on-the-fly by optimizing auxiliary losses, with the decoder kept fixed. To further stabilize adaptation, we introduce Adaptive $\lambda$-Calibration, a meta-learned mechanism for balancing gradients between primary and auxiliary objectives. Extensive experiments on synthetic, simulated, and real-world datasets demonstrate that PointMAC achieves state-of-the-art results by refining each sample individually to produce high-quality completions. To the best of our knowledge, this is the first work to apply meta-auxiliary test-time adaptation to point cloud completion.

[51] Vision4PPG: Emergent PPG Analysis Capability of Vision Foundation Models for Vital Signs like Blood Pressure cs.CV | cs.LGPDF

Saurabh Kataria, Ayca Ermis, Lovely Yeswanth Panchumarthi, Minxiao Wang, Xiao Hu

TL;DR: 这篇论文提出了一种名为Vision4PPG的方法，利用视觉基础模型（VFM）处理PPG信号，通过将其转换为二维图像表示（如STFT），实现了在多项生理任务中的领先性能。

Details

Motivation: PPG传感器在可穿戴和临床设备中提供了非侵入性和实时的生理数据，但传统方法多依赖于专用基础模型或时间序列模型。作者探索了视觉基础模型在处理PPG信号中的潜力，发现其性能优于传统方法。

Result: 结果显示，Vision4PPG在血压估计等任务中达到了SOTA性能，并在其他6项任务中表现出色。

Insight: 研究发现视觉基础模型能有效处理PPG信号，开辟了新的研究方向；同时，PEFT技术使其计算高效，适合临床应用。

Abstract: Photoplethysmography (PPG) sensor in wearable and clinical devices provides valuable physiological insights in a non-invasive and real-time fashion. Specialized Foundation Models (FM) or repurposed time-series FMs are used to benchmark physiological tasks. Our experiments with fine-tuning FMs reveal that Vision FM (VFM) can also be utilized for this purpose and, in fact, surprisingly leads to state-of-the-art (SOTA) performance on many tasks, notably blood pressure estimation. We leverage VFMs by simply transforming one-dimensional PPG signals into image-like two-dimensional representations, such as the Short-Time Fourier transform (STFT). Using the latest VFMs, such as DINOv3 and SIGLIP-2, we achieve promising performance on other vital signs and blood lab measurement tasks as well. Our proposal, Vision4PPG, unlocks a new class of FMs to achieve SOTA performance with notable generalization to other 2D input representations, including STFT phase and recurrence plots. Our work improves upon prior investigations of vision models for PPG by conducting a comprehensive study, comparing them to state-of-the-art time-series FMs, and demonstrating the general PPG processing ability by reporting results on six additional tasks. Thus, we provide clinician-scientists with a new set of powerful tools that is also computationally efficient, thanks to Parameter-Efficient Fine-Tuning (PEFT) techniques.

[52] Self-Supervised Multi-Scale Transformer with Attention-Guided Fusion for Efficient Crack Detection cs.CVPDF

Blessing Agyei Kyem, Joshua Kofi Asamoah, Eugene Denteh, Andrews Danyo, Armstrong Aboah

TL;DR: 论文提出了一种完全自监督的裂缝检测框架Crack-Segmenter，通过多尺度特征提取、注意力机制和自适应特征融合模块，无需人工标注即可实现高效的像素级裂缝分割，性能优于现有监督方法。

Details

Motivation: 传统裂缝检测依赖于成本高且耗时的像素级标注，限制了大规模基础设施监测的可扩展性。本研究旨在探索无需人工标注的高效裂缝检测方法。

Result: 在十个公开数据集上的实验表明，Crack-Segmenter在mIoU、Dice score、XOR和HD等指标上均优于13种监督方法。

Insight: 研究表明，无需人工标注的裂缝检测不仅可行，还能超越监督方法，为大规模基础设施监测提供了低成本、高效的解决方案。

Abstract: Pavement crack detection has long depended on costly and time-intensive pixel-level annotations, which limit its scalability for large-scale infrastructure monitoring. To overcome this barrier, this paper examines the feasibility of achieving effective pixel-level crack segmentation entirely without manual annotations. Building on this objective, a fully self-supervised framework, Crack-Segmenter, is developed, integrating three complementary modules: the Scale-Adaptive Embedder (SAE) for robust multi-scale feature extraction, the Directional Attention Transformer (DAT) for maintaining linear crack continuity, and the Attention-Guided Fusion (AGF) module for adaptive feature integration. Through evaluations on ten public datasets, Crack-Segmenter consistently outperforms 13 state-of-the-art supervised methods across all major metrics, including mean Intersection over Union (mIoU), Dice score, XOR, and Hausdorff Distance (HD). These findings demonstrate that annotation-free crack detection is not only feasible but also superior, enabling transportation agencies and infrastructure managers to conduct scalable and cost-effective monitoring. This work advances self-supervised learning and motivates pavement cracks detection research.

[53] Identifying bias in CNN image classification using image scrambling and transforms cs.CV | cs.AIPDF

Sai Teja Erukude

TL;DR: 该论文通过图像分割和变换技术，探讨了CNN图像分类中难以察觉的偏见问题，并提出两种方法来区分上下文信息和背景噪声。

Details

Motivation: CNN在图像分类中表现优异，但其“黑盒”特性使得用户难以理解特征选择过程，可能导致基于背景信息的偏见决策。作者希望通过技术手段识别这些隐藏偏见。

Result: 实验表明，这些方法能有效区分上下文信息和背景噪声，并且无需依赖背景信息即可检测背景噪声的存在。

Insight: 图像变换和小块打乱技术有助于揭示CNN分类中的隐藏偏见，为理解和改进模型的可解释性提供了新思路。

Abstract: CNNs are now prevalent as the primary choice for most machine vision problems due to their superior rate of classification and the availability of user-friendly libraries. These networks effortlessly identify and select features in a non-intuitive data-driven manner, making it difficult to determine which features were most influential. That leads to a ``black box”, where users cannot know how the image data are analyzed but rely on empirical results. Therefore the decision-making process can be biased by background information that is difficult to detect. Here we discuss examples of such hidden biases and propose techniques for identifying them, methods to distinguish between contextual information and background noise, and explore whether CNNs learn from irrelevant features. One effective approach to identify dataset bias is to classify blank background parts of the images. However, in some situations a blank background in the images is not available, making it more difficult to separate the foreground information from the blank background. Such parts of the image can also be considered contextual learning, not necessarily bias. To overcome this, we propose two approaches that were tested on six different datasets, including natural, synthetic, and hybrid datasets. The first method involves dividing images into smaller, non-overlapping tiles of various sizes, which are then shuffled randomly, making classification more challenging. The second method involves the application of several image transforms, including Fourier, Wavelet transforms, and Median filter, and their combinations. These transforms help recover background noise information used by CNN to classify images. Results indicate that this method can effectively distinguish between contextual information and background noise, and alert on the presence of background noise even without the need to use background information.

[54] AVoCaDO: An Audiovisual Video Captioner Driven by Temporal Orchestration cs.CVPDF

Xinlong Chen, Yue Ding, Weihong Lin, Jingyun Hua, Linli Yao

TL;DR: AVoCaDO是一种通过视听模态时间协调驱动的视听视频字幕生成模型，通过两阶段训练显著提升了字幕的质量和时间一致性。

Details

Motivation: 视听视频字幕生成旨在通过视觉和听觉事件的时间对齐生成语义丰富的描述，但由于时序一致性和多模态对齐的挑战，现有方法效果有限。

Result: 在四个视听视频字幕基准测试中显著优于现有开源模型，并在纯视觉设置下（VDC和DREAM-1K）表现优异。

Insight: 时间协调对多模态字幕生成至关重要，两阶段训练和定制奖励函数可以有效提升模型的生成质量。

Abstract: Audiovisual video captioning aims to generate semantically rich descriptions with temporal alignment between visual and auditory events, thereby benefiting both video understanding and generation. In this paper, we present AVoCaDO, a powerful audiovisual video captioner driven by the temporal orchestration between audio and visual modalities. We propose a two-stage post-training pipeline: (1) AVoCaDO SFT, which fine-tunes the model on a newly curated dataset of 107K high-quality, temporally-aligned audiovisual captions; and (2) AVoCaDO GRPO, which leverages tailored reward functions to further enhance temporal coherence and dialogue accuracy while regularizing caption length and reducing collapse. Experimental results demonstrate that AVoCaDO significantly outperforms existing open-source models across four audiovisual video captioning benchmarks, and also achieves competitive performance on the VDC and DREAM-1K benchmark under visual-only settings.

Zhao-Yang Wang, Jieneng Chen, Jiang Liu, Yuxiang Guo, Rama Chellappa

TL;DR: Mesh-Gait是一种新颖的多模态步态识别框架，通过从2D轮廓直接重建3D表示，结合了2D和3D的优势，实现了高效且鲁棒的步态识别。

Details

Motivation: 传统步态识别方法（如基于2D轮廓或骨架）在视角变化、遮挡和噪声下表现不佳，而多模态方法（如结合3D信息）虽鲁棒但计算成本高。Mesh-Gait旨在解决这些问题，通过高效融合2D和3D信息。

Result: Mesh-Gait在多模态步态识别任务中实现了最先进的准确率。

Insight: 中间3D热图表示是一种高效且鲁棒的特征提取方式，既保留了3D几何信息，又避免了复杂的直接3D重建。

Abstract: Gait recognition, a fundamental biometric technology, leverages unique walking patterns for individual identification, typically using 2D representations such as silhouettes or skeletons. However, these methods often struggle with viewpoint variations, occlusions, and noise. Multi-modal approaches that incorporate 3D body shape information offer improved robustness but are computationally expensive, limiting their feasibility for real-time applications. To address these challenges, we introduce Mesh-Gait, a novel end-to-end multi-modal gait recognition framework that directly reconstructs 3D representations from 2D silhouettes, effectively combining the strengths of both modalities. Compared to existing methods, directly learning 3D features from 3D joints or meshes is complex and difficult to fuse with silhouette-based gait features. To overcome this, Mesh-Gait reconstructs 3D heatmaps as an intermediate representation, enabling the model to effectively capture 3D geometric information while maintaining simplicity and computational efficiency. During training, the intermediate 3D heatmaps are gradually reconstructed and become increasingly accurate under supervised learning, where the loss is calculated between the reconstructed 3D joints, virtual markers, and 3D meshes and their corresponding ground truth, ensuring precise spatial alignment and consistent 3D structure. Mesh-Gait extracts discriminative features from both silhouettes and reconstructed 3D heatmaps in a computationally efficient manner. This design enables the model to capture spatial and structural gait characteristics while avoiding the heavy overhead of direct 3D reconstruction from RGB videos, allowing the network to focus on motion dynamics rather than irrelevant visual details. Extensive experiments demonstrate that Mesh-Gait achieves state-of-the-art accuracy. The code will be released upon acceptance of the paper.

[56] Guided Image Feature Matching using Feature Spatial Order cs.CV | eess.IVPDF

Chin-Hung Teng, Ben-Jian Dong

TL;DR: 该论文提出了一种结合特征空间顺序与渐进式匹配框架的图像特征匹配方法，通过估计特征匹配概率并过滤不必要的匹配，显著提升了匹配效率和准确性。

Details

Motivation: 传统图像特征匹配方法在处理大量特征点时效率较低，且依赖对极几何。特征空间顺序作为一个独立概念可以补充对极几何，提升匹配效率。

Result: 实验表明，该方法在标准数据集、模拟图像和真实图像上均显著提升了匹配效率和准确性。

Insight: 特征空间顺序是对极几何的有效补充，通过渐进式框架和图像对齐可以显著优化特征匹配性能。

Abstract: Image feature matching plays a vital role in many computer vision tasks. Although many image feature detection and matching techniques have been proposed over the past few decades, it is still time-consuming to match feature points in two images, especially for images with a large number of detected features. Feature spatial order can estimate the probability that a pair of features is correct. Since it is a completely independent concept from epipolar geometry, it can be used to complement epipolar geometry in guiding feature match in a target region so as to improve matching efficiency. In this paper, we integrate the concept of feature spatial order into a progressive matching framework. We use some of the initially matched features to build a computational model of feature spatial order and employs it to calculates the possible spatial range of subsequent feature matches, thus filtering out unnecessary feature matches. We also integrate it with epipolar geometry to further improve matching efficiency and accuracy. Since the spatial order of feature points is affected by image rotation, we propose a suitable image alignment method from the fundamental matrix of epipolar geometry to remove the effect of image rotation. To verify the feasibility of the proposed method, we conduct a series of experiments, including a standard benchmark dataset, self-generated simulated images, and real images. The results demonstrate that our proposed method is significantly more efficient and has more accurate feature matching than the traditional method.

Zhao-Yang Wang, Zhimin Shao, Jieneng Chen, Rama Chellappa

TL;DR: 论文提出了一个多模态、多任务框架Combo-Gait，结合2D时序剪影和3D SMPL特征进行步态识别和人体属性分析，采用统一Transformer融合特征，效果优于现有方法。

Details

Motivation: 现有步态识别方法通常仅依赖单一模态（如2D或3D），无法充分捕捉步态的几何和动态复杂性，同时忽视了步态的其他潜在信息（如人体属性）。

Result: 在BRIAR数据集（远距离、极端俯仰角条件下）上，Combo-Gait在步态识别和属性估计任务中均优于现有方法。

Insight: 多模态和多任务学习能够有效提升步态识别的鲁棒性，并挖掘步态中的附加信息（如人体属性），为现实场景中的步态分析提供了新思路。

Abstract: Gait recognition is an important biometric for human identification at a distance, particularly under low-resolution or unconstrained environments. Current works typically focus on either 2D representations (e.g., silhouettes and skeletons) or 3D representations (e.g., meshes and SMPLs), but relying on a single modality often fails to capture the full geometric and dynamic complexity of human walking patterns. In this paper, we propose a multi-modal and multi-task framework that combines 2D temporal silhouettes with 3D SMPL features for robust gait analysis. Beyond identification, we introduce a multitask learning strategy that jointly performs gait recognition and human attribute estimation, including age, body mass index (BMI), and gender. A unified transformer is employed to effectively fuse multi-modal gait features and better learn attribute-related representations, while preserving discriminative identity cues. Extensive experiments on the large-scale BRIAR datasets, collected under challenging conditions such as long-range distances (up to 1 km) and extreme pitch angles (up to 50{\deg}), demonstrate that our approach outperforms state-of-the-art methods in gait recognition and provides accurate human attribute estimation. These results highlight the promise of multi-modal and multitask learning for advancing gait-based human understanding in real-world scenarios.

[58] Towards Cybersickness Severity Classification from VR Gameplay Videos Using Transfer Learning and Temporal Modeling cs.CVPDF

Jyotirmay Nag Setu, Kevin Desai, John Quarles

TL;DR: 该论文提出了一种基于迁移学习和时间建模的方法，利用VR游戏视频预测晕动症的严重程度，提供了一种实用的工具来评估和缓解VR环境中的不适。

Details

Motivation: 虚拟现实（VR）技术在多个领域快速普及，但晕动症（cybersickness）问题阻碍了其广泛应用。现有的多模态深度学习方法依赖于VR传感器数据，但视频数据的潜力尚未充分探索。

Result: 该方法在视频数据上的分类准确率达到68.4%，优于现有仅依赖视频数据的模型。

Insight: 研究展示了视频数据在晕动症预测中的潜力，为未来基于时间建模的视频分析方法奠定了基础，有助于提升VR用户体验。

Abstract: With the rapid advancement of virtual reality (VR) technology, its adoption across domains such as healthcare, education, and entertainment has grown significantly. However, the persistent issue of cybersickness, marked by symptoms resembling motion sickness, continues to hinder widespread acceptance of VR. While recent research has explored multimodal deep learning approaches leveraging data from integrated VR sensors like eye and head tracking, there remains limited investigation into the use of video-based features for predicting cybersickness. In this study, we address this gap by utilizing transfer learning to extract high-level visual features from VR gameplay videos using the InceptionV3 model pretrained on the ImageNet dataset. These features are then passed to a Long Short-Term Memory (LSTM) network to capture the temporal dynamics of the VR experience and predict cybersickness severity over time. Our approach effectively leverages the time-series nature of video data, achieving a 68.4% classification accuracy for cybersickness severity. This surpasses the performance of existing models trained solely on video data, providing a practical tool for VR developers to evaluate and mitigate cybersickness in virtual environments. Furthermore, this work lays the foundation for future research on video-based temporal modeling for enhancing user comfort in VR applications.

[59] Taming a Retrieval Framework to Read Images in Humanlike Manner for Augmenting Generation of MLLMs cs.CV | cs.AIPDF

Suyang Xi, Chenxi Yang, Hong Ding, Yiqing Ni, Catherine C. Liu

TL;DR: 该论文提出了一种名为HuLiRAG的框架，通过模仿人类处理视觉信息的方式（即’what-where-reweight’级联），解决了多模态大语言模型在细粒度视觉问答中的幻觉问题。该方法结合了开放词汇检测和空间解析，显著提高了模型的可靠性和一致性。

Details

Motivation: 现有的多模态大语言模型（MLLMs）在细粒度视觉问答中容易产生幻觉，例如错误识别物体身份、位置和关系。尽管检索增强生成（RAG）部分解决了这些问题，但其未能模拟人类对全局和局部信息的处理方式，导致推理能力受限。

Result: 实验表明，HuLiRAG在细粒度视觉问答任务中显著减少了幻觉现象，同时提升了答案的事实一致性和定位准确性。该方法在多模态问答任务中取得了卓越表现。

Insight: 论文的核心洞察在于，通过模仿人类视觉信息处理的层次化方式（全局到局部），可以有效解决MLLMs在细粒度任务中的幻觉问题。这种设计不仅提升了模型的可靠性，也为多模态推理提供了新的方法论。

Abstract: Multimodal large language models (MLLMs) often fail in fine-grained visual question answering, producing hallucinations about object identities, positions, and relations because textual queries are not explicitly anchored to visual referents. Retrieval-augmented generation (RAG) alleviates some errors, but it fails to align with human-like processing at both the retrieval and augmentation levels. Specifically, it focuses only on global-level image information but lacks local detail and limits reasoning about fine-grained interactions. To overcome this limitation, we present Human-Like Retrieval-Augmented Generation (HuLiRAG), a framework that stages multimodal reasoning as a ``what–where–reweight’’ cascade. Queries are first anchored to candidate referents via open-vocabulary detection (what), then spatially resolved with SAM-derived masks to recover fine-grained precision (where), and adaptively prioritized through the trade-off between local and global alignment (reweight). Mask-guided fine-tuning further injects spatial evidence into the generation process, transforming grounding from a passive bias into an explicit constraint on answer formulation. Extensive experiments demonstrate that this human-like cascade improves grounding fidelity and factual consistency while reducing hallucinations, advancing multimodal question answering toward trustworthy reasoning.

[60] MonoSE(3)-Diffusion: A Monocular SE(3) Diffusion Framework for Robust Camera-to-Robot Pose Estimation cs.CV | cs.ROPDF

Kangjian Zhu, Haobo Jiang, Yigong Zhang, Jianjun Qian, Jian Yang

TL;DR: MonoSE(3)-Diffusion提出了一种基于单目相机的SE(3)扩散框架，通过条件去噪扩散过程解决无标记的机器人姿态估计问题，显著提升了性能。

Details

Motivation: 现有的姿态估计方法通常采用固定尺度的扰动，限制了姿态多样性和鲁棒性。本文旨在通过扩散模型生成更具多样性的训练姿态，并利用时序感知的逐步细化策略提升预测准确性。

Result: 在DREAM和RoboKeyGen两个基准测试中表现优异，最具挑战性的数据集的AUC达到66.75，比现有技术提升了32.3%。

Insight: 扩散模型的时序感知特性可以有效捕捉姿态估计的多尺度信息，提升任务性能和鲁棒性，尤其在复杂场景中表现突出。

Abstract: We propose MonoSE(3)-Diffusion, a monocular SE(3) diffusion framework that formulates markerless, image-based robot pose estimation as a conditional denoising diffusion process. The framework consists of two processes: a visibility-constrained diffusion process for diverse pose augmentation and a timestep-aware reverse process for progressive pose refinement. The diffusion process progressively perturbs ground-truth poses to noisy transformations for training a pose denoising network. Importantly, we integrate visibility constraints into the process, ensuring the transformations remain within the camera field of view. Compared to the fixed-scale perturbations used in current methods, the diffusion process generates in-view and diverse training poses, thereby improving the network generalization capability. Furthermore, the reverse process iteratively predicts the poses by the denoising network and refines pose estimates by sampling from the diffusion posterior of current timestep, following a scheduled coarse-to-fine procedure. Moreover, the timestep indicates the transformation scales, which guide the denoising network to achieve more accurate pose predictions. The reverse process demonstrates higher robustness than direct prediction, benefiting from its timestep-aware refinement scheme. Our approach demonstrates improvements across two benchmarks (DREAM and RoboKeyGen), achieving a notable AUC of 66.75 on the most challenging dataset, representing a 32.3% gain over the state-of-the-art.

[61] On the Problem of Consistent Anomalies in Zero-Shot Industrial Anomaly Detection cs.CV | stat.APPDF

Tai Le-Gia, Ahn Jaehyun

TL;DR: 论文提出了CoDeGraph算法，通过构建图像级图并利用社区检测过滤一致性异常，显著提升了零样本工业异常检测的性能。

Details

Motivation: 现有的基于表示的方法在处理重复出现的相似缺陷（一致性异常）时表现不佳，影响异常分类和分割的性能。

Result: 在MVTec AD数据集上，AC性能达到98.3% AUROC，AS性能在F1和AP指标上分别提升了4.2%和5.4%。

Insight: 正常块与异常块在相似性表现上的差异（稳定渐进vs.突发高峰）为解决一致性异常问题提供了新思路。

Abstract: Zero-shot image anomaly classification (AC) and segmentation (AS) are vital for industrial quality control, detecting defects without prior training data. Existing representation-based methods compare patch features with nearest neighbors in unlabeled test images but struggle with consistent anomalies – similar defects recurring across multiple images – resulting in poor AC/AS performance. We introduce Consistent-Anomaly Detection Graph (CoDeGraph), a novel algorithm that identifies and filters consistent anomalies from similarity computations. Our key insight is that normal patches in industrial images show stable, gradually increasing similarity to other test images, while consistent-anomaly patches exhibit abrupt similarity spikes after exhausting a limited set of similar matches, a phenomenon we term ``neighbor-burnout.’’ CoDeGraph constructs an image-level graph, with images as nodes and edges connecting those with shared consistent-anomaly patterns, using community detection to filter these anomalies. We provide a theoretical foundation using Extreme Value Theory to explain the effectiveness of our approach. Experiments on MVTec AD with the ViT-L-14-336 backbone achieve 98.3% AUROC for AC and AS performance of 66.8% (+4.2%) F1 and 68.1% (+5.4%) AP over state-of-the-art zero-shot methods. Using the DINOv2 backbone further improves segmentation, yielding 69.1% (+6.5%) F1 and 71.9% (+9.2%) AP, demonstrating robustness across architectures.

[62] Learning from Disagreement: A Group Decision Simulation Framework for Robust Medical Image Segmentation cs.CV | cs.AIPDF

Chen Zhong, Yuxuan Yang, Xinyue Zhang, Ruohan Ma, Yong Guo

TL;DR: 论文提出了一种新的医学图像分割框架，通过模拟临床团队的决策过程，利用专家的分歧信息来生成更鲁棒的分割结果。

Details

Motivation: 医学图像分割标注存在专家间差异性（IRV），传统方法简单地平均标注结果会丢失有价值的临床不确定性信息。论文旨在利用这种分歧信息提升分割的鲁棒性和可信度。

Result: 在CBCT和MRI数据集上分别达到92.11%和90.72%的Dice分数，表现优于传统方法。

Insight: 将专家间的分歧视为有用信号而非噪声，为医疗AI提供了更鲁棒和可信的路径。

Abstract: Medical image segmentation annotation suffers from inter-rater variability (IRV) due to differences in annotators’ expertise and the inherent blurriness of medical images. Standard approaches that simply average expert labels are flawed, as they discard the valuable clinical uncertainty revealed in disagreements. We introduce a fundamentally new approach with our group decision simulation framework, which works by mimicking the collaborative decision-making process of a clinical panel. Under this framework, an Expert Signature Generator (ESG) learns to represent individual annotator styles in a unique latent space. A Simulated Consultation Module (SCM) then intelligently generates the final segmentation by sampling from this space. This method achieved state-of-the-art results on challenging CBCT and MRI datasets (92.11% and 90.72% Dice scores). By treating expert disagreement as a useful signal instead of noise, our work provides a clear path toward more robust and trustworthy AI systems for healthcare.

[63] Post-TIPS Prediction via Multimodal Interaction: A Multi-Center Dataset and Framework for Survival, Complication, and Portal Pressure Assessment cs.CVPDF

Junhao Dong, Dejia Liu, Ruiqi Ding, Zongxing Chen, Yingjie Huang

TL;DR: 论文提出了MultiTIPS数据集和一个多模态预后框架，用于TIPS手术后的综合评估，包括生存、并发症和门脉压力预测。框架包含双选项分割、多模态交互和多任务预测三大模块，解决了当前方法在标注效率、泛化性和多终点预测方面的不足。

Details

Motivation: TIPS手术的预后评估存在标注成本高、单模态方法泛化性差以及单终点预测不全面的问题，亟需一个公开数据集和更可靠的预测框架。

Result: 在MultiTIPS数据集上，提出的方法优于现有技术，表现出强泛化性和可解释性。

Insight: 多模态交互和分阶段多任务训练是提升预后模型性能的关键。公开数据集和框架为临床研究提供了重要资源。

Abstract: Transjugular intrahepatic portosystemic shunt (TIPS) is an established procedure for portal hypertension, but provides variable survival outcomes and frequent overt hepatic encephalopathy (OHE), indicating the necessity of accurate preoperative prognostic modeling. Current studies typically build machine learning models from preoperative CT images or clinical characteristics, but face three key challenges: (1) labor-intensive region-of-interest (ROI) annotation, (2) poor reliability and generalizability of unimodal methods, and (3) incomplete assessment from single-endpoint prediction. Moreover, the lack of publicly accessible datasets constrains research in this field. Therefore, we present MultiTIPS, the first public multi-center dataset for TIPS prognosis, and propose a novel multimodal prognostic framework based on it. The framework comprises three core modules: (1) dual-option segmentation, which integrates semi-supervised and foundation model-based pipelines to achieve robust ROI segmentation with limited annotations and facilitate subsequent feature extraction; (2) multimodal interaction, where three techniques, multi-grained radiomics attention (MGRA), progressive orthogonal disentanglement (POD), and clinically guided prognostic enhancement (CGPE), are introduced to enable cross-modal feature interaction and complementary representation integration, thus improving model accuracy and robustness; and (3) multi-task prediction, where a staged training strategy is used to perform stable optimization of survival, portal pressure gradient (PPG), and OHE prediction for comprehensive prognostic assessment. Extensive experiments on MultiTIPS demonstrate the superiority of the proposed method over state-of-the-art approaches, along with strong cross-domain generalization and interpretability, indicating its promise for clinical application. The dataset and code are available.

Jinjin Cao, Zhiyang Chen, Zijun Wang, Liyuan Ma, Weijian Luo

TL;DR: 论文提出了一种名为跨模态引导（CMG）的训练无关的解码方法，通过降低视觉-语言注意力来解决VLMs中的幻觉问题，显著减少了语言偏差而不损害模型能力。

Details

Motivation: 现有的VLMs在生成响应时容易产生与图像无关的语言流畅回答（幻觉问题），本文分析了语言偏差如何导致这一问题，并提出了解决方案。

Result: 实验证明CMG在幻觉基准测试中显著提升了不同VLMs的性能，且无需额外训练成本。

Insight: 通过调整注意力机制可以有效抑制语言偏差，同时保留VLMs的视觉-语言理解能力。

Abstract: Vision-Language Models (VLMs) have shown solid ability for multimodal understanding of both visual and language contexts. However, existing VLMs often face severe challenges of hallucinations, meaning that VLMs tend to generate responses that are only fluent in the language but irrelevant to images in previous contexts. To address this issue, we analyze how language bias contributes to hallucinations and then introduce Cross-Modal Guidance(CMG), a training-free decoding method that addresses the hallucinations by leveraging the difference between the output distributions of the original model and the one with degraded visual-language attention. In practice, we adaptively mask the attention weight of the most influential image tokens in selected transformer layers to corrupt the visual-language perception as a concrete type of degradation. Such a degradation-induced decoding emphasizes the perception of visual contexts and therefore significantly reduces language bias without harming the ability of VLMs. In experiment sections, we conduct comprehensive studies. All results demonstrate the superior advantages of CMG with neither additional conditions nor training costs. We also quantitatively show CMG can improve different VLM’s performance on hallucination-specific benchmarks and generalize effectively.

[65] DAGLFNet:Deep Attention-Guided Global-Local Feature Fusion for Pseudo-Image Point Cloud Segmentation cs.CV | cs.LGPDF

Chuang Chen, Wenyi Ge

TL;DR: DAGLFNet是一种基于伪图像的语义分割框架，通过全局-局部特征融合和多分支特征提取提升点云的分割性能，同时引入深度特征引导注意力机制优化特征融合。

Details

Motivation: 点云的语义分割在高精度地图和自主导航中至关重要，但现有方法在特征融合和区分性方面存在不足，需要一种更高效且性能优异的方法。

Result: 在SemanticKITTI和nuScenes验证集上分别达到69.83%和78.65%的性能，兼顾高效率和实时性。

Insight: 伪图像表示方法在点云分割中具有潜力，同时全局-局部特征融合和注意力机制的结合能够显著提升模型性能。

Abstract: Environmental perception systems play a critical role in high-precision mapping and autonomous navigation, with LiDAR serving as a core sensor that provides accurate 3D point cloud data. How to efficiently process unstructured point clouds while extracting structured semantic information remains a significant challenge, and in recent years, numerous pseudo-image-based representation methods have emerged to achieve a balance between efficiency and performance. However, they often overlook the structural and semantic details of point clouds, resulting in limited feature fusion and discriminability. In this work, we propose DAGLFNet, a pseudo-image-based semantic segmentation framework designed to extract discriminative features. First, the Global-Local Feature Fusion Encoding module is used to enhance the correlation among local features within a set and capture global contextual information. Second, the Multi-Branch Feature Extraction network is employed to capture more neighborhood information and enhance the discriminability of contour features. Finally, a Feature Fusion via Deep Feature-guided Attention mechanism is introduced to improve the precision of cross-channel feature fusion. Experimental evaluations show that DAGLFNet achieves 69.83% and 78.65% on the validation sets of SemanticKITTI and nuScenes, respectively. The method balances high performance with real-time capability, demonstrating great potential for LiDAR-based real-time applications.

[66] MSF-Mamba: Motion-aware State Fusion Mamba for Efficient Micro-Gesture Recognition cs.CVPDF

Deng Li, Jun Shao, Bohao Xing, Rong Gao, Bihan Wen

TL;DR: 该论文提出了一种名为MSF-Mamba的高效模型，用于微手势识别（MGR），通过引入运动感知状态融合模块和多尺度版本（MSF-Mamba+），显著提升了局部时空依赖性建模能力。

Details

Motivation: 微手势识别需要对长距离和局部时空依赖进行精确建模，而现有方法（如CNN和Transformer）各有局限性：CNN难以捕捉长距离依赖，Transformer计算成本高。Mamba虽高效但缺乏局部时空建模能力。

Result: 在两个公开MGR数据集上，MSF-Mamba及MSF-Mamba+均达到最先进性能，超越CNN、Transformer和SSM基线模型，且保持高效。

Insight: 通过增强Mamba的局部时空建模能力，结合运动感知和多尺度动态加权策略，能够高效捕获微手势的细微运动线索。

Abstract: Micro-gesture recognition (MGR) targets the identification of subtle and fine-grained human motions and requires accurate modeling of both long-range and local spatiotemporal dependencies. While CNNs are effective at capturing local patterns, they struggle with long-range dependencies due to their limited receptive fields. Transformer-based models address this limitation through self-attention mechanisms but suffer from high computational costs. Recently, Mamba has shown promise as an efficient model, leveraging state space models (SSMs) to enable linear-time processing However, directly applying the vanilla Mamba to MGR may not be optimal. This is because Mamba processes inputs as 1D sequences, with state updates relying solely on the previous state, and thus lacks the ability to model local spatiotemporal dependencies. In addition, previous methods lack a design of motion-awareness, which is crucial in MGR. To overcome these limitations, we propose motion-aware state fusion mamba (MSF-Mamba), which enhances Mamba with local spatiotemporal modeling by fusing local contextual neighboring states. Our design introduces a motion-aware state fusion module based on central frame difference (CFD). Furthermore, a multiscale version named MSF-Mamba+ has been proposed. Specifically, MSF-Mamba supports multiscale motion-aware state fusion, as well as an adaptive scale weighting module that dynamically weighs the fused states across different scales. These enhancements explicitly address the limitations of vanilla Mamba by enabling motion-aware local spatiotemporal modeling, allowing MSF-Mamba and MSF-Mamba to effectively capture subtle motion cues for MGR. Experiments on two public MGR datasets demonstrate that even the lightweight version, namely, MSF-Mamba, achieves SoTA performance, outperforming existing CNN-, Transformer-, and SSM-based models while maintaining high efficiency.

Yunlong Deng, Guangyi Chen, Tianpei Gu, Lingjing Kong, Yan Li

TL;DR: 该论文提出了一种基于三角一致性原则的自优化框架，验证了视觉-语言模型（VLMs）具有自我生成高质量监督数据的能力，无需外部输入即可自主学习。

Details

Motivation: 研究旨在探索无监督指令训练的VLMs潜力，验证其自我优化能力，以减少对外部监督数据的依赖。

Result: 实验表明，该方法在多个基准测试中实现了一致的性能提升，且无需外部监督。

Insight: VLMs具有内在的自我优化能力，未来研究可进一步探索其学习机制。

Abstract: Vision-Language Models (VLMs) integrate visual knowledge with the analytical capabilities of Large Language Models (LLMs) through supervised visual instruction tuning, using image-question-answer triplets. However, the potential of VLMs trained without supervised instruction remains largely unexplored. This study validates that VLMs possess inherent self-refinement capabilities, enabling them to generate high-quality supervised data without external inputs and thereby learn autonomously. Specifically, to stimulate the self-refinement ability of VLMs, we propose a self-refinement framework based on a Triangular Consistency principle: within the image-query-answer triangle, any masked elements should be consistently and accurately reconstructed. The framework involves three steps: (1) We enable the instruction generation ability of VLMs by adding multi-task instruction tuning like image$\rightarrow$question-answer or image-answer$\rightarrow$question. (2) We generate image-query-answer triplets from unlabeled images and use the Triangular Consistency principle for filtering. (3) The model is further updated using the filtered synthetic data. To investigate the underlying mechanisms behind this self-refinement capability, we conduct a theoretical analysis from a causal perspective. Using the widely recognized LLaVA-1.5 as our baseline, our experiments reveal that the model can autonomously achieve consistent, though deliberately modest, improvements across multiple benchmarks without any external supervision, such as human annotations or environmental feedback. We expect that the insights of this study on the self-refinement ability of VLMs can inspire future research on the learning mechanism of VLMs. Code is available at https://github.com/dengyl20/SRF-LLaVA-1.5.

[68] VR-Thinker: Boosting Video Reward Models through Thinking-with-Image Reasoning cs.CVPDF

Qunzhong Wang, Jie Liu, Jiajun Liang, Yilei Jiang, Yuanxing Zhang

TL;DR: VR-Thinker是一个通过思考与图像推理增强视频奖励模型的框架，解决了传统奖励模型在处理视觉输入时的局限性，显著提高了长视频的评估准确率。

Details

Motivation: 现有的多模态奖励模型（RMs）在处理视觉输入时存在两个主要问题：视觉输入占用大量上下文预算导致细节丢失，以及初始提示中打包所有视觉信息加剧了幻觉和遗忘。

Result: 在多个视频偏好基准测试中取得了领先的准确率，例如7B VR-Thinker在VideoGen Reward上达到80.5%，GenAI-Bench上82.3%，MJ-Bench-Video上75.6%。

Insight: 思考与图像的推理方式显著提升了多模态奖励模型的性能，尤其是在处理长视频时，验证了这一方法的有效性。

Abstract: Recent advancements in multimodal reward models (RMs) have substantially improved post-training for visual generative models. However, current RMs face inherent limitations: (1) visual inputs consume large context budgets, forcing fewer frames and causing loss of fine-grained details; and (2) all visual information is packed into the initial prompt, exacerbating hallucination and forgetting during chain-of-thought reasoning. To overcome these issues, we introduce VideoReward Thinker (VR-Thinker), a thinking-with-image framework that equips the RM with visual reasoning operations (e.g., select frame) and a configurable visual memory window. This allows the RM to actively acquire and update visual evidence within context limits, improving reasoning fidelity and reliability. We activate visual reasoning via a reinforcement fine-tuning pipeline: (i) Cold Start with curated visual chain-of-thought data to distill basic reasoning skills and operation formatting; (ii) select samples whose per-dimension and overall judgments are all correct, then conduct Rejection sampling Fine-Tuning on these high-quality traces to further enhance reasoning; and (iii) apply Group Relative Policy Optimization (GRPO) to strengthen reasoning. Our approach delivers state-of-the-art accuracy among open-source models on video preference benchmarks, especially for longer videos: a 7B VR-Thinker achieves 80.5% on VideoGen Reward, 82.3% on GenAI-Bench, and 75.6% on MJ-Bench-Video. These results validate the effectiveness and promise of thinking-with-image multimodal reward modeling.

[69] Receptive Field Expanded Look-Up Tables for Vision Inference: Advancing from Low-level to High-level Tasks cs.CVPDF

Xi Zhang, Xiaolin Wu

TL;DR: 该论文提出了一种新颖的查找表（LUT）方法，通过扩展卷积核的感受野并引入优化技术，实现了在不增加表大小的情况下提升CNN推理性能的目标。

Details

Motivation: 现有的LUT方法因卷积核感受野受限（由于表大小的组合爆炸）而难以在高层次任务中表现良好。为此，研究旨在扩展感受野，同时保持空间复杂度不变。

Result: 该方法在速度、精度和内存效率之间取得了有效平衡，显著优于现有LUT方法。

Insight: 通过优化量化策略和引入多级上下文捕捉机制，可以在固定表大小的情况下显著提升CNN推理的效率和性能。

Abstract: Recently, several look-up table (LUT) methods were developed to greatly expedite the inference of CNNs in a classical strategy of trading space for speed. However, these LUT methods suffer from a common drawback of limited receptive field of the convolution kernels due to the combinatorial explosion of table size. This research aims to expand the CNN receptive field with a fixed table size, thereby enhancing the performance of LUT-driven fast CNN inference while maintaining the same space complexity. To achieve this goal, various techniques are proposed. The main contribution is a novel approach of learning an optimal lattice vector quantizer that adaptively allocates the quantization resolution across data dimensions based on their significance to the inference task. In addition, the lattice vector quantizer offers an inherently more accurate approximation of CNN kernels than scalar quantizer as used in current practice. Furthermore, we introduce other receptive field expansion strategies, including irregular dilated convolutions and a U-shaped cascaded LUT structure, designed to capture multi-level contextual information without inflating table size. Together, these innovations allow our approach to effectively balance speed, accuracy, and memory efficiency, demonstrating significant improvements over existing LUT methods.

Yang Liu, Yufei Yin, Chenchen Jing, Muzhi Zhu, Hao Chen

TL;DR: COSINE是一个统一的开世界分割模型，结合了开放词汇分割和上下文分割，利用多模态提示（如文本和图像）实现高效分割。

Details

Motivation: 现有的开放词汇分割和上下文分割方法存在架构差异和学习目标不一致的问题，限制了模型的通用性和性能。

Result: 实验表明COSINE在开放词汇和上下文分割任务中性能显著提升，多模态提示的协同作用优于单模态方法。

Insight: 多模态提示（视觉和文本）协同能显著提升泛化能力，统一框架克服了传统方法的局限性。

Abstract: In this work, we present COSINE, a unified open-world segmentation model that consolidates open-vocabulary segmentation and in-context segmentation with multi-modal prompts (e.g., text and image). COSINE exploits foundation models to extract representations for an input image and corresponding multi-modal prompts, and a SegDecoder to align these representations, model their interaction, and obtain masks specified by input prompts across different granularities. In this way, COSINE overcomes architectural discrepancies, divergent learning objectives, and distinct representation learning strategies of previous pipelines for open-vocabulary segmentation and in-context segmentation. Comprehensive experiments demonstrate that COSINE has significant performance improvements in both open-vocabulary and in-context segmentation tasks. Our exploratory analyses highlight that the synergistic collaboration between using visual and textual prompts leads to significantly improved generalization over single-modality approaches.

[71] Layout-Independent License Plate Recognition via Integrated Vision and Language Models cs.CVPDF

Elham Shabaninia, Fatemeh Asadi-zeydabadi, Hossein Nezamabadi-pour

TL;DR: 提出了一种不依赖于布局的车牌识别方法，通过结合视觉和语言模型实现高精度识别。

Details

Motivation: 传统车牌识别方法依赖于手动布局分类或启发式修正，难以应对多样化的车牌布局和复杂现实条件。

Result: 在多个国际数据集上表现出更高的准确性和鲁棒性，优于最近的免分割方法。

Insight: 将模式分析嵌入识别阶段，结合视觉和语言建模，增强了车牌识别的适应性和鲁棒性。

Abstract: This work presents a pattern-aware framework for automatic license plate recognition (ALPR), designed to operate reliably across diverse plate layouts and challenging real-world conditions. The proposed system consists of a modern, high-precision detection network followed by a recognition stage that integrates a transformer-based vision model with an iterative language modelling mechanism. This unified recognition stage performs character identification and post-OCR refinement in a seamless process, learning the structural patterns and formatting rules specific to license plates without relying on explicit heuristic corrections or manual layout classification. Through this design, the system jointly optimizes visual and linguistic cues, enables iterative refinement to improve OCR accuracy under noise, distortion, and unconventional fonts, and achieves layout-independent recognition across multiple international datasets (IR-LPR, UFPR-ALPR, AOLP). Experimental results demonstrate superior accuracy and robustness compared to recent segmentation-free approaches, highlighting how embedding pattern analysis within the recognition stage bridges computer vision and language modelling for enhanced adaptability in intelligent transportation and surveillance applications.

[72] MCE: Towards a General Framework for Handling Missing Modalities under Imbalanced Missing Rates cs.CV | cs.LG | cs.MMPDF

Binyu Zhao, Wei Zhang, Zhaonian Zou

TL;DR: 论文提出了一种名为MCE的通用框架，用于处理不平衡缺失率下的多模态缺失问题。MCE通过动态平衡模态学习进度和增强特征表示能力，显著提高了性能。

Details

Motivation: 多模态学习中，不平衡的缺失率会导致某些模态学习不足和特征退化，现有方法忽略了样本级模态效用和特征质量下降的问题。

Result: 在四个多模态基准测试中，MCE在不同缺失配置下均优于现有方法。

Insight: 动态平衡模态学习进度和增强特征表示是处理不平衡缺失率问题的关键。

Abstract: Multi-modal learning has made significant advances across diverse pattern recognition applications. However, handling missing modalities, especially under imbalanced missing rates, remains a major challenge. This imbalance triggers a vicious cycle: modalities with higher missing rates receive fewer updates, leading to inconsistent learning progress and representational degradation that further diminishes their contribution. Existing methods typically focus on global dataset-level balancing, often overlooking critical sample-level variations in modality utility and the underlying issue of degraded feature quality. We propose Modality Capability Enhancement (MCE) to tackle these limitations. MCE includes two synergistic components: i) Learning Capability Enhancement (LCE), which introduces multi-level factors to dynamically balance modality-specific learning progress, and ii) Representation Capability Enhancement (RCE), which improves feature semantics and robustness through subset prediction and cross-modal completion tasks. Comprehensive evaluations on four multi-modal benchmarks show that MCE consistently outperforms state-of-the-art methods under various missing configurations. The journal preprint version is now available at https://doi.org/10.1016/j.patcog.2025.112591. Our code is available at https://github.com/byzhaoAI/MCE.

[73] GLOFNet – A Multimodal Dataset for GLOF Monitoring and Prediction cs.CV | cs.AIPDF

Zuha Fatima, Muhammad Anser Sohaib, Muhammad Talha, Sidra Sultana, Ayesha Kanwal

TL;DR: GLOFNet是一个多模态数据集，旨在支持冰川湖溃决洪水（GLOFs）的监测和预测研究。它整合了Sentinel-2多光谱影像、NASA ITS_LIVE冰川运动数据和MODIS地表温度记录，通过预处理和多模态融合解决了数据分散和单模态问题。

Details

Motivation: GLOFs是一种罕见但极具破坏性的高山灾害，但预测研究因数据分散和单模态特性而受限，亟需一个统一的多模态数据集来支持预测。

Result: 分析揭示了冰川运动的季节周期性、长期变暖趋势（每十年约0.8K）和低温环境的空间异质性。

Insight: GLOFNet通过解决类别不平衡、云污染和低分辨率等问题，为多模态深度学习在罕见灾害预测中的应用提供了基础。

Abstract: Glacial Lake Outburst Floods (GLOFs) are rare but destructive hazards in high mountain regions, yet predictive research is hindered by fragmented and unimodal data. Most prior efforts emphasize post-event mapping, whereas forecasting requires harmonized datasets that combine visual indicators with physical precursors. We present GLOFNet, a multimodal dataset for GLOF monitoring and prediction, focused on the Shisper Glacier in the Karakoram. It integrates three complementary sources: Sentinel-2 multispectral imagery for spatial monitoring, NASA ITS_LIVE velocity products for glacier kinematics, and MODIS Land Surface Temperature records spanning over two decades. Preprocessing included cloud masking, quality filtering, normalization, temporal interpolation, augmentation, and cyclical encoding, followed by harmonization across modalities. Exploratory analysis reveals seasonal glacier velocity cycles, long-term warming of ~0.8 K per decade, and spatial heterogeneity in cryospheric conditions. The resulting dataset, GLOFNet, is publicly available to support future research in glacial hazard prediction. By addressing challenges such as class imbalance, cloud contamination, and coarse resolution, GLOFNet provides a structured foundation for benchmarking multimodal deep learning approaches to rare hazard prediction.

[74] Equipping Vision Foundation Model with Mixture of Experts for Out-of-Distribution Detection cs.CVPDF

Shizhen Zhao, Jiahui Liu, Xin Wen, Haoru Tan, Xiaojuan Qi

TL;DR: 该论文研究了预训练视觉基础模型（如DINOv2）在OOD检测任务中的应用潜力，并提出了一种新的混合特征专家模块（MoFE）和动态β混合策略（Dynamic-β Mixup），以提升模型在大语义空间下的性能。

Details

Motivation: 预训练视觉基础模型在许多计算机视觉任务中表现出色，但其在OOD检测任务中的潜力尚未被充分挖掘。论文旨在探索这些模型的OOD检测能力，并提出改进方法以应对大语义空间的挑战。

Result: 实验表明，MoFE模块和Dynamic-β Mixup策略显著提升了模型在大语义空间下的OOD检测性能，优于基线方法。

Insight: 1. 预训练基础模型可直接用于OOD检测，无需微调。2. 大语义空间中，决策边界复杂度增加，需通过特征分割和动态混合策略优化。

Abstract: Pre-trained vision foundation models have transformed many computer vision tasks. Despite their strong ability to learn discriminative and generalizable features crucial for out-of-distribution (OOD) detection, their impact on this task remains underexplored. Motivated by this gap, we systematically investigate representative vision foundation models for OOD detection. Our findings reveal that a pre-trained DINOv2 model, even without fine-tuning on in-domain (ID) data, naturally provides a highly discriminative feature space for OOD detection, achieving performance comparable to existing state-of-the-art methods without requiring complex designs. Beyond this, we explore how fine-tuning foundation models on in-domain (ID) data can enhance OOD detection. However, we observe that the performance of vision foundation models remains unsatisfactory in scenarios with a large semantic space. This is due to the increased complexity of decision boundaries as the number of categories grows, which complicates the optimization process. To mitigate this, we propose the Mixture of Feature Experts (MoFE) module, which partitions features into subspaces, effectively capturing complex data distributions and refining decision boundaries. Further, we introduce a Dynamic-$\beta$ Mixup strategy, which samples interpolation weights from a dynamic beta distribution. This adapts to varying levels of learning difficulty across categories, improving feature learning for more challenging categories. Extensive experiments demonstrate the effectiveness of our approach, significantly outperforming baseline methods.

[75] A Simple and Better Baseline for Visual Grounding cs.CVPDF

Jingchao Wang, Wenlong Zhang, Dingjiang Huang, Hong Wang, Yefeng Zheng

TL;DR: 论文提出了一种名为FSVG的简单但高效的方法，用于视觉定位任务，通过特征选择机制减少了计算开销，同时保持了高精度。

Details

Motivation: 现有的视觉定位方法虽然性能出色，但依赖于复杂的迭代过程和缓存机制，导致计算开销大。论文旨在设计一种更简单的基线方法，减少计算负担。

Result: 在多个基准数据集上的实验表明，FSVG在精度和效率上优于当前最优方法。

Insight: 简化网络架构和引入特征选择机制可以有效提升视觉定位任务的效率和性能。

Abstract: Visual grounding aims to predict the locations of target objects specified by textual descriptions. For this task with linguistic and visual modalities, there is a latest research line that focuses on only selecting the linguistic-relevant visual regions for object localization to reduce the computational overhead. Albeit achieving impressive performance, it is iteratively performed on different image scales, and at every iteration, linguistic features and visual features need to be stored in a cache, incurring extra overhead. To facilitate the implementation, in this paper, we propose a feature selection-based simple yet effective baseline for visual grounding, called FSVG. Specifically, we directly encapsulate the linguistic and visual modalities into an overall network architecture without complicated iterative procedures, and utilize the language in parallel as guidance to facilitate the interaction between linguistic modal and visual modal for extracting effective visual features. Furthermore, to reduce the computational cost, during the visual feature learning, we introduce a similarity-based feature selection mechanism to only exploit language-related visual features for faster prediction. Extensive experiments conducted on several benchmark datasets comprehensively substantiate that the proposed FSVG achieves a better balance between accuracy and efficiency beyond the current state-of-the-art methods. Code is available at https://github.com/jcwang0602/FSVG.

[76] ViSurf: Visual Supervised-and-Reinforcement Fine-Tuning for Large Vision-and-Language Models cs.CVPDF

Yuqi Liu, Liangyu Chen, Jiazhen Liu, Mingkang Zhu, Zhisheng Zhong

TL;DR: ViSurf提出了一种统一的后训练范式，结合了监督微调（SFT）和强化学习（RLVR）的优点，通过注入真实标签和引入奖励控制策略，显著提升了大规模视觉语言模型的性能。

Details

Motivation: 现有的SFT和RLVR方法各自存在局限性：SFT性能次优，RLVR难以处理超出模型内部知识的任务。ViSurf旨在结合两者的优势，提供更高效的后训练方案。

Result: 在多个基准测试中，ViSurf的表现优于单独的SFT、RLVR以及两阶段SFT→RLVR，验证了其设计的有效性。

Insight: 结合外部监督和内部强化是提升模型性能的有效途径，奖励控制策略对训练稳定性至关重要。

Abstract: Typical post-training paradigms for Large Vision-and-Language Models (LVLMs) include Supervised Fine-Tuning (SFT) and Reinforcement Learning with Verifiable Rewards (RLVR). SFT leverages external guidance to inject new knowledge, whereas RLVR utilizes internal reinforcement to enhance reasoning capabilities and overall performance. However, our analysis reveals that SFT often leads to sub-optimal performance, while RLVR struggles with tasks that exceed the model’s internal knowledge base. To address these limitations, we propose ViSurf (\textbf{Vi}sual \textbf{Su}pervised-and-\textbf{R}einforcement \textbf{F}ine-Tuning), a unified post-training paradigm that integrates the strengths of both SFT and RLVR within a single stage. We analyze the derivation of the SFT and RLVR objectives to establish the ViSurf objective, providing a unified perspective on these two paradigms. The core of ViSurf involves injecting ground-truth labels into the RLVR rollouts, thereby providing simultaneous external supervision and internal reinforcement. Furthermore, we introduce three novel reward control strategies to stabilize and optimize the training process. Extensive experiments across several diverse benchmarks demonstrate the effectiveness of ViSurf, outperforming both individual SFT, RLVR, and two-stage SFT \textrightarrow RLVR. In-depth analysis corroborates these findings, validating the derivation and design principles of ViSurf.

[77] OmniQuality-R: Advancing Reward Models Through All-Encompassing Quality Assessment cs.CVPDF

Yiting Lu, Fengbin Guan, Yixin Gao, Yan Zhong, Xinge Peng

TL;DR: 本文提出了一种统一的奖励建模框架OmniQuality-R，通过将多任务质量推理转化为连续且可解释的奖励信号以优化策略。该方法结合了主观实验思想，构建了一个推理增强的奖励建模数据集，并使用GRPO进行后训练。通过STD过滤和熵门控机制稳定训练，提高了下游任务的泛化能力。

Details

Motivation: 当前的视觉评估方法通常局限于单一任务，缺乏统一的多任务质量评估框架。

Result: 在美学质量评估、技术质量评估和文本图像对齐三个关键IQA任务上进行了验证。

Insight: 通过多任务推理和连续奖励信号的设计，OmniQuality-R能够有效提升策略优化的稳定性和泛化能力。

Abstract: Current visual evaluation approaches are typically constrained to a single task. To address this, we propose OmniQuality-R, a unified reward modeling framework that transforms multi-task quality reasoning into continuous and interpretable reward signals for policy optimization. Inspired by subjective experiments, where participants are given task-specific instructions outlining distinct assessment principles prior to evaluation, we propose OmniQuality-R, a structured reward modeling framework that transforms multi-dimensional reasoning into continuous and interpretable reward signals. To enable this, we construct a reasoning-enhanced reward modeling dataset by sampling informative plan-reason trajectories via rejection sampling, forming a reliable chain-of-thought (CoT) dataset for supervised fine-tuning (SFT). Building on this, we apply Group Relative Policy Optimization (GRPO) for post-training, using a Gaussian-based reward to support continuous score prediction. To further stabilize the training and improve downstream generalization, we incorporate standard deviation (STD) filtering and entropy gating mechanisms during reinforcement learning. These techniques suppress unstable updates and reduce variance in policy optimization. We evaluate OmniQuality-R on three key IQA tasks: aesthetic quality assessment, technical quality evaluation, and text-image alignment.

[78] DEMO: Disentangled Motion Latent Flow Matching for Fine-Grained Controllable Talking Portrait Synthesis cs.CV | cs.AIPDF

Peiyin Chen, Zhuowei Yang, Hui Feng, Sheng Jiang, Rui Yan

TL;DR: DEMO提出了一种基于流匹配的生成框架，用于音频驱动的说话肖像视频合成，实现了唇部运动、头部姿态和眼部的精细化控制。

Details

Motivation: 尽管基于扩散模型的音频驱动说话头生成技术快速发展，但生成时间一致且具有精细化运动控制的视频仍具挑战性。

Result: 在多基准测试中，DEMO在视频真实性、唇音同步和运动保真度上优于现有方法。

Insight: 将精细化运动解耦与基于流的生成模型结合，为可控说话头视频合成提供了新范式。

Abstract: Audio-driven talking-head generation has advanced rapidly with diffusion-based generative models, yet producing temporally coherent videos with fine-grained motion control remains challenging. We propose DEMO, a flow-matching generative framework for audio-driven talking-portrait video synthesis that delivers disentangled, high-fidelity control of lip motion, head pose, and eye gaze. The core contribution is a motion auto-encoder that builds a structured latent space in which motion factors are independently represented and approximately orthogonalized. On this disentangled motion space, we apply optimal-transport-based flow matching with a transformer predictor to generate temporally smooth motion trajectories conditioned on audio. Extensive experiments across multiple benchmarks show that DEMO outperforms prior methods in video realism, lip-audio synchronization, and motion fidelity. These results demonstrate that combining fine-grained motion disentanglement with flow-based generative modeling provides a powerful new paradigm for controllable talking-head video synthesis.

[79] A Machine Learning Perspective on Automated Driving Corner Cases cs.CVPDF

Sebastian Schmidt, Julius Körner, Stephan Günnemann

TL;DR: 论文提出了一种基于机器学习的新视角，用于自动驾驶中的corner case识别，通过考虑数据分布来统一现有方法，并在多个基准测试中表现出色。

Details

Motivation: 传统方法将自动驾驶中的corner case视为孤立问题，缺乏数据覆盖和泛化能力，本文提出了一种基于数据分布的机器学习方法来解决这一问题。

Result: 方法统一了现有corner case分类，在检测任务中表现优异，并通过新数据集支持组合corner case分析。

Insight: corner case识别应从数据分布角度出发，而非孤立场景，这为自动驾驶安全提供了更普适的理论基础。

Abstract: For high-stakes applications, like autonomous driving, a safe operation is necessary to prevent harm, accidents, and failures. Traditionally, difficult scenarios have been categorized into corner cases and addressed individually. However, this example-based categorization is not scalable and lacks a data coverage perspective, neglecting the generalization to training data of machine learning models. In our work, we propose a novel machine learning approach that takes the underlying data distribution into account. Based on our novel perspective, we present a framework for effective corner case recognition for perception on individual samples. In our evaluation, we show that our approach (i) unifies existing scenario-based corner case taxonomies under a distributional perspective, (ii) achieves strong performance on corner case detection tasks across standard benchmarks for which we extend established out-of-distribution detection benchmarks, and (iii) enables analysis of combined corner cases via a newly introduced fog-augmented Lost & Found dataset. These results provide a principled basis for corner case recognition, underlining our manual specification-free definition.

[80] Stability Under Scrutiny: Benchmarking Representation Paradigms for Online HD Mapping cs.CVPDF

Hao Shan, Ruikai Li, Han Jiang, Yizhe Fan, Ziyang Yan

TL;DR: 该论文首次系统地研究了在线高精地图（HD mapping）的时间稳定性问题，提出了一个多维稳定性评估框架，并引入了新的指标和统一的mAS评分标准。通过大规模实验，揭示了精度（mAP）与稳定性（mAS）是两个相对独立的性能维度，并分析了模型设计对两者的影响。

Details

Motivation: 在线高精地图是自动驾驶的核心模块之一，但由于传感器空间位移导致的映射结果不稳定，对下游任务提出了挑战。现有模型多关注单帧映射精度，而忽略了稳定性问题。

Result: 实验表明，精度（mAP）和稳定性（mAS）是两个独立的性能维度，并识别了提升单一或双重性能的模型设计和训练因素。

Insight: 时间稳定性应与精度同等重要，作为核心评估标准，以推动更可靠的自动驾驶系统发展。

Abstract: As one of the fundamental modules in autonomous driving, online high-definition (HD) maps have attracted significant attention due to their cost-effectiveness and real-time capabilities. Since vehicles always cruise in highly dynamic environments, spatial displacement of onboard sensors inevitably causes shifts in real-time HD mapping results, and such instability poses fundamental challenges for downstream tasks. However, existing online map construction models tend to prioritize improving each frame’s mapping accuracy, while the mapping stability has not yet been systematically studied. To fill this gap, this paper presents the first comprehensive benchmark for evaluating the temporal stability of online HD mapping models. We propose a multi-dimensional stability evaluation framework with novel metrics for Presence, Localization, and Shape Stability, integrated into a unified mean Average Stability (mAS) score. Extensive experiments on 42 models and variants show that accuracy (mAP) and stability (mAS) represent largely independent performance dimensions. We further analyze the impact of key model design choices on both criteria, identifying architectural and training factors that contribute to high accuracy, high stability, or both. To encourage broader focus on stability, we will release a public benchmark. Our work highlights the importance of treating temporal stability as a core evaluation criterion alongside accuracy, advancing the development of more reliable autonomous driving systems. The benchmark toolkit, code, and models will be available at https://stablehdmap.github.io/.

[81] Scalable Face Security Vision Foundation Model for Deepfake, Diffusion, and Spoofing Detection cs.CV | cs.AI | I.4.10; I.2.10; I.5.0PDF

Gaojian Wang, Feng Lin, Tong Wu, Zhisheng Yan, Kui Ren

TL;DR: 论文提出了一种名为FS-VFM的自监督预训练框架，通过学习真实人脸图像的表示，增强在各种人脸安全任务中的泛化能力。通过结合掩码图像建模（MIM）和实例判别（ID），设计了三种学习目标（3C），并提出了CRFR-P掩码策略和自蒸馏机制。此外，还提出了一种轻量级适配器FS-Adapter，用于高效迁移预训练模型。实验证明FS-VFM优于多种现有方法。

Details

Motivation: 现实世界中存在大量未标记的真实人脸数据，如何从中学习鲁棒且可迁移的人脸表示，以提升在多种人脸安全任务中的泛化能力，是一个重要问题。

Result: 在11个公开基准测试中，FS-VFM表现出优异的泛化能力，超越了多种现有方法，包括自然和面部领域的基准模型及任务特定方法。

Insight: 1. 通过结合局部和全局学习目标，可以提高人脸表示的鲁棒性；2. 轻量级适配器是高效迁移预训练模型的有效工具。

Abstract: With abundant, unlabeled real faces, how can we learn robust and transferable facial representations to boost generalization across various face security tasks? We make the first attempt and propose FS-VFM, a scalable self-supervised pre-training framework, to learn fundamental representations of real face images. We introduce three learning objectives, namely 3C, that synergize masked image modeling (MIM) and instance discrimination (ID), empowering FS-VFM to encode both local patterns and global semantics of real faces. Specifically, we formulate various facial masking strategies for MIM and devise a simple yet effective CRFR-P masking, which explicitly prompts the model to pursue meaningful intra-region Consistency and challenging inter-region Coherency. We present a reliable self-distillation mechanism that seamlessly couples MIM with ID to establish underlying local-to-global Correspondence. After pre-training, vanilla vision transformers (ViTs) serve as universal Vision Foundation Models for downstream Face Security tasks: cross-dataset deepfake detection, cross-domain face anti-spoofing, and unseen diffusion facial forensics. To efficiently transfer the pre-trained FS-VFM, we further propose FS-Adapter, a lightweight plug-and-play bottleneck atop the frozen backbone with a novel real-anchor contrastive objective. Extensive experiments on 11 public benchmarks demonstrate that our FS-VFM consistently generalizes better than diverse VFMs, spanning natural and facial domains, fully, weakly, and self-supervised paradigms, small, base, and large ViT scales, and even outperforms SOTA task-specific methods, while FS-Adapter offers an excellent efficiency-performance trade-off. The code and models are available on https://fsfm-3c.github.io/fsvfm.html.

[82] AdaViewPlanner: Adapting Video Diffusion Models for Viewpoint Planning in 4D Scenes cs.CVPDF

Yu Li, Menghan Xia, Gongye Liu, Jianhong Bai, Xintao Wang

TL;DR: 该论文提出了AdaViewPlanner方法，通过两阶段范式将预训练的文本生成视频（T2V）模型适配用于4D场景的视角规划，证明了视频生成模型在真实世界4D交互中的潜力。

Details

Motivation: 利用文本生成视频模型的强大视觉模拟能力，探索其作为隐式世界模型的潜力，并将其适配用于4D场景的视角规划任务。

Result: 实验结果表明，该方法在视角规划任务上优于现有竞争对手，并通过消融实验验证了关键技术设计的有效性。

Insight: 视频生成模型可以作为隐式世界模型，为4D交互任务提供新的视角规划解决方案。

Abstract: Recent Text-to-Video (T2V) models have demonstrated powerful capability in visual simulation of real-world geometry and physical laws, indicating its potential as implicit world models. Inspired by this, we explore the feasibility of leveraging the video generation prior for viewpoint planning from given 4D scenes, since videos internally accompany dynamic scenes with natural viewpoints. To this end, we propose a two-stage paradigm to adapt pre-trained T2V models for viewpoint prediction, in a compatible manner. First, we inject the 4D scene representation into the pre-trained T2V model via an adaptive learning branch, where the 4D scene is viewpoint-agnostic and the conditional generated video embeds the viewpoints visually. Then, we formulate viewpoint extraction as a hybrid-condition guided camera extrinsic denoising process. Specifically, a camera extrinsic diffusion branch is further introduced onto the pre-trained T2V model, by taking the generated video and 4D scene as input. Experimental results show the superiority of our proposed method over existing competitors, and ablation studies validate the effectiveness of our key technical designs. To some extent, this work proves the potential of video generation models toward 4D interaction in real world.

[83] Image-to-Video Transfer Learning based on Image-Language Foundation Models: A Comprehensive Survey cs.CV | cs.AIPDF

Jinxuan Li, Chaolei Tan, Haoxuan Chen, Jianxin Ma, Jian-Fang Hu

TL;DR: 该综述首次全面回顾了基于图像-语言基础模型（ILFM）的图像到视频迁移学习领域，分类了现有方法（冻结特征与修改特征），并分析了其在不同视频文本任务中的应用效果，最后指出了未来研究方向。

Details

Motivation: 现有视频-文本研究的需求推动了从图像领域迁移学习的兴趣，以缓解从头训练视频-语言基础模型的数据与计算成本问题。

Result: 实验分析了不同迁移学习范式在视频理解任务中的表现，验证了方法的有效性。

Insight: 基于ILFM的迁移学习为视频-文本任务提供了高效解决方案，未来可探索更动态的特征调整方法或跨模态对齐优化。

Abstract: Image-Language Foundation Models (ILFM) have demonstrated remarkable success in image-text understanding/generation tasks, providing transferable multimodal representations that generalize across diverse downstream image-based tasks. The advancement of video-text research has spurred growing interest in extending image-based models to the video domain. This paradigm, known as image-to-video transfer learning, succeeds in alleviating the substantial data and computational requirements associated with training video-language foundation models from scratch for video-text learning. This survey provides the first comprehensive review of this emerging field, which begins by summarizing the widely used ILFM and their capabilities. We then systematically classify existing image-to-video transfer learning strategies into two categories: frozen features and modified features, depending on whether the original representations from ILFM are preserved or undergo modifications. Building upon the task-specific nature of image-to-video transfer, this survey methodically elaborates these strategies and details their applications across a spectrum of video-text learning tasks, ranging from fine-grained (e.g., spatio-temporal video grounding) to coarse-grained (e.g., video question answering). We further present a detailed experimental analysis to investigate the efficacy of different image-to-video transfer learning paradigms on a range of downstream video understanding tasks. Finally, we identify prevailing challenges and highlight promising directions for future research. By offering a comprehensive and structured overview, this survey aims to establish a structured roadmap for advancing video-text learning based on existing ILFM, and to inspire future research directions in this rapidly evolving domain.

Yuxiang Luo, Qing Xu, Hai Huang, Yuqi Ouyang, Zhen Chen

TL;DR: MSM-Seg提出了一种新的双记忆分割框架，利用多模态和切片信息结合类别无关提示，实现了多模态脑肿瘤分割的优越性能。

Details

Motivation: 现有基于提示的分割方法忽视跨模态相关性，且依赖类别特定的提示，限制了在真实场景中的适用性。本文旨在解决这些问题。

Result: 在多模态MRI数据集上，MSM-Seg在转移瘤和胶质瘤分割任务中优于现有方法。

Insight: 类别无关提示和多模态/切片信息的结合可以显著提升分割性能，适用于复杂的临床场景。

Abstract: Multi-modal brain tumor segmentation is critical for clinical diagnosis, and it requires accurate identification of distinct internal anatomical subregions. While the recent prompt-based segmentation paradigms enable interactive experiences for clinicians, existing methods ignore cross-modal correlations and rely on labor-intensive category-specific prompts, limiting their applicability in real-world scenarios. To address these issues, we propose a MSM-Seg framework for multi-modal brain tumor segmentation. The MSM-Seg introduces a novel dual-memory segmentation paradigm that synergistically integrates multi-modal and inter-slice information with the efficient category-agnostic prompt for brain tumor understanding. To this end, we first devise a modality-and-slice memory attention (MSMA) to exploit the cross-modal and inter-slice relationships among the input scans. Then, we propose a multi-scale category-agnostic prompt encoder (MCP-Encoder) to provide tumor region guidance for decoding. Moreover, we devise a modality-adaptive fusion decoder (MF-Decoder) that leverages the complementary decoding information across different modalities to improve segmentation accuracy. Extensive experiments on different MRI datasets demonstrate that our MSM-Seg framework outperforms state-of-the-art methods in multi-modal metastases and glioma tumor segmentation. The code is available at https://github.com/xq141839/MSM-Seg.

[85] Action-Dynamics Modeling and Cross-Temporal Interaction for Online Action Understanding cs.CVPDF

Xinyu Yang, Zheheng Jiang, Feixiang Zhou, Yihang Zhu, Na Lv

TL;DR: 该论文提出了一种名为状态特定模型（SSM）的新框架，用于统一和增强动作检测与预测任务，通过关键状态压缩、动作模式学习和跨时间交互模块解决未修剪视频中的冗余信息和噪声问题，并在多个数据集上验证了其性能优越性。

Details

Motivation: 未修剪视频常包含大量冗余信息和噪声，且现有方法常忽略智能体意图对动作的影响。为此，论文提出了一种新框架来解决这些问题。

Result: 在EPIC-Kitchens-100、THUMOS’14等数据集上表现优于现有方法，验证了动作动态学习和跨时间交互的重要性。

Insight: 动作动态建模和意图与历史信息的交互对动作理解至关重要，为未来研究提供了新方向。

Abstract: Action understanding, encompassing action detection and anticipation, plays a crucial role in numerous practical applications. However, untrimmed videos are often characterized by substantial redundant information and noise. Moreover, in modeling action understanding, the influence of the agent’s intention on the action is often overlooked. Motivated by these issues, we propose a novel framework called the State-Specific Model (SSM), designed to unify and enhance both action detection and anticipation tasks. In the proposed framework, the Critical State-Based Memory Compression module compresses frame sequences into critical states, reducing information redundancy. The Action Pattern Learning module constructs a state-transition graph with multi-dimensional edges to model action dynamics in complex scenarios, on the basis of which potential future cues can be generated to represent intention. Furthermore, our Cross-Temporal Interaction module models the mutual influence between intentions and past as well as current information through cross-temporal interactions, thereby refining present and future features and ultimately realizing simultaneous action detection and anticipation. Extensive experiments on multiple benchmark datasets – including EPIC-Kitchens-100, THUMOS’14, TVSeries, and the introduced Parkinson’s Disease Mouse Behaviour (PDMB) dataset – demonstrate the superior performance of our proposed framework compared to other state-of-the-art approaches. These results highlight the importance of action dynamics learning and cross-temporal interactions, laying a foundation for future action understanding research.

[86] Dynamic Gaussian Splatting from Defocused and Motion-blurred Monocular Videos cs.CVPDF

Xuankai Zhang, Junjin Xiao, Qing Zhang

TL;DR: 该论文提出了一种统一的框架，能够从失焦和运动模糊的单目视频中实现高质量动态高斯散射（Gaussian Splatting）。通过模糊预测网络和动态高斯密集化策略，解决了现有方法无法同时处理两类模糊的问题，并在实验中展示了优于现有技术的性能。

Details

Motivation: 现有方法通常针对失焦模糊或运动模糊单独设计，缺乏同时处理两者的能力。虽然两者可以建模为基于模糊核的卷积，但估计准确的模糊核存在固有困难。

Result: 实验表明，该方法在从失焦和运动模糊的单目视频中生成逼真的新视角合成方面优于现有技术。

Insight: 模糊预测网络的引入和动态高斯密集化策略是同时处理两类模糊的关键创新，展示了模糊信息在场景重建中的重要性。

Abstract: This paper presents a unified framework that allows high-quality dynamic Gaussian Splatting from both defocused and motion-blurred monocular videos. Due to the significant difference between the formation processes of defocus blur and motion blur, existing methods are tailored for either one of them, lacking the ability to simultaneously deal with both of them. Although the two can be jointly modeled as blur kernel-based convolution, the inherent difficulty in estimating accurate blur kernels greatly limits the progress in this direction. In this work, we go a step further towards this direction. Particularly, we propose to estimate per-pixel reliable blur kernels using a blur prediction network that exploits blur-related scene and camera information and is subject to a blur-aware sparsity constraint. Besides, we introduce a dynamic Gaussian densification strategy to mitigate the lack of Gaussians for incomplete regions, and boost the performance of novel view synthesis by incorporating unseen view information to constrain scene optimization. Extensive experiments show that our method outperforms the state-of-the-art methods in generating photorealistic novel view synthesis from defocused and motion-blurred monocular videos. Our code and trained model will be made publicly available.

[87] Uncovering Anomalous Events for Marine Environmental Monitoring via Visual Anomaly Detection cs.CVPDF

Laura Weihl, Nejc Novak, Stefan H. Bengtson, Malte Pedersen

TL;DR: 该论文探讨了利用基于深度神经网络的视觉异常检测（VAD）自动识别水下监控视频中的异常事件，并提出了首个多标注者水下VAD基准数据集AURA，评估了四种VAD模型在不同场景下的表现。

Details

Motivation: 水下视频监控是评估海洋生物多样性的有效方法，但海量无事件视频使人工检查不切实际，因此需要自动化的视觉异常检测技术。

Result: 实验表明，当前VAD模型的性能差异显著，且对训练数据量和视觉内容多样性高度敏感。软标签和共识标签在提升模型性能方面具有重要价值。

Insight: 1. 视觉异常检测在水下监控中具有潜力，但需解决数据多样性问题；2. 多标注者的标注方法有助于提升模型鲁棒性；3. 训练数据的量和质量对VAD性能至关重要。

Abstract: Underwater video monitoring is a promising strategy for assessing marine biodiversity, but the vast volume of uneventful footage makes manual inspection highly impractical. In this work, we explore the use of visual anomaly detection (VAD) based on deep neural networks to automatically identify interesting or anomalous events. We introduce AURA, the first multi-annotator benchmark dataset for underwater VAD, and evaluate four VAD models across two marine scenes. We demonstrate the importance of robust frame selection strategies to extract meaningful video segments. Our comparison against multiple annotators reveals that VAD performance of current models varies dramatically and is highly sensitive to both the amount of training data and the variability in visual content that defines “normal” scenes. Our results highlight the value of soft and consensus labels and offer a practical approach for supporting scientific exploration and scalable biodiversity monitoring.

[88] Restricted Receptive Fields for Face Verification cs.CVPDF

Kagan Ozturk, Aman Bhatta, Haiyu Wu, Patrick Flynn, Kevin W. Bowyer

TL;DR: 该论文提出了一种基于受限感受野的人脸相似性度量方法，通过将全局相似性分解为局部补丁贡献，实现模型决策的固有可解释性。

Details

Motivation: 当前深度神经网络的决策过程通常是黑盒的，常用的事后分析方法（如像素重要性）可能无法准确反映模型的实际推理过程。为了解决这一问题，作者提出设计一种固有可解释的模型。

Result: 结果表明，该方法在小补丁（28x28）下具有竞争力，在大补丁（56x56）下甚至超越了现有最优方法。

Insight: 通过局部加性设计，模型的解释性与性能可以并存，无需事后分析，这为设计固有可解释的模型提供了新思路。

Abstract: Understanding how deep neural networks make decisions is crucial for analyzing their behavior and diagnosing failure cases. In computer vision, a common approach to improve interpretability is to assign importance to individual pixels using post-hoc methods. Although they are widely used to explain black-box models, their fidelity to the model’s actual reasoning is uncertain due to the lack of reliable evaluation metrics. This limitation motivates an alternative approach, which is to design models whose decision processes are inherently interpretable. To this end, we propose a face similarity metric that breaks down global similarity into contributions from restricted receptive fields. Our method defines the similarity between two face images as the sum of patch-level similarity scores, providing a locally additive explanation without relying on post-hoc analysis. We show that the proposed approach achieves competitive verification performance even with patches as small as 28x28 within 112x112 face images, and surpasses state-of-the-art methods when using 56x56 patches.

[89] EGD-YOLO: A Lightweight Multimodal Framework for Robust Drone-Bird Discrimination via Ghost-Enhanced YOLOv8n and EMA Attention under Adverse Condition cs.CVPDF

Sudipto Sarkar, Mohammad Asif Hasan, Khondokar Ashik Shahriar, Fablia Labiba, Nahian Tasnim

TL;DR: EGD-YOLOv8n是一个轻量级的多模态目标检测框架，通过Ghost增强的YOLOv8n和EMA注意力机制，在恶劣条件下实现了无人机与鸟类的高效区分。

Details

Motivation: 无人机与鸟类的高效区分对空域安全和安保系统至关重要。现有方法在计算效率和恶劣条件下的鲁棒性上存在不足。

Result: 多模态版本在VIP CUP 2025数据集上表现最佳，兼具高精度和实时性，适用于普通GPU。

Insight: 多模态数据和高效的注意力机制能够显著提升目标检测模型在恶劣条件下的性能，同时保持轻量化和实时性。

Abstract: Identifying drones and birds correctly is essential for keeping the skies safe and improving security systems. Using the VIP CUP 2025 dataset, which provides both RGB and infrared (IR) images, this study presents EGD-YOLOv8n, a new lightweight yet powerful model for object detection. The model improves how image features are captured and understood, making detection more accurate and efficient. It uses smart design changes and attention layers to focus on important details while reducing the amount of computation needed. A special detection head helps the model adapt to objects of different shapes and sizes. We trained three versions: one using RGB images, one using IR images, and one combining both. The combined model achieved the best accuracy and reliability while running fast enough for real-time use on common GPUs.

[90] Structured Spectral Graph Learning for Multi-label Abnormality Classification in 3D Chest CT Scans cs.CVPDF

Theo Di Piazza, Carole Lazarus, Olivier Nempont, Loic Boussel

TL;DR: 论文提出了一种基于图谱的2.5D框架，用于3D胸部CT扫描的多标签异常分类，通过将3D体积表示为结构化图谱并利用谱图卷积捕获切片间依赖关系，实现了跨数据集的强泛化性能。

Details

Motivation: 3D胸部CT的多标签分类问题复杂且具有挑战性，主要由于体积数据的空间关系复杂性和异常多样性。现有的3D卷积神经网络难以捕捉长程依赖，而视觉Transformer需要大量领域特定数据预训练。

Result: 在三个独立机构的数据集上训练和评估，表现出强大的跨数据集泛化能力，性能媲美前沿视觉编码器。消融实验验证了不同聚合策略、边权重方案和图谱连接模式的性能影响。

Insight: 图谱建模方法可以有效捕捉3D CT数据中的复杂空间关系，同时避免了传统方法对大规模预训练的依赖。其方法还可推广至放射报告生成和腹部CT数据。

Abstract: With the growing volume of CT examinations, there is an increasing demand for automated tools such as organ segmentation, abnormality detection, and report generation to support radiologists in managing their clinical workload. Multi-label classification of 3D Chest CT scans remains a critical yet challenging problem due to the complex spatial relationships inherent in volumetric data and the wide variability of abnormalities. Existing methods based on 3D convolutional neural networks struggle to capture long-range dependencies, while Vision Transformers often require extensive pre-training on large-scale, domain-specific datasets to perform competitively. In this work, we propose a 2.5D alternative by introducing a new graph-based framework that represents 3D CT volumes as structured graphs, where axial slice triplets serve as nodes processed through spectral graph convolution, enabling the model to reason over inter-slice dependencies while maintaining complexity compatible with clinical deployment. Our method, trained and evaluated on 3 datasets from independent institutions, achieves strong cross-dataset generalization, and shows competitive performance compared to state-of-the-art visual encoders. We further conduct comprehensive ablation studies to evaluate the impact of various aggregation strategies, edge-weighting schemes, and graph connectivity patterns. Additionally, we demonstrate the broader applicability of our approach through transfer experiments on automated radiology report generation and abdominal CT data.\ This work extends our previous contribution presented at the MICCAI 2025 EMERGE Workshop.

[91] Full segmentation annotations of 3D time-lapse microscopy images of MDA231 cells cs.CVPDF

Aleksandra Melnikova, Petr Matula

TL;DR: 该论文提供了一个高质量、公开可用的3D时间推移显微镜图像（MDA231细胞）分割注释数据集，填补了该领域的数据空白，并验证了其一致性和准确性。

Details

Motivation: 体积图像的分割注释对图像处理领域至关重要，但现有公开数据集中缺乏高质量的多目标3D动态分割注释。

Result: 新注释的数据集在一致性（与CTC跟踪标记）和分割准确性（与2D黄金标准）上表现良好，且优于自动生成的silver truth。

Insight: 手动注释的3D分割数据集能更准确地反映复杂动态目标的形状，可用于细胞分割训练和动态形状分析。

Abstract: High-quality, publicly available segmentation annotations of image and video datasets are critical for advancing the field of image processing. In particular, annotations of volumetric images of a large number of targets are time-consuming and challenging. In (Melnikova, A., & Matula, P., 2025), we presented the first publicly available full 3D time-lapse segmentation annotations of migrating cells with complex dynamic shapes. Concretely, three distinct humans annotated two sequences of MDA231 human breast carcinoma cells (Fluo-C3DL-MDA231) from the Cell Tracking Challenge (CTC). This paper aims to provide a comprehensive description of the dataset and accompanying experiments that were not included in (Melnikova, A., & Matula, P., 2025) due to limitations in publication space. Namely, we show that the created annotations are consistent with the previously published tracking markers provided by the CTC organizers and the segmentation accuracy measured based on the 2D gold truth of CTC is within the inter-annotator variability margins. We compared the created 3D annotations with automatically created silver truth provided by CTC. We have found the proposed annotations better represent the complexity of the input images. The presented annotations can be used for testing and training cell segmentation, or analyzing 3D shapes of highly dynamic objects.

[92] Where on Earth? A Vision-Language Benchmark for Probing Model Geolocation Skills Across Scales cs.CVPDF

Zhaofang Qian, Hardy Chen, Zeyu Wang, Li Zhang, Zijun Wang

TL;DR: 论文提出了EarthWhere，一个综合评估视觉语言模型（VLM）在地理定位任务中的能力的基准，涵盖国家和街道两种尺度。通过评估13个先进VLM，发现模型性能受限且存在区域偏差。

Details

Motivation: 现有的VLM在地理定位任务中的能力尚未被充分评估，而这一任务在实际应用中具有重要价值。

Result: Gemini-2.5-Pro表现最佳（56.32%），最强开源模型GLM-4.5V达到34.71%。研究发现网络搜索和推理在视觉线索有限时未必能提升性能，且模型存在区域偏差。

Insight: 模型在地理定位任务中存在显著的区域偏差，且依赖视觉线索的性能受限，突显了任务中的挑战性和改进空间。

Abstract: Vision-language models (VLMs) have advanced rapidly, yet their capacity for image-grounded geolocation in open-world conditions, a task that is challenging and of demand in real life, has not been comprehensively evaluated. We present EarthWhere, a comprehensive benchmark for VLM image geolocation that evaluates visual recognition, step-by-step reasoning, and evidence use. EarthWhere comprises 810 globally distributed images across two complementary geolocation scales: WhereCountry (i.e., 500 multiple-choice question-answering, with country-level answer and panoramas) and WhereStreet (i.e., 310 fine-grained street-level identification tasks requiring multi-step reasoning with optional web search). For evaluation, we adopt the final-prediction metrics: location accuracies within k km (Acc@k) for coordinates and hierarchical path scores for textual localization. Beyond this, we propose to explicitly score intermediate reasoning chains using human-verified key visual clues and a Shapley-reweighted thinking score that attributes credit to each clue’s marginal contribution. We benchmark 13 state-of-the-art VLMs with web searching tools on our EarthWhere and report different types of final answer accuracies as well as the calibrated model thinking scores. Overall, Gemini-2.5-Pro achieves the best average accuracy at 56.32%, while the strongest open-weight model, GLM-4.5V, reaches 34.71%. We reveal that web search and reasoning do not guarantee improved performance when visual clues are limited, and models exhibit regional biases, achieving up to 42.7% higher scores in certain areas than others. These findings highlight not only the promise but also the persistent challenges of models to mitigate bias and achieve robust, fine-grained localization. We open-source our benchmark at https://github.com/UCSC-VLAA/EarthWhere.

[93] Topological Alignment of Shared Vision-Language Embedding Space cs.CV | cs.AI | cs.LGPDF

Junwon You, Dasol Kang, Jae-Hun Jung

TL;DR: 这篇论文提出了ToMCLIP，一种拓扑对齐方法，用于改进多语言视觉-语言嵌入空间的全局几何结构，通过拓扑保持约束提升跨模态对齐性能。

Details

Motivation: 现有的对比视觉-语言模型在多语言任务中存在跨模态对齐偏向英语的问题，且现有方法忽视了共享嵌入空间的全局几何结构。

Result: 实验表明ToMCLIP在CIFAR-100上零样本准确率更高，在xFlickr&CO上多语言检索性能更强。

Insight: 拓扑对齐提供了一种通用的表示学习方法，适用于改善嵌入空间的全局结构。

Abstract: Contrastive Vision-Language Models (VLMs) have demonstrated strong zero-shot capabilities. However, their cross-modal alignment remains biased toward English due to limited multilingual multimodal data. Recent multilingual extensions have alleviated this gap but enforce instance-level alignment while neglecting the global geometry of the shared embedding space. We address this problem by introducing ToMCLIP (Topological Alignment for Multilingual CLIP), a topology-aware framework aligning embedding spaces with topology-preserving constraints. The proposed method applies persistent homology to define a topological alignment loss and approximates persistence diagram with theoretical error bounds using graph sparsification strategy. This work validates the proposed approach, showing enhanced structural coherence of multilingual representations, higher zero-shot accuracy on the CIFAR-100, and stronger multilingual retrieval performance on the xFlickr&CO. Beyond VLMs, the proposed approach provides a general method for incorporating topological alignment into representation learning.

[94] SceneTextStylizer: A Training-Free Scene Text Style Transfer Framework with Diffusion Model cs.CV | eess.IVPDF

Honghui Yuan, Keiji Yanai

TL;DR: SceneTextStylizer是一个无需训练的、基于扩散模型的场景文本风格转移框架，能够实现灵活、高保真的文本风格编辑，同时保持文本可读性和风格一致性。

Details

Motivation: 现有场景文本编辑方法通常局限于内容替换和简单风格，缺乏自由风格转移的能力。本文旨在解决场景文本灵活且局部化的风格编辑问题。

Result: 实验表明，SceneTextStylizer在视觉保真度和文本保存方面优于现有方法。

Insight: 1. 无需训练的扩散模型框架可以有效实现复杂风格编辑；2. 结合区域控制和风格增强模块能够显著提升文本编辑的质量和灵活性。

Abstract: With the rapid development of diffusion models, style transfer has made remarkable progress. However, flexible and localized style editing for scene text remains an unsolved challenge. Although existing scene text editing methods have achieved text region editing, they are typically limited to content replacement and simple styles, which lack the ability of free-style transfer. In this paper, we introduce SceneTextStylizer, a novel training-free diffusion-based framework for flexible and high-fidelity style transfer of text in scene images. Unlike prior approaches that either perform global style transfer or focus solely on textual content modification, our method enables prompt-guided style transformation specifically for text regions, while preserving both text readability and stylistic consistency. To achieve this, we design a feature injection module that leverages diffusion model inversion and self-attention to transfer style features effectively. Additionally, a region control mechanism is introduced by applying a distance-based changing mask at each denoising step, enabling precise spatial control. To further enhance visual quality, we incorporate a style enhancement module based on the Fourier transform to reinforce stylistic richness. Extensive experiments demonstrate that our method achieves superior performance in scene text style transformation, outperforming existing state-of-the-art methods in both visual fidelity and text preservation.

[95] FG-CLIP 2: A Bilingual Fine-grained Vision-Language Alignment Model cs.CV | cs.AI | cs.LGPDF

Chunyu Xie, Bin Wang, Fanjing Kong, Jincheng Li, Dawei Liang

TL;DR: FG-CLIP 2提出了一个双语（英语和中文）细粒度视觉-语言对齐模型，通过引入区域-文本匹配、长字幕建模和多判别目标等方法，显著提升了细粒度对齐能力，并在29个数据集上表现优异。

Details

Motivation: 现有模型（如CLIP）在全局对齐表现良好，但在细粒度细节（如物体属性、空间关系）和多语言支持上能力有限，尤其是在非英语环境下。FG-CLIP 2旨在填补这一空白，推动双语（英语和中文）细粒度视觉-语言对齐研究。

Result: 在29个数据集和8项任务中，FG-CLIP 2超越了现有方法，实现了双语的最佳性能。

Insight: 细粒度对齐需要多模态监督和多语言数据支持；TIC损失可以有效提升模型对语义相似文本的区分能力。

Abstract: Fine-grained vision-language understanding requires precise alignment between visual content and linguistic descriptions, a capability that remains limited in current models, particularly in non-English settings. While models like CLIP perform well on global alignment, they often struggle to capture fine-grained details in object attributes, spatial relations, and linguistic expressions, with limited support for bilingual comprehension. To address these challenges, we introduce FG-CLIP 2, a bilingual vision-language model designed to advance fine-grained alignment for both English and Chinese. Our approach leverages rich fine-grained supervision, including region-text matching and long-caption modeling, alongside multiple discriminative objectives. We further introduce the Textual Intra-modal Contrastive (TIC) loss to better distinguish semantically similar captions. Trained on a carefully curated mixture of large-scale English and Chinese data, FG-CLIP 2 achieves powerful bilingual performance. To enable rigorous evaluation, we present a new benchmark for Chinese multimodal understanding, featuring long-caption retrieval and bounding box classification. Extensive experiments on 29 datasets across 8 tasks show that FG-CLIP 2 outperforms existing methods, achieving state-of-the-art results in both languages. We release the model, code, and benchmark to facilitate future research on bilingual fine-grained alignment.

[96] DKPMV: Dense Keypoints Fusion from Multi-View RGB Frames for 6D Pose Estimation of Textureless Objects cs.CV | cs.ROPDF

Jiahong Chen, Jinghao Wang, Zi Wang, Ziwen Wang, Banglei Guan

TL;DR: DKPMV提出了一种基于多视角RGB图像的稠密关键点融合方法，用于无纹理物体的6D姿态估计，通过三阶段姿态优化策略和注意力聚合与对称感知训练，显著提升了性能。

Details

Motivation: 无纹理物体的6D姿态估计在工业机器人应用中很重要，但现有方法要么依赖深度数据，要么未能充分利用多视角几何信息，导致性能受限。

Result: 在ROBI数据集上，DKPMV表现优于现有的多视角RGB方法，甚至多数情况下超越了RGB-D方法。

Insight: 仅依赖RGB图像也能通过稠密关键点融合和几何信息优化实现高精度的6D姿态估计，对称感知训练对解决对称物体模糊性至关重要。

Abstract: 6D pose estimation of textureless objects is valuable for industrial robotic applications, yet remains challenging due to the frequent loss of depth information. Current multi-view methods either rely on depth data or insufficiently exploit multi-view geometric cues, limiting their performance. In this paper, we propose DKPMV, a pipeline that achieves dense keypoint-level fusion using only multi-view RGB images as input. We design a three-stage progressive pose optimization strategy that leverages dense multi-view keypoint geometry information. To enable effective dense keypoint fusion, we enhance the keypoint network with attentional aggregation and symmetry-aware training, improving prediction accuracy and resolving ambiguities on symmetric objects. Extensive experiments on the ROBI dataset demonstrate that DKPMV outperforms state-of-the-art multi-view RGB approaches and even surpasses the RGB-D methods in the majority of cases. The code will be available soon.

[97] Towards Distribution-Shift Uncertainty Estimation for Inverse Problems with Generative Priors cs.CVPDF

Namhoon Kim, Sara Fridovich-Keil

TL;DR: 本文提出了一种基于生成式先验的反问题不确定性估计方法，无需校准数据集，即可检测分布偏移，适用于计算成像问题。

Details

Motivation: 生成式模型在反问题（如医学图像重建）中表现出强大的潜力，但它们可能在测试数据超出训练分布时产生幻觉特征。现有方法需要校准数据集或仅提供启发式估计，无法直接量化分布偏移带来的不确定性。

Result: 在MNIST数字的断层重建实验中，分布外数字的 reconstruction 表现出更高的变异性和重建误差，验证了该指标的有效性。

Insight: 该方法为生成式先验的部署提供了轻量级保障策略，能在分布内情况下实现激进的数据减少，同时在分布外情况下自动发出警告。

Abstract: Generative models have shown strong potential as data-driven priors for solving inverse problems such as reconstructing medical images from undersampled measurements. While these priors improve reconstruction quality with fewer measurements, they risk hallucinating features when test images lie outside the training distribution. Existing uncertainty quantification methods in this setting (i) require an in-distribution calibration dataset, which may not be available, (ii) provide heuristic rather than statistical estimates, or (iii) quantify uncertainty from model capacity or limited measurements rather than distribution shift. We propose an instance-level, calibration-free uncertainty indicator that is sensitive to distribution shift, requires no knowledge of the training distribution, and incurs no retraining cost. Our key hypothesis is that reconstructions of in-distribution images remain stable under random measurement variations, while reconstructions of out-of-distribution (OOD) images exhibit greater instability. We use this stability as a proxy for detecting distribution shift. Our proposed OOD indicator is efficiently computable for any computational imaging inverse problem; we demonstrate it on tomographic reconstruction of MNIST digits, where a learned proximal network trained only on digit “0” is evaluated on all ten digits. Reconstructions of OOD digits show higher variability and correspondingly higher reconstruction error, validating this indicator. These results suggest a deployment strategy that pairs generative priors with lightweight guardrails, enabling aggressive measurement reduction for in-distribution cases while automatically warning when priors are applied out of distribution.

[98] IUT-Plug: A Plug-in tool for Interleaved Image-Text Generation cs.CVPDF

Zeteng Lin, Xingxing Li, Wen You, Xiaoyang Li, Zehan Lu

TL;DR: 该论文提出了IUT-Plug，一种基于图像理解树（IUT）的插件工具，用于改进现有视觉语言模型在多模态图像-文本生成中的逻辑、对象一致性和风格保持问题。

Details

Motivation: 现有视觉语言模型（如GPT-4和DALL-E）在多模态图像-文本生成中难以保持逻辑、对象一致性和风格，限制了其在复杂场景中的泛化能力。

Result: 实验表明，IUT-Plug不仅在现有基准测试中提升了准确性，还能在多模态QA场景中有效缓解三种关键形式的上下文漂移。

Insight: 通过结构化推理和跨模态一致性机制，可以显著提升视觉语言模型在多模态任务中的表现。

Abstract: Existing vision language models (VLMs), including GPT-4 and DALL-E, often struggle to preserve logic, object identity, and style in multimodal image-text generation. This limitation significantly hinders the generalization capability of VLMs in complex image-text input-output scenarios. To address this issue, we propose IUT-Plug, a module grounded in an Image Understanding Tree (IUT), which enhances existing interleaved VLMs through explicit structured reasoning, thereby mitigating context drift in logic, entity identity, and style. The proposed framework operates in two stages. (1) A dynamic IUT-Plug extraction module parses visual scenes into hierarchical symbolic structures. (2) A coordinated narrative-flow and image synthesis mechanism ensures cross-modal consistency. To evaluate our approach, we construct a novel benchmark based on 3,000 real human-generated question-answer pairs over fine-tuned large models, introducing a dynamic evaluation protocol for quantifying context drift in interleaved VLMs. Experimental results demonstrate that IUT-Plug not only improves accuracy on established benchmarks but also effectively alleviates the three critical forms of context drift across diverse multimodal question answering (QA) scenarios.

[99] Chart-RVR: Reinforcement Learning with Verifiable Rewards for Explainable Chart Reasoning cs.CV | cs.LGPDF

Sanchit Sinha, Oana Frunza, Kashif Rasul, Yuriy Nevmyvaka, Aidong Zhang

TL;DR: Chart-RVR通过结合GRPO和可验证奖励，提升了大规模视觉语言模型（LVLM）在图谱推理任务中的鲁棒性和可解释性，显著缩小了分布外数据（OOD）的性能差距，并在多个基准测试中达到最优。

Details

Motivation: 目前的大型视觉语言模型在图谱推理任务中表现优秀，但在分布外数据和生成解释性推理链时表现不佳，限制了其可解释性和可靠性。

Result: Chart-RVR-3B系列模型在六个图谱推理基准测试中表现最优，显著缩小了OOD性能差距，并生成了更可解释的推理链。

Insight: 可验证奖励与GRPO结合的方法不仅提升了模型性能，还增强了推理过程的透明度和可靠性，展示了其在可解释性图谱推理任务中的潜力。

Abstract: The capabilities of Large Vision-Language Models (LVLMs) have reached state-of-the-art on many visual reasoning tasks, including chart reasoning, yet they still falter on out-of-distribution (OOD) data, and degrade further when asked to produce their chain-of-thought (CoT) rationales, limiting explainability. We present Chart-RVR, a general framework that fine-tunes LVLMs to be more robust and explainable for chart reasoning by coupling Group Relative Policy Optimization (GRPO) with automatically verifiable rewards. Our framework comprises of three rewards that maximize: (i) correct chart-type classification, (ii) faithful chart table reconstruction, and (iii) process conformity. Applied to 3-billion-parameter LVLMs, Chart-RVR consistently outperforms standard supervised fine-tuning (SFT) on both in-distribution and out-of-distribution datasets, closing the OOD performance gap while improving rationale fidelity. The resulting models, the Chart-RVR-3B series, achieve state-of-the-art results on six chart-reasoning benchmarks spanning in-domain and OOD settings, surpassing all existing models of comparable size. Beyond accuracy, Chart-RVR yields more interpretable CoT rationales, strengthening trust and reliability - showcasing the power of verifiable rewards with GRPO for training reliable, interpretable chart-reasoning models.

[100] Mixup Helps Understanding Multimodal Video Better cs.CVPDF

Xiaoyu Ma, Ding Ding, Hao Chen

TL;DR: 论文提出了Multimodal Mixup (MM)和Balanced Multimodal Mixup (B-MM)方法，通过在多模态特征层级应用Mixup策略和动态调整模态混合比例，解决了多模态视频理解中模态不平衡和过拟合问题。

Details

Motivation: 多模态视频理解任务中，强模态容易主导学习过程，压制弱模态的贡献，导致模型过拟合和泛化能力下降。

Result: 在多个数据集上的实验表明，所提方法显著提升了模型的泛化能力和多模态鲁棒性。

Insight: 动态调整模态混合比例能更有效地平衡模态间的贡献，从而提高模型在多模态任务中的表现。

Abstract: Multimodal video understanding plays a crucial role in tasks such as action recognition and emotion classification by combining information from different modalities. However, multimodal models are prone to overfitting strong modalities, which can dominate learning and suppress the contributions of weaker ones. To address this challenge, we first propose Multimodal Mixup (MM), which applies the Mixup strategy at the aggregated multimodal feature level to mitigate overfitting by generating virtual feature-label pairs. While MM effectively improves generalization, it treats all modalities uniformly and does not account for modality imbalance during training. Building on MM, we further introduce Balanced Multimodal Mixup (B-MM), which dynamically adjusts the mixing ratios for each modality based on their relative contributions to the learning objective. Extensive experiments on several datasets demonstrate the effectiveness of our methods in improving generalization and multimodal robustness.

[101] A Survey on Agentic Multimodal Large Language Models cs.CV | cs.AI | cs.CLPDF

Huanjin Yao, Ruifei Zhang, Jiaxing Huang, Jingyi Zhang, Yibo Wang

TL;DR: 论文对Agentic Multimodal Large Language Models（Agentic MLLMs）进行了全面综述，探讨了其与传统MLLM代理的区别，并提出了一个概念框架，围绕三个维度组织Agentic MLLMs的研究方向。

Details

Motivation: 随着自主代理系统的兴起，AI代理正从静态、被动和领域特定转向动态、主动和通用化。研究者对Agentic AI的兴趣日益增长，并认为其可能成为实现AGI的重要路径，因此需要对Agentic MLLMs进行系统性综述。

Result: 论文总结了Agentic MLLMs的研究现状和发展方向，提供了开源工具和数据集，并展望了未来的研究潜力。

Insight: Agentic MLLMs的关键在于实现动态规划、主动工具调用和适应环境的能力，这种多模态与大语言模型的结合为未来通用AI提供了重要路径。

Abstract: With the recent emergence of revolutionary autonomous agentic systems, research community is witnessing a significant shift from traditional static, passive, and domain-specific AI agents toward more dynamic, proactive, and generalizable agentic AI. Motivated by the growing interest in agentic AI and its potential trajectory toward AGI, we present a comprehensive survey on Agentic Multimodal Large Language Models (Agentic MLLMs). In this survey, we explore the emerging paradigm of agentic MLLMs, delineating their conceptual foundations and distinguishing characteristics from conventional MLLM-based agents. We establish a conceptual framework that organizes agentic MLLMs along three fundamental dimensions: (i) Agentic internal intelligence functions as the system’s commander, enabling accurate long-horizon planning through reasoning, reflection, and memory; (ii) Agentic external tool invocation, whereby models proactively use various external tools to extend their problem-solving capabilities beyond their intrinsic knowledge; and (iii) Agentic environment interaction further situates models within virtual or physical environments, allowing them to take actions, adapt strategies, and sustain goal-directed behavior in dynamic real-world scenarios. To further accelerate research in this area for the community, we compile open-source training frameworks, training and evaluation datasets for developing agentic MLLMs. Finally, we review the downstream applications of agentic MLLMs and outline future research directions for this rapidly evolving field. To continuously track developments in this rapidly evolving field, we will also actively update a public repository at https://github.com/HJYao00/Awesome-Agentic-MLLMs.

[102] Perspective-aware 3D Gaussian Inpainting with Multi-view Consistency cs.CVPDF

Yuxin Cheng, Binxiao Huang, Taiqiang Wu, Wenyong Zhou, Chenchen Ding

TL;DR: PAInpainter是一种新型3D高斯修复方法，通过视角感知的内容传播和多视角一致性验证，显著提升了修复质量和对多视角场景的全局一致性。

Details

Motivation: 3D高斯修复在虚拟现实和多媒体应用中至关重要，但现有方法在多视角一致性方面仍存在挑战。

Result: 在SPIn-NeRF和NeRFiller数据集上，PSNR分别达到26.03 dB和29.51 dB，优于现有方法。

Insight: 多视角一致性验证和视角感知内容传播是提升3D高斯修复质量的关键因素。

Abstract: 3D Gaussian inpainting, a critical technique for numerous applications in virtual reality and multimedia, has made significant progress with pretrained diffusion models. However, ensuring multi-view consistency, an essential requirement for high-quality inpainting, remains a key challenge. In this work, we present PAInpainter, a novel approach designed to advance 3D Gaussian inpainting by leveraging perspective-aware content propagation and consistency verification across multi-view inpainted images. Our method iteratively refines inpainting and optimizes the 3D Gaussian representation with multiple views adaptively sampled from a perspective graph. By propagating inpainted images as prior information and verifying consistency across neighboring views, PAInpainter substantially enhances global consistency and texture fidelity in restored 3D scenes. Extensive experiments demonstrate the superiority of PAInpainter over existing methods. Our approach achieves superior 3D inpainting quality, with PSNR scores of 26.03 dB and 29.51 dB on the SPIn-NeRF and NeRFiller datasets, respectively, highlighting its effectiveness and generalization capability.

[103] ContextGen: Contextual Layout Anchoring for Identity-Consistent Multi-Instance Generation cs.CVPDF

Ruihang Xu, Dewei Zhou, Fan Ma, Yi Yang

TL;DR: ContextGen是一个基于Diffusion Transformer的新框架，通过Contextual Layout Anchoring（CLA）和Identity Consistency Attention（ICA）机制，解决了多实例图像生成中布局控制和身份一致性的挑战。

Details

Motivation: 多实例图像生成（MIG）在现有扩散模型中面临布局控制和身份一致性的难题，缺乏大规模的标注数据集。

Result: 实验表明ContextGen在控制精度、身份保真度和视觉质量上优于现有方法。

Insight: 布局和身份一致性是多实例生成的核心挑战，通过上下文整合和注意力机制可以有效解决。

Abstract: Multi-instance image generation (MIG) remains a significant challenge for modern diffusion models due to key limitations in achieving precise control over object layout and preserving the identity of multiple distinct subjects. To address these limitations, we introduce ContextGen, a novel Diffusion Transformer framework for multi-instance generation that is guided by both layout and reference images. Our approach integrates two key technical contributions: a Contextual Layout Anchoring (CLA) mechanism that incorporates the composite layout image into the generation context to robustly anchor the objects in their desired positions, and Identity Consistency Attention (ICA), an innovative attention mechanism that leverages contextual reference images to ensure the identity consistency of multiple instances. Recognizing the lack of large-scale, hierarchically-structured datasets for this task, we introduce IMIG-100K, the first dataset with detailed layout and identity annotations. Extensive experiments demonstrate that ContextGen sets a new state-of-the-art, outperforming existing methods in control precision, identity fidelity, and overall visual quality.

[104] COCO-Tree: Compositional Hierarchical Concept Trees for Enhanced Reasoning in Vision Language Models cs.CVPDF

Sanchit Sinha, Guangzhi Xiong, Aidong Zhang

TL;DR: 论文提出了COCO-Tree方法，通过设计神经符号概念树（从大型语言模型中学习），增强视觉语言模型（VLM）的语言推理能力，显著提升了组合泛化性能。

Details

Motivation: 现有的视觉语言模型在组合推理方面表现较弱，传统改进方法如提示结构优化或思维链推理效果有限，而依赖大型语言模型的资源密集型方案缺乏可解释性。COCO-Tree旨在解决这些问题。

Result: 在Winoground、EqBench、ColorSwap和SugarCrepe四个基准上的实验表明，COCO-Tree将组合泛化性能提升了5-10%，且适用于不同规模的VLM。

Insight: 1. 神经符号方法可以有效结合LLM的语言优势和VLM的视觉优势；2. 束搜索推理过程增强了模型的解释性；3. 该方法为VLM的组合推理提供了一种高效且可解释的改进方向。

Abstract: Compositional reasoning remains a persistent weakness of modern vision language models (VLMs): they often falter when a task hinges on understanding how multiple objects, attributes, and relations interact within an image. Multiple research works have attempted to improve compositionality performance by creative tricks such as improving prompt structure, chain of thought reasoning, etc. A more recent line of work attempts to impart additional reasoning in VLMs using well-trained Large Language Models (LLMs), which are far superior in linguistic understanding than VLMs to compensate for the limited linguistic prowess of VLMs. However, these approaches are either resource-intensive or do not provide an interpretable reasoning process. In this paper, we present ‘COCO-Tree’ - a novel approach that augments VLM outputs with carefully designed neurosymbolic concept trees learned from LLMs to improve VLM’s linguistic reasoning. COCO-Tree’s beam search-inspired reasoning process boosts compositionality performance and provides a rationale behind VLM predictions. Empirical results on four compositionality benchmarks, Winoground, EqBench, ColorSwap, and SugarCrepe, in seven different open-source VLMs with varying sizes, demonstrate that COCO-Tree significantly improves compositional generalization by 5-10% over baselines.

[105] High-Resolution Spatiotemporal Modeling with Global-Local State Space Models for Video-Based Human Pose Estimation cs.CVPDF

Runyang Feng, Hyung Jin Chang, Tze Ho Elden Tse, Boeun Kim, Yi Chang

TL;DR: 该论文提出了一种新颖的框架，通过扩展Mamba模型（状态空间模型）来分别学习全局和局部的高分辨率时空表示，用于视频姿态估计（VHPE），解决了现有方法在平衡全局与局部动态建模方面的不足及其计算复杂度问题。

Details

Motivation: 现有VHPE方法在统一时空学习时存在全局与局部动态建模失衡的问题，且全局依赖建模的计算复杂度较高（二次复杂度）。Mamba模型虽有潜力在处理长距离依赖时表现出线性复杂度，但仅限于1D序列数据。

Result: 在四个基准数据集上的实验表明，该模型性能优于现有VHPE方法，同时实现了更好的计算效率平衡。

Insight: 通过分离全局与局部动态建模，并利用Mamba模型的线性复杂度特性，可以有效解决高分辨率时空建模中的计算瓶颈和性能限制问题。

Abstract: Modeling high-resolution spatiotemporal representations, including both global dynamic contexts (e.g., holistic human motion tendencies) and local motion details (e.g., high-frequency changes of keypoints), is essential for video-based human pose estimation (VHPE). Current state-of-the-art methods typically unify spatiotemporal learning within a single type of modeling structure (convolution or attention-based blocks), which inherently have difficulties in balancing global and local dynamic modeling and may bias the network to one of them, leading to suboptimal performance. Moreover, existing VHPE models suffer from quadratic complexity when capturing global dependencies, limiting their applicability especially for high-resolution sequences. Recently, the state space models (known as Mamba) have demonstrated significant potential in modeling long-range contexts with linear complexity; however, they are restricted to 1D sequential data. In this paper, we present a novel framework that extends Mamba from two aspects to separately learn global and local high-resolution spatiotemporal representations for VHPE. Specifically, we first propose a Global Spatiotemporal Mamba, which performs 6D selective space-time scan and spatial- and temporal-modulated scan merging to efficiently extract global representations from high-resolution sequences. We further introduce a windowed space-time scan-based Local Refinement Mamba to enhance the high-frequency details of localized keypoint motions. Extensive experiments on four benchmark datasets demonstrate that the proposed model outperforms state-of-the-art VHPE approaches while achieving better computational trade-offs.

Shasha Guo, Liang Pang, Xi Wang, Yanling Wang, Huawei Shen

TL;DR: GeoVLMath提出了一种通过跨模态奖励增强视觉语言模型在几何推理中辅助线生成的强化学习框架，并在标准几何问题上实现了优异性能。

Details

Motivation: 辅助线对解决复杂几何问题至关重要，但对大型视觉语言模型（LVLMs）仍具挑战性。现有图像编辑模型难以精确绘制几何辅助线，因此作者提出通过生成辅助线的文本描述来解决这一问题。

Result: 在3B和7B规模的基准测试中，GeoVLMath表现优于开源和专有LVLMs。

Insight: 通过文本描述而非图像编辑生成辅助线，更符合LVLMs的表达优势，且强化学习框架显著提升了几何推理能力。

Abstract: Auxiliary lines are essential for solving complex geometric problems but remain challenging for large vision-language models (LVLMs). Rather than editing diagrams to draw auxiliary lines, which current image editing models struggle to render with geometric precision, we generate textual descriptions of auxiliary-line constructions to better align with the representational strengths of LVLMs. To bridge the gap between textual descriptions and spatial structure, we propose a reinforcement learning framework that enhances diagram-text alignment. At the core of our approach is a cross-modal reward that evaluates how well the generated auxiliary-line description for an original diagram matches a ground-truth auxiliary-line diagram. Built on this reward, we present GeoVLMath, an open-source LVLM tailored to auxiliary-line reasoning in solid geometry. This fine-grained signal drives a GRPO-based RL stage, yielding precise diagram-text alignment. To support training, we develop a scalable data creation pipeline and construct AuxSolidMath, a dataset of 3,018 real-exam geometry problems with paired diagrams and aligned textual fields. At the 3B and 7B scales, GeoVLMath achieves competitive and often superior performance compared with strong open-source and proprietary LVLMs on auxiliary-line reasoning benchmarks.

[107] GIR-Bench: Versatile Benchmark for Generating Images with Reasoning cs.CVPDF

Hongxiang Li, Yaowei Li, Bin Lin, Yuwei Niu, Yuhang Yang

TL;DR: GIR-Bench是一个多模态模型的综合评测基准，专注于评估理解与生成之间的对齐能力、逻辑推理驱动的图像生成以及多步推理编辑任务。

Details

Motivation: 研究界缺乏一个严格的评测基准，以系统评估多模态模型在理解与生成任务中的一致性及其在复杂视觉任务中的泛化潜力。

Result: 实验表明统一模型在推理驱动任务上表现更优，但仍存在理解与生成之间的差距。

Insight: GIR-Bench揭示了当前多模态模型在理解与生成对齐方面的不足，为未来研究方向提供了关键洞见。

Abstract: Unified multimodal models integrate the reasoning capacity of large language models with both image understanding and generation, showing great promise for advanced multimodal intelligence. However, the community still lacks a rigorous reasoning-centric benchmark to systematically evaluate the alignment between understanding and generation, and their generalization potential in complex visual tasks. To this end, we introduce \textbf{GIR-Bench}, a comprehensive benchmark that evaluates unified models across three complementary perspectives. Firstly, we investigate understanding-generation consistency (GIR-Bench-UGC), asking whether models can consistently leverage the same knowledge in both understanding and generation tasks. Secondly, we investigate whether models can perform reasoning-centric text-to-image generation that requires applying logical constraints and implicit knowledge to generate faithful visual content (GIR-Bench-T2I). Thirdly, we evaluate whether models can handle multi-step reasoning in editing (GIR-Bench-Edit). For each subset, we carefully design different task-specific evaluation pipelines tailored for each task. This enables fine-grained and interpretable evaluation while mitigating biases from the prevalent MLLM-as-a-Judge paradigm. Extensive ablations over various unified models and generation-only systems have shown that: Although unified models are more capable of reasoning-driven visual tasks, they still exhibit a persistent gap between understanding and generation. The data and code for GIR-Bench are available at \href{https://hkust-longgroup.github.io/GIR-Bench}{https://hkust-longgroup.github.io/GIR-Bench}.

[108] Vlaser: Vision-Language-Action Model with Synergistic Embodied Reasoning cs.CVPDF

Ganlin Yang, Tianyi Zhang, Haoran Hao, Weiyun Wang, Yibin Liu

TL;DR: 本文提出了Vlaser模型，一种集成了高层推理与低层控制的视觉-语言-动作模型，通过协同推理能力填补了上游VLM推理与下游VLA策略学习之间的关键缺口。

Details

Motivation: 目前的研究主要集中在使用VLM开发具身推理能力或将高级VLM集成到VLA模型中实现端到端机器人控制，但较少直接解决上游VLM推理与下游VLA策略学习之间的关键差距。

Result: Vlaser在多种具身推理任务（如空间推理、任务规划等）上实现了SOTA性能，并在WidowX和Google Robot基准测试中表现优异。

Insight: 研究发现VLM初始化对VLA微调效果有显著影响，并通过领域偏移缓解策略提升了模型性能。

Abstract: While significant research has focused on developing embodied reasoning capabilities using Vision-Language Models (VLMs) or integrating advanced VLMs into Vision-Language-Action (VLA) models for end-to-end robot control, few studies directly address the critical gap between upstream VLM-based reasoning and downstream VLA policy learning. In this work, we take an initial step toward bridging embodied reasoning with VLA policy learning by introducing Vlaser - a Vision-Language-Action Model with synergistic embodied reasoning capability, which is a foundational vision-language model designed to integrate high-level reasoning with low-level control for embodied agents. Built upon the high-quality Vlaser-6M dataset, Vlaser achieves state-of-the-art performance across a range of embodied reasoning benchmarks - including spatial reasoning, embodied grounding, embodied QA, and task planning. Furthermore, we systematically examine how different VLM initializations affect supervised VLA fine-tuning, offering novel insights into mitigating the domain shift between internet-scale pre-training data and embodied-specific policy learning data. Based on these insights, our approach achieves state-of-the-art results on the WidowX benchmark and competitive performance on the Google Robot benchmark.

[109] LSVOS 2025 Challenge Report: Recent Advances in Complex Video Object Segmentation cs.CVPDF

Chang Liu, Henghui Ding, Kaining Ying, Lingyi Hong, Ning Xu

TL;DR: 本文总结了2025年LSVOS挑战赛的最新进展，重点介绍了为提升复杂视频场景下的鲁棒性而新增的MOSEv2赛道，以及新兴趋势如LLM/MLLM组件的应用。

Details

Motivation: 传统视频目标分割（VOS）任务在真实场景中仍面临挑战，如小物体密集、遮挡频繁等问题。MOSEv2赛道的引入旨在推动长期一致性和泛化能力的提升。

Result: 挑战赛展示了LLM/MLLM组件在视频分割中的作用，并强调了内存感知传播等技术的重要性。

Insight: 未来研究方向包括提升语言感知能力和复杂场景下的鲁棒性，LLM/MLLM可能成为关键工具。

Abstract: This report presents an overview of the 7th Large-scale Video Object Segmentation (LSVOS) Challenge held in conjunction with ICCV 2025. Besides the two traditional tracks of LSVOS that jointly target robustness in realistic video scenarios: Classic VOS (VOS), and Referring VOS (RVOS), the 2025 edition features a newly introduced track, Complex VOS (MOSEv2). Building upon prior insights, MOSEv2 substantially increases difficulty, introducing more challenging but realistic scenarios including denser small objects, frequent disappear/reappear events, severe occlusions, adverse weather and lighting, etc., pushing long-term consistency and generalization beyond curated benchmarks. The challenge retains standard ${J}$, $F$, and ${J&F}$ metrics for VOS and RVOS, while MOSEv2 adopts ${J&\dot{F}}$ as the primary ranking metric to better evaluate objects across scales and disappearance cases. We summarize datasets and protocols, highlight top-performing solutions, and distill emerging trends, such as the growing role of LLM/MLLM components and memory-aware propagation, aiming to chart future directions for resilient, language-aware video segmentation in the wild.

Fengling Zhu, Boshi Liu, Jingyu Hua, Sheng Zhong

TL;DR: 论文提出了一种名为CoDefend的多模态防御方法，通过扩散模型净化和提示优化，提升多模态大语言模型（MLLMs）对对抗攻击的鲁棒性。

Details

Motivation: 多模态大语言模型在视觉和文本任务中表现优异，但容易受到对抗攻击的威胁。现有防御方法如对抗训练和输入净化存在局限性，包括计算成本高、图像质量下降和泛化能力不足。

Result: 在图像描述和视觉问答任务上的实验表明，CoDefend不仅显著提高了鲁棒性，还表现出对未知攻击的强迁移性。

Insight: 监督扩散去噪是多模态防御的有效方法，为MLLMs在实际应用中的安全部署提供了新思路。

Abstract: Multimodal Large Language Models (MLLMs) have achieved remarkable success in tasks such as image captioning, visual question answering, and cross-modal reasoning by integrating visual and textual modalities. However, their multimodal nature also exposes them to adversarial threats, where attackers can perturb either modality or both jointly to induce harmful, misleading, or policy violating outputs. Existing defense strategies, such as adversarial training and input purification, face notable limitations: adversarial training typically improves robustness only against known attacks while incurring high computational costs, whereas conventional purification approaches often suffer from degraded image quality and insufficient generalization to complex multimodal tasks. In this work, we focus on defending the visual modality, which frequently serves as the primary entry point for adversarial manipulation. We propose a supervised diffusion based denoising framework that leverages paired adversarial clean image datasets to fine-tune diffusion models with directional, task specific guidance. Unlike prior unsupervised purification methods such as DiffPure, our approach achieves higher quality reconstructions while significantly improving defense robustness in multimodal tasks. Furthermore, we incorporate prompt optimization as a complementary defense mechanism, enhancing resistance against diverse and unseen attack strategies. Extensive experiments on image captioning and visual question answering demonstrate that our method not only substantially improves robustness but also exhibits strong transferability to unknown adversarial attacks. These results highlight the effectiveness of supervised diffusion based denoising for multimodal defense, paving the way for more reliable and secure deployment of MLLMs in real world applications.

[111] Compositional Zero-Shot Learning: A Survey cs.CVPDF

Ans Munir, Faisal Z. Qureshi, Mohsen Ali, Muhammad Haris Khan

TL;DR: 这篇论文是关于组合零样本学习（CZSL）的第一篇全面综述，系统回顾了最新的CZSL方法，并提出了一种基于解缠的分类法，分析了其优缺点和未来研究方向。

Details

Motivation: CZSL任务在计算机视觉中至关重要，因为它需要模型在推理时识别已知属性和物体的未见组合，而训练数据无法覆盖所有可能的组合。外观的上下文性（如“小的”猫和“老的”猫视觉差异显著）增加了问题的复杂性。

Result: 综述分析了各种方法的优缺点，并指出跨模态解缠在应对上下文性和组合性方面的潜力。同时总结了当前领域的局限性和未来可能的突破点。

Insight: 核心洞见包括：1）上下文性建模对CZSL至关重要；2）解缠策略的选择显著影响性能；3）跨模态方法可能在开放世界中表现更优；4）社区需要更多关注数据高效和泛化能力强的模型。

Abstract: Compositional Zero-Shot Learning (CZSL) is a critical task in computer vision that enables models to recognize unseen combinations of known attributes and objects during inference, addressing the combinatorial challenge of requiring training data for every possible composition. This is particularly challenging because the visual appearance of primitives is highly contextual; for example, small'' cats appear visually distinct from older’’ ones, and wet'' cars differ significantly from wet’’ cats. Effectively modeling this contextuality and the inherent compositionality is crucial for robust compositional zero-shot recognition. This paper presents, to our knowledge, the first comprehensive survey specifically focused on Compositional Zero-Shot Learning. We systematically review the state-of-the-art CZSL methods, introducing a taxonomy grounded in disentanglement, with four families of approaches: no explicit disentanglement, textual disentanglement, visual disentanglement, and cross-modal disentanglement. We provide a detailed comparative analysis of these methods, highlighting their core advantages and limitations in different problem settings, such as closed-world and open-world CZSL. Finally, we identify the most significant open challenges and outline promising future research directions. This survey aims to serve as a foundational resource to guide and inspire further advancements in this fascinating and important field. Papers studied in this survey with their official code are available on our github: https://github.com/ans92/Compositional-Zero-Shot-Learning

[112] MoMaps: Semantics-Aware Scene Motion Generation with Motion Maps cs.CVPDF

Jiahui Lei, Kyle Genova, George Kopanas, Noah Snavely, Leonidas Guibas

TL;DR: 该论文提出了MoMaps（Motion Maps）表示方法，用于从单张输入图像预测未来3D场景运动，并通过扩散模型学习运动分布。

Details

Motivation: 从真实视频中学习语义和功能上有意义的3D运动先验，以支持从单张图像预测未来场景运动的需求。

Result: 实验表明，该方法生成的3D场景运动既真实又语义一致。

Insight: MoMap是一种高效的3D运动表示方法，结合扩散模型能够显著提升运动生成的语义和功能质量。

Abstract: This paper addresses the challenge of learning semantically and functionally meaningful 3D motion priors from real-world videos, in order to enable prediction of future 3D scene motion from a single input image. We propose a novel pixel-aligned Motion Map (MoMap) representation for 3D scene motion, which can be generated from existing generative image models to facilitate efficient and effective motion prediction. To learn meaningful distributions over motion, we create a large-scale database of MoMaps from over 50,000 real videos and train a diffusion model on these representations. Our motion generation not only synthesizes trajectories in 3D but also suggests a new pipeline for 2D video synthesis: first generate a MoMap, then warp an image accordingly and complete the warped point-based renderings. Experimental results demonstrate that our approach generates plausible and semantically consistent 3D scene motion.

[113] Multimodal Disease Progression Modeling via Spatiotemporal Disentanglement and Multiscale Alignment cs.CVPDF

Chen Liu, Wenfang Yao, Kejing Yin, William K. Cheung, Jing Qin

TL;DR: DiPro是一个通过时空解耦和多尺度对齐建模疾病进展的多模态框架，解决了CXR序列的冗余和EHR数据的时间不对齐问题。

Details

Motivation: 纵向多模态数据（如EHR和CXR）对疾病进展建模至关重要，但存在CXR序列冗余和与EHR时间不对齐的挑战。

Result: 在MIMIC数据集上，DiPro在疾病进展识别和ICU预测任务中达到最先进性能。

Insight: 解耦和多尺度对齐是多模态疾病建模的关键，可显著提升动态特征的提取和时序一致性。

Abstract: Longitudinal multimodal data, including electronic health records (EHR) and sequential chest X-rays (CXRs), is critical for modeling disease progression, yet remains underutilized due to two key challenges: (1) redundancy in consecutive CXR sequences, where static anatomical regions dominate over clinically-meaningful dynamics, and (2) temporal misalignment between sparse, irregular imaging and continuous EHR data. We introduce $\texttt{DiPro}$, a novel framework that addresses these challenges through region-aware disentanglement and multi-timescale alignment. First, we disentangle static (anatomy) and dynamic (pathology progression) features in sequential CXRs, prioritizing disease-relevant changes. Second, we hierarchically align these static and dynamic CXR features with asynchronous EHR data via local (pairwise interval-level) and global (full-sequence) synchronization to model coherent progression pathways. Extensive experiments on the MIMIC dataset demonstrate that $\texttt{DiPro}$ could effectively extract temporal clinical dynamics and achieve state-of-the-art performance on both disease progression identification and general ICU prediction tasks.

[114] Connecting Giants: Synergistic Knowledge Transfer of Large Multimodal Models for Few-Shot Learning cs.CV | cs.MMPDF

Hao Tang, Shengfeng He, Jing Qin

TL;DR: SynTrans 是一个新颖的框架，通过从大型多模态模型（如 CLIP）中协同转移知识，显著提升了少样本学习的性能。

Details

Motivation: 现有的少样本学习方法通常依赖小型模型的语义知识，但这些知识可能包含噪声和偏差。SynTrans 旨在通过利用大型多模态模型的多样化知识来解决这一问题。

Result: 在四个少样本学习数据集上表现优异，显著优于当前最先进方法。

Insight: 大型多模态模型的多样化知识可以有效解决少样本学习中的数据稀缺问题，且协同知识转移能显著提升性能。

Abstract: Few-shot learning (FSL) addresses the challenge of classifying novel classes with limited training samples. While some methods leverage semantic knowledge from smaller-scale models to mitigate data scarcity, these approaches often introduce noise and bias due to the data’s inherent simplicity. In this paper, we propose a novel framework, Synergistic Knowledge Transfer (SynTrans), which effectively transfers diverse and complementary knowledge from large multimodal models to empower the off-the-shelf few-shot learner. Specifically, SynTrans employs CLIP as a robust teacher and uses a few-shot vision encoder as a weak student, distilling semantic-aligned visual knowledge via an unsupervised proxy task. Subsequently, a training-free synergistic knowledge mining module facilitates collaboration among large multimodal models to extract high-quality semantic knowledge. Building upon this, a visual-semantic bridging module enables bi-directional knowledge transfer between visual and semantic spaces, transforming explicit visual and implicit semantic knowledge into category-specific classifier weights. Finally, SynTrans introduces a visual weight generator and a semantic weight reconstructor to adaptively construct optimal multimodal FSL classifiers. Experimental results on four FSL datasets demonstrate that SynTrans, even when paired with a simple few-shot vision encoder, significantly outperforms current state-of-the-art methods.

[115] video-SALMONN S: Streaming Audio-Visual LLMs Beyond Length Limits via Memory cs.CV | cs.AIPDF

Guangzhi Sun, Yixuan Li, Xiaodong Wu, Yudong Yang, Wei Li

TL;DR: video-SALMONN S是一种流式音频-视觉LLM，首次能够在固定内存预算下处理3小时1 FPS 360p的视频。通过测试时训练（TTT）内存模块和提示依赖的内存读取器，实现了长视频的高质量理解。

Details

Motivation: 当前视频理解LLM在处理长视频时存在内存和时间限制的挑战，需要一种能够持续处理高帧率高分辨率视频流的方法。

Result: 8B参数模型在Video-MME长分割上达到74.2%和67.8%的性能，优于离线和流式基线。

Insight: 动态更新的TTT模块和选择性内存检索是处理长视频的关键，未来可扩展为更复杂的AI代理任务。

Abstract: Continuous, high-frame-rate, high-resolution processing of long video streams is critical for future AI agents, yet current video-understanding LLMs struggle to scale. Offline, fixed-frame-number methods require the stream length to adapt frame rates; streaming methods constrain memory by merging or discarding tokens, losing information. We propose video-SALMONN S, a streaming audio-visual LLM that, to our knowledge, is the first to process 3-hour videos at 1 FPS and 360p resolution under a fixed memory budget. Our model introduces (i) a test-time-training (TTT) memory module that continually updates token representations to capture long-range dependencies by replacing token merging, and (ii) a prompt-dependent memory reader that selectively retrieves context-relevant content from fixed-size memory. The TTT module is optimised with a Hessian-free conjugate-gradient procedure (TTT_HF) for efficient adaptation. On long-video benchmarks (Video-MME, LVBench, VideoEvalPro), video-SALMONN S sustains high-quality understanding on multi-hour videos with 10k frames and 1M tokens. Our 8B-parameter model achieves 74.2% overall and 67.8% on the Video-MME long split, outperforming both offline and streaming baselines.

[116] Validation of an Artificial Intelligence Tool for the Detection of Sperm DNA Fragmentation Using the TUNEL In Situ Hybridization Assay cs.CVPDF

Byron Alexander Jacobs, Aqeel Morris, Ifthakaar Shaik, Frando Lin

TL;DR: 本研究验证了一种基于人工智能的工具，通过相位对比显微镜图像的数字分析检测精子DNA断裂（SDF），并证明了其有效性。

Details

Motivation: 传统精液分析无法评估精子DNA断裂（SDF），这是男性生育能力评估中的关键参数。本研究旨在开发一种非破坏性方法，提高SDF检测的准确性和效率。

Result: 提出的模型在灵敏度（60%）和特异性（75%）方面表现出色，为非破坏性实时精子选择提供了新方法。

Insight: 该研究展示了AI在生殖医学中的潜力，通过结合形态学和深度学习方法，可以实现精子DNA完整性的高效评估。

Abstract: Sperm DNA fragmentation (SDF) is a critical parameter in male fertility assessment that conventional semen analysis fails to evaluate. This study presents the validation of a novel artificial intelligence (AI) tool designed to detect SDF through digital analysis of phase contrast microscopy images, using the terminal deoxynucleotidyl transferase dUTP nick end labeling (TUNEL) assay as the gold standard reference. Utilising the established link between sperm morphology and DNA integrity, the present work proposes a morphology assisted ensemble AI model that combines image processing techniques with state-of-the-art transformer based machine learning models (GC-ViT) for the prediction of DNA fragmentation in sperm from phase contrast images. The ensemble model is benchmarked against a pure transformer vision' model as well as a morphology-only` model. Promising results show the proposed framework is able to achieve sensitivity of 60% and specificity of 75%. This non-destructive methodology represents a significant advancement in reproductive medicine by enabling real-time sperm selection based on DNA integrity for clinical diagnostic and therapeutic applications.

[117] Multiview Manifold Evidential Fusion for PolSAR Image Classification cs.CVPDF

Junfei Shi, Haojia Zhang, Haiyan Jin, Junhuai Li, Xiaogang Song

TL;DR: 该论文提出了一种多流形证据融合方法（MMEFnet），用于极化合成孔径雷达（PolSAR）图像的分类，解决了传统融合方法忽略多视图差异和不确定性的问题。

Details

Motivation: 传统的PolSAR图像分类方法通常直接拼接多特征或使用深度学习结合特征，但这些方法忽略了不同视图（如协方差矩阵和多特征）处于不同流形的问题，也未考虑视图的重要性和不确定性。

Result: 在三个真实PolSAR数据集上的实验表明，MMEFnet在准确性、鲁棒性和可解释性上均优于现有方法。

Insight: 通过显式建模不同视图的流形结构和不确定性，提升了融合效果；证据理论为分类提供了可靠的可解释性。

Abstract: Polarimetric Synthetic Aperture Radar (PolSAR) covariance matrices and their extracted multi-features - such as scattering angle, entropy, texture, and boundary descriptors - provide complementary and physically interpretable information for image classification. Traditional fusion strategies typically concatenate these features or employ deep learning networks to combine them. However, the covariance matrices and multi-features, as two complementary views, lie on different manifolds with distinct geometric structures. Existing fusion methods also overlook the varying importance of different views and ignore uncertainty, often leading to unreliable predictions. To address these issues, we propose a Multiview Manifold Evidential Fusion (MMEFnet) method to effectively fuse these two views. It gives a new framework to integrate PolSAR manifold learning and evidence fusion into a unified architecture. Specifically, covariance matrices are represented on the Hermitian Positive Definite (HPD) manifold, while multi-features are modeled on the Grassmann manifold. Two different kernel metric learning networks are constructed to learn their manifold representations. Subsequently, a trusted multiview evidence fusion, replacing the conventional softmax classifier, estimates belief mass and quantifies the uncertainty of each view from the learned deep features. Finally, a Dempster-Shafer theory-based fusion strategy combines evidence, enabling a more reliable and interpretable classification. Extensive experiments on three real-world PolSAR datasets demonstrate that the proposed method consistently outperforms existing approaches in accuracy, robustness, and interpretability.

[118] CoPRS: Learning Positional Prior from Chain-of-Thought for Reasoning Segmentation cs.CV | cs.MMPDF

Zhenyu Lu, Liupeng Li, Jinpeng Wang, Yan Feng, Bin Chen

TL;DR: CoPRS提出了一种基于多模态链式思维（MCoT）的位置感知模型，通过可微分和可解释的热图作为位置先验，将语言推理与图像分割联系起来，提升了推理过程的可解释性和分割精度。

Details

Motivation: 现有方法在推理分割中要么直接将语言模型的隐藏特征连接到掩码解码器，要么通过文本表示位置，限制了可解释性和语义细节。CoPRS旨在解决这一问题。

Result: 在RefCOCO系列和ReasonSeg数据集上，CoPRS匹配或超越现有最佳性能，验证了推理输出与掩码生成的一致性。

Insight: 热图质量直接影响掩码质量，表明推理驱动的集中性和分割精度具有强关联。

Abstract: Existing works on reasoning segmentation either connect hidden features from a language model directly to a mask decoder or represent positions in text, which limits interpretability and semantic detail. To solve this, we present CoPRS, a Multi-modal Chain-of-Thought (MCoT)-based positional perception model that bridges language reasoning to segmentation through a differentiable and interpretable positional prior instantiated as a heatmap. By making the reasoning process clear via MCoT and expressing it as a dense, differentiable heatmap, this interface enhances interpretability and diagnostic analysis and yields more concentrated evidence on the target. A learnable concentration token aggregates features of the image and reasoning text to generate this positional prior, which is decoded to precise masks through a lightweight decoder, providing a direct connection between reasoning and segmentation. Across the RefCOCO series and ReasonSeg, CoPRS matches or surpasses the best reported metrics on each standard split under comparable protocols, with performance at or above prior state of the art across both validation and test partitions. Extensive experiments reveal that the quality of the heatmap strongly influences the resulting mask quality, supporting a consistent association between the reasoning output and downstream mask generation. Collectively, these findings support the utility of this paradigm in bridging reasoning and segmentation and show advantages in concentration driven by reasoning and predicting masks more precisely. Code, checkpoints and logs are released at https://github.com/ZhenyuLU-Heliodore/CoPRS.git.

Xiang Ma, Litian Xu, Lexin Fang, Caiming Zhang, Lizhen Cui

TL;DR: 该论文提出了一种名为PICO的新框架，旨在通过原型迭代构造可靠地抑制风格信息在多模态对齐中的干扰。

Details

Motivation: 传统的跨模态对齐方法假设嵌入仅包含语义信息，忽略了非语义信息（如风格）的干扰，导致信息偏差甚至丢失。

Result: 在多种基准测试和模型骨干网络上，PICO优于现有方法5.2%-14.1%。

Insight: 分离语义与风格信息并仅对齐语义信息是提升跨模态对齐效果的关键，原型迭代构造方法为此提供了可靠支持。

Abstract: Cross-modal alignment is an important multi-modal task, aiming to bridge the semantic gap between different modalities. The most reliable fundamention for achieving this objective lies in the semantic consistency between matched pairs. Conventional methods implicitly assume embeddings contain solely semantic information, ignoring the impact of non-semantic information during alignment, which inevitably leads to information bias or even loss. These non-semantic information primarily manifest as stylistic variations in the data, which we formally define as style information. An intuitive approach is to separate style from semantics, aligning only the semantic information. However, most existing methods distinguish them based on feature columns, which cannot represent the complex coupling relationship between semantic and style information. In this paper, we propose PICO, a novel framework for suppressing style interference during embedding interaction. Specifically, we quantify the probability of each feature column representing semantic information, and regard it as the weight during the embedding interaction. To ensure the reliability of the semantic probability, we propose a prototype iterative construction method. The key operation of this method is a performance feedback-based weighting function, and we have theoretically proven that the function can assign higher weight to prototypes that bring higher performance improvements. Extensive experiments on various benchmarks and model backbones demonstrate the superiority of PICO, outperforming state-of-the-art methods by 5.2%-14.1%.

[120] BLEnD-Vis: Benchmarking Multimodal Cultural Understanding in Vision Language Models cs.CV | cs.CYPDF

Bryan Chen Zhengyu Tan, Zheng Weihua, Zhengyuan Liu, Nancy F. Chen, Hwaran Lee

TL;DR: BLEnD-Vis是一个多模态、多文化基准测试，用于评估视觉语言模型（VLMs）在日常文化知识理解中的鲁棒性和跨模态一致性。

Details

Motivation: 现有的VLMs评估主要关注静态回忆或孤立视觉任务，缺乏对其文化理解能力的系统性测试。BLEnD-Vis旨在填补这一空白，评估模型在多文化背景下的表现。

Result: 研究发现当前VLMs在文化知识上存在显著脆弱性，语言改写会导致性能下降，且视觉线索的引入虽然有一定帮助，但跨模态一致性较低，尤其在资源匮乏地区表现更差。

Insight: BLEnD-Vis揭示了VLMs在文化理解和多模态融合上的局限性，为开发更文化敏感的模型提供了方向。

Abstract: As vision-language models (VLMs) are deployed globally, their ability to understand culturally situated knowledge becomes essential. Yet, existing evaluations largely assess static recall or isolated visual grounding, leaving unanswered whether VLMs possess robust and transferable cultural understanding. We introduce BLEnD-Vis, a multimodal, multicultural benchmark designed to evaluate the robustness of everyday cultural knowledge in VLMs across linguistic rephrasings and visual modalities. Building on the BLEnD dataset, BLEnD-Vis constructs 313 culturally grounded question templates spanning 16 regions and generates three aligned multiple-choice formats: (i) a text-only baseline querying from Region $\to$ Entity, (ii) an inverted text-only variant (Entity $\to$ Region), and (iii) a VQA-style version of (ii) with generated images. The resulting benchmark comprises 4,916 images and over 21,000 multiple-choice question (MCQ) instances, validated through human annotation. BLEnD-Vis reveals significant fragility in current VLM cultural knowledge; models exhibit performance drops under linguistic rephrasing and, whilst visual cues often aid performance, low cross-modal consistency highlights challenges in robustly integrating textual and visual understanding, particularly for lower-resource regions. BLEnD-Vis thus provides a crucial testbed for systematically analysing cultural robustness and multimodal grounding, exposing limitations and guiding the development of more culturally competent VLMs.

[121] FlexAC: Towards Flexible Control of Associative Reasoning in Multimodal Large Language Models cs.CVPDF

Shengming Yuan, Xinyu Lyu, Shuailong Wang, Beitao Chen, Jingkuan Song

TL;DR: 论文提出FlexAC框架，通过灵活控制多模态大语言模型（MLLMs）中的联想推理强度，解决了模型在忠实性与创造性之间的平衡问题。

Details

Motivation: MLLMs在不同任务中需要不同程度的联想推理，但现有方法缺乏灵活调节能力。FlexAC旨在通过内部机制分析和动态调节，弥补这一不足。

Result: 在Creation-MMBench中创造力提升5.8倍，CHAIR数据集上幻觉率降低29%，显著优于现有方法。

Insight: 联想推理是多维且任务依赖的，FlexAC的动态调节机制为MLLMs在多样任务中的适应性提供了新思路。

Abstract: Multimodal large language models (MLLMs) face an inherent trade-off between faithfulness and creativity, as different tasks require varying degrees of associative reasoning. However, existing methods lack the flexibility to modulate this reasoning strength, limiting MLLMs’ adaptability across factual and creative scenarios. To bridge this gap, we propose equipping MLLMs with mechanisms that enable flexible control over associative reasoning. We begin by investigating the internal mechanisms underlying associative behavior in MLLMs and find that: (1) middle layers play a pivotal role in shaping model’s associative tendencies, (2) modifying representations in these layers effectively regulates associative reasoning strength, and (3) hallucinations can be exploited to derive steering vectors that guide this modulation. Building on these findings, we introduce Flexible Association Control (FlexAC), a lightweight and training-free framework for modulating associative behavior in MLLMs. FlexAC first induces hallucination-guided intermediate representations to encode associative directions. Then, it selects high-association instances to construct effective associative steering vectors, whose strengths are adaptively calibrated to balance creative guidance with output stability. Finally, recognizing the multi-dimensional nature of associative reasoning, FlexAC incorporates task-specific associative vectors derived from a forward pass on a few target-domain samples, enabling models to follow diverse associative directions and better adapt to creative tasks. Notably, our method achieves up to a 5.8x improvement in creativity on Creation-MMBench and a 29% reduction in hallucination rate on CHAIR, surpassing existing baselines and demonstrating its effectiveness in enabling flexible control over associative reasoning in MLLMs. Our code is available at https://github.com/ylhz/FlexAC.

[122] Class Prototypes based Contrastive Learning for Classifying Multi-Label and Fine-Grained Educational Videos cs.CVPDF

Rohit Gupta, Anirban Roy, Claire Christensen, Sujeong Kim, Sarah Gerard

TL;DR: 该论文提出了一种基于类原型的监督对比学习方法，用于处理多标签和细粒度的教育视频分类问题，并通过多模态Transformer网络捕捉视觉和音频线索的交互关系。

Details

Motivation: 随着儿童早期在线媒体消费的增长，迫切需要数据驱动的工具帮助教育者筛选适合的教育内容。论文针对这一问题，聚焦于识读和数学两类教育视频的多标签和细粒度分类。

Result: 在APPROVE、Youtube-8M和COIN等数据集上，提出的方法优于强基线。

Insight: 视觉和音频线索的交互对教育视频的理解至关重要，结合类原型和监督对比学习可以有效处理多标签和细粒度分类问题。

Abstract: The recent growth in the consumption of online media by children during early childhood necessitates data-driven tools enabling educators to filter out appropriate educational content for young learners. This paper presents an approach for detecting educational content in online videos. We focus on two widely used educational content classes: literacy and math. For each class, we choose prominent codes (sub-classes) based on the Common Core Standards. For example, literacy codes include letter names', letter sounds’, and math codes include counting', sorting’. We pose this as a fine-grained multilabel classification problem as videos can contain multiple types of educational content and the content classes can get visually similar (e.g., letter names' vs letter sounds’). We propose a novel class prototypes based supervised contrastive learning approach that can handle fine-grained samples associated with multiple labels. We learn a class prototype for each class and a loss function is employed to minimize the distances between a class prototype and the samples from the class. Similarly, distances between a class prototype and the samples from other classes are maximized. As the alignment between visual and audio cues are crucial for effective comprehension, we consider a multimodal transformer network to capture the interaction between visual and audio cues in videos while learning the embedding for videos. For evaluation, we present a dataset, APPROVE, employing educational videos from YouTube labeled with fine-grained education classes by education researchers. APPROVE consists of 193 hours of expert-annotated videos with 19 classes. The proposed approach outperforms strong baselines on APPROVE and other benchmarks such as Youtube-8M, and COIN. The dataset is available at https://github.com/rohit-gupta/MMContrast/tree/main/APPROVE

[123] Investigating Identity Signals in Conversational Facial Dynamics via Disentangled Expression Features cs.CVPDF

Masoumeh Chapariniya, Pierre Vuillecard, Jean-Marc Odobez, Volker Dellwo, Teodora Vukovic

TL;DR: 研究表明，面部表情的动态特征（而非静态外貌）可以用于身份识别。通过FLAME 3D可变形模型实现面部形状和表情动态的解耦，并结合对比学习模型在自然对话数据中验证其有效性。

Details

Motivation: 探索面部动态特征是否足以独立作为身份识别的依据，特别是在排除静态外貌影响的情况下。

Result: 在1,429人的分类任务中达到61.14%的准确率，远超随机概率；DNR与识别性能负相关。

Insight: 面部动态特征携带强身份标识，但形状估计的不稳定会削弱动态识别的有效性。

Abstract: This work investigates whether individuals can be identified solely through the pure dynamical components of their facial expressions, independent of static facial appearance. We leverage the FLAME 3D morphable model to achieve explicit disentanglement between facial shape and expression dynamics, extracting frame-by-frame parameters from conversational videos while retaining only expression and jaw coefficients. On the CANDOR dataset of 1,429 speakers in naturalistic conversations, our Conformer model with supervised contrastive learning achieves 61.14%accuracy on 1,429-way classification – 458 times above chance – demonstrating that facial dynamics carry strong identity signatures. We introduce a drift-to-noise ratio (DNR) that quantifies the reliability of shape expression separation by measuring across-session shape changes relative to within-session variability. DNR strongly negatively correlates with recognition performance, confirming that unstable shape estimation compromises dynamic identification. Our findings reveal person-specific signatures in conversational facial dynamics, with implications for social perception and clinical assessment.

[124] A Large-Language-Model Assisted Automated Scale Bar Detection and Extraction Framework for Scanning Electron Microscopic Images cs.CV | cond-mat.mtrl-sci | cs.AI | physics.data-anPDF

Yuxuan Chen, Ruotong Yang, Zhengyang Zhang, Mehreen Ahmed, Yanming Wang

TL;DR: 论文提出了一种结合多模态和大语言模型（LLM）的自动化标尺检测与提取框架，用于扫描电子显微镜（SEM）图像，显著提高了效率和准确性。

Details

Motivation: SEM图像的标尺检测目前依赖人工操作，耗时且易出错。研究旨在通过自动化框架解决这一问题，提高科学图像的解析效率和可靠性。

Result: 标尺检测精度100%，召回率95.8%，mAP@0.5为99.2%；混合OCR精度89%，召回率65%，F1分数75%，优于主流OCR引擎。

Insight: LLM作为推理引擎可以提升自动化科学图像分析的可靠性和智能化，未来有望扩展到其他模态或多任务场景。

Abstract: Microscopic characterizations, such as Scanning Electron Microscopy (SEM), are widely used in scientific research for visualizing and analyzing microstructures. Determining the scale bars is an important first step of accurate SEM analysis; however, currently, it mainly relies on manual operations, which is both time-consuming and prone to errors. To address this issue, we propose a multi-modal and automated scale bar detection and extraction framework that provides concurrent object detection, text detection and text recognition with a Large Language Model (LLM) agent. The proposed framework operates in four phases; i) Automatic Dataset Generation (Auto-DG) model to synthesize a diverse dataset of SEM images ensuring robust training and high generalizability of the model, ii) scale bar object detection, iii) information extraction using a hybrid Optical Character Recognition (OCR) system with DenseNet and Convolutional Recurrent Neural Network (CRNN) based algorithms, iv) an LLM agent to analyze and verify accuracy of the results. The proposed model demonstrates a strong performance in object detection and accurate localization with a precision of 100%, recall of 95.8%, and a mean Average Precision (mAP) of 99.2% at IoU=0.5 and 69.1% at IoU=0.5:0.95. The hybrid OCR system achieved 89% precision, 65% recall, and a 75% F1 score on the Auto-DG dataset, significantly outperforming several mainstream standalone engines, highlighting its reliability for scientific image analysis. The LLM is introduced as a reasoning engine as well as an intelligent assistant that suggests follow-up steps and verifies the results. This automated method powered by an LLM agent significantly enhances the efficiency and accuracy of scale bar detection and extraction in SEM images, providing a valuable tool for microscopic analysis and advancing the field of scientific imaging.

[125] Exploring and Leveraging Class Vectors for Classifier Editing cs.CVPDF

Jaeik Kim, Jaeyoung Do

TL;DR: 论文引入Class Vectors（类向量）用于图像分类器编辑，解决了现有方法灵活性不足或成本过高的问题。通过隐空间和权重空间的调整，实现了高效编辑和高层次概念操作。

Details

Motivation: 图像分类器在训练后行为固定，难以进行事后编辑，尤其是在遗忘特定类别或适应分布变化时。现有方法要么范围有限，要么成本高昂。

Result: 验证了Class Vectors在遗忘学习、环境适应、对抗防御和对抗触发优化等应用中的有效性。

Insight: Class Vectors的线性性和正交性为分类器编辑提供了新的高效工具，支持语义调整和高层次概念操作，适用于多重应用场景。

Abstract: Image classifiers play a critical role in detecting diseases in medical imaging and identifying anomalies in manufacturing processes. However, their predefined behaviors after extensive training make post hoc model editing difficult, especially when it comes to forgetting specific classes or adapting to distribution shifts. Existing classifier editing methods either focus narrowly on correcting errors or incur extensive retraining costs, creating a bottleneck for flexible editing. Moreover, such editing has seen limited investigation in image classification. To overcome these challenges, we introduce Class Vectors, which capture class-specific representation adjustments during fine-tuning. Whereas task vectors encode task-level changes in weight space, Class Vectors disentangle each class’s adaptation in the latent space. We show that Class Vectors capture each class’s semantic shift and that classifier editing can be achieved either by steering latent features along these vectors or by mapping them into weight space to update the decision boundaries. We also demonstrate that the inherent linearity and orthogonality of Class Vectors support efficient, flexible, and high-level concept editing via simple class arithmetic. Finally, we validate their utility in applications such as unlearning, environmental adaptation, adversarial defense, and adversarial trigger optimization.

[126] Human Uncertainty-Aware Data Selection and Automatic Labeling in Visual Question Answering cs.CVPDF

Jian Lan, Zhicheng Liu, Udo Schlegel, Raoyuan Zhao, Yihong Liu

TL;DR: 这篇论文提出了一个名为HaDola的框架，旨在通过数据选择和自动标注来利用人类不确定性（HU）优化视觉问答任务中的监督微调（SFT），减少对昂贵标注数据的依赖。

Details

Motivation: 视觉语言模型（VLMs）在视觉问答任务中表现优秀，但仍依赖大量标注数据进行监督微调（SFT）。现有方法忽略人类不确定性（HU）分布，导致性能受损和模型校准不足。

Result: 在VQAv2和VizWiz数据集上的实验表明，HaDola仅需5%的标注数据即可匹配或超越现有基线方法，同时提升了模型准确性和校准能力。

Insight: 研究表明，合理利用人类不确定性（如剔除高HU样本）比单纯扩大数据集规模更有效。这种方法为减少标注成本提供了新思路。

Abstract: Large vision-language models (VLMs) achieve strong performance in Visual Question Answering but still rely heavily on supervised fine-tuning (SFT) with massive labeled datasets, which is costly due to human annotations. Crucially, real-world datasets often exhibit human uncertainty (HU) – variation in human confidence across annotations – but standard SFT simply optimizes toward the most frequent label, disregarding HU distributions. This leaves two open questions: How does HU affect SFT, and how can HU be effectively leveraged in training? In this work, we first conduct a systematic evaluation of VLMs across varying HU levels. We have two key findings: (i) surprisingly, high-HU samples contribute little or even degrade model performance, and (ii) naively training on the full dataset yields under-calibrated models that fail to capture HU distributions. Motivated by these findings, we introduce HaDola, a human uncertainty-aware data selection and automatic labeling framework. HaDola operates in four stages – discriminate, self-annotate, error trigger, and training – to iteratively identify harmful samples, prioritize informative ones, and bootstrap from a small seed set (5% of data). Our approach substantially reduces reliance on costly HU annotations and makes VLMs more accurate and better calibrated. Extensive experiments on VQAv2 and VizWiz datasets demonstrate that HaDola consistently matches or outperforms state-of-the-art baselines with less training data. Our work highlights the importance of explicitly modeling HU in SFT, suggesting that better utilization of HU is more effective than merely scaling up dataset size.

[127] $Δ\mathrm{Energy}$: Optimizing Energy Change During Vision-Language Alignment Improves both OOD Detection and OOD Generalization cs.CV | cs.LGPDF

Lin Zhu, Yifeng Yang, Xinbing Wang, Qinying Gu, Nanyang Ye

TL;DR: 论文提出了一种新的OOD评分方法ΔEnergy，通过优化视觉-语言对齐过程中的能量变化，显著提升了OOD检测和OOD泛化能力。该方法通过最大化ΔEnergy的下界（EBM）实现理论和实验上的优越表现。

Details

Motivation: 现有视觉-语言模型（VLMs）在真实下游任务中会同时遇到分布内（ID）和分布外（OOD）数据，而OOD数据包括协变量偏移（如图像风格变化）和语义偏移（如未见类别）。因此，需要提升VLMs对OOD数据的泛化能力并有效检测语义偏移的OOD类别。

Result: 在挑战性OOD检测和泛化基准上，方法显著优于现有方法，AUROC提升10%到25%。

Insight: ΔEnergy能量变化优化不仅能提升OOD检测性能，还能通过域一致性Hessian矩阵改善OOD泛化能力，揭示了能量变化与模型鲁棒性的关系。

Abstract: Recent approaches for vision-language models (VLMs) have shown remarkable success in achieving fast downstream adaptation. When applied to real-world downstream tasks, VLMs inevitably encounter both the in-distribution (ID) data and out-of-distribution (OOD) data. The OOD datasets often include both covariate shifts (e.g., known classes with changes in image styles) and semantic shifts (e.g., test-time unseen classes). This highlights the importance of improving VLMs’ generalization ability to covariate-shifted OOD data, while effectively detecting open-set semantic-shifted OOD classes. In this paper, inspired by the substantial energy change observed in closed-set data when re-aligning vision-language modalities (specifically by directly reducing the maximum cosine similarity to a low value), we introduce a novel OOD score, named {\Delta}Energy. {\Delta}Energy significantly outperforms the vanilla energy-based OOD score and provides a more reliable approach for OOD detection. Furthermore, {\Delta}Energy can simultaneously improve OOD generalization under covariate shifts, which is achieved by lower-bound maximization for {\Delta}Energy (termed EBM). EBM is theoretically proven to not only enhance OOD detection but also yields a domain-consistent Hessian, which serves as a strong indicator for OOD generalization. Based on this finding, we developed a unified fine-tuning framework that allows for improving VLMs’ robustness in both OOD generalization and OOD detection. Extensive experiments on challenging OOD detection and generalization benchmarks demonstrate the superiority of our method, outperforming recent approaches by 10% to 25% in AUROC.

[128] When Does Supervised Training Pay Off? The Hidden Economics of Object Detection in the Era of Vision-Language Models cs.CV | cs.AI | cs.LGPDF

Samer Al-Hamadani

TL;DR: 该论文首次对传统监督学习的YOLO目标检测系统与基于视觉语言模型（VLM）的零样本检测方法（Gemini Flash 2.5）进行了成本效益分析，揭示了在不同场景下的经济性和效率权衡。

Details

Motivation: 传统的目标检测方法依赖大量人工标注数据，成本高昂，而零样本检测的VLM方法无需标注但准确率较低。本文旨在比较这两种方法的经济性，为实际应用提供决策依据。

Result: 结果表明，监督学习的YOLO在标准类别上准确率高达91.2%，但需要高昂的标注成本；而零样本Gemini在稀有类别上仍有52.3%的准确率，且单次检测成本更低。

Insight: 研究发现，选择检测方法时需综合考虑经济性和效率，零样本方法在小规模或类别动态变化的场景下更具优势，而监督方法在大规模稳定类别场景下更经济。

Abstract: Object detection systems have traditionally relied on supervised learning with manually annotated bounding boxes, achieving high accuracy at the cost of substantial annotation investment. The emergence of Vision-Language Models (VLMs) offers an alternative paradigm enabling zero-shot detection through natural language queries, eliminating annotation requirements but operating with reduced accuracy. This paper presents the first comprehensive cost-effectiveness analysis comparing supervised detection (YOLO) with zero-shot VLM inference (Gemini Flash 2.5). Through systematic evaluation on 1,000 stratified COCO images and 200 diverse product images spanning consumer electronics and rare categories, combined with detailed Total Cost of Ownership modeling, we establish quantitative break-even thresholds governing architecture selection. Our findings reveal that supervised YOLO achieves 91.2% accuracy versus 68.5% for zero-shot Gemini on standard categories, representing a 22.7 percentage point advantage that costs $10,800 in annotation for 100-category systems. However, this advantage justifies investment only beyond 55 million inferences, equivalent to 151,000 images daily for one year. Zero-shot Gemini demonstrates 52.3% accuracy on diverse product categories (ranging from highly web-prevalent consumer electronics at 75-85% to rare specialized equipment at 25-40%) where supervised YOLO achieves 0% due to architectural constraints preventing detection of untrained classes. Cost per Correct Detection analysis reveals substantially lower per-detection costs for Gemini ($0.00050 vs $0.143) at 100,000 inferences despite accuracy deficits. We develop decision frameworks demonstrating that optimal architecture selection depends critically on deployment volume, category stability, budget constraints, and accuracy requirements rather than purely technical performance metrics.

[129] sketch2symm: Symmetry-aware sketch-to-shape generation via semantic bridging cs.CVPDF

Yan Zhou, Mingji Li, Xiantao Zeng, Jie Lin, Yuexia Zhou

TL;DR: Sketch2Symm通过语义桥接和对称约束，从稀疏草图生成对称感知的3D形状，显著提升了重建效果。

Details

Motivation: 解决草图输入因抽象和稀疏性导致的语义和几何信息不足问题。

Result: 在多个评估指标（Chamfer Distance、Earth Mover’s Distance、F-Score）上表现优于现有方法。

Insight: 语义桥接和对称约束能有效弥补草图输入的不足，提升3D重建质量。

Abstract: Sketch-based 3D reconstruction remains a challenging task due to the abstract and sparse nature of sketch inputs, which often lack sufficient semantic and geometric information. To address this, we propose Sketch2Symm, a two-stage generation method that produces geometrically consistent 3D shapes from sketches. Our approach introduces semantic bridging via sketch-to-image translation to enrich sparse sketch representations, and incorporates symmetry constraints as geometric priors to leverage the structural regularity commonly found in everyday objects. Experiments on mainstream sketch datasets demonstrate that our method achieves superior performance compared to existing sketch-based reconstruction methods in terms of Chamfer Distance, Earth Mover’s Distance, and F-Score, verifying the effectiveness of the proposed semantic bridging and symmetry-aware design.

[130] InternSVG: Towards Unified SVG Tasks with Multimodal Large Language Models cs.CVPDF

Haomin Wang, Jinhui Yin, Qi Wei, Wenguang Zeng, Lixin Gu

TL;DR: 该论文提出了一种统一的SVG（可缩放矢量图形）建模方法InternSVG，利用多模态大语言模型（MLLMs）的能力，解决了数据集碎片化、任务间方法迁移性差和结构复杂性高的问题。其核心贡献包括数据集SAgoge、基准测试SArena和模型InternSVG，实现了SVG理解、编辑和生成的统一建模。

Details

Motivation: SVG建模面临数据集碎片化、任务间迁移性差和结构复杂性高的挑战，需要一种统一的解决方案。

Result: 在SArena和现有基准上的实验表明，InternSVG在性能上显著优于现有开源和专有模型。

Insight: 通过统一的MLLM框架，结合大规模数据集和针对性训练策略，可以有效解决SVG建模的复杂性和多样性问题。

Abstract: General SVG modeling remains challenging due to fragmented datasets, limited transferability of methods across tasks, and the difficulty of handling structural complexity. In response, we leverage the strong transfer and generalization capabilities of multimodal large language models (MLLMs) to achieve unified modeling for SVG understanding, editing, and generation. We present the InternSVG family, an integrated data-benchmark-model suite. At its core is SAgoge, the largest and most comprehensive multimodal dataset for SVG tasks, encompassing both static graphics and dynamic animations. It covers icons, long-sequence illustrations, scientific diagrams, and dynamic animations, supporting tasks of varied difficulty levels and providing deeper hierarchies with richer attributes compared to previous datasets. Based on this resource, we introduce SArena, a companion benchmark with comprehensive task definitions and standardized evaluation that aligns with the domains and difficulty spectrum covered by SAgoge. Building on these foundations, we propose InternSVG, a unified MLLM for SVG understanding, editing, and generation with SVG-specific special tokens, subword-based embedding initialization, and a two-stage training strategy that progresses from short static SVGs to long-sequence illustrations and complex animations. This unified formulation induces positive transfer and improves overall performance. Experiments on SArena and prior benchmark confirm that InternSVG achieves substantial gains and consistently outperforms leading open and proprietary counterparts.

[131] DocReward: A Document Reward Model for Structuring and Stylizing cs.CV | cs.AI | cs.CLPDF

Junpeng Liu, Yuzhong Zhao, Bowen Cao, Jiayu Ding, Yilin Jia

TL;DR: DocReward是一种文档奖励模型，专注于评估文档的结构和风格质量，弥补了现有技术在视觉结构和风格方面的不足。

Details

Motivation: 现有自动化文档生成技术主要关注文本质量，忽视了视觉结构和风格对文档可读性和吸引力的重要性，因此需要一种新的奖励模型来弥补这一缺陷。

Result: DocReward在准确性上分别超过GPT-4o和GPT-5 30.6和19.4个百分点，并在生成任务中取得了60.8%的胜率。

Insight: 文档的结构和风格对专业性和用户体验至关重要，DocReward为自动化文档生成提供了有效的指导工具。

Abstract: Recent advances in agentic workflows have enabled the automation of tasks such as professional document generation. However, they primarily focus on textual quality, neglecting visual structure and style, which are crucial for readability and engagement. This gap arises mainly from the absence of suitable reward models to guide agentic workflows toward producing documents with stronger structural and stylistic quality. To address this, we propose DocReward, a document reward model that evaluates documents based on their structure and style. We construct a multi-domain dataset DocPair of 117K paired documents, covering 32 domains and 267 document types, each including a high- and low-professionalism document with identical content but different structure and style. This enables the model to evaluate professionalism comprehensively, and in a textual-quality-agnostic way. DocReward is trained using the Bradley-Terry loss to score documents, penalizing predictions that contradict the annotated ranking. To assess the performance of reward models, we create a test dataset containing document bundles ranked by well-educated human evaluators. Notably, DocReward outperforms GPT-4o and GPT-5 in accuracy by 30.6 and 19.4 percentage points, respectively, demonstrating its superiority over baselines. In an extrinsic evaluation of document generation, DocReward achieves a significantly higher win rate of 60.8%, compared to GPT-5’s 37.7% win rate, demonstrating its utility in guiding generation agents toward producing human-preferred documents.

[132] Uncertainty-Aware ControlNet: Bridging Domain Gaps with Synthetic Image Generation cs.CV | cs.AIPDF

Joshua Niemeijer, Jan Ehrhardt, Heinz Handels, Hristina Uzunova

TL;DR: 本文提出了一种不确定性感知的ControlNet方法，通过引入不确定性机制，利用无标注域数据训练ControlNet，生成目标域的合成标注数据，从而弥合领域差距，并显著提升下游任务的性能。

Details

Motivation: 生成模型虽能生成高质量图像数据，但现有ControlNet通常只能复制原始训练分布，限制了其增强下游任务的潜力。本文旨在利用无标注域数据，通过不确定性引导生成目标域的合成数据，解决领域差距问题。

Result: 实验表明，生成的合成数据显著改善了目标域（如低质量Home-OCT和交通场景）的分割性能，无需额外标注。

Insight: 不确定性引导的数据生成能够灵活适应任意领域偏移，无需严格的图像风格学习，为跨领域任务提供了一种高效解决方案。

Abstract: Generative Models are a valuable tool for the controlled creation of high-quality image data. Controlled diffusion models like the ControlNet have allowed the creation of labeled distributions. Such synthetic datasets can augment the original training distribution when discriminative models, like semantic segmentation, are trained. However, this augmentation effect is limited since ControlNets tend to reproduce the original training distribution. This work introduces a method to utilize data from unlabeled domains to train ControlNets by introducing the concept of uncertainty into the control mechanism. The uncertainty indicates that a given image was not part of the training distribution of a downstream task, e.g., segmentation. Thus, two types of control are engaged in the final network: an uncertainty control from an unlabeled dataset and a semantic control from the labeled dataset. The resulting ControlNet allows us to create annotated data with high uncertainty from the target domain, i.e., synthetic data from the unlabeled distribution with labels. In our scenario, we consider retinal OCTs, where typically high-quality Spectralis images are available with given ground truth segmentations, enabling the training of segmentation networks. The recent development in Home-OCT devices, however, yields retinal OCTs with lower quality and a large domain shift, such that out-of-the-pocket segmentation networks cannot be applied for this type of data. Synthesizing annotated images from the Home-OCT domain using the proposed approach closes this gap and leads to significantly improved segmentation results without adding any further supervision. The advantage of uncertainty-guidance becomes obvious when compared to style transfer: it enables arbitrary domain shifts without any strict learning of an image style. This is also demonstrated in a traffic scene experiment.

[133] Reasoning as Representation: Rethinking Visual Reinforcement Learning in Image Quality Assessment cs.CVPDF

Shijie Zhao, Xuanyu Zhang, Weiqi Li, Junlin Li, Li Zhang

TL;DR: 本文探讨了基于强化学习（RL）的图像质量评估（IQA）模型的泛化能力，并提出了一种新算法RALI，通过对比学习直接将图像与RL学习到的通用文本表示对齐，显著减少了推理时间和参数量。

Details

Motivation: 当前基于RL的IQA模型虽具备出色的泛化能力，但其推理能耗和延迟极高，限制了实际应用。本文旨在揭示其泛化机制并提出更高效的解决方案。

Result: RALI在质量评分任务中达到与RL模型相当的泛化性能，同时仅需不到5%的模型参数和推理时间。

Insight: RL模型的泛化源于推理能力对视觉表征的转化，而直接对齐文本表示可高效实现类似效果，为轻量化IQA模型提供了新思路。

Abstract: Reasoning-based image quality assessment (IQA) models trained through reinforcement learning (RL) exhibit exceptional generalization, yet the underlying mechanisms and critical factors driving this capability remain underexplored in current research. Moreover, despite their superior performance, these models incur inference energy usage and latency orders of magnitude higher than their earlier counterparts, restricting their deployment in specific scenarios. Through extensive experiments, this paper verifies and elaborates that through RL training, MLLMs leverage their reasoning capability to convert redundant visual representations into compact, cross-domain aligned text representations. This conversion is precisely the source of the generalization exhibited by these reasoning-based IQA models. Building on this fundamental insight, we propose a novel algorithm, RALI, which employs contrastive learning to directly align images with these generalizable text representations learned by RL. This approach eliminates the reliance on reasoning processes and even obviates the need to load an LLM. For the quality scoring task, this framework achieves generalization performance comparable to reasoning-based models while requiring less than 5% of their model parameters and inference time.

[134] Robust Ego-Exo Correspondence with Long-Term Memory cs.CVPDF

Yijun Hu, Bing Fan, Xin Gu, Haiqing Ren, Dongfang Liu

TL;DR: 论文提出了一种基于SAM 2的EEC框架LM-EEC，通过双记忆架构和自适应特征路由模块（MoE）解决了ego-exo视角下的特征融合和长期记忆问题，显著提升了性能。

Details

Motivation: 现有方法在处理egocentric和exocentric视角的物体对应关系时，面临视角变化大、遮挡和小物体等挑战。尽管SAM 2在视频分割中表现优异，但其在EEC任务中因特征融合和长期记忆不足而失效。

Result: 在EgoExo4D基准测试中，LM-EEC超越了现有方法和SAM 2基线，实现了新的SOTA结果。

Insight: 自适应特征路由和长期记忆管理是解决ego-exo视角对应问题的关键。

Abstract: Establishing object-level correspondence between egocentric and exocentric views is essential for intelligent assistants to deliver precise and intuitive visual guidance. However, this task faces numerous challenges, including extreme viewpoint variations, occlusions, and the presence of small objects. Existing approaches usually borrow solutions from video object segmentation models, but still suffer from the aforementioned challenges. Recently, the Segment Anything Model 2 (SAM 2) has shown strong generalization capabilities and excellent performance in video object segmentation. Yet, when simply applied to the ego-exo correspondence (EEC) task, SAM 2 encounters severe difficulties due to ineffective ego-exo feature fusion and limited long-term memory capacity, especially for long videos. Addressing these problems, we propose a novel EEC framework based on SAM 2 with long-term memories by presenting a dual-memory architecture and an adaptive feature routing module inspired by Mixture-of-Experts (MoE). Compared to SAM 2, our approach features (i) a Memory-View MoE module which consists of a dual-branch routing mechanism to adaptively assign contribution weights to each expert feature along both channel and spatial dimensions, and (ii) a dual-memory bank system with a simple yet effective compression strategy to retain critical long-term information while eliminating redundancy. In the extensive experiments on the challenging EgoExo4D benchmark, our method, dubbed LM-EEC, achieves new state-of-the-art results and significantly outperforms existing methods and the SAM 2 baseline, showcasing its strong generalization across diverse scenarios. Our code and model are available at https://github.com/juneyeeHu/LM-EEC.

[135] Enhancing Maritime Domain Awareness on Inland Waterways: A YOLO-Based Fusion of Satellite and AIS for Vessel Characterization cs.CVPDF

Geoffery Agorku, Sarah Hernandez, Hayley Hames, Cade Wagner

TL;DR: 该论文提出了一种基于YOLO v11的框架，通过融合高分辨率卫星图像和AIS数据，提升内陆水道的海上领域感知能力，解决了AIS监测的局限性。

Details

Motivation: 内陆水道的海上领域感知（MDA）存在合作系统（如AIS）的脆弱性问题。论文旨在通过非合作卫星图像与AIS的融合，弥补AIS的不足，提高船舶监测的准确性和可靠性。

Result: 1. 船舶分类F1分数95.8%；2. 状态检测F1分数99.4%；3. 方向性准确率93.8%；4. 驳船计数平均绝对误差2.4。

Insight: 融合卫星图像与AIS数据可以显著提升内陆水道监测能力，特别是弥补AIS的局限性（如‘黑暗船舶’问题）。未来可通过多模态深度学习进一步扩展方法。

Abstract: Maritime Domain Awareness (MDA) for inland waterways remains challenged by cooperative system vulnerabilities. This paper presents a novel framework that fuses high-resolution satellite imagery with vessel trajectory data from the Automatic Identification System (AIS). This work addresses the limitations of AIS-based monitoring by leveraging non-cooperative satellite imagery and implementing a fusion approach that links visual detections with AIS data to identify dark vessels, validate cooperative traffic, and support advanced MDA. The You Only Look Once (YOLO) v11 object detection model is used to detect and characterize vessels and barges by vessel type, barge cover, operational status, barge count, and direction of travel. An annotated data set of 4,550 instances was developed from $5{,}973~\mathrm{mi}^2$ of Lower Mississippi River imagery. Evaluation on a held-out test set demonstrated vessel classification (tugboat, crane barge, bulk carrier, cargo ship, and hopper barge) with an F1 score of 95.8%; barge cover (covered or uncovered) detection yielded an F1 score of 91.6%; operational status (staged or in motion) classification reached an F1 score of 99.4%. Directionality (upstream, downstream) yielded 93.8% accuracy. The barge count estimation resulted in a mean absolute error (MAE) of 2.4 barges. Spatial transferability analysis across geographically disjoint river segments showed accuracy was maintained as high as 98%. These results underscore the viability of integrating non-cooperative satellite sensing with AIS fusion. This approach enables near-real-time fleet inventories, supports anomaly detection, and generates high-quality data for inland waterway surveillance. Future work will expand annotated datasets, incorporate temporal tracking, and explore multi-modal deep learning to further enhance operational scalability.

[136] Coupled Degradation Modeling and Fusion: A VLM-Guided Degradation-Coupled Network for Degradation-Aware Infrared and Visible Image Fusion cs.CVPDF

Tianpei Zhang, Jufeng Zhao, Yiming Zhu, Guangmang Cui

TL;DR: 本文提出了一个新型的视觉语言模型引导的退化耦合融合网络（VGDCFusion），将退化建模与图像融合紧密结合，显著提升了退化场景下的红外与可见光图像融合性能。

Details

Motivation: 现有红外与可见光图像融合方法假设输入图像质量高，但在处理退化图像时依赖手动预处理，导致性能下降。本文旨在解决退化处理与图像融合的脱节问题。

Result: 实验表明，VGDCFusion在多种退化场景下显著优于现有融合方法。

Insight: 退化感知与图像融合的耦合能有效提升退化场景下的融合性能，VLM的引入为退化建模提供了新思路。

Abstract: Existing Infrared and Visible Image Fusion (IVIF) methods typically assume high-quality inputs. However, when handing degraded images, these methods heavily rely on manually switching between different pre-processing techniques. This decoupling of degradation handling and image fusion leads to significant performance degradation. In this paper, we propose a novel VLM-Guided Degradation-Coupled Fusion network (VGDCFusion), which tightly couples degradation modeling with the fusion process and leverages vision-language models (VLMs) for degradation-aware perception and guided suppression. Specifically, the proposed Specific-Prompt Degradation-Coupled Extractor (SPDCE) enables modality-specific degradation awareness and establishes a joint modeling of degradation suppression and intra-modal feature extraction. In parallel, the Joint-Prompt Degradation-Coupled Fusion (JPDCF) facilitates cross-modal degradation perception and couples residual degradation filtering with complementary cross-modal feature fusion. Extensive experiments demonstrate that our VGDCFusion significantly outperforms existing state-of-the-art fusion approaches under various degraded image scenarios. Our code is available at https://github.com/Lmmh058/VGDCFusion.

[137] VA-GS: Enhancing the Geometric Representation of Gaussian Splatting via View Alignment cs.CVPDF

Qing Li, Huifang Feng, Xun Gong, Yu-Shen Liu

TL;DR: 该论文提出一种新方法VA-GS，通过视角对齐（VA）增强3D高斯泼溅的几何表示，结合边缘感知图像线索、可见性感知光度对齐损失和基于法线的约束，提升了表面重建和新视角合成的性能。

Details

Motivation: 3D高斯泼溅在高质量和实时的视角合成中表现优异，但其表面重建的准确性仍有待提升。由于高斯的离散和非结构化特性，仅依靠图像渲染损失会导致几何不准确和多视角对齐不一致。

Result: 在标准基准测试中，VA-GS在表面重建和新视角合成方面均达到了最先进的性能。

Insight: 结合几何一致性约束和多模态监督（如法线和深度特征）可以显著提升3D高斯泼溅的几何表示能力。

Abstract: 3D Gaussian Splatting has recently emerged as an efficient solution for high-quality and real-time novel view synthesis. However, its capability for accurate surface reconstruction remains underexplored. Due to the discrete and unstructured nature of Gaussians, supervision based solely on image rendering loss often leads to inaccurate geometry and inconsistent multi-view alignment. In this work, we propose a novel method that enhances the geometric representation of 3D Gaussians through view alignment (VA). Specifically, we incorporate edge-aware image cues into the rendering loss to improve surface boundary delineation. To enforce geometric consistency across views, we introduce a visibility-aware photometric alignment loss that models occlusions and encourages accurate spatial relationships among Gaussians. To further mitigate ambiguities caused by lighting variations, we incorporate normal-based constraints to refine the spatial orientation of Gaussians and improve local surface estimation. Additionally, we leverage deep image feature embeddings to enforce cross-view consistency, enhancing the robustness of the learned geometry under varying viewpoints and illumination. Extensive experiments on standard benchmarks demonstrate that our method achieves state-of-the-art performance in both surface reconstruction and novel view synthesis. The source code is available at https://github.com/LeoQLi/VA-GS.

[138] AndesVL Technical Report: An Efficient Mobile-side Multimodal Large Language Model cs.CV | cs.AIPDF

Zhiwei Jin, Xiaohui Song, Nan Wang, Yafei Liu, Chao Li

TL;DR: AndesVL是一款高效的移动端多模态大语言模型（MLLM），专为边缘设备设计，参数规模从0.6B到4B，基于Qwen3架构，支持多种视觉编码器，性能媲美同类开源模型。

Details

Motivation: 现有的云端MLLMs（如GPT-4o、Gemini等）虽然性能强大，但无法在内存、功耗和计算能力受限的边缘设备上运行。因此，需要开发高效的移动端MLLM。

Result: AndesVL在开源基准测试中表现优异，覆盖多个任务领域，性能与同类规模的最先进模型相当。

Insight: 移动端MLLM可以在保持高效的同时实现高性能，LoRA等技术在模型压缩和优化中具有潜力。

Abstract: In recent years, while cloud-based MLLMs such as QwenVL, InternVL, GPT-4o, Gemini, and Claude Sonnet have demonstrated outstanding performance with enormous model sizes reaching hundreds of billions of parameters, they significantly surpass the limitations in memory, power consumption, and computing capacity of edge devices such as mobile phones. This paper introduces AndesVL, a suite of mobile-side MLLMs with 0.6B to 4B parameters based on Qwen3’s LLM and various visual encoders. We comprehensively outline the model architectures, training pipeline, and training data of AndesVL, which achieves first-tier performance across a wide range of open-source benchmarks, including fields such as text-rich image understanding, reasoning and math, multi-image comprehension, general VQA, hallucination mitigation, multilingual understanding, and GUI-related tasks when compared with state-of-the-art models of a similar scale. Furthermore, we introduce a 1+N LoR

[139] Towards Fast and Scalable Normal Integration using Continuous Components cs.CVPDF

Francesco Milano, Jen Jen Chung, Lionel Ott, Roland Siegwart

TL;DR: 该论文提出了一种快速且可扩展的法向量积分方法，通过将问题转化为连续组件的相对尺度估计，大幅减少了优化变量的数量，实现了高效的大规模法向量重建。

Details

Motivation: 传统的法向量积分方法需要进行全局迭代优化，计算量大且难以扩展到高分辨率法向量图。论文旨在解决这一效率问题。

Result: 在标准法向量积分基准上取得了最优结果，相较于像素级方法实现了数量级的加速，适用于高分辨率法向量图。

Insight: 通过连续组件的概念将像素级优化转化为更高层次的优化问题，显著降低了计算复杂度。

Abstract: Surface normal integration is a fundamental problem in computer vision, dealing with the objective of reconstructing a surface from its corresponding normal map. Existing approaches require an iterative global optimization to jointly estimate the depth of each pixel, which scales poorly to larger normal maps. In this paper, we address this problem by recasting normal integration as the estimation of relative scales of continuous components. By constraining pixels belonging to the same component to jointly vary their scale, we drastically reduce the number of optimization variables. Our framework includes a heuristic to accurately estimate continuous components from the start, a strategy to rebalance optimization terms, and a technique to iteratively merge components to further reduce the size of the problem. Our method achieves state-of-the-art results on the standard normal integration benchmark in as little as a few seconds and achieves one-order-of-magnitude speedup over pixel-level approaches on large-resolution normal maps.

[140] Situat3DChange: Situated 3D Change Understanding Dataset for Multimodal Large Language Model cs.CVPDF

Ruiping Liu, Junwei Zheng, Yufan Chen, Zirui Wang, Kunyu Peng

TL;DR: Situat3DChange是一个大规模的数据集，支持三种情境感知变化理解任务，包括感知任务的动作任务。该数据集结合人类观察和多模态信息，提出了SCReasoner方法以高效比较点云数据。

Details

Motivation: 当前的3D数据集和评估基准通常专注于动态场景或动态情境的孤立研究，导致对动态环境的理解不完整。为此，作者提出了情境感知的变化理解数据集Situat3DChange。

Result: 在Situat3DChange任务上的综合评估显示了MLLMs的动态场景理解进展和限制。数据扩展和跨域实验证明了数据集的任务无关有效性。

Insight: 结合人类观察和多模态信息的情境感知数据集有助于提升动态环境理解。SCReasoner的轻量化设计为3D MLLMs提供了高效的解决方案。

Abstract: Physical environments and circumstances are fundamentally dynamic, yet current 3D datasets and evaluation benchmarks tend to concentrate on either dynamic scenarios or dynamic situations in isolation, resulting in incomplete comprehension. To overcome these constraints, we introduce Situat3DChange, an extensive dataset supporting three situation-aware change understanding tasks following the perception-action model: 121K question-answer pairs, 36K change descriptions for perception tasks, and 17K rearrangement instructions for the action task. To construct this large-scale dataset, Situat3DChange leverages 11K human observations of environmental changes to establish shared mental models and shared situational awareness for human-AI collaboration. These observations, enriched with egocentric and allocentric perspectives as well as categorical and coordinate spatial relations, are integrated using an LLM to support understanding of situated changes. To address the challenge of comparing pairs of point clouds from the same scene with minor changes, we propose SCReasoner, an efficient 3D MLLM approach that enables effective point cloud comparison with minimal parameter overhead and no additional tokens required for the language decoder. Comprehensive evaluation on Situat3DChange tasks highlights both the progress and limitations of MLLMs in dynamic scene and situation understanding. Additional experiments on data scaling and cross-domain transfer demonstrate the task-agnostic effectiveness of using Situat3DChange as a training dataset for MLLMs.

[141] LikePhys: Evaluating Intuitive Physics Understanding in Video Diffusion Models via Likelihood Preference cs.CV | cs.AIPDF

Jianhao Yuan, Fabio Pizzati, Francesco Pinto, Lars Kunze, Ivan Laptev

TL;DR: LikePhys提出了一种无需训练的方法，通过基于ELBO的似然替代来评估视频扩散模型中的直觉物理理解能力，并在多个物理领域中验证其有效性。

Details

Motivation: 直觉物理理解在构建通用物理世界模拟器中至关重要，但现有评估方法难以区分物理正确性与视觉表现。

Result: PPE指标与人类偏好高度一致，并在不同物理领域中揭示了模型能力的差异与改进趋势。

Insight: 模型容量和推理设置的提升有助于改善物理理解能力，但在复杂和混沌动力学中仍存在挑战。

Abstract: Intuitive physics understanding in video diffusion models plays an essential role in building general-purpose physically plausible world simulators, yet accurately evaluating such capacity remains a challenging task due to the difficulty in disentangling physics correctness from visual appearance in generation. To the end, we introduce LikePhys, a training-free method that evaluates intuitive physics in video diffusion models by distinguishing physically valid and impossible videos using the denoising objective as an ELBO-based likelihood surrogate on a curated dataset of valid-invalid pairs. By testing on our constructed benchmark of twelve scenarios spanning over four physics domains, we show that our evaluation metric, Plausibility Preference Error (PPE), demonstrates strong alignment with human preference, outperforming state-of-the-art evaluator baselines. We then systematically benchmark intuitive physics understanding in current video diffusion models. Our study further analyses how model design and inference settings affect intuitive physics understanding and highlights domain-specific capacity variations across physical laws. Empirical results show that, despite current models struggling with complex and chaotic dynamics, there is a clear trend of improvement in physics understanding as model capacity and inference settings scale.

Kedi Ying, Ruiping Liu, Chongyan Chen, Mingzhe Tao, Hao Shi

TL;DR: 论文构建了mmWalk，一个多模态多视角的模拟数据集，用于支持盲人或低视力人群的室外安全导航，并生成了mmWalkVQA基准测试，验证了现有视觉语言模型在风险评估和导航任务上的不足。

Details

Motivation: 盲人或低视力人群在复杂环境中行走时缺乏全面的场景理解，因此亟需一种多模态多视角的数据集和辅助技术来提升安全导航能力。

Result: 实验表明，现有视觉语言模型在零样本和少样本设置下难以应对风险评估和导航任务；微调模型在真实数据集上表现更优。

Insight: 多模态数据的整合和真实世界的复杂性对提升盲人或低视力人群的导航能力至关重要，未来工作需进一步优化模型的多模态理解能力。

Abstract: Walking assistance in extreme or complex environments remains a significant challenge for people with blindness or low vision (BLV), largely due to the lack of a holistic scene understanding. Motivated by the real-world needs of the BLV community, we build mmWalk, a simulated multi-modal dataset that integrates multi-view sensor and accessibility-oriented features for outdoor safe navigation. Our dataset comprises 120 manually controlled, scenario-categorized walking trajectories with 62k synchronized frames. It contains over 559k panoramic images across RGB, depth, and semantic modalities. Furthermore, to emphasize real-world relevance, each trajectory involves outdoor corner cases and accessibility-specific landmarks for BLV users. Additionally, we generate mmWalkVQA, a VQA benchmark with over 69k visual question-answer triplets across 9 categories tailored for safe and informed walking assistance. We evaluate state-of-the-art Vision-Language Models (VLMs) using zero- and few-shot settings and found they struggle with our risk assessment and navigational tasks. We validate our mmWalk-finetuned model on real-world datasets and show the effectiveness of our dataset for advancing multi-modal walking assistance.

[143] ODI-Bench: Can MLLMs Understand Immersive Omnidirectional Environments? cs.CVPDF

Liu Yang, Huiyu Duan, Ran Tao, Juntao Cheng, Sijing Wu

TL;DR: 论文提出了ODI-Bench，一个针对全向图像（ODI）理解的全新基准测试，包含2000张高质量全向图像和4000多个手动标注的问答对。实验表明当前多模态大语言模型（MLLMs）在全向图像的理解上表现不佳，并提出了一种无需训练的方法Omni-CoT，通过跨文本和视觉线索的链式推理显著提升了MLLMs的能力。

Details

Motivation: 全向图像（ODIs）在VR、AR和具身智能等领域广泛应用，但多模态大语言模型（MLLMs）在全向环境理解方面的能力尚未被充分研究。因此，需要设计专门的基准测试和方法来填补这一空白。

Result: 实验显示当前MLLMs对全向图像的理解能力有限，而Omni-CoT方法在无需训练的情况下显著提升了性能。

Insight: 全向图像的独特空间特性对MLLMs提出了新挑战，而结合文本和视觉的链式推理是提升其理解能力的有效途径。

Abstract: Omnidirectional images (ODIs) provide full 360x180 view which are widely adopted in VR, AR and embodied intelligence applications. While multi-modal large language models (MLLMs) have demonstrated remarkable performance on conventional 2D image and video understanding benchmarks, their ability to comprehend the immersive environments captured by ODIs remains largely unexplored. To address this gap, we first present ODI-Bench, a novel comprehensive benchmark specifically designed for omnidirectional image understanding. ODI-Bench contains 2,000 high-quality omnidirectional images and over 4,000 manually annotated question-answering (QA) pairs across 10 fine-grained tasks, covering both general-level and spatial-level ODI understanding. Extensive experiments are conducted to benchmark 20 representative MLLMs, including proprietary and open-source models, under both close-ended and open-ended settings. Experimental results reveal that current MLLMs still struggle to capture the immersive context provided by ODIs. To this end, we further introduce Omni-CoT, a training-free method which significantly enhances MLLMs’ comprehension ability in the omnidirectional environment through chain-of-thought reasoning across both textual information and visual cues. Both the benchmark and the code will be released upon the publication.

[144] SNAP: Towards Segmenting Anything in Any Point Cloud cs.CVPDF

Aniket Gupta, Hanhui Wang, Charles Saunders, Aruni RoyChowdhury, Hanumant Singh

TL;DR: SNAP是一种统一的3D点云交互式分割模型，支持跨域的点和文本提示分割，通过多数据集训练和域自适应归一化避免负迁移，在多个基准测试中表现优异。

Details

Motivation: 当前3D点云分割方法局限于单一域或单一交互形式，且多数据集训练易导致负迁移，限制了模型的通用性和实用性。

Result: 在8/9空间提示和5个文本提示基准测试中达到SOTA或竞争性结果，证明统一模型优于专用域特定方法。

Insight: 统一模型可通过域自适应和跨域训练实现通用性，为大规模3D标注提供实用工具。

Abstract: Interactive 3D point cloud segmentation enables efficient annotation of complex 3D scenes through user-guided prompts. However, current approaches are typically restricted in scope to a single domain (indoor or outdoor), and to a single form of user interaction (either spatial clicks or textual prompts). Moreover, training on multiple datasets often leads to negative transfer, resulting in domain-specific tools that lack generalizability. To address these limitations, we present \textbf{SNAP} (\textbf{S}egment a\textbf{N}ything in \textbf{A}ny \textbf{P}oint cloud), a unified model for interactive 3D segmentation that supports both point-based and text-based prompts across diverse domains. Our approach achieves cross-domain generalizability by training on 7 datasets spanning indoor, outdoor, and aerial environments, while employing domain-adaptive normalization to prevent negative transfer. For text-prompted segmentation, we automatically generate mask proposals without human intervention and match them against CLIP embeddings of textual queries, enabling both panoptic and open-vocabulary segmentation. Extensive experiments demonstrate that SNAP consistently delivers high-quality segmentation results. We achieve state-of-the-art performance on 8 out of 9 zero-shot benchmarks for spatial-prompted segmentation and demonstrate competitive results on all 5 text-prompted benchmarks. These results show that a unified model can match or exceed specialized domain-specific approaches, providing a practical tool for scalable 3D annotation. Project page is at, https://neu-vi.github.io/SNAP/

[145] Benchmarking foundation models for hyperspectral image classification: Application to cereal crop type mapping cs.CVPDF

Walid Elbarz, Mohamed Bourriz, Hicham Hajji, Hamd Ait Abdelali, François Bourzeix

TL;DR: 该论文对三种基础模型（HyperSigma、DOFA和基于SpectralEarth数据集的Vision Transformers）在超光谱作物分类任务中的性能进行了系统评估。DOFA和SpectralEarth模型表现最佳，后者准确率高达93.5%。

Details

Motivation: 超光谱作物分类在农业应用中具有重要意义，但基础模型在这一领域的潜力尚未充分挖掘。本文旨在填补这一空白，为实际应用提供参考。

Result: SpectralEarth预训练模型表现最佳（OA=93.5%），DOFA次之（OA=62.6%），HyperSigma最低（OA=34.5%）。此外，从头训练的SpectralEarth紧凑版本也达到91%的OA。

Insight: 模型架构对跨地区和传感器的泛化能力至关重要。SpectralEarth模型的成功表明，大规模预训练数据和多时间信息对超光谱作物分类任务有显著帮助。

Abstract: Foundation models are transforming Earth observation, but their potential for hyperspectral crop mapping remains underexplored. This study benchmarks three foundation models for cereal crop mapping using hyperspectral imagery: HyperSigma, DOFA, and Vision Transformers pre-trained on the SpectralEarth dataset (a large multitemporal hyperspectral archive). Models were fine-tuned on manually labeled data from a training region and evaluated on an independent test region. Performance was measured with overall accuracy (OA), average accuracy (AA), and F1-score. HyperSigma achieved an OA of 34.5% (+/- 1.8%), DOFA reached 62.6% (+/- 3.5%), and the SpectralEarth model achieved an OA of 93.5% (+/- 0.8%). A compact SpectralEarth variant trained from scratch achieved 91%, highlighting the importance of model architecture for strong generalization across geographic regions and sensor platforms. These results provide a systematic evaluation of foundation models for operational hyperspectral crop mapping and outline directions for future model development.

[146] MS-Mix: Unveiling the Power of Mixup for Multimodal Sentiment Analysis cs.CV | cs.LGPDF

Hongyu Zhu, Lin Chen, Mounim A. El-Yacoubi, Mingsheng Shang

TL;DR: MS-Mix是一种情感感知的多模态混合数据增强框架，通过自适应混合策略和情感对齐损失，解决了多模态情感分析中的语义不一致问题，显著提升了模型性能。

Details

Motivation: 多模态情感分析（MSA）受限于标注数据稀缺，而现有的Mixup增强方法在多模态任务中引入语义不一致和标签模糊的问题。因此，需要一种情感感知的混合机制来优化样本选择和混合比例。

Result: 在三个基准数据集和六个SOTA模型上，MS-Mix一致优于现有方法。

Insight: 情感感知的混合策略是多模态数据增强的关键，动态计算模态混合比例能有效提升模型的鲁棒性。

Abstract: Multimodal Sentiment Analysis (MSA) aims to identify and interpret human emotions by integrating information from heterogeneous data sources such as text, video, and audio. While deep learning models have advanced in network architecture design, they remain heavily limited by scarce multimodal annotated data. Although Mixup-based augmentation improves generalization in unimodal tasks, its direct application to MSA introduces critical challenges: random mixing often amplifies label ambiguity and semantic inconsistency due to the lack of emotion-aware mixing mechanisms. To overcome these issues, we propose MS-Mix, an adaptive, emotion-sensitive augmentation framework that automatically optimizes sample mixing in multimodal settings. The key components of MS-Mix include: (1) a Sentiment-Aware Sample Selection (SASS) strategy that effectively prevents semantic confusion caused by mixing samples with contradictory emotions. (2) a Sentiment Intensity Guided (SIG) module using multi-head self-attention to compute modality-specific mixing ratios dynamically based on their respective emotional intensities. (3) a Sentiment Alignment Loss (SAL) that aligns the prediction distributions across modalities, and incorporates the Kullback-Leibler-based loss as an additional regularization term to train the emotion intensity predictor and the backbone network jointly. Extensive experiments on three benchmark datasets with six state-of-the-art backbones confirm that MS-Mix consistently outperforms existing methods, establishing a new standard for robust multimodal sentiment augmentation. The source code is available at: https://github.com/HongyuZhu-s/MS-Mix.

[147] ExpVid: A Benchmark for Experiment Video Understanding & Reasoning cs.CVPDF

Yicheng Xu, Yue Wu, Jiashuo Yu, Ziang Yan, Tianxiang Jiang

TL;DR: ExpVid是首个专注于科学实验视频理解与推理的基准测试，旨在评估多模态大语言模型（MLLMs）在实验室环境中的细粒度和长时程任务上的表现。

Details

Motivation: 现有基准测试忽视了真实实验室工作的细粒度和长时程特性，导致MLLMs在科学实验视频中的实际能力未被充分理解。

Result: MLLMs在粗粒度识别上表现优异，但在细粒度区分、状态变化跟踪及实验过程与科学结果的关联上表现不佳，且专有模型与开源模型在高阶推理上存在显著差距。

Insight: ExpVid不仅是诊断工具，还为开发可信赖的科学实验伙伴MLLMs提供了路线图。

Abstract: Multimodal Large Language Models (MLLMs) hold promise for accelerating scientific discovery by interpreting complex experimental procedures. However, their true capabilities are poorly understood, as existing benchmarks neglect the fine-grained and long-horizon nature of authentic laboratory work, especially in wet-lab settings. To bridge this gap, we introduce ExpVid, the first benchmark designed to systematically evaluate MLLMs on scientific experiment videos. Curated from peer-reviewed video publications, ExpVid features a new three-level task hierarchy that mirrors the scientific process: (1) Fine-grained Perception of tools, materials, and actions; (2) Procedural Understanding of step order and completeness; and (3) Scientific Reasoning that connects the full experiment to its published conclusions. Our vision-centric annotation pipeline, combining automated generation with multi-disciplinary expert validation, ensures that tasks require visual grounding. We evaluate 19 leading MLLMs on ExpVid and find that while they excel at coarse-grained recognition, they struggle with disambiguating fine details, tracking state changes over time, and linking experimental procedures to scientific outcomes. Our results reveal a notable performance gap between proprietary and open-source models, particularly in high-order reasoning. ExpVid not only provides a diagnostic tool but also charts a roadmap for developing MLLMs capable of becoming trustworthy partners in scientific experimentation.

[148] EvoCAD: Evolutionary CAD Code Generation with Vision Language Models cs.CV | cs.AI | cs.NEPDF

Tobias Preintner, Weixuan Yuan, Adrian König, Thomas Bäck, Elena Raponi

TL;DR: EvoCAD结合视觉语言模型和进化算法生成CAD对象的符号表示，通过GPT-4V和GPT-4o优化，表现优于现有方法。

Details

Motivation: 结合大型语言模型的生成能力和进化算法的优化潜能，提出了EvoCAD方法，用于生成高质量的CAD对象。

Result: EvoCAD在生成拓扑正确对象方面表现优于现有方法，新提出的指标有效补充了现有空间指标。

Insight: 结合语言模型的生成能力和进化算法的优化能力可以有效提升CAD对象的生成质量，拓扑指标的引入为评估提供了新维度。

Abstract: Combining large language models with evolutionary computation algorithms represents a promising research direction leveraging the remarkable generative and in-context learning capabilities of LLMs with the strengths of evolutionary algorithms. In this work, we present EvoCAD, a method for generating computer-aided design (CAD) objects through their symbolic representations using vision language models and evolutionary optimization. Our method samples multiple CAD objects, which are then optimized using an evolutionary approach with vision language and reasoning language models. We assess our method using GPT-4V and GPT-4o, evaluating it on the CADPrompt benchmark dataset and comparing it to prior methods. Additionally, we introduce two new metrics based on topological properties defined by the Euler characteristic, which capture a form of semantic similarity between 3D objects. Our results demonstrate that EvoCAD outperforms previous approaches on multiple metrics, particularly in generating topologically correct objects, which can be efficiently evaluated using our two novel metrics that complement existing spatial metrics.

[149] NV3D: Leveraging Spatial Shape Through Normal Vector-based 3D Object Detection cs.CV | cs.AI | cs.LG | I.2.6; I.2.9; I.2.10; I.4.8; I.4.10; I.5.1; I.5.4PDF

Krittin Chaowakarn, Paramin Sangwongngam, Nang Htet Htet Aung, Chalie Charoenlarpnopparut

TL;DR: NV3D通过法向量增强3D物体检测，利用KNN和PCA提取局部特征，并提出两种采样策略和数据精简方法，在KITTI数据集上表现优于基线模型。

Details

Motivation: 多模态方法在特征对齐上存在挑战，而局部特征提取可能过于简化复杂的3D检测任务，因此需要一种更有效的方法。

Result: 在KITTI验证集上，NV3D在汽车和行人检测中的mAP分别比基线高2.61%和4.23%，数据精简55%后仍优于基线。

Insight: 法向量能有效表征物体空间形状，采样策略和数据精简可在保持性能的同时显著降低计算开销。

Abstract: Recent studies in 3D object detection for autonomous vehicles aim to enrich features through the utilization of multi-modal setups or the extraction of local patterns within LiDAR point clouds. However, multi-modal methods face significant challenges in feature alignment, and gaining features locally can be oversimplified for complex 3D object detection tasks. In this paper, we propose a novel model, NV3D, which utilizes local features acquired from voxel neighbors, as normal vectors computed per voxel basis using K-nearest neighbors (KNN) and principal component analysis (PCA). This informative feature enables NV3D to determine the relationship between the surface and pertinent target entities, including cars, pedestrians, or cyclists. During the normal vector extraction process, NV3D offers two distinct sampling strategies: normal vector density-based sampling and FOV-aware bin-based sampling, allowing elimination of up to 55% of data while maintaining performance. In addition, we applied element-wise attention fusion, which accepts voxel features as the query and value and normal vector features as the key, similar to the attention mechanism. Our method is trained on the KITTI dataset and has demonstrated superior performance in car and cyclist detection owing to their spatial shapes. In the validation set, NV3D without sampling achieves 86.60% and 80.18% mean Average Precision (mAP), greater than the baseline Voxel R-CNN by 2.61% and 4.23% mAP, respectively. With both samplings, NV3D achieves 85.54% mAP in car detection, exceeding the baseline by 1.56% mAP, despite roughly 55% of voxels being filtered out.

[150] IVEBench: Modern Benchmark Suite for Instruction-Guided Video Editing Assessment cs.CVPDF

Yinan Chen, Jiangning Zhang, Teng Hu, Yuxiang Zeng, Zhucun Xue

TL;DR: IVEBench是一个专门为指令引导的视频编辑评估设计的现代基准测试套件，解决了现有基准在多样性、任务覆盖和评估指标上的不足。

Details

Motivation: 指令引导的视频编辑是新兴研究方向，但现有基准无法充分支持其评估，表现为来源多样性低、任务覆盖窄和评估指标不完整。

Result: 实验表明，IVEBench能有效评估当前最先进的指令引导视频编辑方法，提供全面且与人类对齐的评估结果。

Insight: IVEBench的多样性和多维评估协议为指令引导视频编辑的标准化评估提供了重要工具。

Abstract: Instruction-guided video editing has emerged as a rapidly advancing research direction, offering new opportunities for intuitive content transformation while also posing significant challenges for systematic evaluation. Existing video editing benchmarks fail to support the evaluation of instruction-guided video editing adequately and further suffer from limited source diversity, narrow task coverage and incomplete evaluation metrics. To address the above limitations, we introduce IVEBench, a modern benchmark suite specifically designed for instruction-guided video editing assessment. IVEBench comprises a diverse database of 600 high-quality source videos, spanning seven semantic dimensions, and covering video lengths ranging from 32 to 1,024 frames. It further includes 8 categories of editing tasks with 35 subcategories, whose prompts are generated and refined through large language models and expert review. Crucially, IVEBench establishes a three-dimensional evaluation protocol encompassing video quality, instruction compliance and video fidelity, integrating both traditional metrics and multimodal large language model-based assessments. Extensive experiments demonstrate the effectiveness of IVEBench in benchmarking state-of-the-art instruction-guided video editing methods, showing its ability to provide comprehensive and human-aligned evaluation outcomes.

[151] InfiniHuman: Infinite 3D Human Creation with Precise Control cs.CVPDF

Yuxuan Xue, Xianghui Xie, Margaret Kostyrko, Gerard Pons-Moll

TL;DR: InfiniHuman是一个框架，通过利用基础模型生成无限多样且可控的3D人类数据，解决了传统方法因数据采集昂贵而受限的问题。

Details

Motivation: 传统方法生成3D人类数据的成本高且多样性有限，InfiniHuman致力于通过自动化和可扩展的方式解决这一问题。

Result: 生成了111K个多样化的3D人类身份，并通过实验验证了其在视觉质量、生成速度和可控性上的显著优势。

Insight: 利用现有基础模型可以大幅降低数据生成成本，同时实现高质量的3D内容生成与控制。

Abstract: Generating realistic and controllable 3D human avatars is a long-standing challenge, particularly when covering broad attribute ranges such as ethnicity, age, clothing styles, and detailed body shapes. Capturing and annotating large-scale human datasets for training generative models is prohibitively expensive and limited in scale and diversity. The central question we address in this paper is: Can existing foundation models be distilled to generate theoretically unbounded, richly annotated 3D human data? We introduce InfiniHuman, a framework that synergistically distills these models to produce richly annotated human data at minimal cost and with theoretically unlimited scalability. We propose InfiniHumanData, a fully automatic pipeline that leverages vision-language and image generation models to create a large-scale multi-modal dataset. User study shows our automatically generated identities are undistinguishable from scan renderings. InfiniHumanData contains 111K identities spanning unprecedented diversity. Each identity is annotated with multi-granularity text descriptions, multi-view RGB images, detailed clothing images, and SMPL body-shape parameters. Building on this dataset, we propose InfiniHumanGen, a diffusion-based generative pipeline conditioned on text, body shape, and clothing assets. InfiniHumanGen enables fast, realistic, and precisely controllable avatar generation. Extensive experiments demonstrate significant improvements over state-of-the-art methods in visual quality, generation speed, and controllability. Our approach enables high-quality avatar generation with fine-grained control at effectively unbounded scale through a practical and affordable solution. We will publicly release the automatic data generation pipeline, the comprehensive InfiniHumanData dataset, and the InfiniHumanGen models at https://yuxuan-xue.com/infini-human.

[152] FACE: Faithful Automatic Concept Extraction cs.CV | cs.AIPDF

Dipkamal Bhusal, Michael Clifford, Sara Rampazzi, Nidhi Rastogi

TL;DR: FACE提出了一种基于KL散度正则化的概念提取框架，提高了深度学习模型解释的忠实性。

Details

Motivation: 现有自动概念发现方法未能充分对齐提取的概念与模型真实决策过程，导致解释缺乏忠实性。

Result: 在ImageNet、COCO和CelebA数据集上，FACE在忠实性和稀疏性指标上优于现有方法。

Insight: 通过理论证明，KL散度最小化能约束预测分布的偏差，从而提升概念空间的局部线性忠实性。

Abstract: Interpreting deep neural networks through concept-based explanations offers a bridge between low-level features and high-level human-understandable semantics. However, existing automatic concept discovery methods often fail to align these extracted concepts with the model’s true decision-making process, thereby compromising explanation faithfulness. In this work, we propose FACE (Faithful Automatic Concept Extraction), a novel framework that augments Non-negative Matrix Factorization (NMF) with a Kullback-Leibler (KL) divergence regularization term to ensure alignment between the model’s original and concept-based predictions. Unlike prior methods that operate solely on encoder activations, FACE incorporates classifier supervision during concept learning, enforcing predictive consistency and enabling faithful explanations. We provide theoretical guarantees showing that minimizing the KL divergence bounds the deviation in predictive distributions, thereby promoting faithful local linearity in the learned concept space. Systematic evaluations on ImageNet, COCO, and CelebA datasets demonstrate that FACE outperforms existing methods across faithfulness and sparsity metrics.

[153] Beyond ‘Templates’: Category-Agnostic Object Pose, Size, and Shape Estimation from a Single View cs.CVPDF

Jinyu Zhang, Haitao Lin, Jiashu Hou, Xiangyang Xue, Yanwei Fu

TL;DR: 该论文提出了一种无需模板或CAD模型的类别无关框架，用于从单张RGB-D图像中同时预测物体的6D位姿、大小和密集形状，展现了强大的零样本泛化能力。

Details

Motivation: 现有方法依赖于特定类别的先验（如CAD模型或模板）或多阶段流程，限制了跨类别的泛化能力。该研究旨在解决这些问题，实现更灵活和通用的物体理解。

Result: 在四个基准测试（300+类别）上达到SOTA，实时推理速度为28 FPS，且零样本泛化表现突出。

Insight: 通过密集特征和多专家混合机制，解决了位姿与形状的纠缠问题；合成数据训练仍能实现真实的强泛化。

Abstract: Estimating an object’s 6D pose, size, and shape from visual input is a fundamental problem in computer vision, with critical applications in robotic grasping and manipulation. Existing methods either rely on object-specific priors such as CAD models or templates, or suffer from limited generalization across categories due to pose-shape entanglement and multi-stage pipelines. In this work, we propose a unified, category-agnostic framework that simultaneously predicts 6D pose, size, and dense shape from a single RGB-D image, without requiring templates, CAD models, or category labels at test time. Our model fuses dense 2D features from vision foundation models with partial 3D point clouds using a Transformer encoder enhanced by a Mixture-of-Experts, and employs parallel decoders for pose-size estimation and shape reconstruction, achieving real-time inference at 28 FPS. Trained solely on synthetic data from 149 categories in the SOPE dataset, our framework is evaluated on four diverse benchmarks SOPE, ROPE, ObjaversePose, and HANDAL, spanning over 300 categories. It achieves state-of-the-art accuracy on seen categories while demonstrating remarkably strong zero-shot generalization to unseen real-world objects, establishing a new standard for open-set 6D understanding in robotics and embodied AI.

[154] Bayesian Topological Convolutional Neural Nets cs.CVPDF

Sarah Harkins Dayton, Hayden Everett, Ioannis Schizas, David L. Boothe Jr., Vasileios Maroulas

TL;DR: 这篇论文提出了一种结合拓扑学习和贝叶斯采样的新型贝叶斯拓扑卷积神经网络（Bayesian Topological CNN），解决了传统CNN需要大量数据训练、预测过度自信和不确定性量化不足的问题。

Details

Motivation: 传统卷积神经网络（CNNs）存在训练数据需求大、预测过度自信和不确定性量化不足的问题，而贝叶斯神经网络（BNNs）和拓扑CNNs未能完全解决这些问题，因此需要一种更高效、鲁棒的混合方法。

Result: 在基准图像分类数据集上的实验表明，该方法优于传统CNN、BNN和拓扑CNN，尤其是在训练数据有限或损坏的情况下表现更优。此外，模型在不确定性量化方面优于标准BNN，能更好识别未见过的分布外数据。

Insight: 结合拓扑学习和贝叶斯方法能够显著提升CNN的效率与鲁棒性，为图像分类任务提供了一种更高效且可信的解决方案。

Abstract: Convolutional neural networks (CNNs) have been established as the main workhorse in image data processing; nonetheless, they require large amounts of data to train, often produce overconfident predictions, and frequently lack the ability to quantify the uncertainty of their predictions. To address these concerns, we propose a new Bayesian topological CNN that promotes a novel interplay between topology-aware learning and Bayesian sampling. Specifically, it utilizes information from important manifolds to accelerate training while reducing calibration error by placing prior distributions on network parameters and properly learning appropriate posteriors. One important contribution of our work is the inclusion of a consistency condition in the learning cost, which can effectively modify the prior distributions to improve the performance of our novel network architecture. We evaluate the model on benchmark image classification datasets and demonstrate its superiority over conventional CNNs, Bayesian neural networks (BNNs), and topological CNNs. In particular, we supply evidence that our method provides an advantage in situations where training data is limited or corrupted. Furthermore, we show that the new model allows for better uncertainty quantification than standard BNNs since it can more readily identify examples of out-of-distribution data on which it has not been trained. Our results highlight the potential of our novel hybrid approach for more efficient and robust image classification.

[155] DiT360: High-Fidelity Panoramic Image Generation via Hybrid Training cs.CVPDF

Haoran Feng, Dizhe Zhang, Xiangtai Li, Bo Du, Lu Qi

TL;DR: DiT360提出了一种基于DiT的框架，通过混合训练视角和全景数据生成高质量的全景图像。其核心创新在于跨域转换和域内增强模块，结合图像和令牌级别的监督，提升了边界一致性和图像逼真度。

Details

Motivation: 现有全景图像生成方法因缺乏大规模高质量真实全景数据而导致几何保真度和逼真度不足。DiT360通过数据驱动的视角解决了这一问题，而非单纯依赖模型设计。

Result: 在11项定量指标上，DiT360在边界一致性和图像逼真度方面表现优于基线方法。

Insight: 数据驱动的混合训练策略能显著提升全景图像的生成质量，跨域知识引入和令牌级别监督是关键。

Abstract: In this work, we propose DiT360, a DiT-based framework that performs hybrid training on perspective and panoramic data for panoramic image generation. For the issues of maintaining geometric fidelity and photorealism in generation quality, we attribute the main reason to the lack of large-scale, high-quality, real-world panoramic data, where such a data-centric view differs from prior methods that focus on model design. Basically, DiT360 has several key modules for inter-domain transformation and intra-domain augmentation, applied at both the pre-VAE image level and the post-VAE token level. At the image level, we incorporate cross-domain knowledge through perspective image guidance and panoramic refinement, which enhance perceptual quality while regularizing diversity and photorealism. At the token level, hybrid supervision is applied across multiple modules, which include circular padding for boundary continuity, yaw loss for rotational robustness, and cube loss for distortion awareness. Extensive experiments on text-to-panorama, inpainting, and outpainting tasks demonstrate that our method achieves better boundary consistency and image fidelity across eleven quantitative metrics. Our code is available at https://github.com/Insta360-Research-Team/DiT360.

[156] Point Prompting: Counterfactual Tracking with Video Diffusion Models cs.CVPDF

Ayush Shrivastava, Sanyam Mehta, Daniel Geng, Andrew Owens

TL;DR: 本文提出了一种基于预训练视频扩散模型的零样本点跟踪方法，通过视觉标记点来合成运动轨迹。该方法利用反事实生成技术，显著提升了跟踪性能。

Details

Motivation: 现有的跟踪器和视频生成器分别专注于运动分析和运动合成，两者任务密切相关。本文探索了如何利用预训练视频扩散模型的能力进行零样本点跟踪，填补了这一领域的研究空白。

Result: 实验表明，该方法在多个图像条件视频扩散模型上的表现优于现有零样本方法，能够有效处理遮挡问题，性能接近专用自监督模型。

Insight: 视频扩散模型在零样本跟踪任务中表现出强大的潜力，表明可以利用合成的运动信息来解决分析任务，为跨任务模型设计提供了新思路。

Abstract: Trackers and video generators solve closely related problems: the former analyze motion, while the latter synthesize it. We show that this connection enables pretrained video diffusion models to perform zero-shot point tracking by simply prompting them to visually mark points as they move over time. We place a distinctively colored marker at the query point, then regenerate the rest of the video from an intermediate noise level. This propagates the marker across frames, tracing the point’s trajectory. To ensure that the marker remains visible in this counterfactual generation, despite such markers being unlikely in natural videos, we use the unedited initial frame as a negative prompt. Through experiments with multiple image-conditioned video diffusion models, we find that these “emergent” tracks outperform those of prior zero-shot methods and persist through occlusions, often obtaining performance that is competitive with specialized self-supervised models.

[157] CodePlot-CoT: Mathematical Visual Reasoning by Thinking with Code-Driven Images cs.CV | cs.AIPDF

Chengqi Duan, Kaiyue Sun, Rongyao Fang, Manyuan Zhang, Yan Feng

TL;DR: CodePlot-CoT引入了一种基于代码驱动的链式思考（Chain-of-Thought）范式，通过生成可执行的绘图代码并将其渲染为“视觉思考”图像，解决需要视觉辅助的数学问题。论文还提出了首个双语大规模数据集Math-VR和专用图像到代码转换器。

Details

Motivation: 现有的大语言模型和多模态统一模型在需要视觉辅助的数学问题上存在局限性，尤其是缺乏生成精确可控的图像的能力。

Result: 实验结果表明，CodePlot-CoT在Math-VR基准上比基础模型提升了21%。

Insight: 代码驱动的视觉推理为解决复杂数学问题提供了新思路，同时开源资源将推动多模态数学推理领域的发展。

Abstract: Recent advances in Large Language Models (LLMs) and Vision Language Models (VLMs) have shown significant progress in mathematical reasoning, yet they still face a critical bottleneck with problems requiring visual assistance, such as drawing auxiliary lines or plotting functions to solve the problems. Most LLMs and VLMs are constrained to text-only reasoning chains, while multimodal unified models that can generate interleaved text and images lack the necessary precision and controllability for such tasks. To address this, we propose CodePlot-CoT, a code-driven Chain-of-Thought paradigm for “thinking with images” in mathematics. Our approach leverages the VLM to generate text reasoning as well as executable plotting code, which is then rendered into images as “visual thought”, to solve mathematical problems. To achieve this, we first construct Math-VR, the first large-scale, bilingual dataset and benchmark for Mathematics problems with Visual Reasoning, comprising 178K samples. Second, to create high-quality training data, we develop a state-of-the-art image-to-code converter specialized for parsing complex mathematical figures into codes. Finally, using these training data, we train the CodePlot-CoT model for solving mathematical problems. Experimental results show that our model achieves up to 21% increase over base model on our new benchmark, fully validating the efficacy of our proposed code-driven reasoning paradigm. Our work opens a new direction for multimodal mathematical reasoning and provides the community with the first large-scale dataset, comprehensive benchmark, and strong approach for such problems. To facilitate future research, we make our datasets, code, and pretrained models publicly available at https://github.com/HKU-MMLab/Math-VR-CodePlot-CoT.

cs.CL [Back]

[158] Table Question Answering in the Era of Large Language Models: A Comprehensive Survey of Tasks, Methods, and Evaluation cs.CLPDF

Wei Zhou, Bolei Ma, Annemarie Friedrich, Mohsen Mesgar

TL;DR: 这篇综述论文系统地整理了基于大语言模型（LLMs）的表格问答（TQA）研究，涵盖任务分类、方法趋势及评估标准，旨在填补该领域的系统性研究空白。

Details

Motivation: 表格问答（TQA）领域在任务定义、核心挑战和方法趋势上缺乏系统性梳理，尤其是在大语言模型（LLMs）兴起的背景下，亟需统一的综述框架。

Result: 论文统一了分散的研究方向，为TQA社区提供了理论基础和实践指导。

Insight: LLMs在TQA中的潜力巨大，但需进一步探索强化学习等新兴方向，以提升模型在多模态和复杂任务中的表现。

Abstract: Table Question Answering (TQA) aims to answer natural language questions about tabular data, often accompanied by additional contexts such as text passages. The task spans diverse settings, varying in table representation, question/answer complexity, modality involved, and domain. While recent advances in large language models (LLMs) have led to substantial progress in TQA, the field still lacks a systematic organization and understanding of task formulations, core challenges, and methodological trends, particularly in light of emerging research directions such as reinforcement learning. This survey addresses this gap by providing a comprehensive and structured overview of TQA research with a focus on LLM-based methods. We provide a comprehensive categorization of existing benchmarks and task setups. We group current modeling strategies according to the challenges they target, and analyze their strengths and limitations. Furthermore, we highlight underexplored but timely topics that have not been systematically covered in prior research. By unifying disparate research threads and identifying open problems, our survey offers a consolidated foundation for the TQA community, enabling a deeper understanding of the state of the art and guiding future developments in this rapidly evolving area.

[159] The Idola Tribus of AI: Large Language Models tend to perceive order where none exists cs.CL | cs.AIPDF

Shin-nosuke Ishikawa, Masato Todo, Taiki Ogihara, Hirotsugu Ohba

TL;DR: 研究发现大型语言模型（LLMs）倾向于在随机数列中生成荒谬的模式，显示出逻辑一致性问题，这可能影响其在复杂任务中的应用。

Details

Motivation: 当前LLMs被广泛应用于需要逻辑一致性的任务（如知识检索和多步推理），但对其在识别数列规律中的表现缺乏系统评估。

Result: LLMs在算术和几何数列中表现良好，但在随机数列中频繁生成错误的模式，即使多步推理模型（如OpenAI和Google Gemini系列）也存在此问题。

Insight: LLMs的逻辑缺陷可能影响其在实际任务中的可靠性，需要进一步研究改进其推理机制或开发针对性策略。

Abstract: We present a tendency of large language models (LLMs) to generate absurd patterns despite their clear inappropriateness in a simple task of identifying regularities in number series. Several approaches have been proposed to apply LLMs to complex real-world tasks, such as providing knowledge through retrieval-augmented generation and executing multi-step tasks using AI agent frameworks. However, these approaches rely on the logical consistency and self-coherence of LLMs, making it crucial to evaluate these aspects and consider potential countermeasures. To identify cases where LLMs fail to maintain logical consistency, we conducted an experiment in which LLMs were asked to explain the patterns in various integer sequences, ranging from arithmetic sequences to randomly generated integer series. While the models successfully identified correct patterns in arithmetic and geometric sequences, they frequently over-recognized patterns that were inconsistent with the given numbers when analyzing randomly generated series. This issue was observed even in multi-step reasoning models, including OpenAI o3, o4-mini, and Google Gemini 2.5 Flash Preview Thinking. This tendency to perceive non-existent patterns can be interpreted as the AI model equivalent of Idola Tribus and highlights potential limitations in their capability for applied tasks requiring logical reasoning, even when employing chain-of-thought reasoning mechanisms.

[160] ReaLM: Residual Quantization Bridging Knowledge Graph Embeddings and Large Language Models cs.CL | cs.AIPDF

Wenbin Guo, Xin Wang, Jiaoyan Chen, Lingbing Guo, Zhao Li

TL;DR: ReaLM提出了一种通过残差向量量化（residual vector quantization）将知识图谱嵌入与大型语言模型（LLM）对齐的框架，解决了传统方法无法充分利用结构化语义表示的问题。

Details

Motivation: 现有的基于LLM的知识图谱补全方法难以有效利用结构化语义表示，主要是因为预训练知识图谱模型的连续嵌入空间与LLM的离散词元空间不一致，导致语义传递受限。

Result: 在两个广泛使用的基准数据集上的实验表明，ReaLM取得了最先进的性能，验证了其在结构化知识与大语言模型对齐中的有效性。

Insight: ReaLM的创新点在于弥合了连续嵌入空间与离散词元空间的鸿沟，通过量化技术和语义约束实现了知识图谱与LLM的高效融合，为知识图谱补全提供了新思路。

Abstract: Large Language Models (LLMs) have recently emerged as a powerful paradigm for Knowledge Graph Completion (KGC), offering strong reasoning and generalization capabilities beyond traditional embedding-based approaches. However, existing LLM-based methods often struggle to fully exploit structured semantic representations, as the continuous embedding space of pretrained KG models is fundamentally misaligned with the discrete token space of LLMs. This discrepancy hinders effective semantic transfer and limits their performance. To address this challenge, we propose ReaLM, a novel and effective framework that bridges the gap between KG embeddings and LLM tokenization through the mechanism of residual vector quantization. ReaLM discretizes pretrained KG embeddings into compact code sequences and integrates them as learnable tokens within the LLM vocabulary, enabling seamless fusion of symbolic and contextual knowledge. Furthermore, we incorporate ontology-guided class constraints to enforce semantic consistency, refining entity predictions based on class-level compatibility. Extensive experiments on two widely used benchmark datasets demonstrate that ReaLM achieves state-of-the-art performance, confirming its effectiveness in aligning structured knowledge with large-scale language models.

[161] All Code, No Thought: Current Language Models Struggle to Reason in Ciphered Language cs.CL | cs.AI | cs.LGPDF

Shiyuan Guo, Henry Sleight, Fabien Roger

TL;DR: 当前的语言模型在处理加密语言时推理能力较弱，这可能威胁到链式思维（CoT）监控的有效性。论文测试了28种不同加密方式，发现模型在常见加密（如rot13）中表现较好，但在冷门加密中表现较差，且这种能力与预训练数据中加密的出现频率相关。

Details

Motivation: 随着AI代理的广泛应用，检测有害行为（如对抗攻击或AI不对齐）变得至关重要。CoT监控是一种常用方法，但攻击者可能通过加密语言规避监控。论文旨在评估这一风险。

Result: 模型在常见加密中表现良好，但在冷门加密中表现显著下降。推理能力与加密的预训练数据频率相关，且微调数据对能力提升作用有限。

Insight: 当前模型在加密推理中存在局限性，这可能限制攻击者利用加密语言规避CoT监控的可行性，并为未来模型的设计提供了参考。

Abstract: Detecting harmful AI actions is important as AI agents gain adoption. Chain-of-thought (CoT) monitoring is one method widely used to detect adversarial attacks and AI misalignment. However, attackers and misaligned models might evade CoT monitoring through ciphered reasoning: reasoning hidden in encrypted, translated, or compressed text. To assess this risk, we test whether models can perform ciphered reasoning. For each of 28 different ciphers, we fine-tune and prompt up to 10 models to reason in that cipher. We measure model accuracy on math problems as a proxy for reasoning ability. Across the models we test, we find an asymmetry: model accuracy can drop significantly when reasoning in ciphered text, even though models demonstrate comprehension of ciphered text by being able to translate it accurately to English. Even frontier models struggle with lesser-known ciphers, although they can reason accurately in well-known ciphers like rot13. We show that ciphered reasoning capability correlates with cipher prevalence in pretraining data. We also identify scaling laws showing that ciphered reasoning capability improves slowly with additional fine-tuning data. Our work suggests that evading CoT monitoring using ciphered reasoning may be an ineffective tactic for current models and offers guidance on constraining the development of this capability in future frontier models.

[162] Preference-Aware Memory Update for Long-Term LLM Agents cs.CL | cs.AIPDF

Haoran Sun, Zekun Zhang, Shaoning Zeng

TL;DR: 该论文提出了一种偏好感知记忆更新机制（PAMU），通过结合滑动窗口平均和指数移动平均，动态优化长期LLM代理的记忆表示，从而提升对话质量。

Details

Motivation: 现有的长期记忆机制在存储和检索方面已有显著进展，但缺乏动态更新记忆的能力，特别是在适应用户行为和上下文变化时。PAMU旨在填补这一空白。

Result: 在LoCoMo数据集的五个任务场景中，PAMU显著提升了五个基线的输出质量，验证了其在长期对话中的有效性。

Insight: 动态调整记忆表示是提升长期LLM代理推理能力的关键，结合短期和长期偏好能够更好地适应用户行为的变化。

Abstract: One of the key factors influencing the reasoning capabilities of LLM-based agents is their ability to leverage long-term memory. Integrating long-term memory mechanisms allows agents to make informed decisions grounded in historical interactions. While recent advances have significantly improved the storage and retrieval components, by encoding memory into dense vectors for similarity search or organizing memory as structured knowledge graphs most existing approaches fall short in memory updating. In particular, they lack mechanisms for dynamically refining preference memory representations in response to evolving user behaviors and contexts. To address this gap, we propose a Preference-Aware Memory Update Mechanism (PAMU) that enables dynamic and personalized memory refinement. By integrating sliding window averages (SW) with exponential moving averages (EMA), PAMU constructs a fused preference-aware representation that captures both short-term fluctuations and long-term user tendencies. We conduct experiments on five task scenarios of the LoCoMo dataset, and the results show that our mechanism can significantly improve the output quality of LLM in five baselines, validating its effectiveness in long-term conversations.

[163] VisRAG 2.0: Evidence-Guided Multi-Image Reasoning in Visual Retrieval-Augmented Generation cs.CL | cs.CVPDF

Yubo Sun, Chunyi Peng, Yukun Yan, Shi Yu, Zhenghao Liu

TL;DR: 论文提出了EVisRAG框架，通过证据引导的多图像推理来解决视觉检索增强生成（VRAG）系统中的问题，并使用RS-GRPO方法优化视觉感知和推理能力，实验显示在多图像问答任务中显著提升性能。

Details

Motivation: 当前VRAG系统在多图像推理中难以可靠地感知和整合证据，导致推理不准确。为解决这一问题，作者提出了EVisRAG框架和RS-GRPO训练方法。

Result: 实验表明，EVisRAG在多图像问答任务中平均提升27%的性能，并能准确感知和定位问题相关证据。

Insight: EVisRAG通过证据聚合和RS-GRPO训练方法，显著提升了多图像推理的准确性，展现了类似真实侦探的推理能力。

Abstract: Visual retrieval-augmented generation (VRAG) augments vision-language models (VLMs) with external visual knowledge to ground reasoning and reduce hallucinations. Yet current VRAG systems often fail to reliably perceive and integrate evidence across multiple images, leading to weak grounding and erroneous conclusions. In this paper, we propose EVisRAG, an end-to-end framework that learns to reason with evidence-guided multi-image to address this issue. The model first observes retrieved images and records per-image evidence, then derives the final answer from the aggregated evidence. To train EVisRAG effectively, we introduce Reward-Scoped Group Relative Policy Optimization (RS-GRPO), which binds fine-grained rewards to scope-specific tokens to jointly optimize visual perception and reasoning abilities of VLMs. Experimental results on multiple visual question answering benchmarks demonstrate that EVisRAG delivers substantial end-to-end gains over backbone VLM with 27% improvements on average. Further analysis shows that, powered by RS-GRPO, EVisRAG improves answer accuracy by precisely perceiving and localizing question-relevant evidence across multiple images and deriving the final answer from that evidence, much like a real detective.

[164] Judge’s Verdict: A Comprehensive Analysis of LLM Judge Capability Through Human Agreement cs.CL | cs.AIPDF

Steve Han, Gilberto Titericz Junior, Tom Balough, Wenfei Zhou

TL;DR: 本文提出了Judge’s Verdict Benchmark，通过两步法评估大型语言模型（LLM）作为‘法官’评判回答准确性的能力，发现27/54的LLM达到Tier 1表现，并揭示了评判能力的提升不仅依赖模型规模，还与训练策略相关。

Details

Motivation: 传统方法仅依赖相关性分析来评估LLM作为‘法官’的能力是不够的，需要更全面的方法衡量其是否能够模拟人类评判的复杂性和一致性。

Result: 27个模型达到Tier 1表现（23个人类相似，4个超一致）。评判能力与模型规模无关，而依赖训练策略。

Insight: LLM作为‘法官’的能力不仅取决于规模，训练策略是关键；超一致性可能隐含过度简化风险。

Abstract: This research introduces the Judge’s Verdict Benchmark, a novel two-step methodology to evaluate Large Language Models (LLMs) as judges for response accuracy evaluation tasks. We assess how well 54 LLMs can replicate human judgment when scoring responses from RAG (Retrieval-Augmented Generation) or Agentic pipelines against ground truth answers. Our methodology progresses from traditional correlation analysis to comprehensive Cohen’s Kappa analysis that measures actual agreement patterns. The two-step approach includes: (1) a correlation test that filters judges with strong alignment, followed by (2) a human-likeness test using z-scores to identify two distinct judgment patterns: human-like judgment (|z| < 1) that mimics natural human variation, and super-consistent judgment (z > 1) that exceeds typical human-to-human agreement levels. This methodology reveals that 27 out of 54 tested LLMs achieve Tier 1 performance: 23 models exhibit human-like patterns that preserve the nuances of human judgment, while 4 models demonstrate super-consistent behavior, a pattern that could indicate either enhanced reliability or oversimplification of complex judgments. Testing 43 open-source models (1B-405B parameters) and 11 closed models (GPT, Gemini, Claude variants), we demonstrate that judge excellence is not solely dependent on model size but on specific training strategies. Our key contributions include: (1) establishing that correlation alone is insufficient for judge evaluation, (2) introducing a “Turing Test for judges” based on agreement patterns, and (3) providing a standardized benchmark for classifying LLM judges into distinct performance tiers for different evaluation needs.

[165] Gold Panning: Turning Positional Bias into Signal for Multi-Document LLM Reasoning cs.CLPDF

Adam Byerly, Daniel Khashabi

TL;DR: 论文提出了一种名为Gold Panning Bandits的方法，将LLM在多文档环境中的位置偏置（positional bias）转化为信号，通过重排序文档以高效识别相关内容，显著减少了语言模型查询次数。

Details

Motivation: 大型语言模型（LLM）在多文档处理中存在位置偏置，倾向于依赖文档位置而非相关性。传统方法将这种偏置视为噪声并试图消除，而本文则将其视为可利用的信号。

Result: 在知识密集型NLP任务中，该方法比随机重排序基线减少了65%的语言模型查询次数，显著降低了计算成本。

Insight: 研究表明，LLM的内在偏置可以通过巧妙设计转化为优化推理效率的工具，而无需重新训练模型。

Abstract: Large language models exhibit a strong position bias in multi-document contexts, systematically prioritizing information based on location rather than relevance. While existing approaches treat this bias as noise to be mitigated, we introduce Gold Panning Bandits, a framework that leverages position bias as a diagnostic signal: by reordering documents and observing shifts in the model’s responses, we can efficiently identify the most relevant content. We frame the problem of choosing reorderings as a bipartite matching problem. While an optimal assignment can be computed at each iteration with the Hungarian algorithm in $O(N^3)$ time, we propose a greedy $O(N \log N)$ strategy that achieves comparable performance by prioritizing the placement of the most uncertain documents in the most informative positions. Our approach identifies relevant documents using up to 65% fewer language model queries than random permutation baselines on knowledge-intensive NLP tasks, substantially reducing computational cost without model retraining. This work demonstrates that inherent LLM biases can be transformed from liabilities into assets for efficient, inference-time optimization.

[166] Text Prompt Injection of Vision Language Models cs.CL | cs.CVPDF

Ruizhe Zhu

TL;DR: 该论文提出了一种针对视觉语言模型的文本提示注入攻击方法，该方法简单高效，计算资源需求低。

Details

Motivation: 随着视觉语言模型的广泛应用，其安全性问题日益突出，作者希望通过研究文本提示注入攻击方法，揭示模型潜在的安全漏洞。

Result: 实验结果表明，该方法对大型模型特别有效，且对计算资源的需求较低。

Insight: 该研究揭示了视觉语言模型在面对文本提示注入攻击时的脆弱性，为未来的模型安全性改进提供了重要参考。

Abstract: The widespread application of large vision language models has significantly raised safety concerns. In this project, we investigate text prompt injection, a simple yet effective method to mislead these models. We developed an algorithm for this type of attack and demonstrated its effectiveness and efficiency through experiments. Compared to other attack methods, our approach is particularly effective for large models without high demand for computational resources.

[167] NG-Router: Graph-Supervised Multi-Agent Collaboration for Nutrition Question Answering cs.CLPDF

Kaiwen Shi, Zheyuan Zhang, Zhengqing Yuan, Keerthiram Murugesan, Vincent Galass

TL;DR: NG-Router提出了一种基于图监督的多智能体协作框架，用于营养问答任务，解决了单智能体推理能力有限和多智能体架构设计复杂的问题，并通过梯度子图检索提高推理准确性。

Details

Motivation: 饮食对人类健康至关重要，营养问答为个性化饮食指导和预防慢性疾病提供了可能。现有方法存在单智能体推理能力有限和多智能体架构设计复杂的问题，以及上下文过载导致的决策困难。

Result: 在多个基准测试和主干模型中，NG-Router均优于单智能体和集成基线方法。

Insight: 通过知识图谱和多智能体协作的结合，NG-Router为解决复杂领域任务提供了一种有效的方法。梯度子图检索机制有望扩展到其他依赖上下文推理的任务中。

Abstract: Diet plays a central role in human health, and Nutrition Question Answering (QA) offers a promising path toward personalized dietary guidance and the prevention of diet-related chronic diseases. However, existing methods face two fundamental challenges: the limited reasoning capacity of single-agent systems and the complexity of designing effective multi-agent architectures, as well as contextual overload that hinders accurate decision-making. We introduce Nutritional-Graph Router (NG-Router), a novel framework that formulates nutritional QA as a supervised, knowledge-graph-guided multi-agent collaboration problem. NG-Router integrates agent nodes into heterogeneous knowledge graphs and employs a graph neural network to learn task-aware routing distributions over agents, leveraging soft supervision derived from empirical agent performance. To further address contextual overload, we propose a gradient-based subgraph retrieval mechanism that identifies salient evidence during training, thereby enhancing multi-hop and relational reasoning. Extensive experiments across multiple benchmarks and backbone models demonstrate that NG-Router consistently outperforms both single-agent and ensemble baselines, offering a principled approach to domain-aware multi-agent reasoning for complex nutritional health tasks.

[168] NarraBench: A Comprehensive Framework for Narrative Benchmarking cs.CL | cs.AIPDF

Sil Hamilton, Matthew Wilkens, Andrew Piper

TL;DR: NarraBench是一个全面的叙事理解任务分类框架，通过对78个现有基准的调查，发现当前评测仅覆盖27%的叙事任务，并提出需要更多关注被忽视或评测不佳的叙事方面。

Details

Motivation: 现有的叙事理解评测工具覆盖不全面，许多重要的叙事方面（如事件、风格、视角、揭示）未被充分评测，且缺乏对主观性和多视角内容的评估能力。

Result: 发现当前评测仅覆盖27%的叙事任务，多项关键叙事方面（如事件、风格等）几乎未被评测，且缺乏主观性评估能力。

Insight: 1. 叙事理解评测亟需扩展至更多领域；2. 主观性和多视角内容的评测是未来的重点方向。

Abstract: We present NarraBench, a theory-informed taxonomy of narrative-understanding tasks, as well as an associated survey of 78 existing benchmarks in the area. We find significant need for new evaluations covering aspects of narrative understanding that are either overlooked in current work or are poorly aligned with existing metrics. Specifically, we estimate that only 27% of narrative tasks are well captured by existing benchmarks, and we note that some areas – including narrative events, style, perspective, and revelation – are nearly absent from current evaluations. We also note the need for increased development of benchmarks capable of assessing constitutively subjective and perspectival aspects of narrative, that is, aspects for which there is generally no single correct answer. Our taxonomy, survey, and methodology are of value to NLP researchers seeking to test LLM narrative understanding.

[169] CoBia: Constructed Conversations Can Trigger Otherwise Concealed Societal Biases in LLMs cs.CLPDF

Nafiseh Nikeghbal, Amir Hossein Kargaran, Jana Diesner

TL;DR: 论文介绍了CoBia，一种轻量级的对抗攻击方法，用于系统分析大型语言模型（LLM）在构建对话中暴露的潜在社会偏见，重点关注性别、种族、宗教等社会群体。

Details

Motivation: 尽管LLM的安全防护不断增强，但在对话中仍可能表现出有害行为（如种族主义观点）。论文旨在通过构建对话揭示这些隐蔽的偏见。

Result: 实验结果表明，LLM在构建对话中容易放大偏见，且常常无法拒绝后续的偏见问题，揭示了其内在的偏见问题。

Insight: LLM的偏见问题可能在常规安全测试中被掩盖，而通过交互式攻击（如CoBia）可以更有效地揭示这些问题。

Abstract: Improvements in model construction, including fortified safety guardrails, allow Large language models (LLMs) to increasingly pass standard safety checks. However, LLMs sometimes slip into revealing harmful behavior, such as expressing racist viewpoints, during conversations. To analyze this systematically, we introduce CoBia, a suite of lightweight adversarial attacks that allow us to refine the scope of conditions under which LLMs depart from normative or ethical behavior in conversations. CoBia creates a constructed conversation where the model utters a biased claim about a social group. We then evaluate whether the model can recover from the fabricated bias claim and reject biased follow-up questions. We evaluate 11 open-source as well as proprietary LLMs for their outputs related to six socio-demographic categories that are relevant to individual safety and fair treatment, i.e., gender, race, religion, nationality, sex orientation, and others. Our evaluation is based on established LLM-based bias metrics, and we compare the results against human judgments to scope out the LLMs’ reliability and alignment. The results suggest that purposefully constructed conversations reliably reveal bias amplification and that LLMs often fail to reject biased follow-up questions during dialogue. This form of stress-testing highlights deeply embedded biases that can be surfaced through interaction. Code and artifacts are available at https://github.com/nafisenik/CoBia.

[170] iBERT: Interpretable Style Embeddings via Sense Decomposition cs.CLPDF

Vishal Anand, Milad Alshomary, Kathleen McKeown

TL;DR: iBERT是一种可解释的BERT编码器，通过上下文无关的sense向量分解生成可解释和可控的嵌入，适用于风格和语义结构的模块化表示。

Details

Motivation: 现有BERT模型的嵌入难以解释和控制，iBERT旨在通过模块化和显式的sense分解方法解决这一问题。

Result: 在STEL基准上，iBERT风格表示效果提升8%，同时在作者验证任务中保持竞争力。

Insight: iBERT的结构化sense分解不仅适用于风格建模，还能泛化到混合监督信号的任务，揭示了嵌入的潜在可解释性。

Abstract: We present iBERT (interpretable-BERT), an encoder to produce inherently interpretable and controllable embeddings - designed to modularize and expose the discriminative cues present in language, such as stylistic and semantic structure. Each input token is represented as a sparse, non-negative mixture over k context-independent sense vectors, which can be pooled into sentence embeddings or used directly at the token level. This enables modular control over representation, before any decoding or downstream use. To demonstrate our model’s interpretability, we evaluate it on a suite of style-focused tasks. On the STEL benchmark, it improves style representation effectiveness by ~8 points over SBERT-style baselines, while maintaining competitive performance on authorship verification. Because each embedding is a structured composition of interpretable senses, we highlight how specific style attributes - such as emoji use, formality, or misspelling can be assigned to specific sense vectors. While our experiments center on style, iBERT is not limited to stylistic modeling. Its structural modularity is designed to interpretably decompose whichever discriminative signals are present in the data - enabling generalization even when supervision blends stylistic and semantic factors.

[171] DELTA: Dynamic Layer-Aware Token Attention for Efficient Long-Context Reasoning cs.CL | cs.LGPDF

Hossein Entezari Zarch, Lei Gao, Chaoyi Jiang, Murali Annavarm

TL;DR: DELTA提出了一种动态分层感知的token注意力机制，通过分层设计实现高效的长上下文推理，减少计算开销的同时保持模型准确性。

Details

Motivation: 现有的稀疏注意力方法在长推理任务中因累积选择错误和token动态重要性而准确性下降，DELTA旨在解决这一问题。

Result: 在AIME和GPQA-Diamond等推理基准上，DELTA在保持准确性的同时，减少5倍的token计算量，实现1.5倍的端到端加速。

Insight: 选择性复用中间注意力图为高效长上下文推理提供了新思路。

Abstract: Large reasoning models (LRMs) achieve state-of-the-art performance on challenging benchmarks by generating long chains of intermediate steps, but their inference cost is dominated by decoding, where each new token must attend to the entire growing sequence. Existing sparse attention methods reduce computation by pruning the key-value (KV) cache, yet they suffer from severe accuracy degradation on reasoning tasks due to cumulative selection errors and the dynamic importance of tokens over long derivations. We present \textbf{DELTA}, a training-free sparse attention mechanism that achieves computational efficiency without sacrificing model accuracy. DELTA partitions transformer layers into three groups: initial layers that use full attention, a small set of \emph{selection layers} that identify salient tokens via aggregated head-level attention scores, and subsequent \emph{sparse-attention layers} that attend only to the selected subset. This design preserves the full KV cache in GPU memory for accuracy, while avoiding expensive full-attention computation over many layers. On reasoning benchmarks such as AIME and GPQA-Diamond, DELTA matches or surpasses full attention in accuracy, while reducing the number of attended tokens by up to $5\times$ and delivering $1.5\times$ end-to-end speedup. Our results show that selective reuse of intermediate attention maps offers a robust path toward efficient long-context reasoning.

[172] Closing the Data-Efficiency Gap Between Autoregressive and Masked Diffusion LLMs cs.CL | cs.AIPDF

Xu Pan, Ely Hahami, Jingxuan Fan, Ziqian Xie, Haim Sompolinsky

TL;DR: 本文研究了自回归大语言模型（arLLMs）和掩码扩散大语言模型（dLLMs）在数据效率和知识注入方面的差异，提出了一种新方法提升arLLMs的微调效率。

Details

Motivation: 自回归模型在注入新知识时存在‘逆序诅咒’等问题，而掩码扩散模型在预训练阶段表现更好，但其在微调阶段的表现未知。本文旨在填补这一空白。

Result: dLLMs无需额外数据增强即可在两种问答任务中表现优异，而arLLMs依赖数据增强且效果受限；提出的掩码微调方法显著缩小了arLLMs与dLLMs的性能差距。

Insight: 掩码扩散模型在知识注入和数据效率上具有天然优势；通过改进微调方法，可以显著提升自回归模型的效果。

Abstract: Despite autoregressive large language models (arLLMs) being the current dominant paradigm in language modeling, they resist knowledge injection via fine-tuning due to inherent shortcomings such as the “reversal curse” – the challenge of answering questions that reverse the original information order in the training sample. Masked diffusion large language models (dLLMs) are rapidly emerging as a powerful alternative to the arLLM paradigm, with evidence of better data efficiency and free of the “reversal curse” in pre-training. However, it is unknown whether these advantages extend to the post-training phase, i.e. whether pre-trained dLLMs can easily acquire new knowledge through fine-tuning. On three diverse datasets, we fine-tune arLLMs and dLLMs, evaluating them with forward and backward style Question Answering (QA) to probe knowledge generalization and the reversal curse. Our results confirm that arLLMs critically rely on extensive data augmentation via paraphrases for QA generalization, and paraphrases are only effective when their information order matches the QA style. Conversely, dLLMs achieve high accuracies on both forward and backward QAs without paraphrases; adding paraphrases yields only marginal gains. Lastly, inspired by the dLLM’s performance, we introduce a novel masked fine-tuning paradigm for knowledge injection into pre-trained arLLMs. This proposed method successfully and drastically improves the data efficiency of arLLM fine-tuning, effectively closing the performance gap with dLLMs.

[173] Abductive Preference Learning cs.CLPDF

Yijin Ni, Peng Qi

TL;DR: 该论文提出了一种新的偏好学习方法——溯因偏好学习（Abductive Preference Learning），通过逆转传统的条件学习方式，解决了现有方法在应对反事实提示时的局限性。实验表明，该方法在响应选择和提示区分任务中均取得了显著提升。

Details

Motivation: 现有的大语言模型（如GPT-5和Claude Sonnet）即使经过RLHF和DPO等对齐方法，仍存在过度自信的问题，尤其是无法区分反事实提示。论文希望通过溯因偏好学习改进这一问题。

Result: 在多任务DPOP中，响应选择准确率从90.0%提升至99.5%，提示区分准确率从54.7%提升至85.0%。在AlpacaEval评估中，胜率从5.26%提升至6.17%。

Insight: 1. 传统偏好学习忽略了反事实提示的重要性；2. 通过逆转条件学习提示偏好能有效提升模型的敏感性和区分能力；3. 多任务目标结合了传统和溯因方法的优势。

Abstract: Frontier large language models such as GPT-5 and Claude Sonnet remain prone to overconfidence even after alignment through Reinforcement Learning with Human Feedback (RLHF) and Direct Preference Optimization (DPO). For instance, they tend to offer the same conservative answer “No” to both questions “Can I eat the [food / potato chips] that has been left out overnight?” despite the latter requiring no refridgeration for safe consumption. We find that this failure is potentially attributed to a limitation of existing preference learning: it emphasizes selecting the correct response for a given prompt, while neglecting counterfactual prompts that should alter the response. To address this limitation, we propose abductive preference learning, a fine-tuning paradigm that reverses the conventional conditioning by learning preferences over prompts given a response. To validate this idea, we construct an abductive dataset derived from the HaluEval QA benchmark with 1,001 entries, implementing abductive DPO and its variant DPOP. Experiments reveal complementary strengths: standard methods improve response selection, abductive methods improve prompt discrimination, while a multitask objective unifies both. On the abductive dataset, multitask DPOP boosts accuracy from $90.0%$ to $99.5%$ in response selection and $54.7%$ to $85.0%$ in prompt discrimination, with qualitative evidence highlighting improved sensitivity to prompt differences. Finally, evaluation on AlpacaEval shows multitask DPOP improves win rate (from $5.26%$ to $6.17%$), confirming that abductive preference learning preserves the benefits of conventional preference optimization while addressing the overlooked challenge of counterfactual prompts.

[174] HIPPD: Brain-Inspired Hierarchical Information Processing for Personality Detection cs.CL | cs.LGPDF

Guanming Chen, Lingzhi Shen, Xiaohao Cai, Imran Razzak, Shoaib Jameel

TL;DR: 这篇论文提出了一种名为HIPPD的大脑启发式框架，用于从文本中检测人格特质。HIPPD通过模拟人脑的分层次信息处理机制，结合大型语言模型、动态记忆模块和轻量级专家模型，显著提升了检测性能。

Details

Motivation: 现有的人格检测方法在多篇文本的上下文信息捕捉和语义稀疏环境下的鲁棒特征提取方面表现不佳。HIPPD通过模拟人脑的分层次处理机制来解决这些问题。

Result: 在Kaggle和Pandora数据集上的实验表明，HIPPD在人格检测任务上一致优于现有的最先进基线方法。

Insight: HIPPD的成功表明，模拟人脑的分层次信息处理机制可以有效提升文本分析任务的性能，尤其是在上下文感知和多层次特征提取方面。

Abstract: Personality detection from text aims to infer an individual’s personality traits based on linguistic patterns. However, existing machine learning approaches often struggle to capture contextual information spanning multiple posts and tend to fall short in extracting representative and robust features in semantically sparse environments. This paper presents HIPPD, a brain-inspired framework for personality detection that emulates the hierarchical information processing of the human brain. HIPPD utilises a large language model to simulate the cerebral cortex, enabling global semantic reasoning and deep feature abstraction. A dynamic memory module, modelled after the prefrontal cortex, performs adaptive gating and selective retention of critical features, with all adjustments driven by dopaminergic prediction error feedback. Subsequently, a set of specialised lightweight models, emulating the basal ganglia, are dynamically routed via a strict winner-takes-all mechanism to capture the personality-related patterns they are most proficient at recognising. Extensive experiments on the Kaggle and Pandora datasets demonstrate that HIPPD consistently outperforms state-of-the-art baselines.

[175] Don’t Throw Away Your Pretrained Model cs.CLPDF

Shangbin Feng, Wenhao Yu, Yike Wang, Hongming Zhang, Yulia Tsvetkov

TL;DR: 这篇论文提出了一种名为‘Switch Generation’的模型协作方法，通过让预训练模型和对齐模型在生成响应时动态切换，结合各自的优势，显著提升了任务表现。

Details

Motivation: 对齐训练虽然能提升模型在推理和指令跟随上的能力，但可能会削弱其在创造性和校准等方面的表现。作者希望通过模型协作来兼顾两者的优势。

Result: 实验表明，Switch Generation在16/18任务上优于单一模型，平均比基线方法提升12.9%。该方法还能发现组合技能并泛化到新模型和任务。

Insight: 模型协作可以充分利用训练管道中通常被丢弃的副产品（如不同版本的模型），解决单一模型难以应对的任务。

Abstract: Alignment training has tradeoffs: it helps language models (LMs) gain in reasoning and instruction following but might lose out on skills such as creativity and calibration, where unaligned base models are better at. We aim to make the best of both worlds through model collaboration, where different models in the training pipeline collaborate and complement each other. Since LM responses feature interleaving skills that favor different models, we propose Switch Generation, where pretrained and aligned model versions take turns to ``speak’’ in a response sequence. Specifically, we train a switcher LM by learning from outcomes of choosing different models to generate the next segment across diverse queries and contexts. At inference time, the switcher LM guides different model checkpoints to dynamically generate the next segment where their strengths are most needed. Extensive experiments with 8 model collaboration baselines and 18 datasets show that 1) model collaboration consistently outperforms individual models on 16 out of 18 tasks, and 2) Switch Generation further outperforms baselines by 12.9% on average. Further analysis reveals that Switch Generation discovers compositional skills to solve problems where individual models struggle and generalizes to unseen models and tasks, reusing and repurposing by-products in expensive model training pipelines that are otherwise discarded.

[176] Enhancing Faithfulness in Abstractive Summarization via Span-Level Fine-Tuning cs.CLPDF

Sicong Huang, Qianqi Yan, Shengze Wang, Ian Lane

TL;DR: 该论文通过span级别的微调方法，提升抽象摘要的忠实性，提出了一种新的数据集和三种微调技术，其中unlikelihood训练效果最佳。

Details

Motivation: 大型语言模型（LLM）生成的摘要虽然流畅，但存在不忠实的问题（如幻觉），现有方法无法完全解决，因此需要更有效的微调策略。

Result: 实验表明，三种方法均能利用span标注提升忠实性，其中unlikelihood训练效果最好。

Insight: span级别标注的直接监督可以有效减少幻觉，unlikelihood训练是一种高效的微调策略。

Abstract: Abstractive summarization using large language models (LLMs) has become an essential tool for condensing information. However, despite their ability to generate fluent summaries, these models sometimes produce unfaithful summaries, introducing hallucinations at the word, phrase, or concept level. Existing mitigation strategies, such as post-processing corrections or contrastive learning with synthetically generated negative samples, fail to fully address the diverse errors that can occur in LLM-generated summaries. In this paper, we investigate fine-tuning strategies to reduce the occurrence of unfaithful spans in generated summaries. First, we automatically generate summaries for the set of source documents in the training set with a variety of LLMs and then use GPT-4o to annotate any hallucinations it detects at the span-level. Leveraging these annotations, we fine-tune LLMs with both hallucination-free summaries and annotated unfaithful spans to enhance model faithfulness. In this paper, we introduce a new dataset that contains both faithful and unfaithful summaries with span-level labels and we evaluate three techniques to fine-tuning a LLM to improve the faithfulness of the resulting summarization: gradient ascent, unlikelihood training, and task vector negation. Experimental results show that all three approaches successfully leverage span-level annotations to improve faithfulness, with unlikelihood training being the most effective.

[177] Unifying Tree Search Algorithm and Reward Design for LLM Reasoning: A Survey cs.CLPDF

Jiaqi Wei, Xiang Zhang, Yuejin Yang, Wenxuan Huang, Juntai Cao

TL;DR: 该论文提出了一个统一框架，将树搜索算法和奖励设计分解为三个核心组件，解决了奖励信号角色的模糊性，并为自主、自改进代理的研究指明了方向。

Details

Motivation: 现代大型语言模型（LLM）研究中，树搜索算法是一个关键范式，但领域内缺乏统一的形式化定义，尤其是奖励信号的角色模糊不清。本文旨在解决这一问题，为研究提供清晰的理论基础。

Result: 提出了一套系统化的理论框架和分类法，为未来自主、自改进代理的研究奠定了基础。

Insight: 1. 奖励信号的设计可以从瞬时和持久两个角度进行区分；2. 搜索算法的研究需要结合具体的任务需求和模型改进目标。

Abstract: Deliberative tree search is a cornerstone of modern Large Language Model (LLM) research, driving the pivot from brute-force scaling toward algorithmic efficiency. This single paradigm unifies two critical frontiers: \textbf{Test-Time Scaling (TTS)}, which deploys on-demand computation to solve hard problems, and \textbf{Self-Improvement}, which uses search-generated data to durably enhance model parameters. However, this burgeoning field is fragmented and lacks a common formalism, particularly concerning the ambiguous role of the reward signal – is it a transient heuristic or a durable learning target? This paper resolves this ambiguity by introducing a unified framework that deconstructs search algorithms into three core components: the \emph{Search Mechanism}, \emph{Reward Formulation}, and \emph{Transition Function}. We establish a formal distinction between transient \textbf{Search Guidance} for TTS and durable \textbf{Parametric Reward Modeling} for Self-Improvement. Building on this formalism, we introduce a component-centric taxonomy, synthesize the state-of-the-art, and chart a research roadmap toward more systematic progress in creating autonomous, self-improving agents.

[178] Toward Machine Translation Literacy: How Lay Users Perceive and Rely on Imperfect Translations cs.CLPDF

Yimin Xiao, Yongle Zhang, Dayeon Ki, Calvin Bao, Marianna J. Martindale

TL;DR: 论文研究了机器翻译（MT）在普通用户中的使用情况，发现非双语用户因缺乏评估策略和替代方案而过度依赖MT，而错误体验可能促使用户重新评估未来依赖。

Details

Motivation: 随着MT的日益普及，理解公众对不完美MT的感知和使用方式对于MT研究在实际应用中的意义至关重要。

Result: 研究表明，非双语用户常因缺乏评估能力而过度依赖MT，但错误体验可能促使其减少未来依赖。

Insight: 提升MT质量的同时，需注重培养用户的MT素养，以减少对不完美翻译的盲目依赖。

Abstract: As Machine Translation (MT) becomes increasingly commonplace, understanding how the general public perceives and relies on imperfect MT is crucial for contextualizing MT research in real-world applications. We present a human study conducted in a public museum (n=452), investigating how fluency and adequacy errors impact bilingual and non-bilingual users’ reliance on MT during casual use. Our findings reveal that non-bilingual users often over-rely on MT due to a lack of evaluation strategies and alternatives, while experiencing the impact of errors can prompt users to reassess future reliance. This highlights the need for MT evaluation and NLP explanation techniques to promote not only MT quality, but also MT literacy among its users.

[179] MTP-S2UT: Enhancing Speech-to-Speech Translation Quality with Multi-token Prediction cs.CL | cs.SD | eess.ASPDF

Jianjin Wang, Runsong Zhao, Xiaoqian Liu, Yuan Ge, Ziqiang Xu

TL;DR: 论文提出了一种多令牌预测（MTP）损失函数，应用于语音到单位翻译（S2UT）模型中，通过预测多个后续令牌来增强语义密度和翻译质量。进一步提出MTP-S2UT损失，将MTP应用于中间层，实现更早的语义信息增强。

Details

Motivation: 当前语音到语音翻译方法使用单个语音令牌作为中间表示，但其语义密度不足，难以表达完整语义单元。因此需要一种方法能预测多个令牌，提升语义完整性和信息密度。

Result: 实验表明，所有MTP损失变体均能提升S2UT翻译质量，其中MTP-S2UT表现最佳。

Insight: 语义信息增强不仅需关注输出层，还应提前到模型中间层，以实现更早且更全面的语义捕获。

Abstract: Current direct speech-to-speech translation methods predominantly employ speech tokens as intermediate representations. However, a single speech token is not dense in semantics, so we generally need multiple tokens to express a complete semantic unit. To address this limitation, we introduce multi-token prediction (MTP) loss into speech-to-unit translation (S2UT) models, enabling models to predict multiple subsequent tokens at each position, thereby capturing more complete semantics and enhancing information density per position. Initial MTP implementations apply the loss at the final layer, which improves output representation but initiates information enrichment too late. We hypothesize that advancing the information enrichment process to intermediate layers can achieve earlier and more effective enhancement of hidden representation. Consequently, we propose MTP-S2UT loss, applying MTP loss to hidden representation where CTC loss is computed. Experiments demonstrate that all MTP loss variants consistently improve the quality of S2UT translation, with MTP-S2UT achieving the best performance.

[180] Beyond the limitation of a single query: Train your LLM for query expansion with Reinforcement Learning cs.CL | cs.AI | cs.IRPDF

Shu Zhao, Tan Yu, Anbang Xu

TL;DR: 本文提出了ExpandSearch方法，通过强化学习训练LLM搜索代理，使其具备查询扩展能力，并结合预训练的Squeezer模型提升检索召回率和回答生成能力，在七大多跳问答基准上平均提升4.4%。

Details

Motivation: 现有的搜索代理（如Search-R1）在推理和搜索能力上受限，导致多跳问答任务表现不佳。为了解决这一问题，需要一种能在复杂查询中扩展查询并高效检索的方法。

Result: 在七大多跳问答基准测试中，ExpandSearch方法平均提升了4.4%的性能，显著优于现有基线方法。

Insight: 即使小规模（3B）的LLM，在结合Squeezer模型后也能展现出强大的查询扩展能力，表明辅助模型可以有效提升LLM的核心功能。

Abstract: Reasoning-augmented search agents, such as Search-R1, are trained to reason, search, and generate the final answer iteratively. Nevertheless, due to their limited capabilities in reasoning and search, their performance on multi-hop QA benchmarks remains far from satisfactory. To handle complex or compound queries, we train an LLM-based search agent with the native capability of query expansion through reinforcement learning. In each turn, our search agent proposes several query variants, which are searched simultaneously to cover more relevant information. Meanwhile, given limited post-training data and computing resources, it is very challenging for a search agent to master multiple tasks, including query generation, retrieved information understanding, and answer generation. Therefore, we propose incorporating a pre-trained squeezer model that helps the search agent understand the retrieved documents, allowing the search agent to focus on query generation for high retrieval recall. With the assistance of the squeezer model, we discover that even a small-scale 3B LLM can demonstrate a strong capability of query expansion and achieve state-of-the-art accuracy on the multi-hop QA benchmarks. To be specific, our experiments across seven question-answering benchmarks demonstrate that our method, named ExpandSearch, achieves an average improvement of 4.4% compared to state-of-the-art baselines, with strong gains on multi-hop reasoning tasks requiring diverse evidence aggregation.

[181] Path Drift in Large Reasoning Models:How First-Person Commitments Override Safety cs.CLPDF

Yuyi Huang, Runzhe Zhan, Lidia S. Chao, Ailin Tao, Derek F. Wong

TL;DR: 论文研究了大型语言模型（LLM）在长链思维推理（Long-CoT）中的路径漂移问题，即模型推理偏离安全路径的现象，并提出了触发路径漂移的三个行为机制及防御策略。

Details

Motivation: 尽管通过RLHF等对齐技术实现了早期安全防护，但在长链推理任务中，模型的推理路径可能会偏离对齐路径，导致不安全内容的生成。这一现象尚未被充分研究，因此论文旨在揭示其机制并提出解决方案。

Result: 实验表明，三阶段的诱导框架能够独立或联合降低模型的拒绝率，而防御策略能够有效缓解路径漂移的风险。

Insight: 论文揭示了长链推理任务中路径漂移的风险，强调了在长形式推理中需要对轨迹级别对齐进行监督的重要性，而不仅仅是令牌级别的对齐。

Abstract: As large language models (LLMs) are increasingly deployed for complex reasoning tasks, Long Chain-of-Thought (Long-CoT) prompting has emerged as a key paradigm for structured inference. Despite early-stage safeguards enabled by alignment techniques such as RLHF, we identify a previously underexplored vulnerability: reasoning trajectories in Long-CoT models can drift from aligned paths, resulting in content that violates safety constraints. We term this phenomenon Path Drift. Through empirical analysis, we uncover three behavioral triggers of Path Drift: (1) first-person commitments that induce goal-driven reasoning that delays refusal signals; (2) ethical evaporation, where surface-level disclaimers bypass alignment checkpoints; (3) condition chain escalation, where layered cues progressively steer models toward unsafe completions. Building on these insights, we introduce a three-stage Path Drift Induction Framework comprising cognitive load amplification, self-role priming, and condition chain hijacking. Each stage independently reduces refusal rates, while their combination further compounds the effect. To mitigate these risks, we propose a path-level defense strategy incorporating role attribution correction and metacognitive reflection (reflective safety cues). Our findings highlight the need for trajectory-level alignment oversight in long-form reasoning beyond token-level alignment.

[182] CLMN: Concept based Language Models via Neural Symbolic Reasoning cs.CL | cs.AI | 68T50, 68T07, 68T27 | I.2.7; I.2.6PDF

Yibo Yang

TL;DR: CLMN是一种神经符号框架，通过结合连续可读的概念嵌入和模糊逻辑推理，提升了NLP模型的性能和可解释性。

Details

Motivation: 深度学习的NLP模型在医疗和金融等领域缺乏可解释性。现有的概念瓶颈模型在文本处理中要么使用二值激活损害文本表示，要么使用潜在概念削弱语义，且缺乏动态概念交互的建模。

Result: 在多个数据集和预训练语言模型上，CLMN在准确性和解释质量上均优于现有方法。

Insight: 神经表示与符号推理在统一概念空间中的结合，可实现实用且透明的NLP系统。

Abstract: Deep learning has advanced NLP, but interpretability remains limited, especially in healthcare and finance. Concept bottleneck models tie predictions to human concepts in vision, but NLP versions either use binary activations that harm text representations or latent concepts that weaken semantics, and they rarely model dynamic concept interactions such as negation and context. We introduce the Concept Language Model Network (CLMN), a neural-symbolic framework that keeps both performance and interpretability. CLMN represents concepts as continuous, human-readable embeddings and applies fuzzy-logic reasoning to learn adaptive interaction rules that state how concepts affect each other and the final decision. The model augments original text features with concept-aware representations and automatically induces interpretable logic rules. Across multiple datasets and pre-trained language models, CLMN achieves higher accuracy than existing concept-based methods while improving explanation quality. These results show that integrating neural representations with symbolic reasoning in a unified concept space can yield practical, transparent NLP systems.

[183] Unilaw-R1: A Large Language Model for Legal Reasoning with Reinforcement Learning and Iterative Inference cs.CLPDF

Hua Cai, Shuang Zhao, Liang Zhang, Xuli Shen, Qing Xu

TL;DR: 论文介绍了Unilaw-R1，一个针对法律推理的大语言模型，通过监督微调和强化学习两阶段训练策略，解决了法律知识不足、推理逻辑不可靠和业务泛化能力弱的问题，并在多项基准测试中表现优异。

Details

Motivation: 当前大语言模型在处理复杂法律问题上的能力尚未充分探索，Unilaw-R1旨在填补这一空白，通过轻量化的模型设计降低部署成本，同时提升法律推理性能和可解释性。

Result: Unilaw-R1在权威基准测试中表现优异，超越同类规模模型，并与更大规模模型（如DeepSeek-R1-Distill-Qwen-32B）竞争；在LawBench和LexEval上显著优于Qwen-2.5-7B-Instruct，平均提升6.6%。

Insight: 轻量化设计与高质量数据集的结合，以及两阶段训练策略，可以显著提升模型在法律领域的推理能力和泛化性能；强化学习的应用增强了模型的解释性。

Abstract: Reasoning-focused large language models (LLMs) are rapidly evolving across various domains, yet their capabilities in handling complex legal problems remains underexplored. In this paper, we introduce Unilaw-R1, a large language model tailored for legal reasoning. With a lightweight 7-billion parameter scale, Unilaw-R1 significantly reduces deployment cost while effectively tackling three core challenges in the legal domain: insufficient legal knowledge, unreliable reasoning logic, and weak business generalization. To address these issues, we first construct Unilaw-R1-Data, a high-quality dataset containing 17K distilled and screened chain-of-thought (CoT) samples. Based on this, we adopt a two-stage training strategy combining Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), which significantly boosts the performance on complex legal reasoning tasks and supports interpretable decision-making in legal AI applications. To assess legal reasoning ability, we also introduce Unilaw-R1-Eval, a dedicated benchmark designed to evaluate models across single- and multi-choice legal tasks. Unilaw-R1 demonstrates strong results on authoritative benchmarks, outperforming all models of similar scale and achieving performance on par with the much larger DeepSeek-R1-Distill-Qwen-32B (54.9%). Following domain-specific training, it also showed significant gains on LawBench and LexEval, exceeding Qwen-2.5-7B-Instruct (46.6%) by an average margin of 6.6%.

[184] Stop When Enough: Adaptive Early-Stopping for Chain-of-Thought Reasoning cs.CLPDF

Renliang Sun, Wei Cheng, Dawei Li, Haifeng Chen, Wei Wang

TL;DR: 该论文提出了REFRAIN框架，通过自适应停止推理来减少大型语言模型（LLMs）在Chain-of-Thought（CoT）推理中的冗余步骤，从而降低成本并避免错误结论。

Details

Motivation: Chain-of-Thought推理虽然在复杂任务中提升了LLMs的性能，但冗余的推理步骤（称为“overthinking”）会增加计算成本并导致错误结论，因此需要一种自适应停止机制。

Result: 在四个基准测试和两种模型家族中，REFRAIN减少了20-55%的token使用量，同时保持或提高了推理准确性。

Insight: 研究突出了“何时停止”作为一种新的测试时扩展维度，使模型能够在推理中做到“恰到好处”。

Abstract: Chain-of-Thought (CoT) reasoning has driven recent gains of large language models (LLMs) on reasoning-intensive tasks by externalizing intermediate steps. However, excessive or redundant reasoning – so-called overthinking – can increase inference costs and lead LLMs toward incorrect conclusions. In this paper, we present REFRAIN ($\underline{REF}$lective-$\underline{R}$edundancy for $\underline{A}$daptive $\underline{IN}$ference), a training-free framework that adaptively determines when to stop reasoning to mitigate overthinking. REFRAIN integrates a two-stage stop discriminator to identify reflective yet redundant reasoning and a sliding-window Upper Confidence Bound (SW-UCB) multi-armed bandit controller to dynamically adjust stopping thresholds according to problem difficulty without supervision or fine-tuning. Across four representative benchmarks and two model families, REFRAIN reduces token usage by 20-55% while maintaining or improving accuracy compared to standard CoT prompting. Extensive ablation and robustness analyses demonstrate its stability across models, scorers, and prompt variations. In summary, our findings highlight when-to-stop as a new and practical axis of test-time scaling – enabling models to reason not just more, but just enough.

[185] LinearRAG: Linear Graph Retrieval Augmented Generation on Large-scale Corpora cs.CLPDF

Luyao Zhuang, Shengyuan Chen, Yilin Xiao, Huachi Zhou, Yujing Zhang

TL;DR: LinearRAG提出了一种高效的检索增强生成（RAG）框架，通过构建无关系的分层图（Tri-Graph），避免了传统方法中不稳定且高成本的关系提取问题，显著提升了在大规模语料上的检索效果。

Details

Motivation: 传统RAG系统在大规模非结构化语料中表现不佳，而基于知识图谱的GraphRAG方法因关系提取不稳定且成本高，导致图构建质量差，影响检索效果。

Result: 在四个数据集上的实验表明，LinearRAG显著优于基线模型。

Insight: 避免了复杂关系提取，提升了效率和可靠性，适用于大规模语料的多跳推理任务。

Abstract: Retrieval-Augmented Generation (RAG) is widely used to mitigate hallucinations of Large Language Models (LLMs) by leveraging external knowledge. While effective for simple queries, traditional RAG systems struggle with large-scale, unstructured corpora where information is fragmented. Recent advances incorporate knowledge graphs to capture relational structures, enabling more comprehensive retrieval for complex, multi-hop reasoning tasks. However, existing graph-based RAG (GraphRAG) methods rely on unstable and costly relation extraction for graph construction, often producing noisy graphs with incorrect or inconsistent relations that degrade retrieval quality. In this paper, we revisit the pipeline of existing GraphRAG systems and propose LinearRAG (Linear Graph-based Retrieval-Augmented Generation), an efficient framework that enables reliable graph construction and precise passage retrieval. Specifically, LinearRAG constructs a relation-free hierarchical graph, termed Tri-Graph, using only lightweight entity extraction and semantic linking, avoiding unstable relation modeling. This new paradigm of graph construction scales linearly with corpus size and incurs no extra token consumption, providing an economical and reliable indexing of the original passages. For retrieval, LinearRAG adopts a two-stage strategy: (i) relevant entity activation via local semantic bridging, followed by (ii) passage retrieval through global importance aggregation. Extensive experiments on four datasets demonstrate that LinearRAG significantly outperforms baseline models.

[186] Hybrid OCR-LLM Framework for Enterprise-Scale Document Information Extraction Under Copy-heavy Task cs.CL | cs.AIPDF

Zilong Wang, Xiaoyu Shen

TL;DR: 论文提出了一种结合OCR引擎与大语言模型（LLM）的混合框架，用于企业级文档信息提取，特别是在重复内容较多的任务中优化了准确性与效率的权衡。

Details

Motivation: 处理大量结构相似的文档内容是企业的关键任务，但现有方法缺乏针对性，无法高效应对重复性任务。本文旨在通过智能策略选择，利用文档特性提升性能。

Result: 在结构化文档中实现了F1=1.0的精度和0.97秒的延迟；在图像输入中达到F1=0.997的精度和0.6秒的延迟；性能比传统方法提升54倍。

Insight: 重复性任务可以通过结构感知的方法选择转化为优化机会，为类似任务提供了通用设计原则。

Abstract: Information extraction from copy-heavy documents, characterized by massive volumes of structurally similar content, represents a critical yet understudied challenge in enterprise document processing. We present a systematic framework that strategically combines OCR engines with Large Language Models (LLMs) to optimize the accuracy-efficiency trade-off inherent in repetitive document extraction tasks. Unlike existing approaches that pursue universal solutions, our method exploits document-specific characteristics through intelligent strategy selection. We implement and evaluate 25 configurations across three extraction paradigms (direct, replacement, and table-based) on identity documents spanning four formats (PNG, DOCX, XLSX, PDF). Through table-based extraction methods, our adaptive framework delivers outstanding results: F1=1.0 accuracy with 0.97s latency for structured documents, and F1=0.997 accuracy with 0.6 s for challenging image inputs when integrated with PaddleOCR, all while maintaining sub-second processing speeds. The 54 times performance improvement compared with multimodal methods over naive approaches, coupled with format-aware routing, enables processing of heterogeneous document streams at production scale. Beyond the specific application to identity extraction, this work establishes a general principle: the repetitive nature of copy-heavy tasks can be transformed from a computational burden into an optimization opportunity through structure-aware method selection.

[187] A Survey of Inductive Reasoning for Large Language Models cs.CL | cs.AIPDF

Kedi Chen, Dezhao Ruan, Yuhao Dan, Yaoting Wang, Siyu Yan

TL;DR: 这篇论文首次全面综述了大语言模型（LLMs）中的归纳推理方法，将其改进方法分为三类，总结了现有基准，并提出了基于沙箱的统一评估方法。

Details

Motivation: 归纳推理是LLMs中重要的推理范式，有助于知识泛化和人类认知对齐，但目前缺乏系统性总结。

Result: 提供了归纳能力来源的分析，以及简单模型结构和数据对归纳任务的帮助。

Insight: 归纳能力可能源于模型架构和数据设计的结合，为未来研究奠定基础。

Abstract: Reasoning is an important task for large language models (LLMs). Among all the reasoning paradigms, inductive reasoning is one of the fundamental types, which is characterized by its particular-to-general thinking process and the non-uniqueness of its answers. The inductive mode is crucial for knowledge generalization and aligns better with human cognition, so it is a fundamental mode of learning, hence attracting increasing interest. Despite the importance of inductive reasoning, there is no systematic summary of it. Therefore, this paper presents the first comprehensive survey of inductive reasoning for LLMs. First, methods for improving inductive reasoning are categorized into three main areas: post-training, test-time scaling, and data augmentation. Then, current benchmarks of inductive reasoning are summarized, and a unified sandbox-based evaluation approach with the observation coverage metric is derived. Finally, we offer some analyses regarding the source of inductive ability and how simple model architectures and data help with inductive tasks, providing a solid foundation for future research.

[188] MedAgentAudit: Diagnosing and Quantifying Collaborative Failure Modes in Medical Multi-Agent Systems cs.CL | cs.AI | cs.MAPDF

Lei Gu, Yinghao Zhu, Haoran Sang, Zixiang Wang, Dehao Sui

TL;DR: 论文对基于大规模语言模型（LLM）的多智能体医疗咨询系统进行了实证研究，揭示了其协作过程中的失败模式，强调了透明和可验证的推理过程的重要性。

Details

Motivation: 现有对多智能体医疗系统的评估仅关注最终答案的准确性，忽略了其内部协作过程的透明性和可靠性，这在高风险医疗应用中可能引发严重后果。

Result: 研究发现四种主导的失败模式：共享模型缺陷驱动的错误共识、正确少数观点的压制、无效的讨论动态以及合成过程中的关键信息丢失。

Insight: 高准确性不足以衡量医疗AI的可信度，透明和可审计的推理过程是负责任开发和部署医疗AI的关键。

Abstract: While large language model (LLM)-based multi-agent systems show promise in simulating medical consultations, their evaluation is often confined to final-answer accuracy. This practice treats their internal collaborative processes as opaque “black boxes” and overlooks a critical question: is a diagnostic conclusion reached through a sound and verifiable reasoning pathway? The inscrutable nature of these systems poses a significant risk in high-stakes medical applications, potentially leading to flawed or untrustworthy conclusions. To address this, we conduct a large-scale empirical study of 3,600 cases from six medical datasets and six representative multi-agent frameworks. Through a rigorous, mixed-methods approach combining qualitative analysis with quantitative auditing, we develop a comprehensive taxonomy of collaborative failure modes. Our quantitative audit reveals four dominant failure patterns: flawed consensus driven by shared model deficiencies, suppression of correct minority opinions, ineffective discussion dynamics, and critical information loss during synthesis. This study demonstrates that high accuracy alone is an insufficient measure of clinical or public trust. It highlights the urgent need for transparent and auditable reasoning processes, a cornerstone for the responsible development and deployment of medical AI.

[189] You only need 4 extra tokens: Synergistic Test-time Adaptation for LLMs cs.CL | cs.AI | cs.LGPDF

Yijie Xu, Huizai Yao, Zhiyu Guo, Weiyu Guo, Pengteng Li

TL;DR: 论文提出了一种无需标注数据的测试时自适应框架SyTTA，通过结合输入端的困惑度和输出端的预测熵信号，显著提升了语言模型在分布偏移下的表现。

Details

Motivation: 大语言模型在专业领域部署时常面临训练数据分布偏移的问题，专业领域标注数据昂贵且稀缺，因此需要一种无监督的测试时自适应方法。

Result: 在农业问答等任务中，SyTTA显著提升了性能（如Rouge-LSum提高120%），验证了方法的有效性。

Insight: 测试时自适应可以通过无监督信号实现，为语言模型在标注稀缺领域的部署提供了新思路。

Abstract: Large language models (LLMs) are increasingly deployed in specialized domains such as finance, medicine, and agriculture, where they face significant distribution shifts from their training data. Domain-specific fine-tuning can mitigate this challenge but relies on high-quality labeled data that is expensive and slow to collect in expertise-limited settings. We study label-free test-time adaptation for language models and present SyTTA, an inference-time framework that adapts models on-the-fly without additional supervision. SyTTA couples two complementary uncertainty signals that arise under distribution shift: input-side perplexity, indicating mismatch with domain-specific terminology and patterns, and output-side predictive entropy, indicating diffuse and unstable token probabilities during generation. Across diverse model architectures and domain-specific benchmarks, SyTTA delivers consistent gains. Notably, on agricultural question answering, SyTTA improves Rouge-LSum by over 120% on Qwen-2.5-7B with only 4 extra tokens per query. These results show that effective test-time adaptation for language models is achievable without labeled examples, supporting deployment in label-scarce domains. The code will be made available upon acceptance.

[190] Text2Token: Unsupervised Text Representation Learning with Token Target Prediction cs.CL | cs.IRPDF

Ruize An, Richong Zhang, Zhijie Nie, Zhanyu Wu, Yanzhao Zhang

TL;DR: 论文提出了Text2Token，一种基于无监督文本特征学习的生成框架，通过预测目标token分布来学习高质量的文本表示。该方法在MTEB v2基准测试中表现优异，与基于对比学习的LLM2Vec性能相当。

Details

Motivation: 无监督文本表征学习对自然语言处理任务非常重要，尤其是利用网络上的未标注数据。研究发现高质量的表征与文本中的关键token对齐，这启发了作者探索表征空间与词汇空间的潜在联系。

Result: 在MTEB v2基准测试中，Text2Token的性能与基于对比学习的LLM2Vec相当。研究表明词汇空间与表征空间在训练中协同优化。

Insight: 1. token对齐是高质量文本表征的关键；2. 词汇空间与表征空间的协同优化为未来研究提供了新思路。

Abstract: Unsupervised text representation learning (TRL) is a fundamental task in natural language processing, which is beneficial for improving search and recommendations with the web’s unlabeled texts. A recent empirical study finds that the high-quality representation aligns with the key token of the input text, uncovering the potential connection between representation space and vocabulary space. Inspired by the findings, we revisit the generative tasks and develop an unsupervised generative framework for TRL, Text2Token. The framework is based on the token target prediction task, utilizing carefully constructed target token distribution as supervisory signals. To construct the high-quality target token distribution, we analyze the token-alignment properties with advanced embedders and identify two essential categories of key tokens: (1) the meaningful tokens in the text and (2) semantically derived tokens beyond the text. Based on these insights, we propose two methods – data-driven and model-derived – to construct synthetic token targets from data or the LLM backbone. Experiments on the MTEB v2 benchmark demonstrate that Text2Token achieves performance competitive with the state-of-the-art embedder with unsupervised contrastive learning, LLM2Vec. Our analysis further shows that vocabulary and representation spaces optimize together and toward the optimum solution during training, providing new ideas and insights for future work.

Kangyang Luo, Yuzhuo Bai, Shuzheng Si, Cheng Gao, Zhitong Wang

TL;DR: 论文提出了ImCoref-CeS框架，结合改进的监督神经网络方法和LLM推理能力，通过轻量级桥接模块和双仿射评分器提升长期文本编码能力，并利用LLM作为检查器-分割器优化核心指代结果。

Details

Motivation: 当前核心指代消解任务面临一个问题：是基于小语言模型的监督神经方法继续优化，还是利用大语言模型（LLM）的强大能力。本文旨在结合两者的优势。

Result: 实验证明ImCoref-CeS优于现有SOTA方法，验证了框架的有效性。

Insight: 结合监督神经方法和LLM推理能力是解决核心指代问题的有效途径，轻量化设计和多角色LLM代理为未来研究提供了新思路。

Abstract: Coreference Resolution (CR) is a critical task in Natural Language Processing (NLP). Current research faces a key dilemma: whether to further explore the potential of supervised neural methods based on small language models, whose detect-then-cluster pipeline still delivers top performance, or embrace the powerful capabilities of Large Language Models (LLMs). However, effectively combining their strengths remains underexplored. To this end, we propose \textbf{ImCoref-CeS}, a novel framework that integrates an enhanced supervised model with LLM-based reasoning. First, we present an improved CR method (\textbf{ImCoref}) to push the performance boundaries of the supervised neural method by introducing a lightweight bridging module to enhance long-text encoding capability, devising a biaffine scorer to comprehensively capture positional information, and invoking a hybrid mention regularization to improve training efficiency. Importantly, we employ an LLM acting as a multi-role Checker-Splitter agent to validate candidate mentions (filtering out invalid ones) and coreference results (splitting erroneous clusters) predicted by ImCoref. Extensive experiments demonstrate the effectiveness of ImCoref-CeS, which achieves superior performance compared to existing state-of-the-art (SOTA) methods.

[192] Audit-of-Understanding: Posterior-Constrained Inference for Mathematical Reasoning in Language Models cs.CL | cs.AIPDF

Samir Abdaljalil, Erchin Serpedin, Khalid Qaraqe, Hasan Kurban

TL;DR: 本文提出了一种名为‘理解审计’（AoU）的框架，用于约束语言模型在数学推理中的推理过程，避免无支持的假设导致错误结论，从而显著提升了准确性和可信度。

Details

Motivation: 大型语言模型（LLMs）在推理过程中常常基于未经支持的假设生成看似连贯但实际上错误的结论。现有方法主要关注事实性幻觉或事后验证，未能有效解决推理引发的幻觉问题。

Result: 在GSM8K、MultiArith和SVAMP数据集上，AoU显著提升了准确性和可信度，最高分别提升了30%、45%和20-28%，优于Chain-of-Thought等方法。

Insight: AoU的核心在于约束推理过程，避免模型依赖未经支持的假设。这种方法不仅适用于数学推理，也可能推广到其他需要严谨推理的任务中。

Abstract: Large language models (LLMs) often generate reasoning traces that appear coherent but rest on unsupported assumptions, leading to hallucinated conclusions. Prior work mainly addresses factual hallucinations or relies on post-hoc verification, leaving reasoning-induced hallucinations largely unaddressed. We propose Audit-of-Understanding (AoU), a framework that constrains inference to validated premises through three phases: (1) decomposing a query into candidate assumptions, (2) auditing their support, and (3) conditioning inference only on the validated subset. Formally, AoU is \emph{posterior-constrained inference}, connecting to selective prediction and rejection learning. Our contributions are threefold: (i) theoretical guarantees under perfect validation, (ii) excess-risk bounds under imperfect audits, and (iii) tractability analysis. Empirically, AoU improves both accuracy and faithfulness on GSM8K, MultiArith, and SVAMP, achieving up to +30% gains on GSM8K, +45% on MultiArith, and consistent +20–28% improvements on SVAMP over Chain-of-Thought, Self-Consistency, and CoT-Decoding. Code is available at https://anonymous.4open.science/r/audit-of-understanding-E28B.

[193] Backdoor Collapse: Eliminating Unknown Threats via Known Backdoor Aggregation in Language Models cs.CLPDF

Liang Lin, Miao Yu, Moayad Aloqaily, Zhenhong Zhou, Kun Wang

TL;DR: 该论文提出了一种名为Backdoor Collapse的防御框架，旨在消除语言模型中未知的后门威胁，通过注入已知后门触发器和恢复微调来实现，显著降低了攻击成功率，同时保持了模型的原始性能。

Details

Motivation: 后门攻击对大型语言模型（LLMs）构成严重威胁，而现有防御方法对触发器设置的假设不切实际。本文旨在解决这一问题，提出一种无需触发器先验知识的防御框架。

Result: 攻击成功率显著降低（平均4.41%），同时模型清洁精度和功能损失极小（<0.5%），且方法对不同类型后门具有普适性。

Insight: 论文揭示了后门在表示空间的聚集现象，并通过已知后门聚合实现对未知后门的防御，为后门防御研究提供了新视角。

Abstract: Backdoor attacks are a significant threat to large language models (LLMs), often embedded via public checkpoints, yet existing defenses rely on impractical assumptions about trigger settings. To address this challenge, we propose \ourmethod, a defense framework that requires no prior knowledge of trigger settings. \ourmethod is based on the key observation that when deliberately injecting known backdoors into an already-compromised model, both existing unknown and newly injected backdoors aggregate in the representation space. \ourmethod leverages this through a two-stage process: \textbf{first}, aggregating backdoor representations by injecting known triggers, and \textbf{then}, performing recovery fine-tuning to restore benign outputs. Extensive experiments across multiple LLM architectures demonstrate that: (I) \ourmethod reduces the average Attack Success Rate to 4.41% across multiple benchmarks, outperforming existing baselines by 28.1%$\sim$69.3%$\uparrow$. (II) Clean accuracy and utility are preserved within 0.5% of the original model, ensuring negligible impact on legitimate tasks. (III) The defense generalizes across different types of backdoors, confirming its robustness in practical deployment scenarios.

[194] On the Entity-Level Alignment in Crosslingual Consistency cs.CLPDF

Yihong Liu, Mingyang Wang, François Yvon, Hinrich Schütze

TL;DR: 多语言大语言模型在跨语言一致性中存在实体对齐问题，本文通过实体级翻译任务验证了这一点，并提出两种方法提升一致性。

Details

Motivation: 多语言大语言模型需在不同语言间保持一致的事实性知识，但现有研究对其不一致性原因理解不足，本文聚焦于实体对齐问题。

Result: 实验显示SubSub和SubInj显著提升事实性召回准确率和一致性，同时揭示模型内部处理机制。

Insight: 实体对齐是跨语言一致性的关键，通过枢纽语言处理可有效对齐多语言表示，为多语言事实预测提供实用策略。

Abstract: Multilingual large language models (LLMs) are expected to recall factual knowledge consistently across languages. However, the factors that give rise to such crosslingual consistency – and its frequent failure – remain poorly understood. In this work, we hypothesize that these inconsistencies may arise from failures in entity alignment, the process of mapping subject and object entities into a shared conceptual space across languages. To test this, we assess alignment through entity-level (subject and object) translation tasks, and find that consistency is strongly correlated with alignment across all studied models, with misalignment of subjects or objects frequently resulting in inconsistencies. Building on this insight, we propose SubSub and SubInj, two effective methods that integrate English translations of subjects into prompts across languages, leading to substantial gains in both factual recall accuracy and consistency. Finally, our mechanistic analysis reveals that these interventions reinforce the entity representation alignment in the conceptual space through model’s internal pivot-language processing, offering effective and practical strategies for improving multilingual factual prediction.

[195] MatryoshkaThinking: Recursive Test-Time Scaling Enables Efficient Reasoning cs.CL | cs.AIPDF

Hongwei Chen, Yishu Lei, Dan Zhang, Bo Ke, Danxiang Zhu

TL;DR: MatryoshkaThinking 是一种递归测试时缩放方法，显著降低计算成本的同时保持高性能，实现了高效推理。

Details

Motivation: 传统测试时缩放方法（如 DeepConf）虽然有效，但计算开销较大。本文希望通过递归利用模型的内在能力，减少计算成本并提升性能。

Result: 在多个开源模型和多模态推理基准测试中验证了方法的有效性和通用性，显著降低计算成本的同时保持高性能。

Insight: 递归测试时缩放为设计高效、可扩展的语言模型推理策略提供了新思路。

Abstract: Test-time scaling has emerged as a promising paradigm in language modeling, wherein additional computational resources are allocated during inference to enhance model performance. Recent approaches, such as DeepConf, have demonstrated the efficacy of this strategy, however, they often incur substantial computational overhead to achieve competitive results. In this work, we propose MatryoshkaThinking, a novel method that significantly reduces computational cost while maintaining state-of-the-art performance. Specifically, MatryoshkaThinking attains a score of 99.79 on AIME2025 using only 4% of the computation required by DeepConf. The core of our approach lies in the recursive exploitation of the model’s intrinsic capabilities in reasoning, verification, and summarization, which collectively enhance the retention of correct solutions and reduce the disparity between Pass@k and Pass@1. Comprehensive evaluations across multiple open-source models and challenging multi-modal reasoning benchmarks validate the effectiveness and generality of our method. These findings offer new insights into the design of efficient and scalable test-time inference strategies for advanced language models.

[196] End-to-end Automatic Speech Recognition and Speech Translation: Integration of Speech Foundational Models and LLMs cs.CLPDF

Nam Luu, Ondřej Bojar

TL;DR: 该论文提出了一种结合预训练语音编码器和大型语言模型（LLMs）的端到端架构，同时完成自动语音识别（ASR）和语音翻译（ST）任务，并在英语到德语的实验中表现优于SeamlessM4T模型，甚至媲美级联系统的性能。

Details

Motivation: 语音翻译任务的传统级联方法和端到端方法各有优劣。论文旨在探索如何利用预训练的语音编码器和LLMs，构建一个统一的端到端模型，以同时高效完成ASR和ST任务。

Result: 实验结果显示，该模型在英语到德语的翻译任务中，性能优于SeamlessM4T，并在COMET DA22指标上提升了8%。

Insight: 通过结合语音编码器和LLMs，端到端模型能够避免级联系统的错误传播问题，同时利用预训练模型的能力实现高效翻译。

Abstract: Speech Translation (ST) is a machine translation task that involves converting speech signals from one language to the corresponding text in another language; this task has two different approaches, namely the traditional cascade and the more recent end-to-end. This paper explores a combined end-to-end architecture of pre-trained speech encoders and Large Language Models (LLMs) for performing both Automatic Speech Recognition (ASR) and ST simultaneously. Experiments with the English-to-German language pair show that our best model not only can achieve better translation results than SeamlessM4T, a large foundational end-to-end, multi-modal translation model, but can also match the performance of a cascaded system with Whisper and NLLB, with up to a score gain of 8% in $\text{COMET}^{\text{DA}}_{22}$ metric.

[197] RefusalBench: Generative Evaluation of Selective Refusal in Grounded Language Models cs.CL | cs.AI | cs.LGPDF

Aashiq Muhamed, Leonardo F. R. Ribeiro, Markus Dreyer, Virginia Smith, Mona T. Diab

TL;DR: 论文提出了RefusalBench，一种生成式评估方法，用于测试RAG系统中语言模型在有缺陷上下文下选择性拒绝的能力。研究发现前沿模型在多文档任务中拒绝准确性低于50%，并提出改进方向。

Details

Motivation: 现有的静态基准测试无法可靠评估语言模型在有缺陷上下文下的选择性拒绝能力，导致模型可能利用数据特异性或记忆测试实例，提出了动态评测的需求。

Result: 前沿模型在多文档任务中的拒绝准确性低于50%，表明选择性拒绝是一项可训练的、对齐敏感的能力。

Insight: 选择性拒绝可分为检测和分类两个技能，但目前模型的性能和规模或推理能力无关，提供了明确的改进方向。

Abstract: The ability of language models in RAG systems to selectively refuse to answer based on flawed context is critical for safety, yet remains a significant failure point. Our large-scale study reveals that even frontier models struggle in this setting, with refusal accuracy dropping below 50% on multi-document tasks, while exhibiting either dangerous overconfidence or overcaution. Static benchmarks fail to reliably evaluate this capability, as models exploit dataset-specific artifacts and memorize test instances. We introduce RefusalBench, a generative methodology that programmatically creates diagnostic test cases through controlled linguistic perturbation. Our framework employs 176 distinct perturbation strategies across six categories of informational uncertainty and three intensity levels. Evaluation of over 30 models uncovers systematic failure patterns: refusal comprises separable detection and categorization skills, and neither scale nor extended reasoning improves performance. We find that selective refusal is a trainable, alignment-sensitive capability, offering a clear path for improvement. We release two benchmarks – RefusalBench-NQ (single document) and RefusalBench-GaRAGe (multi-document) – and our complete generation framework to enable continued, dynamic evaluation of this critical capability.

[198] AssoMem: Scalable Memory QA with Multi-Signal Associative Retrieval cs.CLPDF

Kai Zhang, Xinyuan Zhang, Ejaz Ahmed, Hongda Jiang, Caleb Kumar

TL;DR: AssoMem是一种新颖的内存增强框架，通过构建关联记忆图来改善大规模记忆中的问答任务，结合多维信号（相关性、重要性和时间对齐）并通过自适应融合策略提升检索效果，在多个基准测试中表现优于现有方法。

Details

Motivation: 大规模记忆中的准确召回是AI助理问答任务的核心挑战，尤其是在语义密度高的场景下，现有方法主要依赖查询的语义距离进行检索，忽视了人类信息关联的多维特性。

Result: 在三个基准测试和新数据集MeetingQA上，AssoMem的表现优于现有最先进方法，验证了其在上下文感知记忆召回中的优越性。

Insight: 通过模拟人类关联信息的特性，结合多维信号和自适应融合，可以显著提升记忆增强问答任务的性能。

Abstract: Accurate recall from large scale memories remains a core challenge for memory augmented AI assistants performing question answering (QA), especially in similarity dense scenarios where existing methods mainly rely on semantic distance to the query for retrieval. Inspired by how humans link information associatively, we propose AssoMem, a novel framework constructing an associative memory graph that anchors dialogue utterances to automatically extracted clues. This structure provides a rich organizational view of the conversational context and facilitates importance aware ranking. Further, AssoMem integrates multi-dimensional retrieval signals-relevance, importance, and temporal alignment using an adaptive mutual information (MI) driven fusion strategy. Extensive experiments across three benchmarks and a newly introduced dataset, MeetingQA, demonstrate that AssoMem consistently outperforms SOTA baselines, verifying its superiority in context-aware memory recall.

[199] STEAM: A Semantic-Level Knowledge Editing Framework for Large Language Models cs.CL | cs.AIPDF

Geunyeong Jeong, Juoh Sun, Seonghee Lee, Harksoo Kim

TL;DR: STEAM是一个语义级知识编辑框架，旨在通过潜在空间对齐提升LLM的知识编辑能力，使其编辑后的知识更具语义连贯性。

Details

Motivation: 大型语言模型的知识是静态的，无法动态更新。现有知识编辑方法多关注词级优化，缺乏语义连贯性。

Result: 实验表明STEAM提升了模型的推理能力和语义连贯性，验证了潜在空间对齐对知识编辑的重要性。

Insight: 语义级编辑比词级优化更能确保知识的自然整合，潜在空间对齐是关键。

Abstract: Large Language Models store extensive factual knowledge acquired during large-scale pre-training. However, this knowledge is inherently static, reflecting only the state of the world at the time of training. Knowledge editing has emerged as a promising solution for updating outdated or incorrect facts without full retraining. However, most existing locate-and-edit methods primarily focus on token-level likelihood optimization without addressing semantic coherence. Our analysis reveals that such edited knowledge is often encoded as isolated residual streams in the model’s latent space, distinct from pre-existing knowledge and bypassing natural reasoning process. To address this, we propose \textsc{Steam}, a semantic-level knowledge editing framework that enhances integration of updated knowledge into the model’s knowledge structure. \textsc{Steam} first identifies target representations as semantic anchors for the updated factual association, then guides the internal representation of the edited fact towards these anchors through an alignment loss during optimization. Experimental results demonstrate that \textsc{Steam} improves model’s ability to reason with edited knowledge and enhances semantic coherence, underscoring the importance of latent-space alignment for reliable and coherent knowledge editing. The code is available at https://github.com/GY-Jeong/STEAM.

[200] Do Audio LLMs Really LISTEN, or Just Transcribe? Measuring Lexical vs. Acoustic Emotion Cues Reliance cs.CL | cs.AIPDF

Jingyi Chen, Zhimeng Guo, Jiyun Chun, Pichao Wang, Andrew Perrault

TL;DR: 该论文提出了LISTEN基准测试，用于评估音频语言模型（LALMs）在处理情绪时对词汇和声学信息的依赖，发现现有模型主要依赖词汇信息而非声学线索。

Details

Motivation: 目前尚不清楚大型音频语言模型在理解情绪时是真正处理声学信息还是主要依赖词汇内容。论文旨在量化模型对这两种线索的依赖程度。

Result: 现有LALMs表现出明显的词汇主导现象，模型在词汇中性时预测为‘中性’，在声学和词汇线索冲突时分类能力下降，性能接近随机猜测。

Insight: 当前LALMs更像‘转录器’而非‘倾听者’，严重依赖词汇语义而忽视了声学线索。LISTEN为多模态模型的情绪理解评估提供了框架。

Abstract: Understanding emotion from speech requires sensitivity to both lexical and acoustic cues. However, it remains unclear whether large audio language models (LALMs) genuinely process acoustic information or rely primarily on lexical content. We present LISTEN (Lexical vs. Acoustic Speech Test for Emotion in Narratives), a controlled benchmark designed to disentangle lexical reliance from acoustic sensitivity in emotion understanding. Across evaluations of six state-of-the-art LALMs, we observe a consistent lexical dominance. Models predict “neutral” when lexical cues are neutral or absent, show limited gains under cue alignment, and fail to classify distinct emotions under cue conflict. In paralinguistic settings, performance approaches chance. These results indicate that current LALMs largely “transcribe” rather than “listen,” relying heavily on lexical semantics while underutilizing acoustic cues. LISTEN offers a principled framework for assessing emotion understanding in multimodal models.

[201] RECON: Reasoning with Condensation for Efficient Retrieval-Augmented Generation cs.CLPDF

Zhichao Xu, Minheng Wang, Yawei Wang, Wenqian Ye, Yuntao Du

TL;DR: RECON是一个通过在推理循环中引入显式摘要模块来压缩检索到的文档的框架，从而提高了检索增强生成（RAG）系统的效率和性能。

Details

Motivation: 现有的基于强化学习的RAG系统由于上下文管理效率低下，导致了长文档和噪声数据增加了成本并降低了性能，因此需要一种高效的上下文压缩方法。

Result: 实验结果显示，RECON将上下文长度减少35%，训练速度和推理延迟均得到提升。在QA基准测试中，平均EM分数提升显著（3B模型提升14.5%，7B模型提升3.0%）。

Insight: RECON证明了学习的上下文压缩对于构建高效、可扩展且高性能的RAG系统至关重要，尤其在多跳QA任务中表现突出。

Abstract: Retrieval-augmented generation (RAG) systems trained using reinforcement learning (RL) with reasoning are hampered by inefficient context management, where long, noisy retrieved documents increase costs and degrade performance. We introduce RECON (REasoning with CONdensation), a framework that integrates an explicit summarization module to compress evidence within the reasoning loop. Our summarizer is trained via a two-stage process: relevance pretraining on QA datasets, followed by multi-aspect distillation from proprietary LLMs to ensure factuality and clarity. Integrated into the Search-R1 pipeline, RECON reduces total context length by 35%, leading to improved training speed and inference latency, while simultaneously improving RAG performance on downstream QA benchmarks. Notably, it boosts the average EM score of the 3B model by 14.5% and the 7B model by 3.0%, showing particular strength in multi-hop QA. RECON demonstrates that learned context compression is essential for building practical, scalable, and performant RAG systems. Our code implementation is made available at https://github.com/allfornancy/RECON.

[202] Steering Over-refusals Towards Safety in Retrieval Augmented Generation cs.CLPDF

Utsav Maskey, Mark Dras, Usman Naseem

TL;DR: 论文研究了在检索增强生成（RAG）中大型语言模型（LLMs）的安全性对齐导致的过度拒绝问题，并提出了SafeRAG-Steering方法以减少这种问题。

Details

Motivation: 大型语言模型在安全性对齐时容易因激进的过滤器而对良性请求产生过度拒绝，尤其是在RAG中，查询意图和检索上下文特性会进一步影响拒绝行为。

Result: SafeRAG-Steering有效减少了RAG管道中因上下文污染导致的过度拒绝问题，同时保留了合理的拒绝行为。

Insight: 研究表明，模型特定的对齐选择和上下文特性是触发过度拒绝的关键因素；嵌入干预可以作为一种有效的缓解手段。

Abstract: Safety alignment in large language models (LLMs) induces over-refusals – where LLMs decline benign requests due to aggressive safety filters. We analyze this phenomenon in retrieval-augmented generation (RAG), where both the query intent and retrieved context properties influence refusal behavior. We construct RagRefuse, a domain-stratified benchmark spanning medical, chemical, and open domains, pairing benign and harmful queries with controlled context contamination patterns and sizes. Our analysis shows that context arrangement / contamination, domain of query and context, and harmful-text density trigger refusals even on benign queries, with effects depending on model-specific alignment choices. To mitigate over-refusals, we introduce \textsc{SafeRAG-Steering}, a model-centric embedding intervention that steers the embedding regions towards the confirmed safe, non-refusing output regions at inference time. This reduces over-refusals in contaminated RAG pipelines while preserving legitimate refusals.

[203] When or What? Understanding Consumer Engagement on Digital Platforms cs.CL | cs.CYPDF

Jingyi Wu, Junying Liang

TL;DR: 论文研究了数字平台上消费者关注度的驱动因素，发现时间动态比内容主题更能影响消费者参与度。

Details

Motivation: 在当前数字服务经济中，内容创作者竞争消费者注意力。虽然以往研究关注内容特征，但创作者常高估其对受众的价值。本研究旨在揭示消费者参与度的真实驱动因素。

Result: 研究发现时间动态比主题内容更能预测消费者参与度，表明”何时”比”什么”更重要。

Insight: 挑战了内容特征是关注度主要驱动因素的假设，强调了时间规划和上下文因素的重要性，为优化受众参与策略提供了新视角。

Abstract: Understanding what drives popularity is critical in today’s digital service economy, where content creators compete for consumer attention. Prior studies have primarily emphasized the role of content features, yet creators often misjudge what audiences actually value. This study applies Latent Dirichlet Allocation (LDA) modeling to a large corpus of TED Talks, treating the platform as a case of digital service provision in which creators (speakers) and consumers (audiences) interact. By comparing the thematic supply of creators with the demand expressed in audience engagement, we identify persistent mismatches between producer offerings and consumer preferences. Our longitudinal analysis further reveals that temporal dynamics exert a stronger influence on consumer engagement than thematic content, suggesting that when content is delivered may matter more than what is delivered. These findings challenge the dominant assumption that content features are the primary drivers of popularity and highlight the importance of timing and contextual factors in shaping consumer responses. The results provide new insights into consumer attention dynamics on digital platforms and carry practical implications for marketers, platform managers, and content creators seeking to optimize audience engagement strategies.

[204] Assessing Large Language Models for Structured Medical Order Extraction cs.CL | cs.AIPDF

A H M Rezaul Karim, Ozlem Uzuner

TL;DR: 论文评估了大型语言模型在结构化医疗订单提取中的表现，使用通用LLaMA-4 17B模型，无需领域微调，仅通过单例上下文示例指导，在MEDIQA-OE 2025任务中排名第5，证明了其作为临床NLP任务基准的潜力。

Details

Motivation: 医疗订单提取对临床决策和自动化工作流至关重要，但订单来源多样且类型复杂，需要高效、通用的方法来解决这一挑战。

Result: 在MEDIQA-OE 2025任务中平均F1得分为37.76，尤其在订单原因和来源准确性上有显著提升，排名第5。

Insight: 通用大型语言模型结合少量示例提示工程，可以作为临床NLP任务的高效、可扩展基准，减少对领域特定模型的依赖。

Abstract: Medical order extraction is essential for structuring actionable clinical information, supporting decision-making, and enabling downstream applications such as documentation and workflow automation. Orders may be embedded in diverse sources, including electronic health records, discharge summaries, and multi-turn doctor-patient dialogues, and can span categories such as medications, laboratory tests, imaging studies, and follow-up actions. The MEDIQA-OE 2025 shared task focuses on extracting structured medical orders from extended conversational transcripts, requiring the identification of order type, description, reason, and provenance. We present the MasonNLP submission, which ranked 5th among 17 participating teams with 105 total submissions. Our approach uses a general-purpose, instruction-tuned LLaMA-4 17B model without domain-specific fine-tuning, guided by a single in-context example. This few-shot configuration achieved an average F1 score of 37.76, with notable improvements in reason and provenance accuracy. These results demonstrate that large, non-domain-specific LLMs, when paired with effective prompt engineering, can serve as strong, scalable baselines for specialized clinical NLP tasks.

[205] Merlin’s Whisper: Enabling Efficient Reasoning in LLMs via Black-box Adversarial Prompting cs.CL | cs.LGPDF

Heming Xia, Cunxiao Du, Rui Li, Chak Tou Leong, Yongqi Li

TL;DR: 这篇论文提出了AdvPrompt框架，通过黑盒对抗提示（black-box adversarial prompting）减少大型推理模型（LRM）的计算开销，同时保持高准确性。

Details

Motivation: 大型推理模型在复杂推理任务中表现出色，但其冗长的推理过程导致了高昂的计算和延迟成本，限制了实际应用。因此，需要一种高效的方法来减少模型的“过度思考”（overthinking）。

Result: 实验表明，AdvPrompt在多个基准测试中减少了约40%的token使用量，例如在GSM8K上为Qwen3系列模型减少了3倍的响应长度，在MATH-500上为Claude-3.7和Gemini-2.5分别减少了35%和47%的token使用量。

Insight: AdvPrompt展示了黑盒提示作为一种实用策略的潜力，能够有效提升大型推理模型的效率，且适用于不同规模和系列的模型。

Abstract: Large reasoning models (LRMs) have demonstrated remarkable proficiency in tackling complex reasoning tasks through step-by-step thinking. However, such a lengthy reasoning process incurs substantial computational and latency overheads, hindering the practical deployment of these models. In this work, we present a new perspective on mitigating overthinking in LRMs via black-box adversarial prompting. By treating both open-source LRMs and closed-source APIs as black-box communicators, we investigate how to elicit concise responses without sacrificing accuracy. We introduce AdvPrompt, an iterative refinement framework that generates high-quality adversarial prompts from diverse perspectives. Experiments across multiple benchmarks demonstrate that AdvPrompt consistently reduces token usage while preserving performance. Notably, AdvPrompt achieves a 3x reduction in average response length on simple GSM8K questions for the Qwen3 model series, and delivers an average ~40% token reduction across four benchmarks. For closed-source APIs, AdvPrompt reduces token usage on MATH-500 by 35% for Claude-3.7 and 47% for Gemini-2.5. Further analysis reveals the generalizability of AdvPrompt across various model scales and families, underscoring the potential of black-box prompting as a practical and effective strategy for enhancing LRM efficiency.

[206] Detecting Hallucinations in Authentic LLM-Human Interactions cs.CLPDF

Yujie Ren, Niklas Gruhlke, Anne Lauscher

TL;DR: 论文介绍了首个基于真实大语言模型（LLM）-人类对话的幻觉检测基准AuthenHallu，解决了现有基准多为人工构造的问题。

Details

Motivation: 当前幻觉检测基准多为人工构造，无法反映真实场景中LLM的幻觉特性，限制了其在敏感领域（如医学和法律）的应用。

Result: 统计显示，31.4%的查询-响应对存在幻觉，而在数学和数字问题领域，这一比例高达60%。普通LLM作为检测器的效果有限。

Insight: 真实对话中的幻觉特性与人工构造的基准存在显著差异，尤其在复杂领域幻觉率更高，凸显了对真实场景数据的需求和挑战。

Abstract: As large language models (LLMs) are increasingly applied in sensitive domains such as medicine and law, hallucination detection has become a critical task. Although numerous benchmarks have been proposed to advance research in this area, most of them are artificially constructed–either through deliberate hallucination induction or simulated interactions–rather than derived from genuine LLM-human dialogues. Consequently, these benchmarks fail to fully capture the characteristics of hallucinations that occur in real-world usage. To address this limitation, we introduce AuthenHallu, the first hallucination detection benchmark built entirely from authentic LLM-human interactions. For AuthenHallu, we select and annotate samples from genuine LLM-human dialogues, thereby providing a faithful reflection of how LLMs hallucinate in everyday user interactions. Statistical analysis shows that hallucinations occur in 31.4% of the query-response pairs in our benchmark, and this proportion increases dramatically to 60.0% in challenging domains such as Math & Number Problems. Furthermore, we explore the potential of using vanilla LLMs themselves as hallucination detectors and find that, despite some promise, their current performance remains insufficient in real-world scenarios.

[207] BitMar: Low-Bit Multimodal Fusion with Episodic Memory for Edge Devices cs.CL | cs.AI | cs.CV | 68T50 | I.2.7PDF

Euhid Aman, Esteban Carlin, Hsing-Kuo Pao, Giovanni Beltrame, Ghaluh Indah Permata Sari

TL;DR: BitMar是一种低比特率的多模态融合模型，结合外部情景记忆，适用于边缘设备，支持高效的图像-文本生成。

Details

Motivation: 现有的多模态模型（如跨注意力Transformer）计算量大，难以部署在边缘设备上。基于量化技术和情景记忆的模型可以解决这一问题。

Result: BitMar在低延迟和小模型占用下，实现了具有竞争力的图像描述和多模态理解能力。

Insight: 量化技术和情景记忆的结合可以有效减少模型计算量，同时保持生成内容的上下文相关性，适合边缘设备部署。

Abstract: Cross-attention transformers and other multimodal vision-language models excel at grounding and generation; however, their extensive, full-precision backbones make it challenging to deploy them on edge devices. Memory-augmented architectures enhance the utilization of past context; however, most works rarely pair them with aggressive edge-oriented quantization. We introduce BitMar, a quantized multimodal transformer that proposes an external human-like episodic memory for effective image-text generation on hardware with limited resources. BitMar utilizes 1.58-bit encoders, one for text (BitNet-style) and one for vision (DiNOv2-based), to create compact embeddings that are combined and used to query a fixed-size key-value episodic memory. During vector retrieval, the BitNet decoder applies per-layer conditioning, which increases the contextual relevance of generated content. The decoder also employs attention sinks with a sliding-window mechanism to process long or streaming inputs under tight memory budgets. The combination of per-layer conditioning and sliding-window attention achieves a strong quality-speed trade-off, delivering competitive captioning and multimodal understanding at low latency with a small model footprint. These characteristics make BitMar well-suited for edge deployment.

[208] Preserving LLM Capabilities through Calibration Data Curation: From Analysis to Optimization cs.CLPDF

Bowei He, Lihao Yin, Huiling Zhen, Shuqi Liu, Han Wu

TL;DR: 该论文探讨了后训练压缩中校准数据对大语言模型（LLM）能力的影响，并通过激活模式分析提出了一种校准数据优化框架，以提升压缩后模型的复杂推理能力。

Details

Motivation: 现有研究大多仅关注校准数据的来源或样本量对语言建模或常识推理的影响，缺乏对校准数据的组合性质和领域对应性的系统性分析，尤其是对高级复杂推理能力的影响。

Result: 提出的优化框架有效提升了压缩后模型在复杂推理任务上的性能。

Insight: 校准数据的质量不仅取决于样本量和来源，其激活空间的代表性和多样性对模型能力保留更为关键。

Abstract: Post-training compression has been a widely employed approach to scale down large language model (LLM) and facilitate efficient inference. In various proposed compression methods, including pruning and quantization, calibration data plays a vital role by informing the weight importance and activation dynamic ranges. However, how calibration data impacts the LLM capability after compression is less explored. Few of the existing works, though recognizing the significance of this study, only investigate the language modeling or commonsense reasoning performance degradation from limited angles, like the data sources or sample amounts. More systematic research is still needed to examine the impacts on different LLM capabilities in terms of compositional properties and domain correspondence of calibration data. In this work, we aim at bridging this gap and further analyze underlying influencing mechanisms from the activation pattern perspective. Especially, we explore the calibration data’s impacts on high-level complex reasoning capabilities, like math problem solving and code generation. Delving into the underlying mechanism, we find that the representativeness and diversity in activation space more fundamentally determine the quality of calibration data. Finally, we propose a calibration data curation framework based on such observations and analysis, enhancing the performance of existing post-training compression methods on preserving critical LLM capabilities. Our code is provided in \href{https://github.com/BokwaiHo/COLA.git}{Link}.

[209] AGENTIQL: An Agent-Inspired Multi-Expert Framework for Text-to-SQL Generation cs.CL | cs.AIPDF

Omid Reza Heidari, Siobhan Reid, Yassine Yaakoubi

TL;DR: AGENTIQL是一个基于多专家框架的文本到SQL生成方法，通过分解问题、生成子查询和优化列选择，结合路由机制提升效率和准确性，在Spider基准上表现优异。

Details

Motivation: 现有的大型语言模型（LLM）在文本到SQL生成中存在复杂推理和多样表结构处理的局限性，AGENTIQL旨在通过模块化和并行化设计解决这些问题。

Result: 在Spider基准上，AGENTIQL实现了86.07%的执行准确率，接近GPT-4的SOTA（89.65%），同时提升了透明度和可解释性。

Insight: 模块化和并行化设计能够显著提升文本到SQL生成的性能和可扩展性，路由机制在平衡效率和准确性中起关键作用。

Abstract: LLMs have advanced text-to-SQL generation, yet monolithic architectures struggle with complex reasoning and schema diversity. We propose AGENTIQL, an agent-inspired multi-expert framework that combines a reasoning agent for question decomposition, a coding agent for sub-query generation, and a refinement step for column selection. An adaptive router further balances efficiency and accuracy by selecting between our modular pipeline and a baseline parser. Several steps in the pipeline can be executed in parallel, making the framework scalable to larger workloads. Evaluated on the Spider benchmark, AGENTIQL improves execution accuracy and interpretability and achieves up to 86.07% EX with 14B models using the Planner&Executor merging strategy. The attained performance is contingent upon the efficacy of the routing mechanism, thereby narrowing the gap to GPT-4-based SOTA (89.65% EX) while using much smaller open-source LLMs. Beyond accuracy, AGENTIQL enhances transparency by exposing intermediate reasoning steps, offering a robust, scalable, and interpretable approach to semantic parsing.

[210] BrowserAgent: Building Web Agents with Human-Inspired Web Browsing Actions cs.CL | cs.AIPDF

Zhengbo Zhang, Zhiheng Lyu, Junhao Gong, Hongzhu Yi, Xinming Wang

TL;DR: BrowserAgent提出了一种基于人类浏览行为的交互式网络代理，通过直接在网页上进行操作（如滚动、点击、输入）完成任务，采用两阶段训练（SFT和RFT）显著提升了模型的泛化能力，并在多项任务中表现优异。

Details

Motivation: 现有的大多数网络代理依赖工具将动态网络环境转换为静态文本，与人类丰富的浏览器交互行为（如滚动、点击）不符。BrowserAgent旨在通过模仿人类浏览行为，提升代理在动态网络环境中的交互能力。

Result: BrowserAgent-7B在HotpotQA、2Wiki等多元跳转问题上比Search-R1提升约20%，训练数据更少但性能更优。

Insight: 通过模拟人类浏览行为，BrowserAgent展示了在网络任务中更高效的交互能力，显式记忆机制对长程推理任务尤为重要。

Abstract: Efficiently solving real-world problems with LLMs increasingly hinges on their ability to interact with dynamic web environments and autonomously acquire external information. While recent research like Search-R1 and WebDancer demonstrates strong performance in solving web tasks, they heavily rely on additional tools to convert the interactive web environment into static text content. This is in contrast to human browsing behaviors, which involve diverse interactions with the browser, such as scrolling, clicking, and typing. In this paper, we propose BrowserAgent, a more interactive agent that solves complex tasks through human-inspired browser actions. BrowserAgent operates directly on raw web pages via Playwright through a set of predefined browser actions. We adopt a two-stage training (Supervised Fine-Tuning (SFT) and Rejection Fine-Tuning (RFT)) to improve the model’s generalization abilities. Despite using significantly less training data than Search-R1, BrowserAgent achieves more competitive results across different Open-QA tasks. Additionally, we introduce an explicit memory mechanism to store key conclusions across steps, further enhancing the model’s reasoning capabilities for long-horizon tasks. Notably, BrowserAgent-7B can achieve around 20% improvement over Search-R1 on multi-hop QA tasks like HotpotQA, 2Wiki, and Bamboogle. These results indicate that BrowserAgent can serve as a more advanced framework for more interactive and scalable web agents.

[211] Unlocking LLM Safeguards for Low-Resource Languages via Reasoning and Alignment with Minimal Training Data cs.CLPDF

Zhuowei Chen, Bowei Zhang, Nankai Lin, Tian Hou, Lianxi Wang

TL;DR: 论文提出了一种名为ConsistentGuard的新型基于推理的多语言LLM安全防护方法，通过推理增强解释性，并通过对齐提升语言间的知识迁移，仅需1000个训练样本即可在多语言基准上表现出色。

Details

Motivation: 现有基于分类器的方法在低资源语言上表现不佳且缺乏解释性，而LLM的安全防护需求日益增长，因此需要一种更有效的方法。

Result: 在六种语言的三个数据集上性能卓越，优于需要更多数据训练的大模型，并展现了较强的解释性和泛化能力。

Insight: 小样本学习与推理对齐的结合可以显著提升低资源语言场景下的LLM防护能力，同时增强了模型的解释性。

Abstract: Recent advances in LLMs have enhanced AI capabilities, but also increased the risk posed by malicious requests, highlighting the need for effective LLM safeguards to detect such queries. Existing approaches largely rely on classifier-based methods that lack interpretability and perform poorly on low-resource languages. To address these limitations, we propose ConsistentGuard, a novel reasoning-based multilingual safeguard, which enhances explainability via reasoning and boosts knowledge transfer between languages through alignment. With only 1,000 training samples, our method demonstrates superior performance on three datasets across six languages, outperforming larger models trained with significantly more data, and exhibits strong interpretability and generalization ability. We also contribute a multilingual benchmark extension and release our codes to support future research.

[212] RePro: Training Language Models to Faithfully Recycle the Web for Pretraining cs.CLPDF

Zichun Yu, Chenyan Xiong

TL;DR: RePro提出了一种基于强化学习的网络数据回收方法，通过训练小型语言模型生成高质量且语义忠实的数据改写，显著提升了LLM预训练数据的效率和效果。

Details

Motivation: 前沿大型语言模型（LLM）的高质量预训练数据日益稀缺，RePro旨在通过数据回收方法解决这一问题，提高数据的利用效率。

Result: 1. RePro在22项下游任务中比仅使用原始数据的基线提升了4.7%-14.0%的相对准确率；2. 在数据效率上比原始数据提升了2-3倍；3. 改写的语义忠实性优于基于提示的state-of-the-art方法。

Insight: 1. 小型模型可以通过强化学习实现高效数据回收；2. 忠实性奖励设计对于保留核心语义和结构至关重要；3. 数据改写技术在缓解LLM数据稀缺问题上具有潜力。

Abstract: High-quality pretraining data is the fossil fuel of large language models (LLMs), yet its reserves are running low for frontier models. In this paper, we introduce RePro, a novel web recycling method that trains a relatively small LM with reinforcement learning to generate effective and faithful rephrasings of pretraining data. Specifically, we design one quality reward and three faithfulness rewards, optimizing the LM rephraser to convert organic data into high-quality rephrasings while maintaining its core semantics and structure. In our experiment, we train a 4B rephraser to recycle 72B tokens sampled from DCLM-RefinedWeb. Pretraining results on 400M and 1.4B models demonstrate that RePro delivers 4.7%-14.0% relative accuracy gains over organic-only baseline on 22 downstream tasks. RePro also outperforms ReWire, the state-of-the-art web recycling method that prompts a 70B rephraser, as well as the organic baseline with a 4x larger data pool. Experiments with different amounts of recycled data highlight that RePro improves organic data efficiency by 2-3x. Individual and distributional analyses validate that RePro preserves more critical information and faithfully reflects the characteristics of organic data compared to prompting-based methods. Together, these results show that RePro provides an efficient and controllable path to effectively harness the fossil fuel of LLM pretraining. We open-source our code, rephraser, and recycled data at https://github.com/cxcscmu/RePro.

[213] Review of Inference-Time Scaling Strategies: Reasoning, Search and RAG cs.CLPDF

Zhichao Wang, Cheng Wan, Dong Nie

TL;DR: 该论文综述了大型语言模型（LLM）在推理时扩展策略的最新进展，重点讨论了输出导向和输入导向的两类方法，并系统整理了相关技术。

Details

Motivation: 随着高质量训练数据日益稀缺，传统的模型规模和训练数据扩展方法面临瓶颈，研究转向了推理时的计算扩展，以在不重训练模型的情况下提升性能。

Result: 论文整合了大量技术，展示了推理时扩展的潜力，并提出分类框架以指导未来研究方向。

Insight: 推理时扩展是LLM性能提升的新方向，RAG等输入导向方法展示了强大的应用潜力，但输出导向方法的复杂性仍需进一步优化。

Abstract: The performance gains of LLMs have historically been driven by scaling up model size and training data. However, the rapidly diminishing availability of high-quality training data is introducing a fundamental bottleneck, shifting the focus of research toward inference-time scaling. This paradigm uses additional computation at the time of deployment to substantially improve LLM performance on downstream tasks without costly model re-training. This review systematically surveys the diverse techniques contributing to this new era of inference-time scaling, organizing the rapidly evolving field into two comprehensive perspectives: Output-focused and Input-focused methods. Output-focused techniques encompass complex, multi-step generation strategies, including reasoning (e.g., CoT, ToT, ReAct), various search and decoding methods (e.g., MCTS, beam search), training for long CoT (e.g., RLVR, GRPO), and model ensemble methods. Input-focused techniques are primarily categorized by few-shot and RAG, with RAG as the central focus. The RAG section is further detailed through a structured examination of query expansion, data, retrieval and reranker, LLM generation methods, and multi-modal RAG.

[214] DUAL-Bench: Measuring Over-Refusal and Robustness in Vision-Language Models cs.CLPDF

Kaixuan Ren, Preslav Nakov, Usman Naseem

TL;DR: DUAL-Bench是一个多模态基准测试，旨在衡量视觉语言模型（VLMs）的过度拒绝（over-refusal）和安全完成任务的能力。研究发现，当前模型在这些方面表现不佳，表明需要更精细的对齐策略以平衡安全性与实用性。

Details

Motivation: 随着视觉语言模型能力的提升，如何在安全性和实用性之间取得平衡成为一个关键挑战。现有的安全机制可能导致过度拒绝（拒绝良性请求），但在多模态场景中，特别是在指令无害但图像有害的情况下，模型的表现仍然存在问题。

Result: 结果显示，当前模型在安全完成任务方面表现较差：GPT-5-Nano仅为12.9%，GPT-5系列平均7.9%，Qwen系列仅3.9%。这表明模型在多模态场景中仍需改进。

Insight: DUAL-Bench揭示了多模态对齐的复杂性，尤其是在处理视觉和语言结合的指令时，模型需要更精细的策略以实现安全性和实用性的平衡。

Abstract: As vision-language models become increasingly capable, maintaining a balance between safety and usefulness remains a central challenge. Safety mechanisms, while essential, can backfire, causing over-refusal, where models decline benign requests out of excessive caution. Yet, no existing benchmark has systematically addressed over-refusal in the visual modality. This setting introduces unique challenges, such as dual-use cases where an instruction is harmless, but the accompanying image contains harmful content. Models frequently fail in such scenarios, either refusing too conservatively or completing tasks unsafely, which highlights the need for more fine-grained alignment. The ideal behavior is safe completion, i.e., fulfilling the benign parts of a request while explicitly warning about any potentially harmful elements. To address this, we present DUAL-Bench, the first multimodal benchmark focused on over-refusal and safe completion in VLMs. We evaluated 18 VLMs across 12 hazard categories, with focus on their robustness under semantics-preserving visual perturbations. The results reveal substantial room for improvement: GPT-5-Nano achieves 12.9% safe completion, GPT-5 models average 7.9%, and Qwen models only 3.9%. We hope that DUAL-Bench will foster the development of more nuanced alignment strategies that ensure models remain both safe and useful in complex multimodal settings.

[215] Rethinking Agentic Workflows: Evaluating Inference-Based Test-Time Scaling Strategies in Text2SQL Tasks cs.CL | cs.DBPDF

Jiajing Guo, Kenil Patel, Jorge Piazentin Ono, Wenbin He, Liu Ren

TL;DR: 这篇论文评估了六种轻量级、面向产业的测试时扩展策略和四种大型语言模型（LLMs，包括两个推理模型）在BIRD Mini-Dev基准上的表现，发现Divide-and-Conquer提示和少样本演示能持续提升性能，而额外的工作流步骤效果不一。

Details

Motivation: 大型语言模型（LLMs）在Text-to-SQL（Text2SQL）系统中的应用日益广泛，但在实际部署中，测试时扩展策略的效能仍不明确，尤其是在最新的推理模型中。

Result: Divide-and-Conquer提示和少样本演示对所有LLMs均有性能提升，但额外工作流步骤效果不一致，且基础模型的选择对结果有显著影响。

Insight: 在实际部署中，平衡精度、效率和复杂性是关键；Divide-and-Conquer提示和少样本演示是提升Text2SQL系统性能的有效手段。

Abstract: Large language models (LLMs) are increasingly powering Text-to-SQL (Text2SQL) systems, enabling non-expert users to query industrial databases using natural language. While test-time scaling strategies have shown promise in LLM-based solutions, their effectiveness in real-world applications, especially with the latest reasoning models, remains uncertain. In this work, we benchmark six lightweight, industry-oriented test-time scaling strategies and four LLMs, including two reasoning models, evaluating their performance on the BIRD Mini-Dev benchmark. Beyond standard accuracy metrics, we also report inference latency and token consumption, providing insights relevant for practical system deployment. Our findings reveal that Divide-and-Conquer prompting and few-shot demonstrations consistently enhance performance for both general-purpose and reasoning-focused LLMs. However, introducing additional workflow steps yields mixed results, and base model selection plays a critical role. This work sheds light on the practical trade-offs between accuracy, efficiency, and complexity when deploying Text2SQL systems.

[216] LLM$\times$MapReduce-V3: Enabling Interactive In-Depth Survey Generation through a MCP-Driven Hierarchically Modular Agent System cs.CLPDF

Yu Chao, Siyu Lin, xiaorong wang, Zhu Zhang, Zihan Zhou

TL;DR: LLM×MapReduce-V3是一个层次化模块化代理系统，专注于生成长篇调研报告。它采用多代理架构，将功能组件实现为独立的MCP服务器，并通过高级规划代理动态协调工作流，支持人机交互干预，最终生成内容深度和长度优于基线方法的报告。

Details

Motivation: 现有系统在长篇调研报告生成中难以平衡内容深度和交互灵活性，LLM×MapReduce-V3通过模块化和层次化设计，旨在提升系统的定制化和多轮交互能力。

Result: 人工评估显示，系统在内容深度和长度上优于基线方法，验证了MCP模块化规划的有效性。

Insight: 模块化和层次化设计提升了系统的灵活性和交互能力，MCP驱动的动态规划是支持复杂任务的关键。

Abstract: We introduce LLM x MapReduce-V3, a hierarchically modular agent system designed for long-form survey generation. Building on the prior work, LLM x MapReduce-V2, this version incorporates a multi-agent architecture where individual functional components, such as skeleton initialization, digest construction, and skeleton refinement, are implemented as independent model-context-protocol (MCP) servers. These atomic servers can be aggregated into higher-level servers, creating a hierarchically structured system. A high-level planner agent dynamically orchestrates the workflow by selecting appropriate modules based on their MCP tool descriptions and the execution history. This modular decomposition facilitates human-in-the-loop intervention, affording users greater control and customization over the research process. Through a multi-turn interaction, the system precisely captures the intended research perspectives to generate a comprehensive skeleton, which is then developed into an in-depth survey. Human evaluations demonstrate that our system surpasses representative baselines in both content depth and length, highlighting the strength of MCP-based modular planning.

[217] ADVICE: Answer-Dependent Verbalized Confidence Estimation cs.CLPDF

Ki Jung Seo, Sehun Lim, Taeuk Kim

TL;DR: 论文提出ADVICE框架，通过细调提升LLM的置信度校准能力，解决因答案独立性导致的过度自信问题。

Details

Motivation: LLM在自然语言中表达置信度时经常表现出过度自信，这种现象的原因尚不清楚。研究发现答案独立性是主要原因之一，即模型未能将置信度与其生成的答案条件化。

Result: 实验表明ADVICE大幅改善了置信度校准，同时不影响任务性能。分析还显示ADVICE增强了置信度分布的平衡性和校准性。

Insight: 揭示了LLM过度自信的根源之一是答案独立性，并提出了一种可信赖的置信度表达框架。

Abstract: Recent progress in large language models (LLMs) has enabled them to express their confidence in natural language, enhancing transparency and reliability. However, their confidence often exhibits overconfidence, the cause of which remains poorly understood. In this work, we conduct a detailed analysis of the dynamics underlying verbalized confidence and identify answer-independence as a key factor, defined as the model’s failure to condition confidence on its own answer. To address this, we propose ADVICE (Answer-Dependent Verbalized Confidence Estimation), a fine-tuning framework that facilitates answer-grounded confidence estimation. Extensive experiments show that ADVICE substantially improves confidence calibration while preserving task performance. Further analyses confirm that ADVICE strengthens answer-groundedness, leading to more balanced and well-calibrated confidence distributions. Our findings shed light on the origin of overconfidence and establish a framework for more trustworthy confidence verbalization.

[218] Evaluating Language Models’ Evaluations of Games cs.CL | cs.AIPDF

Katherine M. Collins, Cedegao E. Zhang, Graham Todd, Lance Ying, Mauricio Barba da Costa

TL;DR: 该论文提出了一种评估AI系统对游戏评价的新范式，比较了现代语言和推理模型与人类和符号计算代理的评价。结果表明，推理模型更接近人类评价，但随着接近博弈论最优，拟合度反而下降。

Details

Motivation: 传统AI评估主要关注问题解决能力，本文探讨AI系统对游戏本身的评价能力，填补了评估方向的空白。

Result: 推理模型比非推理语言模型更接近人类评价，但接近博弈论最优时拟合度下降；趣味性评价更不稳定且难量化。

Insight: 1. 游戏评价能力的非单调性值得深入研究；2. 趣味性量化更难，需改进模型设计；3. 资源分配不稳定，需增强资源理性的元推理能力。

Abstract: Reasoning is not just about solving problems – it is also about evaluating which problems are worth solving at all. Evaluations of artificial intelligence (AI) systems primarily focused on problem solving, historically by studying how models play games such as chess and Go. In this paper, we advocate for a new paradigm that assesses AI systems’ evaluation of games. First, we introduce a formalism for evaluating such evaluations. We then leverage a large-scale dataset of over $100$ novel board games and over 450 human judgments to compare evaluations produced by modern language and reasoning models against those of people and symbolic computational agents. We consider two kinds of evaluative queries: assessing the payoff (or fairness) and the funness of games. These queries span two dimensions relevant to the design of evaluations of AI evaluations: how complex a query is to compute and how difficult a query is to quantify. Our results show that reasoning models are generally more aligned to people in their evaluations of games than non-reasoning language models. However, we observe a non-monotonic relationship: as models get closer to game-theoretic optimal, their fit to human data weakens. We also observe more “jaggedness” across models for assessing funness, in line with the greater difficulty of quantifying this query. Across queries and games, reasoning models show highly variable and unpredictable resource usage when assessing queries, pointing to the importance of imbuing more resource-rational meta-reasoning in language and reasoning models.

[219] Judge Before Answer: Can MLLM Discern the False Premise in Question? cs.CL | cs.AIPDF

Jidong Li, Lingyong Fang, Haodong Zhao, Sufeng Duan, Gongshen Liu

TL;DR: 论文针对多模态大语言模型（MLLM）在识别问题中错误前提的能力不足，提出了一种自动化构建全面评测基准（JBA数据集）的方法，并设计了增强MLLM识别能力的框架。实验表明，该框架显著提升了模型识别错误前提的能力。

Details

Motivation: 尽管MLLM在多模态任务中表现优异，但在识别问题中错误前提的能力仍存在明显短板。现有评测基准覆盖范围有限，缺乏细粒度分类，无法全面评估模型的这一能力。因此，亟需一种更全面、自动化的评测方法来填补这一空白。

Result: 实验表明，当前MLLM在识别错误前提方面仍有不足。通过提出的增强框架训练的模型，在这一任务上表现显著提升。

Insight: 错误前提识别是MLLM稳健性的关键指标之一，未来研究可以进一步扩展评测基准的多样性，并探索更高效的识别方法。

Abstract: Multimodal large language models (MLLMs) have witnessed astonishing advancements in recent years. Despite these successes, MLLMs remain vulnerable to flase premise problems. However, existing benchmarks targeting this issue are limited in scope: they often lack fine-grained categorization, exhibit insufficient coverage, and thus fail to provide a rigorous evaluation of the ability of models to recognize false premises. To bridge this gap, we introduce a fully automated pipeline for constructing a comprehensive benchmark of false premise questions. Our method systematically categorizes the premises into three main types and thirteen subtypes according to the abilities required to identify the premises, resulting in the JBA dataset.Results show current MLLMs still struggle with false premise recognition. Building upon this benchmark, we further propose a recognition enhancement framework tailored to strengthen the robustness of MLLMs to detect false premises. Extensive experiments demonstrate that models trained with our framework achieve significant improvements in false premise recognition.

[220] RV-HATE: Reinforced Multi-Module Voting for Implicit Hate Speech Detection cs.CL | cs.AI | 68T50 | I.2.7PDF

Yejin Lee, Hyeseon Ahn, Yo-Sub Han

TL;DR: RV-HATE是一个多模块投票框架，通过强化学习优化模块权重，针对不同仇恨言论数据集的特性进行定制化检测，提升准确性并提供可解释性。

Details

Motivation: 仇恨言论的形式多样且隐晦，现有方法未能充分适应不同数据集的多样性。RV-HATE旨在通过动态适应数据集特性解决这一问题。

Result: RV-HATE在隐式仇恨言论检测上表现优于传统静态方法，并提供对数据集特征的深入洞察。

Insight: 动态适应数据集特性的方法能显著提升仇恨言论检测效果，同时可解释性有助于理解不同数据集的独特性。

Abstract: Hate speech remains prevalent in human society and continues to evolve in its forms and expressions. Modern advancements in internet and online anonymity accelerate its rapid spread and complicate its detection. However, hate speech datasets exhibit diverse characteristics primarily because they are constructed from different sources and platforms, each reflecting different linguistic styles and social contexts. Despite this diversity, prior studies on hate speech detection often rely on fixed methodologies without adapting to data-specific features. We introduce RV-HATE, a detection framework designed to account for the dataset-specific characteristics of each hate speech dataset. RV-HATE consists of multiple specialized modules, where each module focuses on distinct linguistic or contextual features of hate speech. The framework employs reinforcement learning to optimize weights that determine the contribution of each module for a given dataset. A voting mechanism then aggregates the module outputs to produce the final decision. RV-HATE offers two primary advantages: (1)~~it improves detection accuracy by tailoring the detection process to dataset-specific attributes, and (2)~~it also provides interpretable insights into the distinctive features of each dataset. Consequently, our approach effectively addresses implicit hate speech and achieves superior performance compared to conventional static methods. Our code is available at https://github.com/leeyejin1231/RV-HATE.

[221] Enhancing Large Language Model Reasoning via Selective Critical Token Fine-Tuning cs.CLPDF

Zhiwen Ruan, Yixia Li, He Zhu, Yun Chen, Peng Li

TL;DR: 本文提出了一种名为Critical Token Fine-tuning (CFT)的方法，通过选择性更新关键推理令牌（而非所有令牌）来增强大语言模型的推理能力，实验证明其在多个数学推理任务中优于标准监督微调。

Details

Motivation: 标准监督微调（SFT）对所有令牌采用均匀惩罚，导致输出多样性降低和泛化能力受限。本文发现仅一小部分关键令牌对推理正确性起决定性作用，因此提出一种更高效的微调方法。

Result: 实验表明，CFT在5个模型（Qwen、OLMo、LLaMA）和11个数学推理任务上优于标准SFT，尽管仅微调了不到12%的令牌。CFT还提升了采样多样性和强化学习的初始化效果。

Insight: 1. 关键令牌对推理至关重要；2. 选择性更新可平衡性能和多样性；3. CFT是一种高效且通用的LLM微调框架。

Abstract: Large language models (LLMs) primarily rely on supervised fine-tuning (SFT) as a key method to adapt pre-trained models to domain-specific tasks such as mathematical reasoning. However, standard SFT uniformly penalizes all tokens, neglecting that only a small subset of critical tokens determines reasoning correctness. This uniform supervision often causes reduced output diversity and limited generalization. We propose Critical Token Fine-tuning (CFT), a simple yet effective approach that updates only tokens identified as functionally indispensable via counterfactual perturbations. By focusing gradient signals on these decisive reasoning steps while preserving the diversity of non-critical tokens, CFT can enhance both generation and diversity. Extensive experiments on five models across three families (Qwen, OLMo, LLaMA) and eleven mathematical reasoning benchmarks show that CFT, despite fine-tuning on less than 12% of tokens, consistently outperforms standard SFT. Moreover, CFT enables test-time scaling through improved sampling diversity and provides a stronger initialization for reinforcement learning, sustaining performance gains in later training stages while maintaining higher entropy for better exploration. These results highlight CFT as a practical and general framework for efficient and robust LLM fine-tuning.

[222] DeepResearchGuard: Deep Research with Open-Domain Evaluation and Multi-Stage Guardrails for Safety cs.CL | cs.AIPDF

Wei-Chieh Huang, Henry Peng Zou, Yaozu Wu, Dongyuan Li, Yankai Chen

TL;DR: DeepResearchGuard是一个深度研究框架，通过四阶段保护机制和开放域评估，提升了报告的安全性、可信性和质量，同时减少了过度拒绝率。

Details

Motivation: 现有的深度研究框架缺乏全面的评估和阶段性保护，可能导致有害内容进入最终报告。需要一种能同时确保报告质量和安全性的解决方案。

Result: 在多种LLM（如GPT-4o、Gemini-2.5-flash等）上测试，防御成功率提升18.16%，过度拒绝率降低6%，显著改善了报告质量。

Insight: 阶段性保护机制能够在早期过滤风险，同时系统性评估和规范引用行为，确保了报告的全面性和安全性。

Abstract: Deep research frameworks have shown promising capabilities in synthesizing comprehensive reports from web sources. While deep research possesses significant potential to address complex issues through planning and research cycles, existing frameworks are deficient in sufficient evaluation procedures and stage-specific protections. They typically treat evaluation as exact match accuracy of question-answering, but overlook crucial aspects of report quality such as credibility, coherence, breadth, depth, and safety. This oversight may result in hazardous or malicious sources being integrated into the final report. To address these issues, we introduce DEEPRESEARCHGUARD, a comprehensive framework featuring four-stage safeguards with open-domain evaluation of references and reports. We assess performance across multiple metrics, e.g., defense success rate and over-refusal rate, and five key report dimensions. In the absence of a suitable safety benchmark, we introduce DRSAFEBENCH, a stage-wise benchmark for deep research safety. Our evaluation spans diverse state-of-the-art LLMs, including GPT-4o, Gemini-2.5-flash, DeepSeek-v3, and o4-mini. DEEPRESEARCHGUARD achieves an average defense success rate improvement of 18.16% while reducing over-refusal rate by 6%. The input guard provides the most substantial early-stage protection by filtering out obvious risks, while the plan and research guards enhance citation discipline and source credibility. Through extensive experiments, we show that DEEPRESEARCHGUARD enables comprehensive open-domain evaluation and stage-aware defenses that effectively block harmful content propagation, while systematically improving report quality without excessive over-refusal rates. The code can be found via https://github.com/Jasonya/DeepResearchGuard.

[223] ABLEIST: Intersectional Disability Bias in LLM-Generated Hiring Scenarios cs.CL | cs.AI | cs.CY | cs.HC | cs.LGPDF

Mahika Phutane, Hayoung Jung, Matthew Kim, Tanushree Mitra, Aditya Vashistha

TL;DR: 论文分析了LLM在招聘场景中对残障人士的偏见问题，提出ABLEIST指标体系，揭示了多重边缘化（如性别和种姓）如何加剧对残障候选人的伤害，并指出当前安全工具的盲点。

Details

Motivation: 现有的研究主要关注西方背景，忽视了在南半球社会中残障人士如何因性别和种姓等多重边缘化因素而遭受更复杂的偏见。论文旨在填补这一空白。

Result: 结果显示LLM对残障候选人的伤害显著增加，尤其是性别和种姓边缘化的残障人士受到的伤害更为严重；现有模型难以检测这些偏见。

Insight: 当前的安全工具无法有效识别多重边缘化带来的复杂偏见，强调了在高风险领域（如招聘）中对前沿模型进行交叉性安全评估的必要性。

Abstract: Large language models (LLMs) are increasingly under scrutiny for perpetuating identity-based discrimination in high-stakes domains such as hiring, particularly against people with disabilities (PwD). However, existing research remains largely Western-centric, overlooking how intersecting forms of marginalization–such as gender and caste–shape experiences of PwD in the Global South. We conduct a comprehensive audit of six LLMs across 2,820 hiring scenarios spanning diverse disability, gender, nationality, and caste profiles. To capture subtle intersectional harms and biases, we introduce ABLEIST (Ableism, Inspiration, Superhumanization, and Tokenism), a set of five ableism-specific and three intersectional harm metrics grounded in disability studies literature. Our results reveal significant increases in ABLEIST harms towards disabled candidates–harms that many state-of-the-art models failed to detect. These harms were further amplified by sharp increases in intersectional harms (e.g., Tokenism) for gender and caste-marginalized disabled candidates, highlighting critical blind spots in current safety tools and the need for intersectional safety evaluations of frontier models in high-stakes domains like hiring.

[224] LogiNumSynth: Synthesizing Joint Logical-Numerical Reasoning Problems for Language Models cs.CLPDF

Yiwei Liu, Yucheng Li, Xiao Li, Gong Cheng

TL;DR: 论文介绍了LogiNumSynth，一种灵活的联合逻辑-数值推理问题合成器，用于生成需要逻辑推理和数值推理能力的自然语言任务，支持任务复杂度的细粒度控制。

Details

Motivation: 现有的联合逻辑-数值推理数据集依赖固定规则集且任务复杂度控制有限，限制了其评估和训练的可扩展性，因此需要一种更灵活的合成工具。

Result: 实验显示当前大语言模型在联合逻辑-数值推理上仍存在不足，LogiNumSynth可作为诊断工具和针对性训练数据源。

Insight: 灵活的合成工具不仅能揭示模型弱点，还能为针对性训练提供数据，推动推理能力的综合提升。

Abstract: Joint logical-numerical reasoning remains a major challenge for language models, yet existing datasets rely on fixed rule sets and offer limited control over task complexity, constraining their generalizability for evaluation and training. We present LogiNumSynth, a flexible natural language problem synthesizer that synthesizes tasks requiring proficiency in joint logical reasoning (e.g., rule-based reasoning) and numerical reasoning (e.g., arithmetic computation). LogiNumSynth supports fine-grained control over reasoning world richness, logical reasoning depth, and the complexity of numerical computations, enabling flexible data synthesis across difficulty levels. We demonstrate three key contributions: (1) Synthesizer – synthesizing fully controllable joint reasoning tasks over natural language; (2) Evaluation & Process Analysis – evaluating both process accuracy and answer accuracy; (3) Targeted Training – using synthesized data to enhance LLMs’ reasoning performance. Experiments with multiple LLMs highlight persistent weaknesses in logical-numerical reasoning, showing that LogiNumSynth can serve as both a diagnostic tool and a source of targeted supervision for advancing integrated reasoning skills.

Qinglin Zhu, Yizhen Yao, Runcong Zhao, Yanzheng Xiang, Amrutha Saseendran

TL;DR: 论文提出了Latent Refinement Decoding (LRD)，一种两阶段解码框架，通过潜在细化和预测反馈循环解决了基于扩散的语言模型的信息丢失和过早决策问题，显著提升了生成速度和准确性。

Details

Motivation: 现有的自回归模型因严格顺序解码导致高延迟，而扩散并行生成方法存在信息丢失和过早决策的问题，阻碍了模型的全局一致性。

Result: 实验表明，LRD在编码（HumanEval +6.3, MBPP +2.6）和推理（GSM8K +2.9, MATH500 +3.8）任务中显著提升准确性，速度提升可达10.6倍。

Insight: LRD通过分布混合和迭代反馈实现了全局一致性，避免了扩散模型的过早决策问题，为并行序列生成提供了一种高效且通用的方法。

Abstract: Autoregressive (AR) models remain the standard for natural language generation but still suffer from high latency due to strictly sequential decoding. Recent diffusion-inspired approaches, such as LlaDA and Dream, mitigate this by generating in parallel, yet they suffer from two core limitations: information loss, as predictive distributions for non-finalized tokens are discarded at each step, and premature commitment, where local decisions are made without sufficient global coordination. We introduce Latent Refinement Decoding (LRD), a two-stage framework with Latent Refinement and a Predictive Feedback Loop. The first stage maintains masked positions as distributional mixtures of predicted tokens and the mask embedding, allowing the model to establish more globally consistent beliefs. The second stage progressively finalizes confident tokens while retaining uncertain ones for iterative feedback. KL-divergence dynamics provide a principled and reliable criterion for convergence and early stopping. Experiments across coding (HumanEval +6.3, MBPP +2.6) and reasoning (GSM8K +2.9, MATH500 +3.8) show that LRD improves accuracy while delivering speedups of up to 10.6x, making it a strong and versatile alternative for parallel sequence generation.

[226] Enhancing LLM Reasoning via Non-Human-Like Reasoning Path Preference Optimization cs.CL | cs.AIPDF

Junjie Lu, Yuliang Liu, Chaofeng Qu, Wei Shen, Zhouhan Lin

TL;DR: 本文提出了一种名为CGPO的新方法，通过利用置信度信号识别模型推理过程中的不确定性高点，并应用自生成的非人类推理路径指导来优化推理性能，避免了传统方法对人工或高容量模型标注的依赖。

Details

Motivation: 当前强化大型语言模型（LLM）推理的方法通常偏向人类推理轨迹，限制了探索非人类推理路径的可能性，从而制约了性能提升空间。此外，研究发现模型的首个错误步骤通常发生在置信度最低点之后，这表明在最低置信点提供指导可能比在首个错误点更有效。

Result: 实验结果表明，CGPO方法在使用少量小型模型生成的数据时，性能优于依赖高容量模型或人工标注数据的方法。

Insight: 在模型推理过程中，最低置信点是关键干预时机，而非人类推理路径可能比传统方法更具潜力。这一发现为未来推理优化方法提供了新方向。

Abstract: Current approaches for strengthening LLM reasoning tend to introduce a training bias toward human-like reasoning trajectories. In step-wise preference optimization, in particular, dependence on human or higher-capacity model annotations for intermediate steps limits exploration of alternative, non-human-like reasoning paths and thus constrains achievable performance. Furthermore, through a small-scale pilot study, we observed that in approximately 75% of cases, the model’s first erroneous step occurs after the lowest-confidence point. This suggests that guiding the model at its lowest-confidence point before an error provides more accurate supervision than locating the first explicit error. In this paper, we propose Confidence-Guided Reasoning Path Preference Optimization (CGPO), a method that leverages a confidence signal to identify points of maximal uncertainty in the model’s reasoning process and applies self-generated, non-human-like reasoning-path guidance to mitigate trajectory drift. Our experiments span diverse models applied to both code and mathematical reasoning tasks. The results show that, with the same amount of training data, our method using data generated by a small model can achieve better performance in most cases compared with approaches using data generated by a strong model or human-annotated.

[227] Evaluating Reasoning Faithfulness in Medical Vision-Language Models using Multimodal Perturbations cs.CL | cs.CVPDF

Johannes Moll, Markus Graf, Tristan Lemke, Nicolas Lenhart, Daniel Truhn

TL;DR: 该论文提出了一个临床基础的框架，用于评估医学视觉语言模型（VLMs）的推理忠实性，通过多模态扰动分析了三点：临床保真度、因果归因和置信度校准。结果显示，答案准确性与解释质量是分离的，且开源模型与专有模型在归因和保真度上存在显著差距。

Details

Motivation: 在高风险的临床应用中，视觉语言模型生成的链式思维（CoT）解释可能听起来合理，但未能真实反映决策过程，导致信任问题。现有评估方法未能捕捉这种不一致性。

Result: 评测显示，答案准确性与解释质量脱节；专有模型在归因（25.0% vs. 1.4%）和保真度（36.1% vs. 31.7%）上优于开源模型。

Insight: 文本线索对解释的影响大于视觉线索；部署VLMs时需超越最终的答案准确性，重视解释的真实性和连贯性。

Abstract: Vision-language models (VLMs) often produce chain-of-thought (CoT) explanations that sound plausible yet fail to reflect the underlying decision process, undermining trust in high-stakes clinical use. Existing evaluations rarely catch this misalignment, prioritizing answer accuracy or adherence to formats. We present a clinically grounded framework for chest X-ray visual question answering (VQA) that probes CoT faithfulness via controlled text and image modifications across three axes: clinical fidelity, causal attribution, and confidence calibration. In a reader study (n=4), evaluator-radiologist correlations fall within the observed inter-radiologist range for all axes, with strong alignment for attribution (Kendall’s $\tau_b=0.670$), moderate alignment for fidelity ($\tau_b=0.387$), and weak alignment for confidence tone ($\tau_b=0.091$), which we report with caution. Benchmarking six VLMs shows that answer accuracy and explanation quality are decoupled, acknowledging injected cues does not ensure grounding, and text cues shift explanations more than visual cues. While some open-source models match final answer accuracy, proprietary models score higher on attribution (25.0% vs. 1.4%) and often on fidelity (36.1% vs. 31.7%), highlighting deployment risks and the need to evaluate beyond final answer accuracy.

[228] Discursive Circuits: How Do Language Models Understand Discourse Relations? cs.CL | cs.LGPDF

Yisong Miao, Min-Yen Kan

TL;DR: 该论文研究了Transformer语言模型中负责理解话语关系的稀疏计算图（称为话语电路），并通过CuDR任务验证了这些电路的有效性和泛化能力。

Details

Motivation: 话语关系理解是自然语言处理中的重要任务，但当前语言模型中哪些组件负责这一功能尚不明确。论文旨在发现和控制这些组件，以便更好地理解和改进模型的话语能力。

Result: 实验表明，稀疏电路在PDTB-based CuDR任务中表现良好，且能泛化到RST和SDRT等未见话语框架；低层捕捉词汇语义和共指，高层编码话语级抽象特征。

Insight: 话语理解在语言模型中由稀疏电路控制，且不同层分工明确；核心共指等特征对话语关系的支持在不同框架中保持一致，为模型设计提供了新视角。

Abstract: Which components in transformer language models are responsible for discourse understanding? We hypothesize that sparse computational graphs, termed as discursive circuits, control how models process discourse relations. Unlike simpler tasks, discourse relations involve longer spans and complex reasoning. To make circuit discovery feasible, we introduce a task called Completion under Discourse Relation (CuDR), where a model completes a discourse given a specified relation. To support this task, we construct a corpus of minimal contrastive pairs tailored for activation patching in circuit discovery. Experiments show that sparse circuits ($\approx 0.2%$ of a full GPT-2 model) recover discourse understanding in the English PDTB-based CuDR task. These circuits generalize well to unseen discourse frameworks such as RST and SDRT. Further analysis shows lower layers capture linguistic features such as lexical semantics and coreference, while upper layers encode discourse-level abstractions. Feature utility is consistent across frameworks (e.g., coreference supports Expansion-like relations).

[229] Domain-Specific Data Generation Framework for RAG Adaptation cs.CL | cs.AIPDF

Chris Xing Tian, Weihao Xie, Zhen Chen, Zhengyuan Yi, Hui Liu

TL;DR: 论文提出了一种名为RAGen的领域专用数据生成框架，用于支持Retrieval-Augmented Generation (RAG)系统的领域自适应。该框架通过生成领域相关的问答-上下文三元组(QAC)，优化RAG的LLM、检索器和嵌入模型等关键组件。

Details

Motivation: RAG系统需要领域专用的训练数据以生成领域相关的回答，但现有数据多为通用问答，缺乏领域特异性。因此，需要一种可扩展的框架来高效生成此类数据。

Result: RAGen能够高效生成领域专用的QAC三元组，为RAG系统提供优化的训练数据，特别适用于科学研究和企业知识库等动态领域。

Insight: 领域专用的QAC数据生成是RAG系统性能优化的关键；模块化和可扩展性是处理大规模动态文档的有效策略。

Abstract: Retrieval-Augmented Generation (RAG) combines the language understanding and reasoning power of large language models (LLMs) with external retrieval to enable domain-grounded responses. Effectively adapting RAG systems to domain-specific settings requires specialized, context-rich training data beyond general-purpose question-answering. Here, we propose RAGen, a scalable and modular framework for generating domain-grounded question-answer-context (QAC) triples tailored to diverse RAG adaptation approaches. RAGen produces these QAC triples by identifying key concepts in documents, generating diverse questions guided by Bloom’s Taxonomy-inspired principles, and pairing them with precise answers extracted from relevant contexts. RAGen supports multiple RAG adaptation strategies, including the optimization of key components such as the LLM, retriever, and embedding model, etc. Its modular pipeline features semantic chunking, hierarchical concept extraction, and multi-chunk retrieval, along with the introduction of curated distractor contexts to promote robust reasoning. Designed for scalability, RAGen efficiently handles large and evolving document corpora without redundant processing, making it especially suitable for dynamic evolving domains such as scientific research and enterprise knowledge bases.

[230] Fairness Metric Design Exploration in Multi-Domain Moral Sentiment Classification using Transformer-Based Models cs.CL | cs.AIPDF

Battemuulen Naranbat, Seyed Sahand Mohammadi Ziabari, Yousuf Nasser Al Husaini, Ali Mohammed Mansoor Alsahag

TL;DR: 该论文探讨了在多领域道德情感分类中使用基于Transformer的模型时如何设计公平性度量，提出了新的Moral Fairness Consistency (MFC)度量标准，并通过实验验证了其在跨域稳定性评估中的有效性。

Details

Motivation: 研究发现在跨域转移时，现有的公平性度量可能会掩盖某些标签的公平性违反问题，因此需要一种新的度量标准来更好地评估模型的公平性。

Result: 实验结果显示，MFC与Demographic Parity Difference呈完全负相关（rho = -1.000），证明了其在公平性评估中的有效性。

Insight: MFC作为一种互补的、诊断导向的度量标准，可以帮助更可靠地评估道德推理模型的公平性，特别是在跨域部署时。

Abstract: Ensuring fairness in natural language processing for moral sentiment classification is challenging, particularly under cross-domain shifts where transformer models are increasingly deployed. Using the Moral Foundations Twitter Corpus (MFTC) and Moral Foundations Reddit Corpus (MFRC), this work evaluates BERT and DistilBERT in a multi-label setting with in-domain and cross-domain protocols. Aggregate performance can mask disparities: we observe pronounced asymmetry in transfer, with Twitter->Reddit degrading micro-F1 by 14.9% versus only 1.5% for Reddit->Twitter. Per-label analysis reveals fairness violations hidden by overall scores; notably, the authority label exhibits Demographic Parity Differences of 0.22-0.23 and Equalized Odds Differences of 0.40-0.41. To address this gap, we introduce the Moral Fairness Consistency (MFC) metric, which quantifies the cross-domain stability of moral foundation detection. MFC shows strong empirical validity, achieving a perfect negative correlation with Demographic Parity Difference (rho = -1.000, p < 0.001) while remaining independent of standard performance metrics. Across labels, loyalty demonstrates the highest consistency (MFC = 0.96) and authority the lowest (MFC = 0.78). These findings establish MFC as a complementary, diagnosis-oriented metric for fairness-aware evaluation of moral reasoning models, enabling more reliable deployment across heterogeneous linguistic contexts. .

[231] A Theorem-Proving-Based Evaluation of Neural Semantic Parsing cs.CLPDF

Hayate Funakura, Hyunsoo Kim, Koji Mineshima

TL;DR: 该论文通过结合图匹配和自动定理证明，重新评估神经语义解析器的性能，发现现有方法在逻辑等价性上存在不足，并提出标准化目标表示的重要性。

Details

Motivation: 目前神经语义解析器的评估主要依赖图匹配指标（如Smatch），但这些指标仅捕捉表面重叠而非逻辑等价性。论文旨在通过自动定理证明更准确地评估逻辑等价性。

Result: 实验表明，图匹配表现优异的模型在生成逻辑等价公式时表现不佳。标准化目标表示可以减少目标变异性并提升逻辑适格性。复杂公式和特定语法结构（如被动语态）会导致性能下降。

Insight: 论文揭示了图匹配指标在推理导向应用中的局限性，强调了逻辑敏感的评估和训练目标的重要性，同时简化目标表示有助于提升性能。

Abstract: Graph-matching metrics such as Smatch are the de facto standard for evaluating neural semantic parsers, yet they capture surface overlap rather than logical equivalence. We reassess evaluation by pairing graph-matching with automated theorem proving. We compare two approaches to building parsers: supervised fine-tuning (T5-Small/Base) and few-shot in-context learning (GPT-4o/4.1/5), under normalized and unnormalized targets. We evaluate outputs using graph-matching, bidirectional entailment between source and target formulas with a first-order logic theorem prover, and well-formedness. Across settings, we find that models performing well on graph-matching often fail to produce logically equivalent formulas. Normalization reduces incidental target variability, improves well-formedness, and strengthens logical adequacy. Error analysis shows performance degrades with increasing formula complexity and with coordination, prepositional phrases, and passive voice; the dominant failures involve variable binding and indexing, and predicate naming. These findings highlight limits of graph-based metrics for reasoning-oriented applications and motivate logic-sensitive evaluation and training objectives together with simplified, normalized target representations. All code and data for our experiments are publicly available.

Jinyuan Xu, Tian Lan, Xintao Yu, Xue He, Hezhi Zhang

TL;DR: 该论文发布了CNSocialDepress数据集，一个用于中文社交媒体抑郁风险检测的基准数据集，包含大量标注数据和多维心理属性分析，支持抑郁信号的细粒度分析和大语言模型微调。

Details

Motivation: 解决中文抑郁风险检测领域公开数据稀缺且多为二分类的问题，提供更丰富的标注数据和多维心理属性支持。

Result: 实验表明该数据集可广泛用于NLP任务，如心理画像和大语言模型微调，提升了抑郁风险检测的实用价值。

Insight: 数据集为中文心理健康应用提供了重要的资源和工具，填补了现有资源的空白。

Abstract: Depression is a pressing global public health issue, yet publicly available Chinese-language resources for risk detection remain scarce and are mostly limited to binary classification. To address this limitation, we release CNSocialDepress, a benchmark dataset for depression risk detection from Chinese social media posts. The dataset contains 44,178 texts from 233 users, within which psychological experts annotated 10,306 depression-related segments. CNSocialDepress provides binary risk labels together with structured multi-dimensional psychological attributes, enabling interpretable and fine-grained analysis of depressive signals. Experimental results demonstrate its utility across a wide range of NLP tasks, including structured psychological profiling and fine-tuning of large language models for depression detection. Comprehensive evaluations highlight the dataset’s effectiveness and practical value for depression risk identification and psychological analysis, thereby providing insights to mental health applications tailored for Chinese-speaking populations.

[233] Towards Real-Time Fake News Detection under Evidence Scarcity cs.CL | cs.AIPDF

Guangyu Wei, Ke Han, Yueming Lyu, Yu Luo, Yue Jiang

TL;DR: 论文提出了EASE框架，通过动态评估证据、推理和情感三个视角，解决了实时假新闻检测中证据稀缺的问题。

Details

Motivation: 实时假新闻检测面临证据稀缺的挑战，现有方法依赖外部证据且泛化能力不足，作者提出动态适应决策的EASE框架。

Result: 实验表明EASE在多个基准上表现最优，且在新提出的RealTimeNews-25数据集上显著提升了泛化能力。

Insight: 动态评估和多视角融合能有效解决证据稀缺问题，指令调优和伪标签可提升评估可靠性。

Abstract: Fake news detection becomes particularly challenging in real-time scenarios, where emerging events often lack sufficient supporting evidence. Existing approaches often rely heavily on external evidence and therefore struggle to generalize under evidence scarcity. To address this issue, we propose Evaluation-Aware Selection of Experts (EASE), a novel framework for real-time fake news detection that dynamically adapts its decision-making process according to the assessed sufficiency of available evidence. EASE introduces a sequential evaluation mechanism comprising three independent perspectives: (1) Evidence-based evaluation, which assesses evidence and incorporates it into decision-making only when the evidence is sufficiently supportive; (2) Reasoning-based evaluation, which leverages the world knowledge of large language models (LLMs) and applies them only when their reliability is adequately established; and (3) Sentiment-based fallback, which integrates sentiment cues when neither evidence nor reasoning is reliable. To enhance the accuracy of evaluation processes, EASE employs instruction tuning with pseudo labels to guide each evaluator in justifying its perspective-specific knowledge through interpretable reasoning. Furthermore, the expert modules integrate the evaluators’ justified assessments with the news content to enable evaluation-aware decision-making, thereby enhancing overall detection accuracy. Moreover, we introduce RealTimeNews-25, a new benchmark comprising recent news for evaluating model generalization on emerging news with limited evidence. Extensive experiments demonstrate that EASE not only achieves state-of-the-art performance across multiple benchmarks, but also significantly improves generalization to real-time news. The code and dataset are available: https://github.com/wgyhhhh/EASE.

[234] Emergent Misalignment via In-Context Learning: Narrow in-context examples can produce broadly misaligned LLMs cs.CLPDF

Nikita Afonin, Nikita Andriyanov, Nikhil Bageshpura, Kyle Liu, Kevin Zhu

TL;DR: 论文研究了窄范围上下文学习（ICL）是否会导致大语言模型（LLM）的广泛不一致性（EM），发现即使在ICL中，EM现象也会显著出现，且随上下文示例数量增加而加剧。

Details

Motivation: 现有研究表明窄范围微调可能导致LLM的广泛不一致性，但未涵盖上下文学习（ICL）。本文旨在填补这一空白，验证ICL是否同样引发EM现象。

Result: 实验显示，64个窄范围示例下不一致响应率为2%-17%，256个示例时上升至58%。67.5%的不一致轨迹源于模型采用危险“角色”。

Insight: ICL中的EM与微调导致的EM机制类似，提示窄范围输入可能泛化为广泛不一致性，需在设计和使用LLM时警惕此类风险。

Abstract: Recent work has shown that narrow finetuning can produce broadly misaligned LLMs, a phenomenon termed emergent misalignment (EM). While concerning, these findings were limited to finetuning and activation steering, leaving out in-context learning (ICL). We therefore ask: does EM emerge in ICL? We find that it does: across three datasets, three frontier models produce broadly misaligned responses at rates between 2% and 17% given 64 narrow in-context examples, and up to 58% with 256 examples. We also examine mechanisms of EM by eliciting step-by-step reasoning (while leaving in-context examples unchanged). Manual analysis of the resulting chain-of-thought shows that 67.5% of misaligned traces explicitly rationalize harmful outputs by adopting a reckless or dangerous ‘’persona’’, echoing prior results on finetuning-induced EM.

[235] Are Large Language Models Effective Knowledge Graph Constructors? cs.CLPDF

Ruirui Chen, Weifeng Jiang, Chengwei Qin, Bo Xiong, Fiona Liausvia

TL;DR: 这篇论文探讨了大语言模型（LLMs）在知识图谱（KG）构建中的有效性，提出了一个层次化的提取框架来构建语义丰富的结构化KG，并评估了LLMs的能力和局限性。

Details

Motivation: 知识图谱对知识密集型任务至关重要，并能减少大语言模型的幻觉问题，但构建高质量KG仍面临挑战。现有的LLM方法通常局限于实体和关系提取，缺乏全面的语义覆盖和结构化表示。

Result: 论文发现了当前LLMs在KG构建中的优势和不足，为进一步研究提供了关键挑战。

Insight: LLMs在KG构建中具有潜力，但需要改进以支持更全面的语义覆盖和结构化表示。发布的资源有助于推动高影响力领域的应用。

Abstract: Knowledge graphs (KGs) are vital for knowledge-intensive tasks and have shown promise in reducing hallucinations in large language models (LLMs). However, constructing high-quality KGs remains difficult, requiring accurate information extraction and structured representations that support interpretability and downstream utility. Existing LLM-based approaches often focus narrowly on entity and relation extraction, limiting coverage to sentence-level contexts or relying on predefined schemas. We propose a hierarchical extraction framework that organizes information at multiple levels, enabling the creation of semantically rich and well-structured KGs. Using state-of-the-art LLMs, we extract and construct knowledge graphs and evaluate them comprehensively from both structural and semantic perspectives. Our results highlight the strengths and shortcomings of current LLMs in KG construction and identify key challenges for future work. To advance research in this area, we also release a curated dataset of LLM-generated KGs derived from research papers on children’s mental well-being. This resource aims to foster more transparent, reliable, and impactful applications in high-stakes domains such as healthcare.

[236] FOSSIL: Harnessing Feedback on Suboptimal Samples for Data-Efficient Generalisation with Imitation Learning for Embodied Vision-and-Language Tasks cs.CL | cs.AIPDF

Sabrina McCallum, Amit Parekh, Alessandro Suglia

TL;DR: 论文提出了一种名为FOSSIL的方法，通过结合最优和次优演示以及语言反馈，提升了模仿学习在具身视觉-语言任务中的泛化能力和数据效率。

Details

Motivation: 当前具身AI方法主要依赖专家演示学习策略，但无法评估演示质量，导致只能学习最优行为或复制错误。强化学习虽然是一种替代，但其探索过程牺牲了数据效率。本文旨在通过语言反馈利用次优演示，提升学习效果。

Result: 实验结果表明，FOSSIL在具身视觉-语言任务中显著提升了模型的组合泛化能力和鲁棒性。同时，语言反馈被证明是比标量奖励更直观和有效的替代方案。

Insight: 论文的洞察在于，语言反馈可以帮助模型从次优行为中学习，从而提升数据效率。这表明语言上下文在具身AI任务中具有重要作用，能够弥补演示质量的不足。

Abstract: Current approaches to embodied AI tend to learn policies from expert demonstrations. However, without a mechanism to evaluate the quality of demonstrated actions, they are limited to learning from optimal behaviour, or they risk replicating errors and inefficiencies. While reinforcement learning offers one alternative, the associated exploration typically results in sacrificing data efficiency. This work explores how agents trained with imitation learning can learn robust representations from both optimal and suboptimal demonstrations when given access to constructive language feedback as a means to contextualise different modes of behaviour. We directly provide language feedback embeddings as part of the input sequence into a Transformer-based policy, and optionally complement the traditional next action prediction objective with auxiliary self-supervised learning objectives for feedback prediction. We test our approach on a range of embodied Vision-and-Language tasks in our custom BabyAI-XGen environment and show significant improvements in agents’ compositional generalisation abilities and robustness, suggesting that our data-efficient method allows models to successfully convert suboptimal behaviour into learning opportunities. Overall, our results suggest that language feedback is a competitive and intuitive alternative to intermediate scalar rewards for language-specified embodied tasks.

[237] Template-Based Text-to-Image Alignment for Language Accessibility: A Study on Visualizing Text Simplifications cs.CLPDF

Belkiss Souayed, Sarah Ebling, Yingqiang Gao

TL;DR: 本文提出了一种基于模板的文本-图像对齐方法，用于语言无障碍性研究，旨在通过结构化视觉-语言模型提示框架生成简约且易于理解的图像，结果显示对象焦点模板在语义对齐和可访问性方面表现最佳。

Details

Motivation: 智力障碍人士在理解复杂文本时面临困难，现有的文本-图像模型多注重美学而非无障碍性，因此需要研究如何通过简化文本生成更具可访问性的图像。

Result: Basic Object Focus模板在语义对齐上表现最佳；Retro视觉风格被评为最具可访问性，Wikipedia数据源效果最好；文本简洁性维度表现出较强的可靠性。

Insight: 视觉极简主义（如Basic Object Focus）有助于提升语言无障碍性；结构化提示在AI生成无障碍视觉工具中至关重要；专家评估表明图像质量的主观性较强。

Abstract: Individuals with intellectual disabilities often have difficulties in comprehending complex texts. While many text-to-image models prioritize aesthetics over accessibility, it is not clear how visual illustrations relate to text simplifications (TS) generated from them. This paper presents a structured vision-language model (VLM) prompting framework for generating accessible images from simplified texts. We designed five prompt templates, i.e., Basic Object Focus, Contextual Scene, Educational Layout, Multi-Level Detail, and Grid Layout, each following distinct spatial arrangements while adhering to accessibility constraints such as object count limits, spatial separation, and content restrictions. Using 400 sentence-level simplifications from four established TS datasets (OneStopEnglish, SimPA, Wikipedia, and ASSET), we conducted a two-phase evaluation: Phase 1 assessed prompt template effectiveness with CLIPScores, and Phase 2 involved human annotation of generated images across ten visual styles by four accessibility experts. Results show that the Basic Object Focus prompt template achieved the highest semantic alignment, indicating that visual minimalism enhances language accessibility. Expert evaluation further identified Retro style as the most accessible and Wikipedia as the most effective data source. Inter-annotator agreement varied across dimensions, with Text Simplicity showing strong reliability and Image Quality proving more subjective. Overall, our framework offers practical guidelines for accessible content generation and underscores the importance of structured prompting in AI-generated visual accessibility tools.

[238] Do LLMs “Feel”? Emotion Circuits Discovery and Control cs.CL | cs.AIPDF

Chenxi Wang, Yixuan Zhang, Ruiji Yu, Yufei Zheng, Lang Gao

TL;DR: 该研究系统性地探索了大型语言模型（LLMs）中的情感机制，提出了情感电路的发现与控制方法，并在情感表达准确性上取得了显著成果。

Details

Motivation: 随着对LLMs情感智能需求的增长，理解其内部情感表达机制并实现对情感生成的精准控制成为关键挑战。

Result: 调制情感电路的情感表达测试准确率达到99.65%，超越了传统提示和引导方法。

Insight: 这是首次系统性地发现并验证LLMs中的情感电路，为模型的可解释性和可控情感智能提供了新视角。

Abstract: As the demand for emotional intelligence in large language models (LLMs) grows, a key challenge lies in understanding the internal mechanisms that give rise to emotional expression and in controlling emotions in generated text. This study addresses three core questions: (1) Do LLMs contain context-agnostic mechanisms shaping emotional expression? (2) What form do these mechanisms take? (3) Can they be harnessed for universal emotion control? We first construct a controlled dataset, SEV (Scenario-Event with Valence), to elicit comparable internal states across emotions. Subsequently, we extract context-agnostic emotion directions that reveal consistent, cross-context encoding of emotion (Q1). We identify neurons and attention heads that locally implement emotional computation through analytical decomposition and causal analysis, and validate their causal roles via ablation and enhancement interventions. Next, we quantify each sublayer’s causal influence on the model’s final emotion representation and integrate the identified local components into coherent global emotion circuits that drive emotional expression (Q2). Directly modulating these circuits achieves 99.65% emotion-expression accuracy on the test set, surpassing prompting- and steering-based methods (Q3). To our knowledge, this is the first systematic study to uncover and validate emotion circuits in LLMs, offering new insights into interpretability and controllable emotional intelligence.

[239] LLM-Specific Utility: A New Perspective for Retrieval-Augmented Generation cs.CL | cs.AI | cs.IRPDF

Hengran Zhang, Keping Bi, Jiafeng Guo, Jiaming Zhang, Shuaiqiang Wang

TL;DR: 该论文提出了LLM特定效用的概念，强调在检索增强生成（RAG）中，不同LLM对相同外部知识的利用能力不同，传统的通用效用标注不适用。通过实验发现，人类标注的段落并非LLM最优选择，且效用段落不可跨模型迁移。论文还提出了一种LLM特定效用的基准评估方法。

Details

Motivation: 传统的检索增强生成（RAG）研究将效用视为通用属性，忽略了不同LLM在内部知识和理解能力上的差异可能导致对相同段落的效用不同。本文旨在揭示这种差异，并提出LLM特定效用的重要性。

Result: 实验结果表明，人类标注的段落对特定LLM并非最优，且效用段落不可跨模型迁移。现有效用评估方法中，基于伪答案的语言化方法表现稳健，但LLM在评估效用时存在缺陷。

Insight: LLM的效用评估需要模型特定的视角，通用的效用标注可能在RAG中失效。困惑度是衡量LLM对查询和段落可读性的关键指标。

Abstract: Retrieval-augmented generation (RAG) enhances large language models (LLMs) by incorporating external knowledge. While traditional retrieval focuses on relevance, RAG’s effectiveness depends on the utility of retrieved passages, i.e., the usefulness in facilitating the generation of an accurate and comprehensive answer. Existing studies often treat utility as a generic attribute, ignoring the fact that different LLMs may benefit differently from the same passage due to variations in internal knowledge and comprehension ability. In this work, we introduce and systematically investigate the notion of LLM-specific utility. Through large-scale experiments across multiple datasets and LLMs, we demonstrate that human-annotated passages are not optimal for LLMs and that ground-truth utilitarian passages are not transferable across different LLMs. These findings highlight the necessity of adopting the LLM-specific utility in RAG research. Our findings indicate that some human-annotated passages are not ground-truth utilitarian passages for specific LLMs, partially due to the varying readability of queries and passages for LLMs, a tendency for which perplexity is a key metric. Based on these findings, we propose a benchmarking procedure for LLM-specific utility judgments. We evaluate existing utility judgment methods on six datasets and find that while verbalized methods using pseudo-answers perform robustly, LLMs struggle to assess utility effectively-failing to reject all passages for known queries and to select truly useful ones for unknown queries.

[240] Stabilizing MoE Reinforcement Learning by Aligning Training and Inference Routers cs.CL | cs.AI | cs.LGPDF

Wenhan Ma, Hailin Zhang, Liang Zhao, Yifan Song, Yudong Wang

TL;DR: 论文提出了一种名为Rollout Routing Replay (R3)的方法，通过记录推理引擎的路由分布并在训练中重放，解决了MoE模型中路由机制导致的不稳定问题，显著提升了强化学习的稳定性。

Details

Motivation: 在Mixture-of-Experts (MoE)模型中，路由机制在训练和推理阶段的不一致性会导致强化学习不稳定甚至崩溃，限制了模型的性能和应用。

Result: 实验证明，R3在多种设置下成功稳定了RL训练，避免了崩溃，并超越了GSPO和TIS等方法。

Insight: 这项工作揭示了MoE模型中路由不一致性问题的根源，并提供了一种新的解决方案，为MoE模型的稳定训练提供了理论基础和实践方法。

Abstract: Reinforcement learning (RL) has emerged as a crucial approach for enhancing the capabilities of large language models. However, in Mixture-of-Experts (MoE) models, the routing mechanism often introduces instability, even leading to catastrophic RL training collapse. We analyze the training-inference consistency of MoE models and identify a notable discrepancy in routing behaviors between the two phases. Moreover, even under identical conditions, the routing framework can yield divergent expert selections across repeated forward passes. To address this foundational inconsistency, we propose Rollout Routing Replay (R3), a method that records routing distributions from the inference engine and replays them during training. R3 significantly reduces training-inference policy KL divergence and mitigates extreme discrepancies without compromising training speed. Extensive experiments on various settings confirm that R3 succeeds in stabilizing RL training, preventing collapse and outperforming methods such as GSPO and TIS. We believe this work can offer a new solution for stabilizing RL in MoE models.

Zirui Song, Yuan Huang, Junchang Liu, Haozhe Luo, Chenxi Wang

TL;DR: 该论文提出了一种新颖的策略对齐评估方法，使用高质量的人类验证数据来评估大型语言模型（LLM）在社交推理游戏中的表现，揭示了模型在欺骗和反事实推理方面的不足。

Details

Motivation: 现有研究大多将社交推理游戏简化为LLM自对弈，忽略了社交游戏的丰富性，同时缺乏高质量的参考数据和细粒度评估方法。

Result: 实验显示，现有LLMs表现参差不齐，约一半模型得分低于0.50，尤其在欺骗和反事实推理方面表现不佳。

Insight: 研究强调了结合策略对齐评估的重要性，为多智能体交互中的语言、推理和策略研究提供了新方向。

Abstract: Social deduction games like Werewolf combine language, reasoning, and strategy, providing a testbed for studying natural language and social intelligence. However, most studies reduce the game to LLM-based self-play, yielding templated utterances and anecdotal cases that overlook the richness of social gameplay. Evaluation further relies on coarse metrics such as survival time or subjective scoring due to the lack of quality reference data. To address these gaps, we curate a high-quality, human-verified multimodal Werewolf dataset containing over 100 hours of video, 32.4M utterance tokens, and 15 rule variants. Based on this dataset, we propose a novel strategy-alignment evaluation that leverages the winning faction’s strategies as ground truth in two stages: 1) Speech evaluation, formulated as multiple-choice-style tasks that assess whether the model can adopt appropriate stances across five dimensions of social ability; and 2) Decision evaluation, which assesses the model’s voting choices and opponent-role inferences. This framework enables a fine-grained evaluation of models’ linguistic and reasoning capabilities, while capturing their ability to generate strategically coherent gameplay. Our experiments show that state-of-the-art LLMs show diverse performance, with roughly half remain below 0.50, revealing clear gaps in deception and counterfactual reasoning. We hope our dataset further inspires research on language, reasoning, and strategy in multi-agent interaction.

[242] KnowRL: Teaching Language Models to Know What They Know cs.CL | cs.AIPDF

Sahil Kale, Devendra Singh Dhami

TL;DR: KnowRL框架通过自省和基于共识的奖励机制，提升语言模型对自身知识边界的认知能力，无需外部监督即可显著增强模型的自我知识一致性。

Details

Motivation: 当前大型语言模型（LLM）在超过20%的情况下无法准确评估自身能力，导致其回答不可靠。因此，需要一种方法帮助模型更准确地认知自身知识范围，以实现更安全和可靠的AI行为。

Result: 在LLaMA-3.1-8B和Qwen-2.5-7B上的实验表明，KnowRL显著提升了模型的自我知识一致性，准确性最高提升28%，F1分数提升12%，且在少量迭代内优于基线方法。

Insight: KnowRL展示了语言模型通过内部机制可以自我提升知识认知能力，无需外部干预。这一方法为构建更可靠、安全的AI系统提供了新思路，尤其适用于关键应用场景。

Abstract: Truly reliable AI requires more than simply scaling up knowledge; it demands the ability to know what it knows and when it does not. Yet recent research shows that even the best LLMs misjudge their own competence in more than one in five cases, making any response born of such internal uncertainty impossible to fully trust. Inspired by self-improvement reinforcement learning techniques that require minimal data, we present a simple but powerful framework KnowRL that strengthens a model’s internal understanding of its own feasibility boundaries, enabling safer and more responsible behaviour. Our framework combines two components: (i) introspection, where the model generates and classifies tasks it judges feasible or infeasible, and (ii) consensus-based rewarding, where stability of self-knowledge assessment is reinforced through internal agreement. By using internally generated data, this design strengthens consistency in self-knowledge and entirely avoids costly external supervision. In experiments on LLaMA-3.1-8B and Qwen-2.5-7B, KnowRL steadily improved self-knowledge, validated by both intrinsic self-consistency and extrinsic benchmarking. With nothing more than a small seed set and no external supervision, our method drove gains as high as 28% in accuracy and 12% in F1, outperforming baselines in just a few iterations. Our framework essentially unlocks the untapped capacity of LLMs to self-improve their knowledge awareness, opening the door to reliable, more accountable AI and safer deployment in critical applications. Owing to its simplicity and independence from external effort, we encourage applying this reliability-enhancing process to all future models.

[243] Valid Survey Simulations with Limited Human Data: The Roles of Prompting, Fine-Tuning, and Rectification cs.CLPDF

Stefan Krsteski, Giuseppe Russo, Serina Chang, Robert West, Kristina Gligorić

TL;DR: 该论文研究了如何在有限的人类数据下，通过提示、微调和校正方法的结合，利用大型语言模型（LLMs）生成有效的调查模拟结果。研究表明，仅使用合成方法会引入显著偏差（24%-86%），而结合校正方法可将偏差降至5%以下，并增加有效样本量至14%。

Details

Motivation: 传统的调查方法成本高且耗时，LLMs被提出作为低成本、可扩展的替代方案。然而，LLMs的输出存在偏差，如何合理分配人类数据以优化合成和校正方法的结合效果成为研究动机。

Result: 结果显示，仅使用合成方法偏差为24%-86%，而结合校正方法后偏差降至5%以下，有效样本量增加至14%。

Insight: 论文揭示了传统方法中过度依赖人类数据微调的局限性，提出在预算有限时，优先分配数据用于校正能显著提升估计质量。

Abstract: Surveys provide valuable insights into public opinion and behavior, but their execution is costly and slow. Large language models (LLMs) have been proposed as a scalable, low-cost substitute for human respondents, but their outputs are often biased and yield invalid estimates. We study the interplay between synthesis methods that use LLMs to generate survey responses and rectification methods that debias population estimates, and explore how human responses are best allocated between them. Using two panel surveys with questions on nutrition, politics, and economics, we find that synthesis alone introduces substantial bias (24-86%), whereas combining it with rectification reduces bias below 5% and increases effective sample size by up to 14%. Overall, we challenge the common practice of using all human responses for fine-tuning, showing that under a fixed budget, allocating most to rectification results in far more effective estimation.

[244] Hallucination Detection via Internal States and Structured Reasoning Consistency in Large Language Models cs.CLPDF

Yusheng Song, Lirong Qiu, Xi Zhang, Zhihao Tang

TL;DR: 论文提出了一种统一的框架，通过结合内部状态探测和推理链验证来解决大型语言模型中的幻觉检测问题，克服了信号稀缺性和表征对齐障碍。

Details

Motivation: 大型语言模型中的幻觉检测存在检测困境，现有的方法在事实密集型和逻辑密集型任务中各有所长但无法兼顾。作者希望通过统一的方法解决这一问题。

Result: 在三个多样化的基准测试和两种领先的大型语言模型上的实验表明，该方法显著优于现有基线。

Insight: 通过融合内部状态和外部推理的一致性，可以更全面地检测幻觉，尤其在事实和逻辑混合的任务中表现优越。

Abstract: The detection of sophisticated hallucinations in Large Language Models (LLMs) is hampered by a ``Detection Dilemma’’: methods probing internal states (Internal State Probing) excel at identifying factual inconsistencies but fail on logical fallacies, while those verifying externalized reasoning (Chain-of-Thought Verification) show the opposite behavior. This schism creates a task-dependent blind spot: Chain-of-Thought Verification fails on fact-intensive tasks like open-domain QA where reasoning is ungrounded, while Internal State Probing is ineffective on logic-intensive tasks like mathematical reasoning where models are confidently wrong. We resolve this with a unified framework that bridges this critical gap. However, unification is hindered by two fundamental challenges: the Signal Scarcity Barrier, as coarse symbolic reasoning chains lack signals directly comparable to fine-grained internal states, and the Representational Alignment Barrier, a deep-seated mismatch between their underlying semantic spaces. To overcome these, we introduce a multi-path reasoning mechanism to obtain more comparable, fine-grained signals, and a segment-aware temporalized cross-attention module to adaptively fuse these now-aligned representations, pinpointing subtle dissonances. Extensive experiments on three diverse benchmarks and two leading LLMs demonstrate that our framework consistently and significantly outperforms strong baselines. Our code is available: https://github.com/peach918/HalluDet.

[245] Information-Preserving Reformulation of Reasoning Traces for Antidistillation cs.CLPDF

Jiayu Ding, Lei Cui, Li Dong, Nanning Zheng, Furu Wei

TL;DR: 论文提出了PART方法，通过保留信息的同时干扰蒸馏过程，保护大型语言模型的推理痕迹不被未经授权的蒸馏。

Details

Motivation: 现有方法在保护推理痕迹时往往牺牲了信息的完整性（如使用简短摘要），而PART旨在解决这种权衡问题。

Result: 实验表明，PART显著降低了不同规模和类型的学生模型的蒸馏效果，例如32B模型的性能下降了13.5%。

Insight: 人类与小模型对推理痕迹的理解方式不同，PART利用这种差异实现了有效保护。

Abstract: Recent advances in Large Language Models (LLMs) show that extending the length of reasoning chains significantly improves performance on complex tasks. While revealing these reasoning traces helps users better follow, verify, and learn from the model’s problem-solving process, it also makes them highly vulnerable to unauthorized distillation. To mitigate this risk, proprietary model providers often adopt aggressive protection strategies, such as replacing detailed reasoning with brief summaries, which deprive users of valuable intermediate information. To address this trade-off, we propose PART, an information-preserving antidistillation reformulation of reasoning traces. Motivated by the difference between how humans understand reasoning traces and how LLMs exploit them for supervised fine-tuning, we design a simple but effective two-step reformulation: removing self-talk behaviors and reordering sub-conclusions. A small auxiliary model is trained to perform this reformulation, incurring minimal computational overhead. Extensive experiments demonstrate that PART consistently disrupts distillation across student models of different sizes and types on various reasoning benchmarks. For instance, when training on reformulated traces, even the performance of a large 32B student model decreases from 54.17 to 46.88 on AIME 2024, corresponding to a 13.5% degradation.

[246] LLMAtKGE: Large Language Models as Explainable Attackers against Knowledge Graph Embeddings cs.CLPDF

Ting Li, Yang Yang, Yipeng Yu, Liang Yao, Guoqing Chao

TL;DR: 该论文提出了LLMAtKGE，一种基于大语言模型（LLM）的框架，用于攻击知识图谱嵌入（KGE），同时生成人类可读的解释。通过结构化提示设计和过滤机制，解决了LLM输入限制和犹豫问题，实验结果表明其攻击性能和解释能力优于现有方法。

Details

Motivation: 现有的黑盒攻击方法在攻击知识图谱嵌入时缺乏人类可读的解释能力，并且在泛化性上表现不佳。大语言模型在文本理解和生成方面的强大能力为这一领域提供了新的可能性。

Result: 在多个知识图谱数据集上的实验表明，LLMAtKGE在黑盒攻击中表现优于基线方法，生成的人类可读解释具有竞争力，并接近白盒方法的性能。

Insight: 大语言模型可以成功应用于知识图谱攻击任务，并通过结构化设计和过滤机制解决实际应用中的输入限制和解释生成问题。

Abstract: Adversarial attacks on knowledge graph embeddings (KGE) aim to disrupt the model’s ability of link prediction by removing or inserting triples. A recent black-box method has attempted to incorporate textual and structural information to enhance attack performance. However, it is unable to generate human-readable explanations, and exhibits poor generalizability. In the past few years, large language models (LLMs) have demonstrated powerful capabilities in text comprehension, generation, and reasoning. In this paper, we propose LLMAtKGE, a novel LLM-based framework that selects attack targets and generates human-readable explanations. To provide the LLM with sufficient factual context under limited input constraints, we design a structured prompting scheme that explicitly formulates the attack as multiple-choice questions while incorporating KG factual evidence. To address the context-window limitation and hesitation issues, we introduce semantics-based and centrality-based filters, which compress the candidate set while preserving high recall of attack-relevant information. Furthermore, to efficiently integrate both semantic and structural information into the filter, we precompute high-order adjacency and fine-tune the LLM with a triple classification task to enhance filtering performance. Experiments on two widely used knowledge graph datasets demonstrate that our attack outperforms the strongest black-box baselines and provides explanations via reasoning, and showing competitive performance compared with white-box methods. Comprehensive ablation and case studies further validate its capability to generate explanations.

[247] Survey Response Generation: Generating Closed-Ended Survey Responses In-Silico with Large Language Models cs.CL | cs.CYPDF

Georg Ahnert, Anna-Carolina Haensch, Barbara Plank, Markus Strohmaier

TL;DR: 该论文系统研究了8种封闭式调查响应生成方法对预测调查响应的影响，提出了32百万次模拟调查响应的实验结果，发现受限生成方法效果最佳，推理输出未必提升对齐性。

Details

Motivation: 研究动机是厘清大语言模型（LLMs）在模拟人类封闭式调查响应时的最佳实践，填补现有研究多集中在开放式文本生成的空白。

Result: 结果显示：受限生成方法总体表现最佳（如强制选择特定格式），推理输出并未显著提升响应质量。

Insight: 关键洞察：封闭式响应生成方法对模拟结果影响显著，需谨慎选择方法；推理能力在封闭式任务中未必适用。

Abstract: Many in-silico simulations of human survey responses with large language models (LLMs) focus on generating closed-ended survey responses, whereas LLMs are typically trained to generate open-ended text instead. Previous research has used a diverse range of methods for generating closed-ended survey responses with LLMs, and a standard practice remains to be identified. In this paper, we systematically investigate the impact that various Survey Response Generation Methods have on predicted survey responses. We present the results of 32 mio. simulated survey responses across 8 Survey Response Generation Methods, 4 political attitude surveys, and 10 open-weight language models. We find significant differences between the Survey Response Generation Methods in both individual-level and subpopulation-level alignment. Our results show that Restricted Generation Methods perform best overall, and that reasoning output does not consistently improve alignment. Our work underlines the significant impact that Survey Response Generation Methods have on simulated survey responses, and we develop practical recommendations on the application of Survey Response Generation Methods.

[248] MeTA-LoRA: Data-Efficient Multi-Task Fine-Tuning for Large Language Models cs.CLPDF

Bo Cheng, Xu Wang, Jinda Liu, Yi Chang, Yuan Wu

TL;DR: 论文提出了一种名为MeTA-LoRA的两阶段优化框架，旨在提升大型语言模型在多任务学习中的数据效率。通过任务特异性LoRA适配器快速适应少量数据，并在第二阶段聚合多任务梯度以促进知识迁移，显著减少了数据需求。

Details

Motivation: 尽管LoRA在单任务微调中表现优异，但在多任务学习中难以高效利用任务间知识，且需要大量任务特异性数据。为解决这一问题，研究者提出了MeTA-LoRA。

Result: 在多任务和多语言学习场景中，MeTA-LoRA的性能与传统全数据微调相当或更优，数据使用量显著减少。

Insight: 通过少量任务数据和跨任务知识共享，可以在数据有限的情况下高效适配大型语言模型。

Abstract: Low-Rank Adaptation (LoRA) has emerged as one of the most widely used parameter-efficient fine-tuning (PEFT) methods for adapting large language models (LLMs) to downstream tasks. While highly effective in single-task settings, it struggles to efficiently leverage inter-task knowledge in complex multi-task learning scenarios, often requiring substantial task-specific data to achieve optimal performance. To address this limitation, we introduce MeTA-LoRA, a two-stage optimization framework that significantly improves data efficiency in multi-task adaptation. In the first stage, task-specific LoRA adapters are learned using only a few samples from each involved dataset, enabling rapid adaptation without large-scale supervision. In the second stage, the shared LoRA adapter is updated by aggregating gradients from multiple tasks to promote knowledge transfer across tasks, further reducing data usage by leveraging common patterns. In both multi-task learning and multilingual learning scenarios, our method matches or surpasses the performance of traditional full-data LoRA fine-tuning approaches, while using significantly less task-specific data.

[249] StoryBox: Collaborative Multi-Agent Simulation for Hybrid Bottom-Up Long-Form Story Generation Using Large Language Models cs.CL | cs.MAPDF

Zehao Chen, Rong Pan, Haoran Li

TL;DR: 该论文提出了StoryBox，一种基于多智能体模拟的混合自底向上长故事生成方法，通过智能体在动态沙盒环境中的交互生成事件，构建长故事。

Details

Motivation: 受到人类作家创作过程的启发，作者希望设计一种能够模拟角色与环境交互、自然生成故事的方法，解决传统自上而下方法过于僵化的问题。

Result: 系统在多项指标上达到state-of-the-art性能，生成的故事具有较高的连贯性和一致性。

Insight: 混合自底向上的方法可以有效平衡结构性与自由度，为长故事生成提供了一个可扩展的创新解决方案。

Abstract: Human writers often begin their stories with an overarching mental scene, where they envision the interactions between characters and their environment. Inspired by this creative process, we propose a novel approach to long-form story generation, termed hybrid bottom-up long-form story generation, using multi-agent simulations. In our method, agents interact within a dynamic sandbox environment, where their behaviors and interactions with one another and the environment generate emergent events. These events form the foundation for the story, enabling organic character development and plot progression. Unlike traditional top-down approaches that impose rigid structures, our hybrid bottom-up approach allows for the natural unfolding of events, fostering more spontaneous and engaging storytelling. The system is capable of generating stories exceeding 10,000 words while maintaining coherence and consistency, addressing some of the key challenges faced by current story generation models. We achieve state-of-the-art performance across several metrics. This approach offers a scalable and innovative solution for creating dynamic, immersive long-form stories that evolve organically from agent-driven interactions.

[250] Enhancing Long Chain-of-Thought Reasoning through Multi-Path Plan Aggregation cs.CLPDF

Siheng Xiong, Ali Payani, Faramarz Fekri

TL;DR: 论文提出了一种名为多路径计划聚合（MPPA）的框架，通过探索和聚合多个候选计划来增强语言模型的推理能力，并结合在线Step-DPO方法优化训练过程，显著提升了长序列推理任务的性能。

Details

Motivation: 现有的单次推理方法容易导致推理链偏离正确方向（CoT derailment），尤其是能力有限的小模型在处理长推理链时更为严重。论文旨在通过分析推理层级结构，发现大多数错误源于规划步骤，从而提出改进方案。

Result: 在数学、科学和逻辑推理基准测试中，仅使用10%的SFT数据和5%的偏好对，MPPA方法显著优于DeepSeek-R1蒸馏基准和结果奖励RL基准。

Insight: 优化规划步骤比直接优化执行步骤更有效，分步监督（如Step-DPO）在长序列任务中比传统的结果奖励RL更具优势。

Abstract: Inference-time scaling enhances the reasoning ability of a language model (LM) by extending its chain-of-thought (CoT). However, existing approaches typically generate the entire reasoning chain in a single forward pass, which often leads to CoT derailment, i.e., the reasoning trajectory drifting off course due to compounding errors. This problem is particularly severe for smaller LMs with long CoTs due to their limited capacity. To address this, we analyze raw long CoTs and uncover a reasoning hierarchy consisting of planning and execution steps. Our analysis reveals that most reasoning errors stem from incorrect planning. Motivated by this observation, we propose Multi-Path Plan Aggregation (MPPA), a framework that augments single-pass reasoning with plan exploration and aggregation. Following a variable interval schedule based on the token position, MPPA generates multiple candidate plans and aggregates them into a refined planning step. To maintain efficiency, we adopt a minimal design in which the base LM serves as the primary policy, while a lightweight LoRA module implements the plan aggregation policy. We further observe that outcome-reward RL is inefficient for long trajectories (e.g., exceeding 4K tokens). To overcome this, we introduce online Step-DPO, a process-level preference optimization scheme that leverages Twisted Sequential Monte Carlo (TSMC) to provide scalable stepwise supervision using small LMs. This yields more efficient training, improved stability, and higher accuracy. Extensive experiments on challenging math, science, and logical reasoning benchmarks demonstrate that, with only 10% SFT data and 5% of preference pairs, our method outperforms both the DeepSeek-R1 distillation baseline and the outcome-reward RL baseline across multiple base models and tasks.

[251] ACADREASON: Exploring the Limits of Reasoning Models with Academic Research Problems cs.CLPDF

Xin Gui, King Zhu, JinCheng Ren, Qianben Chen, Zekun Moore Wang

TL;DR: 论文提出了Acadreason基准，旨在评估大型语言模型（LLMs）和智能代理在多领域学术问题中的高级推理能力。

Details

Motivation: 现有评估主要集中于数学/编程竞赛或通用任务，缺乏针对学术领域复杂推理能力的严格基准。

Result: 结果表明，多数LLMs得分低于20分，顶级GPT-5仅得16分；代理表现较好，但最高不超过40分。

Insight: Acadreason揭示了LLMs和代理在学术研究任务中的能力差距，为未来研究方向提供了挑战。

Abstract: In recent years, the research focus of large language models (LLMs) and agents has shifted increasingly from demonstrating novel capabilities to complex reasoning and tackling challenging tasks. However, existing evaluations focus mainly on math/code contests or general tasks, while existing multi-domain academic benchmarks lack sufficient reasoning depth, leaving the field without a rigorous benchmark for high-level reasoning. To fill this gap, we introduce the Acadreason benchmark, designed to evaluate the ability of LLMs and agents to acquire and reason over academic knowledge. It consists of 50 expert-annotated academic problems across five high-reasoning domains, including computer science, economics, law, mathematics, and philosophy. All questions are sourced from top-tier publications in recent years and undergo rigorous annotation and quality control to ensure they are both challenging and answerable. We conduct systematic evaluations of over 10 mainstream LLMs and agents. The results show that most LLMs scored below 20 points, with even the cutting-edge GPT-5 achieving only 16 points. While agents achieved higher scores, none exceeded 40 points. This demonstrates the current capability gap between LLMs and agents in super-intelligent academic research tasks and highlights the challenges of Acadreason.

[252] Scaling Language-Centric Omnimodal Representation Learning cs.CL | cs.AI | cs.CVPDF

Chenghao Xiao, Hou Pong Chan, Hao Zhang, Weiwen Xu, Mahani Aljunied

TL;DR: 本文提出了一种语言中心的全模态嵌入框架（LCO-Emb），揭示了多模态大语言模型（MLLMs）在生成预训练中隐含的跨模态对齐优势，并通过实验验证了对比学习的轻量化微调作用。作者还提出了生成-表征比例定律（GRSL），表明生成能力的提升可以增强表征质量。

Details

Motivation: 现有的多模态嵌入方法虽然表现优异，但其优势的深层原因尚未明确。本文旨在探索MLLMs在生成预训练中隐含的跨模态对齐机制，并利用这一发现提升表征学习效果。

Result: LCO-Emb在多种主干模型和基准测试中均达到最先进性能。GRSL在低资源视觉文档检索任务中得到验证，表明生成预训练能进一步提升嵌入能力。

Insight: 生成能力的提升是增强表征质量的有效途径，MLLMs在生成预训练中隐含的对齐机制为多模态表征学习提供了新视角。

Abstract: Recent multimodal embedding approaches leveraging multimodal large language models (MLLMs) fine-tuned with contrastive learning (CL) have shown promising results, yet the underlying reasons behind their superiority remain underexplored. This work argues that a crucial advantage of MLLM-based approaches stems from implicit cross-modal alignment achieved during generative pretraining, where the language decoder learns to exploit multimodal signals within a shared representation space for generating unimodal outputs. Through analysis of anisotropy and kernel similarity structure, we empirically confirm that latent alignment emerges within MLLM representations, allowing CL to serve as a lightweight refinement stage. Leveraging this insight, we propose a Language-Centric Omnimodal Embedding framework, termed LCO-Emb. Extensive experiments across diverse backbones and benchmarks demonstrate its effectiveness, achieving state-of-the-art performance across modalities. Furthermore, we identify a Generation-Representation Scaling Law (GRSL), showing that the representational capabilities gained through contrastive refinement scales positively with the MLLM’s generative capabilities. This suggests that improving generative abilities evolves as an effective paradigm for enhancing representation quality. We provide a theoretical explanation of GRSL, which formally links the MLLM’s generative quality to the upper bound on its representation performance, and validate it on a challenging, low-resource visual-document retrieval task, showing that continual generative pretraining before CL can further enhance the potential of a model’s embedding capabilities. Codes, models, and resources are available at https://github.com/LCO-Embedding/LCO-Embedding.

[253] When Agents Trade: Live Multi-Market Trading Benchmark for LLM Agents cs.CLPDF

Lingfei Qian, Xueqing Peng, Yan Wang, Vincent Jim Zhang, Huan He

TL;DR: 论文提出了Agent Market Arena (AMA)，这是首个用于在多市场实时环境中评估基于大语言模型（LLM）的交易代理的基准测试平台，填补了现有研究的空白。

Details

Motivation: 当前研究中，基于LLM的金融交易代理在实时市场中的推理和适应能力尚不明确，且现有测试多集中于模型而非代理，覆盖范围有限。AMA旨在提供一个公平、连续的评估框架。

Result: 实验表明，代理框架表现出显著不同的行为模式（如激进或保守），而模型主干对结果的影响较小。AMA为LLM代理的金融推理能力提供了可复现的评估基础。

Insight: 代理的行为模式在交易中起关键作用，而模型选择的影响相对较小，这为未来代理设计和优化提供了方向。

Abstract: Although Large Language Model (LLM)-based agents are increasingly used in financial trading, it remains unclear whether they can reason and adapt in live markets, as most studies test models instead of agents, cover limited periods and assets, and rely on unverified data. To address these gaps, we introduce Agent Market Arena (AMA), the first lifelong, real-time benchmark for evaluating LLM-based trading agents across multiple markets. AMA integrates verified trading data, expert-checked news, and diverse agent architectures within a unified trading framework, enabling fair and continuous comparison under real conditions. It implements four agents, including InvestorAgent as a single-agent baseline, TradeAgent and HedgeFundAgent with different risk styles, and DeepFundAgent with memory-based reasoning, and evaluates them across GPT-4o, GPT-4.1, Claude-3.5-haiku, Claude-sonnet-4, and Gemini-2.0-flash. Live experiments on both cryptocurrency and stock markets demonstrate that agent frameworks display markedly distinct behavioral patterns, spanning from aggressive risk-taking to conservative decision-making, whereas model backbones contribute less to outcome variation. AMA thus establishes a foundation for rigorous, reproducible, and continuously evolving evaluation of financial reasoning and trading intelligence in LLM-based agents.

[254] Demystifying Reinforcement Learning in Agentic Reasoning cs.CLPDF

Zhaochen Yu, Ling Yang, Jiaru Zou, Shuicheng Yan, Mengdi Wang

TL;DR: 论文通过系统研究揭示了强化学习在代理推理中的关键设计原则，提出了数据、算法和推理模式三方面的优化实践，显著提升了小模型的代理推理能力。

Details

Motivation: 尽管代理强化学习（RL）已显示出提升大型语言模型（LLMs）代理推理能力的潜力，但其关键设计原则和最佳实践仍不明确。本文旨在填补这一空白。

Result: 通过优化实践，4B规模的模型在代理推理任务中超越了32B规模的模型，并在四个挑战性基准（如AIME2024和GPQA-Diamond）中取得了优越性能。

Insight: 高质量数据集、探索友好技术和高效推理模式是提升代理强化学习性能的关键。小模型通过优化设计也能实现与大模型媲美的推理能力。

Abstract: Recently, the emergence of agentic RL has showcased that RL could also effectively improve the agentic reasoning ability of LLMs, yet the key design principles and optimal practices remain unclear. In this work, we conduct a comprehensive and systematic investigation to demystify reinforcement learning in agentic reasoning from three key perspectives: data, algorithm, and reasoning mode. We highlight our key insights: (i) Replacing stitched synthetic trajectories with real end-to-end tool-use trajectories yields a far stronger SFT initialization; high-diversity, model-aware datasets sustain exploration and markedly improve RL performance. (ii) Exploration-friendly techniques are crucial for agentic RL, such as clip higher, overlong reward shaping, and maintaining adequate policy entropy could improve the training efficiency. (iii) A deliberative strategy with fewer tool calls outperforms frequent tool calls or verbose self-reasoning, improving tool efficiency and final accuracy. Together, these simple practices consistently enhance agentic reasoning and training efficiency, achieving strong results on challenging benchmarks with smaller models, and establishing a practical baseline for future agentic RL research. Beyond these empirical insights, we further contribute a high-quality, real end-to-end agentic SFT dataset along with a high-quality RL dataset, and demonstrate the effectiveness of our insights in boosting the agentic reasoning ability of LLMs across four challenging benchmarks, including AIME2024/AIME2025, GPQA-Diamond, and LiveCodeBench-v6. With our recipes, 4B-sized models could also achieve superior agentic reasoning performance compared to 32B-sized models. Code and models: https://github.com/Gen-Verse/Open-AgentRL

[255] Are Large Reasoning Models Interruptible? cs.CL | cs.LGPDF

Tsung-Han Wu, Mihran Miroyan, David M. Chan, Trevor Darrell, Narges Norouzi

TL;DR: 论文探讨了大型推理模型（LRMs）在动态环境中的鲁棒性问题，揭示了静态评估会高估模型的性能，因为在中断或上下文变化时，模型的准确率可能大幅下降。

Details

Motivation: 传统评估假设推理模型在静态环境中运行，而现代推理任务（如辅助编程）需要长时间推理且上下文可能动态变化，因此需要验证LRMs在动态场景中的表现。

Result: 静态评估高估了LRMs的鲁棒性：在动态场景中，即使是顶尖模型也可能出现失败，准确率下降高达60%，并表现出‘推理泄漏’、‘恐慌’和‘自我怀疑’等新型失败模式。

Insight: 动态环境中的中断和上下文变化对LRMs的性能影响显著，未来的模型设计和评估需考虑动态性以提高实际应用的可靠性。

Abstract: Large Reasoning Models (LRMs) excel at complex reasoning but are traditionally evaluated in static, “frozen world” settings: model responses are assumed to be instantaneous, and the context of a request is presumed to be immutable over the duration of the response. While generally true for short-term tasks, the “frozen world” assumption breaks down in modern reasoning tasks such as assistive programming, where models may take hours to think through problems and code may change dramatically from the time the model starts thinking to the model’s final output. In this work, we challenge the frozen world assumption and evaluate LRM robustness under two realistic dynamic scenarios: interruptions, which test the quality of the model’s partial outputs on a limited budget, and dynamic context, which tests model adaptation to in-flight changes. Across mathematics and programming benchmarks that require long-form reasoning, static evaluations consistently overestimate robustness: even state-of-the-art LRMs, which achieve high accuracy in static settings, can fail unpredictably when interrupted or exposed to changing context, with performance dropping by up to 60% when updates are introduced late in the reasoning process. Our analysis further reveals several novel failure modes, including reasoning leakage, where models fold the reasoning into their final answer when interrupted; panic, where under time pressure models abandon reasoning entirely and return incorrect answers; and self-doubt, where performance degrades while incorporating updated information.

cs.CE [Back]

[256] Comparative Evaluation of Neural Network Architectures for Generalizable Human Spatial Preference Prediction in Unseen Built Environments cs.CE | cs.CV | cs.LG | cs.MAPDF

Maral Doctorarastoo, Katherine A. Flanigan, Mario Bergés, Christopher McComb

TL;DR: 本文通过对比图神经网络（GNN）、卷积神经网络（CNN）和前馈神经网络（FFNN）在合成数据上的表现，研究了这些架构在预测未见建筑环境中人类空间偏好的泛化能力。

Details

Motivation: 预测人类在建筑环境中的空间偏好对发展Cyber-Physical-Social Infrastructure Systems（CPSIS）至关重要，但现有模型在未见环境中的泛化能力尚不明确。

Result: 结果显示，GNN在预测人类空间偏好时表现最佳，尤其是在未见环境中具有更高的泛化能力。

Insight: 研究表明，图神经网络能够更好地捕捉空间和上下文依赖关系，是预测人类偏好的理想选择。

Abstract: The capacity to predict human spatial preferences within built environments is instrumental for developing Cyber-Physical-Social Infrastructure Systems (CPSIS). A significant challenge in this domain is the generalizability of preference models, particularly their efficacy in predicting preferences within environmental configurations not encountered during training. While deep learning models have shown promise in learning complex spatial and contextual dependencies, it remains unclear which neural network architectures are most effective at generalizing to unseen layouts. To address this, we conduct a comparative study of Graph Neural Networks, Convolutional Neural Networks, and standard feedforward Neural Networks using synthetic data generated from a simplified and synthetic pocket park environment. Beginning with this illustrative case study, allows for controlled analysis of each model’s ability to transfer learned preference patterns to unseen spatial scenarios. The models are evaluated based on their capacity to predict preferences influenced by heterogeneous physical, environmental, and social features. Generalizability score is calculated using the area under the precision-recall curve for the seen and unseen layouts. This generalizability score is appropriate for imbalanced data, providing insights into the suitability of each neural network architecture for preference-aware human behavior modeling in unseen built environments.

cs.SD [Back]

[257] VCB Bench: An Evaluation Benchmark for Audio-Grounded Large Language Model Conversational Agents cs.SD | cs.CLPDF

Jiliang Hu, Wenfu Wang, Zuchao Li, Chenxing Li, Yiyang Zhao

TL;DR: VCB Bench是一个高质量的中文语音聊天机器人评测基准，首次完全基于真实人类语音构建，填补了现有评测在语言多样性、语音真实性和多维度评测上的不足。

Details

Motivation: 当前大规模音频语言模型（LALMs）的多模态对话系统评测基准多为英语、依赖合成语音，且缺乏全面的多维度评测。VCB Bench旨在解决这些问题，推动中文语音对话模型的进步。

Result: 实验揭示了当前LALMs的性能差距，为未来优化提供了方向。VCB Bench展现出较高的可重复性和细粒度评测能力。

Insight: 真实语音数据和多维评测对语音对话模型的开发至关重要；当前LALMs在中文环境下的表现仍有提升空间。

Abstract: Recent advances in large audio language models (LALMs) have greatly enhanced multimodal conversational systems. However, existing benchmarks remain limited – they are mainly English-centric, rely on synthetic speech, and lack comprehensive, discriminative evaluation across multiple dimensions. To address these gaps, we present Voice Chat Bot Bench (VCB Bench) – a high-quality Chinese benchmark built entirely on real human speech. VCB Bench evaluates LALMs from three complementary perspectives: instruction following (including speech-level control beyond text commands), knowledge understanding (general knowledge, reasoning, and daily dialogue), and robustness (stability under perturbations in content, environment, and speaker traits). Experiments on representative LALMs reveal notable performance gaps and highlight future directions for improvement. VCB Bench provides a reproducible and fine-grained evaluation framework, offering standardized methodology and practical insights for advancing Chinese voice conversational models.

[258] Diffusion-Link: Diffusion Probabilistic Model for Bridging the Audio-Text Modality Gap cs.SD | cs.AI | cs.CL | cs.LG | eess.ASPDF

KiHyun Nam, Jongmin Choi, Hyeongkeun Lee, Jungwoo Heo, Joon Son Chung

TL;DR: Diffusion-Link是一种基于扩散概率模型的轻量级模块，用于减少音频-文本模态之间的差距，从而提升多模态编码器与大型语言模型的耦合效果。在自动音频字幕任务中，该方法在不依赖外部知识的情况下取得了最优性能。

Details

Motivation: 现有的对比音频-语言预训练方法在多模态编码器与大型语言模型（LLMs）的耦合中仍然存在音频-文本模态差距问题。本研究旨在通过扩散概率模型解决这一问题。

Result: Diffusion-Link在相似性和几何标准上显著减少了模态差距，并在AudioCaps数据集上的零样本和完全监督字幕任务中分别提升了52.5%和7.5%的性能。

Insight: 研究显示，减少模态差距是多模态编码器与LLMs有效耦合的关键，扩散概率模型为解决模态差距提供了新的方向。

Abstract: Contrastive audio-language pretraining yields powerful joint representations, yet a persistent audio-text modality gap limits the benefits of coupling multimodal encoders with large language models (LLMs). We present Diffusion-Link, a diffusion-based modality-bridging module that generatively maps audio embeddings into the text-embedding distribution. The module is trained at the output embedding from the frozen multimodal encoder and implemented as a lightweight network with three residual MLP blocks. To assess the effect of Diffusion-Link on multimodal encoder-LLM coupling, we evaluate on Automatic Audio Captioning (AAC); to our knowledge, this is the first application of diffusion-based modality bridging to AAC. We report two results. (1) Modality-gap analysis: on similarity and geometric criteria, Diffusion-Link reduces the modality gap the most among prior diffusion-based methods and shows a collective migration of audio embeddings toward the text distribution. (2) Downstream AAC: attaching Diffusion-Link to the same multimodal LLM baseline achieves state-of-the-art on AudioCaps in both zero-shot and fully supervised captioning without external knowledge, with relative gains up to 52.5% and 7.5%, respectively. These findings show that closing the modality gap is pivotal for effective coupling between multimodal encoders and LLMs, and diffusion-based modality bridging offers a promising direction beyond knowledge-retrieval-centric designs. Code will be released upon acceptance https://github.com/DevKiHyun/Diffusion-Link

cs.CR [Back]

[259] ArtPerception: ASCII Art-based Jailbreak on LLMs with Recognition Pre-test cs.CR | cs.AI | cs.CL | cs.CV | cs.LGPDF

Guan-Yan Yang, Tzu-Yu Cheng, Ya-Wen Teng, Farn Wanga, Kuo-Hui Yeh

TL;DR: 论文提出ArtPerception框架，利用ASCII艺术绕过LLMs的安全措施，采用两阶段方法实现高效攻击。

Details

Motivation: 现有LLMs的安全对齐主要依赖语义解释，容易受到非标准数据表示的攻击，亟需解决这一漏洞。

Result: 在开源和商业LLMs上验证了高效攻击能力，并成功对抗多种防御工具。

Insight: LLMs安全需防御多模态攻击，即使是纯文本输入也可能存在漏洞。

Abstract: The integration of Large Language Models (LLMs) into computer applications has introduced transformative capabilities but also significant security challenges. Existing safety alignments, which primarily focus on semantic interpretation, leave LLMs vulnerable to attacks that use non-standard data representations. This paper introduces ArtPerception, a novel black-box jailbreak framework that strategically leverages ASCII art to bypass the security measures of state-of-the-art (SOTA) LLMs. Unlike prior methods that rely on iterative, brute-force attacks, ArtPerception introduces a systematic, two-phase methodology. Phase 1 conducts a one-time, model-specific pre-test to empirically determine the optimal parameters for ASCII art recognition. Phase 2 leverages these insights to launch a highly efficient, one-shot malicious jailbreak attack. We propose a Modified Levenshtein Distance (MLD) metric for a more nuanced evaluation of an LLM’s recognition capability. Through comprehensive experiments on four SOTA open-source LLMs, we demonstrate superior jailbreak performance. We further validate our framework’s real-world relevance by showing its successful transferability to leading commercial models, including GPT-4o, Claude Sonnet 3.7, and DeepSeek-V3, and by conducting a rigorous effectiveness analysis against potential defenses such as LLaMA Guard and Azure’s content filters. Our findings underscore that true LLM security requires defending against a multi-modal space of interpretations, even within text-only inputs, and highlight the effectiveness of strategic, reconnaissance-based attacks. Content Warning: This paper includes potentially harmful and offensive model outputs.

[260] Bag of Tricks for Subverting Reasoning-based Safety Guardrails cs.CR | cs.CLPDF

Shuo Chen, Zhen Han, Haokun Chen, Bailan He, Shengyun Si

TL;DR: 该论文揭示了基于推理安全防护的脆弱性，并提出一系列绕过防护的攻击方法，展示了其在多种LRMs上的高攻击成功率，强调了改进对齐技术的紧迫性。

Details

Motivation: 研究发现，尽管基于推理的安全防护（如审议对齐）在大型推理模型（LRMs）中表现出强大的防御能力，但它们对输入提示的细微操纵极为脆弱，可能导致更严重的危害。

Result: 攻击方法在5个不同基准测试上实现了超过90%的成功率，展示了基于推理安全防护的系统性漏洞。

Insight: 论文揭示了LRMs安全防护的脆弱性，强调需要更强大的对齐技术以防止恶意滥用，尤其是在开源模型中。

Abstract: Recent reasoning-based safety guardrails for Large Reasoning Models (LRMs), such as deliberative alignment, have shown strong defense against jailbreak attacks. By leveraging LRMs’ reasoning ability, these guardrails help the models to assess the safety of user inputs before generating final responses. The powerful reasoning ability can analyze the intention of the input query and will refuse to assist once it detects the harmful intent hidden by the jailbreak methods. Such guardrails have shown a significant boost in defense, such as the near-perfect refusal rates on the open-source gpt-oss series. Unfortunately, we find that these powerful reasoning-based guardrails can be extremely vulnerable to subtle manipulation of the input prompts, and once hijacked, can lead to even more harmful results. Specifically, we first uncover a surprisingly fragile aspect of these guardrails: simply adding a few template tokens to the input prompt can successfully bypass the seemingly powerful guardrails and lead to explicit and harmful responses. To explore further, we introduce a bag of jailbreak methods that subvert the reasoning-based guardrails. Our attacks span white-, gray-, and black-box settings and range from effortless template manipulations to fully automated optimization. Along with the potential for scalable implementation, these methods also achieve alarmingly high attack success rates (e.g., exceeding 90% across 5 different benchmarks on gpt-oss series on both local host models and online API services). Evaluations across various leading open-source LRMs confirm that these vulnerabilities are systemic, underscoring the urgent need for stronger alignment techniques for open-sourced LRMs to prevent malicious misuse. Code is open-sourced at https://chenxshuo.github.io/bag-of-tricks.

[261] SecureWebArena: A Holistic Security Evaluation Benchmark for LVLM-based Web Agents cs.CR | cs.CVPDF

Zonghao Ying, Yangguang Shao, Jianle Gan, Gan Xu, Junjie Shen

TL;DR: SecureWebArena提出了一种全面的安全评估基准，用于评估基于大型视觉-语言模型（LVLM）的网页代理的安全性，填补了现有基准在覆盖范围和攻击向量多样性上的不足。

Details

Motivation: 现有的安全评估基准仅覆盖狭窄场景（如用户级提示操纵），无法全面捕捉LVLM网页代理的漏洞。因此，需要一种更全面的评估方法。

Result: 实验表明，所有测试的LVLM代理均对微妙的对抗性操纵表现出脆弱性，揭示了模型专用化与安全性之间的权衡。

Insight: 1. LVLM代理的安全性问题广泛存在；2. 专用化模型并非总能提高安全性；3. 多维评估可更精确地揭示代理的漏洞。

Abstract: Large vision-language model (LVLM)-based web agents are emerging as powerful tools for automating complex online tasks. However, when deployed in real-world environments, they face serious security risks, motivating the design of security evaluation benchmarks. Existing benchmarks provide only partial coverage, typically restricted to narrow scenarios such as user-level prompt manipulation, and thus fail to capture the broad range of agent vulnerabilities. To address this gap, we present \tool{}, the first holistic benchmark for evaluating the security of LVLM-based web agents. \tool{} first introduces a unified evaluation suite comprising six simulated but realistic web environments (\eg, e-commerce platforms, community forums) and includes 2,970 high-quality trajectories spanning diverse tasks and attack settings. The suite defines a structured taxonomy of six attack vectors spanning both user-level and environment-level manipulations. In addition, we introduce a multi-layered evaluation protocol that analyzes agent failures across three critical dimensions: internal reasoning, behavioral trajectory, and task outcome, facilitating a fine-grained risk analysis that goes far beyond simple success metrics. Using this benchmark, we conduct large-scale experiments on 9 representative LVLMs, which fall into three categories: general-purpose, agent-specialized, and GUI-grounded. Our results show that all tested agents are consistently vulnerable to subtle adversarial manipulations and reveal critical trade-offs between model specialization and security. By providing (1) a comprehensive benchmark suite with diverse environments and a multi-layered evaluation pipeline, and (2) empirical insights into the security challenges of modern LVLM-based web agents, \tool{} establishes a foundation for advancing trustworthy web agent deployment.

cs.MA [Back]

Thi-Nhung Nguyen, Linhao Luo, Thuy-Trang Vu, Dinh Phung

TL;DR: 这篇论文研究了多智能体系统（MAS）中刻板偏见的产生、传播和放大现象，发现相比于单智能体系统，MAS在偏见鲁棒性上表现较差，但合作和辩论式交流可以缓解偏见放大。

Details

Motivation: 尽管大型语言模型（LLM）的偏见已经被广泛研究，但多智能体系统中的偏见动态尚未得到充分探索。随着MAS的兴起，理解偏见如何在这些系统中涌现和传播变得尤为重要。

Result: 结果表明，MAS的偏见鲁棒性通常低于单智能体系统，但合作和辩论式交流可以减轻偏见的放大，同时更鲁棒的底层LLM能提升系统稳定性。

Insight: 研究发现，偏见在MAS中往往通过“群体内偏爱”早期涌现，强调了设计公平和鲁棒的多智能体系统时需要考虑的关键因素。

Abstract: Bias in large language models (LLMs) remains a persistent challenge, manifesting in stereotyping and unfair treatment across social groups. While prior research has primarily focused on individual models, the rise of multi-agent systems (MAS), where multiple LLMs collaborate and communicate, introduces new and largely unexplored dynamics in bias emergence and propagation. In this work, we present a comprehensive study of stereotypical bias in MAS, examining how internal specialization, underlying LLMs and inter-agent communication protocols influence bias robustness, propagation, and amplification. We simulate social contexts where agents represent different social groups and evaluate system behavior under various interaction and adversarial scenarios. Experiments on three bias benchmarks reveal that MAS are generally less robust than single-agent systems, with bias often emerging early through in-group favoritism. However, cooperative and debate-based communication can mitigate bias amplification, while more robust underlying LLMs improve overall system stability. Our findings highlight critical factors shaping fairness and resilience in multi-agent LLM systems.

[263] Automating Structural Engineering Workflows with Large Language Model Agents cs.MA | cs.AI | cs.CE | cs.CLPDF

Haoran Liang, Yufa Zhou, Mohammad Talebi Kalaleh, Qipei Mei

TL;DR: 论文介绍了MASSE，一种基于大型语言模型的多智能体系统，旨在自动化结构工程工作流程，显著减少专家工作量并提高效率和准确性。

Details

Motivation: 结构工程领域虽然经济影响巨大，但其核心工作流程几十年来几乎未变，亟需现代化和自动化。

Result: 在实际案例验证中，MASSE将专家工作量从约两小时缩减至几分钟，同时提升了可靠性和准确性。

Insight: 大型语言模型在多智能体系统中的应用潜力巨大，可显著提升传统领域的效率和精确性。

Abstract: We introduce $\textbf{MASSE}$, the first Multi-Agent System for Structural Engineering, effectively integrating large language model (LLM)-based agents with real-world engineering workflows. Structural engineering is a fundamental yet traditionally stagnant domain, with core workflows remaining largely unchanged for decades despite its substantial economic impact and global market size. Recent advancements in LLMs have significantly enhanced their ability to perform complex reasoning, long-horizon planning, and precise tool utilization – capabilities well aligned with structural engineering tasks such as interpreting design codes, executing load calculations, and verifying structural capacities. We present a proof-of-concept showing that most real-world structural engineering workflows can be fully automated through a training-free LLM-based multi-agent system. MASSE enables immediate deployment in professional environments, and our comprehensive validation on real-world case studies demonstrates that it can reduce expert workload from approximately two hours to mere minutes, while enhancing both reliability and accuracy in practical engineering scenarios.

cs.LG [Back]

[264] Group-Adaptive Adversarial Learning for Robust Fake News Detection Against Malicious Comments cs.LG | cs.AI | cs.CLPDF

Zhao Tong, Chunlin Gong, Yimeng Gu, Haichao Shi, Qiang Liu

TL;DR: 论文提出了一种基于群体适应性的对抗学习方法，用于增强假新闻检测模型对恶意评论的鲁棒性，通过分类评论、生成多样化攻击并动态调整训练焦点来实现。

Details

Motivation: 现有的假新闻检测模型在标准设置下表现良好，但对恶意评论（尤其是由真实用户或大语言模型生成的对抗性评论）的鲁棒性不足。这些评论会微妙地影响模型决策，从而降低检测效果。

Result: 实验表明，该方法在保持高检测准确率的同时，显著提升了模型对多种对抗性评论扰动的鲁棒性。

Insight: 通过心理学的分类方法可以有效组织对抗性评论；动态调整训练焦点能够更高效地提升模型的鲁棒性。

Abstract: The spread of fake news online distorts public judgment and erodes trust in social media platforms. Although recent fake news detection (FND) models perform well in standard settings, they remain vulnerable to adversarial comments-authored by real users or by large language models (LLMs)-that subtly shift model decisions. In view of this, we first present a comprehensive evaluation of comment attacks to existing fake news detectors and then introduce a group-adaptive adversarial training strategy to improve the robustness of FND models. To be specific, our approach comprises three steps: (1) dividing adversarial comments into three psychologically grounded categories: perceptual, cognitive, and societal; (2) generating diverse, category-specific attacks via LLMs to enhance adversarial training; and (3) applying a Dirichlet-based adaptive sampling mechanism (InfoDirichlet Adjusting Mechanism) that dynamically adjusts the learning focus across different comment categories during training. Experiments on benchmark datasets show that our method maintains strong detection accuracy while substantially increasing robustness to a wide range of adversarial comment perturbations.

[265] Building a Foundational Guardrail for General Agentic Systems via Synthetic Data cs.LG | cs.AI | cs.CLPDF

Yue Huang, Hang Hua, Yujun Zhou, Pengcheng Jing, Manish Nagireddy

TL;DR: 该论文提出了一种通过合成数据为通用代理系统构建基础防护栏的方法，解决了现有防护栏主要在事后执行的问题，并填补了数据、模型和评估三大空白。

Details

Motivation: 现有的防护机制大多在代理执行动作后才会介入，难以扩展且在计划阶段缺乏可控监管。本文旨在通过在计划阶段干预，防止潜在风险。

Result: 实验表明，Safiron在Pre-Exec Bench上表现优于基线，并提供可操作的实践模板。

Insight: 计划阶段的干预是防范代理风险的关键，合成数据和统一输入格式的适配器是实现可控监管的有效工具。

Abstract: While LLM agents can plan multi-step tasks, intervening at the planning stage-before any action is executed-is often the safest way to prevent harm, since certain risks can lead to severe consequences once carried out. However, existing guardrails mostly operate post-execution, which is difficult to scale and leaves little room for controllable supervision at the plan level. To address this challenge, we highlight three critical gaps in current research: data gap, model gap, and evaluation gap. To close the data gap, we introduce AuraGen, a controllable engine that (i) synthesizes benign trajectories, (ii) injects category-labeled risks with calibrated difficulty, and (iii) filters outputs via an automated reward model, producing large and reliable corpora for pre-execution safety. To close the guardian model gap, we propose a foundational guardrail Safiron, combining a cross-planner adapter with a compact guardian model. The adapter unifies different input formats, while Safiron flags risky cases, assigns risk types, and generates rationales; trained in two stages with a broadly explored data recipe, Safiron achieves robust transfer across settings. To close the evaluation gap, we release Pre-Exec Bench, a realistic benchmark covering diverse tools and branching trajectories, which measures detection, fine-grained categorization, explanation, and cross-planner generalization in human-verified scenarios. Extensive experiments demonstrate consistent gains of the proposed guardrail over strong baselines on Pre-Exec Bench, and ablations further distill actionable practices, providing a practical template for safer agentic systems.

[266] Translution: Unifying Self-attention and Convolution for Adaptive and Relative Modeling cs.LG | cs.AI | cs.CL | cs.CVPDF

Hehe Fan, Yi Yang, Mohan Kankanhalli, Fei Wu

TL;DR: Translution是一种融合了自注意力机制的自适应识别能力和卷积的相对编码优势的新型操作，并通过轻量级变体α-Translution解决了参数过多的问题，在多个任务中表现优于传统自注意力方法。

Details

Motivation: 自注意力机制和卷积各有优缺点：前者能自适应地识别相关元素但依赖绝对位置编码，后者虽然以相对方式编码但受限于固定核大小，无法自适应选择元素。研究者希望统一这两种方法的优势。

Result: 在计算机视觉和自然语言处理任务中，Translution（包括α-Translution）的准确率超越了传统的自注意力方法。

Insight: 统一自注意力和卷积的优势可以提升模型性能，但需通过巧妙的参数设计来解决计算资源问题。

Abstract: When modeling a given type of data, we consider it to involve two key aspects: 1) identifying relevant elements (e.g., image pixels or textual words) to a central element, as in a convolutional receptive field, or to a query element, as in self-attention, and 2) encoding these tokens effectively. Self-attention can adaptively identify these elements but relies on absolute positional embedding for structural representation learning. In contrast, convolution encodes elements in a relative manner, yet their fixed kernel size limits their ability to adaptively select the relevant elements. In this paper, we introduce Translution, an operation that unifies the adaptive identification capability of self-attention and the relative encoding advantage of convolution. However, this integration leads to a substantial increase in the number of parameters, exceeding most currently available computational resources. Therefore, we propose a lightweight variant of Translution, named {\alpha}-Translution. Experiments on computer vision and natural language processing tasks show that Translution (including {\alpha}-Translution) achieves superior accuracy compared to self-attention. The code is available at https://github.com/hehefan/Translution.

[267] RLFR: Extending Reinforcement Learning for LLMs with Flow Environment cs.LG | cs.AI | cs.CLPDF

Jinghao Zhang, Naishan Zheng, Ruilin Li, Dongzhou Cheng, Zheming Liang

TL;DR: RLFR提出了一种新颖的强化学习框架，通过潜在空间的流场（flow field）奖励信号来优化大型语言模型（LLMs）的推理能力，解决了传统RLVR方法中忽视有价值探索的问题。

Details

Motivation: 传统的基于二进制验证的RLVR方法容易忽视推理轨迹中的潜在价值探索，且黄金过程奖励模型（PRMs）标注成本高昂。RLFR旨在通过潜在空间的流场奖励信号低成本地优化推理过程。

Result: 实验表明，流场奖励信号在多模态和语言推理任务中表现可靠，为辅助信号奖励塑造提供了新范式。

Insight: 潜在空间的流场奖励能够高效捕捉上下文依赖关系，为强化学习在LLMs中的应用提供了新思路。

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as a promising framework for improving reasoning abilities in Large Language Models (LLMs). However, policy optimized with binary verification prone to overlook potential valuable exploration in reasoning trajectory. In view of heavy annotation cost of golden Process Reward Models (PRMs), recent works attempt using auxiliary signals for reward shaping of process tokens, involving entropy and likelihood collected from logit space. In this work, we offer a novel perspective on shaping RLVR with flow rewards derived from latent space, and propose RLFR, where the flow fields of model latents are constructed from either off-policy high-quality data and on-policy rejection sampling data, and the velocity deviations of policy latents within it are quantified to serve as a reward signal. RLFR first demonstrates that a well-established flow field can be a sound environment for reward signal collection, highlighting the expressive latent space is much underexplored. Moreover, RLFR is able to compress any off-policy expert data as reference for constituting reward signals, and we show that the efficient context dependence compressed within the hidden states are utilized, rather than individual token-level denotation for context comprehending. Experiments on both language and multimodal reasoning benchmarks demonstrate the reliability of flow rewards, and suggesting a promising paradigm for reward shaping with auxiliary signals.

[268] Sample-Efficient Online Learning in LM Agents via Hindsight Trajectory Rewriting cs.LG | cs.AI | cs.CLPDF

Michael Y. Hu, Benjamin Van Durme, Jacob Andreas, Harsh Jhamtani

TL;DR: 论文提出了ECHO框架，通过回溯轨迹重写提升语言模型代理的样本效率，在失败尝试中生成优化轨迹，从而在低成本交互环境中实现更高效的在线学习。

Details

Motivation: 语言模型代理在新环境中学习时样本效率低，导致交互成本高的场景（如与人类或物理系统交互）中表现不佳。现有方法未能充分利用语言模型生成或推理反事实轨迹的能力。

Result: ECHO在XMiniGrid和PeopleJoinQA上分别优于基线80%，且在XMiniGrid中超越了Reflexion和AWM等复杂代理架构，展示了更好的环境适应能力。

Insight: 通过语言模型直接生成优化轨迹，可以有效利用失败经验，显著提高样本效率，为低成本交互场景提供了实用解决方案。

Abstract: Language model (LM) agents deployed in novel environments often exhibit poor sample efficiency when learning from sequential interactions. This significantly hinders the usefulness of such agents in environments where interaction is costly (for example, when they interact with humans or reset physical systems). While a number of existing LM agent architectures incorporate various mechanisms for experience storage and reflection, they make limited use of LMs’ abilities to directly generate or reason about full counterfactual trajectories. We introduce ECHO (Experience Consolidation via Hindsight Optimization), a prompting framework that adapts hindsight experience replay from reinforcement learning for language model agents. ECHO generates optimized trajectories for alternative goals that could have been achieved during failed attempts, effectively creating synthetic positive examples from unsuccessful interactions. Our approach consists of two components: a hindsight rule that uses the language model itself to identify relevant subgoals and generate optimized trajectories, and an update rule that maintains compressed trajectory representations in memory. We evaluate ECHO on stateful versions of XMiniGrid, a text-based navigation and planning benchmark, and PeopleJoinQA, a collaborative information-gathering enterprise simulation. Across both domains, ECHO outperforms vanilla language agent baselines by up to 80%; in XMiniGrid, it also outperforms a number of sophisticated agent architectures including Reflexion and AWM, demonstrating faster adaptation to novel environments through more effective utilization of past experiences.

[269] Find Your Optimal Teacher: Personalized Data Synthesis via Router-Guided Multi-Teacher Distillation cs.LG | cs.CLPDF

Hengyuan Zhang, Shiping Yang, Xiao Liang, Chenming Shang, Yuxuan Jiang

TL;DR: 该论文提出了一种名为PerSyn的新方法，通过路由引导的多教师蒸馏策略，为学生模型定制合成数据，从而提高学习效率。

Details

Motivation: 现有研究表明，更强的教师模型并非总是最优选择，教师输出与学生可学习性之间存在不匹配问题，因此需要一种个性化数据合成方法。

Result: 实验表明，PerSyn在不同模型家族和规模上均优于或与基线方法相当，验证了其有效性。

Insight: 个性化数据合成能显著提升学生模型学习效率，路由器设计是关键。未来的研究可进一步优化路由策略或扩展应用场景。

Abstract: Training student models on synthetic data generated by strong teacher models is a promising way to distilling the capabilities of teachers. However, recent studies show that stronger models are not always optimal teachers, revealing a mismatch between teacher outputs and student learnability. To address this issue, we propose PerSyn (Personalized data Synthesis), a novel synthesis strategy that operates under a new Route then Generate'' paradigm to create data tailored to each student model, enabling it to learn more effectively. Specifically, PerSyn first assigns each prompt to its optimal teacher via a query-level router that jointly considers student learnability and teacher response quality. Each teacher then synthesizes data only for its assigned prompts, making the process more efficient than the conventional Generate then Select’’ paradigm, where all teachers must generate parallel responses for the entire prompt set before constructing the final dataset. Extensive experiments across different model families and scales demonstrate that PerSyn consistently achieves superior or comparable performance to all baselines in instruct tuning and math reasoning settings. Further analysis verifies the effectiveness of PerSyn and offers extra insights to propel future research.

[270] Rediscovering Entropy Regularization: Adaptive Coefficient Unlocks Its Potential for LLM Reinforcement Learning cs.LG | cs.AI | cs.CL | stat.MLPDF

Xiaoyun Zhang, Xiaojian Yuan, Di Huang, Wang You, Chen Hu

TL;DR: 本文重新审视了熵正则化在大型语言模型（LLM）强化学习中的作用，提出了自适应熵正则化（AER）框架，通过动态调整系数来解决任务难度差异和探索需求，显著提升了模型的推理能力。

Details

Motivation: 在强化学习验证奖励（RLVR）训练中，策略熵崩溃（policy entropy collapse）导致策略过于确定性，限制了模型的探索能力和推理性能。传统的熵正则化因其固定系数而不稳定，未能充分发挥潜力。

Result: 在多个数学推理基准测试中，AER显著优于基线方法，提升了推理准确性和探索能力。

Insight: 熵正则化的潜力被低估，动态调整系数可以更好地适应任务需求；适度的策略熵范围对平衡探索与利用至关重要。

Abstract: Reasoning ability has become a defining capability of Large Language Models (LLMs), with Reinforcement Learning with Verifiable Rewards (RLVR) emerging as a key paradigm to enhance it. However, RLVR training often suffers from policy entropy collapse, where the policy becomes overly deterministic, hindering exploration and limiting reasoning performance. While entropy regularization is a common remedy, its effectiveness is highly sensitive to the fixed coefficient, making it unstable across tasks and models. In this work, we revisit entropy regularization in RLVR and argue that its potential has been largely underestimated. Our analysis shows that (i) tasks of varying difficulty demand distinct exploration intensities, and (ii) balanced exploration may require the policy entropy to be maintained within a moderate range below its initial level. Therefore, we propose Adaptive Entropy Regularization (AER)–a framework that dynamically balances exploration and exploitation via three components: difficulty-aware coefficient allocation, initial-anchored target entropy, and dynamic global coefficient adjustment. Experiments on multiple mathematical reasoning benchmarks show that AER consistently outperforms baselines, improving both reasoning accuracy and exploration capability.

[271] EAGER: Entropy-Aware GEneRation for Adaptive Inference-Time Scaling cs.LG | cs.AI | cs.CLPDF

Daniel Scalena, Leonidas Zotos, Elisabetta Fersini, Malvina Nissim, Ahmet Üstün

TL;DR: 论文提出了一种名为EAGER的训练无关生成方法，通过基于熵分布的动态计算资源分配，优化推理模型的性能和效率。

Details

Motivation: 现有推理语言模型在生成候选序列时通常为所有提示分配相同的计算资源，忽略了不同提示的复杂性差异。为了提高效率并减少冗余计算，作者提出了EAGER方法。

Result: 在AIME 2025等复杂推理基准测试中，EAGER无需目标标签即可优化资源分配，实现了推理长度与Pass@k的最佳平衡；若目标标签可用，还能减少65%的令牌生成并提升37%的Pass@k。

Insight: 动态计算资源分配是优化推理语言模型性能的关键，利用模型不确定性可以减少冗余计算并提高效率。

Abstract: With the rise of reasoning language models and test-time scaling methods as a paradigm for improving model performance, substantial computation is often required to generate multiple candidate sequences from the same prompt. This enables exploration of different reasoning paths toward the correct solution, however, allocates the same compute budget for each prompt. Grounded on the assumption that different prompts carry different degrees of complexity, and thus different computation needs, we propose EAGer, a training-free generation method that leverages model uncertainty through token-wise entropy distribution to reduce redundant computation and concurrently improve overall performance. EAGer allows branching to multiple reasoning paths only in the presence of high-entropy tokens, and then reallocates the saved compute budget to the instances where exploration of alternative paths is most needed. We find that across multiple open-source models on complex reasoning benchmarks such as AIME 2025, EAGer can reallocate the budget without accessing target labels, achieving the best efficiency-performance trade-off in terms of reasoning length and Pass@k. When target labels are accessible, EAGer generates up to 65% fewer tokens (hence saving compute) and achieves up to 37% improvement in Pass@k compared to the Full Parallel Sampling.

[272] Can Tool-Integrated Reinforcement Learning Generalize Across Diverse Domains? cs.LG | cs.CLPDF

Zhengyu Chen, Jinluan Yang, Teng Xiao, Ruochen Zhou, Luan Zhang

TL;DR: 论文探讨了基于大语言模型（LLM）的工具增强强化学习在不同领域的泛化能力，并提出了一种名为TGRL的框架来促进领域无关学习和技能迁移。

Details

Motivation: 尽管LLM在工具使用和推理方面表现出色，但其在不同领域的泛化能力仍未得到充分研究。本文旨在验证RL驱动的工具使用方法是否能在训练领域外实现有效迁移。

Result: 实验结果表明，RL驱动的工具学习方法能有效迁移到其他领域，实现高性能和高效token使用。

Insight: 工具增强的RL具备跨领域泛化潜力，标准化设计和模块化思维是关键。

Abstract: Recent advances in large language models (LLMs) have demonstrated remarkable capabilities in reasoning and tool utilization. However, the generalization of tool-augmented reinforcement learning (RL) across diverse domains remains underexplored. In this work, we investigate the cross-domain generalization of an LLM agent equipped with a code interpreter tool, which is exclusively trained on mathematical problem-solving tasks. Despite the restricted training domain, we evaluate the agent’s performance across several distinct reasoning domains. The results reveal that RL-based tool usage learned from mathematical tasks can be effectively transferred to complex tasks in other domains, enabling great task performance and high token efficiency. To facilitate this cross-domain transfer, we propose a Tool Generalization Reinforcement Learning (TGRL) framework designed to promote domain-agnostic learning and skill migration, encompassing: (i) a standardized tool interface that abstracts domain-specific nuances through consistent formatting and explicit termination, fostering transferable invocation patterns; (ii) a dual-component reward system that decomposes rewards to incentivize generalizable behaviors like tool efficiency and reasoning abstraction, ensuring alignment and robustness across domain shifts; and (iii) an XML-based prompt template that separates thinking, tool calls, and responses to encourage modular, domain-invariant planning and coherent multi-turn interactions. Extensive experiments across diverse benchmarks validate our approach, achieving state-of-the-art performance and highlighting the cross-domain potential of Tool RL for LLM reasoning.

[273] ENIGMA: The Geometry of Reasoning and Alignment in Large-Language Models cs.LG | cs.AI | cs.CL | 68T50 | I.2.7PDF

Gareth Seneque, Lap-Hang Ho, Nafise Erfanian Saeedi, Jeffrey Molendijk, Ariel Kupermann

TL;DR: ENIGMA是一种新型的大语言模型（LLM）训练方法，通过信息几何优化联合提升推理、对齐和鲁棒性，将组织政策视为信息流形上的方向。结合GRPO、SAMI和Sinkhorn正则化实现单循环训练，提出的Sufficiency Index（SI）指标优化原则选择，实验表明高SI原则预测训练更稳定且性能提升。

Details

Motivation: 现有大语言模型在推理、对齐和鲁棒性上需要复杂的多步骤优化，ENIGMA旨在通过信息几何视角统一这些目标，简化训练流程。

Result: 在1B参数LLM上实验显示，高SI原则提升训练稳定性和下游性能，信息几何分析验证模型流形的结构变化符合预期。

Insight: 推理、对齐和鲁棒性可能源自单一信息几何目标的不同投影，ENIGMA无需奖励模型即可实现原则性推理，为可信能力提供路径。

Abstract: We present Entropic Mutual-Information Geometry Large-Language Model Alignment (ENIGMA), a novel approach to Large-Language Model (LLM) training that jointly improves reasoning, alignment and robustness by treating an organisation’s policies/principles as directions to move on a model’s information manifold. Our single-loop trainer combines Group-Relative Policy Optimisation (GRPO), an on-policy, critic-free RL method with Chain-of-Thought (CoT)-format only rewards; a Self-Supervised Alignment with Mutual Information (SAMI)-style symmetric InfoNCE auxiliary; and an entropic Sinkhorn optimal-transport regulariser on hidden-state distributions to bound geometry drift. We also introduce infoNCE metrics that specialise to a standard MI lower bound under matched negatives to measure how strongly a model’s CoT encodes these policies. These metrics include a Sufficiency Index (SI) that enables the selection and creation of principles that maximise downstream performance prior to training. In our experiments using small (1B) LLMs, high-SI principles predict steadier training dynamics and improved benchmark performance over GRPO ablations. Our information-geometry analysis of trained models validates desirable structural change in the manifold. These results support our hypothesis that reasoning, alignment, and robustness are projections of a single informationgeometric objective, and that models trained using ENIGMA demonstrate principled reasoning without the use of a reward model, offering a path to trusted capability

[274] ReLook: Vision-Grounded RL with a Multimodal LLM Critic for Agentic Web Coding cs.LG | cs.CLPDF

Yuhang Li, Chenchen Zhang, Ruilin Lv, Ao Liu, Ken Deng

TL;DR: ReLook是一个基于视觉的强化学习框架，通过多模态大语言模型（MLLM）作为工具，实现前端代码生成的闭环优化。

Details

Motivation: 大型语言模型（LLMs）在算法代码生成方面表现优异，但在前端开发中由于正确性依赖于渲染像素和交互性而表现不佳。ReLook旨在解决这一问题。

Result: 在三个广泛使用的基准测试中，ReLook在视觉前端代码生成任务中优于基线方法。

Insight: 通过视觉反馈和强化学习的结合，可以有效提升前端代码生成的质量和鲁棒性。

Abstract: While Large Language Models (LLMs) excel at algorithmic code generation, they struggle with front-end development, where correctness is judged on rendered pixels and interaction. We present ReLook, an agentic, vision-grounded reinforcement learning framework that empowers an agent to close a robust generate–diagnose–refine loop by invoking a multimodal LLM (MLLM) as a tool. During training, the agent uses the MLLM-in-the-loop both as a visual critic–scoring code with screenshots–and as a source of actionable, vision-grounded feedback; a strict zero-reward rule for invalid renders anchors renderability and prevents reward hacking. To prevent behavioral collapse, we introduce Forced Optimization, a strict acceptance rule that admits only improving revisions, yielding monotonically better trajectories. At inference, we decouple the critic and run a lightweight, critic-free self-edit cycle, keeping latency comparable to base decoding while retaining most of the gains. Across three widely used benchmarks, ReLook consistently outperforms strong baselines in vision-grounded front-end code generation, highlighting the benefits of agentic perception, visual rewards, and training-inference decoupling.

[275] Boundary-Guided Policy Optimization for Memory-efficient RL of Diffusion Large Language Models cs.LG | cs.AI | cs.CLPDF

Nianyi Lin, Jiajie Zhang, Lei Hou, Juanzi Li

TL;DR: BGPO是一种针对扩散大语言模型的内存高效强化学习算法，通过构造特殊的ELBO下界解决传统方法中内存开销大的问题，提高了样本量和性能表现。

Details

Motivation: 现有方法在近似扩散大语言模型的似然函数时，由于需要保留蒙特卡洛样本的前向计算图，导致内存开销巨大，限制了样本量并降低了近似精度。

Result: 实验表明，BGPO在数学问题求解、代码生成和规划任务中显著优于现有RL算法。

Insight: 通过精心设计的线性下界优化RL目标，既能解决内存问题，又能保持目标等效性，为高效RL算法设计提供了新思路。

Abstract: A key challenge in applying reinforcement learning (RL) to diffusion large language models (dLLMs) lies in the intractability of their likelihood functions, which are essential for the RL objective, necessitating corresponding approximation in each training step. While existing methods approximate the log-likelihoods by their evidence lower bounds (ELBOs) via customized Monte Carlo (MC) sampling, the forward computational graphs of all MC samples need to be retained for the gradient computation of non-linear terms in the RL objective, resulting in significant memory overhead. This constraint restricts feasible sample sizes, leading to imprecise likelihood approximations and ultimately distorting the RL objective. To overcome this limitation, we propose \emph{Boundary-Guided Policy Optimization} (BGPO), a memory-efficient RL algorithm that maximizes a specially constructed lower bound of the ELBO-based objective. This lower bound is carefully designed to satisfy two key properties: (1) Linearity: it is formulated in a linear sum where each term depends only on a single MC sample, thereby enabling gradient accumulation across samples and ensuring constant memory usage; (2) Equivalence: Both the value and gradient of this lower bound are equal to those of the ELBO-based objective in on-policy training, making it also an effective approximation for the original RL objective. These properties allow BGPO to adopt a large MC sample size, resulting in more accurate likelihood approximations and improved RL objective estimation, which in turn leads to enhanced performance. Experiments show that BGPO significantly outperforms previous RL algorithms for dLLMs in math problem solving, code generation, and planning tasks.

[276] QeRL: Beyond Efficiency – Quantization-enhanced Reinforcement Learning for LLMs cs.LG | cs.CL | cs.CVPDF

Wei Huang, Yi Ge, Shuai Yang, Yicheng Xiao, Huizi Mao

TL;DR: QeRL是一个结合NVFP4量化和LoRA的增强型强化学习框架，旨在提高大型语言模型（LLM）的推理效率，同时通过量化噪声增强策略探索，进一步优化训练效果。

Details

Motivation: 当前LLM的强化学习（RL）训练资源消耗大，需要大量GPU内存和长时间的计算周期。QeRL旨在解决这些问题，同时探索量化噪声对策略探索的潜在好处。

Result: 实验显示，QeRL在rollout阶段实现了1.5倍以上的加速，并在7B模型上匹配了全参数微调的性能（GSM8K 90.8%，MATH 500 77.4%），同时具有更高的奖励增长速度和最终精度。

Insight: 量化噪声可以增加策略熵，从而增强探索能力，帮助发现更好的策略。QeRL的成功表明，量化不仅提升效率，还能通过动态噪声调节优化RL训练效果。

Abstract: We propose QeRL, a Quantization-enhanced Reinforcement Learning framework for large language models (LLMs). While RL is essential for LLMs’ reasoning capabilities, it is resource-intensive, requiring substantial GPU memory and long rollout durations. QeRL addresses these issues by combining NVFP4 quantization with Low-Rank Adaptation (LoRA), accelerating rollout phase of RL while reducing memory overhead. Beyond efficiency, our findings show that quantization noise increases policy entropy, enhancing exploration, and enabling the discovery of better strategies during RL. To further optimize exploration, QeRL introduces an Adaptive Quantization Noise (AQN) mechanism, which dynamically adjusts noise during training. Experiments demonstrate that QeRL delivers over 1.5 times speedup in the rollout phase. Moreover, this is the first framework to enable RL training of a 32B LLM on a single H100 80GB GPU, while delivering overall speedups for RL training. It also achieves faster reward growth and higher final accuracy than 16-bit LoRA and QLoRA, while matching the performance of full-parameter fine-tuning on mathematical benchmarks such as GSM8K (90.8%) and MATH 500 (77.4%) in the 7B model. These results establish QeRL as an efficient and effective framework for RL training in LLMs.

[277] Gradient-Sign Masking for Task Vector Transport Across Pre-Trained Models cs.LG | cs.AI | cs.CVPDF

Filippo Rinaldi, Aniello Panariello, Giacomo Salici, Fengyuan Liu, Marco Ciccone

TL;DR: 该论文提出了一种名为GradFix的方法，通过利用梯度符号结构来跨预训练模型传输任务向量，避免了重复微调的需求。

Details

Motivation: 当基础模型发布新版本时，通常需要重复全微调过程，即使任务在旧版本中已解决。本文旨在通过任务向量传输避免这一重复过程。

Result: 在视觉和语言基准测试中表现显著优于任务向量直接添加和小样本微调方法。

Insight: 梯度符号结构是任务向量成功传输的关键，局部对齐目标损失空间的策略有效提升了传输性能。

Abstract: When a new release of a foundation model is published, practitioners typically need to repeat full fine-tuning, even if the same task has already been solved in the previous version. A promising alternative is to reuse the parameter changes (i.e., task vectors) that capture how a model adapts to a specific task. However, they often fail to transfer across different pre-trained models due to their misaligned parameter space. In this work, we show that the key to successful transfer lies in the sign structure of the gradients of the new model. Based on this insight, we propose GradFix, a novel method that approximates the ideal gradient sign structure and leverages it to transfer knowledge using only a handful of labeled samples. Notably, this requires no additional fine-tuning: the adaptation is achieved by computing a few gradients at the target model and masking the source task vector accordingly. This yields an update that is locally aligned with the target loss landscape, effectively rebasing the task vector onto the new pre-training. We provide a theoretical guarantee that our method ensures first-order descent. Empirically, we demonstrate significant performance gains on vision and language benchmarks, consistently outperforming naive task vector addition and few-shot fine-tuning.

[278] Learning What Matters: Steering Diffusion via Spectrally Anisotropic Forward Noise cs.LG | cs.AI | cs.CVPDF

Luca Scimeca, Thomas Jiralerspong, Berton Earnshaw, Jason Hartford, Yoshua Bengio

TL;DR: 本文提出了一种通过频谱各向异性前向噪声引导扩散模型的方法（SAGD），通过结构化协方差替换传统各向同性噪声，以更好地适应数据分布，提升生成性能并实现选择性忽略。

Details

Motivation: 扩散概率模型（DPMs）的生成性能虽强，但其归纳偏好（inductive biases）通常隐含。本文旨在通过在训练和采样中显式引入归纳偏好，以更好地建模目标数据分布。

Result: 实验表明，SAGD在多视觉数据集上优于标准扩散模型，并能选择性忽略特定频段的已知噪声。

Insight: 各向异性噪声为扩散模型的归纳偏好提供了一种简单且可解释的控制手段，展示了频段操作在新数据建模任务中的潜力。

Abstract: Diffusion Probabilistic Models (DPMs) have achieved strong generative performance, yet their inductive biases remain largely implicit. In this work, we aim to build inductive biases into the training and sampling of diffusion models to better accommodate the target distribution of the data to model. We introduce an anisotropic noise operator that shapes these biases by replacing the isotropic forward covariance with a structured, frequency-diagonal covariance. This operator unifies band-pass masks and power-law weightings, allowing us to emphasize or suppress designated frequency bands, while keeping the forward process Gaussian. We refer to this as spectrally anisotropic Gaussian diffusion (SAGD). In this work, we derive the score relation for anisotropic covariances and show that, under full support, the learned score converges to the true data score as $t!\to!0$, while anisotropy reshapes the probability-flow path from noise to data. Empirically, we show the induced anisotropy outperforms standard diffusion across several vision datasets, and enables selective omission: learning while ignoring known corruptions confined to specific bands. Together, these results demonstrate that carefully designed anisotropic forward noise provides a simple, yet principled, handle to tailor inductive bias in DPMs.

Changchang Sun, Vickie Chen, Yan Yan

TL;DR: 本文提出了一种新颖的语义一致性知识蒸馏方法（SODA），用于解决跨模态哈希学习中多标签语义提取未能与原始多模态数据明确交互的问题。SODA通过引入多标签信息作为新的文本模态，并设计跨模态教师网络来蒸馏语义特征，从而学习一个更优的汉明空间。实验表明该方法优于现有技术。

Details

Motivation: 现有深度跨模态哈希方法在学习语义信息时，未能显式地将多标签语义提取与原始多模态数据交互，导致学习的语义信息与异构多模态数据不兼容，从而影响跨模态鸿沟的性能。

Result: 在两个基准数据集上的实验表明，SODA优于现有的最先进方法。

Insight: 通过引入多标签作为文本模态，并利用知识蒸馏技术，可以显式地学习跨模态语义信息，从而提升跨模态哈希的性能。

Abstract: Recently, deep supervised cross-modal hashing methods have achieve compelling success by learning semantic information in a self-supervised way. However, they still suffer from the key limitation that the multi-label semantic extraction process fail to explicitly interact with raw multimodal data, making the learned representation-level semantic information not compatible with the heterogeneous multimodal data and hindering the performance of bridging modality gap. To address this limitation, in this paper, we propose a novel semantic cohesive knowledge distillation scheme for deep cross-modal hashing, dubbed as SODA. Specifically, the multi-label information is introduced as a new textual modality and reformulated as a set of ground-truth label prompt, depicting the semantics presented in the image like the text modality. Then, a cross-modal teacher network is devised to effectively distill cross-modal semantic characteristics between image and label modalities and thus learn a well-mapped Hamming space for image modality. In a sense, such Hamming space can be regarded as a kind of prior knowledge to guide the learning of cross-modal student network and comprehensively preserve the semantic similarities between image and text modality. Extensive experiments on two benchmark datasets demonstrate the superiority of our model over the state-of-the-art methods.

[280] Deep Neural Networks Inspired by Differential Equations cs.LG | cs.AI | cs.CV | cs.NA | math.NA | A.1; I.2; I.4PDF

Yongshuai Liu, Lianfang Wang, Kuilin Qin, Qinghua Zhang, Faqiang Wang

TL;DR: 本文回顾了受微分方程启发的深度神经网络架构和动态建模方法，重点讨论了基于ODE和SDE的网络模型及其性能和特点，旨在提升模型的可解释性和泛化能力。

Details

Motivation: 深度学习的成功伴随着理论理解、可解释性和泛化能力的挑战，微分方程的视角为这些问题提供了统一的框架和系统性设计方法。

Result: 结果表明，微分方程启发的模型在可解释性和泛化能力方面具有潜力。

Insight: 将微分方程与深度学习结合可以为开发智能计算方法提供新思路，尤其在提升模型的可解释性和泛化能力方面前景广阔。

Abstract: Deep learning has become a pivotal technology in fields such as computer vision, scientific computing, and dynamical systems, significantly advancing these disciplines. However, neural Networks persistently face challenges related to theoretical understanding, interpretability, and generalization. To address these issues, researchers are increasingly adopting a differential equations perspective to propose a unified theoretical framework and systematic design methodologies for neural networks. In this paper, we provide an extensive review of deep neural network architectures and dynamic modeling methods inspired by differential equations. We specifically examine deep neural network models and deterministic dynamical network constructs based on ordinary differential equations (ODEs), as well as regularization techniques and stochastic dynamical network models informed by stochastic differential equations (SDEs). We present numerical comparisons of these models to illustrate their characteristics and performance. Finally, we explore promising research directions in integrating differential equations with deep learning to offer new insights for developing intelligent computational methods that boast enhanced interpretability and generalization capabilities.

[281] Reliable Active Learning from Unreliable Labels via Neural Collapse Geometry cs.LG | cs.CVPDF

Atharv Goel, Sharat Agarwal, Saket Anand, Chetan Arora

TL;DR: NCAL-R利用深度网络的几何规律性，提出两种互补信号（类均值对齐扰动和特征波动）来选择可靠的样本，有效减少标注噪声和数据分布偏移的影响，在ImageNet-100和CIFAR100上表现优于传统AL方法。

Details

Motivation: 传统主动学习（AL）方法在面对噪声标签或数据分布偏移时表现不佳，常因选择错误或冗余样本而放大标注错误。需要一种更可靠的方法来应对这些挑战。

Result: 在ImageNet-100和CIFAR100上优于传统AL方法，实现了更高的准确率和更强的泛化能力。

Insight: 利用深度网络的几何规律性可以有效提升主动学习的可靠性和鲁棒性，适用于实际标注任务。

Abstract: Active Learning (AL) promises to reduce annotation cost by prioritizing informative samples, yet its reliability is undermined when labels are noisy or when the data distribution shifts. In practice, annotators make mistakes, rare categories are ambiguous, and conventional AL heuristics (uncertainty, diversity) often amplify such errors by repeatedly selecting mislabeled or redundant samples. We propose Reliable Active Learning via Neural Collapse Geometry (NCAL-R), a framework that leverages the emergent geometric regularities of deep networks to counteract unreliable supervision. Our method introduces two complementary signals: (i) a Class-Mean Alignment Perturbation score, which quantifies how candidate samples structurally stabilize or distort inter-class geometry, and (ii) a Feature Fluctuation score, which captures temporal instability of representations across training checkpoints. By combining these signals, NCAL-R prioritizes samples that both preserve class separation and highlight ambiguous regions, mitigating the effect of noisy or redundant labels. Experiments on ImageNet-100 and CIFAR100 show that NCAL-R consistently outperforms standard AL baselines, achieving higher accuracy with fewer labels, improved robustness under synthetic label noise, and stronger generalization to out-of-distribution data. These results suggest that incorporating geometric reliability criteria into acquisition decisions can make Active Learning less brittle to annotation errors and distribution shifts, a key step toward trustworthy deployment in real-world labeling pipelines. Our code is available at https://github.com/Vision-IIITD/NCAL.

[282] Causality $\neq$ Decodability, and Vice Versa: Lessons from Interpreting Counting ViTs cs.LG | cs.CVPDF

Lianghuan Huang, Yingshan Chang

TL;DR: 论文研究了视觉Transformer（ViT）中解码性和因果性的区别，通过激活修补和线性探针实验，发现两者并不一致，揭示了信息的存在与使用之间的差异。

Details

Motivation: 动机是解耦神经网络中的解码性和因果性，明确它们的不同作用，以便更好地理解模型的内部工作机制。

Result: 结果显示，中层目标token具有强因果性但弱解码性，而最终层目标token具有高解码性但功能惰性；CLS token在中层可解码但仅在最终层具有因果性。

Insight: 研究发现解码性和因果性是互补的，它们的差异可以帮助揭示隐藏的计算电路。

Abstract: Mechanistic interpretability seeks to uncover how internal components of neural networks give rise to predictions. A persistent challenge, however, is disentangling two often conflated notions: decodability–the recoverability of information from hidden states–and causality–the extent to which those states functionally influence outputs. In this work, we investigate their relationship in vision transformers (ViTs) fine-tuned for object counting. Using activation patching, we test the causal role of spatial and CLS tokens by transplanting activations across clean-corrupted image pairs. In parallel, we train linear probes to assess the decodability of count information at different depths. Our results reveal systematic mismatches: middle-layer object tokens exert strong causal influence despite being weakly decodable, whereas final-layer object tokens support accurate decoding yet are functionally inert. Similarly, the CLS token becomes decodable in mid-layers but only acquires causal power in the final layers. These findings highlight that decodability and causality reflect complementary dimensions of representation–what information is present versus what is used–and that their divergence can expose hidden computational circuits.

[283] Decomposer Networks: Deep Component Analysis and Synthesis cs.LG | cs.CV | cs.IT | cs.NE | math.ITPDF

Mohsen Joneidi

TL;DR: Decomposer Networks（DecompNet）是一种语义自编码器，通过并行分支将输入分解为多个可解释的组件，采用残差更新规则实现竞争的语义表示。

Details

Motivation: 传统自编码器将输入压缩为单一潜在表示，难以捕捉多组件语义。DecompNet旨在通过并行分支和残差更新规则，实现输入的语义分解。

Result: 相比线性分解方法（PCA、NMF）和目标中心架构（MONet、IODINE等），DecompNet能生成更具语义意义的表示。

Insight: 通过残差更新规则引入竞争机制，可以高效实现多组件的语义分解，适用于复杂输入的解析。

Abstract: We propose the Decomposer Networks (DecompNet), a semantic autoencoder that factorizes an input into multiple interpretable components. Unlike classical autoencoders that compress an input into a single latent representation, the Decomposer Network maintains N parallel branches, each assigned a residual input defined as the original signal minus the reconstructions of all other branches. By unrolling a Gauss–Seidel style block-coordinate descent into a differentiable network, DecompNet enforce explicit competition among components, yielding parsimonious, semantically meaningful representations. We situate our model relative to linear decomposition methods (PCA, NMF), deep unrolled optimization, and object-centric architectures (MONet, IODINE, Slot Attention), and highlight its novelty as the first semantic autoencoder to implement an all-but-one residual update rule.

[284] INR-Bench: A Unified Benchmark for Implicit Neural Representations in Multi-Domain Regression and Reconstruction cs.LG | cs.CVPDF

Linfei Li, Fengyi Zhang, Zhong Wang, Lin Zhang, Ying Shen

TL;DR: 论文提出了一个名为INR-Bench的统一基准，用于评估隐式神经网络表示（INRs）在多领域回归和重建任务中的表现，分析了模型架构、位置编码和非线性原语对信号频率响应的影响。

Details

Motivation: 隐式神经网络表示（INRs）因其连续性和无限分辨率的优势在信号处理任务中表现出色，但其有效性及限制因素尚未充分探索。为了更好地理解这些因素，作者从神经网络切线核（NTK）理论出发，分析了模型架构、位置编码和非线性原语对信号频率响应的影响。

Result: INR-Bench提供了一个稳健的平台，突出了不同神经网络模型的优势和局限性，为未来研究奠定了基础。

Insight: 1. 通过NTK理论分析，揭示了模型架构、位置编码和非线性原语对INRs性能的关键影响。2. 实验结果表明，新兴的KAN模型在某些任务中可能优于经典MLP模型，尤其是在处理高频信号时。

Abstract: Implicit Neural Representations (INRs) have gained success in various signal processing tasks due to their advantages of continuity and infinite resolution. However, the factors influencing their effectiveness and limitations remain underexplored. To better understand these factors, we leverage insights from Neural Tangent Kernel (NTK) theory to analyze how model architectures (classic MLP and emerging KAN), positional encoding, and nonlinear primitives affect the response to signals of varying frequencies. Building on this analysis, we introduce INR-Bench, the first comprehensive benchmark specifically designed for multimodal INR tasks. It includes 56 variants of Coordinate-MLP models (featuring 4 types of positional encoding and 14 activation functions) and 22 Coordinate-KAN models with distinct basis functions, evaluated across 9 implicit multimodal tasks. These tasks cover both forward and inverse problems, offering a robust platform to highlight the strengths and limitations of different neural models, thereby establishing a solid foundation for future research. The code and dataset are available at https://github.com/lif314/INR-Bench.

[285] ImpMIA: Leveraging Implicit Bias for Membership Inference Attack under Realistic Scenarios cs.LG | cs.CR | cs.CVPDF

Yuval Golbari, Navve Wasserman, Gal Vardi, Michal Irani

TL;DR: ImpMIA是一种利用神经网络隐式偏置的成员推断攻击方法，无需参考模型，在更现实的隐私攻击场景中表现出色。

Details

Motivation: 现有的黑盒成员推断攻击依赖不现实的假设（如已知训练超参数和训练数据分布），导致性能下降，因此需要一种更通用的方法。

Result: 在仅需模型权重和训练数据超集的现实场景下，ImpMIA性能优于现有黑盒和白盒攻击方法。

Insight: 隐式偏置理论可用于隐私攻击，白盒攻击在模型公开化的趋势下更具实用性。

Abstract: Determining which data samples were used to train a model-known as Membership Inference Attack (MIA)-is a well-studied and important problem with implications for data privacy. Black-box methods presume access only to the model’s outputs and often rely on training auxiliary reference models. While they have shown strong empirical performance, they rely on assumptions that rarely hold in real-world settings: (i) the attacker knows the training hyperparameters; (ii) all available non-training samples come from the same distribution as the training data; and (iii) the fraction of training data in the evaluation set is known. In this paper, we demonstrate that removing these assumptions leads to a significant drop in the performance of black-box attacks. We introduce ImpMIA, a Membership Inference Attack that exploits the Implicit Bias of neural networks, hence removes the need to rely on any reference models and their assumptions. ImpMIA is a white-box attack – a setting which assumes access to model weights and is becoming increasingly realistic given that many models are publicly available (e.g., via Hugging Face). Building on maximum-margin implicit bias theory, ImpMIA uses the Karush-Kuhn-Tucker (KKT) optimality conditions to identify training samples. This is done by finding the samples whose gradients most strongly reconstruct the trained model’s parameters. As a result, ImpMIA achieves state-of-the-art performance compared to both black and white box attacks in realistic settings where only the model weights and a superset of the training data are available.

Qiyi Tong, Olivia Nocentini, Marta Lagomarsino, Kuanqi Cai, Marta Lorenzini

TL;DR: 该文提出了一种名为MLCM-KD的新型框架，结合多级跨模态知识蒸馏（DIKD）方法，从RGB图像中高效转移知识到热成像，用于轻量级面部关键点检测（FLD），显著提升了热成像FLD的性能和效率。

Details

Motivation: 传统热成像FLD方法因缺乏丰富的视觉线索而性能受限，而现有的RGB到热成像的跨模态方法计算开销大或引入结构伪影，无法满足实际部署需求。

Result: 在公共热成像FLD基准测试中取得新SOTA，性能显著优于之前方法，同时大幅降低计算开销。

Insight: 双向知识蒸馏（DIKD）通过闭环监督显著缩小RGB与热成像的模态差距，证明了模态不变特征学习在跨模态任务中的重要性。

Abstract: Facial Landmark Detection (FLD) in thermal imagery is critical for applications in challenging lighting conditions, but it is hampered by the lack of rich visual cues. Conventional cross-modal solutions, like feature fusion or image translation from RGB data, are often computationally expensive or introduce structural artifacts, limiting their practical deployment. To address this, we propose Multi-Level Cross-Modal Knowledge Distillation (MLCM-KD), a novel framework that decouples high-fidelity RGB-to-thermal knowledge transfer from model compression to create both accurate and efficient thermal FLD models. A central challenge during knowledge transfer is the profound modality gap between RGB and thermal data, where traditional unidirectional distillation fails to enforce semantic consistency across disparate feature spaces. To overcome this, we introduce Dual-Injected Knowledge Distillation (DIKD), a bidirectional mechanism designed specifically for this task. DIKD establishes a connection between modalities: it not only guides the thermal student with rich RGB features but also validates the student’s learned representations by feeding them back into the frozen teacher’s prediction head. This closed-loop supervision forces the student to learn modality-invariant features that are semantically aligned with the teacher, ensuring a robust and profound knowledge transfer. Experiments show that our approach sets a new state-of-the-art on public thermal FLD benchmarks, notably outperforming previous methods while drastically reducing computational overhead.

cs.AI [Back]

[287] The Geometry of Reasoning: Flowing Logics in Representation Space cs.AI | cs.CL | cs.LG | cs.LOPDF

Yufa Zhou, Yixiao Wang, Xunjian Yin, Shuyan Zhou, Anru R. Zhang

TL;DR: 该论文提出了一种新的几何框架，将大语言模型（LLM）的推理建模为表征空间中的流（flow），并分离逻辑结构与语义，通过几何量（如位置、速度和曲率）形式化分析推理过程。

Details

Motivation: 研究语言模型如何在表征空间中‘思考’，理解其推理过程是否超越了表面形式的内在逻辑。

Result: 证实LLM推理对应于表征空间中的平滑流，且逻辑语句能控制流的局部速度。

Insight: 几何视角为LLM的推理行为和可解释性研究提供了新工具和分析基础。

Abstract: We study how large language models (LLMs) ``think’’ through their representation space. We propose a novel geometric framework that models an LLM’s reasoning as flows – embedding trajectories evolving where logic goes. We disentangle logical structure from semantics by employing the same natural deduction propositions with varied semantic carriers, allowing us to test whether LLMs internalize logic beyond surface form. This perspective connects reasoning with geometric quantities such as position, velocity, and curvature, enabling formal analysis in representation and concept spaces. Our theory establishes: (1) LLM reasoning corresponds to smooth flows in representation space, and (2) logical statements act as local controllers of these flows’ velocities. Using learned representation proxies, we design controlled experiments to visualize and quantify reasoning flows, providing empirical validation of our theoretical framework. Our work serves as both a conceptual foundation and practical tools for studying reasoning phenomenon, offering a new lens for interpretability and formal analysis of LLMs’ behavior.

[288] The Personalization Trap: How User Memory Alters Emotional Reasoning in LLMs cs.AI | cs.CL | 68T50 | I.2.7PDF

Xi Fang, Weijie Xu, Yuchong Zhang, Stephanie Eckman, Scott Nickleach

TL;DR: 论文探讨了用户记忆如何影响LLMs的情感推理，发现不同用户档案会导致系统性情感解读偏差，且优势群体获得更准确的情感解读。

Details

Motivation: 随着个性化AI系统逐渐融入长期用户记忆，研究用户记忆如何影响LLMs的情感推理成为关键，以避免强化社会不平等。

Result: 发现LLMs在情感理解和支持建议任务中存在显著人口统计差异，个性化可能导致社会不平等。

Insight: 设计个性化AI系统时需警惕用户记忆可能强化社会偏见，情感推理算法需要更加公平。

Abstract: When an AI assistant remembers that Sarah is a single mother working two jobs, does it interpret her stress differently than if she were a wealthy executive? As personalized AI systems increasingly incorporate long-term user memory, understanding how this memory shapes emotional reasoning is critical. We investigate how user memory affects emotional intelligence in large language models (LLMs) by evaluating 15 models on human validated emotional intelligence tests. We find that identical scenarios paired with different user profiles produce systematically divergent emotional interpretations. Across validated user independent emotional scenarios and diverse user profiles, systematic biases emerged in several high-performing LLMs where advantaged profiles received more accurate emotional interpretations. Moreover, LLMs demonstrate significant disparities across demographic factors in emotion understanding and supportive recommendations tasks, indicating that personalization mechanisms can embed social hierarchies into models emotional reasoning. These results highlight a key challenge for memory enhanced AI: systems designed for personalization may inadvertently reinforce social inequalities.

[289] A Layered Intuition – Method Model with Scope Extension for LLM Reasoning cs.AI | cs.CLPDF

Hong Su

TL;DR: 该论文提出了一个分层直觉-方法模型，结合范围扩展，以系统性解决LLM在未见过问题上的推理能力。通过直觉快速反应和方法分解问题，结合垂直、水平、时间和空间扩展，构建知识网络，并提出熵度量来衡量扩展多样性。

Details

Motivation: 现有方法在LLM推理中主要依赖直接矩阵映射，难以系统性解决未见过的问题。本研究旨在通过整合直觉和方法分层模型，结合范围扩展，提升LLM的适应性和推理能力。

Result: 模型通过系统性扩展提高了LLM对未见过问题的解决能力，熵度量有效评估了扩展多样性。

Insight: 范围和时间的扩展为LLM推理提供了更全面的维度，熵度量可推广为评估其他扩展方法的指标。

Abstract: Existing studies have introduced method-based reasoning and scope extension as approaches to enhance Large Language Model (LLM) performance beyond direct matrix mappings. Building on these foundations, this paper summarizes and integrates these ideas into a unified Intuition-Method Layered Model with Scope Extension, designed to address indirected (unseen) issues more systematically. In this framework, intuition-based thinking provides rapid first-reaction answers, while method-based thinking decouples questions and solutions into transferable reasoning units. Scope extension is then applied to broaden applicability, including vertical (cause analysis), horizontal (parallel and generalized issues), and for the first time, temporal and spatial extensions, which expand reasoning across time and contextual dimensions. These extensions are organized into systematic knowledge trees that interconnect into a knowledge network, thereby increasing adaptability. To quantitatively evaluate this process, we propose the entropy of method extension, which measures the independence and diversity of extensions as an indicator of the system’s capacity to solve unseen questions. By logically connecting existing approaches with new extensions and introducing an entropy-based evaluation framework, this work advances toward a more robust and extensible reasoning paradigm for LLMs in real-world problem-solving.

[290] Revisiting Model Interpolation for Efficient Reasoning cs.AI | cs.CLPDF

Taiqiang Wu, Runming Yang, Tao Liu, Jiahao Wang, Ngai Wong

TL;DR: 该论文系统地重新研究了最简单的模型权重直接插值方法，揭示了其遵循三阶段演化范式，并提出了一套实用的框架，以实现高效的推理性能与成本的平衡。

Details

Motivation: 模型插值是一种简单但未被充分研究的模型合并方法，作者希望通过重新研究其性能和行为，提供一种高效且有效的推理解决方案。

Result: 实验证明，策略性插值模型在效率和效果上均优于复杂合并方法。

Insight: 模型插值的动态行为提供了一种性能与成本平衡的原则性指导，简单方法也可实现高效推理。

Abstract: Model merging, typically on Instruct and Thinking models, has shown remarkable performance for efficient reasoning. In this paper, we systematically revisit the simplest merging method that interpolates two weights directly. Particularly, we observe that model interpolation follows a three-stage evolutionary paradigm with distinct behaviors on the reasoning trajectory. These dynamics provide a principled guide for navigating the performance-cost trade-off. Empirical results demonstrate that a strategically interpolated model surprisingly surpasses sophisticated model merging baselines on both efficiency and effectiveness. We further validate our findings with extensive ablation studies on model layers, modules, and decoding strategies. Ultimately, this work demystifies model interpolation and offers a practical framework for crafting models with precisely targeted reasoning capabilities. Code is available at \href{https://github.com/wutaiqiang/MI}{Github}.

cs.CY [Back]

[291] Stop DDoS Attacking the Research Community with AI-Generated Survey Papers cs.CY | cs.AI | cs.CL | cs.IRPDF

Jianghao Lin, Rong Shan, Jiachen Zhu, Yunjia Xi, Yong Yu

TL;DR: 这篇立场论文指出，AI生成的大量综述论文对研究社区造成了类似DDoS攻击的威胁，呼吁制定规范以保障科学记录的质量。

Details

Motivation: 当前AI生成的综述论文泛滥，导致低质量、冗余甚至虚假的内容充斥平台，破坏了研究社区的信任和效率。

Result: 论证了保障综述论文质量的必要性，提出了解决方案。

Insight: AI工具的滥用可能对学术领域造成深远负面影响，亟需透明度和专家监督。

Abstract: Survey papers are foundational to the scholarly progress of research communities, offering structured overviews that guide both novices and experts across disciplines. However, the recent surge of AI-generated surveys, especially enabled by large language models (LLMs), has transformed this traditionally labor-intensive genre into a low-effort, high-volume output. While such automation lowers entry barriers, it also introduces a critical threat: the phenomenon we term the “survey paper DDoS attack” to the research community. This refers to the unchecked proliferation of superficially comprehensive but often redundant, low-quality, or even hallucinated survey manuscripts, which floods preprint platforms, overwhelms researchers, and erodes trust in the scientific record. In this position paper, we argue that we must stop uploading massive amounts of AI-generated survey papers (i.e., survey paper DDoS attack) to the research community, by instituting strong norms for AI-assisted review writing. We call for restoring expert oversight and transparency in AI usage and, moreover, developing new infrastructures such as Dynamic Live Surveys, community-maintained, version-controlled repositories that blend automated updates with human curation. Through quantitative trend analysis, quality audits, and cultural impact discussion, we show that safeguarding the integrity of surveys is no longer optional but imperative to the research community.

cs.GR [Back]

[292] VLM-Guided Adaptive Negative Prompting for Creative Generation cs.GR | cs.CVPDF

Shelly Golan, Yotam Nitzan, Zongze Wu, Or Patashnik

TL;DR: 论文提出了一种无需训练的推理时方法VLM-Guided Adaptive Negative Prompting，通过视觉语言模型（VLM）分析生成过程中的中间输出，自适应地引导生成远离常规视觉概念，从而促进新颖且有效的创意图像生成。该方法在CLIP嵌入空间中评估创意的新颖性和有效性，实验表明其在创意新颖性上有显著提升，且计算开销极小。

Details

Motivation: 现有文本到图像扩散模型难以生成真正新颖的内容，而现有增强生成创意的方法要么依赖于图像特征的插值（限制在预定义类别中），要么需要耗时的嵌入优化或模型微调。因此，需要一种无需训练、高效的方法来引导模型生成新颖且有效的创意内容。

Result: 实验结果表明，该方法在创意新颖性上表现优于现有方法，且计算开销极小。同时，成功扩展到了复杂场景的创意生成。

Insight: 通过动态负提示引导生成过程，可以在不改变模型结构的情况下显著提升创意生成的效果，为创意扩散模型提供了一种高效的新思路。

Abstract: Creative generation is the synthesis of new, surprising, and valuable samples that reflect user intent yet cannot be envisioned in advance. This task aims to extend human imagination, enabling the discovery of visual concepts that exist in the unexplored spaces between familiar domains. While text-to-image diffusion models excel at rendering photorealistic scenes that faithfully match user prompts, they still struggle to generate genuinely novel content. Existing approaches to enhance generative creativity either rely on interpolation of image features, which restricts exploration to predefined categories, or require time-intensive procedures such as embedding optimization or model fine-tuning. We propose VLM-Guided Adaptive Negative-Prompting, a training-free, inference-time method that promotes creative image generation while preserving the validity of the generated object. Our approach utilizes a vision-language model (VLM) that analyzes intermediate outputs of the generation process and adaptively steers it away from conventional visual concepts, encouraging the emergence of novel and surprising outputs. We evaluate creativity through both novelty and validity, using statistical metrics in the CLIP embedding space. Through extensive experiments, we show consistent gains in creative novelty with negligible computational overhead. Moreover, unlike existing methods that primarily generate single objects, our approach extends to complex scenarios, such as generating coherent sets of creative objects and preserving creativity within elaborate compositional prompts. Our method integrates seamlessly into existing diffusion pipelines, offering a practical route to producing creative outputs that venture beyond the constraints of textual descriptions.

eess.IV [Back]

[293] Generative Latent Video Compression eess.IV | cs.CVPDF

Zongyu Guo, Zhaoyang Jia, Jiahao Li, Xiaoyi Zhang, Bin Li

TL;DR: GLVC是一种基于生成潜变量的视频压缩框架，通过预训练的连续标记器将视频帧投影到感知对齐的潜空间中，从而优化率失真与感知效果的平衡。该方法在多个基准测试中表现优异，用户研究表明其在高压缩率下仍能保持稳定的时间一致性。

Details

Motivation: 视频压缩中平衡率失真与感知效果是一个关键挑战，尤其是帧间质量波动导致的闪烁问题。GLVC的提出旨在通过潜生成模型的优势解决这一问题。

Result: GLVC在DISTS和LPIPS指标上表现优异，用户研究表明其在压缩率更高的情况下仍能媲美最新神经视频编解码器。

Insight: 潜生成模型在视频压缩中的应用能够有效平衡感知效果与压缩性能，统一的设计和循环机制是提升时间一致性的关键。

Abstract: Perceptual optimization is widely recognized as essential for neural compression, yet balancing the rate-distortion-perception tradeoff remains challenging. This difficulty is especially pronounced in video compression, where frame-wise quality fluctuations often cause perceptually optimized neural video codecs to suffer from flickering artifacts. In this paper, inspired by the success of latent generative models, we present Generative Latent Video Compression (GLVC), an effective framework for perceptual video compression. GLVC employs a pretrained continuous tokenizer to project video frames into a perceptually aligned latent space, thereby offloading perceptual constraints from the rate-distortion optimization. We redesign the codec architecture explicitly for the latent domain, drawing on extensive insights from prior neural video codecs, and further equip it with innovations such as unified intra/inter coding and a recurrent memory mechanism. Experimental results across multiple benchmarks show that GLVC achieves state-of-the-art performance in terms of DISTS and LPIPS metrics. Notably, our user study confirms GLVC rivals the latest neural video codecs at nearly half their rate while maintaining stable temporal coherence, marking a step toward practical perceptual video compression.

[294] Towards Efficient 3D Gaussian Human Avatar Compression: A Prior-Guided Framework eess.IV | cs.CV | cs.MM | I.4; I.5PDF

Shanzhi Yin, Bolin Chen, Xinju Wu, Ru-Ling Liao, Jie Chen

TL;DR: 本文提出了一种高效的3D人体化身压缩框架，利用紧凑的人体先验和规范到目标的变换，实现了超低比特率下的高质量3D人体化身视频压缩。

Details

Motivation: 现有3D人体化身压缩方法在比特率效率和建模冗余方面存在不足，限制了沉浸式多媒体体验的广泛推广。

Result: 实验表明，该方法在主流多视角人体数据集上显著优于传统2D/3D编解码器和现有动态3D高斯溅射压缩方法。

Insight: 外观与时序运动分离是高效3D压缩的关键，为元宇宙应用中的沉浸式体验提供了可行方案。

Abstract: This paper proposes an efficient 3D avatar coding framework that leverages compact human priors and canonical-to-target transformation to enable high-quality 3D human avatar video compression at ultra-low bit rates. The framework begins by training a canonical Gaussian avatar using articulated splatting in a network-free manner, which serves as the foundation for avatar appearance modeling. Simultaneously, a human-prior template is employed to capture temporal body movements through compact parametric representations. This decomposition of appearance and temporal evolution minimizes redundancy, enabling efficient compression: the canonical avatar is shared across the sequence, requiring compression only once, while the temporal parameters, consisting of just 94 parameters per frame, are transmitted with minimal bit-rate. For each frame, the target human avatar is generated by deforming canonical avatar via Linear Blend Skinning transformation, facilitating temporal coherent video reconstruction and novel view synthesis. Experimental results demonstrate that the proposed method significantly outperforms conventional 2D/3D codecs and existing learnable dynamic 3D Gaussian splatting compression method in terms of rate-distortion performance on mainstream multi-view human video datasets, paving the way for seamless immersive multimedia experiences in meta-verse applications.

cs.IR [Back]

[295] CardRewriter: Leveraging Knowledge Cards for Long-Tail Query Rewriting on Short-Video Platforms cs.IR | cs.CLPDF

Peiyuan Gong, Feiran Zhu, Yaqi Yin, Chenglei Dai, Chao Zhang

TL;DR: CardRewriter 是一个基于大语言模型（LLM）的框架，通过整合领域特定知识卡来优化短视频平台上的长尾查询改写，显著提升查询质量与用户体验，并已在快手平台落地应用。

Details

Motivation: 短视频平台的用户查询（尤其是长尾查询）常因拼写错误、表达不完整或意图模糊导致检索结果与预期不符。现有LLM在非公开内容（如短视频、直播等）上的表现较差，这促使研究者提出一种结合领域知识的改写方法。

Result: 离线实验显示CardRewriter显著提升了针对非公开内容的查询改写质量；在线A/B测试证实其在长播率（LVR）、点击率（CTR）和主动查询改写率（IQRR）方面均有显著改进。

Insight: 领域特定知识的引入能有效弥补LLM在非公开内容上的不足；分层优化的训练流程和定制奖励机制对提升查询改写的实用性和效果至关重要。

Abstract: Short-video platforms have rapidly become a new generation of information retrieval systems, where users formulate queries to access desired videos. However, user queries, especially long-tail ones, often suffer from spelling errors, incomplete phrasing, and ambiguous intent, resulting in mismatches between user expectations and retrieved results. While large language models (LLMs) have shown success in long-tail query rewriting within e-commerce, they struggle on short-video platforms, where proprietary content such as short videos, live streams, micro dramas, and user social networks falls outside their training distribution. To address this challenge, we introduce \textbf{CardRewriter}, an LLM-based framework that incorporates domain-specific knowledge to enhance long-tail query rewriting. For each query, our method aggregates multi-source knowledge relevant to the query and summarizes it into an informative and query-relevant knowledge card. This card then guides the LLM to better capture user intent and produce more effective query rewrites. We optimize CardRewriter using a two-stage training pipeline: supervised fine-tuning followed by group relative policy optimization, with a tailored reward system balancing query relevance and retrieval effectiveness. Offline experiments show that CardRewriter substantially improves rewriting quality for queries targeting proprietary content. Online A/B testing further confirms significant gains in long-view rate (LVR) and click-through rate (CTR), along with a notable reduction in initiative query reformulation rate (IQRR). Since September 2025, CardRewriter has been deployed on Kuaishou, one of China’s largest short-video platforms, serving hundreds of millions of users daily.

[296] REGENT: Relevance-Guided Attention for Entity-Aware Multi-Vector Neural Re-Ranking cs.IR | cs.CLPDF

Shubham Chatterjee

TL;DR: 论文提出了REGENT模型，通过实体引导注意力机制，结合细粒度词法匹配和高层次语义推理，显著提升了神经重排模型的性能。

Details

Motivation: 现有神经重排模型在处理复杂信息需求和内容丰富的长文档时表现不佳，主要原因在于缺乏对关键实体和概念的智能内容选择能力。而人类通常会围绕关键实体和概念构建理解，因此论文提出模仿人类的这种能力。

Result: REGENT在三个挑战性数据集上达到了新的最优性能，相比BM25提升了108%，并显著优于ColBERT和RankVicuna等基线模型。

Insight: 论文揭示了将实体语义直接融入注意力机制的重要性，为未来实体感知的检索模型提供了新方向。

Abstract: Current neural re-rankers often struggle with complex information needs and long, content-rich documents. The fundamental issue is not computational–it is intelligent content selection: identifying what matters in lengthy, multi-faceted texts. While humans naturally anchor their understanding around key entities and concepts, neural models process text within rigid token windows, treating all interactions as equally important and missing critical semantic signals. We introduce REGENT, a neural re-ranking model that mimics human-like understanding by using entities as a “semantic skeleton” to guide attention. REGENT integrates relevance guidance directly into the attention mechanism, combining fine-grained lexical matching with high-level semantic reasoning. This relevance-guided attention enables the model to focus on conceptually important content while maintaining sensitivity to precise term matches. REGENT achieves new state-of-the-art performance in three challenging datasets, providing up to 108% improvement over BM25 and consistently outperforming strong baselines including ColBERT and RankVicuna. To our knowledge, this is the first work to successfully integrate entity semantics directly into neural attention, establishing a new paradigm for entity-aware information retrieval.

[297] FinVet: A Collaborative Framework of RAG and External Fact-Checking Agents for Financial Misinformation Detection cs.IR | cs.AI | cs.CLPDF

Daniel Berhane Araya, Duoduo Liao

TL;DR: FinVet是一个新型多智能体框架，通过结合RAG管道和外部事实核查代理，提升金融错误信息检测的透明度和准确性。

Details

Motivation: 金融市场的错误信息可能导致巨大损失，现有方法缺乏透明度且难以溯源可信来源。

Result: 在FinFact数据集上F1达0.85，比最佳单管道提升10.4%，比独立RAG提升37%。

Insight: 结合RAG与外部核查的动态多智能体方法可显著提升金融信息检测性能，同时增强透明度和可信度。

Abstract: Financial markets face growing threats from misinformation that can trigger billions in losses in minutes. Most existing approaches lack transparency in their decision-making and provide limited attribution to credible sources. We introduce FinVet, a novel multi-agent framework that integrates two Retrieval-Augmented Generation (RAG) pipelines with external fact-checking through a confidence-weighted voting mechanism. FinVet employs adaptive three-tier processing that dynamically adjusts verification strategies based on retrieval confidence, from direct metadata extraction to hybrid reasoning to full model-based analysis. Unlike existing methods, FinVet provides evidence-backed verdicts, source attribution, confidence scores, and explicit uncertainty flags when evidence is insufficient. Experimental evaluation on the FinFact dataset shows that FinVet achieves an F1 score of 0.85, which is a 10.4% improvement over the best individual pipeline (fact-check pipeline) and 37% improvement over standalone RAG approaches.

[298] MTMD: A Multi-Task Multi-Domain Framework for Unified Ad Lightweight Ranking at Pinterest cs.IR | cs.CVPDF

Xiao Yang, Peifeng Yin, Abe Engle, Jinfeng Zhuang, Ling Leng

TL;DR: 论文提出了一个多任务多领域（MTMD）框架，用于统一轻量级广告排序，解决了在多任务学习和多领域数据整合中的挑战，显著提升了离线效果和在线成本效率。

Details

Motivation: 在广告推荐系统中，轻量级排序层是关键，但需要同时优化多个任务（如CTR、CVR）和多领域数据（如不同广告产品和展示位置）。传统多任务学习难以统一处理这些复杂需求。

Result: 离线损失降低12%-36%，在线点击成本降低2%，并取代了9个生产模型。

Insight: 混合专家架构和显式知识迁移是多任务学习在多领域中有效的关键。

Abstract: The lightweight ad ranking layer, living after the retrieval stage and before the fine ranker, plays a critical role in the success of a cascaded ad recommendation system. Due to the fact that there are multiple optimization tasks depending on the ad domain, e.g., Click Through Rate (CTR) for click ads and Conversion Rate (CVR) for conversion ads, as well as multiple surfaces where an ad is served (home feed, search, or related item recommendation) with diverse ad products (shopping or standard ad); it is an essentially challenging problem in industry on how to do joint holistic optimization in the lightweight ranker, such that the overall platform’s value, advertiser’s value, and user’s value are maximized. Deep Neural Network (DNN)-based multitask learning (MTL) can handle multiple goals naturally, with each prediction head mapping to a particular optimization goal. However, in practice, it is unclear how to unify data from different surfaces and ad products into a single model. It is critical to learn domain-specialized knowledge and explicitly transfer knowledge between domains to make MTL effective. We present a Multi-Task Multi-Domain (MTMD) architecture under the classic Two-Tower paradigm, with the following key contributions: 1) handle different prediction tasks, ad products, and ad serving surfaces in a unified framework; 2) propose a novel mixture-of-expert architecture to learn both specialized knowledge each domain and common knowledge shared between domains; 3) propose a domain adaption module to encourage knowledge transfer between experts; 4) constrain the modeling of different prediction tasks. MTMD improves the offline loss value by 12% to 36%, mapping to 2% online reduction in cost per click. We have deployed this single MTMD framework into production for Pinterest ad recommendation replacing 9 production models.

cs.SE [Back]

[299] A Comprehensive Survey on Benchmarks and Solutions in Software Engineering of LLM-Empowered Agentic System cs.SE | cs.CLPDF

Jiale Guo, Suizhi Huang, Mei Li, Dong Huang, Xingsheng Chen

TL;DR: 这篇论文对LLM赋能的软件工程代理系统进行了全面的调查，聚焦于基准测试和解决方案的关联性，填补了评估与解决方案之间的关键空白。

Details

Motivation: 随着LLM在软件工程中的应用从传统规则系统转向复杂的代理系统，领域缺乏对基准测试和解决方案如何相互关联的全面理解，阻碍了系统化进展和评估。

Result: 研究发现领域从提示工程发展为包含规划、推理、记忆机制和工具增强的复杂代理系统，并连接了50多个基准测试及其对应的解决方案策略。

Insight: 论文指出了关键研究缺口，如多代理协作框架和自我进化的代码生成系统，并提出了未来研究方向，如LLM方法与形式验证的结合。

Abstract: The integration of LLMs into software engineering has catalyzed a paradigm shift from traditional rule-based systems to sophisticated agentic systems capable of autonomous problem-solving. Despite this transformation, the field lacks a comprehensive understanding of how benchmarks and solutions interconnect, hindering systematic progress and evaluation. This survey presents the first holistic analysis of LLM-empowered software engineering, bridging the critical gap between evaluation and solution approaches. We analyze 150+ recent papers and organize them into a comprehensive taxonomy spanning two major dimensions: (1) Solutions, categorized into prompt-based, fine-tuning-based, and agent-based paradigms, and (2) Benchmarks, covering code generation, translation, repair, and other tasks. Our analysis reveals how the field has evolved from simple prompt engineering to complex agentic systems incorporating planning and decomposition, reasoning and self-refinement, memory mechanisms, and tool augmentation. We present a unified pipeline that illustrates the complete workflow from task specification to final deliverables, demonstrating how different solution paradigms address varying complexity levels across software engineering tasks. Unlike existing surveys that focus on isolated aspects, we provide full-spectrum coverage connecting 50+ benchmarks with their corresponding solution strategies, enabling researchers to identify optimal approaches for specific evaluation criteria. Furthermore, we identify critical research gaps and propose actionable future directions, including multi-agent collaboration frameworks, self-evolving code generation systems, and integration of formal verification with LLM-based methods. This survey serves as a foundational resource for researchers and practitioners seeking to understand, evaluate, and advance LLM-empowered software engineering systems.

cs.RO [Back]

[300] Dejavu: Post-Deployment Learning for Embodied Agents via Experience Feedback cs.RO | cs.AI | cs.CVPDF

Shaokai Wu, Yanbiao Ji, Qiuchang Li, Zhiyi Zhang, Qichen He

TL;DR: Dejavu引入了一种后部署学习框架，通过经验反馈网络（EFN）扩展冻结的视觉-语言-动作（VLA）策略，使具身代理能够基于检索到的执行记忆进行动作预测，并在部署后持续学习。

Details

Motivation: 具身代理在部署后无法更新知识以提升任务表现，限制了其适应性和鲁棒性。

Result: 实验表明，EFN显著提升了适应性、鲁棒性和任务成功率。

Insight: 为具身代理的后部署持续学习提供了一种可行路径。

Abstract: Embodied agents face a fundamental limitation: once deployed in real-world environments to perform specific tasks, they are unable to acquire new useful knowledge to enhance task performance. In this paper, we propose a general post-deployment learning framework called Dejavu, which employs an Experience Feedback Network (EFN) and augments the frozen Vision-Language-Action (VLA) policy with retrieved execution memories. EFN automatically identifies contextually successful prior action experiences and conditions action prediction on this retrieved guidance. We adopt reinforcement learning with semantic similarity rewards on EFN to ensure that the predicted actions align with past successful behaviors under current observations. During deployment, EFN continually enriches its memory with new trajectories, enabling the agent to exhibit “learning from experience” despite fixed weights. Experiments across diverse embodied tasks show that EFN significantly improves adaptability, robustness, and success rates over frozen baselines. These results highlight a promising path toward embodied agents that continually refine their behavior after deployment.

[301] X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model cs.RO | cs.AI | cs.CVPDF

Jinliang Zheng, Jianxiong Li, Zhihao Wang, Dongxiu Liu, Xirui Kang

TL;DR: X-VLA是一种基于软提示的Transformer模型，用于跨具身视觉语言动作（VLA）任务，通过最小化参数增加实现高效的异构数据训练。

Details

Motivation: 现有VLA模型需要高效利用多样化的机器人数据集，但跨具身数据的异构性限制了模型的通用性和适应性。

Result: 在6个模拟环境和3个真实机器人上的实验表明，X-VLA-0.9B在多项基准测试中达到SOTA性能，展示了优越的适应性和灵活性。

Insight: 软提示技术可以有效解决跨具身数据的异构性问题，为通用VLA模型的开发提供了新思路。

Abstract: Successful generalist Vision-Language-Action (VLA) models rely on effective training across diverse robotic platforms with large-scale, cross-embodiment, heterogeneous datasets. To facilitate and leverage the heterogeneity in rich, diverse robotic data sources, we propose a novel Soft Prompt approach with minimally added parameters, by infusing prompt learning concepts into cross-embodiment robot learning and introducing separate sets of learnable embeddings for each distinct data source. These embeddings serve as embodiment-specific prompts, which in unity empower VLA models with effective exploitation of varying cross-embodiment features. Our new X-VLA, a neat flow-matching-based VLA architecture, relies exclusively on soft-prompted standard Transformer encoders, enjoying both scalability and simplicity. Evaluated across 6 simulations as well as 3 real-world robots, our 0.9B instantiation-X-VLA-0.9B simultaneously achieves SOTA performance over a sweep of benchmarks, demonstrating superior results on a wide axes of capabilities, from flexible dexterity to quick adaptation across embodiments, environments, and tasks. Website: https://thu-air-dream.github.io/X-VLA/

[302] SpikeGrasp: A Benchmark for 6-DoF Grasp Pose Detection from Stereo Spike Streams cs.RO | cs.CVPDF

Zhuoheng Gao, Jiyao Zhang, Zhiyong Xie, Hao Dong, Zhaofei Yu

TL;DR: SpikeGrasp提出了一个基于生物启发的6-DoF抓取姿态检测框架，直接处理立体脉冲相机的事件流，避免了显式3D点云重建，并在复杂和无纹理场景中优于传统方法。

Details

Motivation: 现有机器人抓取系统依赖显式3D点云，这与生物视觉处理方式不同。作者探索一种更接近生物视觉处理的新范式。

Result: SpikeGrasp在复杂和无纹理场景中超越传统点云方法，并展示出色的数据效率。

Insight: 通过模仿生物视觉处理路径，可以实现更高效和流畅的机器人操作，特别是在动态对象应用中。

Abstract: Most robotic grasping systems rely on converting sensor data into explicit 3D point clouds, which is a computational step not found in biological intelligence. This paper explores a fundamentally different, neuro-inspired paradigm for 6-DoF grasp detection. We introduce SpikeGrasp, a framework that mimics the biological visuomotor pathway, processing raw, asynchronous events from stereo spike cameras, similarly to retinas, to directly infer grasp poses. Our model fuses these stereo spike streams and uses a recurrent spiking neural network, analogous to high-level visual processing, to iteratively refine grasp hypotheses without ever reconstructing a point cloud. To validate this approach, we built a large-scale synthetic benchmark dataset. Experiments show that SpikeGrasp surpasses traditional point-cloud-based baselines, especially in cluttered and textureless scenes, and demonstrates remarkable data efficiency. By establishing the viability of this end-to-end, neuro-inspired approach, SpikeGrasp paves the way for future systems capable of the fluid and efficient manipulation seen in nature, particularly for dynamic objects.

[303] Into the Unknown: Towards using Generative Models for Sampling Priors of Environment Uncertainty for Planning in Configuration Spaces cs.RO | cs.AI | cs.CVPDF

Subhransu S. Bhattacharjee, Hao Lu, Dylan Campbell, Rahul Shome

TL;DR: 这篇论文提出了一种基于采样的流程，利用预训练的生成模型在零样本情况下生成概率先验，捕捉环境不确定性和空间语义关系，适用于配置空间规划。通过Matterport3D基准测试，展示了该方法在恢复未观察区域的占用和目标位置不确定性方面的有效性。

Details

Motivation: 在部分可观测环境下的规划任务中，先验信息至关重要，但实际中难以获得。为了解决这一问题，论文提出利用生成模型自动生成具备空间语义信息的先验。

Result: 实验表明，该方法生成的先验能够恢复符合真实环境的常识性空间语义，生成的3D点云可用于运动规划，展现了生成模型在机器人规划中的潜力。

Insight: 生成模型能够作为一种丰富的先验信息来源，为机器人规划任务提供多样化和有效的环境不确定性表示。

Abstract: Priors are vital for planning under partial observability, yet difficult to obtain in practice. We present a sampling-based pipeline that leverages large-scale pretrained generative models to produce probabilistic priors capturing environmental uncertainty and spatio-semantic relationships in a zero-shot manner. Conditioned on partial observations, the pipeline recovers complete RGB-D point cloud samples with occupancy and target semantics, formulated to be directly useful in configuration-space planning. We establish a Matterport3D benchmark of rooms partially visible through doorways, where a robot must navigate to an unobserved target object. Effective priors for this setting must represent both occupancy and target-location uncertainty in unobserved regions. Experiments show that our approach recovers commonsense spatial semantics consistent with ground truth, yielding diverse, clean 3D point clouds usable in motion planning, highlight the promise of generative models as a rich source of priors for robotic planning.

[304] SCOOP’D: Learning Mixed-Liquid-Solid Scooping via Sim2Real Generative Policy cs.RO | cs.CVPDF

Kuanning Wang, Yongchong Gu, Yuqian Fu, Zeyu Shangguan, Sicheng He

TL;DR: 该论文提出了SCOOP’D方法，通过仿真到现实的生成策略学习混合液体-固体物质的舀取技能，并在零样本部署中展示了多样性场景中的优异性能。

Details

Motivation: 自主机器人舀取技能在日常生活和灾害救援中具有广泛应用，但由于复杂工具-物体交互和变形物体（如颗粒或液体）的动力学复杂性，开发通用策略极具挑战性。

Result: 在465次多样性实验中，SCOOP’D表现优于所有基线方法，成功应对了不同难度级别的舀取任务。

Insight: 仿真数据结合生成策略可以有效解决复杂变形物体操纵问题，为机器人学习高难度任务提供了新思路。

Abstract: Scooping items with tools such as spoons and ladles is common in daily life, ranging from assistive feeding to retrieving items from environmental disaster sites. However, developing a general and autonomous robotic scooping policy is challenging since it requires reasoning about complex tool-object interactions. Furthermore, scooping often involves manipulating deformable objects, such as granular media or liquids, which is challenging due to their infinite-dimensional configuration spaces and complex dynamics. We propose a method, SCOOP’D, which uses simulation from OmniGibson (built on NVIDIA Omniverse) to collect scooping demonstrations using algorithmic procedures that rely on privileged state information. Then, we use generative policies via diffusion to imitate demonstrations from observational input. We directly apply the learned policy in diverse real-world scenarios, testing its performance on various item quantities, item characteristics, and container types. In zero-shot deployment, our method demonstrates promising results across 465 trials in diverse scenarios, including objects of different difficulty levels that we categorize as “Level 1” and “Level 2.” SCOOP’D outperforms all baselines and ablations, suggesting that this is a promising approach to acquiring robotic scooping skills. Project page is at https://scoopdiff.github.io/.

Table of Contents

cs.CV [Back]

[1] TinyViT-Batten: Few-Shot Vision Transformer with Explainable Attention for Early Batten-Disease Detection on Pediatric MRI cs.CV | cs.AIPDF

[2] Ultralytics YOLO Evolution: An Overview of YOLO26, YOLO11, YOLOv8 and YOLOv5 Object Detectors for Computer Vision and Pattern Recognition cs.CV | cs.AIPDF

[3] OmniSAT: Compact Action Token, Faster Auto Regression cs.CV | cs.ROPDF

[4] Knowledge-Aware Mamba for Joint Change Detection and Classification from MODIS Times Series cs.CVPDF

[5] Multi Camera Connected Vision System with Multi View Analytics: A Comprehensive Survey cs.CVPDF

[6] Constructive Distortion: Improving MLLMs with Attention-Guided Image Warping cs.CV | cs.LGPDF

[7] Towards Understanding Ambiguity Resolution in Multimodal Inference of Meaning cs.CV | cs.AIPDF

[8] Task-Aware Resolution Optimization for Visual Large Language Models cs.CV | cs.CLPDF

[9] Cluster-Aware Prompt Ensemble Learning for Few-Shot Vision-Language Model Adaptation cs.CVPDF

[10] CHUG: Crowdsourced User-Generated HDR Video Quality Dataset cs.CV | cs.AIPDF

[11] SpectralCA: Bi-Directional Cross-Attention for Next-Generation UAV Hyperspectral Vision cs.CV | cs.AI | I.4.8; I.2.6; I.2.10; I.5.1; I.5.4PDF

[12] HeadsUp! High-Fidelity Portrait Image Super-Resolution cs.CVPDF

[13] Denoising Diffusion as a New Framework for Underwater Images cs.CV | cs.AIPDF

[14] Scaling Traffic Insights with AI and Language Model-Powered Camera Systems for Data-Driven Transportation Decision Making cs.CV | eess.IVPDF

[15] FlareX: A Physics-Informed Dataset for Lens Flare Removal via 2D Synthesis and 3D Rendering cs.CVPDF

[16] BurstDeflicker: A Benchmark Dataset for Flicker Removal in Dynamic Scenes cs.CVPDF

[17] MIMO: A medical vision language model with visual referring multimodal input and pixel grounding multimodal output cs.CVPDF

[18] Q-Adapter: Visual Query Adapter for Extracting Textually-related Features in Video Captioning cs.CVPDF

[19] P-4DGS: Predictive 4D Gaussian Splatting with 90$\times$ Compression cs.CVPDF

[20] Complementary and Contrastive Learning for Audio-Visual Segmentation cs.CVPDF

[21] Think Twice to See More: Iterative Visual Reasoning in Medical VLMs cs.CV | cs.AIPDF

[22] DREAM: A Benchmark Study for Deepfake REalism AssessMent cs.CVPDF

[23] Collaborative Learning of Semantic-Aware Feature Learning and Label Recovery for Multi-Label Image Recognition with Incomplete Labels cs.CVPDF

[24] Probabilistic Hyper-Graphs using Multiple Randomly Masked Autoencoders for Semi-supervised Multi-modal Multi-task Learning cs.CVPDF

[25] Tracking the Spatiotemporal Evolution of Landslide Scars Using a Vision Foundation Model: A Novel and Universal Framework cs.CVPDF

[26] Gesplat: Robust Pose-Free 3D Reconstruction via Geometry-Guided Gaussian Splatting cs.CVPDF

[27] Cooperative Pseudo Labeling for Unsupervised Federated Classification cs.CV | cs.LGPDF

[28] Answer-Consistent Chain-of-thought Reinforcement Learning For Multi-modal Large Langauge Models cs.CVPDF

[29] Uncertainty-Aware Post-Detection Framework for Enhanced Fire and Smoke Detection in Compact Deep Learning Models cs.CV | cs.AI | cs.LG | eess.IVPDF

[30] Training-Free In-Context Forensic Chain for Image Manipulation Detection and Localization cs.CV | cs.AI | cs.CRPDF

[31] Multi Class Parkinsons Disease Detection Based on Finger Tapping Using Attention-Enhanced CNN BiLSTM cs.CVPDF

[32] DeepFusionNet: Autoencoder-Based Low-Light Image Enhancement and Super-Resolution cs.CV | cs.AI | 68T45, 68T10 | I.2.10; I.4.9PDF

[33] Color3D: Controllable and Consistent 3D Colorization with Personalized Colorizer cs.CVPDF

[34] ReMix: Towards a Unified View of Consistent Character Generation and Editing cs.CVPDF

[35] SaFiRe: Saccade-Fixation Reiteration with Mamba for Referring Image Segmentation cs.CV | cs.AIPDF

[36] SparseUWSeg: Active Sparse Point-Label Augmentation for Underwater Semantic Segmentation cs.CVPDF

[37] ViConEx-Med: Visual Concept Explainability via Multi-Concept Token Transformer for Medical Image Analysis cs.CVPDF

[38] TCMA: Text-Conditioned Multi-granularity Alignment for Drone Cross-Modal Text-Video Retrieval cs.CVPDF

[39] Fairness Without Labels: Pseudo-Balancing for Bias Mitigation in Face Gender Classification cs.CV | 68T07 | I.2.10; I.4.8; I.5.4PDF

[40] B2N3D: Progressive Learning from Binary to N-ary Relationships for 3D Object Grounding cs.CVPDF

[41] From Generic to Specialized: A Subspecialty Diagnostic System Powered by Self-Supervised Learning for Cervical Histopathology cs.CVPDF

[42] Semantic Visual Anomaly Detection and Reasoning in AI-Generated Images cs.CVPDF

[43] MRI Brain Tumor Detection with Computer Vision cs.CV | cs.AI | cs.LG | 68T07, 68U10 | I.2.6; I.2.10; I.4.6PDF

[44] Are Video Models Emerging as Zero-Shot Learners and Reasoners in Medical Imaging? cs.CVPDF

[45] Opacity-Gradient Driven Density Control for Compact and Efficient Few-Shot 3D Gaussian Splatting cs.CV | cs.LGPDF

[46] VividAnimator: An End-to-End Audio and Pose-driven Half-Body Human Animation Framework cs.CVPDF

[47] SAM2LoRA: Composite Loss-Guided, Parameter-Efficient Finetuning of SAM2 for Retinal Fundus Segmentation cs.CVPDF

[48] From Programs to Poses: Factored Real-World Scene Generation via Learned Program Libraries cs.CV | cs.AIPDF

[49] Ordinal Scale Traffic Congestion Classification with Multi-Modal Vision-Language and Motion Analysis cs.CVPDF

[50] PointMAC: Meta-Learned Adaptation for Robust Test-Time Point Cloud Completion cs.CVPDF

[51] Vision4PPG: Emergent PPG Analysis Capability of Vision Foundation Models for Vital Signs like Blood Pressure cs.CV | cs.LGPDF

[52] Self-Supervised Multi-Scale Transformer with Attention-Guided Fusion for Efficient Crack Detection cs.CVPDF

[53] Identifying bias in CNN image classification using image scrambling and transforms cs.CV | cs.AIPDF

[54] AVoCaDO: An Audiovisual Video Captioner Driven by Temporal Orchestration cs.CVPDF

[55] Mesh-Gait: A Unified Framework for Gait Recognition Through Multi-Modal Representation Learning from 2D Silhouettes cs.CV | cs.AI | cs.LGPDF

[56] Guided Image Feature Matching using Feature Spatial Order cs.CV | eess.IVPDF

[57] Combo-Gait: Unified Transformer Framework for Multi-Modal Gait Recognition and Attribute Analysis cs.CV | cs.AI | cs.LGPDF

[58] Towards Cybersickness Severity Classification from VR Gameplay Videos Using Transfer Learning and Temporal Modeling cs.CVPDF

[59] Taming a Retrieval Framework to Read Images in Humanlike Manner for Augmenting Generation of MLLMs cs.CV | cs.AIPDF

[60] MonoSE(3)-Diffusion: A Monocular SE(3) Diffusion Framework for Robust Camera-to-Robot Pose Estimation cs.CV | cs.ROPDF

[61] On the Problem of Consistent Anomalies in Zero-Shot Industrial Anomaly Detection cs.CV | stat.APPDF

[62] Learning from Disagreement: A Group Decision Simulation Framework for Robust Medical Image Segmentation cs.CV | cs.AIPDF

[63] Post-TIPS Prediction via Multimodal Interaction: A Multi-Center Dataset and Framework for Survival, Complication, and Portal Pressure Assessment cs.CVPDF

[64] When Images Speak Louder: Mitigating Language Bias-induced Hallucinations in VLMs through Cross-Modal Guidance cs.CVPDF

[65] DAGLFNet:Deep Attention-Guided Global-Local Feature Fusion for Pseudo-Image Point Cloud Segmentation cs.CV | cs.LGPDF

[66] MSF-Mamba: Motion-aware State Fusion Mamba for Efficient Micro-Gesture Recognition cs.CVPDF

[67] Towards Self-Refinement of Vision-Language Models with Triangular Consistency cs.CV | cs.AIPDF

[68] VR-Thinker: Boosting Video Reward Models through Thinking-with-Image Reasoning cs.CVPDF

[69] Receptive Field Expanded Look-Up Tables for Vision Inference: Advancing from Low-level to High-level Tasks cs.CVPDF

[70] Unified Open-World Segmentation with Multi-Modal Prompts cs.CVPDF

[71] Layout-Independent License Plate Recognition via Integrated Vision and Language Models cs.CVPDF

[72] MCE: Towards a General Framework for Handling Missing Modalities under Imbalanced Missing Rates cs.CV | cs.LG | cs.MMPDF

[73] GLOFNet – A Multimodal Dataset for GLOF Monitoring and Prediction cs.CV | cs.AIPDF

[74] Equipping Vision Foundation Model with Mixture of Experts for Out-of-Distribution Detection cs.CVPDF

[75] A Simple and Better Baseline for Visual Grounding cs.CVPDF

[76] ViSurf: Visual Supervised-and-Reinforcement Fine-Tuning for Large Vision-and-Language Models cs.CVPDF

[77] OmniQuality-R: Advancing Reward Models Through All-Encompassing Quality Assessment cs.CVPDF

[78] DEMO: Disentangled Motion Latent Flow Matching for Fine-Grained Controllable Talking Portrait Synthesis cs.CV | cs.AIPDF