Table of Contents

cs.CV [Back]

[1] Hybrid Quantum-Classical Model for Image Classification cs.CV | cs.AI | cs.LGPDF

Muhammad Adnan Shahzad

TL;DR: 该论文系统地比较了混合量子-经典神经网络与纯经典模型在三个基准数据集上的性能、效率和鲁棒性,发现混合模型在精度、训练速度和资源效率上均优于经典模型,尤其在复杂任务中表现突出。

Details

Motivation: 研究旨在探索量子计算与传统深度学习的结合是否能提升图像分类任务的性能,特别是在精度、效率和鲁棒性方面超越纯经典模型。

Result: 混合模型在精度上显著超越经典模型(如CIFAR100提升9.44%),训练速度更快(5-12倍),资源占用更低(内存和CPU使用减少),但在复杂数据集的对抗鲁棒性上与经典模型相当。

Insight: 混合量子-经典模型在复杂视觉任务中表现出显著优势,但其鲁棒性可能受数据集复杂度影响,这表明未来研究可进一步优化量子电路设计以提升鲁棒性。

Abstract: This study presents a systematic comparison between hybrid quantum-classical neural networks and purely classical models across three benchmark datasets (MNIST, CIFAR100, and STL10) to evaluate their performance, efficiency, and robustness. The hybrid models integrate parameterized quantum circuits with classical deep learning architectures, while the classical counterparts use conventional convolutional neural networks (CNNs). Experiments were conducted over 50 training epochs for each dataset, with evaluations on validation accuracy, test accuracy, training time, computational resource usage, and adversarial robustness (tested with $\epsilon=0.1$ perturbations).Key findings demonstrate that hybrid models consistently outperform classical models in final accuracy, achieving {99.38% (MNIST), 41.69% (CIFAR100), and 74.05% (STL10) validation accuracy, compared to classical benchmarks of 98.21%, 32.25%, and 63.76%, respectively. Notably, the hybrid advantage scales with dataset complexity, showing the most significant gains on CIFAR100 (+9.44%) and STL10 (+10.29%). Hybrid models also train 5–12$\times$ faster (e.g., 21.23s vs. 108.44s per epoch on MNIST) and use 6–32% fewer parameters} while maintaining superior generalization to unseen test data.Adversarial robustness tests reveal that hybrid models are significantly more resilient on simpler datasets (e.g., 45.27% robust accuracy on MNIST vs. 10.80% for classical) but show comparable fragility on complex datasets like CIFAR100 ($\sim$1% robustness for both). Resource efficiency analyses indicate that hybrid models consume less memory (4–5GB vs. 5–6GB for classical) and lower CPU utilization (9.5% vs. 23.2% on average).These results suggest that hybrid quantum-classical architectures offer compelling advantages in accuracy, training efficiency, and parameter scalability, particularly for complex vision tasks.


[2] Research on Expressway Congestion Warning Technology Based on YOLOv11-DIoU and GRU-Attention cs.CVPDF

Tong Yulin, Liang Xuechen

TL;DR: 论文提出了一种基于YOLOv11-DIoU和GRU-Attention的高速公路拥堵预警技术,通过优化目标检测和长序列预测模型,显著提升了检测精度和拥堵预警的准确性。

Details

Motivation: 现有高速公路拥堵预警系统的车辆感知精度低且长序列依赖关系丢失,严重影响了预警效果。

Result: YOLOv11-DIoU在mAP上提升6.5%,GRU-Attention测试准确率达99.7%,拥堵预警时间误差≤1分钟。

Insight: 融合目标检测与序列预测的优势可以显著提升拥堵预警系统的性能,尤其在遮挡和高流量场景下表现稳定。

Abstract: Expressway traffic congestion severely reduces travel efficiency and hinders regional connectivity. Existing “detection-prediction” systems have critical flaws: low vehicle perception accuracy under occlusion and loss of long-sequence dependencies in congestion forecasting. This study proposes an integrated technical framework to resolve these issues.For traffic flow perception, two baseline algorithms were optimized. Traditional YOLOv11 was upgraded to YOLOv11-DIoU by replacing GIoU Loss with DIoU Loss, and DeepSort was improved by fusing Mahalanobis (motion) and cosine (appearance) distances. Experiments on Chang-Shen Expressway videos showed YOLOv11-DIoU achieved 95.7% mAP (6.5 percentage points higher than baseline) with 5.3% occlusion miss rate. DeepSort reached 93.8% MOTA (11.3 percentage points higher than SORT) with only 4 ID switches. Using the Greenberg model (for 10-15 vehicles/km high-density scenarios), speed and density showed a strong negative correlation (r=-0.97), conforming to traffic flow theory. For congestion warning, a GRU-Attention model was built to capture congestion precursors. Trained 300 epochs with flow, density, and speed, it achieved 99.7% test accuracy (7-9 percentage points higher than traditional GRU). In 10-minute advance warnings for 30-minute congestion, time error was $\leq$ 1 minute. Validation with an independent video showed 95% warning accuracy, over 90% spatial overlap of congestion points, and stable performance in high-flow ($>$5 vehicles/second) scenarios.This framework provides quantitative support for expressway congestion control, with promising intelligent transportation applications.


[3] Parking Space Ground Truth Test Automation by Artificial Intelligence Using Convolutional Neural Networks cs.CV | 68U99 | J.2PDF

Tony Rohe, Martin Margreiter, Markus Moertl

TL;DR: 本文提出了一种基于卷积神经网络(CNN)的自动化测试方法,用于优化实时云基路边停车服务的真实性测试流程,显著减少人工参与时间。

Details

Motivation: 现有路边停车服务的真实性测试(ground truth test)依赖大量人工,效率低且成本高。研究旨在通过AI技术实现测试自动化,提升服务效率。

Result: 测试结果显示,自动化工具将人工资源消耗减少了99.58%,显著提升了测试效率和准确性。

Insight: 机器学习技术(如CNN)可在停车服务等领域高效替代人工测试任务,但需进一步优化模型以适应更复杂的现实场景。

Abstract: This research is part of a study of a real-time, cloud-based on-street parking service using crowd-sourced in-vehicle fleet data. The service provides real-time information about available parking spots by classifying crowd-sourced detections observed via ultrasonic sensors. The goal of this research is to optimize the current parking service quality by analyzing the automation of the existing test process for ground truth tests. Therefore, methods from the field of machine learning, especially image pattern recognition, are applied to enrich the database and substitute human engineering work in major areas of the analysis process. After an introduction into the related areas of machine learning, this paper explains the methods and implementations made to achieve a high level of automation, applying convolutional neural networks. Finally, predefined metrics present the performance level achieved, showing a time reduction of human resources up to 99.58 %. The overall improvements are discussed, summarized, and followed by an outlook for future development and potential application of the analysis automation tool.


[4] An Empirical Analysis of VLM-based OOD Detection: Mechanisms, Advantages, and Sensitivity cs.CV | cs.AIPDF

Yuxiao Lee, Xiaofeng Cao, Wei Ye, Jiangchao Yao, Jingkuan Song

TL;DR: 该论文系统分析了基于视觉-语言模型(VLM)的零样本OOD检测机制、优势及敏感性,揭示了其语义新颖性利用能力及对提示词的敏感性。

Details

Motivation: 尽管VLM在零样本OOD检测中表现出色,但其工作机制、优势及鲁棒性的全面理解仍不充分,亟需系统化分析以指导未来研究。

Result: VLM利用语义新颖性显著优于单模态方法,但对提示词选择高度敏感。

Insight: VLM的OOD检测能力依赖于其嵌入空间的语义特性,提示词的影响不可忽视,需在设计中加强鲁棒性。

Abstract: Vision-Language Models (VLMs), such as CLIP, have demonstrated remarkable zero-shot out-of-distribution (OOD) detection capabilities, vital for reliable AI systems. Despite this promising capability, a comprehensive understanding of (1) why they work so effectively, (2) what advantages do they have over single-modal methods, and (3) how is their behavioral robustness – remains notably incomplete within the research community. This paper presents a systematic empirical analysis of VLM-based OOD detection using in-distribution (ID) and OOD prompts. (1) Mechanisms: We systematically characterize and formalize key operational properties within the VLM embedding space that facilitate zero-shot OOD detection. (2) Advantages: We empirically quantify the superiority of these models over established single-modal approaches, attributing this distinct advantage to the VLM’s capacity to leverage rich semantic novelty. (3) Sensitivity: We uncovers a significant and previously under-explored asymmetry in their robustness profile: while exhibiting resilience to common image noise, these VLM-based methods are highly sensitive to prompt phrasing. Our findings contribute a more structured understanding of the strengths and critical vulnerabilities inherent in VLM-based OOD detection, offering crucial, empirically-grounded guidance for developing more robust and reliable future designs.


[5] Curvature as a tool for evaluating dimensionality reduction and estimating intrinsic dimension cs.CV | cs.DM | cs.LG | 51K05 (primary) 57-08, 53Z50, 55U10 (secondary) | G.2.2PDF

Charlotte Beylier, Parvaneh Joharinad, Jürgen Jost, Nahid Torbati

TL;DR: 该论文提出了一种基于曲率的方法来评估降维技术的效果,并估计数据集的内在维度。通过抽象的截面曲率概念,构建离散度量空间的几何轮廓,从而分析数据表示的几何特性。

Details

Motivation: 降维技术在实际应用中广泛使用,但缺乏定量评估其效果的工具。作者希望通过几何曲率的概念,提供一种新的方法来评估降维结果的几何保真性,并进一步估计数据集的内在维度。

Result: 实验结果表明,该方法不仅可以有效评估降维技术的效果,还能估计数据集的内在维度。同时,该方法适用于分析大规模网络的几何结构。

Insight: 通过曲率这一几何工具,可以更好地理解数据表示的几何特性,为降维技术的评估和数据集的分析提供了一种新的视角。

Abstract: Utilizing recently developed abstract notions of sectional curvature, we introduce a method for constructing a curvature-based geometric profile of discrete metric spaces. The curvature concept that we use here captures the metric relations between triples of points and other points. More significantly, based on this curvature profile, we introduce a quantitative measure to evaluate the effectiveness of data representations, such as those produced by dimensionality reduction techniques. Furthermore, Our experiments demonstrate that this curvature-based analysis can be employed to estimate the intrinsic dimensionality of datasets. We use this to explore the large-scale geometry of empirical networks and to evaluate the effectiveness of dimensionality reduction techniques.


[6] Real-Time Detection and Tracking of Foreign Object Intrusions in Power Systems via Feature-Based Edge Intelligence cs.CV | cs.SY | eess.SYPDF

Xinan Wang, Di Shi, Fengyu Wang

TL;DR: 该论文提出了一种用于电力系统中异物入侵实时检测与追踪的三阶段框架,结合YOLOv7分割、ConvNeXt特征提取和IoU跟踪器,并通过边缘硬件优化实现高效部署。

Details

Motivation: 电力系统中的异物入侵可能导致严重事故,现有方法在实时性和鲁棒性上存在不足,本文提出了一种高效的边缘智能解决方案。

Result: 在真实监控和无人机视频数据集上验证了框架的高精度和鲁棒性,NVIDIA Jetson设备的硬件测试证明了其实用性。

Insight: 通过边缘优化和增量学习,该方法展示了如何在资源受限的设备上实现高效的实时检测与追踪。

Abstract: This paper presents a novel three-stage framework for real-time foreign object intrusion (FOI) detection and tracking in power transmission systems. The framework integrates: (1) a YOLOv7 segmentation model for fast and robust object localization, (2) a ConvNeXt-based feature extractor trained with triplet loss to generate discriminative embeddings, and (3) a feature-assisted IoU tracker that ensures resilient multi-object tracking under occlusion and motion. To enable scalable field deployment, the pipeline is optimized for deployment on low-cost edge hardware using mixed-precision inference. The system supports incremental updates by adding embeddings from previously unseen objects into a reference database without requiring model retraining. Extensive experiments on real-world surveillance and drone video datasets demonstrate the framework’s high accuracy and robustness across diverse FOI scenarios. In addition, hardware benchmarks on NVIDIA Jetson devices confirm the framework’s practicality and scalability for real-world edge applications.


[7] EdiVal-Agent: An Object-Centric Framework for Automated, Scalable, Fine-Grained Evaluation of Multi-Turn Editing cs.CV | cs.AI | cs.LGPDF

Tianyu Chen, Yasi Zhang, Zhi Zhang, Peiyu Yu, Shu Wang

TL;DR: 论文提出了EdiVal-Agent,一个面向对象的自动、可扩展、细粒度评估框架,用于多轮指令编辑。它结合了视觉语言模型和对象检测器,提高了评估准确性。

Details

Motivation: 当前的图像编辑评估方法依赖参考图像或单一视觉语言模型,存在覆盖有限和评估不精确的问题。需要一种更可靠、可解释的自动评估框架。

Result: 实验表明,结合视觉语言模型与对象检测器在指令遵循评估中比单独使用视觉语言模型更接近人类判断。

Insight: 模块化设计允许未来工具的集成,逐步提升评估准确性。EdiVal-Agent能识别当前编辑模型的失败模式,推动下一代模型的开发。

Abstract: Instruction-based image editing has advanced rapidly, yet reliable and interpretable evaluation remains a bottleneck. Current protocols either (i) depend on paired reference images – resulting in limited coverage and inheriting biases from prior generative models – or (ii) rely solely on zero-shot vision-language models (VLMs), whose prompt-based assessments of instruction following, content consistency, and visual quality are often imprecise. To address this, we introduce EdiVal-Agent, an automated, scalable, and fine-grained evaluation framework for multi-turn instruction-based editing from an object-centric perspective, supported by a suite of expert tools. Given an image, EdiVal-Agent first decomposes it into semantically meaningful objects, then synthesizes diverse, context-aware editing instructions. For evaluation, it integrates VLMs with open-vocabulary object detectors to assess instruction following, uses semantic-level feature extractors to evaluate content consistency, and leverages human preference models to judge visual quality. We show that combining VLMs with object detectors yields stronger agreement with human judgments in instruction-following evaluation compared to using VLMs alone and CLIP-based metrics. Furthermore, the pipeline’s modular design allows future tools to be seamlessly integrated, enhancing evaluation accuracy over time. Instantiating this pipeline, we build EdiVal-Bench, a multi-turn editing benchmark covering 9 instruction types and 11 state-of-the-art editing models spanning autoregressive (AR) (including Nano Banana, GPT-Image-1), flow-matching, and diffusion paradigms. We demonstrate that EdiVal-Agent can be used to identify existing failure modes, thereby informing the development of the next generation of editing models. Project page: https://tianyucodings.github.io/EdiVAL-page/.


[8] MapAnything: Universal Feed-Forward Metric 3D Reconstruction cs.CV | cs.AI | cs.LG | cs.ROPDF

Nikhil Keetha, Norman Müller, Johannes Schönberger, Lorenzo Porzi, Yuchen Zhang

TL;DR: MapAnything是一种基于Transformer的通用前馈模型,能够通过多视图几何信息的分解表示直接回归出场景的三维几何结构和相机参数,适用于多种3D视觉任务。

Details

Motivation: 现有的大多数3D重建模型针对特定任务设计,缺乏通用性。MapAnything旨在提出一个统一的模型,能够高效处理多种3D重建任务。

Result: 实验表明,MapAnything在多项任务中优于或媲美专用模型,同时展现了更高的训练效率和通用性。

Insight: 通过统一的几何表示和训练策略,可以构建一个通用的3D重建主干网络,取代传统专用模型。

Abstract: We introduce MapAnything, a unified transformer-based feed-forward model that ingests one or more images along with optional geometric inputs such as camera intrinsics, poses, depth, or partial reconstructions, and then directly regresses the metric 3D scene geometry and cameras. MapAnything leverages a factored representation of multi-view scene geometry, i.e., a collection of depth maps, local ray maps, camera poses, and a metric scale factor that effectively upgrades local reconstructions into a globally consistent metric frame. Standardizing the supervision and training across diverse datasets, along with flexible input augmentation, enables MapAnything to address a broad range of 3D vision tasks in a single feed-forward pass, including uncalibrated structure-from-motion, calibrated multi-view stereo, monocular depth estimation, camera localization, depth completion, and more. We provide extensive experimental analyses and model ablations demonstrating that MapAnything outperforms or matches specialist feed-forward models while offering more efficient joint training behavior, thus paving the way toward a universal 3D reconstruction backbone.


[9] Semantic-Enhanced Cross-Modal Place Recognition for Robust Robot Localization cs.CVPDF

Yujia Lin, Nicholas Evans

TL;DR: 该论文提出了一种名为SCM-PR的语义增强跨模态地点识别框架,通过结合RGB图像的高层语义和LiDAR地图的几何信息,提升了机器人在无GPS环境中的定位鲁棒性。

Details

Motivation: 现有的RGB图像定位方法对光照、天气等环境变化敏感,而跨模态定位方法在复杂场景、细粒度匹配和视角变化情况下表现欠佳。

Result: 在KITTI和KITTI-360数据集上,SCM-PR优于其他跨模态地点识别方法,达到了最先进的性能。

Insight: 语义信息在提升跨模态定位鲁棒性中发挥了关键作用,特别是在复杂环境和视角变化情况下。

Abstract: Ensuring accurate localization of robots in environments without GPS capability is a challenging task. Visual Place Recognition (VPR) techniques can potentially achieve this goal, but existing RGB-based methods are sensitive to changes in illumination, weather, and other seasonal changes. Existing cross-modal localization methods leverage the geometric properties of RGB images and 3D LiDAR maps to reduce the sensitivity issues highlighted above. Currently, state-of-the-art methods struggle in complex scenes, fine-grained or high-resolution matching, and situations where changes can occur in viewpoint. In this work, we introduce a framework we call Semantic-Enhanced Cross-Modal Place Recognition (SCM-PR) that combines high-level semantics utilizing RGB images for robust localization in LiDAR maps. Our proposed method introduces: a VMamba backbone for feature extraction of RGB images; a Semantic-Aware Feature Fusion (SAFF) module for using both place descriptors and segmentation masks; LiDAR descriptors that incorporate both semantics and geometry; and a cross-modal semantic attention mechanism in NetVLAD to improve matching. Incorporating the semantic information also was instrumental in designing a Multi-View Semantic-Geometric Matching and a Semantic Consistency Loss, both in a contrastive learning framework. Our experimental work on the KITTI and KITTI-360 datasets show that SCM-PR achieves state-of-the-art performance compared to other cross-modal place recognition methods.


[10] Improving 3D Gaussian Splatting Compression by Scene-Adaptive Lattice Vector Quantization cs.CVPDF

Hao Xu, Xiaolin Wu, Xi Zhang

TL;DR: 该论文提出了一种名为SALVQ(场景自适应格子向量量化)的新方法,用于改进3D高斯泼溅(3DGS)数据的压缩性能,通过优化格子基向量使其适应不同场景,从而在保持低复杂度的同时提升压缩效率。

Details

Motivation: 3DGS因其高质量渲染和实时性能而流行,但其数据量巨大,现有压缩方法主要依赖统一的标量量化(USQ),缺乏灵活性。本文探索是否可以通过更复杂的量化器(如格子向量量化LVQ)提升压缩性能,同时保持系统简单。

Result: SALVQ显著提升了3DGS压缩的R-D性能,且计算开销和修改成本极低。其动态调整能力减少了训练时间和内存消耗。

Insight: 格子基向量的优化和动态调整是提升3DGS压缩效率的关键,场景自适应性弥补了传统量化方法的不足,为未来压缩技术提供了新方向。

Abstract: 3D Gaussian Splatting (3DGS) is rapidly gaining popularity for its photorealistic rendering quality and real-time performance, but it generates massive amounts of data. Hence compressing 3DGS data is necessary for the cost effectiveness of 3DGS models. Recently, several anchor-based neural compression methods have been proposed, achieving good 3DGS compression performance. However, they all rely on uniform scalar quantization (USQ) due to its simplicity. A tantalizing question is whether more sophisticated quantizers can improve the current 3DGS compression methods with very little extra overhead and minimal change to the system. The answer is yes by replacing USQ with lattice vector quantization (LVQ). To better capture scene-specific characteristics, we optimize the lattice basis for each scene, improving LVQ’s adaptability and R-D efficiency. This scene-adaptive LVQ (SALVQ) strikes a balance between the R-D efficiency of vector quantization and the low complexity of USQ. SALVQ can be seamlessly integrated into existing 3DGS compression architectures, enhancing their R-D performance with minimal modifications and computational overhead. Moreover, by scaling the lattice basis vectors, SALVQ can dynamically adjust lattice density, enabling a single model to accommodate multiple bit rate targets. This flexibility eliminates the need to train separate models for different compression levels, significantly reducing training time and memory consumption.


[11] MINGLE: VLMs for Semantically Complex Region Detection in Urban Scenes cs.CV | cs.CYPDF

Liu Liu, Alexandra Kudaeva, Marco Cipriano, Fatimeh Al Ghannam, Freya Tan

TL;DR: MINGLE是一个三阶段模块化流程,用于从城市街景图像中检测语义复杂的社交群体区域,结合了现成的人类检测、VLM推理和轻量级空间聚合算法。

Details

Motivation: 理解公共场所的群体社交互动对城市规划至关重要,但目前缺乏能够捕捉抽象人际关系的视觉检测方法。

Result: MINGLE能够有效检测语义复杂的社交群体区域,并通过新数据集验证性能。

Insight: 结合现成模型与基于VLM的推理可以处理传统方法难以捕捉的抽象语义任务。

Abstract: Understanding group-level social interactions in public spaces is crucial for urban planning, informing the design of socially vibrant and inclusive environments. Detecting such interactions from images involves interpreting subtle visual cues such as relations, proximity, and co-movement - semantically complex signals that go beyond traditional object detection. To address this challenge, we introduce a social group region detection task, which requires inferring and spatially grounding visual regions defined by abstract interpersonal relations. We propose MINGLE (Modeling INterpersonal Group-Level Engagement), a modular three-stage pipeline that integrates: (1) off-the-shelf human detection and depth estimation, (2) VLM-based reasoning to classify pairwise social affiliation, and (3) a lightweight spatial aggregation algorithm to localize socially connected groups. To support this task and encourage future research, we present a new dataset of 100K urban street-view images annotated with bounding boxes and labels for both individuals and socially interacting groups. The annotations combine human-created labels and outputs from the MINGLE pipeline, ensuring semantic richness and broad coverage of real-world scenarios.


[12] BiasMap: Leveraging Cross-Attentions to Discover and Mitigate Hidden Social Biases in Text-to-Image Generation cs.CV | cs.LGPDF

Rajatsubhra Chakraborty, Xujun Che, Depeng Xu, Cori Faklaris, Xi Niu

TL;DR: BiasMap 是一个模型无关的框架,通过利用交叉注意力机制揭示文本到图像生成模型中的潜在概念级偏差,并提出能量引导的扩散采样方法进行偏差缓解。

Details

Motivation: 现有的偏差发现工作主要关注输出级的人口统计分布,无法保证概念表示的分离性,而研究更深层的概念级偏差是必要的。

Result: 实验表明,现有公平性干预可能减少输出分布差距但无法解耦概念级纠缠,而 BiasMap 能有效缓解概念级偏差并补充分布级偏差缓解。

Insight: 概念级偏差的发现和缓解是提高文本到图像生成公平性的关键,而注意力机制提供了有效的分析工具。

Abstract: Bias discovery is critical for black-box generative models, especiall text-to-image (TTI) models. Existing works predominantly focus on output-level demographic distributions, which do not necessarily guarantee concept representations to be disentangled post-mitigation. We propose BiasMap, a model-agnostic framework for uncovering latent concept-level representational biases in stable diffusion models. BiasMap leverages cross-attention attribution maps to reveal structural entanglements between demographics (e.g., gender, race) and semantics (e.g., professions), going deeper into representational bias during the image generation. Using attribution maps of these concepts, we quantify the spatial demographics-semantics concept entanglement via Intersection over Union (IoU), offering a lens into bias that remains hidden in existing fairness discovery approaches. In addition, we further utilize BiasMap for bias mitigation through energy-guided diffusion sampling that directly modifies latent noise space and minimizes the expected SoftIoU during the denoising process. Our findings show that existing fairness interventions may reduce the output distributional gap but often fail to disentangle concept-level coupling, whereas our mitigation method can mitigate concept entanglement in image generation while complementing distributional bias mitigation.


[13] LivePyxel: Accelerating image annotations with a Python-integrated webcam live streaming cs.CVPDF

Uriel Garcilazo-Cruz, Joseph O. Okeme, Rodrigo A. Vargas–Hernández

TL;DR: LivePyxel是一个基于Python的图形用户界面工具,支持与摄像头、显微镜等设备集成,实现实时图像标注,加速AI模型的开发。

Details

Motivation: 现有图像标注工具通常需要上传预收集的数据集,不支持实时数据采集,尤其在实验室环境中限制了AI模型的部署效率。

Result: 工具显著简化了数据采集和标注流程,适用于实验工作流中的AI模型开发。

Insight: 实时标注工具的灵活性对于加速科学领域的AI模型部署至关重要,尤其是在实验环境中。

Abstract: The lack of flexible annotation tools has hindered the deployment of AI models in some scientific areas. Most existing image annotation software requires users to upload a precollected dataset, which limits support for on-demand pipelines and introduces unnecessary steps to acquire images. This constraint is particularly problematic in laboratory environments, where real-time data acquisition from instruments such as microscopes is increasingly common. In this work, we introduce \texttt{LivePixel}, a Python-based graphical user interface that integrates with imaging systems, such as webcams, microscopes, and others, to enable real-time image annotation. LivePyxel is designed to be easy to use through a simple interface that allows users to precisely delimit areas for annotation using tools commonly found in commercial graphics editing software. Of particular interest is the availability of B'ezier splines and binary masks, and the software’s capacity to work with non-destructive layers that enable high-performance editing. LivePyxel also integrates a wide compatibility across video devices, and it’s optimized for object detection operations via the use of OpenCV in combination with high-performance libraries designed to handle matrix and linear algebra operations via Numpy effectively. LivePyxel facilitates seamless data collection and labeling, accelerating the development of AI models in experimental workflows. LivePyxel freely available at https://github.com/UGarCil/LivePyxel


[14] DEFT-VTON: Efficient Virtual Try-On with Consistent Generalised H-Transform cs.CVPDF

Xingzi Xu, Qi Li, Shuwen Qiu, Julien Han, Karim Bouyarmane

TL;DR: DEFT-VTON提出了一种高效的虚拟试衣方法,通过冻结预训练模型参数并训练小型h-transform网络,显著减少训练参数量,同时结合自适应一致性损失提升性能,实现了高质量且快速的虚拟试衣效果。

Details

Motivation: 现实应用中,虚拟试衣需要高效的训练和推理,以满足有限的预算需求。当前方法依赖大量端到端训练,难以满足这一需求。

Result: DEFT-VTON在虚拟试衣任务中达到SOTA性能,仅需15步去噪步骤即可实现高质量结果。

Insight: 高效微调和一致性损失是实现高质量虚拟试衣的关键技术,为实际应用提供了可行的解决方案。

Abstract: Diffusion models enable high-quality virtual try-on (VTO) with their established image synthesis abilities. Despite the extensive end-to-end training of large pre-trained models involved in current VTO methods, real-world applications often prioritize limited training and inference, serving, and deployment budgets for VTO. To solve this obstacle, we apply Doob’s h-transform efficient fine-tuning (DEFT) for adapting large pre-trained unconditional models for downstream image-conditioned VTO abilities. DEFT freezes the pre-trained model’s parameters and trains a small h-transform network to learn a conditional h-transform. The h-transform network allows training only 1.42 percent of the frozen parameters, compared to a baseline of 5.52 percent in traditional parameter-efficient fine-tuning (PEFT). To further improve DEFT’s performance and decrease existing models’ inference time, we additionally propose an adaptive consistency loss. Consistency training distills slow but high-performing diffusion models into a fast one while retaining performance by enforcing consistencies along the inference path. Inspired by constrained optimization, instead of distillation, we combine the consistency loss and the denoising score matching loss in a data-adaptive manner for fine-tuning existing VTO models at a low cost. Empirical results show the proposed DEFT-VTON method achieves state-of-the-art performance on VTO tasks, with as few as 15 denoising steps, while maintaining competitive results.


[15] Adversarial Appearance Learning in Augmented Cityscapes for Pedestrian Recognition in Autonomous Driving cs.CVPDF

Artem Savkin, Thomas Lapotre, Kevin Strauss, Uzair Akbar, Federico Tombari

TL;DR: 这篇论文提出了一种通过对抗性学习生成更真实的合成数据的方法,用于提升自动驾驶中行人识别的性能,并在Cityscapes数据集上进行了验证。

Details

Motivation: 自动驾驶需要大量特定场景的数据,但合成数据与真实数据之间存在域差距(domain gap),这影响了模型的泛化能力。论文旨在通过数据增强和对抗性学习减轻这种差距。

Result: 实验表明,对抗性学习能够显著提升合成数据的真实性,进而改善行人识别性能。

Insight: 对抗性学习可以有效缓解合成数据与真实数据之间的域差距,为自动驾驶中的行人识别提供更高质量的训练数据。

Abstract: In the autonomous driving area synthetic data is crucial for cover specific traffic scenarios which autonomous vehicle must handle. This data commonly introduces domain gap between synthetic and real domains. In this paper we deploy data augmentation to generate custom traffic scenarios with VRUs in order to improve pedestrian recognition. We provide a pipeline for augmentation of the Cityscapes dataset with virtual pedestrians. In order to improve augmentation realism of the pipeline we reveal a novel generative network architecture for adversarial learning of the data-set lighting conditions. We also evaluate our approach on the tasks of semantic and instance segmentation.


[16] FunKAN: Functional Kolmogorov-Arnold Network for Medical Image Enhancement and Segmentation cs.CV | I.4.3; I.4.6PDF

Maksim Penkin, Andrey Krylov

TL;DR: FunKAN是一个功能性的Kolmogorov-Arnold神经网络,专为医学图像增强和分割设计,通过傅里叶分解和Hermite函数学习内部函数,提升了解释性并保持了图像的空间结构。

Details

Motivation: 解决传统深度学习方法在医学图像处理中解释性不足的问题,同时利用Kolmogorov-Arnold定理的数学框架,避免破坏图像的空间结构。

Result: 在IXI、BUSI、GlaS和CVC-ClinicDB数据集上,FunKAN在图像增强(PSNR、TV)和分割(IoU、F1)任务中表现优于其他KAN-based方法。

Insight: FunKAN通过将理论函数近似与医学图像分析结合,为临床提供了一个鲁棒且可解释的解决方案。

Abstract: Medical image enhancement and segmentation are critical yet challenging tasks in modern clinical practice, constrained by artifacts and complex anatomical variations. Traditional deep learning approaches often rely on complex architectures with limited interpretability. While Kolmogorov-Arnold networks offer interpretable solutions, their reliance on flattened feature representations fundamentally disrupts the intrinsic spatial structure of imaging data. To address this issue we propose a Functional Kolmogorov-Arnold Network (FunKAN) – a novel interpretable neural framework, designed specifically for image processing, that formally generalizes the Kolmogorov-Arnold representation theorem onto functional spaces and learns inner functions using Fourier decomposition over the basis Hermite functions. We explore FunKAN on several medical image processing tasks, including Gibbs ringing suppression in magnetic resonance images, benchmarking on IXI dataset. We also propose U-FunKAN as state-of-the-art binary medical segmentation model with benchmarks on three medical datasets: BUSI (ultrasound images), GlaS (histological structures) and CVC-ClinicDB (colonoscopy videos), detecting breast cancer, glands and polyps, respectively. Experiments on those diverse datasets demonstrate that our approach outperforms other KAN-based backbones in both medical image enhancement (PSNR, TV) and segmentation (IoU, F1). Our work bridges the gap between theoretical function approximation and medical image analysis, offering a robust, interpretable solution for clinical applications.


[17] Multimodal Hate Detection Using Dual-Stream Graph Neural Networks cs.CVPDF

Jiangbei Yue, Shuonan Yang, Tailin Chen, Jianbo Jiao, Zeyu Fu

TL;DR: 提出了一种基于双流图神经网络的多模态仇恨视频检测模型,通过实例图和权重图分别提取特征和重要性权重,显著提升了分类性能和可解释性。

Details

Motivation: 现有多模态方法未能有效突出仇恨内容,且缺乏对视频结构化信息的系统性建模,导致检测效果受限。

Result: 在公开数据集上达到SOTA性能,并提供强可解释性。

Insight: 突出仇恨实例的多模态建模及结构化关系捕捉是关键改进方向。

Abstract: Hateful videos present serious risks to online safety and real-world well-being, necessitating effective detection methods. Although multimodal classification approaches integrating information from several modalities outperform unimodal ones, they typically neglect that even minimal hateful content defines a video’s category. Specifically, they generally treat all content uniformly, instead of emphasizing the hateful components. Additionally, existing multimodal methods cannot systematically capture structured information in videos, limiting the effectiveness of multimodal fusion. To address these limitations, we propose a novel multimodal dual-stream graph neural network model. It constructs an instance graph by separating the given video into several instances to extract instance-level features. Then, a complementary weight graph assigns importance weights to these features, highlighting hateful instances. Importance weights and instance features are combined to generate video labels. Our model employs a graph-based framework to systematically model structured relationships within and across modalities. Extensive experiments on public datasets show that our model is state-of-the-art in hateful video classification and has strong explainability. Code is available: https://github.com/Multimodal-Intelligence-Lab-MIL/MultiHateGNN.


[18] ColonCrafter: A Depth Estimation Model for Colonoscopy Videos Using Diffusion Priors cs.CV | cs.AI | cs.LGPDF

Romain Hardy, Tyler Berzin, Pranav Rajpurkar

TL;DR: ColonCrafter是一种基于扩散先验的深度估计模型,用于从单目结肠镜视频生成时间一致的深度图。通过合成数据学习几何先验,并结合风格迁移技术,模型在C3VD数据集上实现了零样本SOTA。

Details

Motivation: 结肠镜视频的3D场景理解需求自动化深度估计方法,但现有方法在时间一致性上表现不足。

Result: 在C3VD数据集上超越通用和结肠镜专用方法,支持3D点云生成和表面覆盖评估。

Insight: 扩散模型可用于医学领域的时间一致深度估计,风格迁移技术有助于解决域适应问题。

Abstract: Three-dimensional (3D) scene understanding in colonoscopy presents significant challenges that necessitate automated methods for accurate depth estimation. However, existing depth estimation models for endoscopy struggle with temporal consistency across video sequences, limiting their applicability for 3D reconstruction. We present ColonCrafter, a diffusion-based depth estimation model that generates temporally consistent depth maps from monocular colonoscopy videos. Our approach learns robust geometric priors from synthetic colonoscopy sequences to generate temporally consistent depth maps. We also introduce a style transfer technique that preserves geometric structure while adapting real clinical videos to match our synthetic training domain. ColonCrafter achieves state-of-the-art zero-shot performance on the C3VD dataset, outperforming both general-purpose and endoscopy-specific approaches. Although full trajectory 3D reconstruction remains a challenge, we demonstrate clinically relevant applications of ColonCrafter, including 3D point cloud generation and surface coverage assessment.


[19] MemGS: Memory-Efficient Gaussian Splatting for Real-Time SLAM cs.CVPDF

Yinlong Bai, Hongxin Zhang, Sheng Zhong, Junkai Niu, Hai Li

TL;DR: MemGS提出了一种内存高效的3D高斯泼溅方法,适用于嵌入式平台的实时SLAM,通过体素空间合并冗余高斯基元和Patch-Grid点采样提升渲染质量。

Details

Motivation: 现有3DGS研究多关注高性能GPU,而忽视了嵌入式设备(如微型飞行器)的资源限制,MemGS旨在解决内存和计算资源有限情况下的实时SLAM应用需求。

Result: 公开数据集上的实验表明,MemGS在降低内存占用的同时提升了渲染质量,且不影响实时性能。

Insight: 嵌入式平台可通过高效内存管理和高斯基元优化实现高质量的实时SLAM,无需依赖高性能GPU。

Abstract: Recent advancements in 3D Gaussian Splatting (3DGS) have made a significant impact on rendering and reconstruction techniques. Current research predominantly focuses on improving rendering performance and reconstruction quality using high-performance desktop GPUs, largely overlooking applications for embedded platforms like micro air vehicles (MAVs). These devices, with their limited computational resources and memory, often face a trade-off between system performance and reconstruction quality. In this paper, we improve existing methods in terms of GPU memory usage while enhancing rendering quality. Specifically, to address redundant 3D Gaussian primitives in SLAM, we propose merging them in voxel space based on geometric similarity. This reduces GPU memory usage without impacting system runtime performance. Furthermore, rendering quality is improved by initializing 3D Gaussian primitives via Patch-Grid (PG) point sampling, enabling more accurate modeling of the entire scene. Quantitative and qualitative evaluations on publicly available datasets demonstrate the effectiveness of our improvements.


[20] Dynamic Aware: Adaptive Multi-Mode Out-of-Distribution Detection for Trajectory Prediction in Autonomous Vehicles cs.CV | cs.LG | cs.ROPDF

Tongfei Guo, Lili Su

TL;DR: 该论文提出了一种动态感知的自适应多模态OOD检测框架,用于自动驾驶中的轨迹预测,显著提升了检测延迟和误报率。

Details

Motivation: 自动驾驶中轨迹预测模型在现实场景中面临分布偏移问题,传统的OOD检测方法主要集中在计算机视觉任务,而轨迹级别的OOD检测研究不足。

Result: 在多个真实数据集上的实验表明,该方法在检测延迟和误报率上显著优于现有UQ和视觉基础的OOD方法。

Insight: 预测误差即使在分布内样本中也会表现出随时间演变的模态依赖性,显式建模这些误差模态是提升OOD检测性能的关键。

Abstract: Trajectory prediction is central to the safe and seamless operation of autonomous vehicles (AVs). In deployment, however, prediction models inevitably face distribution shifts between training data and real-world conditions, where rare or underrepresented traffic scenarios induce out-of-distribution (OOD) cases. While most prior OOD detection research in AVs has concentrated on computer vision tasks such as object detection and segmentation, trajectory-level OOD detection remains largely underexplored. A recent study formulated this problem as a quickest change detection (QCD) task, providing formal guarantees on the trade-off between detection delay and false alarms [1]. Building on this foundation, we propose a new framework that introduces adaptive mechanisms to achieve robust detection in complex driving environments. Empirical analysis across multiple real-world datasets reveals that prediction errors – even on in-distribution samples – exhibit mode-dependent distributions that evolve over time with dataset-specific dynamics. By explicitly modeling these error modes, our method achieves substantial improvements in both detection delay and false alarm rates. Comprehensive experiments on established trajectory prediction benchmarks show that our framework significantly outperforms prior UQ- and vision-based OOD approaches in both accuracy and computational efficiency, offering a practical path toward reliable, driving-aware autonomy.


[21] Annotating Satellite Images of Forests with Keywords from a Specialized Corpus in the Context of Change Detection cs.CV | cs.CL | cs.IR | cs.MM | I.2; I.4; I.7; H.3PDF

Nathalie Neptune, Josiane Mothe

TL;DR: 本文提出了一种基于深度学习的亚马逊雨林卫星图像变化检测方法,并通过语料库提取关键词标注变化区域。该方法在环境监测中表现有效,并具有通用性。

Details

Motivation: 亚马逊雨林对全球气候和生物多样性至关重要,但其砍伐问题严重。传统监测方法效率低,亟需自动化工具。

Result: 在亚马逊雨林数据集上验证了方法的有效性,成功检测砍伐并生成相关标注。

Insight: 该方法不仅适用于环境监测,还可推广至其他领域,展现出通用性和实际应用潜力。

Abstract: The Amazon rain forest is a vital ecosystem that plays a crucial role in regulating the Earth’s climate and providing habitat for countless species. Deforestation in the Amazon is a major concern as it has a significant impact on global carbon emissions and biodiversity. In this paper, we present a method for detecting deforestation in the Amazon using image pairs from Earth observation satellites. Our method leverages deep learning techniques to compare the images of the same area at different dates and identify changes in the forest cover. We also propose a visual semantic model that automatically annotates the detected changes with relevant keywords. The candidate annotation for images are extracted from scientific documents related to the Amazon region. We evaluate our approach on a dataset of Amazon image pairs and demonstrate its effectiveness in detecting deforestation and generating relevant annotations. Our method provides a useful tool for monitoring and studying the impact of deforestation in the Amazon. While we focus on environment applications of our work by using images of deforestation in the Amazon rain forest to demonstrate the effectiveness of our proposed approach, it is generic enough to be applied to other domains.


[22] Intelligent Healthcare Imaging Platform An VLM-Based Framework for Automated Medical Image Analysis and Clinical Report Generation cs.CV | cs.AIPDF

Samer Al-Hamadani

TL;DR: 本文提出了一种基于视觉语言模型(VLM)的智能医疗影像分析框架,结合Google Gemini 2.5 Flash实现多模态影像的自动化肿瘤检测和临床报告生成。

Details

Motivation: 人工智能在医疗影像中的快速发展为诊断和决策提供了新机会,亟需一种自动化、多模态的分析工具以提高效率和准确性。

Result: 实验显示系统在多模态异常检测中表现优异,定位偏差平均80像素,用户友好的Gradio界面便于临床整合。

Insight: 该框架展示了AI在医疗影像中的潜力,但需进一步临床验证和多中心评估以推动广泛应用。

Abstract: The rapid advancement of artificial intelligence (AI) in healthcare imaging has revolutionized diagnostic medicine and clinical decision-making processes. This work presents an intelligent multimodal framework for medical image analysis that leverages Vision-Language Models (VLMs) in healthcare diagnostics. The framework integrates Google Gemini 2.5 Flash for automated tumor detection and clinical report generation across multiple imaging modalities including CT, MRI, X-ray, and Ultrasound. The system combines visual feature extraction with natural language processing to enable contextual image interpretation, incorporating coordinate verification mechanisms and probabilistic Gaussian modeling for anomaly distribution. Multi-layered visualization techniques generate detailed medical illustrations, overlay comparisons, and statistical representations to enhance clinical confidence, with location measurement achieving 80 pixels average deviation. Result processing utilizes precise prompt engineering and textual analysis to extract structured clinical information while maintaining interpretability. Experimental evaluations demonstrated high performance in anomaly detection across multiple modalities. The system features a user-friendly Gradio interface for clinical workflow integration and demonstrates zero-shot learning capabilities to reduce dependence on large datasets. This framework represents a significant advancement in automated diagnostic support and radiological workflow efficiency, though clinical validation and multi-center evaluation are necessary prior to widespread adoption.


[23] Federated Learning for Deforestation Detection: A Distributed Approach with Satellite Imagery cs.CV | cs.DC | 14J60 | F.2.2; I.2.7PDF

Yuvraj Dutta, Aaditya Sikder, Basabdatta Palit

TL;DR: 论文提出了一种基于联邦学习的分布式方法,用于从卫星图像中检测森林砍伐,利用FLOWER和RAY框架实现分布式学习,并保障客户数据隐私和安全。

Details

Motivation: 传统的集中式训练方法需要合并数据,可能危及客户数据安全和隐私,因此需要一种分布式方法来解决这一问题。

Result: 框架成功实现了分布式学习任务,同时保障了客户数据隐私和安全,为基于卫星图像的图像分割任务提供了新视角。

Insight: 联邦学习在卫星图像分析中具有潜力,能够在不共享数据的情况下实现高效协作训练,为其他地理空间任务提供了借鉴。

Abstract: Accurate identification of deforestation from satellite images is essential in order to understand the geographical situation of an area. This paper introduces a new distributed approach to identify as well as locate deforestation across different clients using Federated Learning (FL). Federated Learning enables distributed network clients to collaboratively train a model while maintaining data privacy and security of the active users. In our framework, a client corresponds to an edge satellite center responsible for local data processing. Moreover, FL provides an advantage over centralized training method which requires combining data, thereby compromising with data security of the clients. Our framework leverages the FLOWER framework with RAY framework to execute the distributed learning workload. Furthermore, efficient client spawning is ensured by RAY as it can select definite amount of users to create an emulation environment. Our FL framework uses YOLOS-small (a Vision Transformer variant), Faster R-CNN with a ResNet50 backbone, and Faster R-CNN with a MobileNetV3 backbone models trained and tested on publicly available datasets. Our approach provides us a different view for image segmentation-based tasks on satellite imagery.


[24] Gaussian Alignment for Relative Camera Pose Estimation via Single-View Reconstruction cs.CV | I.4.8; I.4.5PDF

Yumin Li, Dylan Campbell

TL;DR: GARPS是一种无需训练的方法,通过单视图重建解决相对相机姿态估计问题,结合3D高斯混合模型(GMM)对齐实现度量尺度下的相机姿态优化,显著优于现有方法。

Details

Motivation: 传统两视图姿态估计方法无法实现度量尺度,且在宽基线或无纹理区域表现不佳。GARPS通过单视图重建和多视图几何结合,提供了一种鲁棒的度量姿态估计方案。

Result: 在Real-Estate10K数据集上,GARPS超越了传统方法和当前最佳学习型方法(如MASt3R),验证了其鲁棒性和精确性。

Insight: 单视图感知与多视图几何的结合为度量尺度下的姿态估计提供了新思路,无需依赖显式2D匹配或大规模训练数据。

Abstract: Estimating metric relative camera pose from a pair of images is of great importance for 3D reconstruction and localisation. However, conventional two-view pose estimation methods are not metric, with camera translation known only up to a scale, and struggle with wide baselines and textureless or reflective surfaces. This paper introduces GARPS, a training-free framework that casts this problem as the direct alignment of two independently reconstructed 3D scenes. GARPS leverages a metric monocular depth estimator and a Gaussian scene reconstructor to obtain a metric 3D Gaussian Mixture Model (GMM) for each image. It then refines an initial pose from a feed-forward two-view pose estimator by optimising a differentiable GMM alignment objective. This objective jointly considers geometric structure, view-independent colour, anisotropic covariance, and semantic feature consistency, and is robust to occlusions and texture-poor regions without requiring explicit 2D correspondences. Extensive experiments on the Real-Estate10K dataset demonstrate that GARPS outperforms both classical and state-of-the-art learning-based methods, including MASt3R. These results highlight the potential of bridging single-view perception with multi-view geometry to achieve robust and metric relative pose estimation.


[25] Re-purposing SAM into Efficient Visual Projectors for MLLM-Based Referring Image Segmentation cs.CV | cs.AIPDF

Xiaobo Yang, Xiaojin Gong

TL;DR: 该论文提出了一种新颖的语义视觉投影器,利用SAM生成的语义超像素来压缩视觉标记,显著减少了计算负担,同时保持了语义清晰度。

Details

Motivation: 现有的MLLM与SAM结合的Referring Image Segmentation框架计算成本高,主要由于视觉标记冗余。传统方法在减少标记数量和保持语义清晰度之间难以平衡。

Result: 实验表明,该方法将视觉标记减少了93%,同时保持性能,显著提升了训练和推理速度,在RIS任务上优于现有压缩投影器。

Insight: 通过将图像分割为语义超像素并压缩表示,可以高效地减少计算成本,同时保持语义信息的完整性。

Abstract: Recently, Referring Image Segmentation (RIS) frameworks that pair the Multimodal Large Language Model (MLLM) with the Segment Anything Model (SAM) have achieved impressive results. However, adapting MLLM to segmentation is computationally intensive, primarily due to visual token redundancy. We observe that traditional patch-wise visual projectors struggle to strike a balance between reducing the number of visual tokens and preserving semantic clarity, often retaining overly long token sequences to avoid performance drops. Inspired by text tokenizers, we propose a novel semantic visual projector that leverages semantic superpixels generated by SAM to identify “visual words” in an image. By compressing and projecting semantic superpixels as visual tokens, our approach adaptively shortens the token sequence according to scene complexity while minimizing semantic loss in compression. To mitigate loss of information, we propose a semantic superpixel positional embedding to strengthen MLLM’s awareness of superpixel geometry and position, alongside a semantic superpixel aggregator to preserve both fine-grained details inside superpixels and global context outside. Experiments show that our method cuts visual tokens by 93% without compromising performance, notably speeding up MLLM training and inference, and outperforming existing compressive visual projectors on RIS.


[26] FishBEV: Distortion-Resilient Bird’s Eye View Segmentation with Surround-View Fisheye Cameras cs.CVPDF

Hang Li, Dianmo Sheng, Qiankun Dong, Zichun Wang, Zhiwei Xu

TL;DR: FishBEV是一个专为鱼眼相机设计的BEV分割框架,通过三个创新模块解决了严重几何畸变、多视角对应模糊和时态不稳定等问题,显著提升了性能。

Details

Motivation: 现有BEV分割方法在鱼眼相机上表现不佳,主要由于其严重的几何畸变、多视角对应模糊和时态不稳定等问题,亟需一种鲁棒的解决方案。

Result: 在Synwoodscapes数据集上的实验表明,FishBEV在环视鱼眼相机BEV分割任务上显著优于现有SOTA方法。

Insight: 鱼眼相机的几何畸变和多视角对齐问题是BEV分割的主要挑战,结合不确定性估计和时序建模可以有效提升性能。

Abstract: As a cornerstone technique for autonomous driving, Bird’s Eye View (BEV) segmentation has recently achieved remarkable progress with pinhole cameras. However, it is non-trivial to extend the existing methods to fisheye cameras with severe geometric distortion, ambiguous multi-view correspondences and unstable temporal dynamics, all of which significantly degrade BEV performance. To address these challenges, we propose FishBEV, a novel BEV segmentation framework specifically tailored for fisheye cameras. This framework introduces three complementary innovations, including a Distortion-Resilient Multi-scale Extraction (DRME) backbone that learns robust features under distortion while preserving scale consistency, an Uncertainty-aware Spatial Cross-Attention (U-SCA) mechanism that leverages uncertainty estimation for reliable cross-view alignment, a Distance-aware Temporal Self-Attention (D-TSA) module that adaptively balances near field details and far field context to ensure temporal coherence. Extensive experiments on the Synwoodscapes dataset demonstrate that FishBEV consistently outperforms SOTA baselines, regarding the performance evaluation of FishBEV on the surround-view fisheye BEV segmentation tasks.


[27] Taylor-Series Expanded Kolmogorov-Arnold Network for Medical Imaging Classification cs.CVPDF

Kaniz Fatema, Emad A. Mohammed, Sukhjit Singh Sehra

TL;DR: 论文提出了基于样条的Kolmogorov-Arnold网络(KANs)用于医学图像分类,通过结合B样条与泰勒级数等方法,显著减少了参数量并提升了模型性能。

Details

Motivation: 医学图像分类在资源有限的临床环境中面临挑战,需要高效且可解释的模型。传统CNN参数量大且依赖预处理,而KANs通过学习原始数据直接建模非线性关系,提供了更轻量化的解决方案。

Result: SBTAYLOR-KAN在多个数据集上表现优异,最高准确率达98.93%,并在数据减少实验中保持86%以上准确率。参数量仅为传统CNN的约万分之一。

Insight: 1. 样条基函数能够有效建模局部与全局非线性关系;2. KANs在数据稀缺场景下具有强泛化能力;3. 轻量化设计适合资源受限的医疗环境。

Abstract: Effective and interpretable classification of medical images is a challenge in computer-aided diagnosis, especially in resource-limited clinical settings. This study introduces spline-based Kolmogorov-Arnold Networks (KANs) for accurate medical image classification with limited, diverse datasets. The models include SBTAYLOR-KAN, integrating B-splines with Taylor series; SBRBF-KAN, combining B-splines with Radial Basis Functions; and SBWAVELET-KAN, embedding B-splines in Morlet wavelet transforms. These approaches leverage spline-based function approximation to capture both local and global nonlinearities. The models were evaluated on brain MRI, chest X-rays, tuberculosis X-rays, and skin lesion images without preprocessing, demonstrating the ability to learn directly from raw data. Extensive experiments, including cross-dataset validation and data reduction analysis, showed strong generalization and stability. SBTAYLOR-KAN achieved up to 98.93% accuracy, with a balanced F1-score, maintaining over 86% accuracy using only 30% of the training data across three datasets. Despite class imbalance in the skin cancer dataset, experiments on both imbalanced and balanced versions showed SBTAYLOR-KAN outperforming other models, achieving 68.22% accuracy. Unlike traditional CNNs, which require millions of parameters (e.g., ResNet50 with 24.18M), SBTAYLOR-KAN achieves comparable performance with just 2,872 trainable parameters, making it more suitable for constrained medical environments. Gradient-weighted Class Activation Mapping (Grad-CAM) was used for interpretability, highlighting relevant regions in medical images. This framework provides a lightweight, interpretable, and generalizable solution for medical image classification, addressing the challenges of limited datasets and data-scarce scenarios in clinical AI applications.


[28] StyleProtect: Safeguarding Artistic Identity in Fine-tuned Diffusion Models cs.CVPDF

Qiuyu Tang, Joshua Krinsky, Aparna Bharati

TL;DR: 论文提出StyleProtect方法,通过选择性更新扩散模型中的交叉注意力层,保护艺术作品的独特风格免受恶意模仿。

Details

Motivation: 随着生成模型(尤其是扩散模型)的快速发展,它们可能被滥用以低成本复制艺术家的独特风格,侵犯其创作劳动和个人愿景。这引发了保护艺术作品风格的需求。

Result: 实验证明,StyleProtect在保护艺术风格和动漫风格免受恶意定制方面表现优异,同时保持较好的不可感知性。

Insight: 交叉注意力层对艺术风格的敏感性是关键,仅更新这些层即可实现高效的风格保护,避免全模型更新的计算开销。

Abstract: The rapid advancement of generative models, particularly diffusion-based approaches, has inadvertently facilitated their potential for misuse. Such models enable malicious exploiters to replicate artistic styles that capture an artist’s creative labor, personal vision, and years of dedication in an inexpensive manner. This has led to a rise in the need and exploration of methods for protecting artworks against style mimicry. Although generic diffusion models can easily mimic an artistic style, finetuning amplifies this capability, enabling the model to internalize and reproduce the style with higher fidelity and control. We hypothesize that certain cross-attention layers exhibit heightened sensitivity to artistic styles. Sensitivity is measured through activation strengths of attention layers in response to style and content representations, and assessing their correlations with features extracted from external models. Based on our findings, we introduce an efficient and lightweight protection strategy, StyleProtect, that achieves effective style defense against fine-tuned diffusion models by updating only selected cross-attention layers. Our experiments utilize a carefully curated artwork dataset based on WikiArt, comprising representative works from 30 artists known for their distinctive and influential styles and cartoon animations from the Anita dataset. The proposed method demonstrates promising performance in safeguarding unique styles of artworks and anime from malicious diffusion customization, while maintaining competitive imperceptibility.


[29] UM-Depth : Uncertainty Masked Self-Supervised Monocular Depth Estimation with Visual Odometry cs.CVPDF

Tae-Wook Um, Ki-Hyeon Kim, Hyun-Duck Choi, Hyo-Sung Ahn

TL;DR: UM-Depth提出了一种结合运动感知和不确定性感知的自监督单目深度估计框架,通过教师-学生训练策略提升动态物体边界和无纹理区域的深度估计精度,无需额外标签或运行时开销。

Details

Motivation: 自监督单目深度估计在动态区域和无纹理区域的表现较差,主要原因是输入数据的不确定性,现有方法通常依赖额外标签或辅助网络,增加了复杂性和开销。

Result: 在KITTI和Cityscapes数据集上验证了方法的有效性,实现了自监督深度和姿态估计的最优性能。

Insight: 不确定性感知可以有效弥补自监督训练中弱光度信号的不足,教师-学生策略是实现高效训练的关键。

Abstract: Monocular depth estimation has been increasingly adopted in robotics and autonomous driving for its ability to infer scene geometry from a single camera. In self-supervised monocular depth estimation frameworks, the network jointly generates and exploits depth and pose estimates during training, thereby eliminating the need for depth labels. However, these methods remain challenged by uncertainty in the input data, such as low-texture or dynamic regions, which can cause reduced depth accuracy. To address this, we introduce UM-Depth, a framework that combines motion- and uncertainty-aware refinement to enhance depth accuracy at dynamic object boundaries and in textureless regions. Specifically, we develop a teacherstudent training strategy that embeds uncertainty estimation into both the training pipeline and network architecture, thereby strengthening supervision where photometric signals are weak. Unlike prior motion-aware approaches that incur inference-time overhead and rely on additional labels or auxiliary networks for real-time generation, our method uses optical flow exclusively within the teacher network during training, which eliminating extra labeling demands and any runtime cost. Extensive experiments on the KITTI and Cityscapes datasets demonstrate the effectiveness of our uncertainty-aware refinement. Overall, UM-Depth achieves state-of-the-art results in both self-supervised depth and pose estimation on the KITTI datasets.


[30] Mitigating Query Selection Bias in Referring Video Object Segmentation cs.CV | cs.AIPDF

Dingwei Zhang, Dong Zhang, Jinhui Tang

TL;DR: 本文提出了三重查询变换器(TQF)来解决基于查询的Referring Video Object Segmentation(RVOS)中的查询选择偏差问题,通过将查询分解为外观、帧内交互和帧间运动三个组件,并结合动态的语言和视觉引导,显著提升了性能。

Details

Motivation: 现有基于查询的RVOS方法因依赖静态查询而易受相似外观或运动的干扰,导致查询选择偏差。本文旨在通过动态查询设计和运动感知模块解决这一问题。

Result: 在多个RVOS基准上,TQF展现了显著的性能提升,验证了结构化查询设计和运动感知模块的有效性。

Insight: 动态查询设计和运动感知模块能有效缓解查询选择偏差,提升跨模态对齐的鲁棒性。

Abstract: Recently, query-based methods have achieved remarkable performance in Referring Video Object Segmentation (RVOS) by using textual static object queries to drive cross-modal alignment. However, these static queries are easily misled by distractors with similar appearance or motion, resulting in \emph{query selection bias}. To address this issue, we propose Triple Query Former (TQF), which factorizes the referring query into three specialized components: an appearance query for static attributes, an intra-frame interaction query for spatial relations, and an inter-frame motion query for temporal association. Instead of relying solely on textual embeddings, our queries are dynamically constructed by integrating both linguistic cues and visual guidance. Furthermore, we introduce two motion-aware aggregation modules that enhance object token representations: Intra-frame Interaction Aggregation incorporates position-aware interactions among objects within a single frame, while Inter-frame Motion Aggregation leverages trajectory-guided alignment across frames to ensure temporal coherence. Extensive experiments on multiple RVOS benchmarks demonstrate the advantages of TQF and the effectiveness of our structured query design and motion-aware aggregation modules.


[31] Improving Generalized Visual Grounding with Instance-aware Joint Learning cs.CVPDF

Ming Dai, Wenxuan Cheng, Jiang-Jiang Liu, Lingfeng Yang, Zhenhua Feng

TL;DR: InstanceVG是一个多任务广义视觉定位框架,通过实例感知能力联合训练GREC和GRES任务,统一实例级别的框和掩码预测,显著优于现有方法。

Details

Motivation: 现有方法通常独立处理GREC和GRES任务,忽视了联合训练的一致性和实例感知能力的重要性。

Result: 在四个任务的十个数据集上实现了SOTA性能,显著超越现有方法。

Insight: 联合训练和实例感知能力对广义视觉定位任务的性能提升至关重要。

Abstract: Generalized visual grounding tasks, including Generalized Referring Expression Comprehension (GREC) and Segmentation (GRES), extend the classical visual grounding paradigm by accommodating multi-target and non-target scenarios. Specifically, GREC focuses on accurately identifying all referential objects at the coarse bounding box level, while GRES aims for achieve fine-grained pixel-level perception. However, existing approaches typically treat these tasks independently, overlooking the benefits of jointly training GREC and GRES to ensure consistent multi-granularity predictions and streamline the overall process. Moreover, current methods often treat GRES as a semantic segmentation task, neglecting the crucial role of instance-aware capabilities and the necessity of ensuring consistent predictions between instance-level boxes and masks. To address these limitations, we propose InstanceVG, a multi-task generalized visual grounding framework equipped with instance-aware capabilities, which leverages instance queries to unify the joint and consistency predictions of instance-level boxes and masks. To the best of our knowledge, InstanceVG is the first framework to simultaneously tackle both GREC and GRES while incorporating instance-aware capabilities into generalized visual grounding. To instantiate the framework, we assign each instance query a prior reference point, which also serves as an additional basis for target matching. This design facilitates consistent predictions of points, boxes, and masks for the same instance. Extensive experiments obtained on ten datasets across four tasks demonstrate that InstanceVG achieves state-of-the-art performance, significantly surpassing the existing methods in various evaluation metrics. The code and model will be publicly available at https://github.com/Dmmm1997/InstanceVG.


[32] Cross-modal Full-mode Fine-grained Alignment for Text-to-Image Person Retrieval cs.CVPDF

Hao Yin, Xin Man, Feiyu Chen, Jie Shao, Heng Tao Shen

TL;DR: FMFA框架通过显式细粒度对齐和隐式关系推理,改进文本-图像跨模态对齐,提升检索性能。

Details

Motivation: TIPR任务中,现有方法无法验证局部特征是否正确对齐,且仅关注硬负样本,忽略误匹配的正样本对。FMFA旨在解决这些问题。

Result: FMFA在三个公共数据集上超越所有全局匹配方法,达到最优性能。

Insight: 显式细粒度对齐与隐式关系推理的结合(”全模式”)能有效提升跨模态检索的精度。

Abstract: Text-to-Image Person Retrieval (TIPR) is a cross-modal matching task that aims to retrieve the most relevant person images based on a given text query. The key challenge in TIPR lies in achieving effective alignment between textual and visual modalities within a common latent space. To address this challenge, prior approaches incorporate attention mechanisms for implicit cross-modal local alignment. However, they lack the ability to verify whether all local features are correctly aligned. Moreover, existing methods primarily focus on hard negative samples during model updates, with the goal of refining distinctions between positive and negative pairs, often neglecting incorrectly matched positive pairs. To alleviate these issues, we propose FMFA, a cross-modal Full-Mode Fine-grained Alignment framework, which enhances global matching through explicit fine-grained alignment and existing implicit relational reasoning – hence the term ``full-mode” – without requiring additional supervision. Specifically, we design an Adaptive Similarity Distribution Matching (A-SDM) module to rectify unmatched positive sample pairs. A-SDM adaptively pulls the unmatched positive pairs closer in the joint embedding space, thereby achieving more precise global alignment. Additionally, we introduce an Explicit Fine-grained Alignment (EFA) module, which makes up for the lack of verification capability of implicit relational reasoning. EFA strengthens explicit cross-modal fine-grained interactions by sparsifying the similarity matrix and employs a hard coding method for local alignment. Our proposed method is evaluated on three public datasets, achieving state-of-the-art performance among all global matching methods. Our code is available at https://github.com/yinhao1102/FMFA.


[33] Iterative Prompt Refinement for Safer Text-to-Image Generation cs.CVPDF

Jinwoo Jeon, JunHyeok Oh, Hayeong Lee, Byung-Jun Lee

TL;DR: 该论文提出了一种基于视觉语言模型(VLM)的迭代提示优化算法,通过结合文本和生成图像的反馈来提升文本到图像(T2I)模型的安全性和用户意图的保持。

Details

Motivation: 现有的安全方法通常基于大型语言模型(LLM)优化提示,但忽视了生成图像的内容,可能导致不安全输出或对已安全提示的过度修改。

Result: 实验结果表明,该方法生成的图像更具安全性,同时保持了与用户意图的高对齐性。

Insight: 视觉反馈在提示优化中的作用至关重要,多模态数据和方法可以显著提升T2I模型的安全性和可靠性。

Abstract: Text-to-Image (T2I) models have made remarkable progress in generating images from text prompts, but their output quality and safety still depend heavily on how prompts are phrased. Existing safety methods typically refine prompts using large language models (LLMs), but they overlook the images produced, which can result in unsafe outputs or unnecessary changes to already safe prompts. To address this, we propose an iterative prompt refinement algorithm that uses Vision Language Models (VLMs) to analyze both the input prompts and the generated images. By leveraging visual feedback, our method refines prompts more effectively, improving safety while maintaining user intent and reliability comparable to existing LLM-based approaches. Additionally, we introduce a new dataset labeled with both textual and visual safety signals using off-the-shelf multi-modal LLM, enabling supervised fine-tuning. Experimental results demonstrate that our approach produces safer outputs without compromising alignment with user intent, offering a practical solution for generating safer T2I content. Our code is available at https://github.com/ku-dmlab/IPR. \textbf{\textcolor{red}WARNING: This paper contains examples of harmful or inappropriate images generated by models.


[34] Task-Aware Image Signal Processor for Advanced Visual Perception cs.CVPDF

Kai Chen, Jin Xiao, Leheng Zhang, Kexuan Shi, Shuhang Gu

TL;DR: TA-ISP是一个轻量级的RAW-to-RGB框架,通过预测多尺度调制算子来优化视觉感知任务,显著减少了计算开销,同时提升了任务性能。

Details

Motivation: 传统ISP方法在RAW数据处理上存在计算开销大或表达能力有限的问题,限制了视觉感知任务的性能提升。

Result: 在多个RAW数据检测和分割任务中,TA-ISP提升了准确率,同时显著减少了参数数量和推理时间。

Insight: 因子化的多尺度调制方法能有效平衡计算开销和任务性能,适用于资源受限的设备。

Abstract: In recent years, there has been a growing trend in computer vision towards exploiting RAW sensor data, which preserves richer information compared to conventional low-bit RGB images. Early studies mainly focused on enhancing visual quality, while more recent efforts aim to leverage the abundant information in RAW data to improve the performance of visual perception tasks such as object detection and segmentation. However, existing approaches still face two key limitations: large-scale ISP networks impose heavy computational overhead, while methods based on tuning traditional ISP pipelines are restricted by limited representational capacity.To address these issues, we propose Task-Aware Image Signal Processing (TA-ISP), a compact RAW-to-RGB framework that produces task-oriented representations for pretrained vision models. Instead of heavy dense convolutional pipelines, TA-ISP predicts a small set of lightweight, multi-scale modulation operators that act at global, regional, and pixel scales to reshape image statistics across different spatial extents. This factorized control significantly expands the range of spatially varying transforms that can be represented while keeping memory usage, computation, and latency tightly constrained. Evaluated on several RAW-domain detection and segmentation benchmarks under both daytime and nighttime conditions, TA-ISP consistently improves downstream accuracy while markedly reducing parameter count and inference time, making it well suited for deployment on resource-constrained devices.


[35] VocSegMRI: Multimodal Learning for Precise Vocal Tract Segmentation in Real-time MRI cs.CVPDF

Daiqi Liu, Tomás Arias-Vergara, Johannes Enk, Fangxu Xing, Maureen Stone

TL;DR: VocSegMRI提出了一种多模态框架,结合视频、音频和语音输入,通过跨注意力融合和对比学习提升声带分割精度,实现实时MRI中的高效分割。

Details

Motivation: 现有方法主要依赖视觉信息,忽视了音频和语音信号的补充作用。通过多模态学习可以更精确地分割声带结构。

Result: Dice分数0.95,HD_95为4.20 mm,超越单模态和基线多模态方法。

Insight: 多模态建模及对比学习显著提升分割精度和鲁棒性,尤其在音频缺失时仍保持性能。

Abstract: Accurately segmenting articulatory structures in real-time magnetic resonance imaging (rtMRI) remains challenging, as most existing methods rely almost entirely on visual cues. Yet synchronized acoustic and phonological signals provide complementary context that can enrich visual information and improve precision. In this paper, we introduce VocSegMRI, a multimodal framework that integrates video, audio, and phonological inputs through cross-attention fusion for dynamic feature alignment. To further enhance cross-modal representation, we incorporate a contrastive learning objective that improves segmentation performance even when the audio modality is unavailable at inference. Evaluated on a sub-set of USC-75 rtMRI dataset, our approach achieves state-of-the-art performance, with a Dice score of 0.95 and a 95th percentile Hausdorff Distance (HD_95) of 4.20 mm, outperforming both unimodal and multimodal baselines. Ablation studies confirm the contributions of cross-attention and contrastive learning to segmentation precision and robustness. These results highlight the value of integrative multimodal modeling for accurate vocal tract analysis.


[36] AdaThinkDrive: Adaptive Thinking via Reinforcement Learning for Autonomous Driving cs.CVPDF

Yuechen Luo, Fang Li, Shaoqing Xu, Zhiyi Lai, Lei Yang

TL;DR: AdaThinkDrive提出了一种双模式推理框架,结合快速和慢速思考机制,通过自适应选择推理模式提高了自动驾驶的决策质量和效率。

Details

Motivation: 当前自动驾驶模型中的推理技术(如CoT)在简单场景中表现不佳,导致不必要的计算开销,急需一种自适应机制以区分不同场景的推理需求。

Result: 在Navsim基准上,AdaThinkDrive的PDMS达到90.3,优于仅视觉基线1.7分,并显著降低推理时间14%。

Insight: 自适应推理能有效平衡决策质量与计算效率,为复杂任务中的推理机制设计提供了新思路。

Abstract: While reasoning technology like Chain of Thought (CoT) has been widely adopted in Vision Language Action (VLA) models, it demonstrates promising capabilities in end to end autonomous driving. However, recent efforts to integrate CoT reasoning often fall short in simple scenarios, introducing unnecessary computational overhead without improving decision quality. To address this, we propose AdaThinkDrive, a novel VLA framework with a dual mode reasoning mechanism inspired by fast and slow thinking. First, our framework is pretrained on large scale autonomous driving (AD) scenarios using both question answering (QA) and trajectory datasets to acquire world knowledge and driving commonsense. During supervised fine tuning (SFT), we introduce a two mode dataset, fast answering (w/o CoT) and slow thinking (with CoT), enabling the model to distinguish between scenarios that require reasoning. Furthermore, an Adaptive Think Reward strategy is proposed in conjunction with the Group Relative Policy Optimization (GRPO), which rewards the model for selectively applying CoT by comparing trajectory quality across different reasoning modes. Extensive experiments on the Navsim benchmark show that AdaThinkDrive achieves a PDMS of 90.3, surpassing the best vision only baseline by 1.7 points. Moreover, ablations show that AdaThinkDrive surpasses both the never Think and always Think baselines, improving PDMS by 2.0 and 1.4, respectively. It also reduces inference time by 14% compared to the always Think baseline, demonstrating its ability to balance accuracy and efficiency through adaptive reasoning.


[37] CETUS: Causal Event-Driven Temporal Modeling With Unified Variable-Rate Scheduling cs.CVPDF

Hanfang Liang, Bing Wang, Shizhen Zhang, Wen Jiang, Yizhuo Yang

TL;DR: CETUS提出了一种直接处理原始事件流的新型架构,通过轻量级因果空间编码器和线性复杂度的Mamba状态空间模型,实现高效的时空建模,并动态调整处理速度以平衡延迟。

Details

Motivation: 现有方法需要将事件流转换为中间表示(如帧或体素网格),这引入了窗口延迟,而逐点检测方法因计算量大难以实现实时效率。CETUS旨在直接处理原始事件流,避免这些限制。

Result: CETUS避免了中间表示的窗口延迟,显著提升了处理效率,适用于高速视觉任务。

Insight: 直接处理原始事件流能有效减少延迟,结合轻量编码器和状态空间模型是高效时空建模的关键。

Abstract: Event cameras capture asynchronous pixel-level brightness changes with microsecond temporal resolution, offering unique advantages for high-speed vision tasks. Existing methods often convert event streams into intermediate representations such as frames, voxel grids, or point clouds, which inevitably require predefined time windows and thus introduce window latency. Meanwhile, pointwise detection methods face computational challenges that prevent real-time efficiency due to their high computational cost. To overcome these limitations, we propose the Variable-Rate Spatial Event Mamba, a novel architecture that directly processes raw event streams without intermediate representations. Our method introduces a lightweight causal spatial neighborhood encoder to efficiently capture local geometric relations, followed by Mamba-based state space models for scalable temporal modeling with linear complexity. During inference, a controller adaptively adjusts the processing speed according to the event rate, achieving an optimal balance between window latency and inference latency.


[38] BWCache: Accelerating Video Diffusion Transformers through Block-Wise Caching cs.CV | cs.AIPDF

Hanshuai Cui, Zhiqing Tang, Zhifei Xu, Zhi Yao, Wenyi Zeng

TL;DR: 论文提出了一种无训练的加速方法 BWCache,通过块级缓存重用来减少扩散变换器(DiT)在视频生成中的计算冗余,显著提升推理速度。

Details

Motivation: 扩散变换器(DiT)在视频生成中表现出色,但其串行去噪过程导致高延迟,现有加速方法要么牺牲视觉质量,要么无法有效重用中间特征。论文发现 DiT 块是延迟的主要来源,且其特征变化在中间时间步呈现高相似性,具有优化潜力。

Result: 实验表明,BWCache 在多个模型上实现了最高 2.24 倍的加速,同时保持了可比的视觉质量。

Insight: DiT 块特征在中间时间步的高相似性为缓存和重用提供了机会,动态阈值设计是平衡速度和视觉质量的关键。

Abstract: Recent advancements in Diffusion Transformers (DiTs) have established them as the state-of-the-art method for video generation. However, their inherently sequential denoising process results in inevitable latency, limiting real-world applicability. Existing acceleration methods either compromise visual quality due to architectural modifications or fail to reuse intermediate features at proper granularity. Our analysis reveals that DiT blocks are the primary contributors to inference latency. Across diffusion timesteps, the feature variations of DiT blocks exhibit a U-shaped pattern with high similarity during intermediate timesteps, which suggests substantial computational redundancy. In this paper, we propose Block-Wise Caching (BWCache), a training-free method to accelerate DiT-based video generation. BWCache dynamically caches and reuses features from DiT blocks across diffusion timesteps. Furthermore, we introduce a similarity indicator that triggers feature reuse only when the differences between block features at adjacent timesteps fall below a threshold, thereby minimizing redundant computations while maintaining visual fidelity. Extensive experiments on several video diffusion models demonstrate that BWCache achieves up to 2.24$\times$ speedup with comparable visual quality.


[39] Diving into Mitigating Hallucinations from a Vision Perspective for Large Vision-Language Models cs.CV | cs.CLPDF

Weihang Wang, Xinhao Li, Ziyue Wang, Yan Pang, Jielei Zhang

TL;DR: 该论文针对大型视觉语言模型(LVLMs)中的目标幻觉问题,提出了一种新的基准测试VHBench-10和动态路由网络VisionWeaver,以减少幻觉并提升性能。

Details

Motivation: LVLMs中的目标幻觉显著影响其实际应用效果。不同视觉编码器的训练范式可能导致其具有不同的归纳偏置,从而表现出多样化的幻觉行为,现有基准测试对此未能充分捕捉。

Result: 实验证明VisionWeaver能显著减少幻觉,提升模型整体性能。

Insight: 视觉编码器的归纳偏置对幻觉行为有重要影响,动态路由机制是一种有效的特征融合策略。

Abstract: Object hallucination in Large Vision-Language Models (LVLMs) significantly impedes their real-world applicability. As the primary component for accurately interpreting visual information, the choice of visual encoder is pivotal. We hypothesize that the diverse training paradigms employed by different visual encoders instill them with distinct inductive biases, which leads to their diverse hallucination performances. Existing benchmarks typically focus on coarse-grained hallucination detection and fail to capture the diverse hallucinations elaborated in our hypothesis. To systematically analyze these effects, we introduce VHBench-10, a comprehensive benchmark with approximately 10,000 samples for evaluating LVLMs across ten fine-grained hallucination categories. Our evaluations confirm encoders exhibit unique hallucination characteristics. Building on these insights and the suboptimality of simple feature fusion, we propose VisionWeaver, a novel Context-Aware Routing Network. It employs global visual features to generate routing signals, dynamically aggregating visual features from multiple specialized experts. Comprehensive experiments confirm the effectiveness of VisionWeaver in significantly reducing hallucinations and improving overall model performance.


[40] SWA-PF: Semantic-Weighted Adaptive Particle Filter for Memory-Efficient 4-DoF UAV Localization in GNSS-Denied Environments cs.CVPDF

Jiayu Yuan, Ming Dai, Enhui Zheng, Chao Su, Nanxing Chen

TL;DR: 该论文提出了一种语义加权自适应粒子滤波方法(SWA-PF),用于在GNSS缺失环境中实现高效、准确的无人机定位,并发布了一个多高度飞行段数据集(MAFS)。

Details

Motivation: 现有基于检索的无人机定位方法在实时性、环境敏感性和泛化能力方面存在局限性,尤其适用于动态或时变环境。论文旨在解决这些问题。

Result: 该方法在计算效率上比特征提取方法提升了10倍,全局定位误差低于10米,并在低分辨率卫星地图上实现秒级的4自由度位姿估计。

Insight: 结合语义信息的粒子滤波方法在无人机定位中具有显著优势,尤其适用于动态环境,且低分辨率卫星地图也能支持高精度定位。

Abstract: Vision-based Unmanned Aerial Vehicle (UAV) localization systems have been extensively investigated for Global Navigation Satellite System (GNSS)-denied environments. However, existing retrieval-based approaches face limitations in dataset availability and persistent challenges including suboptimal real-time performance, environmental sensitivity, and limited generalization capability, particularly in dynamic or temporally varying environments. To overcome these limitations, we present a large-scale Multi-Altitude Flight Segments dataset (MAFS) for variable altitude scenarios and propose a novel Semantic-Weighted Adaptive Particle Filter (SWA-PF) method. This approach integrates robust semantic features from both UAV-captured images and satellite imagery through two key innovations: a semantic weighting mechanism and an optimized particle filtering architecture. Evaluated using our dataset, the proposed method achieves 10x computational efficiency gain over feature extraction methods, maintains global positioning errors below 10 meters, and enables rapid 4 degree of freedom (4-DoF) pose estimation within seconds using accessible low-resolution satellite maps. Code and dataset will be available at https://github.com/YuanJiayuuu/SWA-PF.


[41] Consistent View Alignment Improves Foundation Models for 3D Medical Image Segmentation cs.CV | cs.LGPDF

Puru Vaish, Felix Meister, Tobias Heimann, Christoph Brune, Jelmer M. Wolterink

TL;DR: 本文挑战了表示学习中无关视图足以学习有效表示的假设,提出了一种显式对齐视图的方法(Consistent View Alignment),在3D医学图像分割任务中表现优异。

Details

Motivation: 现有表示学习方法假设无关视图足以学习有效表示,但本文发现潜在空间中的有意义结构不会自然出现,需要显式对齐视图以提升效果。

Result: 在MICCAI 2025 SSL3D挑战赛中,使用Primus视觉Transformer和ResEnc卷积神经网络分别获得第一和第二名。

Insight: 潜在空间中的有效表示需要显式结构化对齐,而非依赖自然涌现,这对医学图像分割任务尤为重要。

Abstract: Many recent approaches in representation learning implicitly assume that uncorrelated views of a data point are sufficient to learn meaningful representations for various downstream tasks. In this work, we challenge this assumption and demonstrate that meaningful structure in the latent space does not emerge naturally. Instead, it must be explicitly induced. We propose a method that aligns representations from different views of the data to align complementary information without inducing false positives. Our experiments show that our proposed self-supervised learning method, Consistent View Alignment, improves performance for downstream tasks, highlighting the critical role of structured view alignment in learning effective representations. Our method achieved first and second place in the MICCAI 2025 SSL3D challenge when using a Primus vision transformer and ResEnc convolutional neural network, respectively. The code and pretrained model weights are released at https://github.com/Tenbatsu24/LatentCampus.


[42] SpecDiff: Accelerating Diffusion Model Inference with Self-Speculation cs.CV | cs.LGPDF

Jiayi Pan, Jiaming Xu, Yongkang Zhou, Guohao Dai

TL;DR: SpecDiff是一种无需训练的多层次特征缓存策略,通过自推测信息改进扩散模型推理效率,显著加速性能并保持质量。

Details

Motivation: 现有特征缓存方法仅依赖历史信息,导致准确性和速度受限,作者希望通过引入未来信息(自推测)来解决这一问题。

Result: 在Stable Diffusion 3、3.5和FLUX上,SpecDiff实现了2.80倍、2.74倍和3.17倍的加速,质量损失可忽略。

Insight: 通过融合推测与历史信息,SpecDiff推动了扩散模型高效推理中速度与准确性Pareto前沿的突破。

Abstract: Feature caching has recently emerged as a promising method for diffusion model acceleration. It effectively alleviates the inefficiency problem caused by high computational requirements by caching similar features in the inference process of the diffusion model. In this paper, we analyze existing feature caching methods from the perspective of information utilization, and point out that relying solely on historical information will lead to constrained accuracy and speed performance. And we propose a novel paradigm that introduces future information via self-speculation based on the information similarity at the same time step across different iteration times. Based on this paradigm, we present \textit{SpecDiff}, a training-free multi-level feature caching strategy including a cached feature selection algorithm and a multi-level feature classification algorithm. (1) Feature selection algorithm based on self-speculative information. \textit{SpecDiff} determines a dynamic importance score for each token based on self-speculative information and historical information, and performs cached feature selection through the importance score. (2) Multi-level feature classification algorithm based on feature importance scores. \textit{SpecDiff} classifies tokens by leveraging the differences in feature importance scores and introduces a multi-level feature calculation strategy. Extensive experiments show that \textit{SpecDiff} achieves average 2.80 \times, 2.74 \times , and 3.17\times speedup with negligible quality loss in Stable Diffusion 3, 3.5, and FLUX compared to RFlow on NVIDIA A800-80GB GPU. By merging speculative and historical information, \textit{SpecDiff} overcomes the speedup-accuracy trade-off bottleneck, pushing the Pareto frontier of speedup and accuracy in the efficient diffusion model inference.


[43] Dense Video Understanding with Gated Residual Tokenization cs.CV | cs.AI | cs.CL | cs.LGPDF

Haichao Zhang, Wenhao Chai, Shwai He, Ang Li, Yun Fu

TL;DR: 本文提出了Dense Video Understanding (DVU)和Gated Residual Tokenization (GRT)方法,用于高效处理高帧率视频理解,通过减少token化时间和开销,解决了现有视频大语言模型在密集时序信息上的不足。

Details

Motivation: 现有视频大语言模型和基准测试大多依赖低帧率采样,忽略了密集时序信息,导致在需要精确时序对齐的任务(如讲座理解)上表现不佳。

Result: 在DIVE上,GRT超越现有视频大语言模型基线,且性能随FPS提高而提升。

Insight: 密集时序信息对视频理解至关重要,GRT提供了一种高效且可扩展的高帧率视频处理方法。

Abstract: High temporal resolution is essential for capturing fine-grained details in video understanding. However, current video large language models (VLLMs) and benchmarks mostly rely on low-frame-rate sampling, such as uniform sampling or keyframe selection, discarding dense temporal information. This compromise avoids the high cost of tokenizing every frame, which otherwise leads to redundant computation and linear token growth as video length increases. While this trade-off works for slowly changing content, it fails for tasks like lecture comprehension, where information appears in nearly every frame and requires precise temporal alignment. To address this gap, we introduce Dense Video Understanding (DVU), which enables high-FPS video comprehension by reducing both tokenization time and token overhead. Existing benchmarks are also limited, as their QA pairs focus on coarse content changes. We therefore propose DIVE (Dense Information Video Evaluation), the first benchmark designed for dense temporal reasoning. To make DVU practical, we present Gated Residual Tokenization (GRT), a two-stage framework: (1) Motion-Compensated Inter-Gated Tokenization uses pixel-level motion estimation to skip static regions during tokenization, achieving sub-linear growth in token count and compute. (2) Semantic-Scene Intra-Tokenization Merging fuses tokens across static regions within a scene, further reducing redundancy while preserving dynamic semantics. Experiments on DIVE show that GRT outperforms larger VLLM baselines and scales positively with FPS. These results highlight the importance of dense temporal information and demonstrate that GRT enables efficient, scalable high-FPS video understanding.


[44] EDITS: Enhancing Dataset Distillation with Implicit Textual Semantics cs.CVPDF

Qianxin Xia, Jiawei Du, Guoming Lu, Zhiyong Shu, Jielei Wang

TL;DR: 论文EDITS提出了一种新框架,通过挖掘图像中的隐式文本语义增强数据集蒸馏效果,融合视觉-语言模型和大型语言模型生成合成数据集。

Details

Motivation: 传统数据集蒸馏方法主要关注低层视觉特征,忽略了图像中的高层语义和结构信息。EDITS通过引入文本语义提升蒸馏效果。

Result: 实验证实EDITS显著提升了数据集蒸馏的效果。

Insight: 文本语义在数据集蒸馏中具有重要作用,结合多模态模型(VLM和LLM)可以更好地捕获高层信息。

Abstract: Dataset distillation aims to synthesize a compact dataset from the original large-scale one, enabling highly efficient learning while preserving competitive model performance. However, traditional techniques primarily capture low-level visual features, neglecting the high-level semantic and structural information inherent in images. In this paper, we propose EDITS, a novel framework that exploits the implicit textual semantics within the image data to achieve enhanced distillation. First, external texts generated by a Vision Language Model (VLM) are fused with image features through a Global Semantic Query module, forming the prior clustered buffer. Local Semantic Awareness then selects representative samples from the buffer to construct image and text prototypes, with the latter produced by guiding a Large Language Model (LLM) with meticulously crafted prompt. Ultimately, Dual Prototype Guidance strategy generates the final synthetic dataset through a diffusion model. Extensive experiments confirm the effectiveness of our method.Source code is available in: https://github.com/einsteinxia/EDITS.


[45] LamiGauss: Pitching Radiative Gaussian for Sparse-View X-ray Laminography Reconstruction cs.CV | cs.LGPDF

Chu Chen, Ander Biguri, Jean-Michel Morel, Raymond H. Chan, Carola-Bibiane Schönlieb

TL;DR: LamiGauss提出了一种基于高斯泼溅辐射光栅化(Gaussian Splatting radiative rasterization)和专用检测器-世界变换模型的稀疏视图X射线层析成像重建算法,显著提高了在极稀疏视图条件下的重建质量。

Details

Motivation: X射线层析成像在板状结构(如微芯片和电池复合材料)的非破坏性检测中至关重要,但传统CT因几何限制难以适用,而稀疏视图条件下的高质量重建仍具挑战性。

Result: 在合成和真实数据集上验证了LamiGauss的有效性和优越性,仅用3%的完整视图即超越全数据优化的迭代方法。

Insight: 高斯泼溅辐射光栅化在稀疏视图重建中具有潜力,结合专用变换模型和伪影过滤策略可显著提升模型性能和重建质量。

Abstract: X-ray Computed Laminography (CL) is essential for non-destructive inspection of plate-like structures in applications such as microchips and composite battery materials, where traditional computed tomography (CT) struggles due to geometric constraints. However, reconstructing high-quality volumes from laminographic projections remains challenging, particularly under highly sparse-view acquisition conditions. In this paper, we propose a reconstruction algorithm, namely LamiGauss, that combines Gaussian Splatting radiative rasterization with a dedicated detector-to-world transformation model incorporating the laminographic tilt angle. LamiGauss leverages an initialization strategy that explicitly filters out common laminographic artifacts from the preliminary reconstruction, preventing redundant Gaussians from being allocated to false structures and thereby concentrating model capacity on representing the genuine object. Our approach effectively optimizes directly from sparse projections, enabling accurate and efficient reconstruction with limited data. Extensive experiments on both synthetic and real datasets demonstrate the effectiveness and superiority of the proposed method over existing techniques. LamiGauss uses only 3$%$ of full views to achieve superior performance over the iterative method optimized on a full dataset.


[46] Distractor-Aware Memory-Based Visual Object Tracking cs.CVPDF

Jovana Videnovic, Matej Kristan, Alan Lukezic

TL;DR: 论文提出了一种针对视觉目标跟踪的干扰物感知内存模块DAM4SAM,有效减少了目标漂移并提升了遮挡后的重检测能力,同时在多个基准测试中取得了领先表现。

Details

Motivation: 当前基于内存的视频分割方法(如SAM2)在分割任务中表现优异,但在目标跟踪任务中未能有效应对干扰物(与目标视觉相似的物体)的挑战。

Result: DAM4SAM在13个基准测试中优于SAM2.1,并在10个测试中刷新了SOTA;集成到实时跟踪器EfficientTAM和边缘跟踪器EdgeTAM中分别提升11%和4%。

Insight: 干扰物感知设计对提升目标跟踪性能至关重要,特别是在复杂场景和遮挡情况下。

Abstract: Recent emergence of memory-based video segmentation methods such as SAM2 has led to models with excellent performance in segmentation tasks, achieving leading results on numerous benchmarks. However, these modes are not fully adjusted for visual object tracking, where distractors (i.e., objects visually similar to the target) pose a key challenge. In this paper we propose a distractor-aware drop-in memory module and introspection-based management method for SAM2, leading to DAM4SAM. Our design effectively reduces the tracking drift toward distractors and improves redetection capability after object occlusion. To facilitate the analysis of tracking in the presence of distractors, we construct DiDi, a Distractor-Distilled dataset. DAM4SAM outperforms SAM2.1 on thirteen benchmarks and sets new state-of-the-art results on ten. Furthermore, integrating the proposed distractor-aware memory into a real-time tracker EfficientTAM leads to 11% improvement and matches tracking quality of the non-real-time SAM2.1-L on multiple tracking and segmentation benchmarks, while integration with edge-based tracker EdgeTAM delivers 4% performance boost, demonstrating a very good generalization across architectures.


[47] EvHand-FPV: Efficient Event-Based 3D Hand Tracking from First-Person View cs.CVPDF

Zhen Xu, Guorui Lu, Chang Gao, Qinyu Chen

TL;DR: EvHand-FPV提出了一种高效的基于单事件相机的第一人称3D手部跟踪框架,通过腕部ROI定位、多任务学习等方法,显著提升了准确性和效率。

Details

Motivation: 传统帧式方法在低延迟和能效方面表现不佳,尤其适用于资源受限的XR设备,因此提出基于事件相机的高效方法。

Result: 2D-AUCp提升至0.85(原0.77),参数量减少89%(1.2M),推理FLOPs减少89%(0.185G),3D-AUCp保持0.84。

Insight: 事件相机和轻量化设计的结合能够在资源受限设备上实现高效的手部跟踪,适合XR应用。

Abstract: Hand tracking holds great promise for intuitive interaction paradigms, but frame-based methods often struggle to meet the requirements of accuracy, low latency, and energy efficiency, especially in resource-constrained settings such as Extended Reality (XR) devices. Event cameras provide $\mu$s-level temporal resolution at mW-level power by asynchronously sensing brightness changes. In this work, we present EvHand-FPV, a lightweight framework for egocentric First-Person-View 3D hand tracking from a single event camera. We construct an event-based FPV dataset that couples synthetic training data with 3D labels and real event data with 2D labels for evaluation to address the scarcity of egocentric benchmarks. EvHand-FPV also introduces a wrist-based region of interest (ROI) that localizes the hand region via geometric cues, combined with an end-to-end mapping strategy that embeds ROI offsets into the network to reduce computation without explicit reconstruction, and a multi-task learning strategy with an auxiliary geometric feature head that improves representations without test-time overhead. On our real FPV test set, EvHand-FPV improves 2D-AUCp from 0.77 to 0.85 while reducing parameters from 11.2M to 1.2M by 89% and FLOPs per inference from 1.648G to 0.185G by 89%. It also maintains a competitive 3D-AUCp of 0.84 on synthetic data. These results demonstrate accurate and efficient egocentric event-based hand tracking suitable for on-device XR applications. The dataset and code are available at https://github.com/zen5x5/EvHand-FPV.


[48] Towards Rationale-Answer Alignment of LVLMs via Self-Rationale Calibration cs.CVPDF

Yuanchen Wu, Ke Yan, Shouhong Ding, Ziyin Zhou, Xiaoqiang Li

TL;DR: 论文提出了Self-Rationale Calibration(SRC)框架,通过迭代校准大型视觉语言模型(LVLM)中rationale(推理依据)与答案的对齐问题,显著提升了模型的感知、推理和泛化能力。

Details

Motivation: 大型视觉语言模型在视觉问答任务中表现出色,但其生成的rationale和答案之间常存在不一致性,导致推理错误。为了解决这一问题,论文提出了SRC框架。

Result: SRC框架在多个基准测试中显著提升了LVLM的感知、推理和泛化能力,验证了rationale导向对齐的有效性。

Insight: rationale与答案的对齐是提升LVLM推理能力的关键,SRC框架通过自校准机制为解决这一问题提供了新思路。

Abstract: Large Vision-Language Models (LVLMs) have manifested strong visual question answering capability. However, they still struggle with aligning the rationale and the generated answer, leading to inconsistent reasoning and incorrect responses. To this end, this paper introduces the Self-Rationale Calibration (SRC) framework to iteratively calibrate the alignment between rationales and answers. SRC begins by employing a lightweight “rationale fine-tuning” approach, which modifies the model’s response format to require a rationale before deriving an answer without explicit prompts. Next, SRC searches for a diverse set of candidate responses from the fine-tuned LVLMs for each sample, followed by a proposed pairwise scoring strategy using a tailored scoring model, R-Scorer, to evaluate both rationale quality and factual consistency of candidates. Based on a confidence-weighted preference curation process, SRC decouples the alignment calibration into a preference fine-tuning manner, leading to significant improvements of LVLMs in perception, reasoning, and generalization across multiple benchmarks. Our results emphasize the rationale-oriented alignment in exploring the potential of LVLMs.


[49] Noise-Level Diffusion Guidance: Well Begun is Half Done cs.CVPDF

Harvey Mannering, Zhiwu Huang, Adam Prugel-Bennett

TL;DR: 这篇论文提出了一种简单高效的噪声级引导(NLG)方法,用于优化扩散模型中的初始噪声,从而提升生成图像的质量和提示遵从性,无需额外数据、网络或反向传播。

Details

Motivation: 扩散模型的初始高斯噪声会影响最终图像质量和提示遵从性,现有方法通常依赖额外数据集、网络或优化,实用性受限。

Result: 在五个标准基准测试中,NLG显著提升了生成质量和条件遵从性,同时保持了计算效率。

Insight: 初始噪声的优化对扩散模型性能至关重要,NLG作为一种轻量级方法,可无缝集成现有技术,推动扩散模型的实用性提升。

Abstract: Diffusion models have achieved state-of-the-art image generation. However, the random Gaussian noise used to start the diffusion process influences the final output, causing variations in image quality and prompt adherence. Existing noise-level optimization approaches generally rely on extra dataset construction, additional networks, or backpropagation-based optimization, limiting their practicality. In this paper, we propose Noise Level Guidance (NLG), a simple, efficient, and general noise-level optimization approach that refines initial noise by increasing the likelihood of its alignment with general guidance - requiring no additional training data, auxiliary networks, or backpropagation. The proposed NLG approach provides a unified framework generalizable to both conditional and unconditional diffusion models, accommodating various forms of diffusion-level guidance. Extensive experiments on five standard benchmarks demonstrate that our approach enhances output generation quality and input condition adherence. By seamlessly integrating with existing guidance methods while maintaining computational efficiency, our method establishes NLG as a practical and scalable enhancement to diffusion models. Code can be found at https://github.com/harveymannering/NoiseLevelGuidance.


[50] Can Current AI Models Count What We Mean, Not What They See? A Benchmark and Systematic Evaluation cs.CVPDF

Gia Khanh Nguyen, Yifeng Huang, Minh Hoai

TL;DR: 论文提出了PairTally数据集,用于评估细粒度视觉计数任务,发现当前AI模型在复杂场景中仍难以准确计数用户的意图对象。

Details

Motivation: 当前AI模型在视觉计数任务中表现优秀,但在细粒度、意图驱动的计数中能力尚不明确,需要更严格的评估标准。

Result: 当前模型在细粒度和视觉模糊场景下的计数可靠性较差,难以完全满足用户意图。

Insight: 细粒度计数任务需要更强的区分能力和语义理解,PairTally为未来模型优化提供了基础。

Abstract: Visual counting is a fundamental yet challenging task, especially when users need to count objects of a specific type in complex scenes. While recent models, including class-agnostic counting models and large vision-language models (VLMs), show promise in counting tasks, their ability to perform fine-grained, intent-driven counting remains unclear. In this paper, we introduce PairTally, a benchmark dataset specifically designed to evaluate fine-grained visual counting. Each of the 681 high-resolution images in PairTally contains two object categories, requiring models to distinguish and count based on subtle differences in shape, size, color, or semantics. The dataset includes both inter-category (distinct categories) and intra-category (closely related subcategories) settings, making it suitable for rigorous evaluation of selective counting capabilities. We benchmark a variety of state-of-the-art models, including exemplar-based methods, language-prompted models, and large VLMs. Our results show that despite recent advances, current models struggle to reliably count what users intend, especially in fine-grained and visually ambiguous cases. PairTally provides a new foundation for diagnosing and improving fine-grained visual counting systems.


[51] MOCHA: Multi-modal Objects-aware Cross-arcHitecture Alignment cs.CV | cs.AI | cs.LGPDF

Elena Camuffo, Francesco Barbato, Mete Ozay, Simone Milani, Umberto Michieli

TL;DR: MOCHA是一种多模态知识蒸馏方法,将大型视觉-语言教师模型(如LLaVa)的区域级多模态语义迁移到轻量级纯视觉目标检测学生模型(如YOLO)中,通过双目标损失实现对象级语义对齐。

Details

Motivation: 现有方法主要集中在密集或全局对齐,但MOCHA专注于对象级语义迁移,旨在高效地将多模态语义知识迁移到纯视觉模型中,同时不依赖推理时的文本输入。

Result: 在四个个性化检测基准测试中,MOCHA相比于基线方法平均提升了10.1分,且在轻量级架构下达到了与大型多模态模型相当的性能。

Insight: 对象级对齐比密集或全局对齐更适合多模态知识的迁移,特别是在轻量级模型中,能够在不依赖文本输入的情况下显著提升性能。

Abstract: We introduce MOCHA (Multi-modal Objects-aware Cross-arcHitecture Alignment), a knowledge distillation approach that transfers region-level multimodal semantics from a large vision-language teacher (e.g., LLaVa) into a lightweight vision-only object detector student (e.g., YOLO). A translation module maps student features into a joint space, where the training of the student and translator is guided by a dual-objective loss that enforces both local alignment and global relational consistency. Unlike prior approaches focused on dense or global alignment, MOCHA operates at the object level, enabling efficient transfer of semantics without modifying the teacher or requiring textual input at inference. We validate our method across four personalized detection benchmarks under few-shot regimes. Results show consistent gains over baselines, with a +10.1 average score improvement. Despite its compact architecture, MOCHA reaches performance on par with larger multimodal models, proving its suitability for real-world deployment.


[52] SAIL-VL2 Technical Report cs.CVPDF

Weijie Yin, Yongjie Ye, Fangxun Shu, Yue Liao, Zijian Kang

TL;DR: SAIL-VL2是一个开源的视觉语言基础模型,通过大规模数据筛选、渐进式训练框架和稀疏混合专家架构创新,在2B和8B参数规模下实现了多模态理解和推理的先进性能。

Details

Motivation: 现有的视觉语言模型在细粒度感知和复杂推理任务上仍有提升空间,SAIL-VL2旨在通过数据、训练和架构创新,推动多模态模型能力的边界。

Result: 在106个数据集上表现优异,在MMMU和MathVista等复杂推理任务中达到SOTA,OpenCompass排行榜中2B模型在4B以下开源模型中排名第一。

Insight: 数据质量与多样性、训练范式的系统性设计以及稀疏架构的应用是提升多模态模型性能的关键。

Abstract: We introduce SAIL-VL2, an open-suite vision-language foundation model (LVM) for comprehensive multimodal understanding and reasoning. As the successor to SAIL-VL, SAIL-VL2 achieves state-of-the-art performance at the 2B and 8B parameter scales across diverse image and video benchmarks, demonstrating strong capabilities from fine-grained perception to complex reasoning. Three core innovations drive its effectiveness. First, a large-scale data curation pipeline with scoring and filtering strategies enhances both quality and distribution across captioning, OCR, QA, and video data, improving training efficiency. Second, a progressive training framework begins with a powerful pre-trained vision encoder (SAIL-ViT), advances through multimodal pre-training, and culminates in a thinking-fusion SFT-RL hybrid paradigm that systematically strengthens model capabilities. Third, architectural advances extend beyond dense LLMs to efficient sparse Mixture-of-Experts (MoE) designs. With these contributions, SAIL-VL2 demonstrates competitive performance across 106 datasets and achieves state-of-the-art results on challenging reasoning benchmarks such as MMMU and MathVista. Furthermore, on the OpenCompass leaderboard, SAIL-VL2-2B ranks first among officially released open-source models under the 4B parameter scale, while serving as an efficient and extensible foundation for the open-source multimodal community.


[53] PROFUSEme: PROstate Cancer Biochemical Recurrence Prediction via FUSEd Multi-modal Embeddings cs.CVPDF

Suhang You, Carla Pitarch-Abaigar, Sanket Kachole, Sumedh Sonawane, Juhyung Ha

TL;DR: PROFUSEme使用多模态嵌入(临床、放射和病理数据)的中级融合配置结合Cox比例风险回归,以早期预测前列腺癌生化复发(BCR),取得了优于晚期融合的性能表现。

Details

Motivation: 30%的前列腺癌患者在根治性前列腺切除术后经历生化复发(BCR),早期准确预测BCR可改善临床决策和患者预后。

Result: 内部5折嵌套交叉验证中平均C-index为0.861(σ=0.112),在CHIMERA 2025挑战验证集上C-index为0.7103。

Insight: 中级融合策略在多模态数据中表现优于晚期融合,提供了更精准的BCR预测潜力。

Abstract: Almost 30% of prostate cancer (PCa) patients undergoing radical prostatectomy (RP) experience biochemical recurrence (BCR), characterized by increased prostate specific antigen (PSA) and associated with increased mortality. Accurate early prediction of BCR, at the time of RP, would contribute to prompt adaptive clinical decision-making and improved patient outcomes. In this work, we propose prostate cancer BCR prediction via fused multi-modal embeddings (PROFUSEme), which learns cross-modal interactions of clinical, radiology, and pathology data, following an intermediate fusion configuration in combination with Cox Proportional Hazard regressors. Quantitative evaluation of our proposed approach reveals superior performance, when compared with late fusion configurations, yielding a mean C-index of 0.861 ($\sigma=0.112$) on the internal 5-fold nested cross-validation framework, and a C-index of 0.7103 on the hold out data of CHIMERA 2025 challenge validation leaderboard.


[54] Wan-Animate: Unified Character Animation and Replacement with Holistic Replication cs.CVPDF

Gang Cheng, Xin Gao, Li Hu, Siqi Hu, Mingyang Huang

TL;DR: Wan-Animate是一个统一的角色动画与替换框架,通过精确复制视频中的表情和动作来生成高保真角色视频,或替换原视频角色并实现环境无缝融合。

Details

Motivation: 解决现有角色动画和替换任务中生成高保真视频和环境无缝融合的挑战。

Result: 实验显示Wan-Animate达到最先进性能,生成视频具有高质量和无缝环境融合效果。

Insight: 统一符号表示支持多任务处理,辅助模块如Relighting LoRA是提升环境适应性的有效手段。

Abstract: We introduce Wan-Animate, a unified framework for character animation and replacement. Given a character image and a reference video, Wan-Animate can animate the character by precisely replicating the expressions and movements of the character in the video to generate high-fidelity character videos. Alternatively, it can integrate the animated character into the reference video to replace the original character, replicating the scene’s lighting and color tone to achieve seamless environmental integration. Wan-Animate is built upon the Wan model. To adapt it for character animation tasks, we employ a modified input paradigm to differentiate between reference conditions and regions for generation. This design unifies multiple tasks into a common symbolic representation. We use spatially-aligned skeleton signals to replicate body motion and implicit facial features extracted from source images to reenact expressions, enabling the generation of character videos with high controllability and expressiveness. Furthermore, to enhance environmental integration during character replacement, we develop an auxiliary Relighting LoRA. This module preserves the character’s appearance consistency while applying the appropriate environmental lighting and color tone. Experimental results demonstrate that Wan-Animate achieves state-of-the-art performance. We are committed to open-sourcing the model weights and its source code.


[55] VSE-MOT: Multi-Object Tracking in Low-Quality Video Scenes Guided by Visual Semantic Enhancement cs.CVPDF

Jun Du, Weiwei Xing, Ming Li, Fei Richard Yu

TL;DR: 该论文提出了VSE-MOT框架,通过视觉语义增强技术提升低质量视频中的多目标跟踪性能,结合视觉语言模型和适配器设计,显著优于现有方法。

Details

Motivation: 现有MOT算法在低质量视频中表现不佳,限制了实际应用。论文旨在通过视觉语义增强技术解决这一问题。

Result: 在低质量视频场景中,VSE-MOT的跟踪性能指标比现有方法高8%-20%,且在常规场景中表现稳健。

Insight: 视觉语义信息的引入和多任务适配器设计是提升低质量视频MOT性能的关键。

Abstract: Current multi-object tracking (MOT) algorithms typically overlook issues inherent in low-quality videos, leading to significant degradation in tracking performance when confronted with real-world image deterioration. Therefore, advancing the application of MOT algorithms in real-world low-quality video scenarios represents a critical and meaningful endeavor. To address the challenges posed by low-quality scenarios, inspired by vision-language models, this paper proposes a Visual Semantic Enhancement-guided Multi-Object Tracking framework (VSE-MOT). Specifically, we first design a tri-branch architecture that leverages a vision-language model to extract global visual semantic information from images and fuse it with query vectors. Subsequently, to further enhance the utilization of visual semantic information, we introduce the Multi-Object Tracking Adapter (MOT-Adapter) and the Visual Semantic Fusion Module (VSFM). The MOT-Adapter adapts the extracted global visual semantic information to suit multi-object tracking tasks, while the VSFM improves the efficacy of feature fusion. Through extensive experiments, we validate the effectiveness and superiority of the proposed method in real-world low-quality video scenarios. Its tracking performance metrics outperform those of existing methods by approximately 8% to 20%, while maintaining robust performance in conventional scenarios.


[56] AD-DINOv3: Enhancing DINOv3 for Zero-Shot Anomaly Detection with Anomaly-Aware Calibration cs.CVPDF

Jingyi Yuan, Jianxiong Ye, Wenkang Chen, Chenqiang Gao

TL;DR: AD-DINOv3通过结合DINOv3和CLIP的多模态框架,针对零样本异常检测任务优化特征对齐和异常区域识别,显著提升了性能。

Details

Motivation: 零样本异常检测(ZSAD)需要高效且无需标注的方法来处理未知类别的异常。传统方法依赖CLIP模型,但DINOv3等模型在迁移学习中的优势未被充分利用。本文旨在解决DINOv3在ZSAD任务中的特征偏差和全局语义偏好问题。

Result: 在八项工业和医疗基准测试上,AD-DINOv3达到或超越现有最优方法。

Insight: 结合视觉和文本模态的对比学习能有效缓解预训练模型在ZSAD任务中的偏差问题,AACM模块显著提升了异常区域的区分能力。

Abstract: Zero-Shot Anomaly Detection (ZSAD) seeks to identify anomalies from arbitrary novel categories, offering a scalable and annotation-efficient solution. Traditionally, most ZSAD works have been based on the CLIP model, which performs anomaly detection by calculating the similarity between visual and text embeddings. Recently, vision foundation models such as DINOv3 have demonstrated strong transferable representation capabilities. In this work, we are the first to adapt DINOv3 for ZSAD. However, this adaptation presents two key challenges: (i) the domain bias between large-scale pretraining data and anomaly detection tasks leads to feature misalignment; and (ii) the inherent bias toward global semantics in pretrained representations often leads to subtle anomalies being misinterpreted as part of the normal foreground objects, rather than being distinguished as abnormal regions. To overcome these challenges, we introduce AD-DINOv3, a novel vision-language multimodal framework designed for ZSAD. Specifically, we formulate anomaly detection as a multimodal contrastive learning problem, where DINOv3 is employed as the visual backbone to extract patch tokens and a CLS token, and the CLIP text encoder provides embeddings for both normal and abnormal prompts. To bridge the domain gap, lightweight adapters are introduced in both modalities, enabling their representations to be recalibrated for the anomaly detection task. Beyond this baseline alignment, we further design an Anomaly-Aware Calibration Module (AACM), which explicitly guides the CLS token to attend to anomalous regions rather than generic foreground semantics, thereby enhancing discriminability. Extensive experiments on eight industrial and medical benchmarks demonstrate that AD-DINOv3 consistently matches or surpasses state-of-the-art methods, verifying its superiority as a general zero-shot anomaly detection framework.


[57] Teacher-Guided Pseudo Supervision and Cross-Modal Alignment for Audio-Visual Video Parsing cs.CV | cs.MMPDF

Yaru Chen, Ruohao Guo, Liting Gao, Yang Xiang, Qingyu Luo

TL;DR: 该论文提出了一种用于弱监督视听视频解析的方法,通过EMA引导的伪监督框架和类感知跨模态一致性损失,实现了段级监督和模态对齐,取得了SOTA性能。

Details

Motivation: 现有方法在弱监督视听视频解析中忽略了段级监督和类感知跨模态对齐,导致性能受限。论文旨在解决这些问题。

Result: 在LLP和UnAV-100数据集上达到了SOTA性能。

Insight: 段级监督和类感知模态对齐对弱监督视听视频解析至关重要,EMA和CMA是有效的解决方案。

Abstract: Weakly-supervised audio-visual video parsing (AVVP) seeks to detect audible, visible, and audio-visual events without temporal annotations. Previous work has emphasized refining global predictions through contrastive or collaborative learning, but neglected stable segment-level supervision and class-aware cross-modal alignment. To address this, we propose two strategies: (1) an exponential moving average (EMA)-guided pseudo supervision framework that generates reliable segment-level masks via adaptive thresholds or top-k selection, offering stable temporal guidance beyond video-level labels; and (2) a class-aware cross-modal agreement (CMA) loss that aligns audio and visual embeddings at reliable segment-class pairs, ensuring consistency across modalities while preserving temporal structure. Evaluations on LLP and UnAV-100 datasets shows that our method achieves state-of-the-art (SOTA) performance across multiple metrics.


[58] Generative AI for Misalignment-Resistant Virtual Staining to Accelerate Histopathology Workflows cs.CVPDF

Jiabo MA, Wenqiang Li, Jinbang Li, Ziyi Liu, Linshan Wu

TL;DR: 本研究提出了一种生成式AI框架,通过级联配准机制解决虚拟染色中的空间错位问题,显著提升了性能,尤其在错位严重的数据集上表现突出。

Details

Motivation: 传统组织病理学诊断需要多次染色,耗时耗力且环境不友好。虚拟染色虽有潜力,但现有方法因依赖对齐良好的配对数据而受限。

Result: 在五个数据集上优于现有方法,内部数据集平均提升3.2%,外部数据集提升10.1%,严重错位数据集上的PSNR提升了23.8%。

Insight: 级联配准机制简化了数据获取过程,为虚拟染色的发展提供了新思路,尤其在错位严重的数据上表现突出。

Abstract: Accurate histopathological diagnosis often requires multiple differently stained tissue sections, a process that is time-consuming, labor-intensive, and environmentally taxing due to the use of multiple chemical stains. Recently, virtual staining has emerged as a promising alternative that is faster, tissue-conserving, and environmentally friendly. However, existing virtual staining methods face significant challenges in clinical applications, primarily due to their reliance on well-aligned paired data. Obtaining such data is inherently difficult because chemical staining processes can distort tissue structures, and a single tissue section cannot undergo multiple staining procedures without damage or loss of information. As a result, most available virtual staining datasets are either unpaired or roughly paired, making it difficult for existing methods to achieve accurate pixel-level supervision. To address this challenge, we propose a robust virtual staining framework featuring cascaded registration mechanisms to resolve spatial mismatches between generated outputs and their corresponding ground truth. Experimental results demonstrate that our method significantly outperforms state-of-the-art models across five datasets, achieving an average improvement of 3.2% on internal datasets and 10.1% on external datasets. Moreover, in datasets with substantial misalignment, our approach achieves a remarkable 23.8% improvement in peak signal-to-noise ratio compared to baseline models. The exceptional robustness of the proposed method across diverse datasets simplifies the data acquisition process for virtual staining and offers new insights for advancing its development.


[59] Deceptive Beauty: Evaluating the Impact of Beauty Filters on Deepfake and Morphing Attack Detection cs.CVPDF

Sara Concas, Simone Maurizio La Cava, Andrea Panzino, Ester Masala, Giulia Orrù

TL;DR: 该论文研究了美颜滤镜如何影响深度伪造(deepfake)和面部变形攻击(morphing attack)检测器的性能,发现滤镜会导致检测器性能下降,暴露出现有模型的脆弱性。

Details

Motivation: 社交媒体美颜滤镜的普及引发了对其影响面部数据可靠性和自动化人脸分析系统效果的担忧,尤其对于检测深度伪造和变形攻击的任务。

Result: 结果显示美颜滤镜显著降低了检测器的性能。

Insight: 研究结果表明,现有检测模型对图像增强操作(如美颜滤镜)缺乏鲁棒性,需要设计更健壮的检测方法。

Abstract: Digital beautification through social media filters has become increasingly popular, raising concerns about the reliability of facial images and videos and the effectiveness of automated face analysis. This issue is particularly critical for digital manipulation detectors, systems aiming at distinguishing between genuine and manipulated data, especially in cases involving deepfakes and morphing attacks designed to deceive humans and automated facial recognition. This study examines whether beauty filters impact the performance of deepfake and morphing attack detectors. We perform a comprehensive analysis, evaluating multiple state-of-the-art detectors on benchmark datasets before and after applying various smoothing filters. Our findings reveal performance degradation, highlighting vulnerabilities introduced by facial enhancements and underscoring the need for robust detection models resilient to such alterations.


[60] MARS2 2025 Challenge on Multimodal Reasoning: Datasets, Methods, Results, Discussion, and Outlook cs.CVPDF

Peng Xu, Shengwu Xiong, Jiajun Zhang, Yaxiong Chen, Bowen Zhou

TL;DR: 本文回顾了MARS2 2025挑战赛,旨在通过大规模基准测试推动多模态机器学习和大语言模型的发展,特别关注现实和专业化场景下的多模态推理应用。

Details

Motivation: 当前多模态推理领域动态发展迅速,但缺乏统一的测试标准和广泛应用场景。MARS2挑战赛希望通过多样化数据集和任务促进技术进步。

Result: 挑战赛成功吸引了大量团队参与,发布了数据集、代码库和排行榜,推动了多模态推理在实际场景中的应用。

Insight: 多模态推理的进步需要多样化测试场景和完善的评价标准,同时开源数据和代码有助于社区共同发展。

Abstract: This paper reviews the MARS2 2025 Challenge on Multimodal Reasoning. We aim to bring together different approaches in multimodal machine learning and LLMs via a large benchmark. We hope it better allows researchers to follow the state-of-the-art in this very dynamic area. Meanwhile, a growing number of testbeds have boosted the evolution of general-purpose large language models. Thus, this year’s MARS2 focuses on real-world and specialized scenarios to broaden the multimodal reasoning applications of MLLMs. Our organizing team released two tailored datasets Lens and AdsQA as test sets, which support general reasoning in 12 daily scenarios and domain-specific reasoning in advertisement videos, respectively. We evaluated 40+ baselines that include both generalist MLLMs and task-specific models, and opened up three competition tracks, i.e., Visual Grounding in Real-world Scenarios (VG-RS), Visual Question Answering with Spatial Awareness (VQA-SA), and Visual Reasoning in Creative Advertisement Videos (VR-Ads). Finally, 76 teams from the renowned academic and industrial institutions have registered and 40+ valid submissions (out of 1200+) have been included in our ranking lists. Our datasets, code sets (40+ baselines and 15+ participants’ methods), and rankings are publicly available on the MARS2 workshop website and our GitHub organization page https://github.com/mars2workshop/, where our updates and announcements of upcoming events will be continuously provided.


[61] An Exploratory Study on Abstract Images and Visual Representations Learned from Them cs.CVPDF

Haotian Li, Jianbo Jiao

TL;DR: 本文探讨了由基本形状构成的抽象图像是否能有效传达视觉语义信息,并通过引入分层抽象图像数据集(HAID),比较了抽象图像与传统栅格图像在视觉任务中的表现差异。

Details

Motivation: 研究旨在理解抽象图像是否能有效传达视觉语义信息,以及为何其在深度学习中表现不如传统栅格图像。

Result: 抽象图像能传达部分语义信息,但在高级任务中表现不如传统图像。

Insight: 抽象图像可能在某些视觉任务中具有潜力,但需进一步优化以缩小与传统图像的差距。

Abstract: Imagine living in a world composed solely of primitive shapes, could you still recognise familiar objects? Recent studies have shown that abstract images-constructed by primitive shapes-can indeed convey visual semantic information to deep learning models. However, representations obtained from such images often fall short compared to those derived from traditional raster images. In this paper, we study the reasons behind this performance gap and investigate how much high-level semantic content can be captured at different abstraction levels. To this end, we introduce the Hierarchical Abstraction Image Dataset (HAID), a novel data collection that comprises abstract images generated from normal raster images at multiple levels of abstraction. We then train and evaluate conventional vision systems on HAID across various tasks including classification, segmentation, and object detection, providing a comprehensive study between rasterised and abstract image representations. We also discuss if the abstract image can be considered as a potentially effective format for conveying visual semantic information and contributing to vision tasks.


[62] BEVUDA++: Geometric-aware Unsupervised Domain Adaptation for Multi-View 3D Object Detection cs.CVPDF

Rongyu Zhang, Jiaming Liu, Xiaoqi Li, Xiaowei Chi, Dan Wang

TL;DR: BEVUDA++提出了一种几何感知的无监督域适应方法,用于多视角3D目标检测的鸟瞰图(BEV)感知,通过可靠的深度教师模型和几何一致性学生模型,减少了跨域场景中的性能下降。

Details

Motivation: BEV感知在自动驾驶中具有重要意义,但跨域场景中的域偏移问题被忽视,导致性能显著下降。本文致力于解决BEV感知中多视角3D目标检测的域适应挑战。

Result: 在BEV 3D目标检测任务中取得了最优性能,例如在昼夜适应任务中NDS提升了12.9%,mAP提升了9.5%。

Insight: 1. 域适应问题的核心在于跨几何空间的域偏移累积;2. 深度感知和几何一致性是缓解域偏移的有效手段;3. 不确定性引导可以提升域适应方法的稳定性。

Abstract: Vision-centric Bird’s Eye View (BEV) perception holds considerable promise for autonomous driving. Recent studies have prioritized efficiency or accuracy enhancements, yet the issue of domain shift has been overlooked, leading to substantial performance degradation upon transfer. We identify major domain gaps in real-world cross-domain scenarios and initiate the first effort to address the Domain Adaptation (DA) challenge in multi-view 3D object detection for BEV perception. Given the complexity of BEV perception approaches with their multiple components, domain shift accumulation across multi-geometric spaces (e.g., 2D, 3D Voxel, BEV) poses a significant challenge for BEV domain adaptation. In this paper, we introduce an innovative geometric-aware teacher-student framework, BEVUDA++, to diminish this issue, comprising a Reliable Depth Teacher (RDT) and a Geometric Consistent Student (GCS) model. Specifically, RDT effectively blends target LiDAR with dependable depth predictions to generate depth-aware information based on uncertainty estimation, enhancing the extraction of Voxel and BEV features that are essential for understanding the target domain. To collaboratively reduce the domain shift, GCS maps features from multiple spaces into a unified geometric embedding space, thereby narrowing the gap in data distribution between the two domains. Additionally, we introduce a novel Uncertainty-guided Exponential Moving Average (UEMA) to further reduce error accumulation due to domain shifts informed by previously obtained uncertainty guidance. To demonstrate the superiority of our proposed method, we execute comprehensive experiments in four cross-domain scenarios, securing state-of-the-art performance in BEV 3D object detection tasks, e.g., 12.9% NDS and 9.5% mAP enhancement on Day-Night adaptation.


[63] Where Do Tokens Go? Understanding Pruning Behaviors in STEP at High Resolutions cs.CV | cs.AIPDF

Michal Szczepanski, Martyna Poreba, Karim Haroun

TL;DR: 论文提出了一种名为STEP的混合令牌减少框架,通过动态合并和早期剪枝提高ViT在高分辨率语义分割中的效率,显著降低计算成本的同时几乎不影响精度。

Details

Motivation: Vision Transformers(ViT)在高分辨率语义分割中表现出色,但高计算和内存成本限制了其应用。因此,需要一种高效的方法在不显著牺牲精度的情况下减少计算负担。

Result: STEP显著降低了计算成本(高达4倍)并提高了推理速度(1.7倍),同时精度损失不超过2%。dCTS单独应用可减少2.5倍令牌数,计算成本降低2.6倍。

Insight: 动态令牌合并和早期退出是高分辨率语义分割中提高ViT效率的有效方法,为后续研究提供了新思路。

Abstract: Vision Transformers (ViTs) achieve state-of-the-art performance in semantic segmentation but are hindered by high computational and memory costs. To address this, we propose STEP (SuperToken and Early-Pruning), a hybrid token-reduction framework that combines dynamic patch merging and token pruning to enhance efficiency without significantly compromising accuracy. At the core of STEP is dCTS, a lightweight CNN-based policy network that enables flexible merging into superpatches. Encoder blocks integrate also early-exits to remove high-confident supertokens, lowering computational load. We evaluate our method on high-resolution semantic segmentation benchmarks, including images up to 1024 x 1024, and show that when dCTS is applied alone, the token count can be reduced by a factor of 2.5 compared to the standard 16 x 16 pixel patching scheme. This yields a 2.6x reduction in computational cost and a 3.4x increase in throughput when using ViT-Large as the backbone. Applying the full STEP framework further improves efficiency, reaching up to a 4x reduction in computational complexity and a 1.7x gain in inference speed, with a maximum accuracy drop of no more than 2.0%. With the proposed STEP configurations, up to 40% of tokens can be confidently predicted and halted before reaching the final encoder layer.


[64] Cinéaste: A Fine-grained Contextual Movie Question Answering Benchmark cs.CV | I.2.10; I.2.7PDF

Nisarg A. Shah, Amir Ziai, Chaitanya Ekanadham, Vishal M. Patel

TL;DR: 本文提出了一种名为Cinéaste的细粒度上下文电影问答基准,用于评估模型对长视频叙事的深度理解能力,包含丰富的问题类型和严格的质量控制流程。

Details

Motivation: 现有视频理解基准多关注短片段识别或模板化问题,缺乏对长叙事内容的细粒度推理能力评估。

Result: 现有MLLMs在Cinéaste上表现不佳,最佳开源模型准确率仅为63.15%,凸显长时序推理的挑战。

Insight: 长叙事内容的深度理解需要更强的长时序推理能力,现有模型仍需改进。

Abstract: While recent advancements in vision-language models have improved video understanding, diagnosing their capacity for deep, narrative comprehension remains a challenge. Existing benchmarks often test short-clip recognition or use template-based questions, leaving a critical gap in evaluating fine-grained reasoning over long-form narrative content. To address these gaps, we introduce $\mathsf{Cin\acute{e}aste}$, a comprehensive benchmark for long-form movie understanding. Our dataset comprises 3,119 multiple-choice question-answer pairs derived from 1,805 scenes across 200 diverse movies, spanning five novel fine-grained contextual reasoning categories. We use GPT-4o to generate diverse, context-rich questions by integrating visual descriptions, captions, scene titles, and summaries, which require deep narrative understanding. To ensure high-quality evaluation, our pipeline incorporates a two-stage filtering process: Context-Independence filtering ensures questions require video context, while Contextual Veracity filtering validates factual consistency against the movie content, mitigating hallucinations. Experiments show that existing MLLMs struggle on $\mathsf{Cin\acute{e}aste}$; our analysis reveals that long-range temporal reasoning is a primary bottleneck, with the top open-source model achieving only 63.15% accuracy. This underscores significant challenges in fine-grained contextual understanding and the need for advancements in long-form movie comprehension.


[65] GenExam: A Multidisciplinary Text-to-Image Exam cs.CVPDF

Zhaokai Wang, Penghao Yin, Xiangyu Zhao, Changyao Tian, Yu Qiao

TL;DR: GenExam是一个多学科的文本到图像的考试基准,首次将图像生成任务以考试形式评估,涵盖10个学科的1000个样本,展示了现有先进模型在严格评分标准下的挑战性表现。

Details

Motivation: 现有的考试式基准主要关注理解和推理任务,而生成基准则侧重于世界知识和视觉概念的呈现,缺乏对严格绘图考试的评估。GenExam填补了这一空白,旨在评估模型整合知识、推理和生成的能力。

Result: 实验显示,即使最先进的模型(如GPT-Image-1和Gemini-2.5-Flash-Image)在严格评分下的得分低于15%,多数模型几乎得0%,表明该基准的极高挑战性。

Insight: 通过将图像生成任务融入考试框架,GenExam为评估AGI模型的知识整合、推理和生成能力提供了新视角,揭示了当前生成模型的局限性。

Abstract: Exams are a fundamental test of expert-level intelligence and require integrated understanding, reasoning, and generation. Existing exam-style benchmarks mainly focus on understanding and reasoning tasks, and current generation benchmarks emphasize the illustration of world knowledge and visual concepts, neglecting the evaluation of rigorous drawing exams. We introduce GenExam, the first benchmark for multidisciplinary text-to-image exams, featuring 1,000 samples across 10 subjects with exam-style prompts organized under a four-level taxonomy. Each problem is equipped with ground-truth images and fine-grained scoring points to enable a precise evaluation of semantic correctness and visual plausibility. Experiments show that even state-of-the-art models such as GPT-Image-1 and Gemini-2.5-Flash-Image achieve less than 15% strict scores, and most models yield almost 0%, suggesting the great challenge of our benchmark. By framing image generation as an exam, GenExam offers a rigorous assessment of models’ ability to integrate knowledge, reasoning, and generation, providing insights on the path to general AGI.


cs.CL [Back]

[66] Op-Fed: Opinion, Stance, and Monetary Policy Annotations on FOMC Transcripts Using Active Learning cs.CLPDF

Alisa Kanganis, Katherine A. Keith

TL;DR: 该论文发布了Op-Fed数据集,包含1044条人工标注的FOMC会议记录句子及其上下文,解决了类别不平衡和跨句子依赖的技术挑战,并通过主动学习提高了数据标注效率。

Details

Motivation: FOMC的货币政策讨论对公众有重大影响,但相关数据集稀缺且标注困难。论文旨在解决类别不平衡和跨句子依赖的问题,提供高质量的标注数据。

Result: LLM在意见分类上的零样本准确率为0.80,但在货币政策立场分类上仅为0.61,低于人类基线0.89。

Insight: 主动学习能有效解决类别不平衡问题;复杂任务(如立场分类)上,模型表现仍需提升;跨句子依赖是标注中的重要挑战。

Abstract: The U.S. Federal Open Market Committee (FOMC) regularly discusses and sets monetary policy, affecting the borrowing and spending decisions of millions of people. In this work, we release Op-Fed, a dataset of 1044 human-annotated sentences and their contexts from FOMC transcripts. We faced two major technical challenges in dataset creation: imbalanced classes – we estimate fewer than 8% of sentences express a non-neutral stance towards monetary policy – and inter-sentence dependence – 65% of instances require context beyond the sentence-level. To address these challenges, we developed a five-stage hierarchical schema to isolate aspects of opinion, monetary policy, and stance towards monetary policy as well as the level of context needed. Second, we selected instances to annotate using active learning, roughly doubling the number of positive instances across all schema aspects. Using Op-Fed, we found a top-performing, closed-weight LLM achieves 0.80 zero-shot accuracy in opinion classification but only 0.61 zero-shot accuracy classifying stance towards monetary policy – below our human baseline of 0.89. We expect Op-Fed to be useful for future model training, confidence calibration, and as a seed dataset for future annotation efforts.


[67] Latent Traits and Cross-Task Transfer: Deconstructing Dataset Interactions in LLM Fine-tuning cs.CL | cs.LGPDF

Shambhavi Krishna, Atharva Naik, Chaitali Agarwal, Sudharshan Govindan, Taesung Lee

TL;DR: 该论文提出了一个分析框架,通过构建转移学习矩阵和降维技术,探讨了在不同数据集间进行LLM微调时的潜在能力和跨任务交互。研究发现,性能提升往往与数据集的隐藏统计特征(如类别分布和生成长度偏好)相关,而非表面相似性。

Details

Motivation: 大规模语言模型(LLM)在部署时会遇到未在训练中见过的任务,而获取所有任务的高质量训练数据不可行。因此,需要依赖转移学习,但跨任务的交互机制尚未充分理解。

Result: 研究发现,性能提升与数据集的隐藏统计特征(如类别分布、生成长度偏好)和特定语言特征更相关,而非表面相似性或数据质量。

Insight: 转移学习的复杂性超出了表面数据相似性的解释,隐藏的统计特征是更关键的影响因素,为LLM的适应性提供了更可预测的方向。

Abstract: Large language models are increasingly deployed across diverse applications. This often includes tasks LLMs have not encountered during training. This implies that enumerating and obtaining the high-quality training data for all tasks is infeasible. Thus, we often need to rely on transfer learning using datasets with different characteristics, and anticipate out-of-distribution requests. Motivated by this practical need, we propose an analysis framework, building a transfer learning matrix and dimensionality reduction, to dissect these cross-task interactions. We train and analyze 10 models to identify latent abilities (e.g., Reasoning, Sentiment Classification, NLU, Arithmetic) and discover the side effects of the transfer learning. Our findings reveal that performance improvements often defy explanations based on surface-level dataset similarity or source data quality. Instead, hidden statistical factors of the source dataset, such as class distribution and generation length proclivities, alongside specific linguistic features, are actually more influential. This work offers insights into the complex dynamics of transfer learning, paving the way for more predictable and effective LLM adaptation.


[68] Sparse Neurons Carry Strong Signals of Question Ambiguity in LLMs cs.CL | cs.AIPDF

Zhuoxuan Zhang, Jinhao Duan, Edward Kim, Kaidi Xu

TL;DR: 该论文发现,大语言模型(LLMs)的内部表示中,稀疏神经元能够在预填充阶段线性编码问题的模糊性信息,这表明模糊性信号在模型的早期处理阶段就已形成。

Details

Motivation: 现实世界中的问题普遍存在模糊性,但大语言模型常常以一种自信的方式回答,而不是寻求澄清。因此,研究模糊性如何在LLMs中编码和控制具有重要价值。

Result: AENs探测器在模糊性检测任务中表现优异,且具有跨数据集泛化能力;通过操纵AENs可以有效控制模型行为。

Insight: LLMs内部存在紧凑且可解释的模糊性表示,这为模型的可控性和解释性提供了新思路。

Abstract: Ambiguity is pervasive in real-world questions, yet large language models (LLMs) often respond with confident answers rather than seeking clarification. In this work, we show that question ambiguity is linearly encoded in the internal representations of LLMs and can be both detected and controlled at the neuron level. During the model’s pre-filling stage, we identify that a small number of neurons, as few as one, encode question ambiguity information. Probes trained on these Ambiguity-Encoding Neurons (AENs) achieve strong performance on ambiguity detection and generalize across datasets, outperforming prompting-based and representation-based baselines. Layerwise analysis reveals that AENs emerge from shallow layers, suggesting early encoding of ambiguity signals in the model’s processing pipeline. Finally, we show that through manipulating AENs, we can control LLM’s behavior from direct answering to abstention. Our findings reveal that LLMs form compact internal representations of question ambiguity, enabling interpretable and controllable behavior.


[69] Improving Context Fidelity via Native Retrieval-Augmented Reasoning cs.CL | cs.AIPDF

Suyuchen Wang, Jinlin Wang, Xinyu Wang, Shiqi Li, Xiangru Tang

TL;DR: 论文提出CARE框架,通过原生检索增强推理能力,提升大语言模型(LLMs)在上下文忠实度上的表现,显著优于现有方法。

Details

Motivation: 大语言模型(LLMs)在上下文忠实度方面存在不足,容易生成与给定信息不一致的答案。现有方法通常依赖昂贵的有监督微调或外部检索,但未显著提升对上下文的利用效率。

Result: 在多种实际和反事实QA基准测试中,CARE显著优于传统方法,包括有监督微调、基于检索的生成方法以及外部检索解决方案。

Insight: 通过模型自身的检索能力与推理过程的深度融合,CARE展示了提升LLMs在知识密集型任务中准确性和可靠性的潜力,同时降低了对外部数据或工具的依赖。

Abstract: Large language models (LLMs) often struggle with context fidelity, producing inconsistent answers when responding to questions based on provided information. Existing approaches either rely on expensive supervised fine-tuning to generate evidence post-answer or train models to perform web searches without necessarily improving utilization of the given context. We propose CARE, a novel native retrieval-augmented reasoning framework that teaches LLMs to explicitly integrate in-context evidence within their reasoning process with the model’s own retrieval capabilities. Our method requires limited labeled evidence data while significantly enhancing both retrieval accuracy and answer generation performance through strategically retrieved in-context tokens in the reasoning chain. Extensive experiments on multiple real-world and counterfactual QA benchmarks demonstrate that our approach substantially outperforms supervised fine-tuning, traditional retrieval-augmented generation methods, and external retrieval solutions. This work represents a fundamental advancement in making LLMs more accurate, reliable, and efficient for knowledge-intensive tasks.


[70] Integrating Text and Time-Series into (Large) Language Models to Predict Medical Outcomes cs.CLPDF

Iyadh Ben Cheikh Larbi, Ajay Madhavan Ravichandran, Aljoscha Burchardt, Roland Roller

TL;DR: 该论文研究了如何将大型语言模型(LLMs)应用于处理临床分类任务,结合文本和时间序列数据,通过DSPy优化提示实现了高性能和任务适应性。

Details

Motivation: 尽管LLMs在文本生成上表现出色,但其在处理结构化数据(如时间序列)的临床分类任务上的能力尚未得到充分探索,因此需要验证LLMs在此类任务中的潜力。

Result: 结果表明,该方法在性能上可与专用多模态系统媲美,同时复杂度更低,任务适应性更强。

Insight: LLMs不仅限于文本生成任务,通过适当的优化和整合,也能有效处理结构化数据任务,扩展了其应用范围。

Abstract: Large language models (LLMs) excel at text generation, but their ability to handle clinical classification tasks involving structured data, such as time series, remains underexplored. In this work, we adapt instruction-tuned LLMs using DSPy-based prompt optimization to process clinical notes and structured EHR inputs jointly. Our results show that this approach achieves performance on par with specialized multimodal systems while requiring less complexity and offering greater adaptability across tasks.


[71] DSCC-HS: A Dynamic Self-Reinforcing Framework for Hallucination Suppression in Large Language Models cs.CL | cs.AIPDF

Xiao Zheng

TL;DR: DSCC-HS是一个动态自我强化框架,用于抑制大语言模型中的幻觉问题,通过在自回归解码过程中实时干预,显著提升了模型的真实性。

Details

Motivation: 大语言模型(LLM)的幻觉问题是其可靠部署的主要障碍,现有方法(如RAG)多为被动应对。DSCC-HS旨在通过主动干预解决这一挑战。

Result: 在TruthfulQA和BioGEN上达到SOTA性能,TruthfulQA的FCR为99.2%,BioGEN的FActScore为46.50。

Insight: DSCC-HS展示了通过动态代理模型干预解码过程的潜力,为LLM真实性问题提供了高效且无需目标模型修改的解决方案。

Abstract: Large Language Model (LLM) hallucination is a significant barrier to their reliable deployment. Current methods like Retrieval-Augmented Generation (RAG) are often reactive. We introduce Dynamic Self-reinforcing Calibration for Hallucination Suppression (DSCC-HS), a novel, proactive framework that intervenes during autoregressive decoding. Inspired by dual-process cognitive theory, DSCC-HS uses a compact proxy model, trained in adversarial roles as a Factual Alignment Proxy (FAP) and a Hallucination Detection Proxy (HDP). During inference, these proxies dynamically steer a large target model by injecting a real-time steering vector, which is the difference between FAP and HDP logits, at each decoding step. This plug-and-play approach requires no modification to the target model. Our experiments on TruthfulQA and BioGEN show DSCC-HS achieves state-of-the-art performance. On TruthfulQA, it reached a 99.2% Factual Consistency Rate (FCR). On the long-form BioGEN benchmark, it attained the highest FActScore of 46.50. These results validate DSCC-HS as a principled and efficient solution for enhancing LLM factuality.


[72] DSPC: Dual-Stage Progressive Compression Framework for Efficient Long-Context Reasoning cs.CLPDF

Yaxin Gao, Yao Lu, Zongfei Zhang, Jiaqi Nie, Shanqing Yu

TL;DR: 论文提出了一种名为DSPC的双阶段渐进压缩框架,旨在高效处理长文本推理任务,通过无需训练的粗粒度语义句子过滤和细粒度token剪枝,显著减少计算成本。

Details

Motivation: 随着大型语言模型(LLMs)的普及,提示越长越精确,但计算成本也随之增加。现有的提示压缩方法通常需要训练辅助模型,带来额外开销,因此亟需一种无需训练的压缩方案。

Result: 在LLaMA-3.1-8B-Instruct和GPT-3.5-Turbo上验证,DSPC仅用3倍更少的token,FewShot任务性能达49.17,优于当前最佳基线LongLLMLingua 7.76分。

Insight: 无需训练的渐进压缩框架可以在保持语义的同时显著提升效率,为长文本推理任务提供了一种低成本解决方案。

Abstract: Large language models (LLMs) have achieved remarkable success in many natural language processing (NLP) tasks. To achieve more accurate output, the prompts used to drive LLMs have become increasingly longer, which incurs higher computational costs. To address this prompt inflation problem, prompt compression has been proposed. However, most existing methods require training a small auxiliary model for compression, incurring a significant amount of additional computation. To avoid this, we propose a two-stage, training-free approach, called Dual-Stage Progressive Compression (DSPC). In the coarse-grained stage, semantic-related sentence filtering removes sentences with low semantic value based on TF-IDF. In the fine-grained stage, token importance is assessed using attention contribution, cross-model loss difference, and positional importance, enabling the pruning of low-utility tokens while preserving semantics. We validate DSPC on LLaMA-3.1-8B-Instruct and GPT-3.5-Turbo under a constrained token budget and observe consistent improvements. For instance, in the FewShot task of the Longbench dataset, DSPC achieves a performance of 49.17 by using only 3x fewer tokens, outperforming the best state-of-the-art baseline LongLLMLingua by 7.76.


[73] Combining Evidence and Reasoning for Biomedical Fact-Checking cs.CL | cs.AI | cs.IRPDF

Mariano Barone, Antonio Romano, Giuseppe Riccio, Marco Postiglione, Vincenzo Moscato

TL;DR: CER是一种结合科学证据检索、大型语言模型推理和监督真实性预测的新型生物医学事实核查框架,显著提升了生物医学领域的自动化事实核查能力。

Details

Motivation: 生物医学领域的错误信息(如疫苗犹豫和未经证实的治疗方法)会危害公众健康和医疗系统信任,而现有自动化事实核查方法在该领域面临术语复杂、需专业知识等独特挑战。

Result: 在HealthFC、BioASQ-7b和SciFact等专家标注数据集上的实验表明,CER达到了最先进性能,并展示了良好的跨数据集泛化能力。

Insight: 结合生成式模型与检索技术能有效减少幻觉风险,并为生物医学领域的事实核查提供可验证的科学依据。

Abstract: Misinformation in healthcare, from vaccine hesitancy to unproven treatments, poses risks to public health and trust in medical systems. While machine learning and natural language processing have advanced automated fact-checking, validating biomedical claims remains uniquely challenging due to complex terminology, the need for domain expertise, and the critical importance of grounding in scientific evidence. We introduce CER (Combining Evidence and Reasoning), a novel framework for biomedical fact-checking that integrates scientific evidence retrieval, reasoning via large language models, and supervised veracity prediction. By integrating the text-generation capabilities of large language models with advanced retrieval techniques for high-quality biomedical scientific evidence, CER effectively mitigates the risk of hallucinations, ensuring that generated outputs are grounded in verifiable, evidence-based sources. Evaluations on expert-annotated datasets (HealthFC, BioASQ-7b, SciFact) demonstrate state-of-the-art performance and promising cross-dataset generalization. Code and data are released for transparency and reproducibility: https: //github.com/PRAISELab-PicusLab/CER.


[74] Combating Biomedical Misinformation through Multi-modal Claim Detection and Evidence-based Verification cs.CL | cs.AI | cs.IRPDF

Mariano Barone, Antonio Romano, Giuseppe Riccio, Marco Postiglione, Vincenzo Moscato

TL;DR: CER 是一个用于生物医学事实核查的框架,结合了科学证据检索、大型语言模型推理和监督的真实性预测,有效降低了幻觉风险,并在多个数据集上展示了最先进的性能。

Details

Motivation: 生物医学领域的错误信息(如疫苗犹豫和未经证实的治疗方法)对公众健康和医疗系统信任构成威胁,但现有技术在验证生物医学声明时面临复杂术语和领域专业知识的挑战。

Result: 在 HealthFC、BioASQ-7b 和 SciFact 等专家标注数据集上,CER 展示了最先进的性能和跨数据集泛化能力。

Insight: 通过结合检索和推理,CER 减少了幻觉,为生物医学事实核查提供了一个可解释且可靠的解决方案。

Abstract: Misinformation in healthcare, from vaccine hesitancy to unproven treatments, poses risks to public health and trust in medical systems. While machine learning and natural language processing have advanced automated fact-checking, validating biomedical claims remains uniquely challenging due to complex terminology, the need for domain expertise, and the critical importance of grounding in scientific evidence. We introduce CER (Combining Evidence and Reasoning), a novel framework for biomedical fact-checking that integrates scientific evidence retrieval, reasoning via large language models, and supervised veracity prediction. By integrating the text-generation capabilities of large language models with advanced retrieval techniques for high-quality biomedical scientific evidence, CER effectively mitigates the risk of hallucinations, ensuring that generated outputs are grounded in verifiable, evidence-based sources. Evaluations on expert-annotated datasets (HealthFC, BioASQ-7b, SciFact) demonstrate state-of-the-art performance and promising cross-dataset generalization. Code and data are released for transparency and reproducibility: https://github.com/PRAISELab-PicusLab/CER


[75] Slim-SC: Thought Pruning for Efficient Scaling with Self-Consistency cs.CL | cs.AI | cs.LG | I.2.7PDF

Colin Hong, Xu Guo, Anand Chaanan Singh, Esha Choukse, Dmitrii Ustiugov

TL;DR: 提出Slim-SC,一种高效的自一致性(SC)推理优化方法,通过思维剪枝减少冗余计算,显著降低延迟和资源消耗。

Details

Motivation: 当前自一致性(SC)方法虽能提升大语言模型推理性能,但其高计算开销限制了实际部署。

Result: 在三个STEM推理数据集上,Slim-SC降低推理延迟45%,KVC使用量26%,且准确性未降。

Insight: 思维层相似性是解决SC低效性的关键,剪枝策略可显著优化计算资源使用。

Abstract: Recently, Test-Time Scaling (TTS) has gained increasing attention for improving LLM reasoning performance at test time without retraining the model. A notable TTS technique is Self-Consistency (SC), which generates multiple reasoning chains in parallel and selects the final answer via majority voting. While effective, the order-of-magnitude computational overhead limits its broad deployment. Prior attempts to accelerate SC mainly rely on model-based confidence scores or heuristics with limited empirical support. For the first time, we theoretically and empirically analyze the inefficiencies of SC and reveal actionable opportunities for improvement. Building on these insights, we propose Slim-SC, a step-wise pruning strategy that identifies and removes redundant chains using inter-chain similarity at the thought level. Experiments on three STEM reasoning datasets and two recent LLM architectures show that Slim-SC reduces inference latency and KVC usage by up to 45% and 26%, respectively, with R1-Distill, while maintaining or improving accuracy, thus offering a simple yet efficient TTS alternative for SC.


[76] Early Stopping Chain-of-thoughts in Large Language Models cs.CLPDF

Minjia Mao, Bowen Yin, Yu Zhu, Xiao Fang

TL;DR: 论文提出了一种名为ES-CoT的方法,通过在推理过程中检测答案收敛性并提前停止生成链式思考(CoT),以减少推理成本,同时保持性能损失最小。

Details

Motivation: 大型语言模型(LLMs)在生成长链式思考(CoT)时推理成本高,亟需一种高效的方法来缩短生成过程而不显著影响性能。

Result: 在五个推理数据集和三个LLM上的实验表明,ES-CoT平均减少41%的推理token,性能损失极小,且与自一致性提示兼容性强。

Insight: 步骤答案会逐步收敛到最终答案,其运行长度的显著跳跃可作为收敛的可靠标记,为高效推理提供了理论基础。

Abstract: Reasoning large language models (LLMs) have demonstrated superior capacities in solving complicated problems by generating long chain-of-thoughts (CoT), but such a lengthy CoT incurs high inference costs. In this study, we introduce ES-CoT, an inference-time method that shortens CoT generation by detecting answer convergence and stopping early with minimal performance loss. At the end of each reasoning step, we prompt the LLM to output its current final answer, denoted as a step answer. We then track the run length of consecutive identical step answers as a measure of answer convergence. Once the run length exhibits a sharp increase and exceeds a minimum threshold, the generation is terminated. We provide both empirical and theoretical support for this heuristic: step answers steadily converge to the final answer, and large run-length jumps reliably mark this convergence. Experiments on five reasoning datasets across three LLMs show that ES-CoT reduces the number of inference tokens by about 41% on average while maintaining accuracy comparable to standard CoT. Further, ES-CoT integrates seamlessly with self-consistency prompting and remains robust across hyperparameter choices, highlighting it as a practical and effective approach for efficient reasoning.


[77] Hala Technical Report: Building Arabic-Centric Instruction & Translation Models at Scale cs.CL | cs.AI | cs.LGPDF

Hasan Abed Al Kader Hammoud, Mohammad Zbeeb, Bernard Ghanem

TL;DR: Hala 是一系列以阿拉伯语为中心的指令和翻译模型,通过翻译微调流水线构建,在阿拉伯语 NLP 任务中取得先进成果。

Details

Motivation: 当前阿拉伯语 NLP 缺乏高质量的指令和翻译模型,Hala 旨在填补这一空白,提升阿拉伯语任务性能。

Result: 在阿拉伯语任务中,Hala 在 ‘nano’ (≤2B) 和 ‘small’ (7-9B) 规模下均达到 SOTA。

Insight: 高效的压缩技术和高质量数据生成可显著提升小规模模型性能,适用于资源有限的语言任务。

Abstract: We present Hala, a family of Arabic-centric instruction and translation models built with our translate-and-tune pipeline. We first compress a strong AR$\leftrightarrow$EN teacher to FP8 (yielding $\sim$2$\times$ higher throughput with no quality loss) and use it to create high-fidelity bilingual supervision. A lightweight language model LFM2-1.2B is then fine-tuned on this data and used to translate high-quality English instruction sets into Arabic, producing a million-scale corpus tailored to instruction following. We train Hala models at 350M, 700M, 1.2B, and 9B parameters, and apply slerp merging to balance Arabic specialization with base-model strengths. On Arabic-centric benchmarks, Hala achieves state-of-the-art results within both the “nano” ($\leq$2B) and “small” (7-9B) categories, outperforming their bases. We release models, data, evaluation, and recipes to accelerate research in Arabic NLP.


[78] Audio-Based Crowd-Sourced Evaluation of Machine Translation Quality cs.CL | cs.HCPDF

Sami Ul Haq, Sheila Castilho, Yvette Graham

TL;DR: 该论文研究了音频与纯文本方式评估机器翻译质量的差异,发现音频评估在某些情况下能更自然地区分翻译系统的性能。

Details

Motivation: 尽管机器翻译(MT)取得了显著进展,但质量评估仍主要依赖文本方式,而忽略了实际应用中翻译通常是口语输出的场景。因此,研究音频评估方式的可行性和效果具有实际意义。

Result: 音频评估的排名与纯文本评估基本一致,但在某些情况下能显著区分翻译系统的性能差异,表明音频因更自然和丰富的模态可能更具评估优势。

Insight: 未来MT评估框架应考虑纳入语音评估方式,以更全面地反映翻译在口语场景中的质量。

Abstract: Machine Translation (MT) has achieved remarkable performance, with growing interest in speech translation and multimodal approaches. However, despite these advancements, MT quality assessment remains largely text centric, typically relying on human experts who read and compare texts. Since many real-world MT applications (e.g Google Translate Voice Mode, iFLYTEK Translator) involve translation being spoken rather printed or read, a more natural way to assess translation quality would be through speech as opposed text-only evaluations. This study compares text-only and audio-based evaluations of 10 MT systems from the WMT General MT Shared Task, using crowd-sourced judgments collected via Amazon Mechanical Turk. We additionally, performed statistical significance testing and self-replication experiments to test reliability and consistency of audio-based approach. Crowd-sourced assessments based on audio yield rankings largely consistent with text only evaluations but, in some cases, identify significant differences between translation systems. We attribute this to speech richer, more natural modality and propose incorporating speech-based assessments into future MT evaluation frameworks.


[79] Enhancing Multi-Agent Debate System Performance via Confidence Expression cs.CLPDF

Zijie Lin, Bryan Hooi

TL;DR: 该论文提出在多智能体辩论系统中引入置信度表达,以提升辩论效果和任务性能。

Details

Motivation: 现有的多智能体辩论系统缺乏清晰的置信度表达机制,导致辩论效果不佳或过早收敛于次优答案。

Result: 实验证明该方法有效,并分析了置信度对辩论动态的影响。

Insight: 置信度表达在多智能体辩论系统中扮演关键角色,可优化系统设计。

Abstract: Generative Large Language Models (LLMs) have demonstrated remarkable performance across a wide range of tasks. Recent research has introduced Multi-Agent Debate (MAD) systems, which leverage multiple LLMs to simulate human debate and thereby improve task performance. However, while some LLMs may possess superior knowledge or reasoning capabilities for specific tasks, they often struggle to clearly communicate this advantage during debates, in part due to a lack of confidence expression. Moreover, inappropriate confidence expression can cause agents in MAD systems to either stubbornly maintain incorrect beliefs or converge prematurely on suboptimal answers, ultimately reducing debate effectiveness and overall system performance. To address these challenges, we propose incorporating confidence expression into MAD systems to allow LLMs to explicitly communicate their confidence levels. To validate this approach, we develop ConfMAD, a MAD framework that integrates confidence expression throughout the debate process. Experimental results demonstrate the effectiveness of our method, and we further analyze how confidence influences debate dynamics, offering insights into the design of confidence-aware MAD systems.


[80] SSL-SSAW: Self-Supervised Learning with Sigmoid Self-Attention Weighting for Question-Based Sign Language Translation cs.CL | cs.AIPDF

Zekang Liu, Wei Feng, Fanhua Shang, Lianyu Hu, Jichao Feng

TL;DR: SSL-SSAW提出了一种跨模态自监督学习方法,结合Sigmoid自注意力加权,用于问题驱动的手语翻译任务,通过对话上下文提升翻译质量。

Details

Motivation: 手语翻译(SLT)在聋哑人与听人之间的沟通中起关键作用,而对话提供了重要的上下文线索。研究提出基于问题的SLT(QB-SLT),探索如何高效整合对话信息,以简化标注并提升翻译性能。

Result: 在CSL-Daily-QA和PHOENIX-2014T-QA数据集上,SSL-SSAW取得了最佳性能,甚至优于依赖注释的方法,可视化结果验证了对话上下文的有效性。

Insight: 对话信息可以简化标注并提升翻译质量,为手语翻译任务提供了更实用的解决方案。

Abstract: Sign Language Translation (SLT) bridges the communication gap between deaf people and hearing people, where dialogue provides crucial contextual cues to aid in translation. Building on this foundational concept, this paper proposes Question-based Sign Language Translation (QB-SLT), a novel task that explores the efficient integration of dialogue. Unlike gloss (sign language transcription) annotations, dialogue naturally occurs in communication and is easier to annotate. The key challenge lies in aligning multimodality features while leveraging the context of the question to improve translation. To address this issue, we propose a cross-modality Self-supervised Learning with Sigmoid Self-attention Weighting (SSL-SSAW) fusion method for sign language translation. Specifically, we employ contrastive learning to align multimodality features in QB-SLT, then introduce a Sigmoid Self-attention Weighting (SSAW) module for adaptive feature extraction from question and sign language sequences. Additionally, we leverage available question text through self-supervised learning to enhance representation and translation capabilities. We evaluated our approach on newly constructed CSL-Daily-QA and PHOENIX-2014T-QA datasets, where SSL-SSAW achieved SOTA performance. Notably, easily accessible question assistance can achieve or even surpass the performance of gloss assistance. Furthermore, visualization results demonstrate the effectiveness of incorporating dialogue in improving translation quality.


[81] AssoCiAm: A Benchmark for Evaluating Association Thinking while Circumventing Ambiguity cs.CLPDF

Yifan Liu, Wenkuan Zhao, Shanshan Zhong, Jinghui Qin, Mingfu Liang

TL;DR: 该论文提出了AssoCiAm基准,用于评估多模态大语言模型(MLLMs)的联想能力,同时通过混合计算方法规避任务中的内在和外在歧义。实验表明,联想能力与认知能力呈正相关,且歧义会导致模型行为更随机。

Details

Motivation: 当前评估联想能力的框架往往忽视了任务中的歧义性,影响评估的可靠性。因此,需要一种能规避歧义的基准,以更准确地衡量MLLMs的联想能力。

Result: 发现联想能力与认知能力呈正相关,歧义会导致模型行为更随机,同时验证了AssoCiAm在评估中的有效性。

Insight: 规避歧义能显著提升评估的可靠性,联想能力是MLLMs创造力评估的重要指标,歧义会影响模型行为的可预测性。

Abstract: Recent advancements in multimodal large language models (MLLMs) have garnered significant attention, offering a promising pathway toward artificial general intelligence (AGI). Among the essential capabilities required for AGI, creativity has emerged as a critical trait for MLLMs, with association serving as its foundation. Association reflects a model’ s ability to think creatively, making it vital to evaluate and understand. While several frameworks have been proposed to assess associative ability, they often overlook the inherent ambiguity in association tasks, which arises from the divergent nature of associations and undermines the reliability of evaluations. To address this issue, we decompose ambiguity into two types-internal ambiguity and external ambiguity-and introduce AssoCiAm, a benchmark designed to evaluate associative ability while circumventing the ambiguity through a hybrid computational method. We then conduct extensive experiments on MLLMs, revealing a strong positive correlation between cognition and association. Additionally, we observe that the presence of ambiguity in the evaluation process causes MLLMs’ behavior to become more random-like. Finally, we validate the effectiveness of our method in ensuring more accurate and reliable evaluations. See Project Page for the data and codes.


[82] Synthesizing Behaviorally-Grounded Reasoning Chains: A Data-Generation Framework for Personal Finance LLMs cs.CL | cs.AI | cs.LG | 68T50 | I.2.7; J.4PDF

Akhil Theerthala

TL;DR: 论文提出了一种结合财务背景和行为金融学的数据生成框架,用于训练个性化的财务建议LLM,并通过实验验证了其8B模型在性能和成本上的优势。

Details

Motivation: 个性化财务建议需要综合考虑用户目标、风险偏好等复杂因素。现有方法维护成本高且效果不佳(财务回报低于预期25%),亟需更高效且低成本的数据生成与模型训练框架。

Result: 8B模型在事实准确性、流畅性和个性化方面媲美更大的基线模型(14-32B参数),且成本降低80%。

Insight: 通过精心设计的监督数据生成框架,可显著降低模型规模需求,同时保持高质量输出,为小模型在复杂任务中的应用提供了新思路。

Abstract: Personalized financial advice requires consideration of user goals, constraints, risk tolerance, and jurisdiction. Prior LLM work has focused on support systems for investors and financial planners. Simultaneously, numerous recent studies examine broader personal finance tasks, including budgeting, debt management, retirement, and estate planning, through agentic pipelines that incur high maintenance costs, yielding less than 25% of their expected financial returns. In this study, we introduce a novel and reproducible framework that integrates relevant financial context with behavioral finance studies to construct supervision data for end-to-end advisors. Using this framework, we create a 19k sample reasoning dataset and conduct a comprehensive fine-tuning of the Qwen-3-8B model on the dataset. Through a held-out test split and a blind LLM-jury study, we demonstrate that through careful data curation and behavioral integration, our 8B model achieves performance comparable to significantly larger baselines (14-32B parameters) across factual accuracy, fluency, and personalization metrics while incurring 80% lower costs than the larger counterparts.


cs.RO [Back]

[83] Semantic 3D Reconstructions with SLAM for Central Airway Obstruction cs.RO | cs.CVPDF

Ayberk Acar, Fangjie Li, Hao Li, Lidia Al-Zogbi, Kanyifeechukwu Jane Oguine

TL;DR: 论文提出了一种结合语义分割与实时单目SLAM的流水线,用于中央气道阻塞(CAO)的内窥镜3D重建,实现高精度且临床相关的实时标注地图。

Details

Motivation: 中央气道阻塞是一种高风险的疾病,传统治疗方法并发症风险高。结合机器人干预和场景理解的自动化方法可以降低风险并提高精确度。

Result: 通过离体模型验证,重建结果与真实CT扫描高度相似(Chamfer距离0.62毫米),重建速度更快。

Insight: 将语义分割直接集成到SLAM工作流中,能够实时标注临床相关区域,为自动化机器人干预提供了可行方向。

Abstract: Central airway obstruction (CAO) is a life-threatening condition with increasing incidence, caused by tumors in and outside of the airway. Traditional treatment methods such as bronchoscopy and electrocautery can be used to remove the tumor completely; however, these methods carry a high risk of complications. Recent advances allow robotic interventions with lesser risk. The combination of robot interventions with scene understanding and mapping also opens up the possibilities for automation. We present a novel pipeline that enables real-time, semantically informed 3D reconstructions of the central airway using monocular endoscopic video. Our approach combines DROID-SLAM with a segmentation model trained to identify obstructive tissues. The SLAM module reconstructs the 3D geometry of the airway in real time, while the segmentation masks guide the annotation of obstruction regions within the reconstructed point cloud. To validate our pipeline, we evaluate the reconstruction quality using ex vivo models. Qualitative and quantitative results show high similarity between ground truth CT scans and the 3D reconstructions (0.62 mm Chamfer distance). By integrating segmentation directly into the SLAM workflow, our system produces annotated 3D maps that highlight clinically relevant regions in real time. High-speed capabilities of the pipeline allows quicker reconstructions compared to previous work, reflecting the surgical scene more accurately. To the best of our knowledge, this is the first work to integrate semantic segmentation with real-time monocular SLAM for endoscopic CAO scenarios. Our framework is modular and can generalize to other anatomies or procedures with minimal changes, offering a promising step toward autonomous robotic interventions.


[84] Object Pose Estimation through Dexterous Touch cs.RO | cs.CVPDF

Amir-Hossein Shahidzadeh, Jiyue Zhu, Kezhou Chen, Sha Yi, Cornelia Fermüller

TL;DR: 本文提出了一种通过双手触觉探索来估计物体姿态的方法,利用强化学习主动收集触觉数据,并通过迭代优化完成姿态估计。

Details

Motivation: 在视觉数据受限的场景下(如光照、遮挡或外观变化),触觉传感器提供的局部信息难以直接用于物体姿态估计。本文希望通过主动探索的方式解决这一问题。

Result: 实验表明,该方法能够在没有物体几何先验的情况下,通过触觉探索识别关键姿态特征。

Insight: 主动触觉探索结合强化学习为物体姿态估计提供了新思路,尤其在视觉信息受限的场景中具有潜力。

Abstract: Robust object pose estimation is essential for manipulation and interaction tasks in robotics, particularly in scenarios where visual data is limited or sensitive to lighting, occlusions, and appearances. Tactile sensors often offer limited and local contact information, making it challenging to reconstruct the pose from partial data. Our approach uses sensorimotor exploration to actively control a robot hand to interact with the object. We train with Reinforcement Learning (RL) to explore and collect tactile data. The collected 3D point clouds are used to iteratively refine the object’s shape and pose. In our setup, one hand holds the object steady while the other performs active exploration. We show that our method can actively explore an object’s surface to identify critical pose features without prior knowledge of the object’s geometry. Supplementary material and more demonstrations will be provided at https://amirshahid.github.io/BimanualTactilePose .


[85] MAP: End-to-End Autonomous Driving with Map-Assisted Planning cs.RO | cs.AI | cs.CV | I.2.9; I.2.10PDF

Huilin Yin, Yiming Kan, Daniel Watzenig

TL;DR: 论文提出了一种名为MAP(Map-Assisted Planning)的新型端到端轨迹规划框架,通过显式集成基于分割的地图特征和当前自车状态,显著提升了自动驾驶的轨迹规划能力。实验表明,该方法在无需后处理的情况下显著降低了误差并提升了性能。

Details

Motivation: 现有端到端自动驾驶方法未充分利用在线地图模块的潜力,导致其在轨迹规划中的作用有限。该论文旨在通过显式集成地图特征和自车状态,提升规划的准确性和鲁棒性。

Result: 在DAIR-V2X-seq-SPD数据集上实现了L2位移误差降低16.6%、越野率降低56.2%、总分提升44.5%;在CVPR2025比赛中总分领先第二名39.5%。

Insight: 显式利用语义地图特征可以显著提升端到端自动驾驶系统的规划能力,为未来系统设计提供了新方向。

Abstract: In recent years, end-to-end autonomous driving has attracted increasing attention for its ability to jointly model perception, prediction, and planning within a unified framework. However, most existing approaches underutilize the online mapping module, leaving its potential to enhance trajectory planning largely untapped. This paper proposes MAP (Map-Assisted Planning), a novel map-assisted end-to-end trajectory planning framework. MAP explicitly integrates segmentation-based map features and the current ego status through a Plan-enhancing Online Mapping module, an Ego-status-guided Planning module, and a Weight Adapter based on current ego status. Experiments conducted on the DAIR-V2X-seq-SPD dataset demonstrate that the proposed method achieves a 16.6% reduction in L2 displacement error, a 56.2% reduction in off-road rate, and a 44.5% improvement in overall score compared to the UniV2X baseline, even without post-processing. Furthermore, it achieves top ranking in Track 2 of the End-to-End Autonomous Driving through V2X Cooperation Challenge of MEIS Workshop @CVPR2025, outperforming the second-best model by 39.5% in terms of overall score. These results highlight the effectiveness of explicitly leveraging semantic map features in planning and suggest new directions for improving structure design in end-to-end autonomous driving systems. Our code is available at https://gitee.com/kymkym/map.git


[86] MCGS-SLAM: A Multi-Camera SLAM Framework Using Gaussian Splatting for High-Fidelity Mapping cs.RO | cs.CVPDF

Zhihao Cao, Hanyu Wu, Li Wa Tang, Zizhou Luo, Zihan Zhu

TL;DR: MCGS-SLAM 是首个基于 RGB 输入的多摄像头 SLAM 系统,利用 3D 高斯点云(3DGS)实现了高保真地图重建。通过多摄像头捆绑调整和尺度一致性模块,系统在实时性、几何覆盖和重建质量上优于单目基线方法。

Details

Motivation: 当前的密集 SLAM 方法主要针对单目摄像头,忽略了多摄像头在鲁棒性和几何覆盖方面的潜力。多摄像头输入可以提供更宽的视野,弥补单目系统在侧视图重建上的不足,这对于自动驾驶等安全关键应用尤为重要。

Result: 在合成和真实数据集上的实验表明,MCGS-SLAM 在轨迹精度和重建质量上优于单目基线方法,尤其在侧视图区域的重建上表现突出。

Insight: 多摄像头输入和高斯点云结合,不仅提升了 SLAM 的鲁棒性和覆盖范围,还为自动驾驶等领域的高保真地图重建提供了新思路。

Abstract: Recent progress in dense SLAM has primarily targeted monocular setups, often at the expense of robustness and geometric coverage. We present MCGS-SLAM, the first purely RGB-based multi-camera SLAM system built on 3D Gaussian Splatting (3DGS). Unlike prior methods relying on sparse maps or inertial data, MCGS-SLAM fuses dense RGB inputs from multiple viewpoints into a unified, continuously optimized Gaussian map. A multi-camera bundle adjustment (MCBA) jointly refines poses and depths via dense photometric and geometric residuals, while a scale consistency module enforces metric alignment across views using low-rank priors. The system supports RGB input and maintains real-time performance at large scale. Experiments on synthetic and real-world datasets show that MCGS-SLAM consistently yields accurate trajectories and photorealistic reconstructions, usually outperforming monocular baselines. Notably, the wide field of view from multi-camera input enables reconstruction of side-view regions that monocular setups miss, critical for safe autonomous operation. These results highlight the promise of multi-camera Gaussian Splatting SLAM for high-fidelity mapping in robotics and autonomous driving.


eess.AS [Back]

[87] TICL: Text-Embedding KNN For Speech In-Context Learning Unlocks Speech Recognition Abilities of Large Multimodal Models eess.AS | cs.AI | cs.CL | cs.LG | cs.MMPDF

Haolong Zheng, Yekaterina Yegorova, Mark Hasegawa-Johnson

TL;DR: 论文提出了一种名为TICL的方法,通过语义上下文选择示例,提升大型多模态模型的语音识别能力,无需微调,在多种挑战性任务中显著降低WER。

Details

Motivation: 现有的语音基础模型虽能进行Speech In-Context Learning (SICL),但示例选择方法尚未充分研究,影响了性能。本文旨在通过语义上下文改进SICL的示例选择。

Result: 实验显示,TICL在多个挑战性任务中相对零样本提升了84.7%的WER降低,证明了其高效性和鲁棒性。

Insight: 语义上下文的示例选择对于SICL至关重要,通过文本嵌入和KNN的简单方法即可显著提升模型性能,为语音识别任务提供了新思路。

Abstract: Speech foundation models have recently demonstrated the ability to perform Speech In-Context Learning (SICL). Selecting effective in-context examples is crucial for SICL performance, yet selection methodologies remain underexplored. In this work, we propose Text-Embedding KNN for SICL (TICL), a simple pipeline that uses semantic context to enhance off-the-shelf large multimodal models’ speech recognition ability without fine-tuning. Across challenging automatic speech recognition tasks, including accented English, multilingual speech, and children’s speech, our method enables models to surpass zero-shot performance with up to 84.7% relative WER reduction. We conduct ablation studies to show the robustness and efficiency of our method.


cs.AI [Back]

[88] Explicit Reasoning Makes Better Judges: A Systematic Study on Accuracy, Efficiency, and Robustness cs.AI | cs.CLPDF

Pratik Jayarao, Himanshu Gupta, Neeraj Varshney, Chaitanya Dwivedi

TL;DR: 论文对比了“思考”与“非思考”LLM在作为裁判任务中的表现,发现显式推理(思考模型)在准确性、效率和鲁棒性上均优于非思考模型,支持显式推理的广泛优势。

Details

Motivation: 随着LLM被广泛用作自动化裁判,确保其可靠性、效率和鲁棒性成为关键问题。本文通过系统研究,探讨显式推理在LLM裁判任务中的作用。

Result: 思考模型准确率高出约10%,计算开销低(<2x),增强策略效果有限且成本高(>8x)。思考模型在多种偏置条件下的鲁棒性平均高出6%。

Insight: 显式推理是提升LLM裁判任务性能的关键,其优势不仅限于英语场景,还具有普适性。

Abstract: As Large Language Models (LLMs) are increasingly adopted as automated judges in benchmarking and reward modeling, ensuring their reliability, efficiency, and robustness has become critical. In this work, we present a systematic comparison of “thinking” and “non-thinking” LLMs in the LLM-as-a-judge paradigm using open-source Qwen 3 models of relatively small sizes (0.6B, 1.7B, and 4B parameters). We evaluate both accuracy and computational efficiency (FLOPs) on RewardBench tasks, and further examine augmentation strategies for non-thinking models, including in-context learning, rubric-guided judging, reference-based evaluation, and n-best aggregation. Our results show that despite these enhancements, non-thinking models generally fall short of their thinking counterparts. Our results show that thinking models achieve approximately 10% points higher accuracy with little overhead (under 2x), in contrast to augmentation strategies like few-shot learning, which deliver modest gains at a higher cost (>8x). Bias and robustness analyses further demonstrate that thinking models maintain significantly greater consistency under a variety of bias conditions such as positional, bandwagon, identity, diversity, and random biases (6% higher on average). We further extend our experiments to the multilingual setting and our results confirm that explicit reasoning extends its benefits beyond English. Overall, our work results in several important findings that provide systematic evidence that explicit reasoning offers clear advantages in the LLM-as-a-judge paradigm not only in accuracy and efficiency but also in robustness.


[89] Teaching LLMs to Plan: Logical Chain-of-Thought Instruction Tuning for Symbolic Planning cs.AI | cs.CLPDF

Pulkit Verma, Ngoc La, Anthony Favier, Swaroop Mishra, Julie A. Shah

TL;DR: 该论文提出了一种名为PDDL-Instruct的指令调优框架,通过逻辑链式思维推理增强大语言模型(LLMs)的符号规划能力,显著提升了规划准确性。

Details

Motivation: 虽然大语言模型(LLMs)在多样任务中表现出色,但其在需要形式化表示(如PDDL)的结构化符号规划任务中能力有限。

Result: 在多个规划领域的实验中,该方法的规划准确率达到94%,比基线模型提升了66%。

Insight: 通过逻辑链式思维推理,成功缩小了LLMs通用推理能力与自动化规划所需逻辑精度之间的差距。

Abstract: Large language models (LLMs) have demonstrated impressive capabilities across diverse tasks, yet their ability to perform structured symbolic planning remains limited, particularly in domains requiring formal representations like the Planning Domain Definition Language (PDDL). In this paper, we present a novel instruction tuning framework, PDDL-Instruct, designed to enhance LLMs’ symbolic planning capabilities through logical chain-of-thought reasoning. Our approach focuses on teaching models to rigorously reason about action applicability, state transitions, and plan validity using explicit logical inference steps. By developing instruction prompts that guide models through the precise logical reasoning required to determine when actions can be applied in a given state, we enable LLMs to self-correct their planning processes through structured reflection. The framework systematically builds verification skills by decomposing the planning process into explicit reasoning chains about precondition satisfaction, effect application, and invariant preservation. Experimental results on multiple planning domains show that our chain-of-thought reasoning based instruction-tuned models are significantly better at planning, achieving planning accuracy of up to 94% on standard benchmarks, representing a 66% absolute improvement over baseline models. This work bridges the gap between the general reasoning capabilities of LLMs and the logical precision required for automated planning, offering a promising direction for developing better AI planning systems.


[90] SteeringControl: Holistic Evaluation of Alignment Steering in LLMs cs.AI | cs.CL | cs.LGPDF

Vincent Siu, Nicholas Crispino, David Park, Nathan W. Henry, Zhun Wang

TL;DR: 论文介绍了SteeringControl基准,用于评估表示导向方法在核心对齐目标(如偏见、有害生成和幻觉)及其对次要行为(如奉承和常识道德)的影响。研究发现现有工作中未系统探索的权衡问题,并通过模块化框架评估五种流行的导向方法在不同模型上的表现。

Details

Motivation: 现有的对齐研究往往仅关注真实性或推理能力,忽略了其他未被系统理解的权衡问题。论文旨在填补这一空白,提供一个综合评估导向方法影响的新基准。

Result: 研究发现,导向效果高度依赖于方法、模型和目标行为的组合,且不当组合可能导致严重的概念纠缠。

Insight: 导向方法的设计需综合考虑多目标之间的权衡,单一优化可能引发意想不到的副作用。模块化框架为未来对齐研究提供了灵活性。

Abstract: We introduce SteeringControl, a benchmark for evaluating representation steering methods across core alignment objectives–bias, harmful generation, and hallucination–and their effects on secondary behaviors such as sycophancy and commonsense morality. While prior alignment work often highlights truthfulness or reasoning ability to demonstrate the side effects of representation steering, we find there are many unexplored tradeoffs not yet understood in a systematic way. We collect a dataset of safety-relevant primary and secondary behaviors to evaluate steering effectiveness and behavioral entanglement centered around five popular steering methods. To enable this, we craft a modular steering framework based on unique components that serve as the building blocks of many existing methods. Our results on Qwen-2.5-7B and Llama-3.1-8B find that strong steering performance is dependent on the specific combination of steering method, model, and targeted behavior, and that severe concept entanglement can result from poor combinations of these three as well. We release our code here: https://github.com/wang-research-lab/SteeringControl.git.


[91] See, Think, Act: Teaching Multimodal Agents to Effectively Interact with GUI by Identifying Toggles cs.AI | cs.CL | cs.HCPDF

Zongru Wu, Rui Mao, Zhiyuan Tian, Pengzhou Cheng, Tianjie Ju

TL;DR: 该论文提出了一种名为State-aware Reasoning (StaR)的培训方法,用于提高多模态代理在GUI中执行切换指令的准确性。通过感知当前切换状态并分析指令中的期望状态,StaR显著提升了性能。

Details

Motivation: 多模态代理在GUI交互中无法可靠执行切换指令是一个主要瓶颈,尤其是当前状态与期望状态一致时。为解决这一问题,研究者提出了StaR方法。

Result: 在三个多模态代理上的实验显示,StaR提升切换指令准确性超30%。在公共基准和动态环境中的进一步评测验证了其通用性和实用性。

Insight: StaR不仅解决了切换指令的可靠性问题,还为多模态代理在复杂场景中的高效交互提供了新思路。

Abstract: The advent of multimodal agents facilitates effective interaction within graphical user interface (GUI), especially in ubiquitous GUI control. However, their inability to reliably execute toggle control instructions remains a key bottleneck. To investigate this, we construct a state control benchmark with binary toggle instructions from public datasets. Evaluations of existing agents demonstrate their unreliability, particularly when the current toggle state already matches the desired state. To address the challenge, we propose State-aware Reasoning (StaR), a training method that teaches agents to perceive the current toggle state, analyze the desired state from the instruction, and act accordingly. Experiments on three multimodal agents demonstrate that StaR can improve toggle instruction execution accuracy by over 30%. Further evaluations on three public benchmarks show that StaR also enhances general task performance. Finally, evaluations on a dynamic environment highlight the potential of StaR for real-world applications. Code, benchmark, and StaR-enhanced agents are available at https://github.com/ZrW00/StaR.


[92] THOR: Tool-Integrated Hierarchical Optimization via RL for Mathematical Reasoning cs.AI | cs.CLPDF

Qikai Chang, Zhenrong Zhang, Pengfei Hu, Jiefeng Ma, Yicheng Pan

TL;DR: 论文提出了THOR方法,通过强化学习结合外部工具,解决了语言模型在数学推理中的高精度任务难题,包括数据集构建、细粒度优化和推理增强。

Details

Motivation: 尽管大语言模型在数学推理上取得了进展,但在数值计算和符号操作等高精度任务中仍表现不佳,需结合外部工具提升能力。

Result: 在多种数学基准测试中达到同类模型最佳性能,同时在代码任务上表现一致提升。

Insight: 中间工具调用的成功是最终答案正确性的强预测指标。

Abstract: Large Language Models (LLMs) have made remarkable progress in mathematical reasoning, but still continue to struggle with high-precision tasks like numerical computation and formal symbolic manipulation. Integrating external tools has emerged as a promising approach to bridge this gap. Despite recent advances, existing methods struggle with three key challenges: constructing tool-integrated reasoning data, performing fine-grained optimization, and enhancing inference. To overcome these limitations, we propose THOR (Tool-Integrated Hierarchical Optimization via RL). First, we introduce TIRGen, a multi-agent actor-critic-based pipeline for constructing high-quality datasets of tool-integrated reasoning paths, aligning with the policy and generalizing well across diverse models. Second, to perform fine-grained hierarchical optimization, we introduce an RL strategy that jointly optimizes for both trajectory-level problem solving and step-level code generation. This is motivated by our key insight that the success of an intermediate tool call is a strong predictor of the final answer’s correctness. Finally, THOR incorporates a self-correction mechanism that leverages immediate tool feedback to dynamically revise erroneous reasoning paths during inference. Our approach demonstrates strong generalization across diverse models, performing effectively in both reasoning and non-reasoning models. It further achieves state-of-the-art performance for models of a similar scale on multiple mathematical benchmarks, while also delivering consistent improvements on code benchmarks. Our code will be publicly available at https://github.com/JingMog/THOR.


[93] Exploring Major Transitions in the Evolution of Biological Cognition With Artificial Neural Networks cs.AI | cs.CL | cs.FL | cs.LGPDF

Konstantinos Voudouris, Andrew Barron, Marta Halina, Colin Klein, Matishalin Patel

TL;DR: 论文通过人工神经网络研究了生物认知进化中的主要转变,发现信息流结构的改变会导致认知性能的质变,尤其是循环网络在处理复杂输入时表现更优,同时训练难度也形成了进化屏障。

Details

Motivation: 探索生物神经网络结构的改变如何通过主要转变影响认知性能,为理解认知进化提供理论模型。

Result: 循环网络在处理复杂输入时表现优于前馈网络,分层网络在语法学习中未显优势,训练难度形成进化屏障。

Insight: 信息流结构的改变可能是认知进化的关键因素,循环网络的优势揭示了进化中的质变和不可逆性。

Abstract: Transitional accounts of evolution emphasise a few changes that shape what is evolvable, with dramatic consequences for derived lineages. More recently it has been proposed that cognition might also have evolved via a series of major transitions that manipulate the structure of biological neural networks, fundamentally changing the flow of information. We used idealised models of information flow, artificial neural networks (ANNs), to evaluate whether changes in information flow in a network can yield a transitional change in cognitive performance. We compared networks with feed-forward, recurrent and laminated topologies, and tested their performance learning artificial grammars that differed in complexity, controlling for network size and resources. We documented a qualitative expansion in the types of input that recurrent networks can process compared to feed-forward networks, and a related qualitative increase in performance for learning the most complex grammars. We also noted how the difficulty in training recurrent networks poses a form of transition barrier and contingent irreversibility – other key features of evolutionary transitions. Not all changes in network topology confer a performance advantage in this task set. Laminated networks did not outperform non-laminated networks in grammar learning. Overall, our findings show how some changes in information flow can yield transitions in cognitive performance.


[94] The Art of Saying “Maybe”: A Conformal Lens for Uncertainty Benchmarking in VLMs cs.AI | cs.CVPDF

Asif Azad, Mohammad Sadat Hossain, MD Sadik Hossain Shanto, M Saifur Rahman, Md Rizwan Pervez

TL;DR: 论文对16种先进视觉语言模型(VLMs)的不确定性量化进行了全面评估,发现更大、更准确的模型在不确定性量化上表现更好,特别是在数学和推理任务中表现较差。

Details

Motivation: 尽管视觉语言模型在多模态任务中表现出色,但不确定性量化这一关键维度尚未得到充分研究,本文旨在填补这一空白。

Result: 更大、更准确的模型在不确定性量化上表现更好,而数学和推理任务在所有模型中表现较差。

Insight: 模型不仅仅需要高精度,还需具备良好的不确定性量化能力,尤其是在复杂任务中。

Abstract: Vision-Language Models (VLMs) have achieved remarkable progress in complex visual understanding across scientific and reasoning tasks. While performance benchmarking has advanced our understanding of these capabilities, the critical dimension of uncertainty quantification has received insufficient attention. Therefore, unlike prior conformal prediction studies that focused on limited settings, we conduct a comprehensive uncertainty benchmarking study, evaluating 16 state-of-the-art VLMs (open and closed-source) across 6 multimodal datasets with 3 distinct scoring functions. Our findings demonstrate that larger models consistently exhibit better uncertainty quantification; models that know more also know better what they don’t know. More certain models achieve higher accuracy, while mathematical and reasoning tasks elicit poorer uncertainty performance across all models compared to other domains. This work establishes a foundation for reliable uncertainty evaluation in multimodal systems.


cs.LG [Back]

[95] Privacy-Aware In-Context Learning for Large Language Models cs.LG | cs.CL | cs.CRPDF

Bishnu Bhusal, Manoj Acharya, Ramneet Kaur, Colin Samplawski, Anirban Roy

TL;DR: 本文提出了一种基于差分隐私(DP)的新型隐私保护框架,用于生成高质量合成文本,确保信息泄露的理论边界,同时不依赖模型微调。

Details

Motivation: 大型语言模型(LLM)在自然语言理解和生成方面表现优异,但也存在隐私泄露风险,尤其是敏感信息可能通过提示词暴露。本文旨在解决这一问题。

Result: 实验表明,该方法在上下文学习(ICL)任务中优于现有方法,同时在隐私保护与实用性之间取得了平衡。

Insight: 研究展示了差分隐私在LLM隐私保护中的潜力,为生成隐私安全的文本提供了新的技术路径。

Abstract: Large language models (LLMs) have significantly transformed natural language understanding and generation, but they raise privacy concerns due to potential exposure of sensitive information. Studies have highlighted the risk of information leakage, where adversaries can extract sensitive information embedded in the prompts. In this work, we introduce a novel private prediction framework for generating high-quality synthetic text with strong privacy guarantees. Our approach leverages the Differential Privacy (DP) framework to ensure worst-case theoretical bounds on information leakage without requiring any fine-tuning of the underlying models.The proposed method performs inference on private records and aggregates the resulting per-token output distributions. This enables the generation of longer and coherent synthetic text while maintaining privacy guarantees. Additionally, we propose a simple blending operation that combines private and public inference to further enhance utility. Empirical evaluations demonstrate that our approach outperforms previous state-of-the-art methods on in-context-learning (ICL) tasks, making it a promising direction for privacy-preserving text generation while maintaining high utility.


[96] LLM-I: LLMs are Naturally Interleaved Multimodal Creators cs.LG | cs.CVPDF

Zirun Guo, Feng Zhang, Kai Jia, Tao Jin

TL;DR: LLM-I提出了一个灵活的框架,通过将交错图像-文本生成问题重构为工具使用问题,解决了当前统一模型的‘单一工具’瓶颈,并通过强化学习训练LLM/MLLM智能协调专用视觉工具。

Details

Motivation: 现有统一模型限于合成图像,难以处理需要事实基础或程序化精确的任务,LLM-I旨在突破这种限制。

Result: 在多样化数据集上训练后,LLM-I在四个基准测试中大幅领先现有方法。

Insight: 通过工具化和强化学习的结合,LLM-I展示了在多模态生成任务中动态协调专用工具的潜力。

Abstract: We propose LLM-Interleaved (LLM-I), a flexible and dynamic framework that reframes interleaved image-text generation as a tool-use problem. LLM-I is designed to overcome the “one-tool” bottleneck of current unified models, which are limited to synthetic imagery and struggle with tasks requiring factual grounding or programmatic precision. Our framework empowers a central LLM or MLLM agent to intelligently orchestrate a diverse toolkit of specialized visual tools, including online image search, diffusion-based generation, code execution, and image editing. The agent is trained to select and apply these tools proficiently via a Reinforcement Learning (RL) framework that features a hybrid reward system combining rule-based logic with judgments from LLM and MLLM evaluators. Trained on a diverse new dataset using four different model backbones, LLM-I demonstrates state-of-the-art performance, outperforming existing methods by a large margin across four benchmarks. We also introduce a novel test-time scaling strategy that provides further performance gains. Project Page: https://github.com/ByteDance-BandAI/LLM-I.


cs.SE [Back]

[97] An Empirical Study on Failures in Automated Issue Solving cs.SE | cs.AI | cs.CLPDF

Simiao Liu, Fang Liu, Liehao Li, Xin Tan, Yinghao Zhu

TL;DR: 该论文分析了自动化问题解决中的失败模式,提出了一个三维分类法,并设计了协作式专家-执行者框架以提升性能。

Details

Motivation: 当前自动化问题解决工具在SWE-Bench-Verified中的失败率高,且现有评估仅关注总体性能,掩盖了失败的根本原因,无法指导针对性改进。

Result: 实验表明,所提框架解决了22.2%的单代理无法处理的问题。

Insight: 1) 代理架构的主要弱点是推理和认知能力;2) 通过诊断评估和协作设计可以显著提升代理的鲁棒性。

Abstract: Automated issue solving seeks to autonomously identify and repair defective code snippets across an entire codebase. SWE-Bench has emerged as the most widely adopted benchmark for evaluating progress in this area. While LLM-based agentic tools show great promise, they still fail on a substantial portion of tasks. Moreover, current evaluations primarily report aggregate issue-solving rates, which obscure the underlying causes of success and failure, making it challenging to diagnose model weaknesses or guide targeted improvements. To bridge this gap, we first analyze the performance and efficiency of three SOTA tools, spanning both pipeline-based and agentic architectures, in automated issue solving tasks of SWE-Bench-Verified under varying task characteristics. Furthermore, to move from high-level performance metrics to underlying cause analysis, we conducted a systematic manual analysis of 150 failed instances. From this analysis, we developed a comprehensive taxonomy of failure modes comprising 3 primary phases, 9 main categories, and 25 fine-grained subcategories. Then we systematically analyze the distribution of the identified failure modes, the results reveal distinct failure fingerprints between the two architectural paradigms, with the majority of agentic failures stemming from flawed reasoning and cognitive deadlocks. Motivated by these insights, we propose a collaborative Expert-Executor framework. It introduces a supervisory Expert agent tasked with providing strategic oversight and course-correction for a primary Executor agent. This architecture is designed to correct flawed reasoning and break the cognitive deadlocks that frequently lead to failure. Experiments show that our framework solves 22.2% of previously intractable issues for a leading single agent. These findings pave the way for building more robust agents through diagnostic evaluation and collaborative design.


[98] Reasoning Efficiently Through Adaptive Chain-of-Thought Compression: A Self-Optimizing Framework cs.SE | cs.AI | cs.CLPDF

Kerui Huang, Shuhan Liu, Xing Hu, Tongtong Xu, Lingfeng Bao

TL;DR: 论文提出了一种自适应压缩Chain-of-Thought(CoT)推理的框架SEER,以减少计算开销并保持准确性。

Details

Motivation: 现有的CoT推理虽然能提升LLM的准确性和鲁棒性,但其计算成本高,过长推理可能导致截断、准确性下降和延迟增加,尤其在需要简洁输出的任务中。

Result: 实验表明,SEER平均缩短CoT 42.1%,减少截断情况,消除无限循环,提升效率。

Insight: 过长的推理并非总是有效,自适应压缩CoT可以在保持性能的同时显著提升效率。

Abstract: Chain-of-Thought (CoT) reasoning enhances Large Language Models (LLMs) by prompting intermediate steps, improving accuracy and robustness in arithmetic, logic, and commonsense tasks. However, this benefit comes with high computational costs: longer outputs increase latency, memory usage, and KV-cache demands. These issues are especially critical in software engineering tasks where concise and deterministic outputs are required. To investigate these trade-offs, we conduct an empirical study based on code generation benchmarks. The results reveal that longer CoT does not always help. Excessive reasoning often causes truncation, accuracy drops, and latency up to five times higher, with failed outputs consistently longer than successful ones. These findings challenge the assumption that longer reasoning is inherently better and highlight the need for adaptive CoT control. Motivated by this, we propose SEER (Self-Enhancing Efficient Reasoning), an adaptive framework that compresses CoT while preserving accuracy. SEER combines Best-of-N sampling with task-aware adaptive filtering, dynamically adjusting thresholds based on pre-inference outputs to reduce verbosity and computational overhead. We then evaluate SEER on three software engineering tasks and one math task. On average, SEER shortens CoT by 42.1%, improves accuracy by reducing truncation, and eliminates most infinite loops. These results demonstrate SEER as a practical method to make CoT-enhanced LLMs more efficient and robust, even under resource constraints.


eess.IV [Back]

[99] 3D Reconstruction of Coronary Vessel Trees from Biplanar X-Ray Images Using a Geometric Approach eess.IV | cs.CVPDF

Ethan Koland, Lin Xi, Nadeev Wijesuriya, YingLiang Ma

TL;DR: 该研究提出了一种从双平面X射线图像重建冠状动脉三维血管树的几何方法框架,通过图像分割、运动相位匹配和三维重建三个主要步骤,改进了传统方法的精度和工作流程。

Details

Motivation: 心脏介入手术中,X射线血管造影用于可视化冠状动脉,但传统方法在3D重建中存在误差和复杂性。研究旨在通过几何方法简化并提高重建精度。

Result: 分割准确率达到0.703,3D重建的投影误差为0.62±0.38毫米,验证了方法的有效性。

Insight: 几何方法简化了3D重建流程且精度更高,对临床心脏手术的辅助具有重要意义。

Abstract: X-ray angiography is widely used in cardiac interventions to visualize coronary vessels, assess integrity, detect stenoses and guide treatment. We propose a framework for reconstructing 3D vessel trees from biplanar X-ray images which are extracted from two X-ray videos captured at different C-arm angles. The proposed framework consists of three main components: image segmentation, motion phase matching, and 3D reconstruction. An automatic video segmentation method for X-ray angiography to enable semantic segmentation for image segmentation and motion phase matching. The goal of the motion phase matching is to identify a pair of X-ray images that correspond to a similar respiratory and cardiac motion phase to reduce errors in 3D reconstruction. This is achieved by tracking a stationary object such as a catheter or lead within the X-ray video. The semantic segmentation approach assigns different labels to different object classes enabling accurate differentiation between blood vessels, balloons, and catheters. Once a suitable image pair is selected, key anatomical landmarks (vessel branching points and endpoints) are matched between the two views using a heuristic method that minimizes reconstruction errors. This is followed by a novel geometric reconstruction algorithm to generate the 3D vessel tree. The algorithm computes the 3D vessel centrelines by determining the intersection of two 3D surfaces. Compared to traditional methods based on epipolar constraints, the proposed approach simplifies there construction workflow and improves overall accuracy. We trained and validated our segmentation method on 62 X-ray angiography video sequences. On the test set, our method achieved a segmentation accuracy of 0.703. The 3D reconstruction framework was validated by measuring the reconstruction error of key anatomical landmarks, achieving a reprojection errors of 0.62mm +/- 0.38mm.


[100] Generative AI Pipeline for Interactive Prompt-driven 2D-to-3D Vascular Reconstruction for Fontan Geometries from Contrast-Enhanced X-Ray Fluoroscopy Imaging eess.IV | cs.AI | cs.CV | cs.ET | q-bio.QM | 92C50, 68T07, 76D05, 65D18, 92C55 | I.4.6; I.4.8; J.3; I.2.10; I.4.9PDF

Prahlad G Menon

TL;DR: 这篇论文提出了一种基于生成式AI的管道,用于从造影增强X射线透视成像中交互式Prompt驱动的2D到3D血管重建(Fontan几何结构),展示了临床可行性。

Details

Motivation: Fontan姑息治疗的单心室先天性心脏病进展为血流动力学衰竭,传统2D成像难以描述复杂的血流模式,亟需一种能从常规2D造影数据生成3D几何结构的方法。

Result: 成功生成几何优化的2D投影,并在15分钟内完成处理,虚拟血流可视化识别了血流停滞区和分支动脉的血流模式。

Insight: 该方法展示了从常规造影数据生成CFD适用几何结构的临床潜力,尽管需迭代优化准确性,但为利用现成影像数据实现高级血流动力学分析奠定了基础。

Abstract: Fontan palliation for univentricular congenital heart disease progresses to hemodynamic failure with complex flow patterns poorly characterized by conventional 2D imaging. Current assessment relies on fluoroscopic angiography, providing limited 3D geometric information essential for computational fluid dynamics (CFD) analysis and surgical planning. A multi-step AI pipeline was developed utilizing Google’s Gemini 2.5 Flash (2.5B parameters) for systematic, iterative processing of fluoroscopic angiograms through transformer-based neural architecture. The pipeline encompasses medical image preprocessing, vascular segmentation, contrast enhancement, artifact removal, and virtual hemodynamic flow visualization within 2D projections. Final views were processed through Tencent’s Hunyuan3D-2mini (384M parameters) for stereolithography file generation. The pipeline successfully generated geometrically optimized 2D projections from single-view angiograms after 16 processing steps using a custom web interface. Initial iterations contained hallucinated vascular features requiring iterative refinement to achieve anatomically faithful representations. Final projections demonstrated accurate preservation of complex Fontan geometry with enhanced contrast suitable for 3D conversion. AI-generated virtual flow visualization identified stagnation zones in central connections and flow patterns in branch arteries. Complete processing required under 15 minutes with second-level API response times. This approach demonstrates clinical feasibility of generating CFD-suitable geometries from routine angiographic data, enabling 3D generation and rapid virtual flow visualization for cursory insights prior to full CFD simulation. While requiring refinement cycles for accuracy, this establishes foundation for democratizing advanced geometric and hemodynamic analysis using readily available imaging data.


cs.SD [Back]

[101] Noise Supervised Contrastive Learning and Feature-Perturbed for Anomalous Sound Detection cs.SD | cs.CLPDF

Shun Huang, Zhihua Fang, Liang He

TL;DR: 论文提出了一阶段监督对比学习(OS-SCL)和特征扰动方法,显著改善了异常声音检测中的误报问题,并提出了新的时频特征TFgram,取得了优异的性能。

Details

Motivation: 当前无监督异常声音检测方法在处理来自不同机器的同类样本时容易产生高误报率,这一问题仍未解决。本文旨在通过监督对比学习和特征扰动技术提升检测性能。

Result: 在DCASE 2020任务2上,Log-Mel特征达到94.64% AUC,TFgram特征进一步达到95.71% AUC,显著优于基线。

Insight: 监督对比学习和特征扰动能有效减少同类样本的误报,而TFgram特征能更好地捕捉异常声音的关键信息。

Abstract: Unsupervised anomalous sound detection aims to detect unknown anomalous sounds by training a model using only normal audio data. Despite advancements in self-supervised methods, the issue of frequent false alarms when handling samples of the same type from different machines remains unresolved. This paper introduces a novel training technique called one-stage supervised contrastive learning (OS-SCL), which significantly addresses this problem by perturbing features in the embedding space and employing a one-stage noisy supervised contrastive learning approach. On the DCASE 2020 Challenge Task 2, it achieved 94.64% AUC, 88.42% pAUC, and 89.24% mAUC using only Log-Mel features. Additionally, a time-frequency feature named TFgram is proposed, which is extracted from raw audio. This feature effectively captures critical information for anomalous sound detection, ultimately achieving 95.71% AUC, 90.23% pAUC, and 91.23% mAUC. The source code is available at: \underline{www.github.com/huangswt/OS-SCL}.


cs.AR [Back]

[102] A TRRIP Down Memory Lane: Temperature-Based Re-Reference Interval Prediction For Instruction Caching cs.AR | cs.CL | cs.OS | cs.PFPDF

Henry Kao, Nikhil Sreekumar, Prabhdeep Singh Soni, Ali Sedaghati, Fang Su

TL;DR: 论文提出了一种软件-硬件协同设计方法TRRIP,通过编译器分析代码温度(热/冷)并利用操作系统接口优化指令缓存替换策略,减少热代码的淘汰率,从而提升移动CPU性能。

Details

Motivation: 现代移动CPU软件因其复杂的运行时行为导致指令缓存的高重用距离,而传统硬件中心缓存管理方法无法满足需求。代码规模和复杂度的增长快于片上存储,需新的解决方案。

Result: 在已使用PGO优化的移动代码上,TRRIP将L2指令MPKI降低26.5%,平均加速3.9%。

Insight: 软件-硬件协同设计可利用代码温度信息优化缓存管理,显著提升移动系统性能,且易于实际部署。

Abstract: Modern mobile CPU software pose challenges for conventional instruction cache replacement policies due to their complex runtime behavior causing high reuse distance between executions of the same instruction. Mobile code commonly suffers from large amounts of stalls in the CPU frontend and thus starvation of the rest of the CPU resources. Complexity of these applications and their code footprint are projected to grow at a rate faster than available on-chip memory due to power and area constraints, making conventional hardware-centric methods for managing instruction caches to be inadequate. We present a novel software-hardware co-design approach called TRRIP (Temperature-based Re-Reference Interval Prediction) that enables the compiler to analyze, classify, and transform code based on “temperature” (hot/cold), and to provide the hardware with a summary of code temperature information through a well-defined OS interface based on using code page attributes. TRRIP’s lightweight hardware extension employs code temperature attributes to optimize the instruction cache replacement policy resulting in the eviction rate reduction of hot code. TRRIP is designed to be practical and adoptable in real mobile systems that have strict feature requirements on both the software and hardware components. TRRIP can reduce the L2 MPKI for instructions by 26.5% resulting in geomean speedup of 3.9%, on top of RRIP cache replacement running mobile code already optimized using PGO.


cs.IR [Back]

[103] Enhancing Time Awareness in Generative Recommendation cs.IR | cs.CLPDF

Sunkyung Lee, Seongmin Park, Jonghyo Kim, Mincheol Yoon, Jongwuk Lee

TL;DR: 该论文提出了GRUT模型,通过时间感知提示和趋势感知推理,解决了生成式推荐中忽视时间动态的问题,显著提升了推荐性能。

Details

Motivation: 现有的生成式推荐方法主要关注物品的顺序,而忽略了物品间的时间动态,这可能隐含用户偏好的演化。论文旨在解决这一局限性。

Result: 在四个基准数据集上,GRUT在Recall@5和NDCG@5上分别提升了15.4%和14.3%。

Insight: 时间动态信息对捕捉用户偏好演化至关重要,GRUT的创新性方法为生成式推荐提供了新思路。

Abstract: Generative recommendation has emerged as a promising paradigm that formulates the recommendations into a text-to-text generation task, harnessing the vast knowledge of large language models. However, existing studies focus on considering the sequential order of items and neglect to handle the temporal dynamics across items, which can imply evolving user preferences. To address this limitation, we propose a novel model, Generative Recommender Using Time awareness (GRUT), effectively capturing hidden user preferences via various temporal signals. We first introduce Time-aware Prompting, consisting of two key contexts. The user-level temporal context models personalized temporal patterns across timestamps and time intervals, while the item-level transition context provides transition patterns across users. We also devise Trend-aware Inference, a training-free method that enhances rankings by incorporating trend information about items with generation likelihood. Extensive experiments demonstrate that GRUT outperforms state-of-the-art models, with gains of up to 15.4% and 14.3% in Recall@5 and NDCG@5 across four benchmark datasets. The source code is available at https://github.com/skleee/GRUT.


q-bio.PE [Back]

[104] Autonomous Reporting of Normal Chest X-rays by Artificial Intelligence in the United Kingdom. Can We Take the Human Out of the Loop? q-bio.PE | cs.CVPDF

Katrina Nash, James Vaz, Ahmed Maiter, Christopher Johns, Nicholas Woznitza

TL;DR: 该论文探讨了在英国使用人工智能自动报告正常胸片(CXRs)的可能性及其潜在影响,研究了技术、法律和实践方面的挑战。

Details

Motivation: 由于英国放射科医生短缺,导致胸片报告延迟,AI工具若能自动识别正常胸片并提出报告,有望大幅减轻工作量。

Result: 研究表明AI在此领域具有潜力,但仍需进一步验证和监管框架,以确保安全性和责任归属。

Insight: AI可以辅助放射科工作,但完全脱离人工监督可能尚不成熟,需结合技术改进、法律支持和多方利益相关者的参与。

Abstract: Chest X-rays (CXRs) are the most commonly performed imaging investigation. In the UK, many centres experience reporting delays due to radiologist workforce shortages. Artificial intelligence (AI) tools capable of distinguishing normal from abnormal CXRs have emerged as a potential solution. If normal CXRs could be safely identified and reported without human input, a substantial portion of radiology workload could be reduced. This article examines the feasibility and implications of autonomous AI reporting of normal CXRs. Key issues include defining normal, ensuring generalisability across populations, and managing the sensitivity-specificity trade-off. It also addresses legal and regulatory challenges, such as compliance with IR(ME)R and GDPR, and the lack accountability frameworks for errors. Further considerations include the impact on radiologists practice, the need for robust post-market surveillance, and incorporation of patient perspectives. While the benefits are clear, adoption must be cautious.


cs.CY [Back]

[105] Accuracy Paradox in Large Language Models: Regulating Hallucination Risks in Generative AI cs.CY | cs.AI | cs.CL | cs.HC | cs.LGPDF

Zihao Li, Weiwei Yi, Jiahong Chen

TL;DR: 该论文探讨了大型语言模型(LLMs)中的“准确性悖论”,指出过度依赖准确性指标会掩盖幻觉问题的复杂性,并提出了多维度分类和风险治理的新思路。

Details

Motivation: 随着LLMs在日常决策中的广泛应用,其输出的幻觉问题(如虚构、误导或不可信内容)对社会和认知风险提出了迫切需求的研究。现有治理框架过度依赖准确性,导致问题被误诊。

Result: 指出准确性作为单一指标无法捕捉误导、价值观偏差和社会扭曲等问题,呼吁多元化的治理方法。

Insight: 治理LLM幻觉需超越准确性,关注上下文感知、抗操纵能力和社会多样性,以解决更广泛的认知和社会风险。

Abstract: As Large Language Models (LLMs) permeate everyday decision-making, their epistemic and societal risks demand urgent scrutiny. Hallucinations, the generation of fabricated, misleading, oversimplified or untrustworthy outputs, has emerged as imperative challenges. While regulatory, academic, and technical discourse position accuracy as the principal benchmark for mitigating such harms, this article contends that overreliance on accuracy misdiagnoses the problem and has counterproductive effect: the accuracy paradox. Drawing on interdisciplinary literatures, this article develops a taxonomy of hallucination types and shows the paradox along three intertwining dimensions: outputs, individuals and society. First, accuracy functions as a superficial proxy for reliability, incentivising the optimisation of rhetorical fluency and surface-level correctness over epistemic trustworthiness. This encourages passive user trust in outputs that appear accurate but epistemically untenable. Second, accuracy as a singular metric fails to detect harms that are not factually false but are nonetheless misleading, value-laden, or socially distorting, including consensus illusions, sycophantic alignment, and subtle manipulation. Third, regulatory overemphasis on accuracy obscures the wider societal consequences of hallucination, including social sorting, privacy violations, equity harms, epistemic convergence that marginalises dissent, reduces pluralism, and causes social deskilling. By examining the EU AI Act, GDPR, and DSA, the article argues that current regulations are not yet structurally equipped to address these epistemic, relational, and systemic harms and exacerbated by the overreliance on accuracy. By exposing such conceptual and practical challenges, this article calls for a fundamental shift towards pluralistic, context-aware, and manipulation-resilient approaches to AI trustworthy governance.


[106] CogniAlign: Survivability-Grounded Multi-Agent Moral Reasoning for Safe and Transparent AI cs.CY | cs.CLPDF

Hasin Jawad Ali, Ilhamul Azam, Ajwad Abrar, Md. Kamrul Hasan, Hasan Mahmud

TL;DR: CogniAlign是一个基于自然道德现实的多代理审议框架,通过跨学科代理(如神经科学、心理学等)的结构化辩论,将道德推理锚定在生存性上,并在透明性和解释深度上显著优于GPT-4o。

Details

Motivation: 现有AI对齐方法在道德推理上存在抽象性和不透明性问题,CogniAlign旨在通过多学科代理的辩论提升透明性和解释质量。

Result: 在60多个道德问题上,CogniAlign在分析质量、广度和解释深度上分别比GPT-4o平均提升16.2、14.3和28.4分。

Insight: 跨学科辩论可显著提升AI的道德推理透明性和安全性,为对齐问题提供可扩展路径。

Abstract: The challenge of aligning artificial intelligence (AI) with human values persists due to the abstract and often conflicting nature of moral principles and the opacity of existing approaches. This paper introduces CogniAlign, a multi-agent deliberation framework based on naturalistic moral realism, that grounds moral reasoning in survivability, defined across individual and collective dimensions, and operationalizes it through structured deliberations among discipline-specific scientist agents. Each agent, representing neuroscience, psychology, sociology, and evolutionary biology, provides arguments and rebuttals that are synthesized by an arbiter into transparent and empirically anchored judgments. We evaluate CogniAlign on classic and novel moral questions and compare its outputs against GPT-4o using a five-part ethical audit framework. Results show that CogniAlign consistently outperforms the baseline across more than sixty moral questions, with average performance gains of 16.2 points in analytic quality, 14.3 points in breadth, and 28.4 points in depth of explanation. In the Heinz dilemma, for example, CogniAlign achieved an overall score of 89.2 compared to GPT-4o’s 69.2, demonstrating a decisive advantage in handling moral reasoning. By reducing black-box reasoning and avoiding deceptive alignment, CogniAlign highlights the potential of interdisciplinary deliberation as a scalable pathway for safe and transparent AI alignment.