Table of Contents
- cs.CV [Total: 70]
- cs.CL [Total: 24]
- quant-ph [Total: 1]
- cs.LG [Total: 1]
- cs.AI [Total: 5]
- cs.RO [Total: 1]
- cs.CY [Total: 1]
- cs.IR [Total: 1]
- cs.CR [Total: 3]
- eess.IV [Total: 5]
cs.CV [Back]
[1] GAITEX: Human motion dataset from impaired gait and rehabilitation exercises of inertial and optical sensor data cs.CV | cs.AI | cs.HCPDF
Andreas Spilz, Heiko Oppel, Jochen Werner, Kathrin Stucke-Straub, Felix Capanni
TL;DR: GAITEX是一个多模态数据集,包含正常和异常步态以及康复练习的惯性传感器和光学传感器数据,支持机器学习模型开发。
Details
Motivation: 临床和日常环境中的人体运动质量评估需要大规模、多样化数据集,但目前收集成本高且耗时。
Result: 数据集包含原始和处理后的数据,支持多种分析任务,如自动运动评估和步态分析。
Insight: 该数据集通过提供同步IMU和MoCap数据,为传感器驱动的运动分析研究提供了标准化资源。
Abstract: Wearable inertial measurement units (IMUs) offer a cost-effective and scalable means to assess human movement quality in clinical and everyday settings. However, the development of robust sensor-based classification models for physiotherapeutic exercises and gait analysis requires large, diverse datasets, which are costly and time-consuming to collect. Here, we present a multimodal dataset of physiotherapeutic exercises - including correct and clinically relevant variants - and gait-related exercises - including both normal and impaired gait patterns - recorded from 19 participants using synchronized IMUs and marker-based motion capture (MoCap). The dataset includes raw data from nine IMUs and thirty-five optical markers capturing full-body kinematics. Each IMU is additionally equipped with four optical markers, enabling precise comparison between IMU-derived orientation estimates and reference values from the MoCap system. To support further analysis, we also provide processed IMU orientations aligned with common segment coordinate systems, subject-specific OpenSim models, inverse kinematics results, and tools for visualizing IMU orientations in the musculoskeletal context. Detailed annotations of movement execution quality and time-stamped segmentations support diverse analysis goals. This dataset supports the development and benchmarking of machine learning models for tasks such as automatic exercise evaluation, gait analysis, temporal activity segmentation, and biomechanical parameter estimation. To facilitate reproducibility, we provide code for postprocessing, sensor-to-segment alignment, inverse kinematics computation, and technical validation. This resource is intended to accelerate research in machine learning-driven human movement analysis.
[2] Seeing Beyond Frames: Zero-Shot Pedestrian Intention Prediction with Raw Temporal Video and Multimodal Cues cs.CV | cs.AI | cs.LGPDF
Pallavi Zambare, Venkata Nikhil Thanikella, Ying Liu
TL;DR: 论文提出了一种名为BF-PIP的零样本方法,通过连续视频片段和多模态数据预测行人意图,性能显著优于基于离散帧的GPT-4V方法。
Details
Motivation: 在城市复杂环境中,行人意图预测对自动驾驶至关重要。现有方法依赖监督学习和大量重新训练,难以适应新场景。
Result: BF-PIP的预测准确率达73%,比GPT-4V基线提高了18%。
Insight: 连续时空信息和多模态上下文提示可以提升在模糊条件下的意图推断能力,为智能交通系统提供了无需重新训练的感知模块。
Abstract: Pedestrian intention prediction is essential for autonomous driving in complex urban environments. Conventional approaches depend on supervised learning over frame sequences and require extensive retraining to adapt to new scenarios. Here, we introduce BF-PIP (Beyond Frames Pedestrian Intention Prediction), a zero-shot approach built upon Gemini 2.5 Pro. It infers crossing intentions directly from short, continuous video clips enriched with structured JAAD metadata. In contrast to GPT-4V based methods that operate on discrete frames, BF-PIP processes uninterrupted temporal clips. It also incorporates bounding-box annotations and ego-vehicle speed via specialized multimodal prompts. Without any additional training, BF-PIP achieves 73% prediction accuracy, outperforming a GPT-4V baseline by 18 %. These findings illustrate that combining temporal video inputs with contextual cues enhances spatiotemporal perception and improves intent inference under ambiguous conditions. This approach paves the way for agile, retraining-free perception module in intelligent transportation system.
[3] ChartM$^3$: Benchmarking Chart Editing with Multimodal Instructions cs.CV | cs.AIPDF
Danglu Yang, Liang Zhang, Zihao Yue, Liangyu Chen, Yichen Xu
TL;DR: 论文ChartM^3提出了一种多模态图表编辑的新范式,结合自然语言和视觉指示器表达用户意图,并提供了一个包含1000个样本的基准数据集以及大规模训练集ChartM^3-Train,揭示了当前多模态大语言模型的局限性。
Details
Motivation: 图表编辑在数据分析中具有重要价值,但现有方法主要依赖自然语言指令,导致精细编辑的模糊性。因此,需要结合多模态表达意图以提升编辑的精确性。
Result: 实验表明当前MLLMs在视觉指示器的理解上存在不足,而通过ChartM^3-Train微调后的模型表现显著提升。
Insight: 多模态监督对构建实用的图表编辑系统至关重要,而现有MLLMs仍需改进对视觉信息的理解能力。
Abstract: Charts are a fundamental visualization format widely used in data analysis across research and industry. While enabling users to edit charts based on high-level intentions is of great practical value, existing methods primarily rely on natural language instructions, which are often too ambiguous to support fine-grained editing. In this work, we introduce a novel paradigm for multimodal chart editing, where user intent is expressed through a combination of natural language and visual indicators that explicitly highlight the elements to be modified. To support this paradigm, we present Chart$\text{M}^3$, a new benchmark for Multimodal chart editing with Multi-level complexity and Multi-perspective evaluation. Chart$\text{M}^3$ contains 1,000 samples spanning four levels of editing difficulty. Each sample includes triplets in the form of (chart, code, multimodal instructions). To comprehensively evaluate chart editing models, Chart$\text{M}^3$ provides metrics that assess both visual appearance and code correctness. Our benchmark reveals significant limitations in current multimodal large language models (MLLMs), including GPT-4o, particularly in their ability to interpret and act on visual indicators. To address this, we construct Chart$\text{M}^3$-Train, a large-scale training set with 24,000 multimodal chart editing samples. Fine-tuning MLLMs on this dataset leads to substantial improvements, demonstrating the importance of multimodal supervision in building practical chart editing systems. Our datasets, codes, and evaluation tools are available at https://github.com/MLrollIT/ChartM3. %https://github.com/MLrollIT/ChartM3Our datasets, codes, and evaluation tools are available at https://github.com/yaolinli/VCE.
[4] PanoGAN A Deep Generative Model for Panoramic Dental Radiographs cs.CV | cs.ET | cs.LG | eess.IVPDF
Soren Pedersen, Sanyam Jain, Mikkel Chavez, Viktor Ladehoff, Bruna Neves de Freitas
TL;DR: 该论文提出了一个用于生成全景牙科X光片的GAN模型PanoGAN,旨在解决牙科研究和教育中数据稀缺的问题。通过训练DCGAN(采用WGAN-GP损失)和预处理的2322张X光片数据集,生成了具有中等解剖学可见性和真实性的图像。
Details
Motivation: 牙科研究和教育中全景X光片数据稀缺,限制了相关研究和教学的发展。
Result: 生成的图像在解剖学可见性和真实性上表现中等,不同预处理方法(去噪与非去噪)在细节和整体清晰度上存在权衡。
Insight: 去噪预处理可以提高图像整体清晰度,但可能损失一些细节(如下颌管和小梁骨结构);非去噪数据则能保留更多细节。
Abstract: This paper presents the development of a generative adversarial network (GAN) for synthesizing dental panoramic radiographs. Although exploratory in nature, the study aims to address the scarcity of data in dental research and education. We trained a deep convolutional GAN (DCGAN) using a Wasserstein loss with gradient penalty (WGANGP) on a dataset of 2322 radiographs of varying quality. The focus was on the dentoalveolar regions, other anatomical structures were cropped out. Extensive preprocessing and data cleaning were performed to standardize the inputs while preserving anatomical variability. We explored four candidate models by varying critic iterations, feature depth, and the use of denoising prior to training. A clinical expert evaluated the generated radiographs based on anatomical visibility and realism, using a 5-point scale (1 very poor 5 excellent). Most images showed moderate anatomical depiction, although some were degraded by artifacts. A trade-off was observed the model trained on non-denoised data yielded finer details especially in structures like the mandibular canal and trabecular bone, while a model trained on denoised data offered superior overall image clarity and sharpness. These findings provide a foundation for future work on GAN-based methods in dental imaging.
[5] On Explaining Visual Captioning with Hybrid Markov Logic Networks cs.CV | cs.AIPDF
Monika Shah, Somdeb Sarkhel, Deepak Venugopal
TL;DR: 论文提出了一种基于混合马尔可夫逻辑网络(HMLNs)的框架,用于解释视觉描述生成模型的行为,通过分析训练数据对生成描述的潜在影响,提升模型的可解释性。
Details
Motivation: 现有的深度神经网络在图像描述任务中表现出色,但其如何整合视觉、语言和知识信息生成描述的机制仍难以解释。标准的性能评估指标(如与人类描述的比较)无法深入揭示这种整合过程。
Result: 实验表明,该方法能够直观解释多个先进描述模型的生成行为,并支持从可解释性维度比较模型。
Insight: 符号逻辑与数值函数的结合(如HMLNs)可以有效提升复杂模型的可解释性,为多模态任务中的信息整合提供新的分析工具。
Abstract: Deep Neural Networks (DNNs) have made tremendous progress in multimodal tasks such as image captioning. However, explaining/interpreting how these models integrate visual information, language information and knowledge representation to generate meaningful captions remains a challenging problem. Standard metrics to measure performance typically rely on comparing generated captions with human-written ones that may not provide a user with a deep insights into this integration. In this work, we develop a novel explanation framework that is easily interpretable based on Hybrid Markov Logic Networks (HMLNs) - a language that can combine symbolic rules with real-valued functions - where we hypothesize how relevant examples from the training data could have influenced the generation of the observed caption. To do this, we learn a HMLN distribution over the training instances and infer the shift in distributions over these instances when we condition on the generated sample which allows us to quantify which examples may have been a source of richer information to generate the observed caption. Our experiments on captions generated for several state-of-the-art captioning models using Amazon Mechanical Turk illustrate the interpretability of our explanations, and allow us to compare these models along the dimension of explainability.
[6] Tracking Moose using Aerial Object Detection cs.CV | I.4.8PDF
Christopher Indris, Raiyan Rahman, Goetz Bramesfeld, Guanghui Wang
TL;DR: 该论文通过补丁增强方法研究不同设置下模型性能,对比三种目标检测器的表现,发现简单高效模型在小型对象检测任务中表现优异。
Details
Motivation: 空中野生动物追踪对保护工作至关重要,但传统方法成本高、风险大,且小型目标检测技术上存在挑战。
Result: 所有模型在至少一种补丁配置下mAP@IoU=0.5达到93%以上,简单模型表现与复杂模型相当。
Insight: 无人机部署中,计算资源有限时,选择简单高效的模型更适合小型目标检测任务。
Abstract: Aerial wildlife tracking is critical for conservation efforts and relies on detecting small objects on the ground below the aircraft. It presents technical challenges: crewed aircraft are expensive, risky and disruptive; autonomous drones have limited computational capacity for onboard AI systems. Since the objects of interest may appear only a few pixels wide, small object detection is an inherently challenging computer vision subfield compounded by computational efficiency needs. This paper applies a patching augmentation to datasets to study model performance under various settings. A comparative study of three common yet architecturally diverse object detectors is conducted using the data, varying the patching method’s hyperparameters against detection accuracy. Each model achieved at least 93% mAP@IoU=0.5 on at least one patching configuration. Statistical analyses provide an in-depth commentary on the effects of various factors. Analysis also shows that faster, simpler models are about as effective as models that require more computational power for this task and perform well given limited patch scales, encouraging UAV deployment. Datasets and models will be made available via https://github.com/chrisindris/Moose.
[7] HDR Environment Map Estimation with Latent Diffusion Models cs.CVPDF
Jack Hilliard, Adrian Hilton, Jean-Yves Guillemaut
TL;DR: 该论文提出了一种利用潜在扩散模型(LDM)从单视图图像中估计高质量HDR环境地图的新方法,解决了传统ERP表示中常见的失真和接缝问题,并提出了全景适应扩散变换器架构。
Details
Motivation: 现有的HDR环境地图估计方法在ERP表示中存在极点的失真和侧边接缝等常见问题。论文旨在通过改进潜在扩散模型和网络架构,提高环境地图的质量和准确性。
Result: 实验表明,所提方法在标准基准测试中表现优异,生成了高质量的环境地图,在图像质量和光照准确性上均与现有最先进方法竞争。
Insight: 结合潜在扩散模型和改进的网络架构可以有效解决HDR环境地图估计中的失真和接缝问题,但需要在图像质量和失真改善之间进行权衡。
Abstract: We advance the field of HDR environment map estimation from a single-view image by establishing a novel approach leveraging the Latent Diffusion Model (LDM) to produce high-quality environment maps that can plausibly light mirror-reflective surfaces. A common issue when using the ERP representation, the format used by the vast majority of approaches, is distortions at the poles and a seam at the sides of the environment map. We remove the border seam artefact by proposing an ERP convolutional padding in the latent autoencoder. Additionally, we investigate whether adapting the diffusion network architecture to the ERP format can improve the quality and accuracy of the estimated environment map by proposing a panoramically-adapted Diffusion Transformer architecture. Our proposed PanoDiT network reduces ERP distortions and artefacts, but at the cost of image quality and plausibility. We evaluate with standard benchmarks to demonstrate that our models estimate high-quality environment maps that perform competitively with state-of-the-art approaches in both image quality and lighting accuracy.
[8] VoluMe – Authentic 3D Video Calls from Live Gaussian Splat Prediction cs.CV | cs.GRPDF
Martin de La Gorce, Charlie Hewitt, Tibor Takacs, Robert Gerdisch, Zafiirah Hosenie
TL;DR: 本文提出了VoluMe,一种从单目2D摄像头实时预测3D高斯重建的方法,用于实现真实感和可信度的3D视频通话。
Details
Motivation: 传统的3D会议解决方案通常依赖复杂硬件或固定外观,不适合视频会议应用。本文旨在通过低成本(单目摄像头)实现高真实感和实时性的3D视频通话。
Result: 在视觉质量和稳定性方面超越了现有方法,并通过单目摄像头和显示器实现了实时3D会议。
Insight: 通过轻量化和实时化的方法,3D视频通话可以在普通设备上实现,提升了远程会议的参与感和真实性。
Abstract: Virtual 3D meetings offer the potential to enhance copresence, increase engagement and thus improve effectiveness of remote meetings compared to standard 2D video calls. However, representing people in 3D meetings remains a challenge; existing solutions achieve high quality by using complex hardware, making use of fixed appearance via enrolment, or by inverting a pre-trained generative model. These approaches lead to constraints that are unwelcome and ill-fitting for videoconferencing applications. We present the first method to predict 3D Gaussian reconstructions in real time from a single 2D webcam feed, where the 3D representation is not only live and realistic, but also authentic to the input video. By conditioning the 3D representation on each video frame independently, our reconstruction faithfully recreates the input video from the captured viewpoint (a property we call authenticity), while generalizing realistically to novel viewpoints. Additionally, we introduce a stability loss to obtain reconstructions that are temporally stable on video sequences. We show that our method delivers state-of-the-art accuracy in visual quality and stability metrics compared to existing methods, and demonstrate our approach in live one-to-one 3D meetings using only a standard 2D camera and display. This demonstrates that our approach can allow anyone to communicate volumetrically, via a method for 3D videoconferencing that is not only highly accessible, but also realistic and authentic.
[9] GLCP: Global-to-Local Connectivity Preservation for Tubular Structure Segmentation cs.CVPDF
Feixiang Zhou, Zhuangzhi Gao, He Zhao, Jianyang Xie, Yanda Meng
TL;DR: 论文GLCP提出了一种新的框架,通过全局和局部连通性保护来解决管状结构分割中的碎片化问题,结合多任务学习和轻量级双注意力细化模块,显著提升了分割效果。
Details
Motivation: 医疗领域中管状结构(如血管网络)的精确分割对下游应用至关重要,但现有方法常因忽视局部不连续区域而导致结构碎片化问题。
Result: 在2D和3D数据集上,GLCP显著优于现有方法,实现了更高的分割准确性和连续性。
Insight: 同时关注全局拓扑结构和局部不连续区域是提升管状结构分割效果的关键,轻量级注意力机制也能有效改进结果。
Abstract: Accurate segmentation of tubular structures, such as vascular networks, plays a critical role in various medical domains. A remaining significant challenge in this task is structural fragmentation, which can adversely impact downstream applications. Existing methods primarily focus on designing various loss functions to constrain global topological structures. However, they often overlook local discontinuity regions, leading to suboptimal segmentation results. To overcome this limitation, we propose a novel Global-to-Local Connectivity Preservation (GLCP) framework that can simultaneously perceive global and local structural characteristics of tubular networks. Specifically, we propose an Interactive Multi-head Segmentation (IMS) module to jointly learn global segmentation, skeleton maps, and local discontinuity maps, respectively. This enables our model to explicitly target local discontinuity regions while maintaining global topological integrity. In addition, we design a lightweight Dual-Attention-based Refinement (DAR) module to further improve segmentation quality by refining the resulting segmentation maps. Extensive experiments on both 2D and 3D datasets demonstrate that our GLCP achieves superior accuracy and continuity in tubular structure segmentation compared to several state-of-the-art approaches. The source codes will be available at https://github.com/FeixiangZhou/GLCP.
[10] Analyzing the Sensitivity of Vision Language Models in Visual Question Answering cs.CVPDF
Monika Shah, Sudarshan Balaji, Somdeb Sarkhel, Sanorita Dey, Deepak Venugopal
TL;DR: 本文研究了视觉语言模型(VLMs)在视觉问答任务中对违反Grice对话原则的敏感性,发现添加修饰语会降低模型性能。
Details
Motivation: 受人类对话中违反Grice原则后仍能理解的启发,研究VLMs是否能够以类似方式处理这些违反,从而揭示其局限性。
Result: 实验表明,添加修饰语会导致VLMs性能一致下降,表明其对对话原则的敏感性不足。
Insight: VLMs在处理复杂或违反直觉的对话时存在局限性,未来研究可进一步探索其认知机制。
Abstract: We can think of Visual Question Answering as a (multimodal) conversation between a human and an AI system. Here, we explore the sensitivity of Vision Language Models (VLMs) through the lens of cooperative principles of conversation proposed by Grice. Specifically, even when Grice’s maxims of conversation are flouted, humans typically do not have much difficulty in understanding the conversation even though it requires more cognitive effort. Here, we study if VLMs are capable of handling violations to Grice’s maxims in a manner that is similar to humans. Specifically, we add modifiers to human-crafted questions and analyze the response of VLMs to these modifiers. We use three state-of-the-art VLMs in our study, namely, GPT-4o, Claude-3.5-Sonnet and Gemini-1.5-Flash on questions from the VQA v2.0 dataset. Our initial results seem to indicate that the performance of VLMs consistently diminish with the addition of modifiers which indicates our approach as a promising direction to understand the limitations of VLMs.
[11] Enhancing and Accelerating Brain MRI through Deep Learning Reconstruction Using Prior Subject-Specific Imaging cs.CV | physics.med-phPDF
Amirmohammad Shamaei, Alexander Stebner, Salome, Bosshart, Johanna Ospel
TL;DR: 论文提出了一种基于深度学习的MRI重建框架,通过整合先前的受试者特定图像信息,提高了MRI重建质量和速度,同时减少了传统配准算法的时间消耗。
Details
Motivation: MRI扫描时间长、成本高且患者舒适度低,现有方法在整合先前扫描信息时耗时且效率不高,亟需一种更高效的重建方法。
Result: 在四个加速因子(R5-R20)下,定量指标显著优于现有方法,且下游任务(如脑分割)的精度和体积一致性得到提升,重建时间大幅减少。
Insight: 整合先验信息的深度学习方法可显著提升MRI重建效率和质量,同时减少传统配准的时间消耗,为实时临床应用提供了可能。
Abstract: Magnetic resonance imaging (MRI) is a crucial medical imaging modality. However, long acquisition times remain a significant challenge, leading to increased costs, and reduced patient comfort. Recent studies have shown the potential of using deep learning models that incorporate information from prior subject-specific MRI scans to improve reconstruction quality of present scans. Integrating this prior information requires registration of the previous scan to the current image reconstruction, which can be time-consuming. We propose a novel deep-learning-based MRI reconstruction framework which consists of an initial reconstruction network, a deep registration model, and a transformer-based enhancement network. We validated our method on a longitudinal dataset of T1-weighted MRI scans with 2,808 images from 18 subjects at four acceleration factors (R5, R10, R15, R20). Quantitative metrics confirmed our approach’s superiority over existing methods (p < 0.05, Wilcoxon signed-rank test). Furthermore, we analyzed the impact of our MRI reconstruction method on the downstream task of brain segmentation and observed improved accuracy and volumetric agreement with reference segmentations. Our approach also achieved a substantial reduction in total reconstruction time compared to methods that use traditional registration algorithms, making it more suitable for real-time clinical applications. The code associated with this work is publicly available at https://github.com/amirshamaei/longitudinal-mri-deep-recon.
[12] Group Relative Augmentation for Data Efficient Action Detection cs.CV | cs.LGPDF
Deep Anil Patel, Iain Melvin, Zachary Izzo, Martin Renqiang Min
TL;DR: 该论文提出了一种结合参数高效调优(LoRA)和可学习的内部特征增广的方法,用于高效适应大规模视频语言模型(VLMs)进行动作检测,解决了过拟合和颗粒度不匹配问题。
Details
Motivation: 适应大规模视频语言模型进行动作检测时,面临过拟合和预训练的全局场景理解与任务所需的人为中心的不匹配问题。
Result: 在AVA和MOMA等多标签、多人物动作检测数据集上取得了高mAP表现,展示了在有限样本下适应VLMs的数据效率。
Insight: 通过动态损失权重和内部特征增广,能够在少量样本下高效适应VLM,解决了动作检测中的颗粒度不匹配问题。
Abstract: Adapting large Video-Language Models (VLMs) for action detection using only a few examples poses challenges like overfitting and the granularity mismatch between scene-level pre-training and required person-centric understanding. We propose an efficient adaptation strategy combining parameter-efficient tuning (LoRA) with a novel learnable internal feature augmentation. Applied within the frozen VLM backbone using FiLM, these augmentations generate diverse feature variations directly relevant to the task. Additionally, we introduce a group-weighted loss function that dynamically modulates the training contribution of each augmented sample based on its prediction divergence relative to the group average. This promotes robust learning by prioritizing informative yet reasonable augmentations. We demonstrate our method’s effectiveness on complex multi-label, multi-person action detection datasets (AVA, MOMA), achieving strong mAP performance and showcasing significant data efficiency for adapting VLMs from limited examples.
[13] Collaborative Perceiver: Elevating Vision-based 3D Object Detection via Local Density-Aware Spatial Occupancy cs.CVPDF
Jicheng Yuan, Manh Nguyen Duc, Qian Liu, Manfred Hauswirth, Danh Le Phuoc
TL;DR: 论文提出了一种多任务学习框架CoP,通过空间占用率作为辅助信息提升基于视觉的BEV 3D目标检测性能,结合局部密度信息重建环境细节,显著优于现有方法。
Details
Motivation: 现有BEV 3D目标检测方法忽略了环境上下文信息(如道路和人行道),导致空间感知不全面。论文通过多任务学习结合占用率预测任务来弥补这一缺陷。
Result: 在nuScenes测试集上达到49.5% mAP和59.2% NDS,优于现有视觉BEV方法。
Insight: 结合占用率预测任务能有效补充BEV特征缺失的环境信息,多任务协同提升了检测性能。
Abstract: Vision-based bird’s-eye-view (BEV) 3D object detection has advanced significantly in autonomous driving by offering cost-effectiveness and rich contextual information. However, existing methods often construct BEV representations by collapsing extracted object features, neglecting intrinsic environmental contexts, such as roads and pavements. This hinders detectors from comprehensively perceiving the characteristics of the physical world. To alleviate this, we introduce a multi-task learning framework, Collaborative Perceiver (CoP), that leverages spatial occupancy as auxiliary information to mine consistent structural and conceptual similarities shared between 3D object detection and occupancy prediction tasks, bridging gaps in spatial representations and feature refinement. To this end, we first propose a pipeline to generate dense occupancy ground truths incorporating local density information (LDO) for reconstructing detailed environmental information. Next, we employ a voxel-height-guided sampling (VHS) strategy to distill fine-grained local features according to distinct object properties. Furthermore, we develop a global-local collaborative feature fusion (CFF) module that seamlessly integrates complementary knowledge between both tasks, thus composing more robust BEV representations. Extensive experiments on the nuScenes benchmark demonstrate that CoP outperforms existing vision-based frameworks, achieving 49.5% mAP and 59.2% NDS on the test set. Code and supplementary materials are available at this link https://github.com/jichengyuan/Collaborative-Perceiver.
[14] Evaluating Deep Learning Models for African Wildlife Image Classification: From DenseNet to Vision Transformers cs.CV | cs.AIPDF
Lukman Jibril Aliyu, Umar Sani Muhammad, Bilqisu Ismail, Nasiru Muhammad, Almustapha A Wakili
TL;DR: 论文比较了多种深度学习模型在非洲野生动物图像分类中的表现,突出模型准确率、资源消耗与部署可行性的权衡。
Details
Motivation: 非洲野生动物数量急剧下降,需要高效、可部署的深度学习工具来支持生物多样性监测与保护。
Result: ViT-H/14准确率最高(99%),但计算成本高;DenseNet-201是CNN中最佳(67%),适合轻量部署。
Insight: 在实际保护场景中,需平衡模型性能与计算资源,ViT虽强但实用性受限,轻量级CNN更适合实时部署。
Abstract: Wildlife populations in Africa face severe threats, with vertebrate numbers declining by over 65% in the past five decades. In response, image classification using deep learning has emerged as a promising tool for biodiversity monitoring and conservation. This paper presents a comparative study of deep learning models for automatically classifying African wildlife images, focusing on transfer learning with frozen feature extractors. Using a public dataset of four species: buffalo, elephant, rhinoceros, and zebra; we evaluate the performance of DenseNet-201, ResNet-152, EfficientNet-B4, and Vision Transformer ViT-H/14. DenseNet-201 achieved the best performance among convolutional networks (67% accuracy), while ViT-H/14 achieved the highest overall accuracy (99%), but with significantly higher computational cost, raising deployment concerns. Our experiments highlight the trade-offs between accuracy, resource requirements, and deployability. The best-performing CNN (DenseNet-201) was integrated into a Hugging Face Gradio Space for real-time field use, demonstrating the feasibility of deploying lightweight models in conservation settings. This work contributes to African-grounded AI research by offering practical insights into model selection, dataset preparation, and responsible deployment of deep learning tools for wildlife conservation.
[15] Exploring Probabilistic Modeling Beyond Domain Generalization for Semantic Segmentation cs.CVPDF
I-Hsiang Chen, Hua-En Chang, Wei-Ting Chen, Jenq-Neng Hwang, Sy-Yen Kuo
TL;DR: 这篇论文提出了PDAF(概率扩散对齐框架),通过引入潜在领域先验(LDP)和扩散建模,提升了语义分割模型的领域泛化能力,无需成对样本即可建模领域偏移。
Details
Motivation: 领域广义语义分割(DGSS)面临域偏移问题,现有方法忽略了潜在的领域先验,导致泛化性能不足。
Result: 在多样化的城市场景实验中验证了PDAF的有效性。
Insight: 扩散模型可以有效捕捉复杂领域先验,为领域泛化问题提供了新思路。
Abstract: Domain Generalized Semantic Segmentation (DGSS) is a critical yet challenging task, as domain shifts in unseen environments can severely compromise model performance. While recent studies enhance feature alignment by projecting features into the source domain, they often neglect intrinsic latent domain priors, leading to suboptimal results. In this paper, we introduce PDAF, a Probabilistic Diffusion Alignment Framework that enhances the generalization of existing segmentation networks through probabilistic diffusion modeling. PDAF introduces a Latent Domain Prior (LDP) to capture domain shifts and uses this prior as a conditioning factor to align both source and unseen target domains. To achieve this, PDAF integrates into a pre-trained segmentation model and utilizes paired source and pseudo-target images to simulate latent domain shifts, enabling LDP modeling. The framework comprises three modules: the Latent Prior Extractor (LPE) predicts the LDP by supervising domain shifts; the Domain Compensation Module (DCM) adjusts feature representations to mitigate domain shifts; and the Diffusion Prior Estimator (DPE) leverages a diffusion process to estimate the LDP without requiring paired samples. This design enables PDAF to iteratively model domain shifts, progressively refining feature representations to enhance generalization under complex target conditions. Extensive experiments validate the effectiveness of PDAF across diverse and challenging urban scenes.
[16] Multimodal LLMs as Customized Reward Models for Text-to-Image Generation cs.CV | cs.AI | cs.CLPDF
Shijie Zhou, Ruiyi Zhang, Huaisheng Zhu, Branislav Kveton, Yufan Zhou
TL;DR: 该论文提出了LLaVA-Reward,一种基于多模态大语言模型(MLLMs)的高效奖励模型,用于自动评估文本到图像(T2I)生成的多个方面。通过引入Skip-connection Cross Attention模块和灵活支持不同类型偏好数据,该方法在自动评估和推理扩展方面优于传统方法。
Details
Motivation: 现有的MLLM方法需要指令跟随数据进行监督微调,且通过分析文本响应评估生成质量,耗时长且训练困难。因此,作者提出直接利用MLLMs的隐藏状态进行高效评估。
Result: 实验表明,LLaVA-Reward在自动评估和推理扩展方面优于传统和MLLM方法,能生成与人类评分一致的分数。
Insight: 直接利用MLLMs隐藏状态和双向特征交互设计能够高效提升评估质量,同时灵活的数据支持为实际应用提供了更强的适应性。
Abstract: We introduce LLaVA-Reward, an efficient reward model designed to automatically evaluate text-to-image (T2I) generations across multiple perspectives, leveraging pretrained multimodal large language models (MLLMs). Existing MLLM-based approaches require instruction-following data for supervised fine-tuning and evaluate generation quality on analyzing text response, which is time-consuming and difficult to train. To address this problem, we propose LLaVA-Reward, which directly utilizes the hidden states of MLLMs given text-image pairs. To enhance the bidirectional interaction between visual and textual representations in decoder-only MLLMs, we further propose adding a Skip-connection Cross Attention (SkipCA) module. This design enhances text-image correlation reasoning by connecting early-layer visual features with later-layer hidden representations.In addition, LLaVA-Reward supports different types of preference data for efficient fine-tuning, including paired preference data and unpaired data. We train LLaVA-Reward on four evaluation perspectives: text-image alignment, fidelity/artifact, safety, and overall ranking. Empirical results demonstrate that LLaVA-Reward outperforms conventional and MLLM-based methods in generating human-aligned scores for automatic evaluations and inference-time scaling in text-to-image generations.
[17] ReGATE: Learning Faster and Better with Fewer Tokens in MLLMs cs.CV | cs.CLPDF
Chaoyu Li, Yogesh Kulkarni, Pooyan Fazli
TL;DR: ReGATE是一种自适应剪枝方法,通过教师-学生框架和动态评分机制加速MLLMs训练,减少计算开销,同时保持或超越模型性能。
Details
Motivation: MLLMs训练的计算成本随token数量增加,现有方法主要针对推理阶段,训练时效果有限。
Result: 在VideoLLaMA2上,ReGATE比标准训练快2倍且性能相当;进一步训练后还能超越基线模型,同时减少41%的token量。
Insight: 动态评分机制与教师-学生框架的结合有效平衡了训练效率和模型性能,为MLLMs的高效训练提供了新思路。
Abstract: The computational cost of training multimodal large language models (MLLMs) rapidly increases with the number of tokens involved. Existing efficiency methods primarily target inference and rely on token reduction or merging, offering limited benefit during training. In this paper, we propose ReGATE (Reference$-$Guided Adaptive Token Elision), an adaptive token pruning method for accelerating MLLM training. Specifically, ReGATE adopts a teacher-student framework in which the MLLM being trained serves as the student, and a frozen reference large language model (LLM) acts as the teacher. The teacher computes per-token reference losses, which are combined with an exponential moving average (EMA) of the student’s own difficulty scores. This adaptive difficulty-based scoring enables the selective processing of crucial tokens while bypassing less informative ones in the forward pass, significantly reducing computational overhead. Experiments demonstrate that ReGATE, when applied to VideoLLaMA2, matches the peak accuracy of standard training on MVBench up to 2$\times$ faster, using only 35% of the tokens. With additional training, it even surpasses the baseline on several multimodal benchmarks, all while reducing the total token count by over 41%. Code and models will be released soon.
[18] MapDiffusion: Generative Diffusion for Vectorized Online HD Map Construction and Uncertainty Estimation in Autonomous Driving cs.CV | cs.AI | cs.LG | cs.ROPDF
Thomas Monninger, Zihan Zhang, Zhipeng Mo, Md Zafar Anwar, Steffen Staab
TL;DR: MapDiffusion是一种基于扩散模型的生成方法,用于在线高精地图构建和不确定性估计,通过在BEV潜在网格上迭代优化随机初始化的查询,生成多个可能的地图样本,提高预测准确性并与场景模糊性相关的不确定性估计。
Details
Motivation: 传统地图构建模型提供确定性点估计,无法捕捉真实世界环境的模糊性(如遮挡和缺失车道标记)及不确定性。需要一种方法能够建模地图的完整分布。
Result: 在nuScenes数据集上达到SOTA性能,单样本性能提升5%,多样本聚合进一步提升性能。不确定性估计在遮挡区域显著更高。
Insight: 建模地图的完整分布可以提升自动驾驶在复杂环境中的鲁棒性和可靠性,支持基于不确定性的决策。
Abstract: Autonomous driving requires an understanding of the static environment from sensor data. Learned Bird’s-Eye View (BEV) encoders are commonly used to fuse multiple inputs, and a vector decoder predicts a vectorized map representation from the latent BEV grid. However, traditional map construction models provide deterministic point estimates, failing to capture uncertainty and the inherent ambiguities of real-world environments, such as occlusions and missing lane markings. We propose MapDiffusion, a novel generative approach that leverages the diffusion paradigm to learn the full distribution of possible vectorized maps. Instead of predicting a single deterministic output from learned queries, MapDiffusion iteratively refines randomly initialized queries, conditioned on a BEV latent grid, to generate multiple plausible map samples. This allows aggregating samples to improve prediction accuracy and deriving uncertainty estimates that directly correlate with scene ambiguity. Extensive experiments on the nuScenes dataset demonstrate that MapDiffusion achieves state-of-the-art performance in online map construction, surpassing the baseline by 5% in single-sample performance. We further show that aggregating multiple samples consistently improves performance along the ROC curve, validating the benefit of distribution modeling. Additionally, our uncertainty estimates are significantly higher in occluded areas, reinforcing their value in identifying regions with ambiguous sensor input. By modeling the full map distribution, MapDiffusion enhances the robustness and reliability of online vectorized HD map construction, enabling uncertainty-aware decision-making for autonomous vehicles in complex environments.
[19] Dual Cross-image Semantic Consistency with Self-aware Pseudo Labeling for Semi-supervised Medical Image Segmentation cs.CVPDF
Han Wu, Chong Wang, Zhiming Cui
TL;DR: 提出了一个名为DuCiSC的半监督医学图像分割框架,通过双重跨图像语义一致性和自感知伪标记策略,解决了当前方法忽略区域级语义一致性和特征差异的问题。
Details
Motivation: 半监督学习在医学图像分割中因标记数据有限而备受关注,但现有方法仅关注像素级一致性,忽略了更全面的语义一致性,且存在标记与未标记数据的特征差异问题。
Result: 在四个数据集(包括左心房、胰腺、心脏诊断和下牙槽神经)上验证,性能优于现有方法。
Insight: 区域级语义一致性和自感知伪标记策略是提升半监督医学图像分割的关键。
Abstract: Semi-supervised learning has proven highly effective in tackling the challenge of limited labeled training data in medical image segmentation. In general, current approaches, which rely on intra-image pixel-wise consistency training via pseudo-labeling, overlook the consistency at more comprehensive semantic levels (e.g., object region) and suffer from severe discrepancy of extracted features resulting from an imbalanced number of labeled and unlabeled data. To overcome these limitations, we present a new \underline{Du}al \underline{C}ross-\underline{i}mage \underline{S}emantic \underline{C}onsistency (DuCiSC) learning framework, for semi-supervised medical image segmentation. Concretely, beyond enforcing pixel-wise semantic consistency, DuCiSC proposes dual paradigms to encourage region-level semantic consistency across: 1) labeled and unlabeled images; and 2) labeled and fused images, by explicitly aligning their prototypes. Relying on the dual paradigms, DuCiSC can effectively establish consistent cross-image semantics via prototype representations, thereby addressing the feature discrepancy issue. Moreover, we devise a novel self-aware confidence estimation strategy to accurately select reliable pseudo labels, allowing for exploiting the training dynamics of unlabeled data. Our DuCiSC method is extensively validated on four datasets, including two popular binary benchmarks in segmenting the left atrium and pancreas, a multi-class Automatic Cardiac Diagnosis Challenge dataset, and a challenging scenario of segmenting the inferior alveolar nerve that features complicated anatomical structures, showing superior segmentation results over previous state-of-the-art approaches. Our code is publicly available at \href{https://github.com/ShanghaiTech-IMPACT/DuCiSC}{https://github.com/ShanghaiTech-IMPACT/DuCiSC}.
[20] Recursive Visual Imagination and Adaptive Linguistic Grounding for Vision Language Navigation cs.CV | cs.ROPDF
Bolei Chen, Jiaxu Kang, Yifei Wang, Ping Zhong, Qi Wu
TL;DR: 论文提出了一种新的导航策略,通过递归总结视觉感知并自适应对齐语言命令,解决了视觉语言导航中的场景表示和视觉-语言对齐问题。
Details
Motivation: 视觉语言导航(VLN)任务中,现有方法存在场景表示过于详细和视觉-语言对齐模糊的问题,导致导航代理难以理解高层场景先验并容易违反语言指令。
Result: 在VLN-CE和ObjectNav任务上,提出的导航策略超越了现有最先进方法。
Insight: 通过递归总结视觉信息和自适应语言对齐,可以显著提升导航代理对高层场景规律的理解,避免因细节导致的决策错误。
Abstract: Vision Language Navigation (VLN) typically requires agents to navigate to specified objects or remote regions in unknown scenes by obeying linguistic commands. Such tasks require organizing historical visual observations for linguistic grounding, which is critical for long-sequence navigational decisions. However, current agents suffer from overly detailed scene representation and ambiguous vision-language alignment, which weaken their comprehension of navigation-friendly high-level scene priors and easily lead to behaviors that violate linguistic commands. To tackle these issues, we propose a navigation policy by recursively summarizing along-the-way visual perceptions, which are adaptively aligned with commands to enhance linguistic grounding. In particular, by structurally modeling historical trajectories as compact neural grids, several Recursive Visual Imagination (RVI) techniques are proposed to motivate agents to focus on the regularity of visual transitions and semantic scene layouts, instead of dealing with misleading geometric details. Then, an Adaptive Linguistic Grounding (ALG) technique is proposed to align the learned situational memories with different linguistic components purposefully. Such fine-grained semantic matching facilitates the accurate anticipation of navigation actions and progress. Our navigation policy outperforms the state-of-the-art methods on the challenging VLN-CE and ObjectNav tasks, showing the superiority of our RVI and ALG techniques for VLN.
[21] Describe, Adapt and Combine: Empowering CLIP Encoders for Open-set 3D Object Retrieval cs.CVPDF
Zhichuan Wang, Yang Zhou, Zhe Liu, Rui Yu, Song Bai
TL;DR: 论文提出了一种名为DAC的框架,利用CLIP预训练模型和多模态大语言模型(MLLM)来提升3D物体检索任务在开放场景下的性能,通过描述和适应学习通用表示,并提出AB-LoRA方法增强泛化能力。
Details
Motivation: 开放场景下的3D物体检索(3DOR)是一个新兴任务,但现有方法因3D训练数据不足而难以生成通用表示。CLIP模型因其对比预训练的广泛适应性成为解决方案的基础。
Result: 在四个开放场景3DOR数据集上,DAC的平均mAP显著提升10.01%。其泛化能力也在图像基和跨数据集实验中验证。
Insight: 1. CLIP模型的预训练表示能力可以迁移到3D领域;2. 结合语言模型的多模态信息能有效提升开放场景任务的性能。
Abstract: Open-set 3D object retrieval (3DOR) is an emerging task aiming to retrieve 3D objects of unseen categories beyond the training set. Existing methods typically utilize all modalities (i.e., voxels, point clouds, multi-view images) and train specific backbones before fusion. However, they still struggle to produce generalized representations due to insufficient 3D training data. Being contrastively pre-trained on web-scale image-text pairs, CLIP inherently produces generalized representations for a wide range of downstream tasks. Building upon it, we present a simple yet effective framework named Describe, Adapt and Combine (DAC) by taking only multi-view images for open-set 3DOR. DAC innovatively synergizes a CLIP model with a multi-modal large language model (MLLM) to learn generalized 3D representations, where the MLLM is used for dual purposes. First, it describes the seen category information to align with CLIP’s training objective for adaptation during training. Second, it provides external hints about unknown objects complementary to visual cues during inference. To improve the synergy, we introduce an Additive-Bias Low-Rank adaptation (AB-LoRA), which alleviates overfitting and further enhances the generalization to unseen categories. With only multi-view images, DAC significantly surpasses prior arts by an average of +10.01% mAP on four open-set 3DOR datasets. Moreover, its generalization is also validated on image-based and cross-dataset setups. Code is available at https://github.com/wangzhichuan123/DAC.
[22] VAGU & GtS: LLM-Based Benchmark and Framework for Joint Video Anomaly Grounding and Understanding cs.CV | cs.MMPDF
Shibo Gao, Peipei Yang, Yangyang Liu, Yi Chen, Han Zhu
TL;DR: 论文提出了VAGU基准和GtS框架,首次将视频异常检测中的定位与语义理解任务结合,通过文本提示实现无需训练的粗定位和精细解释,并引入JeAUG评估指标。
Details
Motivation: 当前视频异常检测方法分别关注时间定位或语义理解,缺乏同时支持两者的模型和数据集。因此,作者提出整合两项任务的基准和框架。
Result: 实验验证了VAGU基准、GtS框架和JeAUG指标的有效性,显示其在联合任务中的优越性。
Insight: 异常定位与语义理解可以互补,联合评估更能反映实际需求;文本提示为无需训练的轻量级框架提供了新思路。
Abstract: Video Anomaly Detection (VAD) aims to identify anomalous events in videos and accurately determine their time intervals. Current VAD methods mainly fall into two categories: traditional DNN-based approaches that focus on temporal localization, and LLM-based approaches that emphasize semantic understanding. Both anomaly understanding and grounding are essential for comprehensive video anomaly detection and can complement each other. However, no existing model or dataset supports both tasks simultaneously. To address this, we introduce VAGU (Video Anomaly Grounding and Understanding), the first benchmark to integrate both tasks. Each VAGU instance includes annotations for anomaly category, semantic explanation, precise temporal grounding and Video QA. We also provide multiple-choice Video QA for objective evaluation. Based on this dataset, we propose Glance then Scrutinize (GtS), a training-free framework guided by textual prompts. The framework first enables coarse localization of high-probability anomalous regions, followed by detailed anomaly interpretation and temporal boundary refinement. Additionally, we propose the JeAUG metric, which jointly evaluates semantic interpretability and temporal precision, overcoming the limitations of traditional metrics. Extensive experiments verify the effectiveness of our benchmark, framework, and evaluation metric.
[23] Optimizing Active Learning in Vision-Language Models via Parameter-Efficient Uncertainty Calibration cs.CVPDF
Athmanarayanan Lakshmi Narayanan, Amrutha Machireddy, Ranganath Krishnan
TL;DR: 本文提出了一种新的参数高效学习方法,通过不确定性校准损失优化视觉语言模型中的主动学习,显著降低了标注成本并提升了采样效率。
Details
Motivation: 在大规模视觉语言模型中,主动学习的挑战主要集中在不确定性估计和高参数量下的高效采样上。本文旨在解决这些问题,提出一种更高效的主动学习方法。
Result: 实验表明,该方法在多个数据集和视觉骨干模型上表现优异,能够与复杂的基于特征的采样方法相媲美甚至超越,同时计算效率高。
Insight: 本文揭示了不确定性校准在主动学习中的重要性,并为高效采样提供了新的思路,Prompt学习和LoRA的比较结果也为相关方法的选择提供了参考。
Abstract: Active Learning (AL) has emerged as a powerful approach for minimizing labeling costs by selectively sampling the most informative data for neural network model development. Effective AL for large-scale vision-language models necessitates addressing challenges in uncertainty estimation and efficient sampling given the vast number of parameters involved. In this work, we introduce a novel parameter-efficient learning methodology that incorporates uncertainty calibration loss within the AL framework. We propose a differentiable loss function that promotes uncertainty calibration for effectively selecting fewer and most informative data samples for fine-tuning. Through extensive experiments across several datasets and vision backbones, we demonstrate that our solution can match and exceed the performance of complex feature-based sampling techniques while being computationally very efficient. Additionally, we investigate the efficacy of Prompt learning versus Low-rank adaptation (LoRA) in sample selection, providing a detailed comparative analysis of these methods in the context of efficient AL.
[24] Chain-of-Cooking:Cooking Process Visualization via Bidirectional Chain-of-Thought Guidance cs.CVPDF
Mengling Xu, Ming Tao, Bing-Kun Bao
TL;DR: 这篇论文提出了Chain-of-Cooking模型,通过动态补丁选择模块和双向思维链指导,实现了烹饪过程的连贯性和语义一致性可视化。
Details
Motivation: 现有的烹饪过程可视化方法主要关注完成食品的图像生成,缺乏对食材外观变化和步骤间上下文连贯性的处理。
Result: 实验表明,该方法在生成连贯且语义一致的烹饪过程图像上优于现有方法。
Insight: 双向思维链机制在时序任务(如烹饪过程)中能有效提升连贯性,动态补丁选择模块为多模态任务提供了新思路。
Abstract: Cooking process visualization is a promising task in the intersection of image generation and food analysis, which aims to generate an image for each cooking step of a recipe. However, most existing works focus on generating images of finished foods based on the given recipes, and face two challenges to visualize the cooking process. First, the appearance of ingredients changes variously across cooking steps, it is difficult to generate the correct appearances of foods that match the textual description, leading to semantic inconsistency. Second, the current step might depend on the operations of previous step, it is crucial to maintain the contextual coherence of images in sequential order. In this work, we present a cooking process visualization model, called Chain-of-Cooking. Specifically, to generate correct appearances of ingredients, we present a Dynamic Patch Selection Module to retrieve previously generated image patches as references, which are most related to current textual contents. Furthermore, to enhance the coherence and keep the rational order of generated images, we propose a Semantic Evolution Module and a Bidirectional Chain-of-Thought (CoT) Guidance. To better utilize the semantics of previous texts, the Semantic Evolution Module establishes the semantical association between latent prompts and current cooking step, and merges it with the latent features. Then the CoT Guidance updates the merged features to guide the current cooking step remain coherent with the previous step. Moreover, we construct a dataset named CookViz, consisting of intermediate image-text pairs for the cooking process. Quantitative and qualitative experiments show that our method outperforms existing methods in generating coherent and semantic consistent cooking process.
[25] Suppressing Gradient Conflict for Generalizable Deepfake Detection cs.CVPDF
Ming-Hui Liu, Harry Cheng, Xin Luo, Xin-Shun Xu
TL;DR: 论文提出了一种抑制梯度冲突的深度伪造检测框架(CS-DFD),通过两个协同模块(UVS和CGR)缓解梯度冲突问题,提升模型的泛化能力。
Details
Motivation: 现有深度伪造检测模型在训练中联合使用原始数据和在线合成伪造数据时,性能可能下降。研究发现这是梯度冲突导致的,需要通过方法缓解。
Result: 实验表明CS-DFD在多个深度伪造基准上实现了最先进的检测准确率和跨域泛化性能。
Insight: 梯度冲突是影响深度伪造检测模型泛化能力的关键因素,通过协同优化梯度方向和特征空间可以有效提升性能。
Abstract: Robust deepfake detection models must be capable of generalizing to ever-evolving manipulation techniques beyond training data. A promising strategy is to augment the training data with online synthesized fake images containing broadly generalizable artifacts. However, in the context of deepfake detection, it is surprising that jointly training on both original and online synthesized forgeries may result in degraded performance. This contradicts the common belief that incorporating more source-domain data should enhance detection accuracy. Through empirical analysis, we trace this degradation to gradient conflicts during backpropagation which force a trade-off between source domain accuracy and target domain generalization. To overcome this issue, we propose a Conflict-Suppressed Deepfake Detection (CS-DFD) framework that explicitly mitigates the gradient conflict via two synergistic modules. First, an Update Vector Search (UVS) module searches for an alternative update vector near the initial gradient vector to reconcile the disparities of the original and online synthesized forgeries. By further transforming the search process into an extremum optimization problem, UVS yields the uniquely update vector, which maximizes the simultaneous loss reductions for each data type. Second, a Conflict Gradient Reduction (CGR) module enforces a low-conflict feature embedding space through a novel Conflict Descent Loss. This loss penalizes misaligned gradient directions and guides the learning of representations with aligned, non-conflicting gradients. The synergy of UVS and CGR alleviates gradient interference in both parameter optimization and representation learning. Experiments on multiple deepfake benchmarks demonstrate that CS-DFD achieves state-of-the-art performance in both in-domain detection accuracy and cross-domain generalization.
[26] Sun sensor calibration algorithms: A systematic mapping and survey cs.CV | astro-ph.IMPDF
Michael Herman, Olivia J. Pinon Fischer, Dimitri N. Mavris
TL;DR: 本文系统地回顾了过去二十年间太阳传感器的建模和校准算法,填补了文献中缺乏对这一领域系统综述的空缺,并提出了未来研究的建议。
Details
Motivation: 太阳传感器在航天器姿态确定中至关重要,但其校准过程因不确定性复杂而困难。目前缺乏对这一领域研究的系统综述,因此本文旨在填补这一空缺。
Result: 总结了当前太阳传感器校准算法的研究现状,识别了研究空白和改进空间。
Insight: 校准算法的改进需要结合多源不确定性(如制造、电气和环境因素)的综合考量,未来工作应关注动态校准和适应性更强的算法。
Abstract: Attitude sensors determine the spacecraft attitude through the sensing of an astronomical object, field or other phenomena. The Sun and fixed stars are the two primary astronomical sensing objects. Attitude sensors are critical components for the survival and knowledge improvement of spacecraft. Of these, sun sensors are the most common and important sensor for spacecraft attitude determination. The sun sensor measures the Sun vector in spacecraft coordinates. The sun sensor calibration process is particularly difficult due to the complex nature of the uncertainties involved. The uncertainties are small, difficult to observe, and vary spatio-temporally over the lifecycle of the sensor. In addition, the sensors are affected by numerous sources of uncertainties, including manufacturing, electrical, environmental, and interference sources. This motivates the development of advanced calibration algorithms to minimize uncertainty over the sensor lifecycle and improve accuracy. Although modeling and calibration techniques for sun sensors have been explored extensively in the literature over the past two decades, there is currently no resource that consolidates and systematically reviews this body of work. The present review proposes a systematic mapping of sun sensor modeling and calibration algorithms across a breadth of sensor configurations. It specifically provides a comprehensive survey of each methodology, along with an analysis of research gaps and recommendations for future directions in sun sensor modeling and calibration techniques.
[27] Multi-View Reconstruction with Global Context for 3D Anomaly Detection cs.CVPDF
Yihan Sun, Yuqi Cheng, Yunkang Cao, Yuxin Zhang, Weiming Shen
TL;DR: 该论文提出了一种多视图重建(MVR)方法,通过将高分辨率点云无损转换为多视图图像,并结合基于重建的异常检测框架,提升了3D异常检测中的全局信息学习能力。
Details
Motivation: 现有的3D异常检测方法在全局信息不足的情况下,其性能在高精度任务中表现不佳,尤其在工业质检领域。因此,论文旨在解决这一问题。
Result: 在Real3D-AD基准测试中,MVR实现了89.6%的对象级AU-ROC和95.7%的点级AU-ROC,表现优异。
Insight: 通过多视图和全局信息学习,可以有效解决3D异常检测中的信息不足问题,为工业质检提供了新思路。
Abstract: 3D anomaly detection is critical in industrial quality inspection. While existing methods achieve notable progress, their performance degrades in high-precision 3D anomaly detection due to insufficient global information. To address this, we propose Multi-View Reconstruction (MVR), a method that losslessly converts high-resolution point clouds into multi-view images and employs a reconstruction-based anomaly detection framework to enhance global information learning. Extensive experiments demonstrate the effectiveness of MVR, achieving 89.6% object-wise AU-ROC and 95.7% point-wise AU-ROC on the Real3D-AD benchmark.
[28] RelMap: Enhancing Online Map Construction with Class-Aware Spatial Relation and Semantic Priors cs.CVPDF
Tianhui Cai, Yun Zhang, Zewei Zhou, Zhiyu Huang, Jiaqi Ma
TL;DR: RelMap提出了一种端到端框架,通过结合空间关系和语义先验来增强在线高精地图构建。该方法利用类感知空间关系先验和基于Mixture-of-Experts的语义先验,显著提高了性能。
Details
Motivation: 当前基于Transformer的高精地图构建方法忽视了地图元素间的空间和语义关系,限制了模型的准确性和泛化能力。RelMap旨在通过显式建模这些关系提升性能。
Result: 在nuScenes和Argoverse 2数据集上达到了SOTA性能。
Insight: 显式建模空间和语义关系能够显著提升高精地图构建的准确性和泛化性,尤其是在复杂场景中。
Abstract: Online high-definition (HD) map construction plays an increasingly important role in scaling autonomous driving systems. Transformer-based methods have become prevalent in online HD map construction; however, existing approaches often neglect the inherent spatial and semantic relationships among map elements, which limits their accuracy and generalization. To address this, we propose RelMap, an end-to-end framework that enhances online map construction by incorporating spatial relations and semantic priors. We introduce a Class-aware Spatial Relation Prior, which explicitly encodes relative positional dependencies between map elements using a learnable class-aware relation encoder. Additionally, we propose a Mixture-of-Experts (MoE)-based Semantic Prior, which routes features to class-specific experts based on predicted class probabilities, refining instance feature decoding. Our method is compatible with both single-frame and temporal perception backbones, achieving state-of-the-art performance on both the nuScenes and Argoverse 2 datasets.
[29] LinDeps: A Fine-tuning Free Post-Pruning Method to Remove Layer-Wise Linear Dependencies with Guaranteed Performance Preservation cs.CVPDF
Maxim Henry, Adrien Deliège, Anthony Cioppa, Marc Van Droogenbroeck
TL;DR: LinDeps是一种新型的后剪枝方法,通过线性依赖分析识别并去除冗余过滤器,无需微调即可保证性能不下降,显著提升了现有剪枝技术的压缩率。
Details
Motivation: 卷积神经网络(CNN)在计算机视觉任务中广泛应用,但其规模和复杂性对资源受限平台的部署提出了挑战。当前剪枝技术忽视了层内特征图的结构依赖,导致剪枝决策不够优化。
Result: 在CIFAR-10和ImageNet上的实验表明,LinDeps显著提升了压缩率并保持了性能,在低资源设置中也表现出色。
Insight: 通过分析层内线性依赖关系,可以更高效地移除冗余参数,同时避免性能损失,为未来剪枝技术提供了新思路。
Abstract: Convolutional Neural Networks (CNN) are widely used in many computer vision tasks. Yet, their increasing size and complexity pose significant challenges for efficient deployment on resource-constrained platforms. Hence, network pruning has emerged as an effective way of reducing the size and computational requirements of neural networks by removing redundant or unimportant parameters. However, a fundamental challenge with pruning consists in optimally removing redundancies without degrading performance. Most existing pruning techniques overlook structural dependencies across feature maps within a layer, resulting in suboptimal pruning decisions. In this work, we introduce LinDeps, a novel post-pruning method, i.e., a pruning method that can be applied on top of any pruning technique, which systematically identifies and removes redundant filters via linear dependency analysis. Particularly, LinDeps applies pivoted QR decomposition to feature maps to detect and prune linearly dependent filters. Then, a novel signal recovery mechanism adjusts the next layer’s kernels to preserve compatibility and performance without requiring any fine-tuning. Our experiments on CIFAR-10 and ImageNet with VGG and ResNet backbones demonstrate that LinDeps improves compression rates of existing pruning techniques while preserving performances, leading to a new state of the art in CNN pruning. We also benchmark LinDeps in low-resource setups where no retraining can be performed, which shows significant pruning improvements and inference speedups over a state-of-the-art method. LinDeps therefore constitutes an essential add-on for any current or future pruning technique.
[30] TARS: MinMax Token-Adaptive Preference Strategy for Hallucination Reduction in MLLMs cs.CVPDF
Kejia Zhang, Keda Tao, Zhiming Luo, Chang Liu, Jiasheng Tang
TL;DR: 论文提出了TARS策略,通过最小-最大优化问题改进直接偏好优化(DPO),以减少多模态大语言模型(MLLMs)中的幻觉问题。TARS在语义约束下最大化token级分布偏移以模拟对齐不确定性,同时最小化偏好损失,从而在保持模型因果性的同时减少幻觉。
Details
Motivation: 现有的直接偏好优化(DPO)方法通常将幻觉相关的偏好视为固定目标,导致模型在训练中容易过度拟合偏好数据的表面语言特征,从而损害模型的视觉信息对齐能力。这种分布刚性和虚假相关性限制了模型在多模态推理中的可靠性。
Result: 实验表明,TARS将幻觉率从26.4%降至13.2%,认知值从2.5降至0.4。其性能优于标准DPO,并在多个关键指标上与GPT-4o相当。
Insight: 通过动态调整的token级偏好策略,TARS展示了在保持模型语义对齐的同时减少幻觉的潜力。这种方法为多模态模型的可靠性优化提供了新思路。
Abstract: Multimodal large language models (MLLMs) enable vision-language reasoning, yet often generate plausible outputs that are factually incorrect or visually ungrounded, thereby compromising their reliability. Direct preference optimization (DPO) is a common strategy for correcting hallucinations by aligning model outputs with human preferences. Existing DPO strategies typically treat hallucination-related preferences as fixed targets, relying on static supervision signals during training. This approach tends to overfit to superficial linguistic cues in preference data, leading to distributional rigidity and spurious correlations that impair grounding in causally relevant visual information. To overcome this limitation, we propose TARS, a token-adaptive preference strategy that reformulates DPO as a min-max optimization problem. TARS maximizes token-level distributional shifts under semantic constraints to simulate alignment uncertainty, and simultaneously minimizes the expected preference loss under these controlled perturbations. This joint objective preserves causal grounding while mitigating overfitting to preference patterns, thereby reducing hallucinations in multimodal reasoning. We evaluate TARS on multiple hallucination benchmarks and find consistently strong performance. Using only 4.8k preference samples and no expert feedback, TARS reduces hallucination rates from 26.4% to 13.2% and decreases cognition value from 2.5 to 0.4. It outperforms standard DPO and matches GPT-4o on several key metrics.
[31] Emerging Trends in Pseudo-Label Refinement for Weakly Supervised Semantic Segmentation with Image-Level Supervision cs.CVPDF
Zheyuan Zhang, Wang Zhang
TL;DR: 本文是一篇综述文章,聚焦于基于图像级标注的弱监督语义分割(WSSS)的最新进展,总结了主流研究方向并探讨了相关挑战与未来趋势。
Details
Motivation: 传统的完全监督语义分割需要密集标注,成本高昂。相比之下,基于图像级标注的WSSS更具实用性,但挑战性更大。现有的综述未能捕捉最新趋势,亟需一篇更新的综述。
Result: 综述突出了当前WSSS技术的优势和局限性,并指出未来在领域适应性和方法鲁棒性等方面的发展空间。
Insight: 1. 图像级标注WSSS的实用性使其成为研究热点,但其性能仍落后于完全监督方法;2. 领域适应和跨数据集泛化是未充分探索的挑战;3. 未来可能通过伪标签优化和多模态学习取得突破。
Abstract: Unlike fully supervised semantic segmentation, weakly supervised semantic segmentation (WSSS) relies on weaker forms of supervision to perform dense prediction tasks. Among the various types of weak supervision, WSSS with image level annotations is considered both the most challenging and the most practical, attracting significant research attention. Therefore, in this review, we focus on WSSS with image level annotations. Additionally, this review concentrates on mainstream research directions, deliberately omitting less influential branches. Given the rapid development of new methods and the limitations of existing surveys in capturing recent trends, there is a pressing need for an updated and comprehensive review. Our goal is to fill this gap by synthesizing the latest advancements and state-of-the-art techniques in WSSS with image level labels. Basically, we provide a comprehensive review of recent advancements in WSSS with image level labels, categorizing existing methods based on the types and levels of additional supervision involved. We also examine the challenges of applying advanced methods to domain specific datasets in WSSS,a topic that remains underexplored. Finally, we discuss the current challenges, evaluate the limitations of existing approaches, and outline several promising directions for future research. This review is intended for researchers who are already familiar with the fundamental concepts of WSSS and are seeking to deepen their understanding of current advances and methodological innovations.
[32] Decoupled Spatio-Temporal Consistency Learning for Self-Supervised Tracking cs.CVPDF
Yaozong Zheng, Bineng Zhong, Qihua Liang, Ning Li, Shuxiang Song
TL;DR: 该论文提出了一种名为{\tracker}的自监督跟踪框架,通过解耦的时空一致性学习框架和实例对比损失,无需人工标注的边界框即可学习跟踪表示,显著优于现有自监督方法。
Details
Motivation: 现有跟踪数据集依赖人工标注的边界框,成本高且限制了数据集的规模和多样性。作者希望通过自监督学习减少对标注的依赖,同时有效学习跟踪表示。
Result: 在GOT10K、LaSOT和TrackingNet数据集上,AUC(AO)分数分别提升25.3%、20.4%和14.8%。
Insight: 自监督学习可以有效减少对标注数据的依赖,同时通过解耦时空学习和实例对比增强跟踪性能。
Abstract: The success of visual tracking has been largely driven by datasets with manual box annotations. However, these box annotations require tremendous human effort, limiting the scale and diversity of existing tracking datasets. In this work, we present a novel Self-Supervised Tracking framework named \textbf, designed to eliminate the need of box annotations. Specifically, a decoupled spatio-temporal consistency training framework is proposed to learn rich target information across timestamps through global spatial localization and local temporal association. This allows for the simulation of appearance and motion variations of instances in real-world scenarios. Furthermore, an instance contrastive loss is designed to learn instance-level correspondences from a multi-view perspective, offering robust instance supervision without additional labels. This new design paradigm enables {\tracker} to effectively learn generic tracking representations in a self-supervised manner, while reducing reliance on extensive box annotations. Extensive experiments on nine benchmark datasets demonstrate that {\tracker} surpasses \textit{SOTA} self-supervised tracking methods, achieving an improvement of more than 25.3%, 20.4%, and 14.8% in AUC (AO) score on the GOT10K, LaSOT, TrackingNet datasets, respectively. Code: https://github.com/GXNU-ZhongLab/SSTrack.
[33] EMIT: Enhancing MLLMs for Industrial Anomaly Detection via Difficulty-Aware GRPO cs.CVPDF
Wei Guan, Jun Lan, Jian Cao, Hao Tan, Huijia Zhu
TL;DR: 论文提出了一种名为EMIT的框架,通过难度感知的Group Relative Policy Optimization (GRPO)方法增强多模态大语言模型(MLLMs)在工业异常检测(IAD)中的表现。
Details
Motivation: 工业异常检测对制造系统的安全性和可靠性至关重要,但现有的多模态大语言模型在此领域的表现不足,需要领域特定的优化。
Result: 在MMAD基准测试中,EMIT显著提升了MLLMs的性能,平均比基础模型(InternVL3-8B)提高了7.77%。
Insight: 1) 难度感知优化机制能有效提升模型性能;2) 多模态数据增强和对比学习结合是提升IAD任务的关键。
Abstract: Industrial anomaly detection (IAD) plays a crucial role in maintaining the safety and reliability of manufacturing systems. While multimodal large language models (MLLMs) show strong vision-language reasoning abilities, their effectiveness in IAD remains limited without domain-specific adaptation. In this work, we propose EMIT, a unified framework that enhances MLLMs for IAD via difficulty-aware group relative policy optimization (GRPO). EMIT constructs a multi-task IAD dataset and utilizes GPT-generated object text descriptions to compensate for missing defective images. For few-shot anomaly detection, it integrates a soft prompt and heatmap-guided contrastive embeddings derived from patch-level comparisons. To better handle difficult data samples, i.e., cases where the MLLM struggles to generate correct answers, we propose a difficulty-aware GRPO that extends the original GRPO by incorporating a response resampling strategy to ensure the inclusion of correct answers in the sampled responses, as well as an advantage reweighting mechanism to strengthen learning from such difficult data samples. Extensive experiments on the MMAD benchmark demonstrate that EMIT significantly enhances the IAD performance of MLLMs, achieving an average improvement of 7.77% over the base model (InternVL3-8B) across seven tasks.
[34] GuidPaint: Class-Guided Image Inpainting with Diffusion Models cs.CV | I.4.4PDF
Qimin Wang, Xinda Liu, Guohua Geng
TL;DR: GuidPaint提出一种无需训练、基于类别引导的图像修复框架,通过结合分类器引导和扩散模型,在语义一致性和视觉真实性上优于现有方法。
Details
Motivation: 现有基于扩散模型的修复方法缺乏对遮罩区域的精细控制,可能导致语义不一致或视觉不合理的内容。
Result: 实验表明,GuidPaint在定性和定量评估上均优于现有上下文感知修复方法。
Insight: 分类器引导可以提升扩散模型在图像修复任务中的控制能力,同时避免额外的训练成本。
Abstract: In recent years, diffusion models have been widely adopted for image inpainting tasks due to their powerful generative capabilities, achieving impressive results. Existing multimodal inpainting methods based on diffusion models often require architectural modifications and retraining, resulting in high computational cost. In contrast, context-aware diffusion inpainting methods leverage the model’s inherent priors to adjust intermediate denoising steps, enabling high-quality inpainting without additional training and significantly reducing computation. However, these methods lack fine-grained control over the masked regions, often leading to semantically inconsistent or visually implausible content. To address this issue, we propose GuidPaint, a training-free, class-guided image inpainting framework. By incorporating classifier guidance into the denoising process, GuidPaint enables precise control over intermediate generations within the masked areas, ensuring both semantic consistency and visual realism. Furthermore, it integrates stochastic and deterministic sampling, allowing users to select preferred intermediate results and deterministically refine them. Experimental results demonstrate that GuidPaint achieves clear improvements over existing context-aware inpainting methods in both qualitative and quantitative evaluations.
[35] The Evolution of Video Anomaly Detection: A Unified Framework from DNN to MLLM cs.CVPDF
Shibo Gao, Peipei Yang, Haiyang Guo, Yangyang Liu, Yi Chen
TL;DR: 本文系统梳理了从深度神经网络(DNN)到多模态大语言模型(MLLM)的视频异常检测(VAD)技术演进,提出了统一框架,总结了新范式下的挑战与未来方向。
Details
Motivation: 视频异常检测是智能监控与公共安全的核心技术,深度学习与多模态大语言模型的快速发展为VAD带来新机遇与挑战,亟需系统综述与框架整合。
Result: 本文系统总结了VAD领域的技术演进,明确了MLLM/LLM带来的新范式与挑战。
Insight: MLLM/LLM通过多模态数据标注、任务灵活性等显著提升了VAD的性能和应用边界,但数据依赖与计算成本仍是瓶颈。
Abstract: Video anomaly detection (VAD) aims to identify and ground anomalous behaviors or events in videos, serving as a core technology in the fields of intelligent surveillance and public safety. With the advancement of deep learning, the continuous evolution of deep model architectures has driven innovation in VAD methodologies, significantly enhancing feature representation and scene adaptability, thereby improving algorithm generalization and expanding application boundaries. More importantly, the rapid development of multi-modal large language (MLLMs) and large language models (LLMs) has introduced new opportunities and challenges to the VAD field. Under the support of MLLMs and LLMs, VAD has undergone significant transformations in terms of data annotation, input modalities, model architectures, and task objectives. The surge in publications and the evolution of tasks have created an urgent need for systematic reviews of recent advancements. This paper presents the first comprehensive survey analyzing VAD methods based on MLLMs and LLMs, providing an in-depth discussion of the changes occurring in the VAD field in the era of large models and their underlying causes. Additionally, this paper proposes a unified framework that encompasses both deep neural network (DNN)-based and LLM-based VAD methods, offering a thorough analysis of the new VAD paradigms empowered by LLMs, constructing a classification system, and comparing their strengths and weaknesses. Building on this foundation, this paper focuses on current VAD methods based on MLLMs/LLMs. Finally, based on the trajectory of technological advancements and existing bottlenecks, this paper distills key challenges and outlines future research directions, offering guidance for the VAD community.
[36] Automated Detection of Antarctic Benthic Organisms in High-Resolution In Situ Imagery to Aid Biodiversity Monitoring cs.CVPDF
Cameron Trotter, Huw Griffiths, Tasnuva Ming Khan, Rowan Whittle
TL;DR: 本文提出一种针对南极底栖生物高分辨率图像的自动化检测框架,解决了海洋生态图像的数据标注有限、目标尺寸多变和复杂海底结构等挑战,并发布了首个韦德尔海底栖生物监测的公开数据集。
Details
Motivation: 南极底栖生物多样性监测对理解气候变化引起的生态变化至关重要,但现有依赖于人工标注的方法效率低且专业化程度高,阻碍了大规模分析。
Result: 框架在检测中到大尺寸生物方面表现优异,但在小尺寸和稀有类别的检测上仍有挑战。
Insight: 当前检测架构对小目标和罕见生物的检测能力有限,未来需要进一步优化。该框架为机器辅助的现场底栖生物监测研究提供了可扩展基础。
Abstract: Monitoring benthic biodiversity in Antarctica is vital for understanding ecological change in response to climate-driven pressures. This work is typically performed using high-resolution imagery captured in situ, though manual annotation of such data remains laborious and specialised, impeding large-scale analysis. We present a tailored object detection framework for identifying and classifying Antarctic benthic organisms in high-resolution towed camera imagery, alongside the first public computer vision dataset for benthic biodiversity monitoring in the Weddell Sea. Our approach addresses key challenges associated with marine ecological imagery, including limited annotated data, variable object sizes, and complex seafloor structure. The proposed framework combines resolution-preserving patching, spatial data augmentation, fine-tuning, and postprocessing via Slicing Aided Hyper Inference. We benchmark multiple object detection architectures and demonstrate strong performance in detecting medium and large organisms across 25 fine-grained morphotypes, significantly more than other works in this area. Detection of small and rare taxa remains a challenge, reflecting limitations in current detection architectures. Our framework provides a scalable foundation for future machine-assisted in situ benthic biodiversity monitoring research.
[37] SAMITE: Position Prompted SAM2 with Calibrated Memory for Visual Object Tracking cs.CVPDF
Qianxiong Xu, Lanyun Zhu, Chenxi Liu, Guosheng Lin, Cheng Long
TL;DR: SAMITE是一种基于SAM2的视频目标跟踪方法,通过原型记忆库和位置提示生成器解决遮挡和干扰问题,显著提升了跟踪性能。
Details
Motivation: 现有视觉目标跟踪方法在时间依赖性和类别偏差上表现不足,且难以应对遮挡和干扰问题。SAMITE旨在通过改进SAM2模型来解决这些问题。
Result: 在六个基准测试中表现优异,代码已开源。
Insight: 通过量化跟踪结果的质量和显式位置提示,可以有效提升模型的抗干扰能力和泛化性。
Abstract: Visual Object Tracking (VOT) is widely used in applications like autonomous driving to continuously track targets in videos. Existing methods can be roughly categorized into template matching and autoregressive methods, where the former usually neglects the temporal dependencies across frames and the latter tends to get biased towards the object categories during training, showing weak generalizability to unseen classes. To address these issues, some methods propose to adapt the video foundation model SAM2 for VOT, where the tracking results of each frame would be encoded as memory for conditioning the rest of frames in an autoregressive manner. Nevertheless, existing methods fail to overcome the challenges of object occlusions and distractions, and do not have any measures to intercept the propagation of tracking errors. To tackle them, we present a SAMITE model, built upon SAM2 with additional modules, including: (1) Prototypical Memory Bank: We propose to quantify the feature-wise and position-wise correctness of each frame’s tracking results, and select the best frames to condition subsequent frames. As the features of occluded and distracting objects are feature-wise and position-wise inaccurate, their scores would naturally be lower and thus can be filtered to intercept error propagation; (2) Positional Prompt Generator: To further reduce the impacts of distractors, we propose to generate positional mask prompts to provide explicit positional clues for the target, leading to more accurate tracking. Extensive experiments have been conducted on six benchmarks, showing the superiority of SAMITE. The code is available at https://github.com/Sam1224/SAMITE.
[38] MAGE: Multimodal Alignment and Generation Enhancement via Bridging Visual and Semantic Spaces cs.CV | cs.MMPDF
Shaojun E, Yuchen Yang, Jiaheng Wu, Yan Zhang, Tiejun Zhao
TL;DR: 本文提出了一种名为MAGE的新型多模态学习框架,通过创新的对齐机制在视觉和文本的语义空间之间建立桥梁,解决了现有方法中的语义差距和信息损失问题,并在多个评估基准上取得了显著性能提升。
Details
Motivation: 多模态学习中,视觉数据编码后的空间和语义损失是主要挑战,现有方法存在向量间隙和语义差异,导致信息传播过程中的损失。
Result: 在MME、MMBench和SEED等评估基准上,MAGE的表现显著优于同类工作。
Insight: 通过增强视觉和语义空间的对齐,可以有效提升多模态模型的性能,特别是在异构数据间的语义一致性方面。
Abstract: In the latest advancements in multimodal learning, effectively addressing the spatial and semantic losses of visual data after encoding remains a critical challenge. This is because the performance of large multimodal models is positively correlated with the coupling between visual encoders and large language models. Existing approaches often face issues such as vector gaps or semantic disparities, resulting in information loss during the propagation process. To address these issues, we propose MAGE (Multimodal Alignment and Generation Enhancement), a novel framework that bridges the semantic spaces of vision and text through an innovative alignment mechanism. By introducing the Intelligent Alignment Network (IAN), MAGE achieves dimensional and semantic alignment. To reduce the gap between synonymous heterogeneous data, we employ a training strategy that combines cross-entropy and mean squared error, significantly enhancing the alignment effect. Moreover, to enhance MAGE’s “Any-to-Any” capability, we developed a fine-tuning dataset for multimodal tool-calling instructions to expand the model’s output capability boundaries. Finally, our proposed multimodal large model architecture, MAGE, achieved significantly better performance compared to similar works across various evaluation benchmarks, including MME, MMBench, and SEED. Complete code and appendix are available at: https://github.com/GTCOM-NLP/MAGE.
[39] Adversarial Reconstruction Feedback for Robust Fine-grained Generalization cs.CVPDF
Shijie Wang, Jian Shi, Haojie Li
TL;DR: AdvRF提出了一种对抗性重建反馈框架,通过结合检索模型的类别感知差异定位和重建模型的类别无关特征学习,实现了对未见类别的泛化能力提升。
Details
Motivation: 现有的细粒度图像检索方法依赖于预定义类别的监督,导致检索表示中引入了类别特定的语义,从而阻碍了对未见类别的泛化能力。
Result: 在细粒度和粗粒度数据集上均取得了优异的性能表现。
Insight: 通过对抗性反馈机制,可以分离类别特定语义,实现更通用的检索表示,从而提升泛化能力。
Abstract: Existing fine-grained image retrieval (FGIR) methods predominantly rely on supervision from predefined categories to learn discriminative representations for retrieving fine-grained objects. However, they inadvertently introduce category-specific semantics into the retrieval representation, creating semantic dependencies on predefined classes that critically hinder generalization to unseen categories. To tackle this, we propose AdvRF, a novel adversarial reconstruction feedback framework aimed at learning category-agnostic discrepancy representations. Specifically, AdvRF reformulates FGIR as a visual discrepancy reconstruction task via synergizing category-aware discrepancy localization from retrieval models with category-agnostic feature learning from reconstruction models. The reconstruction model exposes residual discrepancies overlooked by the retrieval model, forcing it to improve localization accuracy, while the refined signals from the retrieval model guide the reconstruction model to improve its reconstruction ability. Consequently, the retrieval model localizes visual differences, while the reconstruction model encodes these differences into category-agnostic representations. This representation is then transferred to the retrieval model through knowledge distillation for efficient deployment. Quantitative and qualitative evaluations demonstrate that our AdvRF achieves impressive performance on both widely-used fine-grained and coarse-grained datasets.
[40] Few-Shot Vision-Language Reasoning for Satellite Imagery via Verifiable Rewards cs.CVPDF
Aybora Koksal, A. Aydin Alatan
TL;DR: 该论文提出了一种适用于卫星图像的少样本视觉-语言推理框架(RLVR),通过可验证的奖励(无需标注监督)显著提升了模型性能,仅需少量样本即可匹配甚至超越传统大量标注数据训练的模型。
Details
Motivation: 在卫星图像等专业领域中,标注数据稀缺且昂贵,现有的大型语言和视觉-语言模型难以直接适用。论文旨在解决这一挑战,提出一种高效的数据利用方法。
Result: 实验表明,单样本训练即可显著改进模型性能,128样本匹配或超越传统千样本训练的模型。尽管单样本可能导致轻微过拟合,但整体表现稳健。
Insight: 提示设计和奖励机制对少样本训练的稳定性和最终性能至关重要。该方法为数据稀缺领域的视觉-语言模型开发提供了实用方案。
Abstract: Recent advances in large language and vision-language models have enabled strong reasoning capabilities, yet they remain impractical for specialized domains like remote sensing, where annotated data is scarce and expensive. We present the first few-shot reinforcement learning with verifiable reward (RLVR) framework for satellite imagery that eliminates the need for caption supervision–relying solely on lightweight, rule-based binary or IoU-based rewards. Adapting the “1-shot RLVR” paradigm from language models to vision-language models, we employ policy-gradient optimization with as few as one curated example to align model outputs for satellite reasoning tasks. Comprehensive experiments across multiple remote sensing benchmarks–including classification, visual question answering, and grounding–show that even a single example yields substantial improvements over the base model. Scaling to 128 examples matches or exceeds models trained on thousands of annotated samples. While the extreme one-shot setting can induce mild, task-specific overfitting, our approach consistently demonstrates robust generalization and efficiency across diverse tasks. Further, we find that prompt design and loss weighting significantly influence training stability and final accuracy. Our method enables cost-effective and data-efficient development of domain-specialist vision-language reasoning models, offering a pragmatic recipe for data-scarce fields: start from a compact VLM, curate a handful of reward-checkable cases, and train via RLVR.
[41] LiteFat: Lightweight Spatio-Temporal Graph Learning for Real-Time Driver Fatigue Detection cs.CV | cs.AIPDF
Jing Ren, Suyu Ma, Hong Jia, Xiwei Xu, Ivan Lee
TL;DR: 论文提出了一种轻量级的时空图学习模型LiteFat,用于实时检测驾驶员疲劳,降低计算复杂度并保持高精度。
Details
Motivation: 驾驶员疲劳是交通事故的主要原因之一,现有深度学习模型计算量大、延迟高,无法满足嵌入式设备的需求。
Result: 在基准数据集上表现优异,计算复杂度和延迟显著低于现有方法。
Insight: 轻量化设计使模型适用于资源有限的嵌入式设备,为实时疲劳检测系统提供可能。
Abstract: Detecting driver fatigue is critical for road safety, as drowsy driving remains a leading cause of traffic accidents. Many existing solutions rely on computationally demanding deep learning models, which result in high latency and are unsuitable for embedded robotic devices with limited resources (such as intelligent vehicles/cars) where rapid detection is necessary to prevent accidents. This paper introduces LiteFat, a lightweight spatio-temporal graph learning model designed to detect driver fatigue efficiently while maintaining high accuracy and low computational demands. LiteFat involves converting streaming video data into spatio-temporal graphs (STG) using facial landmark detection, which focuses on key motion patterns and reduces unnecessary data processing. LiteFat uses MobileNet to extract facial features and create a feature matrix for the STG. A lightweight spatio-temporal graph neural network is then employed to identify signs of fatigue with minimal processing and low latency. Experimental results on benchmark datasets show that LiteFat performs competitively while significantly decreasing computational complexity and latency as compared to current state-of-the-art methods. This work enables the development of real-time, resource-efficient human fatigue detection systems that can be implemented upon embedded robotic devices.
[42] MOR-VIT: Efficient Vision Transformer with Mixture-of-Recursions cs.CVPDF
YiZhou Li
TL;DR: MoR-ViT 提出了一种新型视觉 Transformer 框架,通过动态递归机制(Mixture-of-Recursions, MoR)实现每个 token 自适应决定处理深度,显著提升了模型效率。
Details
Motivation: 标准 ViT 存在参数冗余和计算成本高的问题,限制了实际部署。现有高效 ViT 方法多关注静态压缩或 token 级别的稀疏化,但仍受限于固定计算深度。
Result: 在 ImageNet-1K 上实现 SOTA 精度,同时减少 70% 参数和 2.5 倍推理加速。
Insight: 动态递归机制是高效视觉 Transformer 的有效策略,为实际场景中的可扩展和可部署模型开辟了新方向。
Abstract: Vision Transformers (ViTs) have achieved remarkable success in image recognition, yet standard ViT architectures are hampered by substantial parameter redundancy and high computational cost, limiting their practical deployment. While recent efforts on efficient ViTs primarily focus on static model compression or token-level sparsification, they remain constrained by fixed computational depth for all tokens. In this work, we present MoR-ViT, a novel vision transformer framework that, for the first time, incorporates a token-level dynamic recursion mechanism inspired by the Mixture-of-Recursions (MoR) paradigm. This approach enables each token to adaptively determine its processing depth, yielding a flexible and input-dependent allocation of computational resources. Extensive experiments on ImageNet-1K and transfer benchmarks demonstrate that MoR-ViT not only achieves state-of-the-art accuracy with up to 70% parameter reduction and 2.5x inference acceleration, but also outperforms leading efficient ViT baselines such as DynamicViT and TinyViT under comparable conditions. These results establish dynamic recursion as an effective strategy for efficient vision transformers and open new avenues for scalable and deployable deep learning models in real-world scenarios.
[43] AU-LLM: Micro-Expression Action Unit Detection via Enhanced LLM-Based Feature Fusion cs.CVPDF
Zhishu Liu, Kaishen Yuan, Bo Zhao, Yong Xu, Zitong Yu
TL;DR: 该论文提出了AU-LLM框架,首次使用大型语言模型(LLM)进行微表情动作单元(AUs)检测,通过增强的特征融合投影器(EFP)弥补了视觉与语言语义的鸿沟,并在基准数据集上取得了最优性能。
Details
Motivation: 微表情AUs检测是情感计算中的重要任务,但现有方法在数据稀缺和低强度微表情上的表现有限。论文探索了LLM在这一领域的潜力。
Result: 在CASME II和SAMM数据集上通过LOSO和跨域测试,AU-LLM达到了新的最优性能。
Insight: LLM不仅适用于语言任务,还能通过特征增强迁移到低强度、数据稀缺的微表情领域,为多模态分析提供了新方向。
Abstract: The detection of micro-expression Action Units (AUs) is a formidable challenge in affective computing, pivotal for decoding subtle, involuntary human emotions. While Large Language Models (LLMs) demonstrate profound reasoning abilities, their application to the fine-grained, low-intensity domain of micro-expression AU detection remains unexplored. This paper pioneers this direction by introducing \textbf{AU-LLM}, a novel framework that for the first time uses LLM to detect AUs in micro-expression datasets with subtle intensities and the scarcity of data. We specifically address the critical vision-language semantic gap, the \textbf{Enhanced Fusion Projector (EFP)}. The EFP employs a Multi-Layer Perceptron (MLP) to intelligently fuse mid-level (local texture) and high-level (global semantics) visual features from a specialized 3D-CNN backbone into a single, information-dense token. This compact representation effectively empowers the LLM to perform nuanced reasoning over subtle facial muscle movements.Through extensive evaluations on the benchmark CASME II and SAMM datasets, including stringent Leave-One-Subject-Out (LOSO) and cross-domain protocols, AU-LLM establishes a new state-of-the-art, validating the significant potential and robustness of LLM-based reasoning for micro-expression analysis. The codes are available at https://github.com/ZS-liu-JLU/AU-LLMs.
[44] MSGCoOp: Multiple Semantic-Guided Context Optimization for Few-Shot Learning cs.CVPDF
Zhaolong Wang, Tongfeng Sun, Mingzheng Du, Yachao Huang
TL;DR: MSGCoOp提出了一种多语义引导的上下文优化框架,用于提升少样本学习的泛化能力,通过并行可学习上下文向量和LLM生成的语义对齐机制,显著优于现有基线。
Details
Motivation: 现有提示学习方法在新类别泛化中表现不佳,容易过拟合已知类别或遗忘通用知识,而复杂方法又带来计算开销。MSGCoOp旨在高效提升泛化能力。
Result: 在11个基准数据集上显著提升基类到新类的泛化性能,平均调和均值优于KgCoOp 1.10%,且在跨域任务中表现鲁棒。
Insight: 多语义引导和多样性正则化是提升少样本泛化的有效策略,LLM生成的语义对齐为提示学习提供了额外信息增益。
Abstract: Vision-language pre-trained models (VLMs) such as CLIP have demonstrated remarkable zero-shot generalization, and prompt learning has emerged as an efficient alternative to full fine-tuning. However, existing methods often struggle with generalization to novel classes, a phenomenon attributed to overfitting on seen classes and forgetting general knowledge. Furthermore, recent approaches that improve generalization often introduce complex architectures or heavy computational overhead. In this paper, we propose a Multiple Semantic-Guided Context Optimization (MSGCoOp) framework to enhance few-shot generalization while maintaining computational efficiency. Our approach leverages an ensemble of parallel learnable context vectors to capture diverse semantic aspects. To enrich these prompts, we introduce a semantic guidance mechanism that aligns them with comprehensive class descriptions automatically generated by a Large Language Model (LLM). Furthermore, a diversity regularization loss encourages the prompts to learn complementary and orthogonal features, preventing them from collapsing into redundant representations. Extensive experiments on 11 benchmark datasets show that MSGCoOp significantly improves performance on base-to-novel generalization, achieving an average harmonic mean improvement of 1.10% over the strong KgCoOp baseline. Our method also demonstrates enhanced robustness in cross-domain generalization tasks. Our code is avaliable at: \href{https://github.com/Rain-Bus/MSGCoOp}{https://github.com/Rain-Bus/MSGCoOp}.
[45] Distribution-Based Masked Medical Vision-Language Model Using Structured Reports cs.CVPDF
Shreyank N Gowda, Ruichi Zhang, Xiao Gu, Ying Weng, Lu Yang
TL;DR: 论文提出了一种基于分布和掩码的医学视觉-语言模型,利用结构化报告增强医学图像与临床文本的对齐。该方法通过建模模态内外的模糊性,显著提升了下游任务的性能。
Details
Motivation: 现有医学视觉-语言预训练模型难以处理医学数据的多样性和模糊性,限制了其对临床细微信息的捕捉能力。论文旨在通过引入结构化报告和不确定性建模来解决这一问题。
Result: 实验结果表明,该方法在多个下游任务中显著优于现有方法,展示了更强的泛化能力和临床相关性。
Insight: 结构化报告和不确定性建模是增强医学视觉-语言预训练的关键,能够更好地捕捉临床语义和模糊性,为实际应用提供了新思路。
Abstract: Medical image-language pre-training aims to align medical images with clinically relevant text to improve model performance on various downstream tasks. However, existing models often struggle with the variability and ambiguity inherent in medical data, limiting their ability to capture nuanced clinical information and uncertainty. This work introduces an uncertainty-aware medical image-text pre-training model that enhances generalization capabilities in medical image analysis. Building on previous methods and focusing on Chest X-Rays, our approach utilizes structured text reports generated by a large language model (LLM) to augment image data with clinically relevant context. These reports begin with a definition of the disease, followed by the appearance' section to highlight critical regions of interest, and finally observations’ and `verdicts’ that ground model predictions in clinical semantics. By modeling both inter- and intra-modal uncertainty, our framework captures the inherent ambiguity in medical images and text, yielding improved representations and performance on downstream tasks. Our model demonstrates significant advances in medical image-text pre-training, obtaining state-of-the-art performance on multiple downstream tasks.
[46] HunyuanWorld 1.0: Generating Immersive, Explorable, and Interactive 3D Worlds from Words or Pixels cs.CVPDF
HunyuanWorld Team, Zhenwei Wang, Yuhao Liu, Junta Wu, Zixiao Gu
TL;DR: HunyuanWorld 1.0提出了一个新颖框架,通过文本或图像生成沉浸式、可探索且交互的3D世界,结合全景代理和分层网格表示,解决了现有方法在3D一致性和渲染效率上的不足。
Details
Motivation: 现有3D世界生成方法在多样性、3D一致性和交互性方面存在局限,HunyuanWorld 1.0旨在提供一种兼顾全景沉浸感、高效渲染和交互能力的解决方案。
Result: 实验表明,该方法在生成一致、可探索和交互的3D世界上表现优异,适用于VR、物理模拟和游戏开发等领域。
Insight: 通过结合全景代理和分层网格,HunyuanWorld 1.0展示了在3D世界生成中平衡多样性与一致性的潜力。
Abstract: Creating immersive and playable 3D worlds from texts or images remains a fundamental challenge in computer vision and graphics. Existing world generation approaches typically fall into two categories: video-based methods that offer rich diversity but lack 3D consistency and rendering efficiency, and 3D-based methods that provide geometric consistency but struggle with limited training data and memory-inefficient representations. To address these limitations, we present HunyuanWorld 1.0, a novel framework that combines the best of both worlds for generating immersive, explorable, and interactive 3D scenes from text and image conditions. Our approach features three key advantages: 1) 360{\deg} immersive experiences via panoramic world proxies; 2) mesh export capabilities for seamless compatibility with existing computer graphics pipelines; 3) disentangled object representations for augmented interactivity. The core of our framework is a semantically layered 3D mesh representation that leverages panoramic images as 360{\deg} world proxies for semantic-aware world decomposition and reconstruction, enabling the generation of diverse 3D worlds. Extensive experiments demonstrate that our method achieves state-of-the-art performance in generating coherent, explorable, and interactive 3D worlds while enabling versatile applications in virtual reality, physical simulation, game development, and interactive content creation.
[47] Anyone Can Jailbreak: Prompt-Based Attacks on LLMs and T2Is cs.CVPDF
Ahmed B Mustafa, Zihan Ye, Yang Lu, Michael P Pound, Shreyank N Gowda
TL;DR: 尽管大型语言模型(LLMs)和文本到图像(T2I)系统在内容安全和对齐方面取得了显著进展,但用户仍能通过简单的提示词攻击(称为“越狱”)绕过安全机制。本文系统性地研究了非专家用户如何通过多轮叙述升级、词汇伪装、暗示链等低门槛方法实现攻击,提出了统一的攻击分类法,并强调需要开发上下文感知的防御机制。
Details
Motivation: 尽管LLMs和T2I系统在安全性方面有所改进,但用户仍能轻松通过提示词攻击绕过限制,表明现有防御机制存在不足。本文旨在揭示这些攻击的普遍性和易操作性,推动更高效的防御策略研究。
Result: 研究发现,现有安全机制的所有阶段均可被低门槛方法绕过,攻击可复现性强。
Insight: 简单的提示词攻击对现有安全机制构成严重威胁,需开发更智能的上下文感知防御技术。
Abstract: Despite significant advancements in alignment and content moderation, large language models (LLMs) and text-to-image (T2I) systems remain vulnerable to prompt-based attacks known as jailbreaks. Unlike traditional adversarial examples requiring expert knowledge, many of today’s jailbreaks are low-effort, high-impact crafted by everyday users with nothing more than cleverly worded prompts. This paper presents a systems-style investigation into how non-experts reliably circumvent safety mechanisms through techniques such as multi-turn narrative escalation, lexical camouflage, implication chaining, fictional impersonation, and subtle semantic edits. We propose a unified taxonomy of prompt-level jailbreak strategies spanning both text-output and T2I models, grounded in empirical case studies across popular APIs. Our analysis reveals that every stage of the moderation pipeline, from input filtering to output validation, can be bypassed with accessible strategies. We conclude by highlighting the urgent need for context-aware defenses that reflect the ease with which these jailbreaks can be reproduced in real-world settings.
[48] Unleashing the Power of Motion and Depth: A Selective Fusion Strategy for RGB-D Video Salient Object Detection cs.CVPDF
Jiahao He, Daerji Suolang, Keren Fu, Qijun Zhao
TL;DR: 该论文提出了一种新颖的选择性跨模态融合框架(SMFNet),用于RGB-D视频显著目标检测(RGB-D VSOD),通过像素级选择性融合策略(PSF)和多维选择性注意力模块(MSAM)有效结合运动和深度信息,提升检测性能。
Details
Motivation: 现有RGB-D VSOD方法在处理运动和深度信息时缺乏对不同模态贡献的差异化考虑,限制了其潜力。作者的目标是通过选择性融合策略和多维注意力机制更好地利用这些信息。
Result: SMFNet在多个数据集上优于现有19种方法,成为目前最全面的RGB-D VSOD基准。在合成深度数据集上也验证了其有效性。
Insight: 运动和深度信息在RGB-D VSOD中的贡献是动态变化的,通过选择性融合和多维注意力机制可以显著提升模型性能。
Abstract: Applying salient object detection (SOD) to RGB-D videos is an emerging task called RGB-D VSOD and has recently gained increasing interest, due to considerable performance gains of incorporating motion and depth and that RGB-D videos can be easily captured now in daily life. Existing RGB-D VSOD models have different attempts to derive motion cues, in which extracting motion information explicitly from optical flow appears to be a more effective and promising alternative. Despite this, there remains a key issue that how to effectively utilize optical flow and depth to assist the RGB modality in SOD. Previous methods always treat optical flow and depth equally with respect to model designs, without explicitly considering their unequal contributions in individual scenarios, limiting the potential of motion and depth. To address this issue and unleash the power of motion and depth, we propose a novel selective cross-modal fusion framework (SMFNet) for RGB-D VSOD, incorporating a pixel-level selective fusion strategy (PSF) that achieves optimal fusion of optical flow and depth based on their actual contributions. Besides, we propose a multi-dimensional selective attention module (MSAM) to integrate the fused features derived from PSF with the remaining RGB modality at multiple dimensions, effectively enhancing feature representation to generate refined features. We conduct comprehensive evaluation of SMFNet against 19 state-of-the-art models on both RDVS and DVisal datasets, making the evaluation the most comprehensive RGB-D VSOD benchmark up to date, and it also demonstrates the superiority of SMFNet over other models. Meanwhile, evaluation on five video benchmark datasets incorporating synthetic depth validates the efficacy of SMFNet as well. Our code and benchmark results are made publicly available at https://github.com/Jia-hao999/SMFNet.
[49] Low-Cost Test-Time Adaptation for Robust Video Editing cs.CVPDF
Jianhui Wang, Yinda Chen, Yangfan He, Xinyuan Song, Yi Xin
TL;DR: 提出了一种轻量级的视频编辑测试时自适应框架Vid-TTA,通过自监督辅助任务优化每个测试视频,显著提升时间一致性和减少提示过拟合,同时保持低计算开销。
Details
Motivation: 解决视频编辑中因复杂运动模式导致的时序不一致性和UNet架构对简单提示的过拟合问题,同时减少计算资源和高质量标注数据的需求。
Result: 实验表明,Vid-TTA显著提升了视频时间一致性并减少了提示过拟合,且计算开销低。
Insight: 轻量化的测试时自适应方法可以在不增加计算负担的情况下,显著提升视频编辑模型的鲁棒性和泛化能力。
Abstract: Video editing is a critical component of content creation that transforms raw footage into coherent works aligned with specific visual and narrative objectives. Existing approaches face two major challenges: temporal inconsistencies due to failure in capturing complex motion patterns, and overfitting to simple prompts arising from limitations in UNet backbone architectures. While learning-based methods can enhance editing quality, they typically demand substantial computational resources and are constrained by the scarcity of high-quality annotated data. In this paper, we present Vid-TTA, a lightweight test-time adaptation framework that personalizes optimization for each test video during inference through self-supervised auxiliary tasks. Our approach incorporates a motion-aware frame reconstruction mechanism that identifies and preserves crucial movement regions, alongside a prompt perturbation and reconstruction strategy that strengthens model robustness to diverse textual descriptions. These innovations are orchestrated by a meta-learning driven dynamic loss balancing mechanism that adaptively adjusts the optimization process based on video characteristics. Extensive experiments demonstrate that Vid-TTA significantly improves video temporal consistency and mitigates prompt overfitting while maintaining low computational overhead, offering a plug-and-play performance boost for existing video editing models.
[50] CAPE: A CLIP-Aware Pointing Ensemble of Complementary Heatmap Cues for Embodied Reference Understanding cs.CVPDF
Fevziye Irem Eyiokur, Dogucan Yaman, Hazım Kemal Ekenel, Alexander Waibel
TL;DR: 本文提出了一个双模型框架(CAPE),通过结合头部到指尖和手腕到指尖的指向线索,以及CLIP感知的集成模块,显著提升了Embodied Reference Understanding任务的性能。
Details
Motivation: 现有方法在Embodied Reference Understanding任务中难以有效利用视觉指向线索,且单一指向假设(如头部到指尖)可能导致性能下降。
Result: 在YouRefIt数据集上,IoU阈值为0.25时的mAP提升了约4个单位。
Insight: 1. 多模态线索(语言和视觉指向)的有效融合是关键;2. 双模型框架能够捕捉更全面的指向线索;3. CLIP特征为集成提供了强监督信号。
Abstract: We address the problem of Embodied Reference Understanding, which involves predicting the object that a person in the scene is referring to through both pointing gesture and language. Accurately identifying the referent requires multimodal understanding: integrating textual instructions, visual pointing, and scene context. However, existing methods often struggle to effectively leverage visual clues for disambiguation. We also observe that, while the referent is often aligned with the head-to-fingertip line, it occasionally aligns more closely with the wrist-to-fingertip line. Therefore, relying on a single line assumption can be overly simplistic and may lead to suboptimal performance. To address this, we propose a dual-model framework, where one model learns from the head-to-fingertip direction and the other from the wrist-to-fingertip direction. We further introduce a Gaussian ray heatmap representation of these lines and use them as input to provide a strong supervisory signal that encourages the model to better attend to pointing cues. To combine the strengths of both models, we present the CLIP-Aware Pointing Ensemble module, which performs a hybrid ensemble based on CLIP features. Additionally, we propose an object center prediction head as an auxiliary task to further enhance referent localization. We validate our approach through extensive experiments and analysis on the benchmark YouRefIt dataset, achieving an improvement of approximately 4 mAP at the 0.25 IoU threshold.
[51] Aether Weaver: Multimodal Affective Narrative Co-Generation with Dynamic Scene Graphs cs.CVPDF
Saeed Ghorbani
TL;DR: Aether Weaver是一个多模态叙事协同生成框架,通过整合文本、动态场景图、视觉场景和情感音景,显著提升了叙事的深度、视觉逼真度和情感共鸣。
Details
Motivation: 克服传统的文本到视觉串行管道的局限性,实现多模态内容的高效协同生成,增强叙事体验的沉浸感和情感一致性。
Result: 在多样化的叙事提示下,该框架在叙事深度、视觉逼真度和情感共鸣上优于基线方法。
Insight: 通过紧密集成的协同机制,Aether Weaver为创意原型设计和沉浸式叙事提供了新思路。
Abstract: We introduce Aether Weaver, a novel, integrated framework for multimodal narrative co-generation that overcomes limitations of sequential text-to-visual pipelines. Our system concurrently synthesizes textual narratives, dynamic scene graph representations, visual scenes, and affective soundscapes, driven by a tightly integrated, co-generation mechanism. At its core, the Narrator, a large language model, generates narrative text and multimodal prompts, while the Director acts as a dynamic scene graph manager, and analyzes the text to build and maintain a structured representation of the story’s world, ensuring spatio-temporal and relational consistency for visual rendering and subsequent narrative generation. Additionally, a Narrative Arc Controller guides the high-level story structure, influencing multimodal affective consistency, further complemented by an Affective Tone Mapper that ensures congruent emotional expression across all modalities. Through qualitative evaluations on a diverse set of narrative prompts encompassing various genres, we demonstrate that Aether Weaver significantly enhances narrative depth, visual fidelity, and emotional resonance compared to cascaded baseline approaches. This integrated framework provides a robust platform for rapid creative prototyping and immersive storytelling experiences.
[52] Evaluating Deepfake Detectors in the Wild cs.CV | cs.AI | cs.LGPDF
Viacheslav Pirogov, Maksim Artemev
TL;DR: 该论文提出了一个真实场景下的深度伪造检测方法评估框架,并通过大规模数据集验证了现有检测器的性能,发现其效果仍然有限。
Details
Motivation: 深度伪造技术对数字媒体的真实性构成了严重威胁,但目前大多数检测器的性能尚未在真实场景中得到充分验证。
Result: 测试表明,仅有不到一半的检测器AUC得分超过60%,且图像操作会显著降低性能。
Insight: 深度伪造检测仍具挑战性,且现有检测器对图像操作的鲁棒性不足,需进一步改进。
Abstract: Deepfakes powered by advanced machine learning models present a significant and evolving threat to identity verification and the authenticity of digital media. Although numerous detectors have been developed to address this problem, their effectiveness has yet to be tested when applied to real-world data. In this work we evaluate modern deepfake detectors, introducing a novel testing procedure designed to mimic real-world scenarios for deepfake detection. Using state-of-the-art deepfake generation methods, we create a comprehensive dataset containing more than 500,000 high-quality deepfake images. Our analysis shows that detecting deepfakes still remains a challenging task. The evaluation shows that in fewer than half of the deepfake detectors tested achieved an AUC score greater than 60%, with the lowest being 50%. We demonstrate that basic image manipulations, such as JPEG compression or image enhancement, can significantly reduce model performance. All code and data are publicly available at https://github.com/messlav/Deepfake-Detectors-in-the-Wild.
[53] ArtSeek: Deep artwork understanding via multimodal in-context reasoning and late interaction retrieval cs.CVPDF
Nicola Fanelli, Gennaro Vessio, Giovanna Castellano
TL;DR: ArtSeek是一个多模态框架,通过结合多模态大语言模型和检索增强生成技术,实现了对艺术品的深度理解。其核心组件包括基于延迟交互检索的多模态检索模块、对比多任务分类网络和代理推理策略(通过上下文示例和Qwen2.5-VL实现)。它仅需图像输入,适用于缺乏链接的艺术品,并在多个基准测试中取得最优结果。
Details
Motivation: 数字化艺术品的分析不仅需要视觉解读,还需要结合艺术、历史等背景知识,这是一个复杂的任务。现有的方法通常依赖于外部知识库(如Wikidata),适用性有限。ArtSeek旨在通过多模态技术,仅依赖图像输入,实现对艺术品的深度理解和知识检索。
Result: ArtSeek在风格分类任务上的F1分数比GraphCLIP提高了8.4%,在ArtPedia上的BLEU@1分数提高了7.1。定性分析表明,模型能够解释视觉主题、推断历史背景,并为冷门艺术品检索相关知识。
Insight: 1. 仅依赖图像输入的多模态框架具有广泛适用性;2. 检索增强生成技术结合外部知识库(如WikiFragments)能显著提升推理能力;3. 代理推理策略使模型能够处理复杂的视觉问答任务。
Abstract: Analyzing digitized artworks presents unique challenges, requiring not only visual interpretation but also a deep understanding of rich artistic, contextual, and historical knowledge. We introduce ArtSeek, a multimodal framework for art analysis that combines multimodal large language models with retrieval-augmented generation. Unlike prior work, our pipeline relies only on image input, enabling applicability to artworks without links to Wikidata or Wikipedia-common in most digitized collections. ArtSeek integrates three key components: an intelligent multimodal retrieval module based on late interaction retrieval, a contrastive multitask classification network for predicting artist, genre, style, media, and tags, and an agentic reasoning strategy enabled through in-context examples for complex visual question answering and artwork explanation via Qwen2.5-VL. Central to this approach is WikiFragments, a Wikipedia-scale dataset of image-text fragments curated to support knowledge-grounded multimodal reasoning. Our framework achieves state-of-the-art results on multiple benchmarks, including a +8.4% F1 improvement in style classification over GraphCLIP and a +7.1 BLEU@1 gain in captioning on ArtPedia. Qualitative analyses show that ArtSeek can interpret visual motifs, infer historical context, and retrieve relevant knowledge, even for obscure works. Though focused on visual arts, our approach generalizes to other domains requiring external knowledge, supporting scalable multimodal AI research. Both the dataset and the source code will be made publicly available at https://github.com/cilabuniba/artseek.
[54] MMAT-1M: A Large Reasoning Dataset for Multimodal Agent Tuning cs.CVPDF
Tianhong Gao, Yannian Fu, Weiqun Wu, Haixiao Yue, Shanshan Liu
TL;DR: 论文介绍了MMAT-1M,首个百万规模的多模态智能体调优数据集,通过四阶段数据引擎构建,支持链式思维、反思和动态工具使用,显著提升了多模态模型的性能。
Details
Motivation: 当前多模态领域缺乏大规模高质量的智能体调优数据集,限制了多模态大语言模型的潜力发挥。
Result: 在InternVL2.5-8B-RR模型上,平均提升2.7%于8个公共基准测试,RAG基准Dyn-VQA提升8.8%。
Insight: 高质量多模态数据生成和反思优化对提升模型推理和工具使用能力至关重要。
Abstract: Large Language Models (LLMs), enhanced through agent tuning, have demonstrated remarkable capabilities in Chain-of-Thought (CoT) and tool utilization, significantly surpassing the performance of standalone models. However, the multimodal domain still lacks a large-scale, high-quality agent tuning dataset to unlock the full potential of multimodal large language models. To bridge this gap, we introduce MMAT-1M, the first million-scale multimodal agent tuning dataset designed to support CoT, reflection, and dynamic tool usage. Our dataset is constructed through a novel four-stage data engine: 1) We first curate publicly available multimodal datasets containing question-answer pairs; 2) Then, leveraging GPT-4o, we generate rationales for the original question-answer pairs and dynamically integrate API calls and Retrieval Augmented Generation (RAG) information through a multi-turn paradigm; 3) Furthermore, we refine the rationales through reflection to ensure logical consistency and accuracy, creating a multi-turn dialogue dataset with both Rationale and Reflection (RR); 4) Finally, to enhance efficiency, we optionally compress multi-turn dialogues into a One-turn Rationale and Reflection (ORR) format. By fine-tuning open-source multimodal models on the MMAT-1M, we observe significant performance gains. For instance, the InternVL2.5-8B-RR model achieves an average improvement of 2.7% across eight public benchmarks and 8.8% on the RAG benchmark Dyn-VQA, demonstrating the dataset’s effectiveness in enhancing multimodal reasoning and tool-based capabilities. The dataset is publicly available at https://github.com/VIS-MPU-Agent/MMAT-1M.
[55] Attention-Driven Multimodal Alignment for Long-term Action Quality Assessment cs.CVPDF
Xin Wang, Peng-Jie Li, Yuan-Yuan Shen
TL;DR: 该论文提出了一种名为LMAC-Net的注意力驱动的多模态对齐方法,用于长时动作质量评估(AQA),通过多模态注意力一致性机制实现对视觉和音频信息的高效整合与提升特征表示。
Details
Motivation: 现有方法多为单模态或简单多模态融合,难以捕捉复杂跨模态交互和长时动态变化,尤其在艺术类运动评估中需要多模态协同。
Result: 在RG和Fis-V数据集上性能显著优于现有方法,验证了其有效性。
Insight: 多模态注意力对齐和时序语义捕捉对于长时动作质量评估至关重要,通过显式对齐可实现更稳定的跨模态协作。
Abstract: Long-term action quality assessment (AQA) focuses on evaluating the quality of human activities in videos lasting up to several minutes. This task plays an important role in the automated evaluation of artistic sports such as rhythmic gymnastics and figure skating, where both accurate motion execution and temporal synchronization with background music are essential for performance assessment. However, existing methods predominantly fall into two categories: unimodal approaches that rely solely on visual features, which are inadequate for modeling multimodal cues like music; and multimodal approaches that typically employ simple feature-level contrastive fusion, overlooking deep cross-modal collaboration and temporal dynamics. As a result, they struggle to capture complex interactions between modalities and fail to accurately track critical performance changes throughout extended sequences. To address these challenges, we propose the Long-term Multimodal Attention Consistency Network (LMAC-Net). LMAC-Net introduces a multimodal attention consistency mechanism to explicitly align multimodal features, enabling stable integration of visual and audio information and enhancing feature representations. Specifically, we introduce a multimodal local query encoder module to capture temporal semantics and cross-modal relations, and use a two-level score evaluation for interpretable results. In addition, attention-based and regression-based losses are applied to jointly optimize multimodal alignment and score fusion. Experiments conducted on the RG and Fis-V datasets demonstrate that LMAC-Net significantly outperforms existing methods, validating the effectiveness of our proposed approach.
[56] Enhancing Generalization in Data-free Quantization via Mixup-class Prompting cs.CV | cs.AIPDF
Jiwoong Park, Chaeun Lee, Yongseok Choi, Sein Park, Deokki Hong
TL;DR: 论文提出了mixup-class prompt方法,通过多类标签融合生成多样化合成数据,提升数据自由量化中的泛化能力,在极低比特量化场景中表现优异。
Details
Motivation: 现有数据自由量化方法依赖单类提示生成合成图像,易出现多义性问题,导致量化模型泛化性能下降。需探索合成图像与量化模型泛化能力的关系。
Result: 在CNN和ViT上实验表明,该方法优于现有DFQ方法(如GenQ),并在2-bit权重和4-bit激活的极端量化场景中创下新记录。
Insight: 多类标签融合能有效缓解单类提示的多义性问题,提升合成数据的多样性,从而改善量化模型的泛化性能。
Abstract: Post-training quantization (PTQ) improves efficiency but struggles with limited calibration data, especially under privacy constraints. Data-free quantization (DFQ) mitigates this by generating synthetic images using generative models such as generative adversarial networks (GANs) and text-conditioned latent diffusion models (LDMs), while applying existing PTQ algorithms. However, the relationship between generated synthetic images and the generalizability of the quantized model during PTQ remains underexplored. Without investigating this relationship, synthetic images generated by previous prompt engineering methods based on single-class prompts suffer from issues such as polysemy, leading to performance degradation. We propose \textbf{mixup-class prompt}, a mixup-based text prompting strategy that fuses multiple class labels at the text prompt level to generate diverse, robust synthetic data. This approach enhances generalization, and improves optimization stability in PTQ. We provide quantitative insights through gradient norm and generalization error analysis. Experiments on convolutional neural networks (CNNs) and vision transformers (ViTs) show that our method consistently outperforms state-of-the-art DFQ methods like GenQ. Furthermore, it pushes the performance boundary in extremely low-bit scenarios, achieving new state-of-the-art accuracy in challenging 2-bit weight, 4-bit activation (W2A4) quantization.
[57] MetaCLIP 2: A Worldwide Scaling Recipe cs.CV | cs.CLPDF
Yung-Sung Chuang, Yang Li, Dong Wang, Ching-Feng Yeh, Kehan Lyu
TL;DR: MetaCLIP 2提出了一种从全球网络数据中训练CLIP模型的配方,解决了非英语数据的处理问题,并在多语言基准测试中取得了新的SOTA。
Details
Motivation: 现有的CLIP模型主要基于英语数据训练,扩展到全球多语言数据面临数据选择和多语言性能下降的挑战。
Result: MetaCLIP 2在零样本ImageNet分类中超越其英语版本0.8%,并在多语言基准测试(如CVQA、Babel-ImageNet和XM3600)中达到SOTA。
Insight: 通过合理的全球数据选择和训练方法,多语言CLIP可以超越单语言版本,且无需复杂系统级改动。
Abstract: Contrastive Language-Image Pretraining (CLIP) is a popular foundation model, supporting from zero-shot classification, retrieval to encoders for multimodal large language models (MLLMs). Although CLIP is successfully trained on billion-scale image-text pairs from the English world, scaling CLIP’s training further to learning from the worldwide web data is still challenging: (1) no curation method is available to handle data points from non-English world; (2) the English performance from existing multilingual CLIP is worse than its English-only counterpart, i.e., “curse of multilinguality” that is common in LLMs. Here, we present MetaCLIP 2, the first recipe training CLIP from scratch on worldwide web-scale image-text pairs. To generalize our findings, we conduct rigorous ablations with minimal changes that are necessary to address the above challenges and present a recipe enabling mutual benefits from English and non-English world data. In zero-shot ImageNet classification, MetaCLIP 2 ViT-H/14 surpasses its English-only counterpart by 0.8% and mSigLIP by 0.7%, and surprisingly sets new state-of-the-art without system-level confounding factors (e.g., translation, bespoke architecture changes) on multilingual benchmarks, such as CVQA with 57.4%, Babel-ImageNet with 50.2% and XM3600 with 64.3% on image-to-text retrieval.
[58] Mitigating Spurious Correlations in Weakly Supervised Semantic Segmentation via Cross-architecture Consistency Regularization cs.CVPDF
Zheyuan Zhang, Yen-chia Hsu
TL;DR: 论文提出了一种新的弱监督语义分割框架,通过跨架构一致性正则化减少伪相关性问题,结合CNN和ViT的师生框架提升分割质量。
Details
Motivation: 像素级标签稀缺,尤其在工业烟雾等专业领域标注困难,传统弱监督方法存在前景覆盖不全、边界不准确及伪相关性等问题。
Result: 新框架有效减少伪相关性,提升分割质量,尤其在工业烟雾等复杂场景中表现更优。
Insight: 跨架构一致性正则化是一种有效解决弱监督中模型偏见问题的方法,无需依赖额外先验知识。
Abstract: Scarcity of pixel-level labels is a significant challenge in practical scenarios. In specific domains like industrial smoke, acquiring such detailed annotations is particularly difficult and often requires expert knowledge. To alleviate this, weakly supervised semantic segmentation (WSSS) has emerged as a promising approach. However, due to the supervision gap and inherent bias in models trained with only image level labels, existing WSSS methods suffer from limitations such as incomplete foreground coverage, inaccurate object boundaries, and spurious correlations, especially in our domain, where emissions are always spatially coupled with chimneys. Previous solutions typically rely on additional priors or external knowledge to mitigate these issues, but they often lack scalability and fail to address the model’s inherent bias toward co-occurring context. To address this, we propose a novel WSSS framework that directly targets the co-occurrence problem without relying on external supervision. Unlike prior methods that adopt a single network, we employ a teacher-student framework that combines CNNs and ViTs. We introduce a knowledge transfer loss that enforces cross-architecture consistency by aligning internal representations. Additionally, we incorporate post-processing techniques to address partial coverage and further improve pseudo mask quality.
[59] A Deep Learning Pipeline Using Synthetic Data to Improve Interpretation of Paper ECG Images cs.CVPDF
Xiaoyu Wang, Ramesh Nadarajah, Zhiqiang Zhang, David Wong
TL;DR: 该论文提出了一种深度学习框架,用于将心电图(ECG)图像分类为五种主要诊断类别,解决了图像噪声和波形模式识别难题,并在比赛中取得了优异表现。
Details
Motivation: 心血管疾病是全球主要死因,早期检测至关重要。传统的ECG图像解读依赖人工作业,耗时且需专业知识。当前研究多关注数字信号,但临床中ECG数据多以图像形式存储,因此需要一种针对图像分类的自动化方法。
Result: 在2024年英国心脏基金会开放数据科学挑战赛中,该方法在公开验证集和私有测试集上的AUROC分数分别为0.9688和0.9677。
Insight: 1. 合成数据可用于增强模型在特定领域的性能。2. 两阶段微调策略能有效结合领域特征和任务特异性。3. 图像预处理对噪声敏感任务的性能提升至关重要。
Abstract: Cardiovascular diseases (CVDs) are the leading global cause of death, and early detection is essential to improve patient outcomes. Electrocardiograms (ECGs), especially 12-lead ECGs, play a key role in the identification of CVDs. These are routinely interpreted by human experts, a process that is time-consuming and requires expert knowledge. Historical research in this area has focused on automatic ECG interpretation from digital signals, with recent deep learning approaches achieving strong results. In practice, however, most ECG data in clinical practice are stored or shared in image form. To bridge this gap, we propose a deep learning framework designed specifically to classify paper-like ECG images into five main diagnostic categories. Our method was the winning entry to the 2024 British Heart Foundation Open Data Science Challenge. It addresses two main challenges of paper ECG classification: visual noise (e.g., shadows or creases) and the need to detect fine-detailed waveform patterns. We propose a pre-processing pipeline that reduces visual noise and a two-stage fine-tuning strategy: the model is first fine-tuned on synthetic and external ECG image datasets to learn domain-specific features, and then further fine-tuned on the target dataset to enhance disease-specific recognition. We adopt the ConvNeXt architecture as the backbone of our model. Our method achieved AUROC scores of 0.9688 on the public validation set and 0.9677 on the private test set of the British Heart Foundation Open Data Science Challenge, highlighting its potential as a practical tool for automated ECG interpretation in clinical workflows.
[60] EIFNet: Leveraging Event-Image Fusion for Robust Semantic Segmentation cs.CVPDF
Zhijiang Li, Haoran He
TL;DR: 这篇论文提出了EIFNet,一个多模态融合网络,结合事件和帧图像的优势,用于鲁棒的语义分割任务。
Details
Motivation: 事件相机具有高动态范围和精细时间分辨率,但其稀疏和噪声特性以及图像数据的结构和表示差异,使得语义分割任务具有挑战性。
Result: 在DDD17-Semantic和DSEC-Semantic数据集上取得SOTA性能,验证了方法的有效性。
Insight: 通过注意力机制和门控融合策略,可以有效解决事件和图像数据的异构性问题,提升语义分割的鲁棒性。
Abstract: Event-based semantic segmentation explores the potential of event cameras, which offer high dynamic range and fine temporal resolution, to achieve robust scene understanding in challenging environments. Despite these advantages, the task remains difficult due to two main challenges: extracting reliable features from sparse and noisy event streams, and effectively fusing them with dense, semantically rich image data that differ in structure and representation. To address these issues, we propose EIFNet, a multi-modal fusion network that combines the strengths of both event and frame-based inputs. The network includes an Adaptive Event Feature Refinement Module (AEFRM), which improves event representations through multi-scale activity modeling and spatial attention. In addition, we introduce a Modality-Adaptive Recalibration Module (MARM) and a Multi-Head Attention Gated Fusion Module (MGFM), which align and integrate features across modalities using attention mechanisms and gated fusion strategies. Experiments on DDD17-Semantic and DSEC-Semantic datasets show that EIFNet achieves state-of-the-art performance, demonstrating its effectiveness in event-based semantic segmentation.
[61] ZIUM: Zero-Shot Intent-Aware Adversarial Attack on Unlearned Models cs.CV | cs.CRPDF
Hyun Jun Yook, Ga San Jhun, Jae Hyun Cho, Min Jeon, Donghyun Kim
TL;DR: ZIUM是一种针对未学习模型的零样本意图感知对抗攻击方法,能够灵活定制目标攻击图像以反映攻击者意图,同时支持零样本攻击,显著降低计算成本。
Details
Motivation: 现有的对抗攻击方法在生成符合攻击者意图的内容时面临挑战,且计算成本高。ZIUM旨在解决这些问题,提升攻击效率并支持意图定制。
Result: 在多种机器未学习场景下,ZIUM表现出更高的攻击成功率,且零样本攻击减少了攻击时间。
Insight: ZIUM为对抗攻击提供了新思路,展示了未学习模型的潜在安全风险,同时强调了意图感知和效率的重要性。
Abstract: Machine unlearning (MU) removes specific data points or concepts from deep learning models to enhance privacy and prevent sensitive content generation. Adversarial prompts can exploit unlearned models to generate content containing removed concepts, posing a significant security risk. However, existing adversarial attack methods still face challenges in generating content that aligns with an attacker’s intent while incurring high computational costs to identify successful prompts. To address these challenges, we propose ZIUM, a Zero-shot Intent-aware adversarial attack on Unlearned Models, which enables the flexible customization of target attack images to reflect an attacker’s intent. Additionally, ZIUM supports zero-shot adversarial attacks without requiring further optimization for previously attacked unlearned concepts. The evaluation across various MU scenarios demonstrated ZIUM’s effectiveness in successfully customizing content based on user-intent prompts while achieving a superior attack success rate compared to existing methods. Moreover, its zero-shot adversarial attack significantly reduces the attack time for previously attacked unlearned concepts.
[62] Staining and locking computer vision models without retraining cs.CV | cs.AI | cs.LG | 68T07, 68T45, 68W40 | I.2.10; F.2.0; K.5.1; K.6.5PDF
Oliver J. Sutton, Qinghua Zhou, George Leete, Alexander N. Gorban, Ivan Y. Tyukin
TL;DR: 该论文提出了一种无需重新训练的方法,通过直接修改模型的少量权重,实现对计算机视觉模型的染色(水印)和锁定,以保护模型的知识产权。锁定模型需在输入图像中插入特定触发器才能使用。
Details
Motivation: 保护计算机视觉模型的知识产权,防止未经授权的使用或复制。现有方法通常需要重新训练或微调模型,而本文的方法无需此类操作。
Result: 实验证明,该方法在多种计算机视觉模型上有效,且对未锁定模型的性能影响可忽略。
Insight: 无需重新训练即可实现模型保护和锁定,为模型知识产权保护提供了更高效和实用的解决方案。
Abstract: We introduce new methods of staining and locking computer vision models, to protect their owners’ intellectual property. Staining, also known as watermarking, embeds secret behaviour into a model which can later be used to identify it, while locking aims to make a model unusable unless a secret trigger is inserted into input images. Unlike existing methods, our algorithms can be used to stain and lock pre-trained models without requiring fine-tuning or retraining, and come with provable, computable guarantees bounding their worst-case false positive rates. The stain and lock are implemented by directly modifying a small number of the model’s weights and have minimal impact on the (unlocked) model’s performance. Locked models are unlocked by inserting a small `trigger patch’ into the corner of the input image. We present experimental results showing the efficacy of our methods and demonstrating their practical performance on a variety of computer vision models.
[63] Bridging Synthetic and Real-World Domains: A Human-in-the-Loop Weakly-Supervised Framework for Industrial Toxic Emission Segmentation cs.CV | cs.AIPDF
Yida Tao, Yen-Chia Hsu
TL;DR: 论文提出CEDANet,结合人类参与的弱监督域适应框架,通过公民投票优化伪标签,利用对抗性特征对齐解决工业烟雾分割中标注数据稀缺问题。
Details
Motivation: 工业烟雾分割对环境监测至关重要,但真实场景中像素级标注成本高且稀缺,需要一种高效且低成本的解决方案。
Result: 在SMOKE5K和IJmond数据集上,CEDANet的F1分数和烟雾类IoU分别达到0.414和0.261,比基线模型提升5倍和6倍。
Insight: 结合公民科学与弱监督域适应技术,可显著降低标注成本,同时接近全监督模型的精度,适用于数据稀缺的复杂环境监测任务。
Abstract: Industrial smoke segmentation is critical for air-quality monitoring and environmental protection but is often hampered by the high cost and scarcity of pixel-level annotations in real-world settings. We introduce CEDANet, a human-in-the-loop, class-aware domain adaptation framework that uniquely integrates weak, citizen-provided video-level labels with adversarial feature alignment. Specifically, we refine pseudo-labels generated by a source-trained segmentation model using citizen votes, and employ class-specific domain discriminators to transfer rich source-domain representations to the industrial domain. Comprehensive experiments on SMOKE5K and custom IJmond datasets demonstrate that CEDANet achieves an F1-score of 0.414 and a smoke-class IoU of 0.261 with citizen feedback, vastly outperforming the baseline model, which scored 0.083 and 0.043 respectively. This represents a five-fold increase in F1-score and a six-fold increase in smoke-class IoU. Notably, CEDANet with citizen-constrained pseudo-labels achieves performance comparable to the same architecture trained on limited 100 fully annotated images with F1-score of 0.418 and IoU of 0.264, demonstrating its ability to reach small-sampled fully supervised-level accuracy without target-domain annotations. Our research validates the scalability and cost-efficiency of combining citizen science with weakly supervised domain adaptation, offering a practical solution for complex, data-scarce environmental monitoring applications.
[64] See Different, Think Better: Visual Variations Mitigating Hallucinations in LVLMs cs.CVPDF
Ziyun Dai, Xiaoqiang Li, Shaohua Zhang, Yuanchen Wu, Jide Li
TL;DR: 该论文提出了一种名为ViHallu的视觉中心幻觉缓解框架,通过视觉变化图像生成和视觉指令构建,提升视觉语义对齐,显著减少大型视觉语言模型(LVLM)中的幻觉现象。
Details
Motivation: 当前的大型视觉语言模型(LVLM)在视觉理解和多模态推理方面表现出色,但常常产生与视觉内容不一致的文本幻觉。现有的缓解方法以文本为中心,在细粒度视觉场景中效果有限。因此,论文提出ViHallu框架,专注于视觉语义对齐以解决这一问题。
Result: 在多个基准测试中,ViHallu显著减少了幻觉现象,同时提升了模型的细粒度视觉理解能力。
Insight: 通过视觉变化和针对性指令构建,可以更有效地提升LVLM的视觉语义对齐能力,减少幻觉。该方法为多模态模型的细粒度理解提供了新思路。
Abstract: Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities in visual understanding and multimodal reasoning. However, LVLMs frequently exhibit hallucination phenomena, manifesting as the generated textual responses that demonstrate inconsistencies with the provided visual content. Existing hallucination mitigation methods are predominantly text-centric, the challenges of visual-semantic alignment significantly limit their effectiveness, especially when confronted with fine-grained visual understanding scenarios. To this end, this paper presents ViHallu, a Vision-Centric Hallucination mitigation framework that enhances visual-semantic alignment through Visual Variation Image Generation and Visual Instruction Construction. ViHallu introduces \textbf{\textit{visual variation images}} with controllable visual alterations while maintaining the overall image structure. These images, combined with carefully constructed visual instructions, enable LVLMs to better understand fine-grained visual content through fine-tuning, allowing models to more precisely capture the correspondence between visual content and text, thereby enhancing visual-semantic alignment. Extensive experiments on multiple benchmarks show that ViHallu effectively enhances models’ fine-grained visual understanding while significantly reducing hallucination tendencies. Furthermore, we release ViHallu-Instruction, a visual instruction dataset specifically designed for hallucination mitigation and visual-semantic alignment. Code is available at https://github.com/oliviadzy/ViHallu.
[65] VeS: Teaching Pixels to Listen Without Supervision cs.CV | I.2.10PDF
Sajay Raj
TL;DR: 该论文研究了在低资源、多语言环境中密集音频-视觉(AV)模型的性能,通过对比三种对比目标,发现密集目标(DenseAV-style)在音频-视觉检索和定位任务中表现最优。
Details
Motivation: 现有密集AV模型的研究多基于英语和字幕丰富的网络视频,而在低资源、多语言环境中的性能尚不明确。论文旨在验证这些模型在印度多语言环境中的有效性。
Result: 密集目标在音频-视觉检索任务中相对全局池化提升了59%的R@1,且能生成清晰的零样本定位热图。
Insight: 密集令牌路由不仅在英语等高资源环境中有效,在低资源、多语言环境中更具决定性作用,适用于标注和声音质量不足的场景。
Abstract: Recent dense audio-visual (AV) models achieve impressive retrieval and emergent localization, but almost all evidence comes from English-centric, caption-rich web video. It is unclear whether these objectives survive in low-resource, code-switched, and noisy multilingual settings that typify developing regions. We show they do**-**and that the choice of aggregation function becomes even more critical. Using a multilingual subset of Project Vaani spanning dozens of Indian languages and dialectal variants, we compare three contrastive objectives: (i) a global mean-pooled loss (CLIP-style), (ii) a dense max-mean token matcher (DenseAV-style), and (iii) a simple hybrid (motivated by frozen-vision alignment strategies). The dense objective delivers a +59% relative R@1 (Audio Visual) improvement over global pooling and substantially lower mean/median ranks, while consistently producing sharp zero-shot localization heatmaps of spoken objects-despite keeping the vision backbone entirely frozen (no LoRA / partial fine-tuning). Our results demonstrate that dense token routing is not a luxury of high-resource English corpora; it is more decisive when annotations and acoustic cleanliness are scarce. We release the codebase and trained models.
[66] From Seeing to Experiencing: Scaling Navigation Foundation Models with Reinforcement Learning cs.CV | cs.ROPDF
Honglin He, Yukai Ma, Wayne Wu, Bolei Zhou
TL;DR: 论文提出了一种名为Seeing-to-Experiencing(S2E)的框架,通过结合离线预训练和强化学习(RL)来增强导航基础模型的交互能力,解决了仅依赖离线数据训练的局限性。
Details
Motivation: 现有的导航基础模型仅通过离线数据训练,难以处理现实世界中需要交互和安全的复杂任务(如避障和行人避让)。因此,需要一种方法既能保持模型的泛化能力,又能提升其交互性。
Result: 实验表明,S2E解决了仅依赖离线数据时的性能瓶颈,强化学习在提升交互性方面显著优于监督微调。
Insight: 强化学习的在线交互经验对增强基础模型在实际机器人任务中的性能至关重要,仿真环境与真实世界的结合是未来研究方向。
Abstract: Navigation foundation models trained on massive webscale data enable agents to generalize across diverse environments and embodiments. However, these models trained solely on offline data, often lack the capacity to reason about the consequences of their actions or adapt through counterfactual understanding. They thus face significant limitations in the real-world urban navigation where interactive and safe behaviors, such as avoiding obstacles and moving pedestrians, are critical. To tackle these challenges, we introduce the Seeing-to-Experiencing framework to scale the capability of navigation foundation models with reinforcement learning. S2E combines the strengths of pre-training on videos and post-training through RL. It maintains the generalizability acquired from large-scale real-world videos while enhancing its interactivity through RL in simulation environments. Specifically, we introduce two innovations: an Anchor-Guided Distribution Matching strategy, which stabilizes learning and models diverse motion patterns through anchor-based supervision; and a Residual-Attention Module, which obtains reactive behaviors from simulation environments without erasing the model’s pretrained knowledge. Moreover, we establish a comprehensive end-to-end evaluation benchmark, NavBench-GS, built on photorealistic 3DGS reconstructions of real-world scenes that incorporate physical interactions. It can systematically assess the generalizability and safety of navigation foundation models. Extensive experiments show that S2E mitigates the diminishing returns often seen when scaling with offline data alone. We perform a thorough analysis of the benefits of Reinforcement Learning compared to Supervised Fine-Tuning in the context of post-training for robot learning. Our findings emphasize the crucial role of integrating interactive online experiences to effectively scale foundation models in Robotics.
[67] Ov3R: Open-Vocabulary Semantic 3D Reconstruction from RGB Videos cs.CVPDF
Ziren Gong, Xiaohan Li, Fabio Tosi, Jiawei Han, Stefano Mattoccia
TL;DR: Ov3R提出了一种新颖的框架,用于从RGB视频流中实现开放词汇语义3D重建,结合了CLIP语义和3D重建技术,显著提升了空间AI的能力。
Details
Motivation: 现有方法在3D重建和语义分割中通常缺乏开放词汇能力,即无法处理未见过的语义类别。Ov3R旨在通过结合CLIP语义和3D重建技术,实现全局一致的几何和细粒度语义对齐。
Result: 实验表明,Ov3R在密集3D重建和开放词汇3D分割任务上表现优异,标志着向实时、语义感知的空间AI迈进一步。
Insight: 将CLIP等开放词汇模型与3D重建结合,能显著提升语义对齐能力,为开放场景下的空间AI应用提供了新思路。
Abstract: We present Ov3R, a novel framework for open-vocabulary semantic 3D reconstruction from RGB video streams, designed to advance Spatial AI. The system features two key components: CLIP3R, a CLIP-informed 3D reconstruction module that predicts dense point maps from overlapping clips while embedding object-level semantics; and 2D-3D OVS, a 2D-3D open-vocabulary semantic module that lifts 2D features into 3D by learning fused descriptors integrating spatial, geometric, and semantic cues. Unlike prior methods, Ov3R incorporates CLIP semantics directly into the reconstruction process, enabling globally consistent geometry and fine-grained semantic alignment. Our framework achieves state-of-the-art performance in both dense 3D reconstruction and open-vocabulary 3D segmentation, marking a step forward toward real-time, semantics-aware Spatial AI.
[68] X-Omni: Reinforcement Learning Makes Discrete Autoregressive Image Generative Models Great Again cs.CVPDF
Zigang Geng, Yibing Wang, Yeyao Ma, Chen Li, Yongming Rao
TL;DR: 论文提出了一种基于强化学习的离散自回归图像生成方法X-Omni,通过结合语义图像分词器、统一自回归模型和离线扩散解码器,显著提升了图像生成质量,并实现了图像与语言生成的无缝结合。
Details
Motivation: 现有的自回归建模方法在图像生成中存在视觉保真度低、输出失真以及难以遵循复杂指令的问题,导致研究者逐渐转向扩散模型与自回归模型的联合训练。本文旨在通过强化学习解决这些缺陷,重新统一图像和语言生成。
Result: X-Omni在7B语言模型上实现了图像生成的最先进性能,生成的图像具有高美学质量,并能有效遵循指令和渲染长文本。
Insight: 强化学习可以有效解决自回归模型在图像生成中的累积误差问题,为统一建模图像和语言生成提供了可能性。
Abstract: Numerous efforts have been made to extend the ``next token prediction’’ paradigm to visual contents, aiming to create a unified approach for both image generation and understanding. Nevertheless, attempts to generate images through autoregressive modeling with discrete tokens have been plagued by issues such as low visual fidelity, distorted outputs, and failure to adhere to complex instructions when rendering intricate details. These shortcomings are likely attributed to cumulative errors during autoregressive inference or information loss incurred during the discretization process. Probably due to this challenge, recent research has increasingly shifted toward jointly training image generation with diffusion objectives and language generation with autoregressive objectives, moving away from unified modeling approaches. In this work, we demonstrate that reinforcement learning can effectively mitigate artifacts and largely enhance the generation quality of a discrete autoregressive modeling method, thereby enabling seamless integration of image and language generation. Our framework comprises a semantic image tokenizer, a unified autoregressive model for both language and images, and an offline diffusion decoder for image generation, termed X-Omni. X-Omni achieves state-of-the-art performance in image generation tasks using a 7B language model, producing images with high aesthetic quality while exhibiting strong capabilities in following instructions and rendering long texts.
[69] StepAL: Step-aware Active Learning for Cataract Surgical Videos cs.CVPDF
Nisarg A. Shah, Bardia Safaei, Shameema Sikder, S. Swaroop Vedula, Vishal M. Patel
TL;DR: StepAL是一种针对白内障手术视频的主动学习框架,通过结合步态感知特征表示和熵加权聚类策略,选择需要标注的全视频,显著减少了标注成本并提高了手术步骤识别的准确性。
Details
Motivation: 传统主动学习方法为图像或短视频设计,不适用于长、未修剪的手术视频,因为标注依赖整个视频的上下文。为解决这一问题,提出了StepAL框架。
Result: 在两个白内障手术数据集(Cataract-1k和Cataract-101)上,StepAL优于现有方法,用更少的标注视频实现了更高的步骤识别准确率。
Insight: StepAL通过考虑视频的全局上下文和步骤多样性,为手术视频的高效标注和模型训练提供了新思路。
Abstract: Active learning (AL) can reduce annotation costs in surgical video analysis while maintaining model performance. However, traditional AL methods, developed for images or short video clips, are suboptimal for surgical step recognition due to inter-step dependencies within long, untrimmed surgical videos. These methods typically select individual frames or clips for labeling, which is ineffective for surgical videos where annotators require the context of the entire video for annotation. To address this, we propose StepAL, an active learning framework designed for full video selection in surgical step recognition. StepAL integrates a step-aware feature representation, which leverages pseudo-labels to capture the distribution of predicted steps within each video, with an entropy-weighted clustering strategy. This combination prioritizes videos that are both uncertain and exhibit diverse step compositions for annotation. Experiments on two cataract surgery datasets (Cataract-1k and Cataract-101) demonstrate that StepAL consistently outperforms existing active learning approaches, achieving higher accuracy in step recognition with fewer labeled videos. StepAL offers an effective approach for efficient surgical video analysis, reducing the annotation burden in developing computer-assisted surgical systems.
[70] MOVE: Motion-Guided Few-Shot Video Object Segmentation cs.CVPDF
Kaining Ying, Hengrui Hu, Henghui Ding
TL;DR: 这篇论文提出了一种基于运动指导的少样本视频目标分割(FSVOS)方法,并引入了一个大规模数据集MOVE。通过评估现有方法并分析其挑战,作者提出了一种新的基线方法DMA,验证了其在少样本运动理解任务中的优越性能。
Details
Motivation: 现有的FSVOS方法主要关注静态目标类别,忽视了视频中的时间动态信息,限制了其在需要运动理解场景中的应用。为了填补这一空白,作者提出了专注于运动指导的FSVOS任务,并构建了相应的数据集MOVE。
Result: 实验表明,DMA在MOVE数据集上表现优异,显著优于其他现有方法,为运动指导的FSVOS任务提供了新的基线性能。
Insight: 运动是视频目标分割中的关键动态特性。通过解耦运动和外观特征,可以更有效地解决少样本条件下的视频目标分割问题,为未来研究提供了新的方向。
Abstract: This work addresses motion-guided few-shot video object segmentation (FSVOS), which aims to segment dynamic objects in videos based on a few annotated examples with the same motion patterns. Existing FSVOS datasets and methods typically focus on object categories, which are static attributes that ignore the rich temporal dynamics in videos, limiting their application in scenarios requiring motion understanding. To fill this gap, we introduce MOVE, a large-scale dataset specifically designed for motion-guided FSVOS. Based on MOVE, we comprehensively evaluate 6 state-of-the-art methods from 3 different related tasks across 2 experimental settings. Our results reveal that current methods struggle to address motion-guided FSVOS, prompting us to analyze the associated challenges and propose a baseline method, Decoupled Motion Appearance Network (DMA). Experiments demonstrate that our approach achieves superior performance in few shot motion understanding, establishing a solid foundation for future research in this direction.
cs.CL [Back]
[71] Dialogic Social Learning for Artificial Agents: Enhancing LLM Ontology Acquisition through Mixed-Initiative Educational Interactions cs.CL | cs.HC | cs.LG | cs.RO | I.2.7, I.2.9, j.4,PDF
Sabrina Patania, Luca Annese, Cansu Koyuturk, Azzurra Ruggeri, Dimitri Ognibene
TL;DR: 该论文探讨了通过社会性学习范式(如混合主动教育对话)提升大语言模型(LLM)的知识获取能力,实验表明这种方法在知识获取和应用上优于传统单向教学方法。
Details
Motivation: 传统AI训练范式(如监督学习和强化学习)依赖大数据和稀疏反馈信号,限制了模型从交互中高效学习的能力。受维果茨基社会文化理论启发,该研究希望通过社会性学习(如教育对话)来克服这些限制。
Result: 混合主动对话方法在知识获取和应用上优于单向教学和直接访问结构化知识,表明社会性学习范式能显著提升LLM的学习效果。
Insight: 整合教育学与心理学的社会性学习方法为AI训练提供了新路径,能够弥补现有方法(如提示工程)的局限性。
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in processing extensive offline datasets. However, they often face challenges in acquiring and integrating complex, knowledge online. Traditional AI training paradigms, predominantly based on supervised learning or reinforcement learning, mirror a ‘Piagetian’ model of independent exploration. These approaches typically rely on large datasets and sparse feedback signals, limiting the models’ ability to learn efficiently from interactions. Drawing inspiration from Vygotsky’s sociocultural theory, this study explores the potential of socially mediated learning paradigms to address these limitations. We introduce a dynamic environment, termed the ‘AI Social Gym’, where an AI learner agent engages in dyadic pedagogical dialogues with knowledgeable AI teacher agents. These interactions emphasize external, structured dialogue as a core mechanism for knowledge acquisition, contrasting with methods that depend solely on internal inference or pattern recognition. Our investigation focuses on how different pedagogical strategies impact the AI learning process in the context of ontology acquisition. Empirical results indicate that such dialogic approaches-particularly those involving mixed-direction interactions combining top-down explanations with learner-initiated questioning-significantly enhance the LLM’s ability to acquire and apply new knowledge, outperforming both unidirectional instructional methods and direct access to structured knowledge, formats typically present in training datasets. These findings suggest that integrating pedagogical and psychological insights into AI and robot training can substantially improve post-training knowledge acquisition and response quality. This approach offers a complementary pathway to existing strategies like prompt engineering
[72] Which symbol grounding problem should we try to solve? cs.CL | cs.AIPDF
Vincent C. Müller
TL;DR: 本文探讨符号接地问题的定义及其解决方案,认为Floridi和Taddeo的零语义承诺条件无法实现,并提出应重新思考问题的本质和目标在系统中的角色。
Details
Motivation: 符号接地问题在人工智能中至关重要,但现有解决方案存在争议。本文旨在重新定义问题,并提出更合理的解决方向。
Result: 指出零语义承诺条件不可行,提出仅关注人工计算代理的行为能力和意义功能才是合理的接地问题。
Insight: 符号接地问题需要重新聚焦于行为功能和计算代理的实际能力,而非抽象的语义承诺。
Abstract: Floridi and Taddeo propose a condition of “zero semantic commitment” for solutions to the grounding problem, and a solution to it. I argue briefly that their condition cannot be fulfilled, not even by their own solution. After a look at Luc Steels’ very different competing suggestion, I suggest that we need to re-think what the problem is and what role the ‘goals’ in a system play in formulating the problem. On the basis of a proper understanding of computing, I come to the conclusion that the only sensible grounding problem is how we can explain and re-produce the behavioral ability and function of meaning in artificial computational agents
[73] Rewrite-to-Rank: Optimizing Ad Visibility via Retrieval-Aware Text Rewriting cs.CLPDF
Chloe Ho, Ishneet Sukhvinder Singh, Diya Sharma, Tanvi Reddy Anumandla, Michael Lu
TL;DR: 论文提出了一种通过LLM重写广告文本以提升其在检索系统中的排名和可见性的方法,并通过自定义损失函数平衡语义相关性和内容保真度,实验表明该方法优于提示工程和监督微调。
Details
Motivation: 现有研究主要关注LLM返回相关信息的能力,但广告文本的表述如何影响其在检索系统中的可见性和排名仍未被充分探索。本文旨在填补这一空白。
Result: 实验表明,PPO训练模型在指令提示和少量样本提示场景中均表现优异,最高实现了2.79的DeltaDIR@5和0.0073的DeltaMRR@5提升。
Insight: 广告文本的表述对检索排名和LLM生成结果的可见性有显著影响,且强化学习方法在优化广告重写任务中具有显著优势。
Abstract: Search algorithms and user query relevance have given LLMs the ability to return relevant information, but the effect of content phrasing on ad visibility remains underexplored. We investigate how LLM-based rewriting of advertisements can improve their ranking in retrieval systems and inclusion in generated LLM responses, without modifying the retrieval model itself. We introduce a supervised fine-tuning framework with a custom loss balancing semantic relevance and content fidelity. To evaluate effectiveness, we propose two metrics: DeltaMRR@K (ranking improvement) and DeltaDIR@K (inclusion frequency improvement). Our approach presents a scalable method to optimize ad phrasing, enhancing visibility in retrieval-based LLM workflows. Experiments across both instruction-based and few-shot prompting demonstrate that PPO trained models outperform both prompt engineering and supervised fine-tuning in most cases, achieving up to a 2.79 DeltaDIR@5 and 0.0073 DeltaMRR@5 in instruction-based prompting. These results highlight the importance of how the ad is written before retrieval and prompt format and reinforcement learning in effective ad rewriting for LLM integrated retrieval systems.
[74] iLSU-T: an Open Dataset for Uruguayan Sign Language Translation cs.CL | cs.AIPDF
Ariel E. Stassi, Yanina Boria, J. Matías Di Martino, Gregory Randall
TL;DR: 论文介绍了iLSU-T数据集,一个用于乌拉圭手语翻译的开放数据集,包含RGB视频、音频和文本转录,并通过实验验证了其有效性。
Details
Motivation: 由于每种手语的独特性,机器翻译需要本地数据来开发新技术或适配现有技术。当前缺乏专门针对乌拉圭手语的多模态数据集。
Result: 实验证明iLSU-T数据集的有效性,并指出本地化数据集对手语处理的重要性。
Insight: 本地化数据集是实现手语技术进步的关键,未来需更多类似资源来推动包容性和无障碍技术的发展。
Abstract: Automatic sign language translation has gained particular interest in the computer vision and computational linguistics communities in recent years. Given each sign language country particularities, machine translation requires local data to develop new techniques and adapt existing ones. This work presents iLSU T, an open dataset of interpreted Uruguayan Sign Language RGB videos with audio and text transcriptions. This type of multimodal and curated data is paramount for developing novel approaches to understand or generate tools for sign language processing. iLSU T comprises more than 185 hours of interpreted sign language videos from public TV broadcasting. It covers diverse topics and includes the participation of 18 professional interpreters of sign language. A series of experiments using three state of the art translation algorithms is presented. The aim is to establish a baseline for this dataset and evaluate its usefulness and the proposed pipeline for data processing. The experiments highlight the need for more localized datasets for sign language translation and understanding, which are critical for developing novel tools to improve accessibility and inclusion of all individuals. Our data and code can be accessed.
[75] Curved Inference: Concern-Sensitive Geometry in Large Language Model Residual Streams cs.CL | cs.AIPDF
Rob Manson
TL;DR: 论文提出了一种名为’弯曲推理’的几何可解释性框架,用于分析大型语言模型(如Gemma3-1b和LLaMA3.2-3b)在语义关注变化时残差流轨迹的弯曲行为。研究发现,模型的激活轨迹会因关注焦点变化而显著改变,尤其是LLaMA模型在关注强度增加时表现出明显的曲率和显著性变化。
Details
Motivation: 研究动机在于理解大型语言模型如何在推理过程中根据语义关注的变化调整内部表示,从而揭示模型的几何结构与语义抽象的动态关系。这有助于诊断模型的对齐性和推理特性。
Result: 研究结果显示,语义关注的变化会可靠地改变模型的内部激活轨迹,尤其是LLaMA在关注强度增加时表现出显著的曲率和显著性变化,而Gemma的区分性较弱。
Insight: 论文揭示了大型语言模型的几何结构分为两层:嵌入空间中的潜在概念结构,和由具体提示驱动的上下文轨迹。这一发现为理解模型的语义抽象和对齐性提供了新视角。
Abstract: We propose Curved Inference - a geometric Interpretability framework that tracks how the residual stream trajectory of a large language model bends in response to shifts in semantic concern. Across 20 matched prompts spanning emotional, moral, perspective, logical, identity, environmental, and nonsense domains, we analyse Gemma3-1b and LLaMA3.2-3b using five native-space metrics, with a primary focus on curvature (\k{appa}_i) and salience (S(t)). These metrics are computed under a pullback semantic metric derived from the unembedding matrix, ensuring that all measurements reflect token-aligned geometry rather than raw coordinate structure. We find that concern-shifted prompts reliably alter internal activation trajectories in both models - with LLaMA exhibiting consistent, statistically significant scaling in both curvature and salience as concern intensity increases. Gemma also responds to concern but shows weaker differentiation between moderate and strong variants. Our results support a two-layer view of LLM geometry - a latent conceptual structure encoded in the embedding space, and a contextual trajectory shaped by prompt-specific inference. Curved Inference reveals how models navigate, reorient, or reinforce semantic meaning over depth, offering a principled method for diagnosing alignment, abstraction, and emergent inference dynamics. These findings offer fresh insight into semantic abstraction and model alignment through the lens of Curved Inference.
[76] A Survey of Classification Tasks and Approaches for Legal Contracts cs.CL | cs.AIPDF
Amrita Singh, Aditya Joshi, Jiaojiao Jiang, Hye-young Paik
TL;DR: 该论文是关于自动法律合同分类(LCC)的首个全面调查,探讨了分类任务、数据集和方法论,提出了未来研究方向。
Details
Motivation: 法律合同的规模大、复杂性高,手动审核效率低下且易出错,因此需要自动化分类方法以提高速度、准确性和可访问性。
Result: 总结了14个与英文合同相关的数据集,并提供了不同方法在LCC任务上的性能比较。
Insight: LCC领域仍有提升空间,未来研究应关注效率、准确性和可扩展性,以支持法律NLP的发展。
Abstract: Given the large size and volumes of contracts and their underlying inherent complexity, manual reviews become inefficient and prone to errors, creating a clear need for automation. Automatic Legal Contract Classification (LCC) revolutionizes the way legal contracts are analyzed, offering substantial improvements in speed, accuracy, and accessibility. This survey delves into the challenges of automatic LCC and a detailed examination of key tasks, datasets, and methodologies. We identify seven classification tasks within LCC, and review fourteen datasets related to English-language contracts, including public, proprietary, and non-public sources. We also introduce a methodology taxonomy for LCC, categorized into Traditional Machine Learning, Deep Learning, and Transformer-based approaches. Additionally, the survey discusses evaluation techniques and highlights the best-performing results from the reviewed studies. By providing a thorough overview of current methods and their limitations, this survey suggests future research directions to improve the efficiency, accuracy, and scalability of LCC. As the first comprehensive survey on LCC, it aims to support legal NLP researchers and practitioners in improving legal processes, making legal information more accessible, and promoting a more informed and equitable society.
[77] Bangla BERT for Hyperpartisan News Detection: A Semi-Supervised and Explainable AI Approach cs.CLPDF
Mohammad Mehadi Hasan, Fatema Binte Hassan, Md Al Jubair, Zobayer Ahmed, Sazzatul Yeakin
TL;DR: 论文探讨了利用Bangla BERT检测孟加拉语超党派新闻,结合半监督学习和可解释AI技术,实现了高准确率。
Details
Motivation: 孟加拉语作为低资源语言,缺乏有效的自然语言处理工具检测超党派新闻,导致错误信息传播,亟需解决。
Result: Bangla BERT在试验数据中准确率达95.65%,显著优于传统方法。
Insight: 研究展示了Transformer模型在低资源语言环境中的潜力,为后续改进提供了可能。
Abstract: In the current digital landscape, misinformation circulates rapidly, shaping public perception and causing societal divisions. It is difficult to identify hyperpartisan news in Bangla since there aren’t many sophisticated natural language processing methods available for this low-resource language. Without effective detection methods, biased content can spread unchecked, posing serious risks to informed discourse. To address this gap, our research fine-tunes Bangla BERT. This is a state-of-the-art transformer-based model, designed to enhance classification accuracy for hyperpartisan news. We evaluate its performance against traditional machine learning models and implement semi-supervised learning to enhance predictions further. Not only that, we use LIME to provide transparent explanations of the model’s decision-making process, which helps to build trust in its outcomes. With a remarkable accuracy score of 95.65%, Bangla BERT outperforms conventional approaches, according to our trial data. The findings of this study demonstrate the usefulness of transformer models even in environments with limited resources, which opens the door to further improvements in this area.
[78] Can human clinical rationales improve the performance and explainability of clinical text classification models? cs.CLPDF
Christoph Metzner, Shang Gao, Drahomira Herrmannova, Heidi A. Hanson
TL;DR: 本文研究了人类临床依据是否能提升临床文本分类模型的性能和可解释性,发现尽管在某些情况下有效,但其效果不如直接增加报告数据。若目标是准确性,应优先标注更多报告;若关注可解释性,可尝试补充依据数据。
Details
Motivation: 探索人类临床依据是否能作为额外的监督信号,提升基于Transformer的临床文本分类模型的性能和可解释性。
Result: 1. 在高资源场景中,依据能提升模型性能,但低资源下效果不一致;
2. 依据训练的模型在性能上不如仅使用额外报告的模型;
3. 依据对可解释性的提升有限(通过token级依据覆盖率衡量)。
Insight: 1. 若目标是优化准确性,标注更多报告比生成依据更有效;
2. 若侧重于可解释性,可尝试补充依据数据;
3. 依据的自动质量评估方法仍需进一步研究。
Abstract: AI-driven clinical text classification is vital for explainable automated retrieval of population-level health information. This work investigates whether human-based clinical rationales can serve as additional supervision to improve both performance and explainability of transformer-based models that automatically encode clinical documents. We analyzed 99,125 human-based clinical rationales that provide plausible explanations for primary cancer site diagnoses, using them as additional training samples alongside 128,649 electronic pathology reports to evaluate transformer-based models for extracting primary cancer sites. We also investigated sufficiency as a way to measure rationale quality for pre-selecting rationales. Our results showed that clinical rationales as additional training data can improve model performance in high-resource scenarios but produce inconsistent behavior when resources are limited. Using sufficiency as an automatic metric to preselect rationales also leads to inconsistent results. Importantly, models trained on rationales were consistently outperformed by models trained on additional reports instead. This suggests that clinical rationales don’t consistently improve model performance and are outperformed by simply using more reports. Therefore, if the goal is optimizing accuracy, annotation efforts should focus on labeling more reports rather than creating rationales. However, if explainability is the priority, training models on rationale-supplemented data may help them better identify rationale-like features. We conclude that using clinical rationales as additional training data results in smaller performance improvements and only slightly better explainability (measured as average token-level rationale coverage) compared to training on additional reports.
[79] MemTool: Optimizing Short-Term Memory Management for Dynamic Tool Calling in LLM Agent Multi-Turn Conversations cs.CLPDF
Elias Lumer, Anmol Gulati, Vamse Kumar Subbiah, Pradeep Honaganahalli Basavaraju, James A. Burke
TL;DR: 论文提出MemTool框架,优化LLM代理在多轮对话中的短期记忆管理,通过三种模式(自主代理、工作流、混合模式)动态管理工具或MCP服务器上下文,显著提升任务完成效率和工具移除效率。
Details
Motivation: 固定上下文窗口限制了LLM代理在多轮交互中的有效性,特别是在需要重复独立使用工具的场景下。MemTool旨在通过动态管理短期记忆解决这一问题。
Result: 自主代理模式下,推理型LLM工具移除效率达90-94%,而中型模型效率较低(0-60%)。工作流和混合模式在工具移除上表现稳定,自主和混合模式在任务完成上更优。
Insight: 不同LLM的能力差异显著影响工具管理效率,任务需求决定模式选择——需权衡任务准确率、自主性和模型能力。
Abstract: Large Language Model (LLM) agents have shown significant autonomous capabilities in dynamically searching and incorporating relevant tools or Model Context Protocol (MCP) servers for individual queries. However, fixed context windows limit effectiveness in multi-turn interactions requiring repeated, independent tool usage. We introduce MemTool, a short-term memory framework enabling LLM agents to dynamically manage tools or MCP server contexts across multi-turn conversations. MemTool offers three agentic architectures: 1) Autonomous Agent Mode, granting full tool management autonomy, 2) Workflow Mode, providing deterministic control without autonomy, and 3) Hybrid Mode, combining autonomous and deterministic control. Evaluating each MemTool mode across 13+ LLMs on the ScaleMCP benchmark, we conducted experiments over 100 consecutive user interactions, measuring tool removal ratios (short-term memory efficiency) and task completion accuracy. In Autonomous Agent Mode, reasoning LLMs achieve high tool-removal efficiency (90-94% over a 3-window average), while medium-sized models exhibit significantly lower efficiency (0-60%). Workflow and Hybrid modes consistently manage tool removal effectively, whereas Autonomous and Hybrid modes excel at task completion. We present trade-offs and recommendations for each MemTool mode based on task accuracy, agency, and model capabilities.
[80] Towards Locally Deployable Fine-Tuned Causal Large Language Models for Mode Choice Behaviour cs.CL | cs.AIPDF
Tareq Alsaleh, Bilal Farooq
TL;DR: 论文提出了一种本地可部署的因果大语言模型LiTransMC,用于旅行模式选择预测,在预测准确性和可解释性方面优于未调优模型和大型专有系统。
Details
Motivation: 研究旨在探索如何将通用因果大语言模型(LLMs)转化为专业化、可解释的工具,用于交通行为研究,同时保持隐私、降低成本并通过本地部署拓宽应用。
Result: LiTransMC在加权F1得分(0.6845)和Jensen-Shannon Divergence(0.000245)上表现优异,超越了未调优模型、GPT-4o和经典方法(如离散选择模型和机器学习分类器)。
Insight: 通过结合结构化行为预测和自然语言推理,研究展示了构建专业化、本地可部署LLMs的可行性,为交通政策制定和行为研究提供了新工具。
Abstract: This study investigates the adoption of open-access, locally deployable causal large language models (LLMs) for travel mode choice prediction and introduces LiTransMC, the first fine-tuned causal LLM developed for this task. We systematically benchmark eleven LLMs (1-12B parameters) across three stated and revealed preference datasets, testing 396 configurations and generating over 79,000 synthetic commuter predictions. Beyond predictive accuracy, we evaluate models generated reasoning using BERTopic for topic modelling and a novel Explanation Strength Index, providing the first structured analysis of how LLMs articulate decision factors in alignment with behavioural theory. LiTransMC, fine-tuned using parameter efficient and loss masking strategy, achieved a weighted F1 score of 0.6845 and a Jensen-Shannon Divergence of 0.000245, surpassing both untuned local models and larger proprietary systems, including GPT-4o with advanced persona inference and embedding-based loading, while also outperforming classical mode choice methods such as discrete choice models and machine learning classifiers for the same dataset. This dual improvement, i.e., high instant-level accuracy and near-perfect distributional calibration, demonstrates the feasibility of creating specialist, locally deployable LLMs that integrate prediction and interpretability. Through combining structured behavioural prediction with natural language reasoning, this work unlocks the potential for conversational, multi-task transport models capable of supporting agent-based simulations, policy testing, and behavioural insight generation. These findings establish a pathway for transforming general purpose LLMs into specialized, explainable tools for transportation research and policy formulation, while maintaining privacy, reducing cost, and broadening access through local deployment.
[81] Which LLMs Get the Joke? Probing Non-STEM Reasoning Abilities with HumorBench cs.CL | cs.AIPDF
Reuben Narad, Siddharth Suresh, Jiayi Chen, Pine S. L. Dysart-Bricken, Bob Mankoff
TL;DR: HumorBench是一个评估大型语言模型(LLM)在非STEM领域(如幽默理解)中推理能力的新基准,通过卡通标题的笑话元素识别和解释任务,揭示了LLM在幽默推理中的表现和局限性。
Details
Motivation: 随着LLM在数学和科学领域的推理能力趋于饱和,需要新的基准来评估其在非STEM领域(如幽默理解)的推理能力,因为幽默理解需要复杂的文化参考和语言技巧。
Result: 发现LLM在STEM领域的推理能力能够有效迁移到幽默理解任务,但测试时增加推理资源的效果因模型而异。
Insight: 幽默推理需要复杂的文化知识和语言技巧,LLM表现良好的现象表明其推理能力具有跨领域的可转移性。
Abstract: We present HumorBench, a benchmark designed to evaluate large language models’ (LLMs) ability to reason about and explain sophisticated humor in cartoon captions. As reasoning models increasingly saturate existing benchmarks in mathematics and science, novel and challenging evaluations of model intelligence beyond STEM domains are essential. Reasoning is fundamentally involved in text-based humor comprehension, requiring the identification of connections between concepts in cartoons/captions and external cultural references, wordplays, and other mechanisms. HumorBench includes approximately 300 unique cartoon-caption pairs from the New Yorker Caption Contest and Cartoonstock.com, with expert-annotated evaluation rubrics identifying essential joke elements. LLMs are evaluated based on their explanations towards the humor and abilities in identifying the joke elements. To perform well on this task, models must form and test hypotheses about associations between concepts, potentially backtracking from initial interpretations to arrive at the most plausible explanation. Our extensive benchmarking of current SOTA models reveals three key insights: (1) LLM progress on STEM reasoning transfers effectively to humor comprehension; (2) models trained exclusively on STEM reasoning data still perform well on HumorBench, demonstrating strong transferability of reasoning abilities; and (3) test-time scaling by increasing thinking token budgets yields mixed results across different models in humor reasoning.
[82] MAGIC: A Multi-Hop and Graph-Based Benchmark for Inter-Context Conflicts in Retrieval-Augmented Generation cs.CLPDF
Jungyeon Lee, Kangmin Lee, Taeuk Kim
TL;DR: 论文提出了一个基于知识图谱的基准MAGIC,用于研究检索增强生成(RAG)系统中的知识冲突问题,弥补了现有基准在任务多样性、冲突类型和可解释性上的不足。
Details
Motivation: 现有RAG系统研究知识冲突的基准存在局限性,如任务单一(集中在问答)、依赖实体替换技术,且冲突类型有限。需要更全面、可解释的解决方案。
Result: 实验显示,开源和专有LLM在冲突检测(尤其是多跳推理时)和矛盾定位上表现不佳。
Insight: 结果表明,LLM在多源信息整合方面存在挑战,尤其是处理复杂冲突时。未来的改进需关注冲突检测和溯源能力的提升。
Abstract: Knowledge conflict often arises in retrieval-augmented generation (RAG) systems, where retrieved documents may be inconsistent with one another or contradict the model’s parametric knowledge. Existing benchmarks for investigating the phenomenon have notable limitations, including a narrow focus on the question answering setup, heavy reliance on entity substitution techniques, and a restricted range of conflict types. To address these issues, we propose a knowledge graph (KG)-based framework that generates varied and subtle conflicts between two similar yet distinct contexts, while ensuring interpretability through the explicit relational structure of KGs. Experimental results on our benchmark, MAGIC, provide intriguing insights into the inner workings of LLMs regarding knowledge conflict: both open-source and proprietary models struggle with conflict detection – especially when multi-hop reasoning is required – and often fail to pinpoint the exact source of contradictions. Finally, we present in-depth analyses that serve as a foundation for improving LLMs in integrating diverse, sometimes even conflicting, information.
[83] Multilingual JobBERT for Cross-Lingual Job Title Matching cs.CLPDF
Jens-Joris Decorte, Matthias De Lange, Jeroen Van Hautte
TL;DR: JobBERT-V3是基于对比学习的多语言职位标题匹配模型,支持英语、德语、西班牙语和中文,比基线模型表现更优。
Details
Motivation: 现有的职位标题匹配模型主要支持单语言,而多语言支持在全球化劳动力市场中具有重要意义。JobBERT-V3旨在填补这一空白。
Result: 在TalentCLEF 2025基准测试中表现优于基线模型,且在单语言和跨语言设置下均表现一致。
Insight: 模型的成功表明合成数据和多语言对比学习在跨语言任务中的潜力;还可扩展用于技能排名等应用。
Abstract: We introduce JobBERT-V3, a contrastive learning-based model for cross-lingual job title matching. Building on the state-of-the-art monolingual JobBERT-V2, our approach extends support to English, German, Spanish, and Chinese by leveraging synthetic translations and a balanced multilingual dataset of over 21 million job titles. The model retains the efficiency-focused architecture of its predecessor while enabling robust alignment across languages without requiring task-specific supervision. Extensive evaluations on the TalentCLEF 2025 benchmark demonstrate that JobBERT-V3 outperforms strong multilingual baselines and achieves consistent performance across both monolingual and cross-lingual settings. While not the primary focus, we also show that the model can be effectively used to rank relevant skills for a given job title, demonstrating its broader applicability in multilingual labor market intelligence. The model is publicly available: https://huggingface.co/TechWolf/JobBERT-v3.
[84] Libra: Assessing and Improving Reward Model by Learning to Think cs.CLPDF
Meng Zhou, Bei Li, Jiahao Liu, Xiaowen Shi, Yang Bai
TL;DR: 论文提出了Libra框架,通过改进奖励模型性能来解决复杂推理场景中的问题,并开发了基于学习思考的生成奖励模型Libra-RM系列,取得了最优结果。
Details
Motivation: 当前强化学习中奖励模型在复杂推理场景表现不佳,且依赖精细标注的参考答案和受限的输出格式,限制了RL数据的扩展和模型推理性能的持续提升。
Result: Libra-RM系列在多个基准上达到了最优水平,实验证明其能利用未标记数据进一步提升推理模型性能。
Insight: 学习思考方法可以显著提升生成奖励模型在复杂推理任务中的表现,且未标记数据的潜力值得挖掘。
Abstract: Reinforcement learning (RL) has significantly improved the reasoning ability of large language models. However, current reward models underperform in challenging reasoning scenarios and predominant RL training paradigms rely on rule-based or reference-based rewards, which impose two critical limitations: 1) the dependence on finely annotated reference answer to attain rewards; and 2) the requirement for constrained output format. These limitations fundamentally hinder further RL data scaling and sustained enhancement of model reasoning performance. To address these limitations, we propose a comprehensive framework for evaluating and improving the performance of reward models in complex reasoning scenarios. We first present a reasoning-oriented benchmark (Libra Bench), systematically constructed from a diverse collection of challenging mathematical problems and advanced reasoning models, to address the limitations of existing reward model benchmarks in reasoning scenarios. We further introduce a novel approach for improving the generative reward model via learning-to-think methodologies. Based on the proposed approach, we develop Libra-RM series, a collection of generative reward models with reasoning capabilities that achieve state-of-the-art results on various benchmarks. Comprehensive downstream experiments are conducted and the experimental results demonstrate the correlation between our Libra Bench and downstream application, and the potential of Libra-RM to further improve reasoning models with unlabeled data.
[85] UnsafeChain: Enhancing Reasoning Model Safety via Hard Cases cs.CLPDF
Raj Vardhan Tomar, Preslav Nakov, Yuxia Wang
TL;DR: UnsafeChain是一个用于增强推理模型安全性的数据集,专注于解决硬提示(hard prompts)引发的有害输出问题。通过暴露不安全行为并指导纠正,提升模型安全性同时保留推理能力。实验表明UnsafeChain优于现有数据集。
Details
Motivation: 随着大推理模型能力增强,链式思维(CoT)推理引入新的安全性挑战。现有工作多关注安全提示而忽视硬提示,导致有害输出未被解决。
Result: 在六个分布外和五个分布内基准测试中,UnsafeChain表现优于SafeChain和STAR-1,甚至1K子集也能匹配或超越基线性能。
Insight: 纠正为基础的监督是高效且通用的安全对齐方法,小规模高质量数据也能显著提升模型安全性。
Abstract: As large reasoning models (LRMs) grow more capable, chain-of-thought (CoT) reasoning introduces new safety challenges. Existing SFT-based safety alignment studies dominantly focused on filtering prompts with safe, high-quality responses, while overlooking hard prompts that always elicit harmful outputs. To fill this gap, we introduce UnsafeChain, a safety alignment dataset constructed from hard prompts with diverse sources, where unsafe completions are identified and explicitly corrected into safe responses. By exposing models to unsafe behaviors and guiding their correction, UnsafeChain enhances safety while preserving general reasoning ability. We fine-tune three LRMs on UnsafeChain and compare them against recent SafeChain and STAR-1 across six out-of-distribution and five in-distribution benchmarks. UnsafeChain consistently outperforms prior datasets, with even a 1K subset matching or surpassing baseline performance, demonstrating the effectiveness and generalizability of correction-based supervision. We release our dataset and code at https://github.com/mbzuai-nlp/UnsafeChain
[86] Adversarial Defence without Adversarial Defence: Enhancing Language Model Robustness via Instance-level Principal Component Removal cs.CLPDF
Yang Wang, Chenghao Xiao, Yizhi Li, Stuart E. Middleton, Noura Al Moubayed
TL;DR: 这篇论文提出了一种简单但有效的方法,通过移除实例级的主成分来增强预训练语言模型(PLMs)的对抗鲁棒性,无需传统的对抗防御或数据扰动。
Details
Motivation: 预训练语言模型在面对对抗攻击时表现脆弱,且传统的对抗防御方法计算成本高。因此,作者提出了一种无需对抗训练的低成本替代方案。
Result: 在八个基准数据集上的实验表明,该方法在保持基线模型精度的同时显著提升了对抗鲁棒性。
Insight: 通过简单的嵌入空间变换来对齐分布,可以显著提高模型的鲁棒性,同时避免了传统对抗训练的高成本。
Abstract: Pre-trained language models (PLMs) have driven substantial progress in natural language processing but remain vulnerable to adversarial attacks, raising concerns about their robustness in real-world applications. Previous studies have sought to mitigate the impact of adversarial attacks by introducing adversarial perturbations into the training process, either implicitly or explicitly. While both strategies enhance robustness, they often incur high computational costs. In this work, we propose a simple yet effective add-on module that enhances the adversarial robustness of PLMs by removing instance-level principal components, without relying on conventional adversarial defences or perturbing the original training data. Our approach transforms the embedding space to approximate Gaussian properties, thereby reducing its susceptibility to adversarial perturbations while preserving semantic relationships. This transformation aligns embedding distributions in a way that minimises the impact of adversarial noise on decision boundaries, enhancing robustness without requiring adversarial examples or costly training-time augmentation. Evaluations on eight benchmark datasets show that our approach improves adversarial robustness while maintaining comparable before-attack accuracy to baselines, achieving a balanced trade-off between robustness and generalisation.
[87] AgriEval: A Comprehensive Chinese Agricultural Benchmark for Large Language Models cs.CLPDF
Lian Yan, Haotian Wang, Chen Tang, Haifeng Liu, Tianyang Sun
TL;DR: AgriEval 是一个全面的中文农业基准测试,旨在填补农业领域缺乏大语言模型评估数据的空白,涵盖多种认知场景和广泛的数据规模,实验显示现有模型性能不足,并提出了改进策略。
Details
Motivation: 农业领域缺乏专门的大语言模型训练数据和评估基准,限制了相关技术的发展和应用,因此需要一个自然且高质量的基准来推动农业 LLMs 的发展。
Result: 实验结果表明,现有 LLMs 在 AgriEval 上的准确率普遍低于 60%,显示出农业领域模型的不足和改进空间。
Insight: AgriEval 不仅填补了农业 LLMs 的评估空白,还揭示了模型在知识应用和专家决策能力上的短板,为未来研究提供了方向和策略。
Abstract: In the agricultural domain, the deployment of large language models (LLMs) is hindered by the lack of training data and evaluation benchmarks. To mitigate this issue, we propose AgriEval, the first comprehensive Chinese agricultural benchmark with three main characteristics: (1) Comprehensive Capability Evaluation. AgriEval covers six major agriculture categories and 29 subcategories within agriculture, addressing four core cognitive scenarios: memorization, understanding, inference, and generation. (2) High-Quality Data. The dataset is curated from university-level examinations and assignments, providing a natural and robust benchmark for assessing the capacity of LLMs to apply knowledge and make expert-like decisions. (3) Diverse Formats and Extensive Scale. AgriEval comprises 14,697 multiple-choice questions and 2,167 open-ended question-and-answer questions, establishing it as the most extensive agricultural benchmark available to date. We also present comprehensive experimental results over 51 open-source and commercial LLMs. The experimental results reveal that most existing LLMs struggle to achieve 60% accuracy, underscoring the developmental potential in agricultural LLMs. Additionally, we conduct extensive experiments to investigate factors influencing model performance and propose strategies for enhancement. AgriEval is available at https://github.com/YanPioneer/AgriEval/.
[88] HRIPBench: Benchmarking LLMs in Harm Reduction Information Provision to Support People Who Use Drugs cs.CL | cs.CYPDF
Kaixuan Wang, Chenxin Diao, Jason T. Jacques, Zhongliang Guo, Shuai Zhao
TL;DR: HRIPBench是一个用于评测大语言模型(LLM)在减少毒品使用危害信息提供中准确性和安全性风险的基准,涵盖三类任务,结果显示当前LLM仍存在严重问题。
Details
Motivation: 毒品使用对数百万人健康构成威胁,减少危害的公共卫生策略需要技术支持。LLM虽具备医学知识潜力,但其实际表现未被充分研究,需系统性评测。
Result: 当前最先进LLM提供减少危害信息时准确性不足,且可能对毒品使用者造成严重安全风险,需谨慎约束使用。
Insight: LLM在公共卫生领域应用需严格验证,避免因信息错误引发负面健康后果;基准设计可扩展至其他敏感领域评测。
Abstract: Millions of individuals’ well-being are challenged by the harms of substance use. Harm reduction as a public health strategy is designed to improve their health outcomes and reduce safety risks. Some large language models (LLMs) have demonstrated a decent level of medical knowledge, promising to address the information needs of people who use drugs (PWUD). However, their performance in relevant tasks remains largely unexplored. We introduce HRIPBench, a benchmark designed to evaluate LLM’s accuracy and safety risks in harm reduction information provision. The benchmark dataset HRIP-Basic has 2,160 question-answer-evidence pairs. The scope covers three tasks: checking safety boundaries, providing quantitative values, and inferring polysubstance use risks. We build the Instruction and RAG schemes to evaluate model behaviours based on their inherent knowledge and the integration of domain knowledge. Our results indicate that state-of-the-art LLMs still struggle to provide accurate harm reduction information, and sometimes, carry out severe safety risks to PWUD. The use of LLMs in harm reduction contexts should be cautiously constrained to avoid inducing negative health outcomes. WARNING: This paper contains illicit content that potentially induces harms.
[89] Modelling Adjectival Modification Effects on Semantic Plausibility cs.CLPDF
Anna Golub, Beate Zywietz, Annerose Eichel
TL;DR: 该论文研究了形容词修饰对语义合理性的影响,提出了一种基于句子Transformer的新方法,但发现其性能不及RoBERTa等模型,并强调了平衡评估方法的重要性。
Details
Motivation: 研究旨在理解事件修饰如何影响语义合理性,这对对话生成、常识推理和幻觉检测等任务至关重要。
Result: 发现句子Transformer在任务中表现不佳,甚至不如RoBERTa,而评估方法的失衡会扭曲模型性能和度量。
Insight: 研究揭示了任务需要更平衡的评估方法,并强调了修饰合理性的建模对自然语言处理任务的重要性。
Abstract: While the task of assessing the plausibility of events such as ‘’news is relevant’’ has been addressed by a growing body of work, less attention has been paid to capturing changes in plausibility as triggered by event modification. Understanding changes in plausibility is relevant for tasks such as dialogue generation, commonsense reasoning, and hallucination detection as it allows to correctly model, for example, ‘’gentle sarcasm’’ as a sign of closeness rather than unkindness among friends [9]. In this work, we tackle the ADEPT challenge benchmark [6] consisting of 16K English sentence pairs differing by exactly one adjectival modifier. Our modeling experiments provide a conceptually novel method by using sentence transformers, and reveal that both they and transformer-based models struggle with the task at hand, and sentence transformers - despite their conceptual alignment with the task - even under-perform in comparison to models like RoBERTa. Furthermore, an in-depth comparison with prior work highlights the importance of a more realistic, balanced evaluation method: imbalances distort model performance and evaluation metrics, and weaken result trustworthiness.
[90] Introducing HALC: A general pipeline for finding optimal prompting strategies for automated coding with LLMs in the computational social sciences cs.CL | cs.AI | cs.LGPDF
Andreas Reich, Claudia Thoms, Tobias Schrimpf
TL;DR: 论文提出了一种通用流水线HALC,用于为计算社会科学中的LLMs自动化编码任务找到最优提示策略,并通过实验验证了其有效性。
Details
Motivation: 尽管LLMs在任务自动化(如社会科学中的自动编码)中广泛应用,但提示策略的效果因模型和任务而异,通常依赖试错方法。本文旨在提供一种系统性、可靠的方法来构造最优提示。
Result: 实验显示,基于Mistral NeMo模型的提示策略在单变量编码(α=0.76-0.78)和跨变量编码(α=0.71-0.74)中表现可靠。
Insight: 提示策略的效果受任务和模型影响,HALC能够为特定任务和模型识别出可靠的提示,无需专门优化代码以适应LLM。
Abstract: LLMs are seeing widespread use for task automation, including automated coding in the social sciences. However, even though researchers have proposed different prompting strategies, their effectiveness varies across LLMs and tasks. Often trial and error practices are still widespread. We propose HALC$-$a general pipeline that allows for the systematic and reliable construction of optimal prompts for any given coding task and model, permitting the integration of any prompting strategy deemed relevant. To investigate LLM coding and validate our pipeline, we sent a total of 1,512 individual prompts to our local LLMs in over two million requests. We test prompting strategies and LLM task performance based on few expert codings (ground truth). When compared to these expert codings, we find prompts that code reliably for single variables (${\alpha}$climate = .76; ${\alpha}$movement = .78) and across two variables (${\alpha}$climate = .71; ${\alpha}$movement = .74) using the LLM Mistral NeMo. Our prompting strategies are set up in a way that aligns the LLM to our codebook$-$we are not optimizing our codebook for LLM friendliness. Our paper provides insights into the effectiveness of different prompting strategies, crucial influencing factors, and the identification of reliable prompts for each coding task and model.
[91] AutoTIR: Autonomous Tools Integrated Reasoning via Reinforcement Learning cs.CLPDF
Yifan Wei, Xiaoyan Yu, Yixuan Weng, Tengfei Pan, Angsheng Li
TL;DR: AutoTIR是一个通过强化学习框架让大语言模型(LLMs)在推理过程中自主决定是否以及何时调用外部工具的方法,显著提升了工具集成推理(TIR)的性能和泛化能力。
Details
Motivation: 现有的工具集成推理(TIR)方法通常依赖预定义的工具使用模式,这可能削弱语言模型的核心能力。受人类自适应选择工具的启发,AutoTIR旨在通过强化学习动态优化工具使用策略。
Result: 在多种知识密集型、数学和通用语言建模任务中,AutoTIR显著优于基线方法,表现出更好的工具使用行为和泛化能力。
Insight: 强化学习可以有效地提升大语言模型在工具集成推理中的自主性和泛化能力,为构建更通用的TIR能力提供了新思路。
Abstract: Large Language Models (LLMs), when enhanced through reasoning-oriented post-training, evolve into powerful Large Reasoning Models (LRMs). Tool-Integrated Reasoning (TIR) further extends their capabilities by incorporating external tools, but existing methods often rely on rigid, predefined tool-use patterns that risk degrading core language competence. Inspired by the human ability to adaptively select tools, we introduce AutoTIR, a reinforcement learning framework that enables LLMs to autonomously decide whether and which tool to invoke during the reasoning process, rather than following static tool-use strategies. AutoTIR leverages a hybrid reward mechanism that jointly optimizes for task-specific answer correctness, structured output adherence, and penalization of incorrect tool usage, thereby encouraging both precise reasoning and efficient tool integration. Extensive evaluations across diverse knowledge-intensive, mathematical, and general language modeling tasks demonstrate that AutoTIR achieves superior overall performance, significantly outperforming baselines and exhibits superior generalization in tool-use behavior. These results highlight the promise of reinforcement learning in building truly generalizable and scalable TIR capabilities in LLMs. The code and data are available at https://github.com/weiyifan1023/AutoTIR.
[92] Graph-R1: Towards Agentic GraphRAG Framework via End-to-end Reinforcement Learning cs.CLPDF
Haoran Luo, Haihong E, Guanting Chen, Qika Lin, Yikai Guo
TL;DR: Graph-R1 是一个基于端到端强化学习的智能 GraphRAG 框架,通过轻量级知识超图构建和多轮交互优化检索能力,在推理准确性、检索效率和生成质量上优于传统方法。
Details
Motivation: 尽管 GraphRAG 通过实体-关系图改进了 RAG,但仍面临构建成本高、检索固定、依赖长上下文推理和提示设计的问题。Graph-R1 旨在通过强化学习解决这些问题。
Result: 实验表明,Graph-R1 在标准 RAG 数据集上优于传统 GraphRAG 和强化学习增强的 RAG 方法。
Insight: 将强化学习与结构化知识结合,可以显著提升检索和生成的能力,同时降低对固定检索和提示设计的依赖。
Abstract: Retrieval-Augmented Generation (RAG) mitigates hallucination in LLMs by incorporating external knowledge, but relies on chunk-based retrieval that lacks structural semantics. GraphRAG methods improve RAG by modeling knowledge as entity-relation graphs, but still face challenges in high construction cost, fixed one-time retrieval, and reliance on long-context reasoning and prompt design. To address these challenges, we propose Graph-R1, an agentic GraphRAG framework via end-to-end reinforcement learning (RL). It introduces lightweight knowledge hypergraph construction, models retrieval as a multi-turn agent-environment interaction, and optimizes the agent process via an end-to-end reward mechanism. Experiments on standard RAG datasets show that Graph-R1 outperforms traditional GraphRAG and RL-enhanced RAG methods in reasoning accuracy, retrieval efficiency, and generation quality.
[93] Post-Training Large Language Models via Reinforcement Learning from Self-Feedback cs.CL | cs.AIPDF
Carel van Niekerk, Renato Vukovic, Benjamin Matthias Ruppik, Hsien-chin Lin, Milica Gašić
TL;DR: 论文提出了一种名为RLSF的方法,通过自我反馈的强化学习对大型语言模型进行后训练,利用模型自身的置信度作为内在奖励,无需外部反馈即可改进模型的校准性和推理能力。
Details
Motivation: 大型语言模型在推理密集型任务中常产生校准性较差的答案,限制了其可靠性。为解决这一问题,论文提出利用模型自身的置信度作为内在奖励,模仿人类在缺乏外部反馈时的学习方式。
Result: 实验表明,RLSF在算术推理和多选题回答任务中提升了性能,同时改进了模型的校准性,验证了其有效性。
Insight: 论文强调,将模型自身的不确定性转化为有用的自我反馈,是一种数据高效且原则性的后训练方法,为LLM后训练中的内在奖励研究提供了新思路。
Abstract: Large Language Models (LLMs) often produce plausible but poorly-calibrated answers, limiting their reliability on reasoning-intensive tasks. We present Reinforcement Learning from Self-Feedback (RLSF), a post-training stage that uses the model’s own confidence as an intrinsic reward, mimicking how humans learn in the absence of external feedback. After a frozen LLM generates several chain-of-thought solutions, we define and compute the confidence of each final answer span and rank the traces accordingly. These synthetic preferences are then used to fine-tune the policy with standard preference optimization, similar to RLHF yet requiring no human labels, gold answers, or externally curated rewards. RLSF simultaneously (i) refines the model’s probability estimates – restoring well-behaved calibration – and (ii) strengthens step-by-step reasoning, yielding improved performance on arithmetic reasoning and multiple-choice question answering. By turning a model’s own uncertainty into useful self-feedback, RLSF affirms reinforcement learning on intrinsic model behaviour as a principled and data-efficient component of the LLM post-training pipeline and warrents further research in intrinsic rewards for LLM post-training.
[94] DeepSieve: Information Sieving via LLM-as-a-Knowledge-Router cs.CLPDF
Minghao Guo, Qingcheng Zeng, Xujiang Zhao, Yanchi Liu, Wenchao Yu
TL;DR: DeepSieve是一个基于LLM的代理框架,通过信息筛选和多阶段蒸馏提升RAG的性能,解决了现有方法在噪声检索和浅层推理上的问题。
Details
Motivation: 现有RAG方法缺乏对查询和知识源的细粒度控制,导致噪声检索和浅层推理问题。DeepSieve旨在通过LLM作为知识路由器提升性能。
Result: 在异构知识源的多跳问答任务中,DeepSieve在推理深度、检索精度和可解释性上优于传统RAG方法。
Insight: LLM作为知识路由器的代理设计可以显著提升信息检索和推理的效率,模块化和透明性设计有助于广泛应用。
Abstract: Large Language Models (LLMs) excel at many reasoning tasks but struggle with knowledge-intensive queries due to their inability to dynamically access up-to-date or domain-specific information. Retrieval-Augmented Generation (RAG) has emerged as a promising solution, enabling LLMs to ground their responses in external sources. However, existing RAG methods lack fine-grained control over both the query and source sides, often resulting in noisy retrieval and shallow reasoning. In this work, we introduce DeepSieve, an agentic RAG framework that incorporates information sieving via LLM-as-a-knowledge-router. DeepSieve decomposes complex queries into structured sub-questions and recursively routes each to the most suitable knowledge source, filtering irrelevant information through a multi-stage distillation process. Our design emphasizes modularity, transparency, and adaptability, leveraging recent advances in agentic system design. Experiments on multi-hop QA tasks across heterogeneous sources demonstrate improved reasoning depth, retrieval precision, and interpretability over conventional RAG approaches.
quant-ph [Back]
[95] Supervised Quantum Image Processing quant-ph | cs.AI | cs.CV | cs.LG | 81P68, 81P70, 81P40, 68Q12, 68T01 | I.2; I.4; J.2PDF
Marco Parigi, Mehran Khosrojerdi, Filippo Caruso, Leonardo Banchi
TL;DR: 该论文比较了四种量子图像表示(QImR)的压缩性能,发现FRQI优于TNR、NEQR和QPIE,并表明量子内核在二分类任务中与经典线性内核表现相当,但存储资源需求更低。
Details
Motivation: 在大数据和人工智能时代,数据量增长和复杂计算需求推动了量子图像处理(QIP)的发展,以利用量子计算的潜力提升效率。
Result: FRQI的压缩性能最优,量子内核在分类任务中表现与经典线性内核相当,但资源需求更低。
Insight: 量子图像处理在压缩和分类任务中展现出潜力,表明量子计算可能为图像处理提供新的高效解决方案。
Abstract: In the era of big data and artificial intelligence, the increasing volume of data and the demand to solve more and more complex computational challenges are two driving forces for improving the efficiency of data storage, processing and analysis. Quantum image processing (QIP) is an interdisciplinary field between quantum information science and image processing, which has the potential to alleviate some of these challenges by leveraging the power of quantum computing. In this work, we compare and examine the compression properties of four different Quantum Image Representations (QImRs): namely, Tensor Network Representation (TNR), Flexible Representation of Quantum Image (FRQI), Novel Enhanced Quantum Representation NEQR, and Quantum Probability Image Encoding (QPIE). Our simulations show that FRQI performs a higher compression of image information than TNR, NEQR, and QPIE. Furthermore, we investigate the trade-off between accuracy and memory in binary classification problems, evaluating the performance of quantum kernels based on QImRs compared to the classical linear kernel. Our results indicate that quantum kernels provide comparable classification average accuracy but require exponentially fewer resources for image storage.
cs.LG [Back]
[96] R-Stitch: Dynamic Trajectory Stitching for Efficient Reasoning cs.LG | cs.AI | cs.CLPDF
Zhuokun Chen, Zeren Chen, Jiahao He, Mingkui Tan, Jianfei Cai
TL;DR: R-Stitch 是一种基于置信度的混合解码框架,通过在推理过程中动态切换小语言模型(SLM)和大语言模型(LLM),显著加速了链式思维(CoT)推理的效率。
Details
Motivation: CoT 推理虽然有效,但因依赖长序列的自回归解码带来巨大计算开销。现有加速方法(如推测性解码)在模型间一致性低时效果有限,且未充分利用小模型在简洁中间推理中的潜力。
Result: 在数学推理基准测试中,R-Stitch 将推理延迟降低高达 85%,且准确率下降可忽略。
Insight: 通过动态切换模型,R-Stitch 在保持答案质量的同时大幅提升效率,是一种实用且通用的 CoT 加速方法。
Abstract: Chain-of-thought (CoT) reasoning enhances the problem-solving capabilities of large language models by encouraging step-by-step intermediate reasoning during inference. While effective, CoT introduces substantial computational overhead due to its reliance on autoregressive decoding over long token sequences. Existing acceleration strategies either reduce sequence length through early stopping or compressive reward designs, or improve decoding speed via speculative decoding with smaller models. However, speculative decoding suffers from limited speedup when the agreement between small and large models is low, and fails to exploit the potential advantages of small models in producing concise intermediate reasoning. In this paper, we present R-Stitch, a token-level, confidence-based hybrid decoding framework that accelerates CoT inference by switching between a small language model (SLM) and a large language model (LLM) along the reasoning trajectory. R-Stitch uses the SLM to generate tokens by default and delegates to the LLM only when the SLM’s confidence falls below a threshold. This design avoids full-sequence rollback and selectively invokes the LLM on uncertain steps, preserving both efficiency and answer quality. R-Stitch is model-agnostic, training-free, and compatible with standard decoding pipelines. Experiments on math reasoning benchmarks demonstrate that R-Stitch achieves up to 85% reduction in inference latency with negligible accuracy drop, highlighting its practical effectiveness in accelerating CoT reasoning.
cs.AI [Back]
[97] Teaching Language Models To Gather Information Proactively cs.AI | cs.CLPDF
Tenghao Huang, Sihao Chen, Muhao Chen, Jonathan May, Longqi Yang
TL;DR: 该论文提出了一种新任务范式,教语言模型(LLMs)主动收集缺失信息以提高协作解决问题的能力,并通过强化微调方法显著提升了模型性能。
Details
Motivation: 现有的LLMs在面对不完整或模糊的提示时往往表现被动,无法主动收集关键信息,导致解决方案质量受限。论文旨在提升LLMs在复杂对话中的主动信息收集能力。
Result: 1. 在自动评估指标中,模型比o3-mini高出18%;2. 人工评估显示,人类标注者更喜欢模型的澄清提问(42%)和最终方案提纲(28%)。
Insight: 强化学习能有效训练LLMs主动提问,从而提升其在真实协作任务中的表现,使其从被动文本生成器进化为真正的协作伙伴。
Abstract: Large language models (LLMs) are increasingly expected to function as collaborative partners, engaging in back-and-forth dialogue to solve complex, ambiguous problems. However, current LLMs often falter in real-world settings, defaulting to passive responses or narrow clarifications when faced with incomplete or under-specified prompts, falling short of proactively gathering the missing information that is crucial for high-quality solutions. In this work, we introduce a new task paradigm: proactive information gathering, where LLMs must identify gaps in the provided context and strategically elicit implicit user knowledge through targeted questions. To systematically study and train this capability, we design a scalable framework that generates partially specified, real-world tasks, masking key information and simulating authentic ambiguity. Within this setup, our core innovation is a reinforcement finetuning strategy that rewards questions that elicit genuinely new, implicit user information – such as hidden domain expertise or fine-grained requirements – that would otherwise remain unspoken. Experiments demonstrate that our trained Qwen-2.5-7B model significantly outperforms o3-mini by 18% on automatic evaluation metrics. More importantly, human evaluation reveals that clarification questions and final outlines generated by our model are favored by human annotators by 42% and 28% respectively. Together, these results highlight the value of proactive clarification in elevating LLMs from passive text generators to genuinely collaborative thought partners.
[98] What Does it Mean for a Neural Network to Learn a “World Model”? cs.AI | cs.CLPDF
Kenneth Li, Fernanda Viégas, Martin Wattenberg
TL;DR: 该论文提出了一套明确的标准,用于判断神经网络是否学习并使用了“世界模型”,旨在为实验研究提供共同语言。
Details
Motivation: 目前“世界模型”这一概念在神经网络领域被非正式地广泛使用,缺乏明确的定义和标准,因此需要提出一套可操作的准则。
Result: 论文未提供具体的实验结果,但通过理论框架为未来实验提供了基础。
Insight: 该研究表明,只有通过严格的标准才能区分神经网络是否真正学习了“世界模型”,而不仅仅是数据或任务的副产品。
Abstract: We propose a set of precise criteria for saying a neural net learns and uses a “world model.” The goal is to give an operational meaning to terms that are often used informally, in order to provide a common language for experimental investigation. We focus specifically on the idea of representing a latent “state space” of the world, leaving modeling the effect of actions to future work. Our definition is based on ideas from the linear probing literature, and formalizes the notion of a computation that factors through a representation of the data generation process. An essential addition to the definition is a set of conditions to check that such a “world model” is not a trivial consequence of the neural net’s data or task.
[99] UI-AGILE: Advancing GUI Agents with Effective Reinforcement Learning and Precise Inference-Time Grounding cs.AI | cs.CL | cs.CVPDF
Shuquan Lian, Yuhang Wu, Jia Ma, Zihan Song, Bingqi Chen
TL;DR: UI-AGILE通过强化学习训练阶段的奖励函数改进和推理时的分解定位技术,提升了GUI代理的表现,在基准测试中取得了最优性能。
Details
Motivation: 当前GUI代理的训练和推理方法在推理设计、奖励机制和视觉噪声方面存在不足,UI-AGILE旨在解决这些问题。
Result: 在ScreenSpot-Pro和ScreenSpot-v2基准测试中,UI-AGILE比基线方法提升了23%的定位精度。
Insight: 通过优化奖励机制和图像处理策略,GUI代理的性能可以显著提升,尤其是在复杂和高分辨率任务中。
Abstract: The emergence of Multimodal Large Language Models (MLLMs) has driven significant advances in Graphical User Interface (GUI) agent capabilities. Nevertheless, existing GUI agent training and inference techniques still suffer from a dilemma for reasoning designs, ineffective reward, and visual noise. To address these issues, we introduce UI-AGILE, a comprehensive framework enhancing GUI agents at both the training and inference stages. For training, we propose a suite of improvements to the Supervised Fine-Tuning (SFT) process: 1) a Continuous Reward function to incentivize high-precision grounding; 2) a “Simple Thinking” reward to balance planning with speed and grounding accuracy; and 3) a Cropping-based Resampling strategy to mitigate the sparse reward problem and improve learning on complex tasks. For inference, we present Decomposed Grounding with Selection, a novel method that dramatically improves grounding accuracy on high-resolution displays by breaking the image into smaller, manageable parts. Experiments show that UI-AGILE achieves the state-of-the-art performance on two benchmarks ScreenSpot-Pro and ScreenSpot-v2. For instance, using both our proposed training and inference enhancement methods brings 23% grounding accuracy improvement over the best baseline on ScreenSpot-Pro.
[100] UserBench: An Interactive Gym Environment for User-Centric Agents cs.AI | cs.CL | cs.LGPDF
Cheng Qian, Zuxin Liu, Akshara Prabhakar, Zhiwei Liu, Jianguo Zhang
TL;DR: 论文介绍了UserBench,一个用于评估语言模型代理在用户多轮交互中协作能力的基准测试,揭示了当前模型在用户意图对齐方面的不足。
Details
Motivation: 当前基于大语言模型的代理在复杂任务解决上有进展,但在与用户主动协作(尤其是目标模糊或动态变化时)方面仍有待探索。
Result: 评估显示,当前模型仅20%的回答完全对齐用户意图,且最先进模型通过主动交互仅揭示不到30%的用户偏好。
Insight: 研究强调,构建真正协作代理的关键在于超越任务执行,深入理解动态用户意图和对齐用户偏好。
Abstract: Large Language Models (LLMs)-based agents have made impressive progress in reasoning and tool use, enabling them to solve complex tasks. However, their ability to proactively collaborate with users, especially when goals are vague, evolving, or indirectly expressed, remains underexplored. To address this gap, we introduce UserBench, a user-centric benchmark designed to evaluate agents in multi-turn, preference-driven interactions. UserBench features simulated users who start with underspecified goals and reveal preferences incrementally, requiring agents to proactively clarify intent and make grounded decisions with tools. Our evaluation of leading open- and closed-source LLMs reveals a significant disconnect between task completion and user alignment. For instance, models provide answers that fully align with all user intents only 20% of the time on average, and even the most advanced models uncover fewer than 30% of all user preferences through active interaction. These results highlight the challenges of building agents that are not just capable task executors, but true collaborative partners. UserBench offers an interactive environment to measure and advance this critical capability.
[101] MoHoBench: Assessing Honesty of Multimodal Large Language Models via Unanswerable Visual Questions cs.AI | cs.CVPDF
Yanxu Zhu, Shitong Duan, Xiangxu Zhang, Jitao Sang, Peng Zhang
TL;DR: 论文提出了MoHoBench,一个评估多模态大语言模型(MLLMs)诚实的基准,通过不可回答的视觉问题分析了模型的诚实行为,并提出了初步的对齐方法。
Details
Motivation: 多模态大语言模型在视觉语言任务中取得了进展,但可能产生不可信的内容,目前对其诚实行为的研究较少,尤其是在面对不可回答的视觉问题时。
Result: 研究发现多数模型无法适当拒绝回答不可回答问题,且诚实行为受视觉信息影响,需要模态特定的对齐方法。
Insight: MLLMs的诚实不仅是语言建模问题,还受视觉信息影响,未来需开发针对性的多模态对齐方法以提高可信度。
Abstract: Recently Multimodal Large Language Models (MLLMs) have achieved considerable advancements in vision-language tasks, yet produce potentially harmful or untrustworthy content. Despite substantial work investigating the trustworthiness of language models, MMLMs’ capability to act honestly, especially when faced with visually unanswerable questions, remains largely underexplored. This work presents the first systematic assessment of honesty behaviors across various MLLMs. We ground honesty in models’ response behaviors to unanswerable visual questions, define four representative types of such questions, and construct MoHoBench, a large-scale MMLM honest benchmark, consisting of 12k+ visual question samples, whose quality is guaranteed by multi-stage filtering and human verification. Using MoHoBench, we benchmarked the honesty of 28 popular MMLMs and conducted a comprehensive analysis. Our findings show that: (1) most models fail to appropriately refuse to answer when necessary, and (2) MMLMs’ honesty is not solely a language modeling issue, but is deeply influenced by visual information, necessitating the development of dedicated methods for multimodal honesty alignment. Therefore, we implemented initial alignment methods using supervised and preference learning to improve honesty behavior, providing a foundation for future work on trustworthy MLLMs. Our data and code can be found at https://github.com/DSTTSD/MoHoBench.
cs.RO [Back]
[102] Research Challenges and Progress in the End-to-End V2X Cooperative Autonomous Driving Competition cs.RO | cs.CV | I.4.9PDF
Ruiyang Hao, Haibao Yu, Jiaru Zhong, Chuanye Wang, Jiahao Wang
TL;DR: 本文介绍了端到端V2X协同自动驾驶竞赛的设计与成果,重点关注带宽感知融合、多智能体规划和异构传感器集成等关键研究问题,并分析了高性能解决方案中的技术趋势。
Details
Motivation: V2X通信是扩展感知范围和提升驾驶安全的关键技术,但在实际应用中面临多源传感器数据整合、有限通信带宽和动态环境等技术挑战。因此,需要通过竞赛推动相关研究的发展。
Result: 竞赛展示了高性能解决方案在带宽感知融合和多智能体规划中的技术趋势,为开发可扩展且可靠的V2X协同自动驾驶系统提供了实践基础。
Insight: V2X协同自动驾驶的实用化需要解决通信和数据融合的实际约束,未来的技术发展方向可能集中在高效的多源数据融合算法和鲁棒的协同规划方法上。
Abstract: With the rapid advancement of autonomous driving technology, vehicle-to-everything (V2X) communication has emerged as a key enabler for extending perception range and enhancing driving safety by providing visibility beyond the line of sight. However, integrating multi-source sensor data from both ego-vehicles and infrastructure under real-world constraints, such as limited communication bandwidth and dynamic environments, presents significant technical challenges. To facilitate research in this area, we organized the End-to-End Autonomous Driving through V2X Cooperation Challenge, which features two tracks: cooperative temporal perception and cooperative end-to-end planning. Built on the UniV2X framework and the V2X-Seq-SPD dataset, the challenge attracted participation from over 30 teams worldwide and established a unified benchmark for evaluating cooperative driving systems. This paper describes the design and outcomes of the challenge, highlights key research problems including bandwidth-aware fusion, robust multi-agent planning, and heterogeneous sensor integration, and analyzes emerging technical trends among top-performing solutions. By addressing practical constraints in communication and data fusion, the challenge contributes to the development of scalable and reliable V2X-cooperative autonomous driving systems.
cs.CY [Back]
[103] A Tactical Behaviour Recognition Framework Based on Causal Multimodal Reasoning: A Study on Covert Audio-Video Analysis Combining GAN Structure Enhancement and Phonetic Accent Modelling cs.CY | cs.AI | cs.CV | 05C82, 68T07, 68T05, 62H30 | I.2.10; I.4.8; H.5.1; H.2.8PDF
Wei Meng
TL;DR: 该论文提出了一种名为TACTIC-GRAPHS的系统,结合谱图理论和多模态图神经推理,用于高噪声和弱结构下的战术视频语义理解和威胁检测。
Details
Motivation: 在复杂噪声和弱结构环境下,传统方法难以有效识别战术行为中的威胁链,因此需要结合多模态信息进行因果关系建模。
Result: 在TACTIC-AVS和TACTIC-Voice数据集上,时间对齐准确率达89.3%,威胁链识别率超过85%,节点延迟在±150毫秒以内。
Insight: 该方法提高了结构的可解释性,适用于监控、国防和智能安全系统,展示了多模态融合在复杂环境下的优势。
Abstract: This paper introduces TACTIC-GRAPHS, a system that combines spectral graph theory and multimodal graph neural reasoning for semantic understanding and threat detection in tactical video under high noise and weak structure. The framework incorporates spectral embedding, temporal causal edge modeling, and discriminative path inference across heterogeneous modalities. A semantic-aware keyframe extraction method fuses visual, acoustic, and action cues to construct temporal graphs. Using graph attention and Laplacian spectral mapping, the model performs cross-modal weighting and causal signal analysis. Experiments on TACTIC-AVS and TACTIC-Voice datasets show 89.3 percent accuracy in temporal alignment and over 85 percent recognition of complete threat chains, with node latency within plus-minus 150 milliseconds. The approach enhances structural interpretability and supports applications in surveillance, defense, and intelligent security systems.
cs.IR [Back]
[104] AgentMaster: A Multi-Agent Conversational Framework Using A2A and MCP Protocols for Multimodal Information Retrieval and Analysis cs.IR | cs.AI | cs.CLPDF
Callie C. Liao, Duoduo Liao, Sai Surya Gadiraju
TL;DR: 该论文介绍了AgentMaster,一个基于多代理系统(MAS)的框架,结合了A2A和MCP协议,用于多模态信息检索和分析,通过自然语言交互实现动态协调和灵活通信。
Details
Motivation: 当前多代理系统在代理间通信、协调以及与异构工具和资源的交互方面仍面临挑战,而结合A2A和MCP协议的框架尚未广泛研究。
Result: BERTScore F1和G-Eval分别达到96.3%和87.1%,显示了强大的代理间协调、查询分解和领域相关响应能力。
Insight: AgentMaster展示了MAS在领域特定、协作和可扩展对话AI中的潜力,为未来多代理系统的应用提供了方向。
Abstract: The rise of Multi-Agent Systems (MAS) in Artificial Intelligence (AI), especially integrated with Large Language Models (LLMs), has greatly facilitated the resolution of complex tasks. However, current systems are still facing challenges of inter-agent communication, coordination, and interaction with heterogeneous tools and resources. Most recently, the Model Context Protocol (MCP) by Anthropic and Agent-to-Agent (A2A) communication protocol by Google have been introduced, and to the best of our knowledge, very few applications exist where both protocols are employed within a single MAS framework. We present a pilot study of AgentMaster, a novel modular multi-protocol MAS framework with self-implemented A2A and MCP, enabling dynamic coordination and flexible communication. Through a unified conversational interface, the system supports natural language interaction without prior technical expertise and responds to multimodal queries for tasks including information retrieval, question answering, and image analysis. Evaluation through the BERTScore F1 and LLM-as-a-Judge metric G-Eval averaged 96.3% and 87.1%, revealing robust inter-agent coordination, query decomposition, dynamic routing, and domain-specific, relevant responses. Overall, our proposed framework contributes to the potential capabilities of domain-specific, cooperative, and scalable conversational AI powered by MAS.
cs.CR [Back]
[105] OneShield – the Next Generation of LLM Guardrails cs.CR | cs.AI | cs.CLPDF
Chad DeLuca, Anna Lisa Gentile, Shubhi Asthana, Bing Zhang, Pawan Chowdhary
TL;DR: OneShield是一个独立、模型无关且可定制的解决方案,旨在保护大型语言模型(LLMs)用户免受潜在风险,强调针对特定客户的需求设计安全策略。
Details
Motivation: 随着大型语言模型的广泛应用,其安全性、隐私和伦理问题日益突出,亟需一种灵活且通用的解决方案来应对这些挑战。
Result: 论文描述了OneShield的实现细节和使用统计数据,展示了其在实际部署中的有效性。
Insight: LLM的快速发展要求防护方案必须具备高度灵活性和定制性,传统的“一刀切”解决方案难以满足需求。
Abstract: The rise of Large Language Models has created a general excitement about the great potential for a myriad of applications. While LLMs offer many possibilities, questions about safety, privacy, and ethics have emerged, and all the key actors are working to address these issues with protective measures for their own models and standalone solutions. The constantly evolving nature of LLMs makes the task of universally shielding users against their potential risks extremely challenging, and one-size-fits-all solutions unfeasible. In this work, we propose OneShield, our stand-alone, model-agnostic and customizable solution to safeguard LLMs. OneShield aims to provide facilities for defining risk factors, expressing and declaring contextual safety and compliance policies, and mitigating LLM risks, with a focus on each specific customer. We describe the implementation of the framework, the scalability considerations and provide usage statistics of OneShield since its first deployment.
[106] Unmasking Synthetic Realities in Generative AI: A Comprehensive Review of Adversarially Robust Deepfake Detection Systems cs.CR | cs.CV | F.2.2; I.2.7PDF
Naseem Khan, Tuan Nguyen, Amine Bermak, Issa Khalil
TL;DR: 该论文系统综述了深度伪造检测的先进方法,强调对抗性鲁棒性不足的问题,并提供了开源资源以促进研究。
Details
Motivation: 由于生成式AI的快速发展,深度伪造技术对数字安全、信息真实性和身份保护构成威胁,亟需有效的检测方法。
Result: 结果表明,现有方法在受控环境下表现良好,但对对抗性扰动(设计用于规避检测的微小修改)的鲁棒性不足。
Insight: 未来的研究需要专注于提升对抗性鲁棒性,开发模态无关的架构以应对复杂篡改技术。
Abstract: The rapid advancement of Generative Artificial Intelligence has fueled deepfake proliferation-synthetic media encompassing fully generated content and subtly edited authentic material-posing challenges to digital security, misinformation mitigation, and identity preservation. This systematic review evaluates state-of-the-art deepfake detection methodologies, emphasizing reproducible implementations for transparency and validation. We delineate two core paradigms: (1) detection of fully synthetic media leveraging statistical anomalies and hierarchical feature extraction, and (2) localization of manipulated regions within authentic content employing multi-modal cues such as visual artifacts and temporal inconsistencies. These approaches, spanning uni-modal and multi-modal frameworks, demonstrate notable precision and adaptability in controlled settings, effectively identifying manipulations through advanced learning techniques and cross-modal fusion. However, comprehensive assessment reveals insufficient evaluation of adversarial robustness across both paradigms. Current methods exhibit vulnerability to adversarial perturbations-subtle alterations designed to evade detection-undermining reliability in real-world adversarial contexts. This gap highlights critical disconnect between methodological development and evolving threat landscapes. To address this, we contribute a curated GitHub repository aggregating open-source implementations, enabling replication and testing. Our findings emphasize urgent need for future work prioritizing adversarial resilience, advocating scalable, modality-agnostic architectures capable of withstanding sophisticated manipulations. This review synthesizes strengths and shortcomings of contemporary deepfake detection while charting paths toward robust trustworthy systems.
[107] PRISM: Programmatic Reasoning with Image Sequence Manipulation for LVLM Jailbreaking cs.CR | cs.CVPDF
Quanchen Zou, Zonghao Ying, Moyang Chen, Wenzhuo Xu, Yisong Xiao
TL;DR: 论文提出了一种新的攻击框架PRISM,针对大型视觉语言模型(LVLM)的安全对齐机制,通过分解有害指令为多个视觉小工具,利用模型的组合推理能力实现高效越狱。
Details
Motivation: 现有越狱方法通常依赖直接的语义显式提示,忽略了LVLM在推理过程中的组合漏洞。论文旨在揭示并利用这种未充分探索的漏洞。
Result: 在SafeBench和MM-SafetyBench上验证,攻击成功率超过0.90,比基线提升高达0.39。
Insight: 揭示了LVLM组合推理能力的漏洞,强调了保护整个推理过程安全的重要性。
Abstract: The increasing sophistication of large vision-language models (LVLMs) has been accompanied by advances in safety alignment mechanisms designed to prevent harmful content generation. However, these defenses remain vulnerable to sophisticated adversarial attacks. Existing jailbreak methods typically rely on direct and semantically explicit prompts, overlooking subtle vulnerabilities in how LVLMs compose information over multiple reasoning steps. In this paper, we propose a novel and effective jailbreak framework inspired by Return-Oriented Programming (ROP) techniques from software security. Our approach decomposes a harmful instruction into a sequence of individually benign visual gadgets. A carefully engineered textual prompt directs the sequence of inputs, prompting the model to integrate the benign visual gadgets through its reasoning process to produce a coherent and harmful output. This makes the malicious intent emergent and difficult to detect from any single component. We validate our method through extensive experiments on established benchmarks including SafeBench and MM-SafetyBench, targeting popular LVLMs. Results show that our approach consistently and substantially outperforms existing baselines on state-of-the-art models, achieving near-perfect attack success rates (over 0.90 on SafeBench) and improving ASR by up to 0.39. Our findings reveal a critical and underexplored vulnerability that exploits the compositional reasoning abilities of LVLMs, highlighting the urgent need for defenses that secure the entire reasoning process.
eess.IV [Back]
[108] Comparative Analysis of Vision Transformers and Convolutional Neural Networks for Medical Image Classification eess.IV | cs.CV | cs.LG | I.2.10; I.4.8PDF
Kunal Kawadkar
TL;DR: 论文比较了视觉变换器(ViTs)和卷积神经网络(CNNs)在医学图像分类任务中的表现,发现不同任务下模型的性能优势不同,强调了任务特定架构选择的重要性。
Details
Motivation: 医学图像分类中,ViTs的效果与CNNs的对比尚未充分探索,因此需要实证分析以指导临床应用。
Result: ResNet-50在胸部X光分类中表现最佳(98.37%),DeiT-Small在脑肿瘤检测中领先(92.16%),EfficientNet-B0在皮肤癌分类中表现最优(81.84%)。
Insight: 医学图像分类任务需根据具体任务选择模型架构,ViTs和CNNs在不同任务中各有优势。
Abstract: The emergence of Vision Transformers (ViTs) has revolutionized computer vision, yet their effectiveness compared to traditional Convolutional Neural Networks (CNNs) in medical imaging remains under-explored. This study presents a comprehensive comparative analysis of CNN and ViT architectures across three critical medical imaging tasks: chest X-ray pneumonia detection, brain tumor classification, and skin cancer melanoma detection. We evaluated four state-of-the-art models - ResNet-50, EfficientNet-B0, ViT-Base, and DeiT-Small - across datasets totaling 8,469 medical images. Our results demonstrate task-specific model advantages: ResNet-50 achieved 98.37% accuracy on chest X-ray classification, DeiT-Small excelled at brain tumor detection with 92.16% accuracy, and EfficientNet-B0 led skin cancer classification at 81.84% accuracy. These findings provide crucial insights for practitioners selecting architectures for medical AI applications, highlighting the importance of task-specific architecture selection in clinical decision support systems.
[109] Querying GI Endoscopy Images: A VQA Approach eess.IV | cs.CVPDF
Gaurav Parajuli
TL;DR: 该论文探索了如何将Florence2模型应用于医学视觉问答任务,特别是在胃肠内窥镜图像上,并评估其性能。
Details
Motivation: 尽管现有多模态大模型在通用领域的视觉问答任务中表现优异,但在医学影像等专业领域中表现不佳,因此作者希望改进Florence2模型以提升其在此类任务中的表现。
Result: 论文展示了Florence2模型在医学视觉问答任务中的潜力,但未明确说明具体性能数据。
Insight: 研究揭示了通用领域模型在专业领域(如医学影像)中表现不足的问题,并提出了改进方向。
Abstract: VQA (Visual Question Answering) combines Natural Language Processing (NLP) with image understanding to answer questions about a given image. It has enormous potential for the development of medical diagnostic AI systems. Such a system can help clinicians diagnose gastro-intestinal (GI) diseases accurately and efficiently. Although many of the multimodal LLMs available today have excellent VQA capabilities in the general domain, they perform very poorly for VQA tasks in specialized domains such as medical imaging. This study is a submission for ImageCLEFmed-MEDVQA-GI 2025 subtask 1 that explores the adaptation of the Florence2 model to answer medical visual questions on GI endoscopy images. We also evaluate the model performance using standard metrics like ROUGE, BLEU and METEOR
[110] ST-DAI: Single-shot 2.5D Spatial Transcriptomics with Intra-Sample Domain Adaptive Imputation for Cost-efficient 3D Reconstruction eess.IV | cs.CVPDF
Jiahe Qian, Yaoyu Fang, Xinkun Wang, Lee A. Cooper, Bo Zhou
TL;DR: ST-DAI 是一种低成本单样本 3D 空间转录组学重建框架,通过 2.5D 稀疏采样和域自适应插补技术,显著降低实验成本。
Details
Motivation: 当前 3D 空间转录组学(ST)技术成本高昂,且现有基于组织学图像的预测方法依赖大数据集,泛化性差。ST-DAI 旨在通过低成本采样和域自适应技术解决这些问题。
Result: 实验表明,ST-DAI 在基因表达预测性能上与全采样方法相当,同时大幅降低测量成本。
Insight: 单样本域自适应技术可有效解决跨切片域差异问题,为低成本 3D 空间转录组学提供了新思路。
Abstract: For 3D spatial transcriptomics (ST), the high per-section acquisition cost of fully sampling every tissue section remains a significant challenge. Although recent approaches predict gene expression from histology images, these methods require large external datasets, which leads to high-cost and suffers from substantial domain discrepancies that lead to poor generalization on new samples. In this work, we introduce ST-DAI, a single-shot framework for 3D ST that couples a cost-efficient 2.5D sampling scheme with an intra-sample domain-adaptive imputation framework. First, in the cost-efficient 2.5D sampling stage, one reference section (central section) is fully sampled while other sections (adjacent sections) is sparsely sampled, thereby capturing volumetric context at significantly reduced experimental cost. Second, we propose a single-shot 3D imputation learning method that allows us to generate fully sampled 3D ST from this cost-efficient 2.5D ST scheme, using only sample-specific training. We observe position misalignment and domain discrepancy between sections. To address those issues, we adopt a pipeline that first aligns the central section to the adjacent section, thereafter generates dense pseudo-supervision on the central section, and then performs Fast Multi-Domain Refinement (FMDR), which adapts the network to the domain of the adjacent section while fine-tuning only a few parameters through the use of Parameter-Efficient Domain-Alignment Layers (PDLs). During this refinement, a Confidence Score Generator (CSG) reweights the pseudo-labels according to their estimated reliability, thereby directing imputation toward trustworthy regions. Our experimental results demonstrate that ST-DAI achieves gene expression prediction performance comparable to fully sampled approaches while substantially reducing the measurement burden.
[111] VidFuncta: Towards Generalizable Neural Representations for Ultrasound Videos eess.IV | cs.CVPDF
Julia Wolleb, Florentin Bieder, Paul Friedrich, Hemant D. Tagare, Xenophon Papademetris
TL;DR: VidFuncta提出了一种基于隐式神经表示(INRs)的新框架,用于处理超声视频分析中的动态性和冗余问题,并在多个下游任务中表现优于传统方法。
Details
Motivation: 超声视频分析在临床中应用广泛,但传统深度学习方法因采集方式非标准化和操作者偏差而难以处理。INRs为解决这一问题提供了新视角。
Result: 在视频重构和下游任务(如射血分数预测、B线检测和乳腺病变分类)中优于2D和3D基线方法。
Insight: 通过INRs捕获视频的时空特性,可以提升医学视频分析的泛化能力和效率,为临床决策提供更紧凑的表达。
Abstract: Ultrasound is widely used in clinical care, yet standard deep learning methods often struggle with full video analysis due to non-standardized acquisition and operator bias. We offer a new perspective on ultrasound video analysis through implicit neural representations (INRs). We build on Functa, an INR framework in which each image is represented by a modulation vector that conditions a shared neural network. However, its extension to the temporal domain of medical videos remains unexplored. To address this gap, we propose VidFuncta, a novel framework that leverages Functa to encode variable-length ultrasound videos into compact, time-resolved representations. VidFuncta disentangles each video into a static video-specific vector and a sequence of time-dependent modulation vectors, capturing both temporal dynamics and dataset-level redundancies. Our method outperforms 2D and 3D baselines on video reconstruction and enables downstream tasks to directly operate on the learned 1D modulation vectors. We validate VidFuncta on three public ultrasound video datasets – cardiac, lung, and breast – and evaluate its downstream performance on ejection fraction prediction, B-line detection, and breast lesion classification. These results highlight the potential of VidFuncta as a generalizable and efficient representation framework for ultrasound videos. Our code is publicly available under https://github.com/JuliaWolleb/VidFuncta_public.
[112] Cardiac-CLIP: A Vision-Language Foundation Model for 3D Cardiac CT Images eess.IV | cs.CVPDF
Yutao Hu, Ying Zheng, Shumei Miao, Xiaolei Zhang, Jiahao Xia
TL;DR: 这篇论文提出了Cardiac-CLIP,一种针对3D心脏CT图像的多模态基础模型,通过两阶段预训练策略实现,并在多种下游任务中取得了SOTA性能。
Details
Motivation: 当前基础模型在医疗领域的应用主要聚焦于简单任务,针对复杂心血管诊断的研究较少。Cardiac-CLIP旨在填补这一空白,提高心血管诊断的准确性和效率。
Result: 在心血管异常分类、信息检索和临床分析等任务中,Cardiac-CLIP的表现优于现有方法,尤其在前瞻性预测急性冠脉综合征等复杂任务中表现突出。
Insight: Cardiac-CLIP证明了多模态基础模型在复杂医疗任务中的潜力,尤其是通过标准化文本和软标签矩阵可以更好地利用非结构化数据。
Abstract: Foundation models have demonstrated remarkable potential in medical domain. However, their application to complex cardiovascular diagnostics remains underexplored. In this paper, we present Cardiac-CLIP, a multi-modal foundation model designed for 3D cardiac CT images. Cardiac-CLIP is developed through a two-stage pre-training strategy. The first stage employs a 3D masked autoencoder (MAE) to perform self-supervised representation learning from large-scale unlabeled volumetric data, enabling the visual encoder to capture rich anatomical and contextual features. In the second stage, contrastive learning is introduced to align visual and textual representations, facilitating cross-modal understanding. To support the pre-training, we collect 16641 real clinical CT scans, supplemented by 114k publicly available data. Meanwhile, we standardize free-text radiology reports into unified templates and construct the pathology vectors according to diagnostic attributes, based on which the soft-label matrix is generated to supervise the contrastive learning process. On the other hand, to comprehensively evaluate the effectiveness of Cardiac-CLIP, we collect 6,722 real-clinical data from 12 independent institutions, along with the open-source data to construct the evaluation dataset. Specifically, Cardiac-CLIP is comprehensively evaluated across multiple tasks, including cardiovascular abnormality classification, information retrieval and clinical analysis. Experimental results demonstrate that Cardiac-CLIP achieves state-of-the-art performance across various downstream tasks in both internal and external data. Particularly, Cardiac-CLIP exhibits great effectiveness in supporting complex clinical tasks such as the prospective prediction of acute coronary syndrome, which is notoriously difficult in real-world scenarios.