cs.CV [Total: 55]
cs.CL [Total: 22]
q-bio.QM [Total: 1]
cs.RO [Total: 1]
eess.IV [Total: 3]
cs.AI [Total: 2]
eess.AS [Total: 1]
cs.IR [Total: 1]
cs.LG [Total: 2]

cs.CV [Back]

[1] Australian Supermarket Object Set (ASOS): A Benchmark Dataset of Physical Objects and 3D Models for Robotics and Computer Vision cs.CV | cs.RO | eess.IVPDF

Akansel Cosgun, Lachlan Chumbley, Benjamin J. Meyer

TL;DR: 论文介绍了澳大利亚超市物品集（ASOS），一个包含50种常见超市物品的高质量3D纹理网格数据集，旨在为机器人和计算机视觉任务提供基准测试。

Details

Motivation: 现有的数据集多依赖合成模型或难以获取的特殊物品，限制了实际应用的可行性。ASOS通过提供低成本、易获取的真实物品数据集，填补了这一空白。

Result: ASOS数据集为物体检测、姿态估计和机器人应用提供了一个实际且易于获取的基准测试平台。

Insight: ASOS通过低成本、真实物品的设计，解决了现有数据集的局限性，有望推动实际场景中的算法开发和评估。

Abstract: This paper introduces the Australian Supermarket Object Set (ASOS), a comprehensive dataset comprising 50 readily available supermarket items with high-quality 3D textured meshes designed for benchmarking in robotics and computer vision applications. Unlike existing datasets that rely on synthetic models or specialized objects with limited accessibility, ASOS provides a cost-effective collection of common household items that can be sourced from a major Australian supermarket chain. The dataset spans 10 distinct categories with diverse shapes, sizes, and weights. 3D meshes are acquired by a structure-from-motion techniques with high-resolution imaging to generate watertight meshes. The dataset’s emphasis on accessibility and real-world applicability makes it valuable for benchmarking object detection, pose estimation, and robotics applications.

[2] A Multimodal RAG Framework for Housing Damage Assessment: Collaborative Optimization of Image Encoding and Policy Vector Retrieval cs.CV | cs.AI | cs.LGPDF

Jiayi Miao, Dingxin Lu, Zhuqi Wang

TL;DR: 该论文提出了一种多模态检索增强生成（MM-RAG）框架，用于自然灾害后房屋损坏评估，通过双分支编码器和跨模态交互模块优化图像与文本的语义对齐，显著提升了检索准确性和分类性能。

Details

Motivation: 自然灾害后房屋损坏的准确评估对保险理赔和资源规划至关重要，现有方法在多模态数据融合和检索生成协同优化方面存在不足。

Result: Top-1检索准确率提升9.6%，在损坏严重性分类指标上表现优异。

Insight: 通过协同优化图像编码和政策检索，多模态框架能够更高效地融合视觉与文本信息，从而提升评估任务的精度和实用性。

Abstract: After natural disasters, accurate evaluations of damage to housing are important for insurance claims response and planning of resources. In this work, we introduce a novel multimodal retrieval-augmented generation (MM-RAG) framework. On top of classical RAG architecture, we further the framework to devise a two-branch multimodal encoder structure that the image branch employs a visual encoder composed of ResNet and Transformer to extract the characteristic of building damage after disaster, and the text branch harnesses a BERT retriever for the text vectorization of posts as well as insurance policies and for the construction of a retrievable restoration index. To impose cross-modal semantic alignment, the model integrates a cross-modal interaction module to bridge the semantic representation between image and text via multi-head attention. Meanwhile, in the generation module, the introduced modal attention gating mechanism dynamically controls the role of visual evidence and text prior information during generation. The entire framework takes end-to-end training, and combines the comparison loss, the retrieval loss and the generation loss to form multi-task optimization objectives, and achieves image understanding and policy matching in collaborative learning. The results demonstrate superior performance in retrieval accuracy and classification index on damage severity, where the Top-1 retrieval accuracy has been improved by 9.6%.

[3] Improving MLLM Historical Record Extraction with Test-Time Image cs.CV | cs.CL | cs.LGPDF

Taylor Archibald, Tony Martinez

TL;DR: 该论文提出了一种新的集成框架，通过生成多种图像变体并结合自定义的对齐方法，提高了从噪声历史文档中提取文本的准确性，且方法简单可扩展。

Details

Motivation: 历史文档通常包含噪声和复杂的格式，传统的单次转录方法难以处理这些情况。为了提高转录的准确性和稳定性，论文提出了一种基于多图像变体和集成的方法。

Result: 实验结果表明，该方法将转录准确性提高了4个百分点，填充和模糊技术对提升准确性最有效，而网格扭曲则有助于区分高、低置信度的情况。

Insight: 论文的见解包括：1）多种图像变体的结合可以有效提升转录准确性；2）简单的对齐方法可以显著增强结果的稳定性；3）该方法易于扩展并适用于其他文档集合和转录模型。

Abstract: We present a novel ensemble framework that stabilizes LLM based text extraction from noisy historical documents. We transcribe multiple augmented variants of each image with Gemini 2.0 Flash and fuse these outputs with a custom Needleman Wunsch style aligner that yields both a consensus transcription and a confidence score. We present a new dataset of 622 Pennsylvania death records, and demonstrate our method improves transcription accuracy by 4 percentage points relative to a single shot baseline. We find that padding and blurring are the most useful for improving accuracy, while grid warp perturbations are best for separating high and low confidence cases. The approach is simple, scalable, and immediately deployable to other document collections and transcription models.

[4] MITS: A Large-Scale Multimodal Benchmark Dataset for Intelligent Traffic Surveillance cs.CV | cs.AIPDF

Kaikai Zhao, Zhaoxiang Liu, Peng Wang, Xin Wang, Zhicheng Ma

TL;DR: MITS是一个专为智能交通监控（ITS）设计的大规模多模态基准数据集，包含17万张真实场景图像、高质量的图像标注及500万条问答对，显著提升了主流大型多模态模型（LMM）在ITS任务中的表现。

Details

Motivation: 当前大型多模态模型在通用领域表现优异，但在ITS领域由于缺乏专门的多模态数据集，性能受限。MITS的提出填补了这一空白。

Result: 微调后的LMM在ITS任务中表现显著提升，如LLaVA-1.5性能提升83.2%，Qwen2-VL提升58.6%。

Insight: MITS不仅推动了ITS领域的研究，也为多模态模型在特定领域的应用提供了新方向。

Abstract: General-domain large multimodal models (LMMs) have achieved significant advances in various image-text tasks. However, their performance in the Intelligent Traffic Surveillance (ITS) domain remains limited due to the absence of dedicated multimodal datasets. To address this gap, we introduce MITS (Multimodal Intelligent Traffic Surveillance), the first large-scale multimodal benchmark dataset specifically designed for ITS. MITS includes 170,400 independently collected real-world ITS images sourced from traffic surveillance cameras, annotated with eight main categories and 24 subcategories of ITS-specific objects and events under diverse environmental conditions. Additionally, through a systematic data generation pipeline, we generate high-quality image captions and 5 million instruction-following visual question-answer pairs, addressing five critical ITS tasks: object and event recognition, object counting, object localization, background analysis, and event reasoning. To demonstrate MITS’s effectiveness, we fine-tune mainstream LMMs on this dataset, enabling the development of ITS-specific applications. Experimental results show that MITS significantly improves LMM performance in ITS applications, increasing LLaVA-1.5’s performance from 0.494 to 0.905 (+83.2%), LLaVA-1.6’s from 0.678 to 0.921 (+35.8%), Qwen2-VL’s from 0.584 to 0.926 (+58.6%), and Qwen2.5-VL’s from 0.732 to 0.930 (+27.0%). We release the dataset, code, and models as open-source, providing high-value resources to advance both ITS and LMM research.

[5] Decomposing Visual Classification: Assessing Tree-Based Reasoning in VLMs cs.CVPDF

Sary Elmansoury, Islam Mesabah, Gerrit Großmann, Peter Neigel, Raj Bhalwankar

TL;DR: 本文研究了视觉语言模型（VLMs）在零样本视觉分类中的表现，特别关注树形结构推理是否能提升性能。研究发现，尽管树形推理在理解树知识上表现优异（98.2%准确率），但在分类任务中仍然不如标准零样本提示。

Details

Motivation: 研究旨在探索结构化、树形推理是否能提升VLMs在细粒度和粗粒度视觉分类任务中的性能，并为设计更可解释的系统提供见解。

Result: 树形推理在理解树知识上表现优异（98.2%准确率），但在分类任务中仍然不如标准零样本提示。添加图像描述对两种方法均有提升。

Insight: 结构化推理在视觉分类中存在局限性，而结合图像描述的提示方法可能有助于提升模型性能和可解释性。

Abstract: Vision language models (VLMs) excel at zero-shot visual classification, but their performance on fine-grained tasks and large hierarchical label spaces is understudied. This paper investigates whether structured, tree-based reasoning can enhance VLM performance. We introduce a framework that decomposes classification into interpretable decisions using decision trees and evaluates it on fine-grained (GTSRB) and coarse-grained (CIFAR-10) datasets. Although the model achieves 98.2% accuracy in understanding the tree knowledge, tree-based reasoning consistently underperforms standard zero-shot prompting. We also explore enhancing the tree prompts with LLM-generated classes and image descriptions to improve alignment. The added description enhances the performance of the tree-based and zero-shot methods. Our findings highlight limitations of structured reasoning in visual classification and offer insights for designing more interpretable VLM systems.

[6] World Modeling with Probabilistic Structure Integration cs.CV | cs.AI | cs.LGPDF

Klemen Kotar, Wanhee Lee, Rahul Venkatesh, Honglin Chen, Daniel Bear

TL;DR: 本文提出了PSI系统，通过三步骤循环（概率预测、结构提取和集成）学习可控且灵活的提示型世界模型，并在大规模视频数据上验证其性能。

Details

Motivation: 构建一个能够从数据中学习丰富可控性和灵活提示的世界模型，以支持视频预测和理解任务。

Result: 在1.4万亿视频标记数据上训练，实现了视频预测、光流估计、深度估计和对象分割等任务的SOTA性能。

Insight: 通过循环迭代的结构提取和集成，PSI能够逐步提升模型能力，类似于LLM的通用提示语言。

Abstract: We present Probabilistic Structure Integration (PSI), a system for learning richly controllable and flexibly promptable world models from data. PSI consists of a three-step cycle. The first step, Probabilistic prediction, involves building a probabilistic graphical model Psi of the data, in the form of a random-access autoregressive sequence model. Psi supports a complete set of learned conditional distributions describing the dependence of any variables in the data on any other set of variables. In step 2, Structure extraction, we show how to extract underlying low-dimensional properties in the data, corresponding to a diverse set of meaningful “intermediate structures”, in a zero-shot fashion via causal inference on Psi. Step 3, Integration, completes the cycle by converting these structures into new token types that are then continually mixed back into the training diet as conditioning signals and prediction targets. Each such cycle augments the capabilities of Psi, both allowing it to model the underlying data better, and creating new control handles – akin to an LLM-like universal prompting language. We train an instance of Psi on 1.4 trillion tokens of internet video data; we use it to perform a variety of useful video prediction and understanding inferences; we extract state-of-the-art optical flow, self-supervised depth and object segmentation; and we use these structures to support a full cycle of predictive improvements.

[7] Images in Motion?: A First Look into Video Leakage in Collaborative Deep Learning cs.CVPDF

Md Fazle Rasul, Alanood Alqobaisi, Bruhadeshwar Bezawada, Indrakshi Ray

TL;DR: 该论文首次分析了联邦学习（FL）中视频数据通过梯度反转攻击的泄漏问题，发现特征提取器能提供更高抵抗性，但分类器复杂度不足时仍有泄漏风险。

Details

Motivation: 尽管联邦学习通过梯度交换保护隐私，但梯度反转攻击能重建原始数据。目前对图像、文本等数据已有研究，但视频数据的泄漏尚未被探索。

Result: 特征提取器对攻击更具抵抗力，但分类器复杂度不足时仍可能泄漏；超分辨率技术可显著提升攻击重建的帧质量。

Insight: 视频数据在FL中仍存在泄漏风险，需进一步研究防御机制和分类器设计。

Abstract: Federated learning (FL) allows multiple entities to train a shared model collaboratively. Its core, privacy-preserving principle is that participants only exchange model updates, such as gradients, and never their raw, sensitive data. This approach is fundamental for applications in domains where privacy and confidentiality are important. However, the security of this very mechanism is threatened by gradient inversion attacks, which can reverse-engineer private training data directly from the shared gradients, defeating the purpose of FL. While the impact of these attacks is known for image, text, and tabular data, their effect on video data remains an unexamined area of research. This paper presents the first analysis of video data leakage in FL using gradient inversion attacks. We evaluate two common video classification approaches: one employing pre-trained feature extractors and another that processes raw video frames with simple transformations. Our initial results indicate that the use of feature extractors offers greater resilience against gradient inversion attacks. We also demonstrate that image super-resolution techniques can enhance the frames extracted through gradient inversion attacks, enabling attackers to reconstruct higher-quality videos. Our experiments validate this across scenarios where the attacker has access to zero, one, or more reference frames from the target environment. We find that although feature extractors make attacks more challenging, leakage is still possible if the classifier lacks sufficient complexity. We, therefore, conclude that video data leakage in FL is a viable threat, and the conditions under which it occurs warrant further investigation.

[8] Fine-Grained Cross-View Localization via Local Feature Matching and Monocular Depth Priors cs.CVPDF

Zimin Xia, Chenghao Xu, Alexandre Alahi

TL;DR: 这篇论文提出了一种细粒度的跨视角定位方法，通过结合局部特征匹配和单目深度先验，直接从地面图像与参考航拍图像建立对应关系，提升了定位精度和解释性。

Details

Motivation: 现有的跨视角定位方法通常将地面图像转换为鸟瞰图（BEV）表示，再与航拍图像对齐，但这一过程中因透视畸变或高度信息压缩会导致信息丢失，影响对齐质量。本文希望通过直接建立地面与航拍图像的局部特征匹配来避免这一问题。

Result: 实验表明，即使在弱监督条件下，该方法也能学习到准确的局部特征对应关系，并在跨区域泛化和未知方向等挑战性场景中表现出优越的定位性能。此外，该方法兼容多种相对深度模型，无需针对每模型微调。

Insight: 1. 直接建立跨视角局部特征匹配避免了BEV转换的信息损失。2. 深度先验的灵活使用（支持度量/相对深度）提升了方法的泛化性。3. 尺度感知对齐技术为相对深度条件下的位姿估计提供了解决方案。

Abstract: We propose an accurate and highly interpretable fine-grained cross-view localization method that estimates the 3 Degrees of Freedom pose of a ground-level image by matching its local features with a reference aerial image. Previous methods typically transform the ground image into a bird’s-eye view (BEV) representation and then align it with the aerial image for localization. However, this transformation often leads to information loss due to perspective distortion or compression of height information, thereby degrading alignment quality with the aerial view. In contrast, our method directly establishes correspondences between ground and aerial images and lifts only the matched keypoints to BEV space using monocular depth prior. Notably, modern depth predictors can provide reliable metric depth when the test samples are similar to the training data. When the depth distribution differs, they still produce consistent relative depth, i.e., depth accurate up to an unknown scale. Our method supports both metric and relative depth. It employs a scale-aware Procrustes alignment to estimate the camera pose from the correspondences and optionally recover the scale when using relative depth. Experimental results demonstrate that, with only weak supervision on camera pose, our method learns accurate local feature correspondences and achieves superior localization performance under challenging conditions, such as cross-area generalization and unknown orientation. Moreover, our method is compatible with various relative depth models without requiring per-model finetuning. This flexibility, combined with strong localization performance, makes it well-suited for real-world deployment.

[9] Early Detection of Visual Impairments at Home Using a Smartphone Red-Eye Reflex Test cs.CV | cs.LGPDF

Judith Massmann, Alexander Lichtenstein, Francisco M. López

TL;DR: 本文提出了一种基于智能手机的儿童红眼反射测试方法，用于早期检测视力障碍，并通过深度学习模型实现了90%的准确率。

Details

Motivation: 传统视力筛查需要专业设备及眼科医生，而智能手机和人工智能的进步使得家庭视力筛查成为可能，尤其适用于儿童视力问题的早期发现。

Result: 模型在未见测试数据上达到90%的准确率，显著提升了视力筛查的可及性和效率。

Insight: AI结合智能手机可实现低成本、高精度的家庭视力筛查，为全球儿童视力健康提供了新型干预方案。

Abstract: Numerous visual impairments can be detected in red-eye reflex images from young children. The so-called Bruckner test is traditionally performed by ophthalmologists in clinical settings. Thanks to the recent technological advances in smartphones and artificial intelligence, it is now possible to recreate the Bruckner test using a mobile device. In this paper, we present a first study conducted during the development of KidsVisionCheck, a free application that can perform vision screening with a mobile device using red-eye reflex images. The underlying model relies on deep neural networks trained on children’s pupil images collected and labeled by an ophthalmologist. With an accuracy of 90% on unseen test data, our model provides highly reliable performance without the necessity of specialist equipment. Furthermore, we can identify the optimal conditions for data collection, which can in turn be used to provide immediate feedback to the users. In summary, this work marks a first step toward accessible pediatric vision screenings and early intervention for vision abnormalities worldwide.

[10] DGFusion: Depth-Guided Sensor Fusion for Robust Semantic Perception cs.CV | cs.LG | cs.ROPDF

Tim Broedermannn, Christos Sakaridis, Luigi Piccinelli, Wim Abbeloos, Luc Van Gool

TL;DR: DGFusion提出了一种基于深度引导的多模态融合方法，通过利用激光雷达数据作为深度信息，动态调整传感器融合策略，提升自动驾驶场景中的语义分割性能。

Details

Motivation: 现有传感器融合方法在空间上对所有传感器数据均匀处理，难以应对复杂环境下的挑战。作者提出通过深度信息动态调整融合策略，以适应传感器在不同空间区域的可靠性。

Result: 在MUSES和DELIVER数据集上实现了最先进的语义分割和全景分割性能。

Insight: 深度信息能够有效指导传感器动态融合，提升复杂环境下的语义感知鲁棒性。

Abstract: Robust semantic perception for autonomous vehicles relies on effectively combining multiple sensors with complementary strengths and weaknesses. State-of-the-art sensor fusion approaches to semantic perception often treat sensor data uniformly across the spatial extent of the input, which hinders performance when faced with challenging conditions. By contrast, we propose a novel depth-guided multimodal fusion method that upgrades condition-aware fusion by integrating depth information. Our network, DGFusion, poses multimodal segmentation as a multi-task problem, utilizing the lidar measurements, which are typically available in outdoor sensor suites, both as one of the model’s inputs and as ground truth for learning depth. Our corresponding auxiliary depth head helps to learn depth-aware features, which are encoded into spatially varying local depth tokens that condition our attentive cross-modal fusion. Together with a global condition token, these local depth tokens dynamically adapt sensor fusion to the spatially varying reliability of each sensor across the scene, which largely depends on depth. In addition, we propose a robust loss for our depth, which is essential for learning from lidar inputs that are typically sparse and noisy in adverse conditions. Our method achieves state-of-the-art panoptic and semantic segmentation performance on the challenging MUSES and DELIVER datasets. Code and models will be available at https://github.com/timbroed/DGFusion

[11] Patch-based Automatic Rosacea Detection Using the ResNet Deep Learning Framework cs.CVPDF

Chengyu Yang, Rishik Reddy Yesgari, Chengjun Liu

TL;DR: 本文提出了一种基于图像块的自动红斑痤疮检测方法，利用ResNet-18深度学习框架，通过提取不同大小、形状和位置的图像块，提升了检测性能和隐私保护。

Details

Motivation: 红斑痤疮是一种慢性炎症性皮肤病，早期精确检测对治疗至关重要。传统基于全图像的方法可能忽略局部特征且存在隐私问题。

Result: 块方法在准确性和敏感性上优于全图像方法，同时保护患者隐私。

Insight: 局部块策略不仅提升模型性能，还能增强可解释性和隐私保护，为皮肤病自动诊断提供实用方案。

Abstract: Rosacea, which is a chronic inflammatory skin condition that manifests with facial redness, papules, and visible blood vessels, often requirs precise and early detection for significantly improving treatment effectiveness. This paper presents new patch-based automatic rosacea detection strategies using the ResNet-18 deep learning framework. The contributions of the proposed strategies come from the following aspects. First, various image pateches are extracted from the facial images of people in different sizes, shapes, and locations. Second, a number of investigation studies are carried out to evaluate how the localized visual information influences the deep learing model performance. Third, thorough experiments are implemented to reveal that several patch-based automatic rosacea detection strategies achieve competitive or superior accuracy and sensitivity than the full-image based methods. And finally, the proposed patch-based strategies, which use only localized patches, inherently preserve patient privacy by excluding any identifiable facial features from the data. The experimental results indicate that the proposed patch-based strategies guide the deep learning model to focus on clinically relevant regions, enhance robustness and interpretability, and protect patient privacy. As a result, the proposed strategies offer practical insights for improving automated dermatological diagnostics.

[12] Privacy-Preserving Automated Rosacea Detection Based on Medically Inspired Region of Interest Selection cs.CVPDF

Chengyu Yang, Rishik Reddy Yesgari, Chengjun Liu

TL;DR: 这篇论文提出了一种基于医学启发的隐私保护自动玫瑰痤疮检测方法，通过合成数据和固定红斑掩膜实现高效检测。

Details

Motivation: 玫瑰痤疮是一种常见但易被漏诊的皮肤病，现有检测方法面临症状分散、数据稀缺和隐私问题等挑战。作者希望通过临床先验和合成数据解决这些问题。

Result: 实验表明该方法在准确性、召回率和F1分数上优于全脸基线模型，验证了合成数据和临床先验的有效性。

Insight: 合成数据和临床先验的结合可以开发出既准确又符合伦理的皮肤病AI系统，适用于远程医疗和大规模筛查等隐私敏感场景。

Abstract: Rosacea is a common but underdiagnosed inflammatory skin condition that primarily affects the central face and presents with subtle redness, pustules, and visible blood vessels. Automated detection remains challenging due to the diffuse nature of symptoms, the scarcity of labeled datasets, and privacy concerns associated with using identifiable facial images. A novel privacy-preserving automated rosacea detection method inspired by clinical priors and trained entirely on synthetic data is presented in this paper. Specifically, the proposed method, which leverages the observation that rosacea manifests predominantly through central facial erythema, first constructs a fixed redness-informed mask by selecting regions with consistently high red channel intensity across facial images. The mask thus is able to focus on diagnostically relevant areas such as the cheeks, nose, and forehead and exclude identity-revealing features. Second, the ResNet-18 deep learning method, which is trained on the masked synthetic images, achieves superior performance over the full-face baselines with notable gains in terms of accuracy, recall and F1 score when evaluated using the real-world test data. The experimental results demonstrate that the synthetic data and clinical priors can jointly enable accurate and ethical dermatological AI systems, especially for privacy sensitive applications in telemedicine and large-scale screening.

[13] Investigating the Impact of Various Loss Functions and Learnable Wiener Filter for Laparoscopic Image Desmoking cs.CVPDF

Chengyu Yang, Chengjun Liu

TL;DR: 这篇论文通过系统的消融研究，评估了ULW框架中各个组件（包括可学习维纳滤波器和复合损失函数）对腹腔镜图像去烟雾效果的贡献。

Details

Motivation: 为了验证ULW框架中各个组件的必要性和有效性，研究者进行了全面的消融实验，旨在明确每个部分对整体性能的具体影响。

Result: 结果表明，复合损失函数和可学习维纳滤波器都对提升去烟雾效果起到了重要作用，且不同组件在特定方面的贡献不同。

Insight: 消融实验揭示了某些组件在某些指标上的冗余性，为未来更高效的去烟雾模型设计提供了方向。

Abstract: To rigorously assess the effectiveness and necessity of individual components within the recently proposed ULW framework for laparoscopic image desmoking, this paper presents a comprehensive ablation study. The ULW approach combines a U-Net based backbone with a compound loss function that comprises mean squared error (MSE), structural similarity index (SSIM) loss, and perceptual loss. The framework also incorporates a differentiable, learnable Wiener filter module. In this study, each component is systematically ablated to evaluate its specific contribution to the overall performance of the whole framework. The analysis includes: (1) removal of the learnable Wiener filter, (2) selective use of individual loss terms from the composite loss function. All variants are benchmarked on a publicly available paired laparoscopic images dataset using quantitative metrics (SSIM, PSNR, MSE and CIEDE-2000) alongside qualitative visual comparisons.

Razvan Stefanescu, Ethan Oh, Ruben Vazquez, Chris Mesterharm, Constantin Serban

TL;DR: WAVE-DETR 是一种结合可见光（RGB）和声学信号的多模态无人机检测器，通过融合 Deformable DETR 和 Wav2Vec2 架构，在复杂环境中实现鲁棒检测性能。

Details

Motivation: 现有的无人机检测方法主要依赖视觉信息，但在复杂环境中性能受限。多模态（视觉+声学）融合可以提升检测鲁棒性。

Result: 门控融合方法在小无人机检测上提升 11.1%~~15.3% mAP（IoU 0.5~~0.9），各类无人机尺寸检测性能均有提升（3.27%~5.84%）。

Insight: 声学信息可显著补充视觉检测的不足；门控机制在多模态融合中表现最佳。

Abstract: We introduce a multi-modal WAVE-DETR drone detector combining visible RGB and acoustic signals for robust real-life UAV object detection. Our approach fuses visual and acoustic features in a unified object detector model relying on the Deformable DETR and Wav2Vec2 architectures, achieving strong performance under challenging environmental conditions. Our work leverage the existing Drone-vs-Bird dataset and the newly generated ARDrone dataset containing more than 7,500 synchronized images and audio segments. We show how the acoustic information is used to improve the performance of the Deformable DETR object detector on the real ARDrone dataset. We developed, trained and tested four different fusion configurations based on a gated mechanism, linear layer, MLP and cross attention. The Wav2Vec2 acoustic embeddings are fused with the multi resolution feature mappings of the Deformable DETR and enhance the object detection performance over all drones dimensions. The best performer is the gated fusion approach, which improves the mAP of the Deformable DETR object detector on our in-distribution and out-of-distribution ARDrone datasets by 11.1% to 15.3% for small drones across all IoU thresholds between 0.5 and 0.9. The mAP scores for medium and large drones are also enhanced, with overall gains across all drone sizes ranging from 3.27% to 5.84%.

[15] Surrogate Supervision for Robust and Generalizable Deformable Image Registration cs.CV | cs.AIPDF

Yihao Liu, Junyu Chen, Lianrui Zuo, Shuwen Wei, Brian D. Boyd

TL;DR: 该论文提出了一种名为”代理监督”的训练范式，通过将空间变换应用于代理图像，提高深度学习形变图像配准的鲁棒性和泛化性。

Details

Motivation: 现有深度学习方法在输入图像特性（如伪影、视场不匹配或模态差异）变化时表现不稳定，需一种普适训练方法来提升配准网络的鲁棒性和泛化性。

Result: 实验显示代理监督对输入变化（如场不均匀性、视场不一致和模态差异）具有强鲁棒性，同时在标准数据上保持高性能。

Insight: 代理监督为复杂医学图像配准提供了一种简单有效的解决方案，适用于多样化生物医学场景。

Abstract: Objective: Deep learning-based deformable image registration has achieved strong accuracy, but remains sensitive to variations in input image characteristics such as artifacts, field-of-view mismatch, or modality difference. We aim to develop a general training paradigm that improves the robustness and generalizability of registration networks. Methods: We introduce surrogate supervision, which decouples the input domain from the supervision domain by applying estimated spatial transformations to surrogate images. This allows training on heterogeneous inputs while ensuring supervision is computed in domains where similarity is well defined. We evaluate the framework through three representative applications: artifact-robust brain MR registration, mask-agnostic lung CT registration, and multi-modal MR registration. Results: Across tasks, surrogate supervision demonstrated strong resilience to input variations including inhomogeneity field, inconsistent field-of-view, and modality differences, while maintaining high performance on well-curated data. Conclusions: Surrogate supervision provides a principled framework for training robust and generalizable deep learning-based registration models without increasing complexity. Significance: Surrogate supervision offers a practical pathway to more robust and generalizable medical image registration, enabling broader applicability in diverse biomedical imaging scenarios.

[16] An Autoencoder and Vision Transformer-based Interpretability Analysis of the Differences in Automated Staging of Second and Third Molars cs.CV | cs.AI | 68T07 (Primary)PDF

Barkin Buyukcakir, Jannick De Tobel, Patrick Thevissen, Dirk Vandermeulen, Peter Claes

TL;DR: 该论文提出了一个结合卷积自编码器（AE）和视觉变换器（ViT）的框架，以提高牙齿分期分类的准确性和可解释性，尤其在第二和第三磨牙的分类中表现出显著改进。

Details

Motivation: 深度学习在法医牙龄估计等高风险应用中因模型“黑箱”性质而受限。作者旨在通过增强模型性能和透明度来推动其实际应用。

Result: 在第二磨牙（37号牙）和第三磨牙（38号牙）的分类准确率分别从0.712提升至0.815和从0.462提升至0.543。

Insight: 单一的可解释性方法（如注意力图）可能掩盖数据问题，需结合多种分析手段来全面理解模型性能限制。

Abstract: The practical adoption of deep learning in high-stakes forensic applications, such as dental age estimation, is often limited by the ‘black box’ nature of the models. This study introduces a framework designed to enhance both performance and transparency in this context. We use a notable performance disparity in the automated staging of mandibular second (tooth 37) and third (tooth 38) molars as a case study. The proposed framework, which combines a convolutional autoencoder (AE) with a Vision Transformer (ViT), improves classification accuracy for both teeth over a baseline ViT, increasing from 0.712 to 0.815 for tooth 37 and from 0.462 to 0.543 for tooth 38. Beyond improving performance, the framework provides multi-faceted diagnostic insights. Analysis of the AE’s latent space metrics and image reconstructions indicates that the remaining performance gap is data-centric, suggesting high intra-class morphological variability in the tooth 38 dataset is a primary limiting factor. This work highlights the insufficiency of relying on a single mode of interpretability, such as attention maps, which can appear anatomically plausible yet fail to identify underlying data issues. By offering a methodology that both enhances accuracy and provides evidence for why a model may be uncertain, this framework serves as a more robust tool to support expert decision-making in forensic age estimation.

[17] SCoDA: Self-supervised Continual Domain Adaptation cs.CVPDF

Chirayu Agrawal, Snehasis Mukherjee

TL;DR: SCoDA提出了一种自监督持续域自适应方法，通过自监督预训练教师模型和几何流形对齐原则，解决源自由域自适应（SFDA）中的问题，显著优于现有方法。

Details

Motivation: 现有SFDA方法依赖全监督预训练和余弦相似度对齐特征，忽略了源模型的潜在流形几何信息，导致性能受限。SCoDA旨在通过自监督预训练和几何对齐解决这一问题。

Result: 在多个基准数据集上，SCoDA显著优于现有SFDA方法。

Insight: 自监督预训练和几何对齐是解决域自适应中潜在流形信息丢失的关键技术。

Abstract: Source-Free Domain Adaptation (SFDA) addresses the challenge of adapting a model to a target domain without access to the data of the source domain. Prevailing methods typically start with a source model pre-trained with full supervision and distill the knowledge by aligning instance-level features. However, these approaches, relying on cosine similarity over L2-normalized feature vectors, inadvertently discard crucial geometric information about the latent manifold of the source model. We introduce Self-supervised Continual Domain Adaptation (SCoDA) to address these limitations. We make two key departures from standard practice: first, we avoid the reliance on supervised pre-training by initializing the proposed framework with a teacher model pre-trained entirely via self-supervision (SSL). Second, we adapt the principle of geometric manifold alignment to the SFDA setting. The student is trained with a composite objective combining instance-level feature matching with a Space Similarity Loss. To combat catastrophic forgetting, the teacher’s parameters are updated via an Exponential Moving Average (EMA) of the student’s parameters. Extensive experiments on benchmark datasets demonstrate that SCoDA significantly outperforms state-of-the-art SFDA methods.

[18] Segment Anything for Cell Tracking cs.CVPDF

Zhu Chen, Mert Edgü, Er Jin, Johannes Stegmaier

TL;DR: 本文提出了一种基于Segment Anything 2 (SAM2)的零样本细胞追踪框架，无需依赖人工标注数据即可在多样化显微镜数据集上实现高精度追踪。

Details

Motivation: 当前基于深度学习的细胞追踪方法依赖人工标注数据，成本高且泛化能力有限。为了解决这些问题，本文提出了一种无需训练数据的全无监督方法。

Result: 在2D和大规模3D时间序列显微镜视频中达到了竞争性精度，且无需特定数据集的适配。

Insight: 大规模基础模型如SAM2可以直接应用于特定领域（如细胞追踪），减少对标注数据的依赖，提高了方法的通用性和扩展性。

Abstract: Tracking cells and detecting mitotic events in time-lapse microscopy image sequences is a crucial task in biomedical research. However, it remains highly challenging due to dividing objects, low signal-tonoise ratios, indistinct boundaries, dense clusters, and the visually similar appearance of individual cells. Existing deep learning-based methods rely on manually labeled datasets for training, which is both costly and time-consuming. Moreover, their generalizability to unseen datasets remains limited due to the vast diversity of microscopy data. To overcome these limitations, we propose a zero-shot cell tracking framework by integrating Segment Anything 2 (SAM2), a large foundation model designed for general image and video segmentation, into the tracking pipeline. As a fully-unsupervised approach, our method does not depend on or inherit biases from any specific training dataset, allowing it to generalize across diverse microscopy datasets without finetuning. Our approach achieves competitive accuracy in both 2D and large-scale 3D time-lapse microscopy videos while eliminating the need for dataset-specific adaptation.

[19] Online 3D Multi-Camera Perception through Robust 2D Tracking and Depth-based Late Aggregation cs.CVPDF

Vu-Minh Le, Thao-Anh Tran, Duc Huy Do, Xuan Canh Do, Huong Ninh

TL;DR: 该论文提出了一种方法，将现有的在线2D多摄像头跟踪系统扩展到3D空间，利用深度信息重建目标的点云空间，并通过聚类和航向角细化恢复其3D边界框。

Details

Motivation: 多目标多摄像头跟踪（MTMC）在大规模监控自动化中至关重要，但现有的2D跟踪系统难以直接扩展到3D空间。研究旨在无需完全重构系统的情况下实现3D感知。

Result: 在2025 AI City Challenge的3D MTMC数据集上排名第三。

Insight: 在不完全重构现有系统的情况下，通过深度信息能够有效实现从2D到3D的扩展，为实际应用提供了灵活性。

Abstract: Multi-Target Multi-Camera Tracking (MTMC) is an essential computer vision task for automating large-scale surveillance. With camera calibration and depth information, the targets in the scene can be projected into 3D space, offering unparalleled levels of automatic perception of a 3D environment. However, tracking in the 3D space requires replacing all 2D tracking components from the ground up, which may be infeasible for existing MTMC systems. In this paper, we present an approach for extending any online 2D multi-camera tracking system into 3D space by utilizing depth information to reconstruct a target in point-cloud space, and recovering its 3D box through clustering and yaw refinement following tracking. We also introduced an enhanced online data association mechanism that leverages the target’s local ID consistency to assign global IDs across frames. The proposed framework is evaluated on the 2025 AI City Challenge’s 3D MTMC dataset, achieving 3rd place on the leaderboard.

[20] Zero-Shot Referring Expression Comprehension via Visual-Language True/False Verification cs.CV | cs.AIPDF

Jeffrey Liu, Rongbin Hu

TL;DR: 该论文提出了一种零样本的Referring Expression Comprehension（REC）方法，通过视觉-语言的真/假验证，无需特定任务训练即可实现竞争性或更优性能。

Details

Motivation: 传统的REC方法通常需要任务特定的训练模型，而本文探索了一种无需REC特定训练的零样本流程，旨在简化流程并提高性能。

Result: 在RefCOCO、RefCOCO+和RefCOCOg数据集上的实验表明，该方法不仅超越了零样本的GroundingDINO基线，还优于部分训练过的模型。

Insight: 研究表明，工作流程设计比任务特定的预训练更能驱动零样本REC的高性能，同时支持多匹配和弃权机制。

Abstract: Referring Expression Comprehension (REC) is usually addressed with task-trained grounding models. We show that a zero-shot workflow, without any REC-specific training, can achieve competitive or superior performance. Our approach reformulates REC as box-wise visual-language verification: given proposals from a COCO-clean generic detector (YOLO-World), a general-purpose VLM independently answers True/False queries for each region. This simple procedure reduces cross-box interference, supports abstention and multiple matches, and requires no fine-tuning. On RefCOCO, RefCOCO+, and RefCOCOg, our method not only surpasses a zero-shot GroundingDINO baseline but also exceeds reported results for GroundingDINO trained on REC and GroundingDINO+CRG. Controlled studies with identical proposals confirm that verification significantly outperforms selection-based prompting, and results hold with open VLMs. Overall, we show that workflow design, rather than task-specific pretraining, drives strong zero-shot REC performance.

[21] An HMM-based framework for identity-aware long-term multi-object tracking from sparse and uncertain identification: use case on long-term tracking in livestock cs.CVPDF

Anne Marthe Sophie Ngo Bibinbe, Chiron Bang, Patrick Gagnon, Jamie Ahloy-Dallaire, Eric R. Paquet

TL;DR: 本文提出了一种基于HMM的框架，结合不确定身份信息和追踪技术，解决了长时多目标追踪（MOT）中的身份切换问题，提升了追踪性能。

Details

Motivation: 由于现有MOT方法在长时追踪中因身份切换问题导致性能下降，而在实际应用中（如畜牧业）可以获取部分目标的零星身份信息，因此需要一种方法将不确定的身份信息与追踪结合。

Result: 在10分钟的猪追踪数据集上，HMM框架提升了ByteTrack的F1分数；在MOT17和MOT20基准测试中验证了对ByteTrack和FairMOT的增强。

Insight: 稀疏但可靠的身份信息可以有效改善长时MOT性能，HMM模型适合处理此类不确定性数据。

Abstract: The need for long-term multi-object tracking (MOT) is growing due to the demand for analyzing individual behaviors in videos that span several minutes. Unfortunately, due to identity switches between objects, the tracking performance of existing MOT approaches decreases over time, making them difficult to apply for long-term tracking. However, in many real-world applications, such as in the livestock sector, it is possible to obtain sporadic identifications for some of the animals from sources like feeders. To address the challenges of long-term MOT, we propose a new framework that combines both uncertain identities and tracking using a Hidden Markov Model (HMM) formulation. In addition to providing real-world identities to animals, our HMM framework improves the F1 score of ByteTrack, a leading MOT approach even with re-identification, on a 10 minute pig tracking dataset with 21 identifications at the pen’s feeding station. We also show that our approach is robust to the uncertainty of identifications, with performance increasing as identities are provided more frequently. The improved performance of our HMM framework was also validated on the MOT17 and MOT20 benchmark datasets using both ByteTrack and FairMOT. The code for this new HMM framework and the new 10-minute pig tracking video dataset are available at: https://github.com/ngobibibnbe/uncertain-identity-aware-tracking

[22] Event Camera Guided Visual Media Restoration & 3D Reconstruction: A Survey cs.CVPDF

Aupendu Kar, Vishnu Raj, Guan-Ming Su

TL;DR: 这篇综述探讨了事件相机与传统帧相机的融合如何显著提升视频修复和3D重建任务，系统回顾了深度学习在时空增强领域的贡献，并展望了未来的研究方向。

Details

Motivation: 事件相机因其低延迟、低功耗和高捕捉率在计算机视觉领域崭露头角，但其与传统帧相机的融合潜力尚待充分挖掘，尤其是在视觉媒体修复和3D重建任务中。

Result: 研究发现事件相机与传统相机的融合能显著提升视觉媒体的修复质量，尤其在低光、高速运动等挑战性场景中表现突出。

Insight: 未来研究方向应聚焦于进一步优化事件相机与深度学习的结合，尤其是在复杂动态场景中的应用潜力。

Abstract: Event camera sensors are bio-inspired sensors which asynchronously capture per-pixel brightness changes and output a stream of events encoding the polarity, location and time of these changes. These systems are witnessing rapid advancements as an emerging field, driven by their low latency, reduced power consumption, and ultra-high capture rates. This survey explores the evolution of fusing event-stream captured with traditional frame-based capture, highlighting how this synergy significantly benefits various video restoration and 3D reconstruction tasks. The paper systematically reviews major deep learning contributions to image/video enhancement and restoration, focusing on two dimensions: temporal enhancement (such as frame interpolation and motion deblurring) and spatial enhancement (including super-resolution, low-light and HDR enhancement, and artifact reduction). This paper also explores how the 3D reconstruction domain evolves with the advancement of event driven fusion. Diverse topics are covered, with in-depth discussions on recent works for improving visual quality under challenging conditions. Additionally, the survey compiles a comprehensive list of openly available datasets, enabling reproducible research and benchmarking. By consolidating recent progress and insights, this survey aims to inspire further research into leveraging event camera systems, especially in combination with deep learning, for advanced visual media restoration and enhancement.

[23] ISTASTrack: Bridging ANN and SNN via ISTA Adapter for RGB-Event Tracking cs.CVPDF

Siying Liu, Zikai Wang, Hanle Zheng, Yifan Hu, Xilin Wang

TL;DR: ISTASTrack是一种基于Transformer的ANN-SNN混合跟踪器，通过ISTA适配器桥接RGB和事件数据，提升了RGB-Event跟踪的性能和能效。

Details

Motivation: RGB-Event跟踪中，人工神经网络（ANN）难以充分利用事件流的稀疏和异步特性，而混合ANN-SNN架构面临特征融合的挑战。本文旨在解决这一问题。

Result: 在多个RGB-Event跟踪基准测试中达到SOTA性能，同时保持高能效。

Insight: 混合ANN-SNN设计在实际应用中具有潜力，稀疏表示理论和注意力机制在跨模态融合中起关键作用。

Abstract: RGB-Event tracking has become a promising trend in visual object tracking to leverage the complementary strengths of both RGB images and dynamic spike events for improved performance. However, existing artificial neural networks (ANNs) struggle to fully exploit the sparse and asynchronous nature of event streams. Recent efforts toward hybrid architectures combining ANNs and spiking neural networks (SNNs) have emerged as a promising solution in RGB-Event perception, yet effectively fusing features across heterogeneous paradigms remains a challenge. In this work, we propose ISTASTrack, the first transformer-based \textbf{A}NN-\textbf{S}NN hybrid \textbf{Track}er equipped with \textbf{ISTA} adapters for RGB-Event tracking. The two-branch model employs a vision transformer to extract spatial context from RGB inputs and a spiking transformer to capture spatio-temporal dynamics from event streams. To bridge the modality and paradigm gap between ANN and SNN features, we systematically design a model-based ISTA adapter for bidirectional feature interaction between the two branches, derived from sparse representation theory by unfolding the iterative shrinkage thresholding algorithm. Additionally, we incorporate a temporal downsampling attention module within the adapter to align multi-step SNN features with single-step ANN features in the latent space, improving temporal fusion. Experimental results on RGB-Event tracking benchmarks, such as FE240hz, VisEvent, COESOT, and FELT, have demonstrated that ISTASTrack achieves state-of-the-art performance while maintaining high energy efficiency, highlighting the effectiveness and practicality of hybrid ANN-SNN designs for robust visual tracking. The code is publicly available at https://github.com/lsying009/ISTASTrack.git.

[24] FLARE-SSM: Deep State Space Models with Influence-Balanced Loss for 72-Hour Solar Flare Prediction cs.CV | astro-ph.SRPDF

Yusuke Takagi, Shunya Nagashima, Komei Sugiura

TL;DR: 论文提出了FLARE-SSM模型，通过深度状态空间模型和FLARE损失函数解决太阳耀斑预测中的类别不平衡问题，显著提升了预测性能。

Details

Motivation: 当前太阳耀斑预测方法在类别不平衡问题上表现不佳，难以满足对关键基础设施保护的准确需求。

Result: 在覆盖11年太阳活动周期的数据集上，方法在Gandin-Murphy-Gerrity得分和真实技能统计上优于基线方法。

Insight: 针对类别不平衡问题设计专用损失函数可以有效提升模型的预测能力和可靠性。

Abstract: Accurate and reliable solar flare predictions are essential to mitigate potential impacts on critical infrastructure. However, the current performance of solar flare forecasting is insufficient. In this study, we address the task of predicting the class of the largest solar flare expected to occur within the next 72 hours. Existing methods often fail to adequately address the severe class imbalance across flare classes. To address this issue, we propose a solar flare prediction model based on multiple deep state space models. In addition, we introduce the frequency & local-boundary-aware reliability loss (FLARE loss) to improve predictive performance and reliability under class imbalance. Experiments were conducted on a multi-wavelength solar image dataset covering a full 11-year solar activity cycle. As a result, our method outperformed baseline approaches in terms of both the Gandin-Murphy-Gerrity score and the true skill statistic, which are standard metrics in terms of the performance and reliability.

Xiaodong Guo, Tong Liu, Yike Li, Zi’ang Lin, Zhihong Deng

TL;DR: TUNI提出了一种实时RGB-T语义分割框架，通过统一的多模态特征提取和跨模态特征融合，解决了现有模型中热特征提取和跨模态融合的局限性，同时提升了模型的实时效率。

Details

Motivation: 现有的RGB-T语义分割模型通常使用预训练的RGB编码器处理双模态输入，这导致热特征提取不足和跨模态融合效果不佳，且冗余的编码器结构影响了实时性能。

Result: 在FMB、PST900和CART数据集上表现出与SOTA模型竞争的性能，参数量和计算成本更低，并在Jetson Orin NX上达到27 FPS的实时推理速度。

Insight: 统一的特征提取与融合框架能显著提升跨模态任务的性能，而局部特征的自适应融合策略对多模态语义分割尤为关键。

Abstract: RGB-thermal (RGB-T) semantic segmentation improves the environmental perception of autonomous platforms in challenging conditions. Prevailing models employ encoders pre-trained on RGB images to extract features from both RGB and infrared inputs, and design additional modules to achieve cross-modal feature fusion. This results in limited thermal feature extraction and suboptimal cross-modal fusion, while the redundant encoders further compromises the model’s real-time efficiency. To address the above issues, we propose TUNI, with an RGB-T encoder consisting of multiple stacked blocks that simultaneously perform multi-modal feature extraction and cross-modal fusion. By leveraging large-scale pre-training with RGB and pseudo-thermal data, the RGB-T encoder learns to integrate feature extraction and fusion in a unified manner. By slimming down the thermal branch, the encoder achieves a more compact architecture. Moreover, we introduce an RGB-T local module to strengthen the encoder’s capacity for cross-modal local feature fusion. The RGB-T local module employs adaptive cosine similarity to selectively emphasize salient consistent and distinct local features across RGB-T modalities. Experimental results show that TUNI achieves competitive performance with state-of-the-art models on FMB, PST900 and CART, with fewer parameters and lower computational cost. Meanwhile, it achieves an inference speed of 27 FPS on a Jetson Orin NX, demonstrating its real-time capability in deployment. Codes are available at https://github.com/xiaodonguo/TUNI.

[26] Efficient and Accurate Downfacing Visual Inertial Odometry cs.CV | cs.RO | eess.IVPDF

Jonas Kühne, Christian Vogt, Michele Magno, Luca Benini

TL;DR: 该论文提出了一种针对微型和纳米无人机的视觉惯性里程计（VIO）高效准确流水线，结合了先进的特征检测与跟踪方法，并通过优化和量化适配低功耗RISC-V芯片。

Details

Motivation: 传统高精度VIO通常需要在强大计算系统上运行，而微型和纳米无人机需要轻量级实现。本研究旨在填补这一技术空白。

Result: 在GAP9 SoC上，优化流水线比基准流水线的RMSE平均降低了3.65倍（使用ORB特征跟踪器时）。PX4FLOW在低于24像素/帧的运动速度下，与ORB精度相当但运行时更低。

Insight: 通过优化和量化，轻量级SoC可以实现接近传统高功耗系统的VIO性能，为微型无人机等设备的实时定位提供了可行方案。

Abstract: Visual Inertial Odometry (VIO) is a widely used computer vision method that determines an agent’s movement through a camera and an IMU sensor. This paper presents an efficient and accurate VIO pipeline optimized for applications on micro- and nano-UAVs. The proposed design incorporates state-of-the-art feature detection and tracking methods (SuperPoint, PX4FLOW, ORB), all optimized and quantized for emerging RISC-V-based ultra-low-power parallel systems on chips (SoCs). Furthermore, by employing a rigid body motion model, the pipeline reduces estimation errors and achieves improved accuracy in planar motion scenarios. The pipeline’s suitability for real-time VIO is assessed on an ultra-low-power SoC in terms of compute requirements and tracking accuracy after quantization. The pipeline, including the three feature tracking methods, was implemented on the SoC for real-world validation. This design bridges the gap between high-accuracy VIO pipelines that are traditionally run on computationally powerful systems and lightweight implementations suitable for microcontrollers. The optimized pipeline on the GAP9 low-power SoC demonstrates an average reduction in RMSE of up to a factor of 3.65x over the baseline pipeline when using the ORB feature tracker. The analysis of the computational complexity of the feature trackers further shows that PX4FLOW achieves on-par tracking accuracy with ORB at a lower runtime for movement speeds below 24 pixels/frame.

[27] Hierarchical MLANet: Multi-level Attention for 3D Face Reconstruction From Single Images cs.CVPDF

Danling Cao

TL;DR: 这篇论文提出了一个名为Hierarchical MLANet的多层次注意力网络，用于从单张图像中重建3D人脸模型，结合了多层次注意力机制和半监督训练策略，并在公开数据集上验证了其有效性。

Details

Motivation: 3D人脸重建在计算机视觉领域具有广泛应用，但缺乏标注数据和真实环境的复杂性仍是主要挑战。

Result: 在AFLW2000-3D和MICC Florence数据集上进行了定量和定性评估，验证了方法的有效性。

Insight: 多层次注意力机制有助于捕捉更细致的脸部特征，半监督训练策略解决了标注数据不足的问题。

Abstract: Recovering 3D face models from 2D in-the-wild images has gained considerable attention in the computer vision community due to its wide range of potential applications. However, the lack of ground-truth labeled datasets and the complexity of real-world environments remain significant challenges. In this chapter, we propose a convolutional neural network-based approach, the Hierarchical Multi-Level Attention Network (MLANet), for reconstructing 3D face models from single in-the-wild images. Our model predicts detailed facial geometry, texture, pose, and illumination parameters from a single image. Specifically, we employ a pre-trained hierarchical backbone network and introduce multi-level attention mechanisms at different stages of 2D face image feature extraction. A semi-supervised training strategy is employed, incorporating 3D Morphable Model (3DMM) parameters from publicly available datasets along with a differentiable renderer, enabling an end-to-end training process. Extensive experiments, including both comparative and ablation studies, were conducted on two benchmark datasets, AFLW2000-3D and MICC Florence, focusing on 3D face reconstruction and 3D face alignment tasks. The effectiveness of the proposed method was evaluated both quantitatively and qualitatively.

[28] LaV-CoT: Language-Aware Visual CoT with Multi-Aspect Reward Optimization for Real-World Multilingual VQA cs.CVPDF

Jing Huang, Zhiya Tan, Shutao Gong, Fanwei Zeng, Jianshu Li

TL;DR: LaV-CoT introduces了一个语言感知的视觉CoT框架，通过多阶段推理流程和多方面奖励优化，显著提升了多语言VQA的性能和可解释性。

Details

Motivation: 现有的方法主要依赖文本CoT，对多语言多模态推理支持有限，限制了在实际应用中的部署。

Result: 在多个公开数据集上性能显著优于开源基线模型和部分专有模型，线上A/B测试验证了其工业部署潜力。

Insight: 结合语言感知和视觉链式推理能有效提升多模态任务的性能和可解释性，自动数据生成方法可扩展到其他任务。

Abstract: As large vision language models (VLMs) advance, their capabilities in multilingual visual question answering (mVQA) have significantly improved. Chain-of-thought (CoT) reasoning has been proven to enhance interpretability and complex reasoning. However, most existing approaches rely primarily on textual CoT and provide limited support for multilingual multimodal reasoning, constraining their deployment in real-world applications. To address this gap, we introduce \textbf{LaV-CoT}, the first Language-aware Visual CoT framework with Multi-Aspect Reward Optimization. LaV-CoT incorporates an interpretable multi-stage reasoning pipeline consisting of Text Summary with Bounding Box (BBox), Language Identification, Spatial Object-level Captioning, and Step-by-step Logical Reasoning. Following this reasoning pipeline, we design an automated data curation method that generates multilingual CoT annotations through iterative generation, correction, and refinement, enabling scalable and high-quality training data. To improve reasoning and generalization, LaV-CoT adopts a two-stage training paradigm combining Supervised Fine-Tuning (SFT) with Language-aware Group Relative Policy Optimization (GRPO), guided by verifiable multi-aspect rewards including language consistency, structural accuracy, and semantic alignment. Extensive evaluations on public datasets including MMMB, Multilingual MMBench, and MTVQA show that LaV-CoT achieves up to (\sim)9.5% accuracy improvements over open-source baselines of similar size and even surpasses models with 2$\times$ larger scales by (\sim)2.6%. Moreover, LaV-CoT outperforms advanced proprietary models such as GPT-4o-0513 and Gemini-2.5-flash. We further conducted an online A/B test to validate our method on real-world data, highlighting its effectiveness for industrial deployment. Our code is available at this link: \href{https://github.com/HJNVR/LaV-CoT}

[29] Color Me Correctly: Bridging Perceptual Color Spaces and Text Embeddings for Improved Diffusion Generation cs.CVPDF

Sung-Lin Tsai, Bo-Lun Huang, Yu Ting Shen, Cheng Yu Yeo, Chiang Tseng

TL;DR: 本文提出了一种无需训练的框架，通过利用大型语言模型（LLM）解决文本提示中的模糊颜色描述，并在CIELAB颜色空间中优化文本嵌入，从而提升扩散生成模型的颜色准确性。

Details

Motivation: 目前扩散模型在处理复杂颜色描述时（如’蒂芙尼蓝’或’柠檬绿’）表现不佳，现有方法无法系统解决模糊颜色描述的精确生成问题。本文旨在通过改进文本嵌入和颜色空间操作，提升颜色对齐的精度。

Result: 实验表明，该方法显著提升了颜色对齐准确性，无需额外训练或参考图像，同时保持了图像生成质量。

Insight: 将语言模型的解析能力与颜色空间的空间关系结合，可以有效解决扩散模型中的颜色模糊问题，为生成任务提供了新思路。

Abstract: Accurate color alignment in text-to-image (T2I) generation is critical for applications such as fashion, product visualization, and interior design, yet current diffusion models struggle with nuanced and compound color terms (e.g., Tiffany blue, lime green, hot pink), often producing images that are misaligned with human intent. Existing approaches rely on cross-attention manipulation, reference images, or fine-tuning but fail to systematically resolve ambiguous color descriptions. To precisely render colors under prompt ambiguity, we propose a training-free framework that enhances color fidelity by leveraging a large language model (LLM) to disambiguate color-related prompts and guiding color blending operations directly in the text embedding space. Our method first employs a large language model (LLM) to resolve ambiguous color terms in the text prompt, and then refines the text embeddings based on the spatial relationships of the resulting color terms in the CIELAB color space. Unlike prior methods, our approach improves color accuracy without requiring additional training or external reference images. Experimental results demonstrate that our framework improves color alignment without compromising image quality, bridging the gap between text semantics and visual generation.

[30] Multimodal Mathematical Reasoning Embedded in Aerial Vehicle Imagery: Benchmarking, Analysis, and Exploration cs.CV | cs.AIPDF

Yue Zhou, Litong Feng, Mengcheng Lan, Xue Yang, Qingyun Li

TL;DR: 本文提出了AVI-Math基准测试，用于评估视觉语言模型在无人机图像中的多模态数学推理能力，发现当前模型在此任务上表现不足，并探索了改进方法。

Details

Motivation: 无人机遥感任务需要复杂的数学推理能力，但目前视觉语言模型（VLMs）在该领域的表现未得到充分测试。

Result: 结果显示当前VLMs在AVI-Math任务中表现不佳，但Chain-of-Thought提示和微调技术显示出改进潜力。

Insight: 当前VLMs在数学推理方面存在显著局限性，未来研究需要结合领域知识和改进推理能力以支持无人机应用。

Abstract: Mathematical reasoning is critical for tasks such as precise distance and area computations, trajectory estimations, and spatial analysis in unmanned aerial vehicle (UAV) based remote sensing, yet current vision-language models (VLMs) have not been adequately tested in this domain. To address this gap, we introduce AVI-Math, the first benchmark to rigorously evaluate multimodal mathematical reasoning in aerial vehicle imagery, moving beyond simple counting tasks to include domain-specific knowledge in areas such as geometry, logic, and algebra. The dataset comprises 3,773 high-quality vehicle-related questions captured from UAV views, covering 6 mathematical subjects and 20 topics. The data, collected at varying altitudes and from multiple UAV angles, reflects real-world UAV scenarios, ensuring the diversity and complexity of the constructed mathematical problems. In this paper, we benchmark 14 prominent VLMs through a comprehensive evaluation and demonstrate that, despite their success on previous multimodal benchmarks, these models struggle with the reasoning tasks in AVI-Math. Our detailed analysis highlights significant limitations in the mathematical reasoning capabilities of current VLMs and suggests avenues for future research. Furthermore, we explore the use of Chain-of-Thought prompting and fine-tuning techniques, which show promise in addressing the reasoning challenges in AVI-Math. Our findings not only expose the limitations of VLMs in mathematical reasoning but also offer valuable insights for advancing UAV-based trustworthy VLMs in real-world applications. The code, and datasets will be released at https://github.com/VisionXLab/avi-math

[31] BEVTraj: Map-Free End-to-End Trajectory Prediction in Bird’s-Eye View with Deformable Attention and Sparse Goal Proposals cs.CV | I.2.9; I.4.8PDF

Minsang Kong, Myeongjun Kim, Sang Gu Kang, Sang Hun Lee

TL;DR: BEVTraj是一种无需依赖预建高清地图的端到端轨迹预测框架，直接在鸟瞰视图（BEV）空间中利用实时传感器数据进行预测。通过可变形注意力和稀疏目标候选提案模块，实现了高性能且灵活的轨迹预测。

Details

Motivation: 自动驾驶中，轨迹预测依赖预建高清地图或实时局部地图，但这些方法无法适应动态变化或可能引入误差。BEVTraj旨在通过直接利用实时传感器数据，避免这些限制。

Result: 实验表明，BEVTraj性能与依赖高清地图的先进模型相当，同时更具灵活性。

Insight: BEVTraj的成功表明，通过直接利用传感器数据和高效的上下文提取方法，可以绕过预建地图的限制，为轨迹预测提供更灵活的解决方案。

Abstract: In autonomous driving, trajectory prediction is essential for ensuring safe and efficient navigation. To improve prediction accuracy, recent approaches often rely on pre-built high-definition (HD) maps or real-time local map construction modules to incorporate static environmental information. However, pre-built HD maps are limited to specific regions and cannot adapt to transient changes. In addition, local map construction modules, which recognize only predefined elements, may fail to capture critical scene details or introduce errors that degrade prediction performance. To overcome these limitations, we propose Bird’s-Eye View Trajectory Prediction (BEVTraj), a novel trajectory prediction framework that operates directly in the bird’s-eye view (BEV) space utilizing real-time sensor data without relying on any pre-built maps. The BEVTraj leverages deformable attention to efficiently extract relevant context from dense BEV features. Furthermore, we introduce a Sparse Goal Candidate Proposal (SGCP) module, which enables full end-to-end prediction without requiring any post-processing steps. Extensive experiments demonstrate that the BEVTraj achieves performance comparable to state-of-the-art HD map-based models while offering greater flexibility by eliminating the dependency on pre-built maps. The source code is available at https://github.com/Kongminsang/bevtraj.

[32] Leveraging Multi-View Weak Supervision for Occlusion-Aware Multi-Human Parsing cs.CVPDF

Laura Bragagnolo, Matteo Terreran, Leonardo Barcellona, Stefano Ghidoni

TL;DR: 该论文提出了一种利用多视角弱监督方法来提升遮挡情况下多人解析的新训练框架,通过结合弱监督和多视角一致性损失,显著提升了遮挡场景下的解析效果。

Details

Motivation: 现有先进方法在公开数据集上表现优异,但在人体重叠时的分割效果显著下降。受启发于遮挡人体在不同视角下可能分离的观察,作者希望利用多视角信息改进遮挡下的多人解析。

Result: 实验表明,该方法在遮挡场景下比基线模型提升了4.20%的解析效果。

Insight: 多视角信息能有效缓解遮挡问题,弱监督与半自动标注结合是解决标注数据缺乏的有效策略。

Abstract: Multi-human parsing is the task of segmenting human body parts while associating each part to the person it belongs to, combining instance-level and part-level information for fine-grained human understanding. In this work, we demonstrate that, while state-of-the-art approaches achieved notable results on public datasets, they struggle considerably in segmenting people with overlapping bodies. From the intuition that overlapping people may appear separated from a different point of view, we propose a novel training framework exploiting multi-view information to improve multi-human parsing models under occlusions. Our method integrates such knowledge during the training process, introducing a novel approach based on weak supervision on human instances and a multi-view consistency loss. Given the lack of suitable datasets in the literature, we propose a semi-automatic annotation strategy to generate human instance segmentation masks from multi-view RGB+D data and 3D human skeletons. The experiments demonstrate that the approach can achieve up to a 4.20% relative improvement on human parsing over the baseline model in occlusion scenarios.

[33] VARCO-VISION-2.0 Technical Report cs.CV | cs.CLPDF

Young-rok Cha, Jeongho Ju, SunYoung Park, Jong-Hyeon Lee, Younghyun Yu

TL;DR: VARCO-VISION-2.0是一个改进的双语（韩语和英语）视觉语言模型，提升了多图像理解能力，支持布局感知OCR，并通过四阶段课程学习和高效内存技术增强了多模态对齐和安全性。

Details

Motivation: 开发一个更强大的双语视觉语言模型，支持复杂输入（如文档、图表和表格），同时优化模型的性能和安全性的需求。

Result: 模型在OpenCompass VLM排行榜上表现优异（14B模型排名第8），具备强大的空间定位能力，并支持双语任务。

Insight: 通过课程学习和偏好优化，可以有效提升多模态模型的性能和安全特性，同时证明了开放权重模型的实用性。

Abstract: We introduce VARCO-VISION-2.0, an open-weight bilingual vision-language model (VLM) for Korean and English with improved capabilities compared to the previous model VARCO-VISION-14B. The model supports multi-image understanding for complex inputs such as documents, charts, and tables, and delivers layoutaware OCR by predicting both textual content and its spatial location. Trained with a four-stage curriculum with memory-efficient techniques, the model achieves enhanced multimodal alignment, while preserving core language abilities and improving safety via preference optimization. Extensive benchmark evaluations demonstrate strong spatial grounding and competitive results for both languages, with the 14B model achieving 8th place on the OpenCompass VLM leaderboard among models of comparable scale. Alongside the 14B-scale model, we release a 1.7B version optimized for on-device deployment. We believe these models advance the development of bilingual VLMs and their practical applications. Two variants of VARCO-VISION-2.0 are available at Hugging Face: a full-scale 14B model and a lightweight 1.7B model.

[34] A Lightweight Ensemble-Based Face Image Quality Assessment Method with Correlation-Aware Loss cs.CVPDF

MohammadAli Hamidi, Hadi Amirpour, Luigi Atzori, Christian Timmerer

TL;DR: 论文提出了一种轻量级的基于集成学习的脸部图像质量评估方法，结合MobileNetV3-Small和ShuffleNetV2网络，并通过相关性感知损失（MSECorrLoss）优化性能。该方法在精度与计算效率间取得了平衡，适用于实际应用。

Details

Motivation: 现有的通用无参考图像质量评估方法难以捕捉脸部特定的退化问题，而现有的先进FIQA模型计算复杂度高，限制了实际部署。

Result: 在VQualA FIQA基准测试中，SRCC达到0.9829，PLCC达到0.9894，同时满足计算效率要求。

Insight: 通过集成轻量级网络和相关性感知损失，可以在保持高效的同时显著提升FIQA的性能，为实际部署提供了可行的解决方案。

Abstract: Face image quality assessment (FIQA) plays a critical role in face recognition and verification systems, especially in uncontrolled, real-world environments. Although several methods have been proposed, general-purpose no-reference image quality assessment techniques often fail to capture face-specific degradations. Meanwhile, state-of-the-art FIQA models tend to be computationally intensive, limiting their practical applicability. We propose a lightweight and efficient method for FIQA, designed for the perceptual evaluation of face images in the wild. Our approach integrates an ensemble of two compact convolutional neural networks, MobileNetV3-Small and ShuffleNetV2, with prediction-level fusion via simple averaging. To enhance alignment with human perceptual judgments, we employ a correlation-aware loss (MSECorrLoss), combining mean squared error (MSE) with a Pearson correlation regularizer. Our method achieves a strong balance between accuracy and computational cost, making it suitable for real-world deployment. Experiments on the VQualA FIQA benchmark demonstrate that our model achieves a Spearman rank correlation coefficient (SRCC) of 0.9829 and a Pearson linear correlation coefficient (PLCC) of 0.9894, remaining within competition efficiency constraints.

[35] LayerLock: Non-collapsing Representation Learning with Progressive Freezing cs.CVPDF

Goker Erdogan, Nikhil Parthasarathy, Catalin Ionescu, Drew Hudson, Alexander Lerchner

TL;DR: 论文提出了LayerLock方法，通过渐进式冻结层，从像素预测过渡到潜在预测，解决了自监督视觉表示学习中的表示坍塌问题，并在大规模模型上取得了优于非潜在掩码预测的结果。

Details

Motivation: 作者观察到在视频掩码自编码（MAE）模型训练过程中，ViT层的收敛顺序与其深度有关，浅层收敛较早而深层较晚。这一观察促使他们提出一种通过渐进冻结来加速训练并避免表示坍塌的方法。

Result: 在4DS感知套件上，LayerLock方法的表现优于非潜在掩码预测方法。

Insight: 层的渐进冻结不仅加速了训练，还为潜在预测提供了一种简单且可扩展的解决方案，避免了表示坍塌问题。

Abstract: We introduce LayerLock, a simple yet effective approach for self-supervised visual representation learning, that gradually transitions from pixel to latent prediction through progressive layer freezing. First, we make the observation that during training of video masked-autoencoding (MAE) models, ViT layers converge in the order of their depth: shallower layers converge early, deeper layers converge late. We then show that this observation can be exploited to accelerate standard MAE by progressively freezing the model according to an explicit schedule, throughout training. Furthermore, this same schedule can be used in a simple and scalable approach to latent prediction that does not suffer from “representation collapse”. We apply our proposed approach, LayerLock, to large models of up to 4B parameters with results surpassing those of non-latent masked prediction on the 4DS perception suite.

[36] On the Geometric Accuracy of Implicit and Primitive-based Representations Derived from View Rendering Constraints cs.CVPDF

Elias De Smijter, Renaud Detry, Christophe De Vleeschouwer

TL;DR: 本文首次系统比较了基于隐式和显式表示的新视角合成方法在空间3D物体重建中的作用，重点评估了外观嵌入的影响。研究发现，嵌入虽然能提升光度保真度，但对几何精度提升有限，且显式方法（如凸面抛射）在效率和紧凑性上表现更优。

Details

Motivation: 空间机器人应用中，几何精度是3D重建的关键指标，但现有研究多关注光度保真度，对几何精度提升的探讨不足。本文旨在填补这一空白，明确外观嵌入在几何任务中的局限性。

Result: 实验表明，嵌入主要减少显式方法所需的基本体数量，而非提升几何精度；凸面抛射比高斯抛射更紧凑，适用于安全关键任务。

Insight: 几何中心任务中，外观嵌入的作用有限，显式方法在效率和安全性上更具优势，为空间场景的重建提供了新的权衡视角。

Abstract: We present the first systematic comparison of implicit and explicit Novel View Synthesis methods for space-based 3D object reconstruction, evaluating the role of appearance embeddings. While embeddings improve photometric fidelity by modeling lighting variation, we show they do not translate into meaningful gains in geometric accuracy - a critical requirement for space robotics applications. Using the SPEED+ dataset, we compare K-Planes, Gaussian Splatting, and Convex Splatting, and demonstrate that embeddings primarily reduce the number of primitives needed for explicit methods rather than enhancing geometric fidelity. Moreover, convex splatting achieves more compact and clutter-free representations than Gaussian splatting, offering advantages for safety-critical applications such as interaction and collision avoidance. Our findings clarify the limits of appearance embeddings for geometry-centric tasks and highlight trade-offs between reconstruction quality and representation efficiency in space scenarios.

[37] GAMMA: Generalizable Alignment via Multi-task and Manipulation-Augmented Training for AI-Generated Image Detection cs.CVPDF

Haozhen Yan, Yan Hong, Suning Lang, Jiahui Zhan, Yikun Ji

TL;DR: GAMMA提出了一种新的训练框架，通过多任务和操纵增强训练来提高AI生成图像检测的泛化能力。

Details

Motivation: 随着生成模型的多样化和复杂度提升，现有AI生成图像检测器在泛化到未见过的生成模型时表现受限，亟需改进。

Result: 在GenImage基准上实现了SOTA泛化性能，并对GPT-4o等新模型保持强鲁棒性。

Insight: 通过操纵增强和多任务学习可以有效减少模型对生成特定伪影的依赖，提升泛化能力。

Abstract: With generative models becoming increasingly sophisticated and diverse, detecting AI-generated images has become increasingly challenging. While existing AI-genereted Image detectors achieve promising performance on in-distribution generated images, their generalization to unseen generative models remains limited. This limitation is largely attributed to their reliance on generation-specific artifacts, such as stylistic priors and compression patterns. To address these limitations, we propose GAMMA, a novel training framework designed to reduce domain bias and enhance semantic alignment. GAMMA introduces diverse manipulation strategies, such as inpainting-based manipulation and semantics-preserving perturbations, to ensure consistency between manipulated and authentic content. We employ multi-task supervision with dual segmentation heads and a classification head, enabling pixel-level source attribution across diverse generative domains. In addition, a reverse cross-attention mechanism is introduced to allow the segmentation heads to guide and correct biased representations in the classification branch. Our method achieves state-of-the-art generalization performance on the GenImage benchmark, imporving accuracy by 5.8%, but also maintains strong robustness on newly released generative model such as GPT-4o.

[38] Robustness and Diagnostic Performance of Super-Resolution Fetal Brain MRI cs.CVPDF

Ema Masterl, Tina Vipotnik Vesnaver, Žiga Špiclin

TL;DR: 该论文研究了三种超分辨率重建（SRR）方法（NiftyMIC、SVRTK和NeSVoR）在胎儿脑MRI中的应用，重点评估了它们在视觉质量、重建成功率、体积测量一致性和诊断分类性能方面的表现。研究发现，NeSVoR在重建成功率和一致性上表现最佳，且诊断性能不受SRR方法选择的影响。

Details

Motivation: 胎儿脑MRI因快速多视角2D切片采集而导致分辨率低且易受运动伪影影响，现有SRR方法在病理案例中的性能及其对下游任务的影响尚未充分研究。

Result: NeSVoR在健康对照和病理案例中均表现出最高的重建成功率（>90%）。尽管不同SRR方法在体积估计上存在显著差异，但对病理分类性能无影响。

Insight: 1. NeSVoR因其高重建成功率和一致性成为首选；2. 体积估计的差异未影响诊断性能，表明SRR方法的选择可基于其他标准（如速度或计算资源）。

Abstract: Fetal brain MRI relies on rapid multi-view 2D slice acquisitions to reduce motion artifacts caused by fetal movement. However, these stacks are typically low resolution, may suffer from motion corruption, and do not adequately capture 3D anatomy. Super-resolution reconstruction (SRR) methods aim to address these limitations by combining slice-to-volume registration and super-resolution techniques to generate high-resolution (HR) 3D volumes. While several SRR methods have been proposed, their comparative performance - particularly in pathological cases - and their influence on downstream volumetric analysis and diagnostic tasks remain underexplored. In this study, we applied three state-of-the-art SRR method - NiftyMIC, SVRTK, and NeSVoR - to 140 fetal brain MRI scans, including both healthy controls (HC) and pathological cases (PC) with ventriculomegaly (VM). Each HR reconstruction was segmented using the BoUNTi algorithm to extract volumes of nine principal brain structures. We evaluated visual quality, SRR success rates, volumetric measurement agreement, and diagnostic classification performance. NeSVoR demonstrated the highest and most consistent reconstruction success rate (>90%) across both HC and PC groups. Although significant differences in volumetric estimates were observed between SRR methods, classification performance for VM was not affected by the choice of SRR method. These findings highlight NeSVoR’s robustness and the resilience of diagnostic performance despite SRR-induced volumetric variability.

[39] Mask Consistency Regularization in Object Removal cs.CVPDF

Hua Yuan, Jin Yuan, Yicheng Jiang, Yao Zhang, Xin Geng

TL;DR: 该论文提出了一种新的训练策略——掩码一致性正则化（MCR），用于解决图像修复中物体移除任务的两个主要问题：掩码幻觉和掩码形状偏差。

Details

Motivation: 当前的扩散模型在物体移除任务中存在掩码幻觉（生成无关内容）和掩码形状偏差（填充内容与掩码形状而非上下文一致）的问题，影响了修复质量。

Result: 实验表明，MCR显著减少了掩码幻觉和形状偏差，提高了物体移除任务的修复效果。

Insight: 掩码扰动的一致性约束能够有效提升模型对上下文的理解，从而生成更自然的修复结果。

Abstract: Object removal, a challenging task within image inpainting, involves seamlessly filling the removed region with content that matches the surrounding context. Despite advancements in diffusion models, current methods still face two critical challenges. The first is mask hallucination, where the model generates irrelevant or spurious content inside the masked region, and the second is mask-shape bias, where the model fills the masked area with an object that mimics the mask’s shape rather than surrounding content. To address these issues, we propose Mask Consistency Regularization (MCR), a novel training strategy designed specifically for object removal tasks. During training, our approach introduces two mask perturbations: dilation and reshape, enforcing consistency between the outputs of these perturbed branches and the original mask. The dilated masks help align the model’s output with the surrounding content, while reshaped masks encourage the model to break the mask-shape bias. This combination of strategies enables MCR to produce more robust and contextually coherent inpainting results. Our experiments demonstrate that MCR significantly reduces hallucinations and mask-shape bias, leading to improved performance in object removal.

[40] MagicMirror: A Large-Scale Dataset and Benchmark for Fine-Grained Artifacts Assessment in Text-to-Image Generation cs.CVPDF

Jia Wang, Jie Hu, Xiaoqi Ma, Hanghang Ma, Yanbing Zeng

TL;DR: MagicMirror提出了一个系统性评估文本生成图像（T2I）中细粒度物理瑕疵的框架，包括首个大规模人工标注数据集MagicData340K、评估模型MagicAssessor和自动化评测基准MagicBench。

Details

Motivation: 尽管T2I生成在指令遵循和美学上取得进展，但其生成的图像常存在解剖和结构等物理瑕疵，严重影响感知质量，而现有评测基准缺乏对这类问题的细粒度评估。

Result: 评测表明，即使如GPT-image-1等顶级T2I模型也存在显著瑕疵，凸显了减少瑕疵是未来T2I发展的关键方向。

Insight: 物理瑕疵是T2I生成领域的重要挑战，需系统性评估和改进；MagicMirror为相关研究提供了数据和工具支持。

Abstract: Text-to-image (T2I) generation has achieved remarkable progress in instruction following and aesthetics. However, a persistent challenge is the prevalence of physical artifacts, such as anatomical and structural flaws, which severely degrade perceptual quality and limit application. Given the diversity and complexity of these artifacts, a systematic and fine-grained evaluation framework is required, which is lacking in current benchmarks. To fill this gap, we introduce MagicMirror, a comprehensive framework for artifacts assessment. We first establish a detailed taxonomy of generated image artifacts. Guided by this taxonomy, we manually annotate MagicData340K, the first human-annotated large-scale dataset of 340K generated images with fine-grained artifact labels. Building on this dataset, we train MagicAssessor, a Vision-Language Model (VLM) that provides detailed assessments and corresponding labels. To overcome challenges like class imbalance and reward hacking, we design a novel data sampling strategy and a multi-level reward system for Group Relative Policy Optimization (GRPO). Finally, we leverage MagicAssessor to construct MagicBench, an automated benchmark for evaluating the image artifacts of current T2I models. Our evaluation with MagicBench reveals that despite their widespread adoption, even top-tier models like GPT-image-1 are consistently plagued by significant artifacts, highlighting artifact reduction as a critical frontier for future T2I development. Project page: https://wj-inf.github.io/MagicMirror-page/.

[41] SignClip: Leveraging Mouthing Cues for Sign Language Translation by Multimodal Contrastive Fusion cs.CV | cs.AIPDF

Wenfang Wu, Tingting Yuan, Yupeng Li, Daling Wang, Xiaoming Fu

TL;DR: SignClip是一种新型手语翻译框架，通过融合手势和唇部动作特征，并引入多层次对比学习，显著提高了翻译精度。

Details

Motivation: 现有手语翻译方法主要关注手势信号，忽视了唇部动作等重要非手动线索，而这些线索对区分视觉相似的手势至关重要。

Result: 在PHOENIX14T数据集上，SignClip在无Gloss设置下BLEU-4得分从24.32提升至24.71，ROUGE从46.57提升至48.38，优于SpaMo模型。

Insight: 非手动线索（如唇部动作）在手语翻译中具有重要的语义信息，多模态融合和对比学习可以有效提升翻译性能。

Abstract: Sign language translation (SLT) aims to translate natural language from sign language videos, serving as a vital bridge for inclusive communication. While recent advances leverage powerful visual backbones and large language models, most approaches mainly focus on manual signals (hand gestures) and tend to overlook non-manual cues like mouthing. In fact, mouthing conveys essential linguistic information in sign languages and plays a crucial role in disambiguating visually similar signs. In this paper, we propose SignClip, a novel framework to improve the accuracy of sign language translation. It fuses manual and non-manual cues, specifically spatial gesture and lip movement features. Besides, SignClip introduces a hierarchical contrastive learning framework with multi-level alignment objectives, ensuring semantic consistency across sign-lip and visual-text modalities. Extensive experiments on two benchmark datasets, PHOENIX14T and How2Sign, demonstrate the superiority of our approach. For example, on PHOENIX14T, in the Gloss-free setting, SignClip surpasses the previous state-of-the-art model SpaMo, improving BLEU-4 from 24.32 to 24.71, and ROUGE from 46.57 to 48.38.

[42] Detecting Text Manipulation in Images using Vision Language Models cs.CVPDF

Vidit Vidit, Pavel Korshunov, Amir Mohammadi, Christophe Ecabert, Ketan Kotwal

TL;DR: 该论文通过分析开源和闭源的视觉语言模型（VLMs）在文本篡改检测任务上的表现，填补了文本篡改检测的空白，并发现了开源模型与闭源模型之间的差距以及专用模型的泛化问题。

Details

Motivation: 当前研究主要关注图像篡改检测，而文本篡改检测却未得到充分研究。论文旨在填补这一空白，并比较开源与闭源模型在此任务上的表现。

Result: 结果显示，开源模型在文本篡改检测任务上表现有所提升，但仍落后于闭源模型。此外，专用模型在泛化性上存在问题。

Insight: 文本篡改检测是一个具有挑战性的任务，当前的开源模型仍需要进一步改进才能与闭源模型竞争，而专用模型的泛化性问题也值得关注。

Abstract: Recent works have shown the effectiveness of Large Vision Language Models (VLMs or LVLMs) in image manipulation detection. However, text manipulation detection is largely missing in these studies. We bridge this knowledge gap by analyzing closed- and open-source VLMs on different text manipulation datasets. Our results suggest that open-source models are getting closer, but still behind closed-source ones like GPT- 4o. Additionally, we benchmark image manipulation detection-specific VLMs for text manipulation detection and show that they suffer from the generalization problem. We benchmark VLMs for manipulations done on in-the-wild scene texts and on fantasy ID cards, where the latter mimic a challenging real-world misuse.

[43] MCL-AD: Multimodal Collaboration Learning for Zero-Shot 3D Anomaly Detection cs.CV | cs.LGPDF

Gang Li, Tianjiao Chen, Mingle Zhou, Min Li, Delong Han

TL;DR: MCL-AD is a novel framework for zero-shot 3D异常检测（ZS-3D AD），通过多模态协作学习（点云、RGB图像和文本语义）实现了优于现有方法的效果。

Details

Motivation: 现有方法大多仅关注点云数据，忽略了RGB图像和文本等多模态信息的丰富语义线索，导致在零样本场景下性能受限。

Result: 实验表明，MCL-AD在ZS-3D异常检测任务中达到了SOTA性能。

Insight: 多模态协作能够有效提升零样本场景下的3D异常检测性能，尤其是通过文本语义的引入增强了模型的泛化能力。

Abstract: Zero-shot 3D (ZS-3D) anomaly detection aims to identify defects in 3D objects without relying on labeled training data, making it especially valuable in scenarios constrained by data scarcity, privacy, or high annotation cost. However, most existing methods focus exclusively on point clouds, neglecting the rich semantic cues available from complementary modalities such as RGB images and texts priors. This paper introduces MCL-AD, a novel framework that leverages multimodal collaboration learning across point clouds, RGB images, and texts semantics to achieve superior zero-shot 3D anomaly detection. Specifically, we propose a Multimodal Prompt Learning Mechanism (MPLM) that enhances the intra-modal representation capability and inter-modal collaborative learning by introducing an object-agnostic decoupled text prompt and a multimodal contrastive loss. In addition, a collaborative modulation mechanism (CMM) is proposed to fully leverage the complementary representations of point clouds and RGB images by jointly modulating the RGB image-guided and point cloud-guided branches. Extensive experiments demonstrate that the proposed MCL-AD framework achieves state-of-the-art performance in ZS-3D anomaly detection.

[44] Adversarial robustness through Lipschitz-Guided Stochastic Depth in Neural Networks cs.CVPDF

Laith Nayal, Mahmoud Mousatat, Bader Rasheed

TL;DR: 该论文提出了一种Lipschitz引导的随机深度方法（DropPath），通过调整网络各层的丢弃概率来控制Lipschitz常数，从而提升对抗鲁棒性，同时保持干净的准确性和降低计算开销。

Details

Motivation: 深度神经网络和Vision Transformers在计算机视觉中表现出色，但对对抗扰动高度脆弱。传统防御方法常伴随高计算成本或缺乏形式化保证。

Result: 在CIFAR-10数据集上，该方法在ViT-Tiny模型上保持了接近基线的干净准确性，显著提升了对抗攻击（FGSM、PGD-20和AutoAttack）下的鲁棒性，同时减少了计算开销（FLOPs）。

Insight: 通过调节深度层的丢弃概率可以有效地控制网络的Lipschitz常数，为设计高效且鲁棒的神经网络提供了一种新思路。

Abstract: Deep neural networks and Vision Transformers achieve state-of-the-art performance in computer vision but are highly vulnerable to adversarial perturbations. Standard defenses often incur high computational cost or lack formal guarantees. We propose a Lipschitz-guided stochastic depth (DropPath) method, where drop probabilities increase with depth to control the effective Lipschitz constant of the network. This approach regularizes deeper layers, improving robustness while preserving clean accuracy and reducing computation. Experiments on CIFAR-10 with ViT-Tiny show that our custom depth-dependent schedule maintains near-baseline clean accuracy, enhances robustness under FGSM, PGD-20, and AutoAttack, and significantly reduces FLOPs compared to baseline and linear DropPath schedules.

[45] A Stochastic Birth-and-Death Approach for Street Furniture Geolocation in Urban Environments cs.CVPDF

Evan Murphy, Marco Viola, Vladimir A. Krylov

TL;DR: 本文提出了一种基于随机出生-死亡算法的街道家具精确定位方法，通过能量地图整合地理信息，提升城市环境中的资产定位准确性和可扩展性。

Details

Motivation: 城市环境中街道家具的精确定位对公共基础设施的监测和维护至关重要，但现有方法难以处理复杂城市环境和外部地理信息。

Result: 实验表明，该方法在复杂的城市环境中具有高精度和可扩展性。

Insight: 通过整合外部地理信息，可以显著提升城市资产定位的准确性，为城市管理提供了一种新的技术手段。

Abstract: In this paper we address the problem of precise geolocation of street furniture in complex urban environments, which is a critical task for effective monitoring and maintenance of public infrastructure by local authorities and private stakeholders. To this end, we propose a probabilistic framework based on energy maps that encode the spatial likelihood of object locations. Representing the energy in a map-based geopositioned format allows the optimisation process to seamlessly integrate external geospatial information, such as GIS layers, road maps, or placement constraints, which improves contextual awareness and localisation accuracy. A stochastic birth-and-death optimisation algorithm is introduced to infer the most probable configuration of assets. We evaluate our approach using a realistic simulation informed by a geolocated dataset of street lighting infrastructure in Dublin city centre, demonstrating its potential for scalable and accurate urban asset mapping. The implementation of the algorithm will be made available in the GitHub repository https://github.com/EMurphy0108/SBD_Street_Furniture.

[46] Compute Only 16 Tokens in One Timestep: Accelerating Diffusion Transformers with Cluster-Driven Feature Caching cs.CVPDF

Zhixin Zheng, Xinyu Wang, Chang Zou, Shaobo Wang, Linfeng Zhang

TL;DR: 论文提出了一种名为ClusCa的方法，通过空间聚类减少扩散变换器中令牌的计算量，显著降低了计算成本，同时保持了生成质量。

Details

Motivation: 扩散变换器在高质量图像和视频生成中表现优异，但其迭代去噪过程导致计算成本极高。现有特征缓存方法仅利用时间相似性，忽视了空间相似性。

Result: 在DiT、FLUX和HunyuanVideo上的实验表明，ClusCa能显著加速生成过程，如FLUX上实现4.96倍加速，且ImageReward提升0.51%。

Insight: 空间维度的特征相似性可以作为时间维度的补充，显著提升扩散变换器的效率，无需额外训练。

Abstract: Diffusion transformers have gained significant attention in recent years for their ability to generate high-quality images and videos, yet still suffer from a huge computational cost due to their iterative denoising process. Recently, feature caching has been introduced to accelerate diffusion transformers by caching the feature computation in previous timesteps and reusing it in the following timesteps, which leverage the temporal similarity of diffusion models while ignoring the similarity in the spatial dimension. In this paper, we introduce Cluster-Driven Feature Caching (ClusCa) as an orthogonal and complementary perspective for previous feature caching. Specifically, ClusCa performs spatial clustering on tokens in each timestep, computes only one token in each cluster and propagates their information to all the other tokens, which is able to reduce the number of tokens by over 90%. Extensive experiments on DiT, FLUX and HunyuanVideo demonstrate its effectiveness in both text-to-image and text-to-video generation. Besides, it can be directly applied to any diffusion transformer without requirements for training. For instance, ClusCa achieves 4.96x acceleration on FLUX with an ImageReward of 99.49%, surpassing the original model by 0.51%. The code is available at https://github.com/Shenyi-Z/Cache4Diffusion.

[47] I-Segmenter: Integer-Only Vision Transformer for Efficient Semantic Segmentation cs.CV | cs.AI | cs.LGPDF

Jordan Sassoon, Michal Szczepanski, Martyna Poreba

TL;DR: 这篇论文提出了I-Segmenter，一种完全整数化的ViT语义分割框架，通过$λ$-ShiftGELU激活函数和整数化操作优化，显著减少了模型大小和计算成本，同时保持了接近FP32基线的精度。

Details

Motivation: Vision Transformers（ViTs）在语义分割中表现优异，但其高内存占用和计算成本限制了在资源受限设备上的部署。量化是提高效率的有效方法，但ViT在低精度下表现脆弱，量化误差在复杂的编码器-解码器结构中累积。因此，需要一种高效的整数化框架来解决这些问题。

Result: I-Segmenter在FP32基线下仅损失5.1%的平均精度，模型大小减少3.8倍，推理速度提升1.2倍。单次PTQ（单张校准图像）下表现仍具竞争力。

Insight: 1. 整数化ViT在语义分割中可行且高效；2. 激活函数设计对量化模型的稳定性至关重要；3. 简单优化（如移除L2层）对整数化执行具有显著效果。

Abstract: Vision Transformers (ViTs) have recently achieved strong results in semantic segmentation, yet their deployment on resource-constrained devices remains limited due to their high memory footprint and computational cost. Quantization offers an effective strategy to improve efficiency, but ViT-based segmentation models are notoriously fragile under low precision, as quantization errors accumulate across deep encoder-decoder pipelines. We introduce I-Segmenter, the first fully integer-only ViT segmentation framework. Building on the Segmenter architecture, I-Segmenter systematically replaces floating-point operations with integer-only counterparts. To further stabilize both training and inference, we propose $\lambda$-ShiftGELU, a novel activation function that mitigates the limitations of uniform quantization in handling long-tailed activation distributions. In addition, we remove the L2 normalization layer and replace bilinear interpolation in the decoder with nearest neighbor upsampling, ensuring integer-only execution throughout the computational graph. Extensive experiments show that I-Segmenter achieves accuracy within a reasonable margin of its FP32 baseline (5.1 % on average), while reducing model size by up to 3.8x and enabling up to 1.2x faster inference with optimized runtimes. Notably, even in one-shot PTQ with a single calibration image, I-Segmenter delivers competitive accuracy, underscoring its practicality for real-world deployment.

[48] GARD: Gamma-based Anatomical Restoration and Denoising for Retinal OCT cs.CVPDF

Botond Fazekas, Thomas Pinetz, Guilherme Aresta, Taha Emre, Hrvoje Bogunovic

TL;DR: GARD 是一种基于伽马分布的扩散概率模型，用于视网膜 OCT 图像的降噪和结构恢复，显著优于传统方法和现有深度学习模型。

Details

Motivation: OCT 图像因散斑噪声而影响诊断精度，现有方法难以在降噪与保留解剖结构之间平衡，GARD 旨在解决这一问题。

Result: 在 PSNR、SSIM 和 MSE 指标上优于传统和深度学习方法，定性结果也显示更清晰的边缘和细节保留。

Insight: 伽马分布更适合 OCT 散斑噪声的统计特性，结合保真项可避免高频噪声的重新引入，为医学图像降噪提供了新思路。

Abstract: Optical Coherence Tomography (OCT) is a vital imaging modality for diagnosing and monitoring retinal diseases. However, OCT images are inherently degraded by speckle noise, which obscures fine details and hinders accurate interpretation. While numerous denoising methods exist, many struggle to balance noise reduction with the preservation of crucial anatomical structures. This paper introduces GARD (Gamma-based Anatomical Restoration and Denoising), a novel deep learning approach for OCT image despeckling that leverages the strengths of diffusion probabilistic models. Unlike conventional diffusion models that assume Gaussian noise, GARD employs a Denoising Diffusion Gamma Model to more accurately reflect the statistical properties of speckle. Furthermore, we introduce a Noise-Reduced Fidelity Term that utilizes a pre-processed, less-noisy image to guide the denoising process. This crucial addition prevents the reintroduction of high-frequency noise. We accelerate the inference process by adapting the Denoising Diffusion Implicit Model framework to our Gamma-based model. Experiments on a dataset with paired noisy and less-noisy OCT B-scans demonstrate that GARD significantly outperforms traditional denoising methods and state-of-the-art deep learning models in terms of PSNR, SSIM, and MSE. Qualitative results confirm that GARD produces sharper edges and better preserves fine anatomical details.

[49] GLAM: Geometry-Guided Local Alignment for Multi-View VLP in Mammography cs.CV | cs.AI | cs.LGPDF

Yuexi Du, Lihui Chen, Nicha C. Dvornek

TL;DR: 论文提出GLAM模型，通过几何引导的多视角对齐方法改进乳腺X光片的视觉语言预训练，解决了现有方法忽略多视角关系的问题，并在多个数据集上表现优异。

Details

Motivation: 乳腺X光片筛查对乳腺癌早期检测至关重要，但现有视觉语言模型（VLM）因数据有限和领域差异（自然图像与医学图像）效果不佳，尤其是多视角关系的建模不足。

Result: 在EMBED等大型乳腺X光数据集上预训练的GLAM模型，优于基线方法。

Insight: 几何先验和多视角关系的建模对医学视觉语言模型的性能提升至关重要。

Abstract: Mammography screening is an essential tool for early detection of breast cancer. The speed and accuracy of mammography interpretation have the potential to be improved with deep learning methods. However, the development of a foundation visual language model (VLM) is hindered by limited data and domain differences between natural and medical images. Existing mammography VLMs, adapted from natural images, often ignore domain-specific characteristics, such as multi-view relationships in mammography. Unlike radiologists who analyze both views together to process ipsilateral correspondence, current methods treat them as independent images or do not properly model the multi-view correspondence learning, losing critical geometric context and resulting in suboptimal prediction. We propose GLAM: Global and Local Alignment for Multi-view mammography for VLM pretraining using geometry guidance. By leveraging the prior knowledge about the multi-view imaging process of mammograms, our model learns local cross-view alignments and fine-grained local features through joint global and local, visual-visual, and visual-language contrastive learning. Pretrained on EMBED [14], one of the largest open mammography datasets, our model outperforms baselines across multiple datasets under different settings.

[50] Towards Understanding Visual Grounding in Visual Language Models cs.CV | cs.AIPDF

Georgios Pantazopoulos, Eda B. Özyiğit

TL;DR: 这篇综述论文探讨了视觉语言模型（VLMs）中的视觉接地（visual grounding）能力，总结了其重要性、核心组成部分、应用场景及挑战，并提出了未来研究方向。

Details

Motivation: 视觉接地能力在多种应用中至关重要，如指代表达理解、细粒度视觉问答等，但目前对这一能力的系统性研究不足，亟需总结和讨论。

Result: 通过综述，论文揭示了视觉接地在多模态任务中的重要性，并提出了改进模型接地能力的潜在途径。

Insight: 视觉接地不仅是多模态理解的核心能力，还与推理和思维链紧密相关，未来研究应关注如何提升模型的细粒度和可解释性。

Abstract: Visual grounding refers to the ability of a model to identify a region within some visual input that matches a textual description. Consequently, a model equipped with visual grounding capabilities can target a wide range of applications in various domains, including referring expression comprehension, answering questions pertinent to fine-grained details in images or videos, caption visual context by explicitly referring to entities, as well as low and high-level control in simulated and real environments. In this survey paper, we review representative works across the key areas of research on modern general-purpose vision language models (VLMs). We first outline the importance of grounding in VLMs, then delineate the core components of the contemporary paradigm for developing grounded models, and examine their practical applications, including benchmarks and evaluation metrics for grounded multimodal generation. We also discuss the multifaceted interrelations among visual grounding, multimodal chain-of-thought, and reasoning in VLMs. Finally, we analyse the challenges inherent to visual grounding and suggest promising directions for future research.

[51] Ordinality of Visible-Thermal Image Intensities for Intrinsic Image Decomposition cs.CVPDF

Zeqing Leo Yuan, Mani Ramanagopal, Aswin C. Sankaranarayanan, Srinivasa G. Narasimhan

TL;DR: 该论文提出了一种无需训练的可见光和热图像对的自监督方法，用于内部分解，通过利用可见光和热图像强度的序数关系来恢复阴影和反射率。

Details

Motivation: 由于缺乏真实场景的广泛地面真实数据，内部分解一直是一个长期挑战。特别是针对室外场景的数据更是稀少。论文提出了一种新的方法，利用可见光和热图像对的序数关系来解决这一问题。

Result: 实验表明，该方法在自然光和人工光照下均能准确恢复反射率和阴影，并在多样化室外场景中超越基于学习的模型。

Insight: 论文指出，利用热图像的吸收特性可以为真实场景中的内部分解提供一种可扩展的自监督途径，解决了传统手工标注难以实现的问题。

Abstract: Decomposing an image into its intrinsic photometric factors–shading and reflectance–is a long-standing challenge due to the lack of extensive ground-truth data for real-world scenes. Recent methods rely on synthetic data or sparse annotations for limited indoor and even fewer outdoor scenes. We introduce a novel training-free approach for intrinsic image decomposition using only a pair of visible and thermal images. We leverage the principle that light not reflected from an opaque surface is absorbed and detected as heat by a thermal camera. This allows us to relate the ordinalities between visible and thermal image intensities to the ordinalities of shading and reflectance, which can densely self-supervise an optimizing neural network to recover shading and reflectance. We perform quantitative evaluations with known reflectance and shading under natural and artificial lighting, and qualitative experiments across diverse outdoor scenes. The results demonstrate superior performance over recent learning-based models and point toward a scalable path to curating real-world ordinal supervision, previously infeasible via manual labeling.

[52] Compressed Video Quality Enhancement: Classifying and Benchmarking over Standards cs.CVPDF

Xiem HoangVan, Dang BuiDinh, Sang NguyenQuang, Wen-Hsiao Peng

TL;DR: 该论文系统性地分类和评估了压缩视频质量增强（CVQE）方法，填补了现有研究中分类体系不完整、对比分析不足以及基准测试不规范的空白。

Details

Motivation: 压缩视频质量增强（CVQE）对提升用户体验至关重要，但现有研究缺乏系统性分类、方法对比和标准化基准测试，影响一致评估与模型选择。

Result: 论文提供了对现有方法的全面评估，揭示了性能与复杂度的权衡关系，并指出了未来研究方向。

Insight: 未来的CVQE研究应更注重方法的标准化评估和实际部署中的效率优化。

Abstract: Compressed video quality enhancement (CVQE) is crucial for improving user experience with lossy video codecs like H.264/AVC, H.265/HEVC, and H.266/VVC. While deep learning based CVQE has driven significant progress, existing surveys still suffer from limitations: lack of systematic classification linking methods to specific standards and artifacts, insufficient comparative analysis of architectural paradigms across coding types, and underdeveloped benchmarking practices. To address these gaps, this paper presents three key contributions. First, it introduces a novel taxonomy classifying CVQE methods across architectural paradigms, coding standards, and compressed-domain feature utilization. Second, it proposes a unified benchmarking framework integrating modern compression protocols and standard test sequences for fair multi-criteria evaluation. Third, it provides a systematic analysis of the critical trade-offs between reconstruction performance and computational complexity observed in state-of-the-art methods and highlighting promising directions for future research. This comprehensive review aims to establish a foundation for consistent assessment and informed model selection in CVQE research and deployment.

[53] Multimodal SAM-adapter for Semantic Segmentation cs.CV | cs.AIPDF

Iacopo Curti, Pierluigi Zama Ramirez, Alioscia Petrelli, Luigi Di Stefano

TL;DR: 本文提出了一种名为MM SAM-adapter的多模态语义分割框架，通过适配器网络将多模态特征融合到Segment Anything Model（SAM）的RGB特征中，实现了在复杂环境下（如光照不足、遮挡、恶劣天气）的鲁棒性能提升。

Details

Motivation: 现有的语义分割方法在处理复杂环境（如光照不足、遮挡、恶劣天气）时表现欠佳，而多模态方法通过整合辅助传感器数据（如LiDAR、红外）可以增强模型的鲁棒性。本文旨在利用SAM的强大泛化能力，并结合多模态信息，提升语义分割的性能。

Result: 在DeLiVER、FMB和MUSES三个基准数据集上，MM SAM-adapter均取得了最先进的性能。尤其是在将DeLiVER和FMB划分为RGB-easy和RGB-hard子集后，模型在两种条件下均优于其他方法，证明了多模态适应的有效性。

Insight: 1. 多模态信息的动态融合能够显著提升语义分割在复杂环境下的性能。2. 适配器网络的设计实现了对辅助模态的高效利用，避免了冗余计算。3. SAM的泛化能力与多模态特征的结合为鲁棒场景理解提供了新思路。

Abstract: Semantic segmentation, a key task in computer vision with broad applications in autonomous driving, medical imaging, and robotics, has advanced substantially with deep learning. Nevertheless, current approaches remain vulnerable to challenging conditions such as poor lighting, occlusions, and adverse weather. To address these limitations, multimodal methods that integrate auxiliary sensor data (e.g., LiDAR, infrared) have recently emerged, providing complementary information that enhances robustness. In this work, we present MM SAM-adapter, a novel framework that extends the capabilities of the Segment Anything Model (SAM) for multimodal semantic segmentation. The proposed method employs an adapter network that injects fused multimodal features into SAM’s rich RGB features. This design enables the model to retain the strong generalization ability of RGB features while selectively incorporating auxiliary modalities only when they contribute additional cues. As a result, MM SAM-adapter achieves a balanced and efficient use of multimodal information. We evaluate our approach on three challenging benchmarks, DeLiVER, FMB, and MUSES, where MM SAM-adapter delivers state-of-the-art performance. To further analyze modality contributions, we partition DeLiVER and FMB into RGB-easy and RGB-hard subsets. Results consistently demonstrate that our framework outperforms competing methods in both favorable and adverse conditions, highlighting the effectiveness of multimodal adaptation for robust scene understanding. The code is available at the following link: https://github.com/iacopo97/Multimodal-SAM-Adapter.

[54] InfGen: A Resolution-Agnostic Paradigm for Scalable Image Synthesis cs.CVPDF

Tao Han, Wanghan Xu, Junchao Gong, Xiaoyu Yue, Song Guo

TL;DR: InfGen提出了一种分辨率无关的图像生成范式，通过固定潜在表示和一步生成器实现任意分辨率的高效图像合成，显著降低了计算复杂度和生成时间。

Details

Motivation: 当前扩散模型在高分辨率图像生成时计算复杂度呈二次增长，导致生成延迟高（如4K图像需100秒以上）。InfGen旨在解决这一问题，提供高效的任意分辨率图像生成方案。

Result: 实验证明InfGen将4K图像生成时间从100秒以上降低至10秒内，同时支持任意分辨率图像的生成。

Insight: 高分辨率图像生成的关键在于降低计算复杂度，而固定潜在表示结合轻量级解码器是一种有效的解决方案，为实际应用提供了灵活性。

Abstract: Arbitrary resolution image generation provides a consistent visual experience across devices, having extensive applications for producers and consumers. Current diffusion models increase computational demand quadratically with resolution, causing 4K image generation delays over 100 seconds. To solve this, we explore the second generation upon the latent diffusion models, where the fixed latent generated by diffusion models is regarded as the content representation and we propose to decode arbitrary resolution images with a compact generated latent using a one-step generator. Thus, we present the \textbf{InfGen}, replacing the VAE decoder with the new generator, for generating images at any resolution from a fixed-size latent without retraining the diffusion models, which simplifies the process, reducing computational complexity and can be applied to any model using the same latent space. Experiments show InfGen is capable of improving many models into the arbitrary high-resolution era while cutting 4K image generation time to under 10 seconds.

[55] SSL-AD: Spatiotemporal Self-Supervised Learning for Generalizability and Adaptability Across Alzheimer’s Prediction Tasks and Datasets cs.CV | cs.LGPDF

Emily Kaczmarek, Justin Szeto, Brennan Nichyporuk, Tal Arbel

TL;DR: 该论文提出了一种时空自监督学习框架（SSL-AD），用于提升阿尔茨海默病预测任务中的模型泛化能力和适应性，解决了标记数据不足和跨数据集性能差的问题。

Details

Motivation: 现有深度学习模型在阿尔茨海默病预测任务中存在标记数据不足、跨数据集泛化能力差以及输入扫描数量和扫描时间间隔灵活性不足的局限性。

Result: 模型在七个下游任务中的六个任务上超越了监督学习方法，展示了跨任务和跨输入图像数量的适应性。

Insight: 自监督学习在医学影像分析中具有潜力，能够减少对标记数据的依赖，并在跨任务和跨数据集场景中保持鲁棒性。

Abstract: Alzheimer’s disease is a progressive, neurodegenerative disorder that causes memory loss and cognitive decline. While there has been extensive research in applying deep learning models to Alzheimer’s prediction tasks, these models remain limited by lack of available labeled data, poor generalization across datasets, and inflexibility to varying numbers of input scans and time intervals between scans. In this study, we adapt three state-of-the-art temporal self-supervised learning (SSL) approaches for 3D brain MRI analysis, and add novel extensions designed to handle variable-length inputs and learn robust spatial features. We aggregate four publicly available datasets comprising 3,161 patients for pre-training, and show the performance of our model across multiple Alzheimer’s prediction tasks including diagnosis classification, conversion detection, and future conversion prediction. Importantly, our SSL model implemented with temporal order prediction and contrastive learning outperforms supervised learning on six out of seven downstream tasks. It demonstrates adaptability and generalizability across tasks and number of input images with varying time intervals, highlighting its capacity for robust performance across clinical applications. We release our code and model publicly at https://github.com/emilykaczmarek/SSL-AD.

cs.CL [Back]

[56] Cross-Layer Attention Probing for Fine-Grained Hallucination Detection cs.CL | cs.AIPDF

Malavika Suresh, Rahaf Aljundi, Ikechukwu Nkisi-Orji, Nirmalie Wiratunga

TL;DR: CLAP是一种新颖的激活探测技术，通过联合处理LLM残差流中的激活，显著提升了幻觉检测的准确性，并支持细粒度检测。

Details

Motivation: 随着大语言模型（LLMs）的广泛应用，其生成不准确文本（幻觉）的问题日益突出，需要高效检测方法以提升可靠性。

Result: 实验表明，CLAP在五种LLM和三个任务中优于基线方法，且在高温度采样下仍保持高可靠性。

Insight: CLAP提出了一种“检测后缓解”策略，为提升LLM可靠性提供了新思路，且具备较强的泛化能力。

Abstract: With the large-scale adoption of Large Language Models (LLMs) in various applications, there is a growing reliability concern due to their tendency to generate inaccurate text, i.e. hallucinations. In this work, we propose Cross-Layer Attention Probing (CLAP), a novel activation probing technique for hallucination detection, which processes the LLM activations across the entire residual stream as a joint sequence. Our empirical evaluations using five LLMs and three tasks show that CLAP improves hallucination detection compared to baselines on both greedy decoded responses as well as responses sampled at higher temperatures, thus enabling fine-grained detection, i.e. the ability to disambiguate hallucinations and non-hallucinations among different sampled responses to a given prompt. This allows us to propose a detect-then-mitigate strategy using CLAP to reduce hallucinations and improve LLM reliability compared to direct mitigation approaches. Finally, we show that CLAP maintains high reliability even when applied out-of-distribution.

[57] Creativity Benchmark: A benchmark for marketing creativity for LLM models cs.CL | cs.AI | cs.HCPDF

Ninad Bhat, Kieran Browne, Pip Bingemann

TL;DR: Creativity Benchmark 是一个用于评估大型语言模型在营销创意表现的新框架，结果表明模型表现差异不大且自动化评估与人工评估存在差异。

Details

Motivation: 现有的大型语言模型（LLM）在营销创意领域的表现缺乏系统性评估，尤其是在品牌约束任务中的多样性表现和自动化评估的可靠性问题。

Result: 模型表现差异不大（胜率最高仅61%），且自动化评测与人工评测相关性弱，传统创意测试在品牌约束任务中效果有限。

Insight: 强调了人工专家评测的必要性，并指出模型评测需考虑多样性和提示语的敏感性。

Abstract: We introduce Creativity Benchmark, an evaluation framework for large language models (LLMs) in marketing creativity. The benchmark covers 100 brands (12 categories) and three prompt types (Insights, Ideas, Wild Ideas). Human pairwise preferences from 678 practising creatives over 11,012 anonymised comparisons, analysed with Bradley-Terry models, show tightly clustered performance with no model dominating across brands or prompt types: the top-bottom spread is $\Delta\theta \approx 0.45$, which implies a head-to-head win probability of $0.61$; the highest-rated model beats the lowest only about $61%$ of the time. We also analyse model diversity using cosine distances to capture intra- and inter-model variation and sensitivity to prompt reframing. Comparing three LLM-as-judge setups with human rankings reveals weak, inconsistent correlations and judge-specific biases, underscoring that automated judges cannot substitute for human evaluation. Conventional creativity tests also transfer only partially to brand-constrained tasks. Overall, the results highlight the need for expert human evaluation and diversity-aware workflows.

[58] Temporal Preferences in Language Models for Long-Horizon Assistance cs.CL | cs.AI | cs.CYPDF

Ali Mazyaki, Mohammad Naghizadeh, Samaneh Ranjkhah Zonouzaghi, Hossein Setareh

TL;DR: 该论文研究了语言模型在跨期选择中是否表现出未来导向或现在导向的偏好，以及这些偏好是否可以被系统性地操纵。通过实验，作者引入了一个操作性指标MTO，衡量模型在时间偏好上的可操纵性。

Details

Motivation: 语言模型在长期辅助任务中的应用需要与人类的时间偏好对齐，但模型是否具有类似人类的跨期选择偏好尚不清楚。因此，作者希望通过实验研究这一问题。

Result: 推理模型（如DeepSeek-Reasoner和grok-3-mini）在未来导向提示下倾向于选择延迟选项，但在跨身份或地理的个人化决策中表现部分一致。此外，能够正确推理时间导向的模型对自身作为AI决策者也表现出未来导向。

Insight: 研究强调了AI助手在长期目标辅助中与人类时间偏好对齐的重要性，并提出了个性化上下文校准和社会感知部署的研究方向。

Abstract: We study whether language models (LMs) exhibit future- versus present-oriented preferences in intertemporal choice and whether those preferences can be systematically manipulated. Using adapted human experimental protocols, we evaluate multiple LMs on time-tradeoff tasks and benchmark them against a sample of human decision makers. We introduce an operational metric, the Manipulability of Time Orientation (MTO), defined as the change in an LM’s revealed time preference between future- and present-oriented prompts. In our tests, reasoning-focused models (e.g., DeepSeek-Reasoner and grok-3-mini) choose later options under future-oriented prompts but only partially personalize decisions across identities or geographies. Moreover, models that correctly reason about time orientation internalize a future orientation for themselves as AI decision makers. We discuss design implications for AI assistants that should align with heterogeneous, long-horizon goals and outline a research agenda on personalized contextual calibration and socially aware deployment.

[59] Psychiatry-Bench: A Multi-Task Benchmark for LLMs in Psychiatry cs.CL | cs.AIPDF

Aya E. Fouda, Abdelrahamn A. Hassan, Radwa J. Hanafy, Mohammed E. Fouda

TL;DR: PsychiatryBench是一个基于权威精神科教科书和案例的多任务基准测试，用于评估大语言模型（LLM）在精神病学中的表现，揭示了当前模型在临床一致性和安全性上的不足。

Details

Motivation: 现有评估资源多依赖小型临床访谈语料或社交媒体数据，临床有效性有限，未能涵盖精神病学推理的复杂性，因此需要更专业的基准测试。

Result: 结果显示模型在多轮随访和管理任务中临床一致性和安全性存在显著差距，需要专门优化。

Insight: 精神病学领域的LLM需进一步调优和更鲁棒的评估方法，PsychiatryBench为这一方向提供了模块化、可扩展的平台。

Abstract: Large language models (LLMs) hold great promise in enhancing psychiatric practice, from improving diagnostic accuracy to streamlining clinical documentation and therapeutic support. However, existing evaluation resources heavily rely on small clinical interview corpora, social media posts, or synthetic dialogues, which limits their clinical validity and fails to capture the full complexity of psychiatric reasoning. In this work, we introduce PsychiatryBench, a rigorously curated benchmark grounded exclusively in authoritative, expert-validated psychiatric textbooks and casebooks. PsychiatryBench comprises eleven distinct question-answering tasks ranging from diagnostic reasoning and treatment planning to longitudinal follow-up, management planning, clinical approach, sequential case analysis, and multiple-choice/extended matching formats totaling over 5,300 expert-annotated items. We evaluate a diverse set of frontier LLMs (including Google Gemini, DeepSeek, LLaMA 3, and QWQ-32) alongside leading open-source medical models (e.g., OpenBiloLLM, MedGemma) using both conventional metrics and an “LLM-as-judge” similarity scoring framework. Our results reveal substantial gaps in clinical consistency and safety, particularly in multi-turn follow-up and management tasks, underscoring the need for specialized model tuning and more robust evaluation paradigms. PsychiatryBench offers a modular, extensible platform for benchmarking and improving LLM performance in high-stakes mental health applications.

[60] The Thinking Therapist: Training Large Language Models to Deliver Acceptance and Commitment Therapy using Supervised Fine-Tuning and Odds Ratio Policy Optimization cs.CL | cs.AIPDF

Talha Tahir

TL;DR: 该论文研究了如何通过监督微调（SFT）和胜率比策略优化（ORPO）两种方法训练小型开源大语言模型（LLM）以提供接纳与承诺疗法（ACT）。实验表明，ORPO训练的模型在治疗忠实度和共情能力上显著优于SFT和基础模型，而显式推理（COT）仅对SFT模型有效。

Details

Motivation: 接纳与承诺疗法（ACT）在多种精神疾病中表现出疗效，但如何利用小型LLM有效实施ACT仍是一个未解决的问题。研究旨在探索不同训练方法和推理步骤对模型表现的影响。

Result: ORPO训练的模型在ACT忠实度和共情能力上显著优于SFT和基础模型（p < 0.001）。COT仅对SFT模型有显著提升（+2.68分），而对ORPO模型无帮助。

Insight: ORPO的优势在于学习治疗“过程”而非模仿“内容”，而COT仅为模仿学习的模型提供了必要的支持。研究结果表明，策略优化方法可有效提升小型LLM的ACT能力，且推理的效用高度依赖训练范式。

Abstract: Acceptance and Commitment Therapy (ACT) is a third-wave cognitive behavioral therapy with emerging evidence of efficacy in several psychiatric conditions. This study investigates the impact of post-training methodology and explicit reasoning on the ability of a small open-weight large language model (LLM) to deliver ACT. Using 50 sets of synthetic ACT transcripts generated by Mistral-Large, we trained Llama-3.2-3b-Instruct with two distinct approaches, supervised fine-tuning (SFT) and odds ratio policy optimization (ORPO), each with and without an explicit chain-of-thought (COT) reasoning step. Performance was evaluated by comparing these four post-trained variants against the base Instruct model. These models were benchmarked in simulated therapy sessions, with performance quantitatively assessed on the ACT Fidelity Measure (ACT-FM) and the Therapist Empathy Scale (TES) by an LLM judge that had been fine-tuned on human evaluations. Our findings demonstrate that the ORPO-trained models significantly outperformed both their SFT and Instruct counterparts on ACT fidelity ($\chi^2(5) = 185.15, p < .001$) and therapeutic empathy ($\chi^2(5) = 140.37, p < .001$). The effect of COT was conditional as it provided a significant benefit to SFT models, improving ACT-FM scores by an average of 2.68 points ($p < .001$), while offering no discernible advantage to the superior ORPO or instruct-tuned variants. We posit that the superiority of ORPO stems from its ability to learn the therapeutic process' over imitating content,’ a key aspect of ACT, while COT acts as a necessary scaffold for models trained only via imitation. This study establishes that preference-aligned policy optimization can effectively instill ACT competencies in small LLMs, and that the utility of explicit reasoning is highly dependent on the underlying training paradigm.

[61] HANRAG: Heuristic Accurate Noise-resistant Retrieval-Augmented Generation for Multi-hop Question Answering cs.CL | cs.AIPDF

Duolin Sun, Dan Yang, Yue Shen, Yihan Jiao, Zhehao Tan

TL;DR: HANRAG是一种基于启发式的高效框架，通过路由查询、分解子查询和过滤噪声文档，提升了检索增强生成（RAG）在多跳问答任务中的性能和噪声鲁棒性。

Details

Motivation: 现有RAG方法在多跳查询中存在迭代检索效率低、噪声积累等问题，HANRAG旨在解决这些挑战。

Result: 在多个基准测试中，HANRAG在单跳和多跳问答任务中均表现优异。

Insight: 启发式查询分解和噪声过滤是提升RAG在多跳任务中性能的关键技术。

Abstract: The Retrieval-Augmented Generation (RAG) approach enhances question-answering systems and dialogue generation tasks by integrating information retrieval (IR) technologies with large language models (LLMs). This strategy, which retrieves information from external knowledge bases to bolster the response capabilities of generative models, has achieved certain successes. However, current RAG methods still face numerous challenges when dealing with multi-hop queries. For instance, some approaches overly rely on iterative retrieval, wasting too many retrieval steps on compound queries. Additionally, using the original complex query for retrieval may fail to capture content relevant to specific sub-queries, resulting in noisy retrieved content. If the noise is not managed, it can lead to the problem of noise accumulation. To address these issues, we introduce HANRAG, a novel heuristic-based framework designed to efficiently tackle problems of varying complexity. Driven by a powerful revelator, HANRAG routes queries, decomposes them into sub-queries, and filters noise from retrieved documents. This enhances the system’s adaptability and noise resistance, making it highly capable of handling diverse queries. We compare the proposed framework against other leading industry methods across various benchmarks. The results demonstrate that our framework obtains superior performance in both single-hop and multi-hop question-answering tasks.

[62] A Role-Aware Multi-Agent Framework for Financial Education Question Answering with LLMs cs.CL | cs.CEPDF

Andy Zhu, Yingjun Du

TL;DR: 这篇论文提出了一个基于角色的多智能体框架，用于提升金融教育问答的性能，通过基础生成器、证据检索器和专家评审器的协作，显著提高了答案的准确性。

Details

Motivation: 金融教育问答需要复杂的专业推理和领域知识，而现有的大型语言模型方法通常无法满足这些需求，导致性能不足。

Result: 与零样本思维链基线相比，方法提升了6.6-8.3%的答案准确性，Gemini-2.0-Flash表现最佳，GPT-4o-mini性能接近金融调优的FinGPT模型。

Insight: 展示了多智能体框架在金融教育领域的潜力，为低成本提升金融问答性能提供了有效方法。

Abstract: Question answering (QA) plays a central role in financial education, yet existing large language model (LLM) approaches often fail to capture the nuanced and specialized reasoning required for financial problem-solving. The financial domain demands multistep quantitative reasoning, familiarity with domain-specific terminology, and comprehension of real-world scenarios. We present a multi-agent framework that leverages role-based prompting to enhance performance on domain-specific QA. Our framework comprises a Base Generator, an Evidence Retriever, and an Expert Reviewer agent that work in a single-pass iteration to produce a refined answer. We evaluated our framework on a set of 3,532 expert-designed finance education questions from Study.com, an online learning platform. We leverage retrieval-augmented generation (RAG) for contextual evidence from 6 finance textbooks and prompting strategies for a domain-expert reviewer. Our experiments indicate that critique-based refinement improves answer accuracy by 6.6-8.3% over zero-shot Chain-of-Thought baselines, with the highest performance from Gemini-2.0-Flash. Furthermore, our method enables GPT-4o-mini to achieve performance comparable to the finance-tuned FinGPT-mt_Llama3-8B_LoRA. Our results show a cost-effective approach to enhancing financial QA and offer insights for further research in multi-agent financial LLM systems.

[63] MultimodalHugs: Enabling Sign Language Processing in Hugging Face cs.CL | cs.AI | cs.MMPDF

Gerard Sant, Zifan Jiang, Carlos Escolano, Amit Moryossef, Mathias Müller

TL;DR: MultimodalHugs是一个基于Hugging Face的框架，旨在解决手语处理研究中代码复杂、可重复性低的问题，支持多模态数据处理，并扩展了Hugging Face的适用范围。

Details

Motivation: 手语处理（SLP）研究因复杂的代码和低可重复性而受限，现有工具（如Hugging Face）难以无缝集成手语实验。

Result: 实验表明MultimodalHugs能有效处理手语姿态数据和文本字符像素数据。

Insight: 通过抽象层设计，MultimodalHugs不仅适用于手语，还可扩展至其他非标准任务，为多模态研究提供新工具。

Abstract: In recent years, sign language processing (SLP) has gained importance in the general field of Natural Language Processing. However, compared to research on spoken languages, SLP research is hindered by complex ad-hoc code, inadvertently leading to low reproducibility and unfair comparisons. Existing tools that are built for fast and reproducible experimentation, such as Hugging Face, are not flexible enough to seamlessly integrate sign language experiments. This view is confirmed by a survey we conducted among SLP researchers. To address these challenges, we introduce MultimodalHugs, a framework built on top of Hugging Face that enables more diverse data modalities and tasks, while inheriting the well-known advantages of the Hugging Face ecosystem. Even though sign languages are our primary focus, MultimodalHugs adds a layer of abstraction that makes it more widely applicable to other use cases that do not fit one of the standard templates of Hugging Face. We provide quantitative experiments to illustrate how MultimodalHugs can accommodate diverse modalities such as pose estimation data for sign languages, or pixel data for text characters.

[64] Benchmarking Vision-Language Models on Chinese Ancient Documents: From OCR to Knowledge Reasoning cs.CLPDF

Haiyang Yu, Yuchuan Wu, Fan Shi, Lei Liao, Jinghui Lu

TL;DR: 该论文提出了首个针对中文古籍的视觉语言模型（VLM）基准AncientDoc，填补了现有基准在复杂视觉和语言处理上的空白。

Details

Motivation: 中文古籍是中华文化的重要载体，但目前数字化和理解面临挑战，传统方法仅处理图像，而现有VLM难以应对其视觉和语言复杂性。缺乏适合的评估基准限制了相关研究的进展。

Result: 基于AncientDoc的评估揭示了主流VLM在古籍处理中的表现和局限性，尤其是在OCR和知识推理任务上的不足。

Insight: 古籍的复杂性和多样性对VLM提出了更高要求，未来研究需结合领域知识，优化模型在低资源场景下的性能。

Abstract: Chinese ancient documents, invaluable carriers of millennia of Chinese history and culture, hold rich knowledge across diverse fields but face challenges in digitization and understanding, i.e., traditional methods only scan images, while current Vision-Language Models (VLMs) struggle with their visual and linguistic complexity. Existing document benchmarks focus on English printed texts or simplified Chinese, leaving a gap for evaluating VLMs on ancient Chinese documents. To address this, we present AncientDoc, the first benchmark for Chinese ancient documents, designed to assess VLMs from OCR to knowledge reasoning. AncientDoc includes five tasks (page-level OCR, vernacular translation, reasoning-based QA, knowledge-based QA, linguistic variant QA) and covers 14 document types, over 100 books, and about 3,000 pages. Based on AncientDoc, we evaluate mainstream VLMs using multiple metrics, supplemented by a human-aligned large language model for scoring.

[65] HEFT: A Coarse-to-Fine Hierarchy for Enhancing the Efficiency and Accuracy of Language Model Reasoning cs.CL | cs.AI | cs.LG | 68T07, 68T50, 68T05 | I.2.7; I.2.6; C.4PDF

Brennen Hill

TL;DR: HEFT是一种层次化高效微调策略，结合LoRA和ReFT两种方法，显著提升语言模型的推理效率和准确性。

Details

Motivation: 大型语言模型在适应专业推理任务时受限于计算资源，研究如何通过组合不同的参数高效微调方法（PEFT）实现更优性能。

Result: 仅在3轮微调后，HEFT在BoolQ基准上实现了85.17%的准确率，优于单独使用LoRA（85.05%）或ReFT（83.36%）。

Insight: 组合不同PEFT方法能显著提升性能与效率，为适应大规模模型的复杂认知任务提供了一种创新思路。

Abstract: The adaptation of large language models (LLMs) to specialized reasoning tasks is fundamentally constrained by computational resources. Parameter-Efficient Fine-Tuning (PEFT) methods have emerged as a powerful solution, yet the landscape of these techniques is diverse, with distinct methods operating in either the model’s weight space or its representation space. This paper investigates the hypothesis that a synergistic combination of these paradigms can unlock superior performance and efficiency. We introduce HEFT (Hierarchical Efficient Fine-Tuning), a novel hierarchical adaptation strategy that composes two distinct PEFT methods in a coarse-to-fine manner: first, a broad, foundational adaptation in the weight space using Low-Rank Adaptation (LoRA), followed by a precise, surgical refinement of internal activations using Representation Fine-Tuning (ReFT). We evaluate this approach by fine-tuning a Llama-2-7B model on the BoolQ benchmark, a challenging dataset for inferential reasoning. Our results reveal a profound synergistic effect. A model fine-tuned for only three epochs with our HEFT strategy achieves an accuracy of 85.17%, exceeding the performance of models trained for 20 epochs with either LoRA-only (85.05%) or ReFT-only (83.36%) methodologies. This work demonstrates that the thoughtful composition of PEFT methods is a potent algorithmic innovation, offering a more efficient and effective path toward advancing the reasoning capabilities of language models. By achieving superior results with a fraction of the computational budget, our findings present a principled approach to overcoming the obstacles inherent in adapting large-scale models for complex cognitive tasks.

[66] Pragmatic Frames Evoked by Gestures: A FrameNet Brasil Approach to Multimodality in Turn Organization cs.CLPDF

Helen de Andrade Abreu, Tiago Timponi Torrent, Ely Edison da Silva Matos

TL;DR: 该论文提出了一种基于FrameNet Brasil的框架，通过分析语言和交互手势的关联，研究多模态对话中的轮次组织，并通过标注方法丰富了多模态数据集。

Details

Motivation: 尽管对话轮次组织已被多领域研究，但手势策略尚未被编码为可用于机器学习的数据集，作者旨在填补这一空白。

Result: 研究证实手势是对话轮次传递、获取和保留的工具，并揭示了未记载的手势变体，表明语用框架标注有助于理解人类认知和语言。

Insight: 手势行为与语用框架的概念化（如心理空间、概念整合和隐喻）密切相关，多模态数据的标注为理解人类交流提供了新视角。

Abstract: This paper proposes a framework for modeling multimodal conversational turn organization via the proposition of correlations between language and interactive gestures, based on analysis as to how pragmatic frames are conceptualized and evoked by communicators. As a means to provide evidence for the analysis, we developed an annotation methodology to enrich a multimodal dataset (annotated for semantic frames) with pragmatic frames modeling conversational turn organization. Although conversational turn organization has been studied by researchers from diverse fields, the specific strategies, especially gestures used by communicators, had not yet been encoded in a dataset that can be used for machine learning. To fill this gap, we enriched the Frame2 dataset with annotations of gestures used for turn organization. The Frame2 dataset features 10 episodes from the Brazilian TV series Pedro Pelo Mundo annotated for semantic frames evoked in both video and text. This dataset allowed us to closely observe how communicators use interactive gestures outside a laboratory, in settings, to our knowledge, not previously recorded in related literature. Our results have confirmed that communicators involved in face-to-face conversation make use of gestures as a tool for passing, taking and keeping conversational turns, and also revealed variations of some gestures that had not been documented before. We propose that the use of these gestures arises from the conceptualization of pragmatic frames, involving mental spaces, blending and conceptual metaphors. In addition, our data demonstrate that the annotation of pragmatic frames contributes to a deeper understanding of human cognition and language.

[67] Topic-Guided Reinforcement Learning with LLMs for Enhancing Multi-Document Summarization cs.CLPDF

Chuyuan Li, Austin Xu, Shafiq Joty, Giuseppe Carenini

TL;DR: 本文提出了一种基于主题引导的强化学习方法，通过LLMs（大语言模型）提升多文档摘要的质量，重点解决了信息整合和主题相关性挑战。

Details

Motivation: 多文档摘要（MDS）中，信息整合和主题一致性是关键挑战。现有基于LLM的方法在单文档摘要表现优秀，但在MDS中仍有提升空间。

Result: 实验表明，该方法在Multi-News和Multi-XScience数据集上显著优于基线模型。

Insight: 将主题信息显式引入强化学习目标可以显著提升多文档摘要的质量。

Abstract: A key challenge in Multi-Document Summarization (MDS) is effectively integrating information from multiple sources while maintaining coherence and topical relevance. While Large Language Models have shown impressive results in single-document summarization, their performance on MDS still leaves room for improvement. In this paper, we propose a topic-guided reinforcement learning approach to improve content selection in MDS. We first show that explicitly prompting models with topic labels enhances the informativeness of the generated summaries. Building on this insight, we propose a novel topic reward within the Group Relative Policy Optimization (GRPO) framework to measure topic alignment between the generated summary and source documents. Experimental results on the Multi-News and Multi-XScience datasets demonstrate that our method consistently outperforms strong baselines, highlighting the effectiveness of leveraging topical cues in MDS.

[68] Emulating Public Opinion: A Proof-of-Concept of AI-Generated Synthetic Survey Responses for the Chilean Case cs.CL | cs.AI | 68T50 (Primary) 91F10 (Secondary)PDF

Bastián González-Bustamante, Nando Verelst, Carla Cisternas

TL;DR: 本文探讨了使用大型语言模型（LLM）生成合成调查响应以模拟人类意见的可行性，并通过与智利真实调查数据的对比，评估了模型的性能与潜在偏见。

Details

Motivation: 传统调查方法存在测量误差和代表性不足的问题，LLM生成的合成响应可能提供一种补充或替代方案，但需要验证其可靠性和是否引入偏见。

Result: 1. 在信任类问题上表现优异（F1分数和准确率>0.90）；2. GPT-4o、GPT-4o-mini和Llama 4 Maverick性能相近；3. 45-59岁年龄组的合成与真实数据对齐度最高。

Insight: 虽然LLM可以近似模拟公众意见，但问题层面的异质性显著，需进一步校准和测试以提高算法保真度并减少误差。

Abstract: Large Language Models (LLMs) offer promising avenues for methodological and applied innovations in survey research by using synthetic respondents to emulate human answers and behaviour, potentially mitigating measurement and representation errors. However, the extent to which LLMs recover aggregate item distributions remains uncertain and downstream applications risk reproducing social stereotypes and biases inherited from training data. We evaluate the reliability of LLM-generated synthetic survey responses against ground-truth human responses from a Chilean public opinion probabilistic survey. Specifically, we benchmark 128 prompt-model-question triplets, generating 189,696 synthetic profiles, and pool performance metrics (i.e., accuracy, precision, recall, and F1-score) in a meta-analysis across 128 question-subsample pairs to test for biases along key sociodemographic dimensions. The evaluation spans OpenAI’s GPT family and o-series reasoning models, as well as Llama and Qwen checkpoints. Three results stand out. First, synthetic responses achieve excellent performance on trust items (F1-score and accuracy > 0.90). Second, GPT-4o, GPT-4o-mini and Llama 4 Maverick perform comparably on this task. Third, synthetic-human alignment is highest among respondents aged 45-59. Overall, LLM-based synthetic samples approximate responses from a probabilistic sample, though with substantial item-level heterogeneity. Capturing the full nuance of public opinion remains challenging and requires careful calibration and additional distributional tests to ensure algorithmic fidelity and reduce errors.

[69] Unsupervised Hallucination Detection by Inspecting Reasoning Processes cs.CL | cs.AIPDF

Ponhvoan Srey, Xiaobao Wu, Anh Tuan Luu

TL;DR: 这篇论文提出了IRIS，一种无监督的幻觉检测框架，通过检查LLM的推理过程来识别生成内容中的幻觉，无需标注数据。

Details

Motivation: 现有的无监督方法依赖与事实正确性无关的代理信号，导致检测偏向表面或非真实性方面，泛化性差。因此，需要一种能直接利用事实正确性内在表示的方法。

Result: 实验结果表明，IRIS在多个数据集上优于现有无监督方法，计算成本低，适用于数据量少的场景。

Insight: 直接利用LLM的内部表示可以更有效地捕捉事实正确性，而无需依赖人工标注。响应的不确定性是衡量真实性的有效指标。

Abstract: Unsupervised hallucination detection aims to identify hallucinated content generated by large language models (LLMs) without relying on labeled data. While unsupervised methods have gained popularity by eliminating labor-intensive human annotations, they frequently rely on proxy signals unrelated to factual correctness. This misalignment biases detection probes toward superficial or non-truth-related aspects, limiting generalizability across datasets and scenarios. To overcome these limitations, we propose IRIS, an unsupervised hallucination detection framework, leveraging internal representations intrinsic to factual correctness. IRIS prompts the LLM to carefully verify the truthfulness of a given statement, and obtain its contextualized embedding as informative features for training. Meanwhile, the uncertainty of each response is considered a soft pseudolabel for truthfulness. Experimental results demonstrate that IRIS consistently outperforms existing unsupervised methods. Our approach is fully unsupervised, computationally low cost, and works well even with few training data, making it suitable for real-time detection.

[70] Multi-Intent Recognition in Dialogue Understanding: A Comparison Between Smaller Open-Source LLMs cs.CL | cs.HCPDF

Adnan Ahmad, Philine Kowol, Stefan Hillmann, Sebastian Möller

TL;DR: 该论文通过比较三种开源的小型大语言模型（LLama2-7B、Mistral-7B-v0.1和Yi-6B），在多标签意图分类任务中的表现，发现Mistral-7B-v0.1在少样本设置下表现最佳，但BERT-based监督分类器表现更优。

Details

Motivation: 研究动机在于评估开源小型大语言模型在复杂多意图对话理解中的实用性，特别是在资源受限的硬件环境中。

Result: Mistral-7B-v0.1在14个意图类别中的11个上表现最佳（加权F1为0.50），但整体性能仍不及BERT-based监督分类器。

Insight: 尽管Generative LLMs在少样本设置下表现不俗，但在多意图分类任务中，监督学习方法仍具优势。研究为小型开源LLMs的应用提供了实用参考。

Abstract: In this paper, we provide an extensive analysis of multi-label intent classification using Large Language Models (LLMs) that are open-source, publicly available, and can be run in consumer hardware. We use the MultiWOZ 2.1 dataset, a benchmark in the dialogue system domain, to investigate the efficacy of three popular open-source pre-trained LLMs, namely LLama2-7B-hf, Mistral-7B-v0.1, and Yi-6B. We perform the classification task in a few-shot setup, giving 20 examples in the prompt with some instructions. Our approach focuses on the differences in performance of these models across several performance metrics by methodically assessing these models on multi-label intent classification tasks. Additionally, we compare the performance of the instruction-based fine-tuning approach with supervised learning using the smaller transformer model BertForSequenceClassification as a baseline. To evaluate the performance of the models, we use evaluation metrics like accuracy, precision, and recall as well as micro, macro, and weighted F1 score. We also report the inference time, VRAM requirements, etc. The Mistral-7B-v0.1 outperforms two other generative models on 11 intent classes out of 14 in terms of F-Score, with a weighted average of 0.50. It also has relatively lower Humming Loss and higher Jaccard Similarity, making it the winning model in the few-shot setting. We find BERT based supervised classifier having superior performance compared to the best performing few-shot generative LLM. The study provides a framework for small open-source LLMs in detecting complex multi-intent dialogues, enhancing the Natural Language Understanding aspect of task-oriented chatbots.

[71] Towards Reliable and Interpretable Document Question Answering via VLMs cs.CL | cs.IRPDF

Alessio Chen, Simone Giovannini, Andrea Gemelli, Fabio Coppini, Simone Marinai

TL;DR: 论文提出DocExplainerV0模块，通过解耦答案生成和空间定位，提升视觉语言模型（VLMs）在文档问答中的可靠性和可解释性。

Details

Motivation: 文档问答中，尽管VLMs在提取文本信息方面表现优异，但答案的精确定位仍是挑战，限制了模型的实用性和可解释性。

Result: 实验表明，即使答案文本正确，其空间定位往往不可靠，框架为未来研究提供了标准化基准。

Insight: 解耦设计提升了模型的灵活性，揭示了文本与空间定位的分离现象，为更可靠的文档理解模型指明了方向。

Abstract: Vision-Language Models (VLMs) have shown strong capabilities in document understanding, particularly in identifying and extracting textual information from complex documents. Despite this, accurately localizing answers within documents remains a major challenge, limiting both interpretability and real-world applicability. To address this, we introduce \textit{DocExplainerV0}, a plug-and-play bounding-box prediction module that decouples answer generation from spatial localization. This design makes it applicable to existing VLMs, including proprietary systems where fine-tuning is not feasible. Through systematic evaluation, we provide quantitative insights into the gap between textual accuracy and spatial grounding, showing that correct answers often lack reliable localization. Our standardized framework highlights these shortcomings and establishes a benchmark for future research toward more interpretable and robust document information extraction VLMs.

[72] Benchmark of stylistic variation in LLM-generated texts cs.CL | cs.AIPDF

Jiří Milička, Anna Marklová, Václav Cvrček

TL;DR: 该研究通过Biber的多维分析方法，比较人类文本与LLM生成文本的风格差异，构建了AI-Brown和AI-Koditex语料库，分析了16个前沿LLM模型在不同设置下的表现，并提出了一个可解释的评测基准。

Details

Motivation: 探索大语言模型（LLM）生成的文本在风格维度上与人类文本的差异，尤其是不同语言（英语和捷克语）中的表现，以填补LLM在多语言风格研究中的空白。

Result: 研究发现LLM在多个风格维度上与人类文本存在显著且系统的差异，尤其是在非英语语言（如捷克语）中表现更为明显。

Insight: LLM的风格多样性受训练数据和指令调优的影响较大，未来需要更多针对非英语语言的多样化训练数据。

Abstract: This study investigates the register variation in texts written by humans and comparable texts produced by large language models (LLMs). Biber’s multidimensional analysis (MDA) is applied to a sample of human-written texts and AI-created texts generated to be their counterparts to find the dimensions of variation in which LLMs differ most significantly and most systematically from humans. As textual material, a new LLM-generated corpus AI-Brown is used, which is comparable to BE-21 (a Brown family corpus representing contemporary British English). Since all languages except English are underrepresented in the training data of frontier LLMs, similar analysis is replicated on Czech using AI-Koditex corpus and Czech multidimensional model. Examined were 16 frontier models in various settings and prompts, with emphasis placed on the difference between base models and instruction-tuned models. Based on this, a benchmark is created through which models can be compared with each other and ranked in interpretable dimensions.

[73] Incongruent Positivity: When Miscalibrated Positivity Undermines Online Supportive Conversations cs.CLPDF

Leen Almajed, Abeer ALdayel

TL;DR: 该论文探讨了情感支持对话中“不一致的积极性”现象，即在某些情况下，善意的积极回应可能显得不恰当或轻描淡写，尤其是在高风险的对话中。通过分析人类和LLM生成的对话，研究者提出了改进方法来增强模型的语境感知能力。

Details

Motivation: 情感支持对话中，过度或不恰当的积极性可能导致对话效果适得其反，尤其是在处理高情感强度的主题时（如悲恸、焦虑）。研究者希望通过分析这一现象，改进LLM的输出，使其更符合情感支持的实际需求。

Result: 研究发现，LLM在高风险对话中更容易表现出不切实际的积极性（如轻描淡写或忽略情感深度）。微调后的模型和分类器在检测不一致积极性方面表现更好，为改进情感支持对话提供了方向。

Insight: 1. 情感支持对话需要平衡积极性和情感承认，避免适得其反的效果。2. LLM的语境感知能力需进一步提升，尤其是在高情感强度的对话中。3. 分类器和微调方法是改进LLM情感支持能力的有效途径。

Abstract: In emotionally supportive conversations, well-intended positivity can sometimes misfire, leading to responses that feel dismissive, minimizing, or unrealistically optimistic. We examine this phenomenon of incongruent positivity as miscalibrated expressions of positive support in both human and LLM generated responses. To this end, we collected real user-assistant dialogues from Reddit across a range of emotional intensities and generated additional responses using large language models for the same context. We categorize these conversations by intensity into two levels: Mild, which covers relationship tension and general advice, and Severe, which covers grief and anxiety conversations. This level of categorization enables a comparative analysis of how supportive responses vary across lower and higher stakes contexts. Our analysis reveals that LLMs are more prone to unrealistic positivity through dismissive and minimizing tone, particularly in high-stakes contexts. To further study the underlying dimensions of this phenomenon, we finetune LLMs on datasets with strong and weak emotional reactions. Moreover, we developed a weakly supervised multilabel classifier ensemble (DeBERTa and MentalBERT) that shows improved detection of incongruent positivity types across two sorts of concerns (Mild and Severe). Our findings shed light on the need to move beyond merely generating generic positive responses and instead study the congruent support measures to balance positive affect with emotional acknowledgment. This approach offers insights into aligning large language models with affective expectations in the online supportive dialogue, paving the way toward context-aware and trust preserving online conversation systems.

[74] Beyond Token Limits: Assessing Language Model Performance on Long Text Classification cs.CL | I.7; I.2; J.4PDF

Miklós Sebők, Viktor Kovács, Martin Bánóczy, Daniel Møller Eriksen, Nathalie Neptune

TL;DR: 该论文评估了多种语言模型（如XLM-RoBERTa、Longformer、GPT-3.5和GPT-4）在处理长文本分类任务时的性能，发现专门为长文本设计的Longformer并未表现突出，而开源模型在某些情况下优于GPT变体。研究还揭示了类别间的支持与内容重叠对性能的影响。

Details

Motivation: 当前主流语言模型（如BERT、RoBERTa）存在输入文本长度限制，难以处理像法律草案等超长文本的分类任务。研究旨在探索哪些模型更适合此类任务。

Result: Longformer未表现突出；开源模型在部分任务中优于GPT变体；类别间的支持与内容重叠对性能有显著影响。

Insight: 长文本分类任务需关注模型对类别间内容重叠的处理能力；专门为长文本设计的模型未必总能优于通用模型。

Abstract: The most widely used large language models in the social sciences (such as BERT, and its derivatives, e.g. RoBERTa) have a limitation on the input text length that they can process to produce predictions. This is a particularly pressing issue for some classification tasks, where the aim is to handle long input texts. One such area deals with laws and draft laws (bills), which can have a length of multiple hundred pages and, therefore, are not particularly amenable for processing with models that can only handle e.g. 512 tokens. In this paper, we show results from experiments covering 5 languages with XLM-RoBERTa, Longformer, GPT-3.5, GPT-4 models for the multiclass classification task of the Comparative Agendas Project, which has a codebook of 21 policy topic labels from education to health care. Results show no particular advantage for the Longformer model, pre-trained specifically for the purposes of handling long inputs. The comparison between the GPT variants and the best-performing open model yielded an edge for the latter. An analysis of class-level factors points to the importance of support and substance overlaps between specific categories when it comes to performance on long text inputs.

[75] Dropping Experts, Recombining Neurons: Retraining-Free Pruning for Sparse Mixture-of-Experts LLMs cs.CLPDF

Yixiao Zhou, Ziyu Zhao, Dongzhou Cheng, zhiliang wu, Jie Gui

TL;DR: 该论文提出了一种无需重新训练的稀疏混合专家（SMoE）大语言模型剪枝方法DERN，通过剪枝和神经重组显著减少了专家数量及内存占用，同时提高了性能。

Details

Motivation: 稀疏混合专家（SMoE）架构在大型语言模型（LLMs）中广泛应用，但其仍需加载所有专家参数，导致内存占用高、部署困难。现有方法主要关注专家级别操作，未深入探索神经元级别的结构优化。

Result: 在Mixtral、Qwen和DeepSeek SMoE模型上，DERN在50%专家稀疏度下性能提升超过5%，且大幅减少专家数量和内存占用。

Insight: 神经元级别的语义冲突是专家直接合并的主要障碍，而DERN通过片段重组解决了这一挑战，为SMoE模型的优化提供了新思路。

Abstract: Sparse Mixture-of-Experts (SMoE) architectures are widely used in large language models (LLMs) due to their computational efficiency. However, though only a few experts are activated for each token, SMoE still requires loading all expert parameters, leading to high memory usage and challenges in deployment. Previous work has tried to reduce the overhead by pruning and merging experts, but primarily focused on expert-level operations, leaving neuron-level structure underexplored. We propose DERN (Dropping Experts, Recombining Neurons), a task-agnostic and retraining-free framework for expert pruning and reconstruction. We observe that experts are often misaligned and contain semantic conflicts at the neuron level, which poses challenges for direct merging. To solve this, DERN works in three steps: it first prunes redundant experts using router statistics; then it decomposes them into neuron-level expert segments, assigning each segment to its most compatible retained expert; and finally, it merges segments within each retained expert to build a compact representation. Experiments on Mixtral, Qwen, and DeepSeek SMoE models show that DERN improves performance by more than 5% on commonsense reasoning and MMLU benchmarks under 50% expert sparsity, without extra training. It also greatly reduces the number of experts and memory usage, making SMoE LLMs easier to deploy in practice.

[76] RefactorCoderQA: Benchmarking LLMs for Multi-Domain Coding Question Solutions in Cloud and Edge Deployment cs.CLPDF

Shadikur Rahman, Aroosa Hameed, Gautam Srivastava, Syed Muhammad Danish

TL;DR: 论文提出了一种新颖的云边协作架构RefactorCoderQA，通过多智能体提示框架优化大语言模型（LLMs）在多领域编程任务中的性能，并引入了一个包含GuideLLM、SolverLLM和JudgeLLM的协作框架。

Details

Motivation: 现有基准测试在多领域编程任务中的局限性促使作者开发了一个更全面的基准RefactorCoderQA，以优化LLMs的问题解决能力。

Result: RefactorCoder-MoE在多个领域显著优于基线模型，整体准确率达76.84%。人类评估验证了解决方案的可解释性和实用性。

Insight: 云边协作框架和系统级指标（如吞吐量和延迟）的引入为LLMs在实际部署中的性能优化提供了新思路。

Abstract: To optimize the reasoning and problem-solving capabilities of Large Language Models (LLMs), we propose a novel cloud-edge collaborative architecture that enables a structured, multi-agent prompting framework. This framework comprises three specialized components: GuideLLM, a lightweight model deployed at the edge to provide methodological guidance; SolverLLM, a more powerful model hosted in the cloud responsible for generating code solutions; and JudgeLLM, an automated evaluator for assessing solution correctness and quality. To evaluate and demonstrate the effectiveness of this architecture in realistic settings, we introduce RefactorCoderQA, a comprehensive benchmark designed to evaluate and enhance the performance of Large Language Models (LLMs) across multi-domain coding tasks. Motivated by the limitations of existing benchmarks, RefactorCoderQA systematically covers various technical domains, including Software Engineering, Data Science, Machine Learning, and Natural Language Processing, using authentic coding challenges from Stack Overflow. Extensive experiments reveal that our fine-tuned model, RefactorCoder-MoE, achieves state-of-the-art performance, significantly outperforming leading open-source and commercial baselines with an overall accuracy of 76.84%. Human evaluations further validate the interpretability, accuracy, and practical relevance of the generated solutions. In addition, we evaluate system-level metrics, such as throughput and latency, to gain deeper insights into the performance characteristics and trade-offs of the proposed architecture.

[77] DeepDive: Advancing Deep Search Agents with Knowledge Graphs and Multi-Turn RL cs.CLPDF

Rui Lu, Zhenyu Hou, Zihan Wang, Hanchen Zhang, Xiao Liu

TL;DR: DeepDive通过知识图谱和强化学习提升开源大模型的深度搜索能力，自动生成复杂问题并实现多轮强化学习训练，显著优于现有方法。

Details

Motivation: 开源大模型在深度搜索任务中表现不佳，主要原因是缺乏长时推理能力和高质量监督数据。

Result: DeepDive-32B在BrowseComp上超越现有方法，多轮强化学习对性能提升贡献显著。

Insight: 多轮强化学习能显著提升模型的工具调用和并行采样能力。

Abstract: Augmenting large language models (LLMs) with browsing tools substantially improves their potential as deep search agents to solve complex, real-world tasks. Yet, open LLMs still perform poorly in such settings due to limited long-horizon reasoning capacity with browsing tools and the lack of sufficiently difficult supervised data. To address these challenges, we present DeepDive to advance deep search agents. First, we propose a strategy to automatically synthesize complex, difficult, and hard-to-find questions from open knowledge graphs. Second, we apply end-to-end multi-turn reinforcement learning (RL) to enhance LLMs’ long-horizon reasoning with deep search. Experiments show that DeepDive-32B achieves a new open-source competitive result on BrowseComp, outperforming WebSailor, DeepSeek-R1-Browse, and Search-o1. We demonstrate that multi-turn RL training improves deep search ability and significantly contributes to the performance improvements across multiple benchmarks. We observe that DeepDive enables test-time scaling of tool calls and parallel sampling. All datasets, models, and code are publicly available at https://github.com/THUDM/DeepDive.

q-bio.QM [Back]

[78] HypoGeneAgent: A Hypothesis Language Agent for Gene-Set Cluster Resolution Selection Using Perturb-seq Datasets q-bio.QM | cs.AI | cs.CL | cs.LGPDF

Ying Yuan, Xing-Yue Monica Ge, Aaron Archer Waterman, Tommaso Biancalani, David Richmond

TL;DR: HYPOGENEAGENT是一个基于大语言模型（LLM）的框架，用于优化基因集的聚类分辨率和功能注释任务，通过生成假设和评分机制实现自动化和定量化。

Details

Motivation: 在单细胞和Perturb-seq研究中，传统的聚类分辨率和功能注释方法依赖启发式规则和专家经验，具有主观性。HYPOGENEAGENT旨在通过LLM实现定量化和自动化。

Result: 在K562 CRISPRi Perturb-seq数据集上，HYPOGENEAGENT选择的分辨率优于传统指标（如轮廓系数、模块度），且与已知通路更匹配。

Insight: LLM可以作为客观的聚类分辨率和功能注释裁决工具，推动单细胞多组学研究的全自动化分析。

Abstract: Large-scale single-cell and Perturb-seq investigations routinely involve clustering cells and subsequently annotating each cluster with Gene-Ontology (GO) terms to elucidate the underlying biological programs. However, both stages, resolution selection and functional annotation, are inherently subjective, relying on heuristics and expert curation. We present HYPOGENEAGENT, a large language model (LLM)-driven framework, transforming cluster annotation into a quantitatively optimizable task. Initially, an LLM functioning as a gene-set analyst analyzes the content of each gene program or perturbation module and generates a ranked list of GO-based hypotheses, accompanied by calibrated confidence scores. Subsequently, we embed every predicted description with a sentence-embedding model, compute pair-wise cosine similarities, and let the agent referee panel score (i) the internal consistency of the predictions, high average similarity within the same cluster, termed intra-cluster agreement (ii) their external distinctiveness, low similarity between clusters, termed inter-cluster separation. These two quantities are combined to produce an agent-derived resolution score, which is maximized when clusters exhibit simultaneous coherence and mutual exclusivity. When applied to a public K562 CRISPRi Perturb-seq dataset as a preliminary test, our Resolution Score selects clustering granularities that exhibit alignment with known pathway compared to classical metrics such silhouette score, modularity score for gene functional enrichment summary. These findings establish LLM agents as objective adjudicators of cluster resolution and functional annotation, thereby paving the way for fully automated, context-aware interpretation pipelines in single-cell multi-omics studies.

cs.RO [Back]

Hang Yin, Haoyu Wei, Xiuwei Xu, Wenxuan Guo, Jie Zhou

TL;DR: 本文提出了一种无需训练的方法GC-VLN，通过将导航指令分解为图约束优化问题，实现了连续环境中的视觉与语言导航。

Details

Motivation: 现有的零样本视觉与语言导航（VLN）方法多为离散环境设计，或在连续模拟器环境中依赖无监督训练，难以推广到真实场景。因此，需要一个无需训练且适应连续环境的框架。

Result: 在标准基准测试中，相比现有零样本VLN方法，成功率和导航效率显著提升。真实实验验证了框架在新环境和指令集上的泛化能力。

Insight: 1. 空间约束分解能有效解决导航语义理解问题；2. 图约束优化为训练无关方法提供了新思路；3. 导航树和回溯机制增强了系统的鲁棒性。

Abstract: In this paper, we propose a training-free framework for vision-and-language navigation (VLN). Existing zero-shot VLN methods are mainly designed for discrete environments or involve unsupervised training in continuous simulator environments, which makes it challenging to generalize and deploy them in real-world scenarios. To achieve a training-free framework in continuous environments, our framework formulates navigation guidance as graph constraint optimization by decomposing instructions into explicit spatial constraints. The constraint-driven paradigm decodes spatial semantics through constraint solving, enabling zero-shot adaptation to unseen environments. Specifically, we construct a spatial constraint library covering all types of spatial relationship mentioned in VLN instructions. The human instruction is decomposed into a directed acyclic graph, with waypoint nodes, object nodes and edges, which are used as queries to retrieve the library to build the graph constraints. The graph constraint optimization is solved by the constraint solver to determine the positions of waypoints, obtaining the robot’s navigation path and final goal. To handle cases of no solution or multiple solutions, we construct a navigation tree and the backtracking mechanism. Extensive experiments on standard benchmarks demonstrate significant improvements in success rate and navigation efficiency compared to state-of-the-art zero-shot VLN methods. We further conduct real-world experiments to show that our framework can effectively generalize to new environments and instruction sets, paving the way for a more robust and autonomous navigation framework.

eess.IV [Back]

[80] Automated Tuning for Diffusion Inverse Problem Solvers without Generative Prior Retraining eess.IV | cs.AI | cs.CV | cs.LG | physics.med-phPDF

Yaşar Utku Alçalar, Junno Yun, Mehmet Akçakaya

TL;DR: 本文提出了Zero-shot Adaptive Diffusion Sampling (ZADS)，一种无需重新训练扩散模型的自适应调整方法，用于解决逆问题中的权重调优问题，尤其在快速采样和稀疏测量条件下表现出色。

Details

Motivation: 现有的扩散/基于分数的模型在解决逆问题时依赖固定的数据保真权重，无法适应不同的测量条件和采样计划，导致性能不均衡。

Result: 在fastMRI膝盖数据集上的实验表明，ZADS优于传统压缩感知和现有扩散方法，实现了高性能重建。

Insight: ZADS展示了在稀疏测量条件下自适应调优的重要性，为扩散模型的应用提供了更灵活的工具。

Abstract: Diffusion/score-based models have recently emerged as powerful generative priors for solving inverse problems, including accelerated MRI reconstruction. While their flexibility allows decoupling the measurement model from the learned prior, their performance heavily depends on carefully tuned data fidelity weights, especially under fast sampling schedules with few denoising steps. Existing approaches often rely on heuristics or fixed weights, which fail to generalize across varying measurement conditions and irregular timestep schedules. In this work, we propose Zero-shot Adaptive Diffusion Sampling (ZADS), a test-time optimization method that adaptively tunes fidelity weights across arbitrary noise schedules without requiring retraining of the diffusion prior. ZADS treats the denoising process as a fixed unrolled sampler and optimizes fidelity weights in a self-supervised manner using only undersampled measurements. Experiments on the fastMRI knee dataset demonstrate that ZADS consistently outperforms both traditional compressed sensing and recent diffusion-based methods, showcasing its ability to deliver high-fidelity reconstructions across varying noise schedules and acquisition settings.

[81] Drone-Based Multispectral Imaging and Deep Learning for Timely Detection of Branched Broomrape in Tomato Farms eess.IV | cs.AI | cs.CV | cs.LGPDF

Mohammadreza Narimani, Alireza Pourreza, Ali Moghimi, Mohsen Mesgaran, Parastoo Farajpoor

TL;DR: 该研究利用无人机多光谱成像和LSTM深度学习，结合SMOTE技术，实现了番茄田中分枝列当的早期检测，显著提升了检测精度和召回率。

Details

Motivation: 分枝列当（Phelipanche ramosa）对番茄产业构成严重威胁，传统检测方法成本高且效果有限。研究旨在开发一种高效、环保的早期检测方法。

Result: 在关键生长阶段（897 GDD），模型实现了79.09%的总体准确率和70.36%的召回率；整合所有生长阶段后，准确率提升至88.37%，召回率达到95.37%。

Insight: 时序多光谱分析和LSTM网络的结合在早期病害检测中表现出强大潜力，为精准农业和可持续发展提供了新思路。

Abstract: This study addresses the escalating threat of branched broomrape (Phelipanche ramosa) to California’s tomato industry, which supplies over 90 percent of U.S. processing tomatoes. The parasite’s largely underground life cycle makes early detection difficult, while conventional chemical controls are costly, environmentally harmful, and often ineffective. To address this, we combined drone-based multispectral imagery with Long Short-Term Memory (LSTM) deep learning networks, using the Synthetic Minority Over-sampling Technique (SMOTE) to handle class imbalance. Research was conducted on a known broomrape-infested tomato farm in Woodland, Yolo County, CA, across five key growth stages determined by growing degree days (GDD). Multispectral images were processed to isolate tomato canopy reflectance. At 897 GDD, broomrape could be detected with 79.09 percent overall accuracy and 70.36 percent recall without integrating later stages. Incorporating sequential growth stages with LSTM improved detection substantially. The best-performing scenario, which integrated all growth stages with SMOTE augmentation, achieved 88.37 percent overall accuracy and 95.37 percent recall. These results demonstrate the strong potential of temporal multispectral analysis and LSTM networks for early broomrape detection. While further real-world data collection is needed for practical deployment, this study shows that UAV-based multispectral sensing coupled with deep learning could provide a powerful precision agriculture tool to reduce losses and improve sustainability in tomato production.

[82] Polarization Denoising and Demosaicking: Dataset and Baseline Method eess.IV | cs.CVPDF

Muhamad Daniel Ariff Bin Abdul Rahman, Yusuke Monno, Masayuki Tanaka, Masatoshi Okutomi

TL;DR: 该论文提出了一种用于偏振去噪和去马赛克的数据集和基线方法，填补了该领域的数据和基线方法的空白。

Details

Motivation: 由于缺乏合适的数据集和基线方法，偏振去噪和去马赛克的联合任务研究较少。作者希望通过提供数据集和方法填补这一空白。

Result: 实验结果表明，该方法在图像重建性能上优于其他替代方法，提供了一个可靠的基线。

Insight: 该研究强调了数据集和基线方法在偏振图像处理中的重要性，为未来的研究奠定了坚实基础。

Abstract: A division-of-focal-plane (DoFP) polarimeter enables us to acquire images with multiple polarization orientations in one shot and thus it is valuable for many applications using polarimetric information. The image processing pipeline for a DoFP polarimeter entails two crucial tasks: denoising and demosaicking. While polarization demosaicking for a noise-free case has increasingly been studied, the research for the joint task of polarization denoising and demosaicking is scarce due to the lack of a suitable evaluation dataset and a solid baseline method. In this paper, we propose a novel dataset and method for polarization denoising and demosaicking. Our dataset contains 40 real-world scenes and three noise-level conditions, consisting of pairs of noisy mosaic inputs and noise-free full images. Our method takes a denoising-then-demosaicking approach based on well-accepted signal processing components to offer a reproducible method. Experimental results demonstrate that our method exhibits higher image reconstruction performance than other alternative methods, offering a solid baseline.

cs.AI [Back]

[83] Executable Ontologies: Synthesizing Event Semantics with Dataflow Architecture cs.AI | cs.CL | cs.FL | cs.SEPDF

Aleksandr Boldachev

TL;DR: 本文提出了一种名为boldsea的新架构，通过将事件语义与数据流架构结合，解决了传统BPM系统和面向对象语义技术的局限性，实现了运行时动态修改和统一语义框架下的数据和逻辑结合。

Details

Motivation: 传统BPM系统和面向对象语义技术在建模复杂动态系统时存在局限性，缺乏运行时灵活性和统一的语义框架。本文旨在通过集成事件语义和数据流架构来解决这些问题。

Result: boldsea架构能够动态修改事件模型，支持运行时灵活调整，同时确保时空透明度，优于传统BPM系统和面向对象方法。

Insight: 1. 事件语义与数据流架构的结合为复杂系统建模提供了新的可能性；2. 动态语义模型的直接执行减少了编译环节，提高了系统灵活性；3. 统一的语义框架简化了数据和逻辑的集成。

Abstract: This paper presents boldsea, Boldachev’s semantic-event approach – an architecture for modeling complex dynamic systems using executable ontologies – semantic models that act as dynamic structures, directly controlling process execution. We demonstrate that integrating event semantics with a dataflow architecture addresses the limitations of traditional Business Process Management (BPM) systems and object-oriented semantic technologies. The paper presents the formal BSL (boldsea Semantic Language), including its BNF grammar, and outlines the boldsea-engine’s architecture, which directly interprets semantic models as executable algorithms without compilation. It enables the modification of event models at runtime, ensures temporal transparency, and seamlessly merges data and business logic within a unified semantic framework.

[84] Abduct, Act, Predict: Scaffolding Causal Inference for Automated Failure Attribution in Multi-Agent Systems cs.AI | cs.CLPDF

Alva West, Yixuan Weng, Minjun Zhu, Zhen Lin, Yue Zhang

TL;DR: 论文提出A2P框架，通过结构化因果推理解决多智能体系统中的错误归因问题，显著提升了准确率。

Details

Motivation: 当前方法将错误归因视为模式识别任务，步骤级准确率低（<17%），无法进行有效的反事实推理，难以为复杂系统提供实用调试支持。

Result: 在Who&When基准测试中，A2P在Algorithm-Generated数据集上达到47.46%步骤级准确率（2.85倍提升），在Hand-Crafted数据集上达到29.31%（2.43倍提升）。

Insight: 通过因果推理框架（而非模式识别）解决问题，可以显著提升多智能体系统中错误归因的准确性和可验证性。

Abstract: Failure attribution in multi-agent systems – pinpointing the exact step where a decisive error occurs – is a critical yet unsolved challenge. Current methods treat this as a pattern recognition task over long conversation logs, leading to critically low step-level accuracy (below 17%), which renders them impractical for debugging complex systems. Their core weakness is a fundamental inability to perform robust counterfactual reasoning: to determine if correcting a single action would have actually averted the task failure. To bridge this counterfactual inference gap, we introduce Abduct-Act-Predict (A2P) Scaffolding, a novel agent framework that transforms failure attribution from pattern recognition into a structured causal inference task. A2P explicitly guides a large language model through a formal three-step reasoning process within a single inference pass: (1) Abduction, to infer the hidden root causes behind an agent’s actions; (2) Action, to define a minimal corrective intervention; and (3) Prediction, to simulate the subsequent trajectory and verify if the intervention resolves the failure. This structured approach leverages the holistic context of the entire conversation while imposing a rigorous causal logic on the model’s analysis. Our extensive experiments on the Who&When benchmark demonstrate its efficacy. On the Algorithm-Generated dataset, A2P achieves 47.46% step-level accuracy, a 2.85$\times$ improvement over the 16.67% of the baseline. On the more complex Hand-Crafted dataset, it achieves 29.31% step accuracy, a 2.43$\times$ improvement over the baseline’s 12.07%. By reframing the problem through a causal lens, A2P Scaffolding provides a robust, verifiable, and significantly more accurate solution for automated failure attribution.

eess.AS [Back]

[85] Error Analysis in a Modular Meeting Transcription System eess.AS | cs.CL | cs.LG | cs.SDPDF

Peter Vieting, Simon Berger, Thilo von Neumann, Christoph Boeddeker, Ralf Schlüter

TL;DR: 本文通过扩展先前提出的框架分析了语音分离中的泄漏问题，揭示了交叉通道泄漏对语音活动检测（VAD）的有限影响，并比较了不同分割方法的效果。

Details

Motivation: 会议转录是一个具有高相关性和显著进展的领域，但仍然存在性能限制。本文旨在解决语音分离中的泄漏问题及其对转录性能的影响。

Result: 结果显示，泄漏对最终性能影响有限，因为VAD会忽略大部分泄漏部分。高级分割技术能将与理想分割的差距减少三分之一。系统在LibriCSS数据集上达到了仅使用LibriSpeech训练数据的最佳性能。

Insight: 语音分离中的泄漏问题虽然存在，但对最终转录性能的影响较小；高级分割技术能显著提升性能。

Abstract: Meeting transcription is a field of high relevance and remarkable progress in recent years. Still, challenges remain that limit its performance. In this work, we extend a previously proposed framework for analyzing leakage in speech separation with proper sensitivity to temporal locality. We show that there is significant leakage to the cross channel in areas where only the primary speaker is active. At the same time, the results demonstrate that this does not affect the final performance much as these leaked parts are largely ignored by the voice activity detection (VAD). Furthermore, different segmentations are compared showing that advanced diarization approaches are able to reduce the gap to oracle segmentation by a third compared to a simple energy-based VAD. We additionally reveal what factors contribute to the remaining difference. The results represent state-of-the-art performance on LibriCSS among systems that train the recognition module on LibriSpeech data only.

cs.IR [Back]

[86] DB3 Team’s Solution For Meta KDD Cup’ 25 cs.IR | cs.AI | cs.CL | cs.LGPDF

Yikuan Xia, Jiazun Chen, Yirui Zhan, Suifeng Zhao, Weipeng Jiang

TL;DR: DB3团队提出的解决方案在Meta KDD Cup’25的CRAG-MM挑战赛中获胜，通过整合定制化的检索流水线和统一的LLM调优方法，在多模态、多轮问答任务中表现出色。

Details

Motivation: 解决CRAG-MM挑战赛中多模态、多轮问答的独特需求，提升对第一人称视角查询的处理能力。

Result: 在Task 1和Task 2中排名第二，Task 3中排名第一，最终获得总冠军。

Insight: 结合领域专用检索和统一的LLM调优可有效提升多模态问答任务的性能。

Abstract: This paper presents the db3 team’s winning solution for the Meta CRAG-MM Challenge 2025 at KDD Cup’25. Addressing the challenge’s unique multi-modal, multi-turn question answering benchmark (CRAG-MM), we developed a comprehensive framework that integrates tailored retrieval pipelines for different tasks with a unified LLM-tuning approach for hallucination control. Our solution features (1) domain-specific retrieval pipelines handling image-indexed knowledge graphs, web sources, and multi-turn conversations; and (2) advanced refusal training using SFT, DPO, and RL. The system achieved 2nd place in Task 1, 2nd place in Task 2, and 1st place in Task 3, securing the grand prize for excellence in ego-centric queries through superior handling of first-person perspective challenges.

cs.LG [Back]

[87] Clip Your Sequences Fairly: Enforcing Length Fairness for Sequence-Level RL cs.LG | cs.AI | cs.CLPDF

Hanyi Mao, Quanjia Xiao, Lei Pang, Haixiao Liu

TL;DR: 该论文提出了一种名为FSPO的序列级强化学习方法，通过公平地裁剪重要性采样权重来确保序列长度公平性，解决了PPO/GRPO类方法在序列级别应用时的不匹配问题。

Details

Motivation: 现有的序列级强化学习方法（如PPO/GRPO）在裁剪时存在长度不公平问题，固定的裁剪区间会导致长回答和短回答的权重分配失衡，影响优化目标。

Result: 实验表明，FSPO能够均衡不同长度区间的裁剪率，稳定训练过程，并在多个评测数据集上超越所有基线方法。

Insight: 论文揭示序列级强化学习中长度公平性的重要性，并提出了一种简单有效的裁剪策略，为类似问题提供了新思路。

Abstract: We propose FSPO (Fair Sequence Policy Optimization), a sequence-level reinforcement learning method for LLMs that enforces length-fair clipping directly in the importance-sampling (IS) weight space. We revisit sequence-level RL methods and identify a mismatch when PPO/GRPO-style clipping is transplanted to sequences: a fixed clip range systematically reweights short vs. long responses, distorting the effective objective. Theoretically, we formalize length fairness via a Length Reweighting Error (LRE) and prove that small LRE yields a directional cosine guarantee between the clipped and true updates. FSPO introduces a simple, Gaussian-motivated remedy: we clip the sequence log-IS ratio with a band that applies a KL-corrected drift term and scales as $\sqrt{L}$. Empirically, FSPO flattens clip rates across length bins, stabilizes training, and outperforms all baselines across multiple evaluation datasets.

[88] Latency and Token-Aware Test-Time Compute cs.LG | cs.AI | cs.CLPDF

Jenny Y. Huang, Mehul Damani, Yousef El-Kurdi, Ramon Astudillo, Wei Sun

TL;DR: 本文提出了一种动态计算分配框架，用于优化大型语言模型（LLM）的推理性能，综合考虑了token成本和延迟，优于静态策略。

Details

Motivation: 现有研究多关注并行生成方法（如best-of-N），而忽视了增量解码方法（如束搜索），且未充分考虑延迟对用户体验的影响。本文旨在填补这一空白。

Result: 实验表明，该框架在推理任务中优于静态策略，实现了更高的准确性且保持部署可行性。

Insight: 延迟是影响用户体验的关键因素，尤其在多查询工作流中，动态分配和策略选择能显著提升效率。

Abstract: Inference-time scaling has emerged as a powerful way to improve large language model (LLM) performance by generating multiple candidate responses and selecting among them. However, existing work on dynamic allocation for test-time compute typically considers only parallel generation methods such as best-of-N, overlooking incremental decoding methods like beam search, and has largely ignored latency, focusing only on token usage. We formulate inference-time scaling as a problem of dynamic compute allocation and method selection, where the system must decide which strategy to apply and how much compute to allocate on a per-query basis. Our framework explicitly incorporates both token cost and wall-clock latency, the latter being critical for user experience and particularly for agentic workflows where models must issue multiple queries efficiently. Experiments on reasoning benchmarks show that our approach consistently outperforms static strategies, achieving favorable accuracy-cost trade-offs while remaining practical for deployment.

Table of Contents

cs.CV [Back]

[1] Australian Supermarket Object Set (ASOS): A Benchmark Dataset of Physical Objects and 3D Models for Robotics and Computer Vision cs.CV | cs.RO | eess.IVPDF

[2] A Multimodal RAG Framework for Housing Damage Assessment: Collaborative Optimization of Image Encoding and Policy Vector Retrieval cs.CV | cs.AI | cs.LGPDF

[3] Improving MLLM Historical Record Extraction with Test-Time Image cs.CV | cs.CL | cs.LGPDF

[4] MITS: A Large-Scale Multimodal Benchmark Dataset for Intelligent Traffic Surveillance cs.CV | cs.AIPDF

[5] Decomposing Visual Classification: Assessing Tree-Based Reasoning in VLMs cs.CVPDF

[6] World Modeling with Probabilistic Structure Integration cs.CV | cs.AI | cs.LGPDF

[7] Images in Motion?: A First Look into Video Leakage in Collaborative Deep Learning cs.CVPDF

[8] Fine-Grained Cross-View Localization via Local Feature Matching and Monocular Depth Priors cs.CVPDF

[9] Early Detection of Visual Impairments at Home Using a Smartphone Red-Eye Reflex Test cs.CV | cs.LGPDF

[10] DGFusion: Depth-Guided Sensor Fusion for Robust Semantic Perception cs.CV | cs.LG | cs.ROPDF

[11] Patch-based Automatic Rosacea Detection Using the ResNet Deep Learning Framework cs.CVPDF

[12] Privacy-Preserving Automated Rosacea Detection Based on Medically Inspired Region of Interest Selection cs.CVPDF

[13] Investigating the Impact of Various Loss Functions and Learnable Wiener Filter for Laparoscopic Image Desmoking cs.CVPDF

[14] WAVE-DETR Multi-Modal Visible and Acoustic Real-Life Drone Detector cs.CV | cs.LG | 68W99PDF

[15] Surrogate Supervision for Robust and Generalizable Deformable Image Registration cs.CV | cs.AIPDF

[16] An Autoencoder and Vision Transformer-based Interpretability Analysis of the Differences in Automated Staging of Second and Third Molars cs.CV | cs.AI | 68T07 (Primary)PDF

[17] SCoDA: Self-supervised Continual Domain Adaptation cs.CVPDF

[18] Segment Anything for Cell Tracking cs.CVPDF

[19] Online 3D Multi-Camera Perception through Robust 2D Tracking and Depth-based Late Aggregation cs.CVPDF

[20] Zero-Shot Referring Expression Comprehension via Visual-Language True/False Verification cs.CV | cs.AIPDF

[21] An HMM-based framework for identity-aware long-term multi-object tracking from sparse and uncertain identification: use case on long-term tracking in livestock cs.CVPDF

[22] Event Camera Guided Visual Media Restoration & 3D Reconstruction: A Survey cs.CVPDF

[23] ISTASTrack: Bridging ANN and SNN via ISTA Adapter for RGB-Event Tracking cs.CVPDF

[24] FLARE-SSM: Deep State Space Models with Influence-Balanced Loss for 72-Hour Solar Flare Prediction cs.CV | astro-ph.SRPDF

[25] TUNI: Real-time RGB-T Semantic Segmentation with Unified Multi-Modal Feature Extraction and Cross-Modal Feature Fusion cs.CVPDF

[26] Efficient and Accurate Downfacing Visual Inertial Odometry cs.CV | cs.RO | eess.IVPDF

[27] Hierarchical MLANet: Multi-level Attention for 3D Face Reconstruction From Single Images cs.CVPDF

[28] LaV-CoT: Language-Aware Visual CoT with Multi-Aspect Reward Optimization for Real-World Multilingual VQA cs.CVPDF

[29] Color Me Correctly: Bridging Perceptual Color Spaces and Text Embeddings for Improved Diffusion Generation cs.CVPDF

[30] Multimodal Mathematical Reasoning Embedded in Aerial Vehicle Imagery: Benchmarking, Analysis, and Exploration cs.CV | cs.AIPDF

[31] BEVTraj: Map-Free End-to-End Trajectory Prediction in Bird’s-Eye View with Deformable Attention and Sparse Goal Proposals cs.CV | I.2.9; I.4.8PDF

[32] Leveraging Multi-View Weak Supervision for Occlusion-Aware Multi-Human Parsing cs.CVPDF

[33] VARCO-VISION-2.0 Technical Report cs.CV | cs.CLPDF

[34] A Lightweight Ensemble-Based Face Image Quality Assessment Method with Correlation-Aware Loss cs.CVPDF

[35] LayerLock: Non-collapsing Representation Learning with Progressive Freezing cs.CVPDF

[36] On the Geometric Accuracy of Implicit and Primitive-based Representations Derived from View Rendering Constraints cs.CVPDF

[37] GAMMA: Generalizable Alignment via Multi-task and Manipulation-Augmented Training for AI-Generated Image Detection cs.CVPDF

[38] Robustness and Diagnostic Performance of Super-Resolution Fetal Brain MRI cs.CVPDF

[39] Mask Consistency Regularization in Object Removal cs.CVPDF

[40] MagicMirror: A Large-Scale Dataset and Benchmark for Fine-Grained Artifacts Assessment in Text-to-Image Generation cs.CVPDF

[41] SignClip: Leveraging Mouthing Cues for Sign Language Translation by Multimodal Contrastive Fusion cs.CV | cs.AIPDF

[42] Detecting Text Manipulation in Images using Vision Language Models cs.CVPDF

[43] MCL-AD: Multimodal Collaboration Learning for Zero-Shot 3D Anomaly Detection cs.CV | cs.LGPDF

[44] Adversarial robustness through Lipschitz-Guided Stochastic Depth in Neural Networks cs.CVPDF

[45] A Stochastic Birth-and-Death Approach for Street Furniture Geolocation in Urban Environments cs.CVPDF

[46] Compute Only 16 Tokens in One Timestep: Accelerating Diffusion Transformers with Cluster-Driven Feature Caching cs.CVPDF

[47] I-Segmenter: Integer-Only Vision Transformer for Efficient Semantic Segmentation cs.CV | cs.AI | cs.LGPDF

[48] GARD: Gamma-based Anatomical Restoration and Denoising for Retinal OCT cs.CVPDF

[49] GLAM: Geometry-Guided Local Alignment for Multi-View VLP in Mammography cs.CV | cs.AI | cs.LGPDF

[50] Towards Understanding Visual Grounding in Visual Language Models cs.CV | cs.AIPDF

[51] Ordinality of Visible-Thermal Image Intensities for Intrinsic Image Decomposition cs.CVPDF

[52] Compressed Video Quality Enhancement: Classifying and Benchmarking over Standards cs.CVPDF

[53] Multimodal SAM-adapter for Semantic Segmentation cs.CV | cs.AIPDF

[54] InfGen: A Resolution-Agnostic Paradigm for Scalable Image Synthesis cs.CVPDF

[55] SSL-AD: Spatiotemporal Self-Supervised Learning for Generalizability and Adaptability Across Alzheimer’s Prediction Tasks and Datasets cs.CV | cs.LGPDF

cs.CL [Back]

[56] Cross-Layer Attention Probing for Fine-Grained Hallucination Detection cs.CL | cs.AIPDF

[57] Creativity Benchmark: A benchmark for marketing creativity for LLM models cs.CL | cs.AI | cs.HCPDF

[58] Temporal Preferences in Language Models for Long-Horizon Assistance cs.CL | cs.AI | cs.CYPDF

[59] Psychiatry-Bench: A Multi-Task Benchmark for LLMs in Psychiatry cs.CL | cs.AIPDF

[60] The Thinking Therapist: Training Large Language Models to Deliver Acceptance and Commitment Therapy using Supervised Fine-Tuning and Odds Ratio Policy Optimization cs.CL | cs.AIPDF

[61] HANRAG: Heuristic Accurate Noise-resistant Retrieval-Augmented Generation for Multi-hop Question Answering cs.CL | cs.AIPDF

[62] A Role-Aware Multi-Agent Framework for Financial Education Question Answering with LLMs cs.CL | cs.CEPDF

[63] MultimodalHugs: Enabling Sign Language Processing in Hugging Face cs.CL | cs.AI | cs.MMPDF

[64] Benchmarking Vision-Language Models on Chinese Ancient Documents: From OCR to Knowledge Reasoning cs.CLPDF

[65] HEFT: A Coarse-to-Fine Hierarchy for Enhancing the Efficiency and Accuracy of Language Model Reasoning cs.CL | cs.AI | cs.LG | 68T07, 68T50, 68T05 | I.2.7; I.2.6; C.4PDF

[66] Pragmatic Frames Evoked by Gestures: A FrameNet Brasil Approach to Multimodality in Turn Organization cs.CLPDF

[67] Topic-Guided Reinforcement Learning with LLMs for Enhancing Multi-Document Summarization cs.CLPDF

[68] Emulating Public Opinion: A Proof-of-Concept of AI-Generated Synthetic Survey Responses for the Chilean Case cs.CL | cs.AI | 68T50 (Primary) 91F10 (Secondary)PDF

[69] Unsupervised Hallucination Detection by Inspecting Reasoning Processes cs.CL | cs.AIPDF

[70] Multi-Intent Recognition in Dialogue Understanding: A Comparison Between Smaller Open-Source LLMs cs.CL | cs.HCPDF

[71] Towards Reliable and Interpretable Document Question Answering via VLMs cs.CL | cs.IRPDF

[72] Benchmark of stylistic variation in LLM-generated texts cs.CL | cs.AIPDF

[73] Incongruent Positivity: When Miscalibrated Positivity Undermines Online Supportive Conversations cs.CLPDF

[74] Beyond Token Limits: Assessing Language Model Performance on Long Text Classification cs.CL | I.7; I.2; J.4PDF

[75] Dropping Experts, Recombining Neurons: Retraining-Free Pruning for Sparse Mixture-of-Experts LLMs cs.CLPDF

[76] RefactorCoderQA: Benchmarking LLMs for Multi-Domain Coding Question Solutions in Cloud and Edge Deployment cs.CLPDF

[77] DeepDive: Advancing Deep Search Agents with Knowledge Graphs and Multi-Turn RL cs.CLPDF