cs.CV [Total: 53]
cs.CL [Total: 27]
cs.MM [Total: 1]
cs.SI [Total: 1]
eess.IV [Total: 6]
cs.GR [Total: 2]
cs.DC [Total: 1]
cs.RO [Total: 4]
cs.SE [Total: 1]
cs.AI [Total: 1]
cs.LG [Total: 7]

cs.CV [Back]

[1] SemIRNet: A Semantic Irony Recognition Network for Multimodal Sarcasm Detection cs.CV | cs.CL | cs.LGPDF

Jingxuan Zhou, Yuehao Wu, Yibo Zhang, Yeyubei Zhang, Yunchong Liu

TL;DR: 本文提出了一种语义讽刺识别网络（SemIRNet），用于多模态讽刺检测任务，通过引入ConceptNet知识库、跨模态语义相似性检测模块和对比学习损失函数，显著提升了模型性能。

Details

Motivation: 解决多模态讽刺检测任务中难以准确识别图形与文本之间隐含关联的问题。

Result: 在公开数据集上，准确率和F1值分别提升1.64%和2.88%，达到88.87%和86.33%。

Insight: 知识融合和语义相似性检测对提升模型性能具有重要作用。

Abstract: Aiming at the problem of difficulty in accurately identifying graphical implicit correlations in multimodal irony detection tasks, this paper proposes a Semantic Irony Recognition Network (SemIRNet). The model contains three main innovations: (1) The ConceptNet knowledge base is introduced for the first time to acquire conceptual knowledge, which enhances the model’s common-sense reasoning ability; (2) Two cross-modal semantic similarity detection modules at the word level and sample level are designed to model graphic-textual correlations at different granularities; and (3) A contrastive learning loss function is introduced to optimize the spatial distribution of the sample features, which improves the separability of positive and negative samples. Experiments on a publicly available multimodal irony detection benchmark dataset show that the accuracy and F1 value of this model are improved by 1.64% and 2.88% to 88.87% and 86.33%, respectively, compared with the existing optimal methods. Further ablation experiments verify the important role of knowledge fusion and semantic similarity detection in improving the model performance.

[2] Argus Inspection: Do Multimodal Large Language Models Possess the Eye of Panoptes? cs.CV | cs.AI | cs.CL | cs.LG | cs.MMPDF

Yang Yao, Lingyu Li, Jiaxin Song, Chiyu Chen, Zhenqi He

TL;DR: 该论文提出了Argus Inspection基准和Eye of Panoptes框架，用于评估多模态大语言模型（MLLMs）在细粒度视觉感知和常识因果推理上的能力，发现当前模型的最高表现仅为0.46，表明仍有较大改进空间。

Details

Motivation: 随着多模态大语言模型（MLLMs）的发展，其认知和推理能力显著提升，但细粒度视觉感知和常识因果推理仍是挑战。本文旨在为这些问题提供量化评估方法。

Result: 在26个主流MLLMs上测试，视觉细粒度推理任务的最高表现仅为0.46，表明当前模型能力有限。

Insight: 当前MLLMs在视觉细粒度感知和因果推理上仍有显著提升空间，未来研究应关注如何融合多模态信息以优化模型性能。

Abstract: As Multimodal Large Language Models (MLLMs) continue to evolve, their cognitive and reasoning capabilities have seen remarkable progress. However, challenges in visual fine-grained perception and commonsense causal inference persist. This paper introduces Argus Inspection, a multimodal benchmark with two levels of difficulty, emphasizing detailed visual recognition while incorporating real-world commonsense understanding to evaluate causal reasoning abilities. Expanding on it, we present the Eye of Panoptes framework, which integrates a binary parametric Sigmoid metric with an indicator function, enabling a more holistic evaluation of MLLMs’ responses in opinion-based reasoning tasks. Experiments conducted on 26 mainstream MLLMs reveal that the highest performance in visual fine-grained reasoning reaches only 0.46, highlighting considerable potential for enhancement. Our research offers valuable perspectives for the continued refinement of MLLMs.

[3] A Hybrid ConvNeXt-EfficientNet AI Solution for Precise Falcon Disease Detection cs.CVPDF

Alavikunhu Panthakkan, Zubair Medammal, S M Anzar, Fatma Taher, Hussain Al-Ahmad

TL;DR: 本文提出了一种结合ConvNeXt和EfficientNet的混合AI模型，用于精确检测猎鹰疾病，优于传统方法和单一模型架构。

Details

Motivation: 猎鹰训练和狩猎传统需要细致的健康监测，以保障这些珍贵鸟类的健康。传统诊断方法效率较低，亟需更精确的AI解决方案。

Result: 实验表明，混合模型在猎鹰疾病检测中优于传统方法和单一模型，为AI驱动的禽类健康监测提供了新方向。

Insight: 混合AI模型在特定领域（如鸟类健康监测）中展现出潜力，未来可扩展至其他复杂医学诊断任务。

Abstract: Falconry, a revered tradition involving the training and hunting with falcons, requires meticulous health surveillance to ensure the health and safety of these prized birds, particularly in hunting scenarios. This paper presents an innovative method employing a hybrid of ConvNeXt and EfficientNet AI models for the classification of falcon diseases. The study focuses on accurately identifying three conditions: Normal, Liver Disease and ‘Aspergillosis’. A substantial dataset was utilized for training and validating the model, with an emphasis on key performance metrics such as accuracy, precision, recall, and F1-score. Extensive testing and analysis have shown that our concatenated AI model outperforms traditional diagnostic methods and individual model architectures. The successful implementation of this hybrid AI model marks a significant step forward in precise falcon disease detection and paves the way for future developments in AI-powered avian healthcare solutions.

[4] ViLLa: A Neuro-Symbolic approach for Animal Monitoring cs.CV | cs.AIPDF

Harsha Koduri

TL;DR: ViLLa是一个结合视觉和语言理解的神经符号框架，用于可解释的动物监测，通过模块化的感知、理解和推理实现透明化。

Details

Motivation: 动物种群的监测需要能够同时解释视觉数据和自然语言查询的系统，避免传统黑箱模型的不可解释性。

Result: 在多种动物图像任务上验证了ViLLa的能力，能够桥接视觉内容与结构化、可解释的查询。

Insight: 神经符号方法可以提高系统的可解释性和模块化，适用于复杂的多模态任务。

Abstract: Monitoring animal populations in natural environments requires systems that can interpret both visual data and human language queries. This work introduces ViLLa (Vision-Language-Logic Approach), a neuro-symbolic framework designed for interpretable animal monitoring. ViLLa integrates three core components: a visual detection module for identifying animals and their spatial locations in images, a language parser for understanding natural language queries, and a symbolic reasoning layer that applies logic-based inference to answer those queries. Given an image and a question such as “How many dogs are in the scene?” or “Where is the buffalo?”, the system grounds visual detections into symbolic facts and uses predefined rules to compute accurate answers related to count, presence, and location. Unlike end-to-end black-box models, ViLLa separates perception, understanding, and reasoning, offering modularity and transparency. The system was evaluated on a range of animal imagery tasks and demonstrates the ability to bridge visual content with structured, human-interpretable queries.

[5] GraphGSOcc: Semantic and Geometric Graph Transformer for 3D Gaussian Splating-based Occupancy Prediction cs.CV | cs.AIPDF

Ke Song, Yunhe Wu, Chunchit Siu, Huiyuan Xiong

TL;DR: GraphGSOcc提出了一种结合语义和几何图Transformer的新框架，解决了3D高斯泼溅方法中统一特征聚合和边界模糊的问题，显著提升了语义占用预测的性能和效率。

Details

Motivation: 现有3D高斯泼溅方法存在以下问题：1) 统一特征聚合忽略了相似类别和区域间的语义相关性；2) 因缺少几何约束导致边界模糊。

Result: 在SurroundOcc数据集上达到24.10%的mIoU，GPU内存降至6.1 GB，性能提升1.97%，内存减少13.7%。

Insight: 通过分层的图注意力机制，低层优化边界细节，高层建模对象级拓扑，提升了语义占用预测的精度和效率。

Abstract: Addressing the task of 3D semantic occupancy prediction for autonomous driving, we tackle two key issues in existing 3D Gaussian Splating (3DGS) methods: (1) unified feature aggregation neglecting semantic correlations among similar categories and across regions, and (2) boundary ambiguities caused by the lack of geometric constraints in MLP iterative optimization. We propose the GraphGSOcc model, a novel framework that combines semantic and geometric graph Transformer for 3D Gaussian Splating-based Occupancy Prediction. We propose the Dual Gaussians Graph Attenntion, which dynamically constructs dual graph structures: a geometric graph adaptively calculating KNN search radii based on Gaussian poses, enabling large-scale Gaussians to aggregate features from broader neighborhoods while compact Gaussians focus on local geometric consistency; a semantic graph retaining top-M highly correlated nodes via cosine similarity to explicitly encode semantic relationships within and across instances. Coupled with the Multi-scale Graph Attention framework, fine-grained attention at lower layers optimizes boundary details, while coarse-grained attention at higher layers models object-level topology. Experiments on the SurroundOcc dataset achieve an mIoU of 24.10%, reducing GPU memory to 6.1 GB, demonstrating a 1.97% mIoU improvement and 13.7% memory reduction compared to GaussianWorld

[6] DAVID-XR1: Detecting AI-Generated Videos with Explainable Reasoning cs.CV | cs.AIPDF

Yifeng Gao, Yifan Ding, Hongyu Su, Juncheng Li, Yunhan Zhao

TL;DR: DAVID-XR1 是一种可解释的视频生成检测方法，通过提供细粒度的时空缺陷标注和自然语言解释，将 AI 生成视频检测从黑盒决策转变为透明可验证的诊断过程。

Details

Motivation: 随着 AI 生成视频在媒体平台上的普及，可靠区分合成内容和真实内容的需求日益迫切。现有方法多为二分类任务，缺乏对检测结果的解释性。本文旨在填补这一关键空白。

Result: 实验表明，通用主干网络在小规模数据集上微调后，能够在多种生成器和生成模式中表现优秀，验证了可解释检测方法的潜力。

Insight: 可解释性是 AI 生成内容检测的核心挑战之一，通过细粒度标注和自然语言解释，可以显著提升检测结果的可信度和实用性。

Abstract: As AI-generated video becomes increasingly pervasive across media platforms, the ability to reliably distinguish synthetic content from authentic footage has become both urgent and essential. Existing approaches have primarily treated this challenge as a binary classification task, offering limited insight into where or why a model identifies a video as AI-generated. However, the core challenge extends beyond simply detecting subtle artifacts; it requires providing fine-grained, persuasive evidence that can convince auditors and end-users alike. To address this critical gap, we introduce DAVID-X, the first dataset to pair AI-generated videos with detailed defect-level, temporal-spatial annotations and written rationales. Leveraging these rich annotations, we present DAVID-XR1, a video-language model designed to deliver an interpretable chain of visual reasoning-including defect categorization, temporal-spatial localization, and natural language explanations. This approach fundamentally transforms AI-generated video detection from an opaque black-box decision into a transparent and verifiable diagnostic process. We demonstrate that a general-purpose backbone, fine-tuned on our compact dataset and enhanced with chain-of-thought distillation, achieves strong generalization across a variety of generators and generation modes. Our results highlight the promise of explainable detection methods for trustworthy identification of AI-generated video content.

[7] ArchShapeNet:An Interpretable 3D-CNN Framework for Evaluating Architectural Shapes cs.CV | cs.AIPDF

Jun Yin, Jing Zhong, Pengyu Zeng, Peilin Li, Zixuan Dai

TL;DR: 该论文提出了ArchShapeNet，一个基于3D-CNN的框架，用于分类和分析建筑形式，并通过实验证明其在区分人工设计与机器生成形式上的高效性。

Details

Motivation: 由于设计需求的复杂性和多样性增加，生成工具在建筑设计中变得越来越重要，但分析人工设计与机器生成形式的差异仍具挑战性。

Result: 模型在区分形式起源的任务中表现优异（准确率94.29%，精确率96.2%，召回率98.51%）。

Insight: 研究表明人工设计的形式在空间组织、比例和谐和细节处理上有独特优势，为未来生成设计工具的改进提供了方向。

Abstract: In contemporary architectural design, the growing complexity and diversity of design demands have made generative plugin tools essential for quickly producing initial concepts and exploring novel 3D forms. However, objectively analyzing the differences between human-designed and machine-generated 3D forms remains a challenge, limiting our understanding of their respective strengths and hindering the advancement of generative tools. To address this, we built ArchForms-4000, a dataset containing 2,000 architect-designed and 2,000 Evomass-generated 3D forms; Proposed ArchShapeNet, a 3D convolutional neural network tailored for classifying and analyzing architectural forms, incorporating a saliency module to highlight key spatial features aligned with architectural reasoning; And conducted comparative experiments showing our model outperforms human experts in distinguishing form origins, achieving 94.29% accuracy, 96.2% precision, and 98.51% recall. This study not only highlights the distinctive advantages of human-designed forms in spatial organization, proportional harmony, and detail refinement but also provides valuable insights for enhancing generative design tools in the future.

[8] Real-Time, Low-Latency Surveillance Using Entropy-Based Adaptive Buffering and MobileNetV2 on Edge Devices cs.CV | cs.AIPDF

Poojashree Chandrashekar Pankaj M Sajjanar

TL;DR: 本文提出了一种基于熵的自适应帧缓冲算法，结合MobileNetV2，用于资源受限的边缘设备上实现高性能、低延迟的视频监控系统。

Details

Motivation: 现有视频监控系统在资源受限设备上运行时，常面临高延迟和低吞吐量的问题，且难以适应复杂环境。

Result: 在边缘设备上实现端到端延迟低于50ms，检测准确率超过92%，并对光照、背景和速度变化具有鲁棒性。

Insight: 通过动态缓冲优化和轻量模型结合，可在资源受限设备上高效运行复杂任务，适用于智慧城市和嵌入式安全架构。

Abstract: This paper describes a high-performance, low-latency video surveillance system designed for resource-constrained environments. We have proposed a formal entropy-based adaptive frame buffering algorithm and integrated that with MobileNetV2 to achieve high throughput with low latency. The system is capable of processing live streams of video with sub-50ms end-to-end inference latency on resource-constrained devices (embedding platforms) such as Raspberry Pi, Amazon, and NVIDIA Jetson Nano. Our method maintains over 92% detection accuracy on standard datasets focused on video surveillance and exhibits robustness to varying lighting, backgrounds, and speeds. A number of comparative and ablation experiments validate the effectiveness of our design. Finally, our architecture is scalable, inexpensive, and compliant with stricter data privacy regulations than common surveillance systems, so that the system could coexist in a smart city or embedded security architecture.

[9] MonoVQD: Monocular 3D Object Detection with Variational Query Denoising and Self-Distillation cs.CVPDF

Kiet Dang Vu, Trung Thai Tran, Duc Dung Nguyen

TL;DR: MonoVQD提出了一种基于DETR的单目3D目标检测框架，通过变分查询去噪和自蒸馏技术显著提升了性能。

Details

Motivation: 传统DETR架构在单目3D检测中存在局限性，如匈牙利匹配的不稳定性和梯度消失问题，亟需一种新的解决方案。

Result: 在KITTI和nuScenes数据集上表现出色，展示了良好的泛化能力。

Insight: 通过引入随机属性和跨层信息蒸馏，MonoVQD解决了传统方法的根本局限性，为3D检测提供了新思路。

Abstract: Precisely localizing 3D objects from a single image constitutes a central challenge in monocular 3D detection. While DETR-like architectures offer a powerful paradigm, their direct application in this domain encounters inherent limitations, preventing optimal performance. Our work addresses these challenges by introducing MonoVQD, a novel framework designed to fundamentally advance DETR-based monocular 3D detection. We propose three main contributions. First, we propose the Mask Separated Self-Attention mechanism that enables the integration of the denoising process into a DETR architecture. This improves the stability of Hungarian matching to achieve a consistent optimization objective. Second, we present the Variational Query Denoising technique to address the gradient vanishing problem of conventional denoising methods, which severely restricts the efficiency of the denoising process. This explicitly introduces stochastic properties to mitigate this fundamental limitation and unlock substantial performance gains. Finally, we introduce a sophisticated self-distillation strategy, leveraging insights from later decoder layers to synergistically improve query quality in earlier layers, thereby amplifying the iterative refinement process. Rigorous experimentation demonstrates that MonoVQD achieves superior performance on the challenging KITTI monocular benchmark. Highlighting its broad applicability, MonoVQD’s core components seamlessly integrate into other architectures, delivering significant performance gains even in multi-view 3D detection scenarios on the nuScenes dataset and underscoring its robust generalization capabilities.

Chengzhi Xu, Yuyang Wang, Lai Wei, Lichao Sun, Weiran Huang

TL;DR: 本文提出了ChartIR方法，基于结构化指令改进迭代优化，用于图表到代码的生成任务。通过将任务分解为视觉理解和代码翻译两部分，设计了描述和差异两种结构化指令，显著提升了生成质量。

Details

Motivation: 现有的多模态大语言模型（MLLMs）在图表到代码生成任务中表现不佳，需要一种方法能够结合精确的视觉理解和准确的代码翻译能力。

Result: 在开源模型Qwen2-VL和闭源模型GPT-4o上均取得了优于其他方法的表现。

Insight: 结构化指令能够有效将视觉特征转化为语言表示，两阶段流程可以逐步提升生成质量。

Abstract: Recently, multimodal large language models (MLLMs) have attracted increasing research attention due to their powerful visual understanding capabilities. While they have achieved impressive results on various vision tasks, their performance on chart-to-code generation remains suboptimal. This task requires MLLMs to generate executable code that can reproduce a given chart, demanding not only precise visual understanding but also accurate translation of visual elements into structured code. Directly prompting MLLMs to perform this complex task often yields unsatisfactory results. To address this challenge, we propose {ChartIR}, an iterative refinement method based on structured instruction. First, we distinguish two tasks: visual understanding and code translation. To accomplish the visual understanding component, we design two types of structured instructions: description and difference. The description instruction captures the visual elements of the reference chart, while the difference instruction characterizes the discrepancies between the reference chart and the generated chart. These instructions effectively transform visual features into language representations, thereby facilitating the subsequent code translation process. Second, we decompose the overall chart generation pipeline into two stages: initial code generation and iterative refinement, enabling progressive enhancement of the final output. Experimental results show that, compared to other method, our method achieves superior performance on both the open-source model Qwen2-VL and the closed-source model GPT-4o.

[11] PictSure: Pretraining Embeddings Matters for In-Context Learning Image Classifiers cs.CV | cs.AIPDF

Lukas Schiesser, Cornelius Wolff, Sophie Haas, Simon Pukrop

TL;DR: PictSure是一个专注于图像嵌入模型对上下文学习（ICL）影响的框架，研究表明嵌入模型的预训练方式对少样本图像分类（FSIC）的性能至关重要，显著提升了域外任务的性能。

Details

Motivation: 在数据稀缺的领域构建图像分类模型具有挑战性，传统方法依赖于大量标注数据。ICL作为一种少样本学习范式提供了新思路，但此前研究忽略了图像嵌入模型的关键作用。

Result: PictSure在域外基准测试中超越了现有ICL-FSIC模型，同时在域内任务上保持可比性能。

Insight: 嵌入模型的预训练质量和方式是ICL-FSIC成功的关键因素，尤其是在域外泛化场景中。

Abstract: Building image classification models remains cumbersome in data-scarce domains, where collecting large labeled datasets is impractical. In-context learning (ICL) has emerged as a promising paradigm for few-shot image classification (FSIC), enabling models to generalize across domains without gradient-based adaptation. However, prior work has largely overlooked a critical component of ICL-based FSIC pipelines: the role of image embeddings. In this work, we present PictSure, an ICL framework that places the embedding model – its architecture, pretraining, and training dynamics – at the center of analysis. We systematically examine the effects of different visual encoder types, pretraining objectives, and fine-tuning strategies on downstream FSIC performance. Our experiments show that the training success and the out-of-domain performance are highly dependent on how the embedding models are pretrained. Consequently, PictSure manages to outperform existing ICL-based FSIC models on out-of-domain benchmarks that differ significantly from the training distribution, while maintaining comparable results on in-domain tasks. Code can be found at https://github.com/PictSure/pictsure-library.

[12] Efficient Retail Video Annotation: A Robust Key Frame Generation Approach for Product and Customer Interaction Analysis cs.CV | cs.AI | cs.HC | cs.LGPDF

Varun Mannam, Zhenyu Shi

TL;DR: 论文提出了一种基于深度学习的零售视频关键帧生成方法，通过自动标注产品和顾客交互，显著降低了人工标注成本，同时保持了高准确性。

Details

Motivation: 传统零售视频标注依赖耗时的人工标注方法，导致非鲁棒性帧选择和高成本。论文旨在通过自动化关键帧生成和标注来解决这些问题。

Result: 实验结果显示，该方法比人工标注效率高2倍，且仅需人工验证/调整5%的帧，同时保持了标注质量。

Insight: 方法在零售视频标注中实现了高自动化程度，可广泛应用于顾客行为分析、产品交互检测和安全监控等领域。

Abstract: Accurate video annotation plays a vital role in modern retail applications, including customer behavior analysis, product interaction detection, and in-store activity recognition. However, conventional annotation methods heavily rely on time-consuming manual labeling by human annotators, introducing non-robust frame selection and increasing operational costs. To address these challenges in the retail domain, we propose a deep learning-based approach that automates key-frame identification in retail videos and provides automatic annotations of products and customers. Our method leverages deep neural networks to learn discriminative features by embedding video frames and incorporating object detection-based techniques tailored for retail environments. Experimental results showcase the superiority of our approach over traditional methods, achieving accuracy comparable to human annotator labeling while enhancing the overall efficiency of retail video annotation. Remarkably, our approach leads to an average of 2 times cost savings in video annotation. By allowing human annotators to verify/adjust less than 5% of detected frames in the video dataset, while automating the annotation process for the remaining frames without reducing annotation quality, retailers can significantly reduce operational costs. The automation of key-frame detection enables substantial time and effort savings in retail video labeling tasks, proving highly valuable for diverse retail applications such as shopper journey analysis, product interaction detection, and in-store security monitoring.

[13] Peering into the Unknown: Active View Selection with Neural Uncertainty Maps for 3D Reconstruction cs.CV | cs.AIPDF

Zhengquan Zhang, Feng Xu, Mengmi Zhang

TL;DR: 该论文提出了一种名为UPNet的轻量级神经网络，用于预测不确定性地图，从而指导主动视角选择（AVS）任务，显著提升了3D重建的效率和准确性。

Details

Motivation: 3D重建中，如何选择最具信息量的视角是一个关键挑战。传统方法依赖计算密集型的不确定性估计，而本文希望通过学习一种直接映射来简化这一过程。

Result: 相比基线方法，使用一半视角即可达到类似精度，计算开销降低400倍，并显著节约CPU、RAM和GPU资源。

Insight: 通过直接学习不确定性映射，可以避免复杂的计算，同时实现对新物体类别的高效泛化。

Abstract: Some perspectives naturally provide more information than others. How can an AI system determine which viewpoint offers the most valuable insight for accurate and efficient 3D object reconstruction? Active view selection (AVS) for 3D reconstruction remains a fundamental challenge in computer vision. The aim is to identify the minimal set of views that yields the most accurate 3D reconstruction. Instead of learning radiance fields, like NeRF or 3D Gaussian Splatting, from a current observation and computing uncertainty for each candidate viewpoint, we introduce a novel AVS approach guided by neural uncertainty maps predicted by a lightweight feedforward deep neural network, named UPNet. UPNet takes a single input image of a 3D object and outputs a predicted uncertainty map, representing uncertainty values across all possible candidate viewpoints. By leveraging heuristics derived from observing many natural objects and their associated uncertainty patterns, we train UPNet to learn a direct mapping from viewpoint appearance to uncertainty in the underlying volumetric representations. Next, our approach aggregates all previously predicted neural uncertainty maps to suppress redundant candidate viewpoints and effectively select the most informative one. Using these selected viewpoints, we train 3D neural rendering models and evaluate the quality of novel view synthesis against other competitive AVS methods. Remarkably, despite using half of the viewpoints than the upper bound, our method achieves comparable reconstruction accuracy. In addition, it significantly reduces computational overhead during AVS, achieving up to a 400 times speedup along with over 50% reductions in CPU, RAM, and GPU usage compared to baseline methods. Notably, our approach generalizes effectively to AVS tasks involving novel object categories, without requiring any additional training.

[14] PeRL: Permutation-Enhanced Reinforcement Learning for Interleaved Vision-Language Reasoning cs.CV | cs.AIPDF

Yizhen Zhang, Yang Ding, Shuoshuo Zhang, Xinchen Zhang, Haoling Li

TL;DR: PeRL提出了一种基于强化学习的视觉语言推理方法，通过图像序列的排列增强位置关系理解，并结合多阶段策略和轨迹过滤机制，显著提升了多图像基准任务的表现。

Details

Motivation: 现有多模态强化学习方法局限于单图像的空间推理，难以扩展到涉及多图像位置推理的复杂场景，需要一种更通用的方法来理解图像间的关系。

Result: 在5个多图像基准和3个单图像基准上的实验表明，PeRL显著优于现有方法，达到了多图像任务的最优性能，同时在单图像任务上保持竞争力。

Insight: 图像序列的排列和多阶段策略的结合能够有效提升模型对复杂位置关系的理解能力，同时轨迹过滤机制为强化学习提供了更高效的策略利用方式。

Abstract: Inspired by the impressive reasoning capabilities demonstrated by reinforcement learning approaches like DeepSeek-R1, recent emerging research has begun exploring the use of reinforcement learning (RL) to enhance vision-language models (VLMs) for multimodal reasoning tasks. However, most existing multimodal reinforcement learning approaches remain limited to spatial reasoning within single-image contexts, yet still struggle to generalize to more complex and real-world scenarios involving multi-image positional reasoning, where understanding the relationships across images is crucial. To address this challenge, we propose a general reinforcement learning approach PeRL tailored for interleaved multimodal tasks, and a multi-stage strategy designed to enhance the exploration-exploitation trade-off, thereby improving learning efficiency and task performance. Specifically, we introduce permutation of image sequences to simulate varied positional relationships to explore more spatial and positional diversity. Furthermore, we design a rollout filtering mechanism for resampling to focus on trajectories that contribute most to learning optimal behaviors to exploit learned policies effectively. We evaluate our model on 5 widely-used multi-image benchmarks and 3 single-image benchmarks. Our experiments confirm that PeRL trained model consistently surpasses R1-related and interleaved VLM baselines by a large margin, achieving state-of-the-art performance on multi-image benchmarks, while preserving comparable performance on single-image tasks.

[15] Frequency-Calibrated Membership Inference Attacks on Medical Image Diffusion Models cs.CV | cs.LGPDF

Xinkai Zhao, Yuta Tokuoka, Junichiro Iwasawa, Keita Oda

TL;DR: 该论文针对医学图像扩散模型的隐私问题，提出了一种频率校准的成员推断攻击方法（FCRE），通过关注中频区域的重建误差，显著提升了攻击效果。

Details

Motivation: 由于扩散模型在医学图像生成中的广泛应用，隐私风险日益突出。现有成员推断攻击（MIA）方法基于重建误差，但在医学图像上表现不佳，因高频细节难以重建且固有图像难度干扰结果。

Result: 在多个医学图像数据集上的实验表明，FCRE方法优于现有MIA方法。

Insight: 聚焦特定频率区域能有效克服医学图像重建中的难点，为MIA提供了新思路。

Abstract: The increasing use of diffusion models for image generation, especially in sensitive areas like medical imaging, has raised significant privacy concerns. Membership Inference Attack (MIA) has emerged as a potential approach to determine if a specific image was used to train a diffusion model, thus quantifying privacy risks. Existing MIA methods often rely on diffusion reconstruction errors, where member images are expected to have lower reconstruction errors than non-member images. However, applying these methods directly to medical images faces challenges. Reconstruction error is influenced by inherent image difficulty, and diffusion models struggle with high-frequency detail reconstruction. To address these issues, we propose a Frequency-Calibrated Reconstruction Error (FCRE) method for MIAs on medical image diffusion models. By focusing on reconstruction errors within a specific mid-frequency range and excluding both high-frequency (difficult to reconstruct) and low-frequency (less informative) regions, our frequency-selective approach mitigates the confounding factor of inherent image difficulty. Specifically, we analyze the reverse diffusion process, obtain the mid-frequency reconstruction error, and compute the structural similarity index score between the reconstructed and original images. Membership is determined by comparing this score to a threshold. Experiments on several medical image datasets demonstrate that our FCRE method outperforms existing MIA methods.

[16] Vision Transformers for End-to-End Quark-Gluon Jet Classification from Calorimeter Images cs.CVPDF

Md Abrar Jahin, Shahriar Soudeep, Arian Rahman Aditta, M. F. Mridha, Nafiz Fahad

TL;DR: 该论文首次系统评估了Vision Transformer（ViT）及其与CNN混合模型在区分夸克和胶子喷注上的表现，并在2012年CMS公开数据上展示了其优于传统CNN的性能。

Details

Motivation: 在高能物理中，区分夸克和胶子喷注是一个关键挑战，而现有方法（如CNN）在全局上下文建模上存在局限。ViT因其出色的全局建模能力成为潜在解决方案。

Result: 实验表明，ViT及混合模型在F1分数、ROC-AUC和准确率上均优于传统CNN，尤其在捕捉喷注子结构中的长距离空间相关性方面表现突出。

Insight: 论文揭示了ViT在高能物理图像分析中的潜力，尤其是在复杂探测器条件和堆积环境下，其全局建模能力可能成为未来研究的重点方向。

Abstract: Distinguishing between quark- and gluon-initiated jets is a critical and challenging task in high-energy physics, pivotal for improving new physics searches and precision measurements at the Large Hadron Collider. While deep learning, particularly Convolutional Neural Networks (CNNs), has advanced jet tagging using image-based representations, the potential of Vision Transformer (ViT) architectures, renowned for modeling global contextual information, remains largely underexplored for direct calorimeter image analysis, especially under realistic detector and pileup conditions. This paper presents a systematic evaluation of ViTs and ViT-CNN hybrid models for quark-gluon jet classification using simulated 2012 CMS Open Data. We construct multi-channel jet-view images from detector-level energy deposits (ECAL, HCAL) and reconstructed tracks, enabling an end-to-end learning approach. Our comprehensive benchmarking demonstrates that ViT-based models, notably ViT+MaxViT and ViT+ConvNeXt hybrids, consistently outperform established CNN baselines in F1-score, ROC-AUC, and accuracy, highlighting the advantage of capturing long-range spatial correlations within jet substructure. This work establishes the first systematic framework and robust performance baselines for applying ViT architectures to calorimeter image-based jet classification using public collider data, alongside a structured dataset suitable for further deep learning research in this domain.

[17] Advances in Compliance Detection: Novel Models Using Vision-Based Tactile Sensors cs.CV | cs.RO | I.2.9PDF

Ziteng Li, Malte Kuhlmann, Ilana Nisky, Nicolás Navarro-Guerrero

TL;DR: 论文提出了两种基于LRCN和Transformer的模型，利用RGB触觉图像和GelSight传感器数据，显著提升了物体顺应性检测的准确性。

Details

Motivation: 传统顺应性检测方法便携性和扩展性不足，且依赖昂贵设备，而现有神经网络方法精度不高。

Result: 模型性能优于基线，但发现较硬的物体估计更具挑战性。

Insight: 传感器与物体顺应性之间的相关性会影响检测效果，硬物体更难估计。

Abstract: Compliance is a critical parameter for describing objects in engineering, agriculture, and biomedical applications. Traditional compliance detection methods are limited by their lack of portability and scalability, rely on specialized, often expensive equipment, and are unsuitable for robotic applications. Moreover, existing neural network-based approaches using vision-based tactile sensors still suffer from insufficient prediction accuracy. In this paper, we propose two models based on Long-term Recurrent Convolutional Networks (LRCNs) and Transformer architectures that leverage RGB tactile images and other information captured by the vision-based sensor GelSight to predict compliance metrics accurately. We validate the performance of these models using multiple metrics and demonstrate their effectiveness in accurately estimating compliance. The proposed models exhibit significant performance improvement over the baseline. Additionally, we investigated the correlation between sensor compliance and object compliance estimation, which revealed that objects that are harder than the sensor are more challenging to estimate.

[18] SynPo: Boosting Training-Free Few-Shot Medical Segmentation via High-Quality Negative Prompts cs.CVPDF

Yufei Liu, Haoke Xiao, Jiaxing Chai, Yongcun Zhang, Rong Wang

TL;DR: SynPo是一种无需训练的少样本医学图像分割方法，通过提高负提示的质量，利用LVMs（如SAM）实现高效性能。其核心是通过结合DINOv2和SAM的优势设计置信图协同模块，选择高质量的正负提示点，最终在实验中获得与训练方法可比的结果。

Details

Motivation: 现有的基于LVMs的无训练少样本医学分割方法未能有效利用负提示，导致在低对比度医学图像上表现不佳。SynPo的提出旨在解决这一问题，通过优化负提示质量提升性能。

Result: 实验表明，SynPo在性能上与最先进的基于训练的方法相当，尤其在低对比度医学图像上表现优异。

Insight: 负提示的质量对无训练少样本分割方法至关重要，结合多模型优势及有效的提示点选择策略能显著提升性能。

Abstract: The advent of Large Vision Models (LVMs) offers new opportunities for few-shot medical image segmentation. However, existing training-free methods based on LVMs fail to effectively utilize negative prompts, leading to poor performance on low-contrast medical images. To address this issue, we propose SynPo, a training-free few-shot method based on LVMs (e.g., SAM), with the core insight: improving the quality of negative prompts. To select point prompts in a more reliable confidence map, we design a novel Confidence Map Synergy Module by combining the strengths of DINOv2 and SAM. Based on the confidence map, we select the top-k pixels as the positive points set and choose the negative points set using a Gaussian distribution, followed by independent K-means clustering for both sets. Then, these selected points are leveraged as high-quality prompts for SAM to get the segmentation results. Extensive experiments demonstrate that SynPo achieves performance comparable to state-of-the-art training-based few-shot methods.

[19] ReSeDis: A Dataset for Referring-based Object Search across Large-Scale Image Collections cs.CVPDF

Ziling Huang, Yidan Zhang, Shin’ichi Satoh

TL;DR: ReSeDis is a new dataset and benchmark for unified object search and localization, combining large-scale retrieval with fine-grained grounding in images based on natural language descriptions.

Details

Motivation: Existing methods address either retrieval or localization separately, but not both. ReSeDis aims to bridge this gap by creating a realistic task for multimodal search systems.

Result: Provides a new benchmark revealing significant room for improvement, with a zero-shot baseline showing potential for future research.

Insight: ReSeDis fills a critical gap in multimodal search, highlighting the need for models that can simultaneously handle retrieval and precise localization.

Abstract: Large-scale visual search engines are expected to solve a dual problem at once: (i) locate every image that truly contains the object described by a sentence and (ii) identify the object’s bounding box or exact pixels within each hit. Existing techniques address only one side of this challenge. Visual grounding yields tight boxes and masks but rests on the unrealistic assumption that the object is present in every test image, producing a flood of false alarms when applied to web-scale collections. Text-to-image retrieval excels at sifting through massive databases to rank relevant images, yet it stops at whole-image matches and offers no fine-grained localization. We introduce Referring Search and Discovery (ReSeDis), the first task that unifies corpus-level retrieval with pixel-level grounding. Given a free-form description, a ReSeDis model must decide whether the queried object appears in each image and, if so, where it is, returning bounding boxes or segmentation masks. To enable rigorous study, we curate a benchmark in which every description maps uniquely to object instances scattered across a large, diverse corpus, eliminating unintended matches. We further design a task-specific metric that jointly scores retrieval recall and localization precision. Finally, we provide a straightforward zero-shot baseline using a frozen vision-language model, revealing significant headroom for future study. ReSeDis offers a realistic, end-to-end testbed for building the next generation of robust and scalable multimodal search systems.

[20] Privacy-Shielded Image Compression: Defending Against Exploitation from Vision-Language Pretrained Models cs.CVPDF

Xuelin Shen, Jiayin Xu, Kangsheng Yin, Wenhan Yang

TL;DR: 提出了一种隐私保护的图像压缩方法PSIC，通过灵活的编码方式生成可多解码的比特流，既保护图像隐私又保留原有压缩功能。

Details

Motivation: 随着视觉语言预训练模型语义理解能力的提升，公开图像容易遭受搜索引擎等工具的利用，亟需在图像压缩阶段引入隐私保护机制。

Result: 实验表明，该方法在多种下游任务中有效平衡了隐私保护与图像质量。

Insight: 通过灵活的比特流控制，PSIC实现了隐私保护与图像功能的兼容，为预训练模型的隐私威胁提供了解决方案。

Abstract: The improved semantic understanding of vision-language pretrained (VLP) models has made it increasingly difficult to protect publicly posted images from being exploited by search engines and other similar tools. In this context, this paper seeks to protect users’ privacy by implementing defenses at the image compression stage to prevent exploitation. Specifically, we propose a flexible coding method, termed Privacy-Shielded Image Compression (PSIC), that can produce bitstreams with multiple decoding options. By default, the bitstream is decoded to preserve satisfactory perceptual quality while preventing interpretation by VLP models. Our method also retains the original image compression functionality. With a customizable input condition, the proposed scheme can reconstruct the image that preserves its full semantic information. A Conditional Latent Trigger Generation (CLTG) module is proposed to produce bias information based on customizable conditions to guide the decoding process into different reconstructed versions, and an Uncertainty-Aware Encryption-Oriented (UAEO) optimization function is designed to leverage the soft labels inferred from the target VLP model’s uncertainty on the training data. This paper further incorporates an adaptive multi-objective optimization strategy to obtain improved encrypting performance and perceptual quality simultaneously within a unified training process. The proposed scheme is plug-and-play and can be seamlessly integrated into most existing Learned Image Compression (LIC) models. Extensive experiments across multiple downstream tasks have demonstrated the effectiveness of our design.

[21] DM-FNet: Unified multimodal medical image fusion via diffusion process-trained encoder-decoder cs.CVPDF

Dan He, Weisheng Li, Guofen Wang, Yuping Huang, Shiqiang Liu

TL;DR: 论文提出了一种基于扩散过程的两阶段融合网络DM-FNet，用于统一多模态医学图像融合，通过扩散训练和融合模块增强特征识别与交互能力，提升融合图像质量。

Details

Motivation: 现有多模态医学图像融合方法在特征捕获和跨模态交互方面表现不足，导致融合图像质量不理想。

Result: 实验结果表明，DM-FNet在多种医学图像类型上表现优异，融合图像具备良好的亮度、纹理和边缘清晰度。

Insight: 扩散过程为图像融合提供了丰富的特征表示，两阶段设计与融合模块的结合有效解决了跨模态特征交互不足的问题。

Abstract: Multimodal medical image fusion (MMIF) extracts the most meaningful information from multiple source images, enabling a more comprehensive and accurate diagnosis. Achieving high-quality fusion results requires a careful balance of brightness, color, contrast, and detail; this ensures that the fused images effectively display relevant anatomical structures and reflect the functional status of the tissues. However, existing MMIF methods have limited capacity to capture detailed features during conventional training and suffer from insufficient cross-modal feature interaction, leading to suboptimal fused image quality. To address these issues, this study proposes a two-stage diffusion model-based fusion network (DM-FNet) to achieve unified MMIF. In Stage I, a diffusion process trains UNet for image reconstruction. UNet captures detailed information through progressive denoising and represents multilevel data, providing a rich set of feature representations for the subsequent fusion network. In Stage II, noisy images at various steps are input into the fusion network to enhance the model’s feature recognition capability. Three key fusion modules are also integrated to process medical images from different modalities adaptively. Ultimately, the robust network structure and a hybrid loss function are integrated to harmonize the fused image’s brightness, color, contrast, and detail, enhancing its quality and information density. The experimental results across various medical image types demonstrate that the proposed method performs exceptionally well regarding objective evaluation metrics. The fused image preserves appropriate brightness, a comprehensive distribution of radioactive tracers, rich textures, and clear edges. The code is available at https://github.com/HeDan-11/DM-FNet.

[22] video-SALMONN 2: Captioning-Enhanced Audio-Visual Large Language Models cs.CV | cs.CL | cs.SDPDF

Changli Tang, Yixuan Li, Yudong Yang, Jimin Zhuang, Guangzhi Sun

TL;DR: 论文提出了video-SALMONN 2，一种基于低秩自适应（LoRA）的音频-视觉大语言模型（LLM），通过定向偏好优化（DPO）显著提升了视频（含音频）字幕生成的完整性和准确性。提出了多轮DPO（MrDPO）方法，进一步优化训练过程，实验结果表明其将字幕错误率降低28%。

Details

Motivation: 视频包含丰富信息，生成自然语言的详细准确描述是视频理解的关键。现有模型在视频字幕生成任务中仍有提升空间，尤其是在完整性和准确性方面。

Result: 实验显示，video-SALMONN 2在仅70亿参数量下，视频字幕错误率降低28%，超越GPT-4o和Gemini-1.5-Pro等领先模型。

Insight: 多轮DPO（MrDPO）通过周期性更新参考模型和融合LoRA模块，有效提升了模型性能和稳定性，同时保持了参数量效率。

Abstract: Videos contain a wealth of information, and generating detailed and accurate descriptions in natural language is a key aspect of video understanding. In this paper, we present video-SALMONN 2, an advanced audio-visual large language model (LLM) with low-rank adaptation (LoRA) designed for enhanced video (with paired audio) captioning through directed preference optimisation (DPO). We propose new metrics to evaluate the completeness and accuracy of video descriptions, which are optimised using DPO. To further improve training, we propose a novel multi-round DPO (MrDPO) approach, which involves periodically updating the DPO reference model, merging and re-initialising the LoRA module as a proxy for parameter updates after each training round (1,000 steps), and incorporating guidance from ground-truth video captions to stabilise the process. Experimental results show that MrDPO significantly enhances video-SALMONN 2’s captioning accuracy, reducing the captioning error rates by 28%. The final video-SALMONN 2 model, with just 7 billion parameters, surpasses leading models such as GPT-4o and Gemini-1.5-Pro in video captioning tasks, while maintaining highly competitive performance to the state-of-the-art on widely used video question-answering benchmarks among models of similar size. Codes are available at \href{https://github.com/bytedance/video-SALMONN-2}{https://github.com/bytedance/video-SALMONN-2}.

[23] Convolutional Feature Enhancement and Attention Fusion BiFPN for Ship Detection in SAR Images cs.CVPDF

Liangjie Meng, Danxia Li, Jinrong He, Lili Ma, Zhixin Li

TL;DR: 本文提出了一种名为C-AFBiFPN的新框架，用于合成孔径雷达（SAR）图像中的船舶检测，通过卷积特征增强模块和改进的双向特征金字塔网络（BiFPN）提升检测性能。

Details

Motivation: 解决SAR船舶检测中面临的船舶尺度变化大、小型船只与噪声混杂、近岸大型船只背景复杂等挑战。

Result: 在SAR船舶检测数据集（SSDD）上，显著提升了小目标检测精度、抗遮挡鲁棒性和多尺度适应性。

Insight: 通过卷积特征增强和注意力融合策略，可以有效解决SAR图像中船舶检测的复杂性问题，尤其是对小目标和多尺度场景有显著效果。

Abstract: Synthetic Aperture Radar (SAR) enables submeter-resolution imaging and all-weather monitoring via active microwave and advanced signal processing. Currently, SAR has found extensive applications in critical maritime domains such as ship detection. However, SAR ship detection faces several challenges, including significant scale variations among ships, the presence of small offshore vessels mixed with noise, and complex backgrounds for large nearshore ships. To address these issues, this paper proposes a novel feature enhancement and fusion framework named C-AFBiFPN. C-AFBiFPN constructs a Convolutional Feature Enhancement (CFE) module following the backbone network, aiming to enrich feature representation and enhance the ability to capture and represent local details and contextual information. Furthermore, C-AFBiFPN innovatively integrates BiFormer attention within the fusion strategy of BiFPN, creating the AFBiFPN network. AFBiFPN improves the global modeling capability of cross-scale feature fusion and can adaptively focus on critical feature regions. The experimental results on SAR Ship Detection Dataset (SSDD) indicate that the proposed approach substantially enhances detection accuracy for small targets, robustness against occlusions, and adaptability to multi-scale features.

[24] RA-NeRF: Robust Neural Radiance Field Reconstruction with Accurate Camera Pose Estimation under Complex Trajectories cs.CVPDF

Qingsong Yan, Qiang Wang, Kaiyong Zhao, Jie Chen, Bo Li

TL;DR: RA-NeRF提出了一种新颖的方法，通过结合光流驱动的姿态调整和隐式姿态滤波，解决了复杂相机轨迹下NeRF重建依赖准确相机位姿先验的问题，在两个数据集上实现了先进性能。

Details

Motivation: 现有NeRF和3DGS方法在复杂相机轨迹下依赖准确相机位姿先验，但现有解决方案效果不佳，导致重建精度不足。

Result: 在Tanks&Temple和NeRFBuster数据集上，RA-NeRF在位姿估计和视觉质量上均达到最先进水平。

Insight: 光流驱动的姿态调整和隐式姿态滤波能够显著提升NeRF在复杂相机轨迹下的鲁棒性，为3D重建和SLAM任务提供了新思路。

Abstract: Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) have emerged as powerful tools for 3D reconstruction and SLAM tasks. However, their performance depends heavily on accurate camera pose priors. Existing approaches attempt to address this issue by introducing external constraints but fall short of achieving satisfactory accuracy, particularly when camera trajectories are complex. In this paper, we propose a novel method, RA-NeRF, capable of predicting highly accurate camera poses even with complex camera trajectories. Following the incremental pipeline, RA-NeRF reconstructs the scene using NeRF with photometric consistency and incorporates flow-driven pose regulation to enhance robustness during initialization and localization. Additionally, RA-NeRF employs an implicit pose filter to capture the camera movement pattern and eliminate the noise for pose estimation. To validate our method, we conduct extensive experiments on the Tanks&Temple dataset for standard evaluation, as well as the NeRFBuster dataset, which presents challenging camera pose trajectories. On both datasets, RA-NeRF achieves state-of-the-art results in both camera pose estimation and visual quality, demonstrating its effectiveness and robustness in scene reconstruction under complex pose trajectories.

[25] Retrospective Memory for Camouflaged Object Detection cs.CVPDF

Chenxi Zhang, Jiayun Wu, Qing Zhang, Yazhe Zhai, Youwei Pang

TL;DR: 该论文提出了一种名为RetroMem的增强记忆架构用于伪装物体检测（COD），通过动态整合历史知识来优化伪装模式感知和推理，显著提升了模型性能。

Details

Motivation: 现有COD方法基于静态视觉表示建模，缺乏对历史上下文的显式利用，限制了其在复杂伪装场景中的适应性和效果。

Result: 在多个标准数据集上的实验表明，RetroMem显著优于现有最优方法。

Insight: 动态记忆机制能够有效提升模型对复杂伪装场景的理解能力，两阶段训练模式为COD任务提供了新思路。

Abstract: Camouflaged object detection (COD) primarily focuses on learning subtle yet discriminative representations from complex scenes. Existing methods predominantly follow the parametric feedforward architecture based on static visual representation modeling. However, they lack explicit mechanisms for acquiring historical context, limiting their adaptation and effectiveness in handling challenging camouflage scenes. In this paper, we propose a recall-augmented COD architecture, namely RetroMem, which dynamically modulates camouflage pattern perception and inference by integrating relevant historical knowledge into the process. Specifically, RetroMem employs a two-stage training paradigm consisting of a learning stage and a recall stage to construct, update, and utilize memory representations effectively. During the learning stage, we design a dense multi-scale adapter (DMA) to improve the pretrained encoder’s capability to capture rich multi-scale visual information with very few trainable parameters, thereby providing foundational inferences. In the recall stage, we propose a dynamic memory mechanism (DMM) and an inference pattern reconstruction (IPR). These components fully leverage the latent relationships between learned knowledge and current sample context to reconstruct the inference of camouflage patterns, thereby significantly improving the model’s understanding of camouflage scenes. Extensive experiments on several widely used datasets demonstrate that our RetroMem significantly outperforms existing state-of-the-art methods.

[26] Domain Adaptation for Image Classification of Defects in Semiconductor Manufacturing cs.CV | cs.AIPDF

Adrian Poniatowski, Natalie Gentner, Manuel Barusco, Davide Dalle Pezze, Samuele Salti

TL;DR: 论文探讨了在半导体制造缺陷图像分类中应用领域自适应（DA）技术的有效性，并提出了一种改进的CycleGAN模型DBACS，以减少对人工标注的需求。

Details

Motivation: 半导体行业对快速上市和高质量的需求推动了深度学习和DA技术的应用，以减少手动标注和重新训练模型的成本。

Result: 实验证明DBACS方法在半导体领域的DA任务中表现优异，提升了分类性能。

Insight: DA技术可以有效减少人工标注需求，提升模型在跨域任务中的适应性和鲁棒性。

Abstract: In the semiconductor sector, due to high demand but also strong and increasing competition, time to market and quality are key factors in securing significant market share in various application areas. Thanks to the success of deep learning methods in recent years in the computer vision domain, Industry 4.0 and 5.0 applications, such as defect classification, have achieved remarkable success. In particular, Domain Adaptation (DA) has proven highly effective since it focuses on using the knowledge learned on a (source) domain to adapt and perform effectively on a different but related (target) domain. By improving robustness and scalability, DA minimizes the need for extensive manual re-labeling or re-training of models. This not only reduces computational and resource costs but also allows human experts to focus on high-value tasks. Therefore, we tested the efficacy of DA techniques in semi-supervised and unsupervised settings within the context of the semiconductor field. Moreover, we propose the DBACS approach, a CycleGAN-inspired model enhanced with additional loss terms to improve performance. All the approaches are studied and validated on real-world Electron Microscope images considering the unsupervised and semi-supervised settings, proving the usefulness of our method in advancing DA techniques for the semiconductor field.

[27] MSNeRV: Neural Video Representation with Multi-Scale Feature Fusion cs.CV | cs.MM | eess.IVPDF

Jun Zhu, Xinfeng Zhang, Lv Tang, JunHao Jiang

TL;DR: MSNeRV提出了一种基于多尺度特征融合的神经视频表示框架，通过时间窗口和GoP级网格增强时间一致性，设计了多尺度空间解码器和自适应损失函数，显著提升了细节密集和快速变化视频内容的表示能力和压缩效率。

Details

Motivation: 现有基于隐式神经表示（INR）的视频压缩方法在表示细节密集和快速变化的视频内容时表现不佳，主要原因是网络内部特征利用不足且缺乏视频特定的设计。

Result: 在HEVC ClassB和UVG数据集上的实验表明，MSNeRV在动态场景中的压缩效率超越VTM-23.7（随机访问），并在INR类方法中表现出最优的表示能力。

Insight: 多尺度特征融合和时间一致性设计是提升神经视频表示能力的关键，特别是在细节密集和快速变化的场景中。

Abstract: Implicit Neural representations (INRs) have emerged as a promising approach for video compression, and have achieved comparable performance to the state-of-the-art codecs such as H.266/VVC. However, existing INR-based methods struggle to effectively represent detail-intensive and fast-changing video content. This limitation mainly stems from the underutilization of internal network features and the absence of video-specific considerations in network design. To address these challenges, we propose a multi-scale feature fusion framework, MSNeRV, for neural video representation. In the encoding stage, we enhance temporal consistency by employing temporal windows, and divide the video into multiple Groups of Pictures (GoPs), where a GoP-level grid is used for background representation. Additionally, we design a multi-scale spatial decoder with a scale-adaptive loss function to integrate multi-resolution and multi-frequency information. To further improve feature extraction, we introduce a multi-scale feature block that fully leverages hidden features. We evaluate MSNeRV on HEVC ClassB and UVG datasets for video representation and compression. Experimental results demonstrate that our model exhibits superior representation capability among INR-based approaches and surpasses VTM-23.7 (Random Access) in dynamic scenarios in terms of compression efficiency.

Qian Li, Feng Liu, Shuojue Yang, Daiyun Shen, Yueming Jin

TL;DR: BCRNet通过Bezier曲线细化策略显著提升了腹腔镜肝脏手术中地标检测的精度。结合多模态特征提取、自适应曲线提案初始化和分层曲线细化机制，该方法在多个数据集上表现优异。

Details

Motivation: 腹腔镜肝脏手术中，准确识别曲线解剖标志对增强现实导航至关重要。现有方法在地标检测精度上仍有不足，需改进以支持更精确的2D-3D配准。

Result: 在L3D和P2ILF数据集上，BCRNet优于现有方法，性能提升显著。

Insight: 将曲线建模与多阶段细化结合可有效提升地标检测精度，适用于复杂医学图像场景。

Abstract: Laparoscopic liver surgery, while minimally invasive, poses significant challenges in accurately identifying critical anatomical structures. Augmented reality (AR) systems, integrating MRI/CT with laparoscopic images based on 2D-3D registration, offer a promising solution for enhancing surgical navigation. A vital aspect of the registration progress is the precise detection of curvilinear anatomical landmarks in laparoscopic images. In this paper, we propose BCRNet (Bezier Curve Refinement Net), a novel framework that significantly enhances landmark detection in laparoscopic liver surgery primarily via the Bezier curve refinement strategy. The framework starts with a Multi-modal Feature Extraction (MFE) module designed to robustly capture semantic features. Then we propose Adaptive Curve Proposal Initialization (ACPI) to generate pixel-aligned Bezier curves and confidence scores for reliable initial proposals. Additionally, we design the Hierarchical Curve Refinement (HCR) mechanism to enhance these proposals iteratively through a multi-stage process, capturing fine-grained contextual details from multi-scale pixel-level features for precise Bezier curve adjustment. Extensive evaluations on the L3D and P2ILF datasets demonstrate that BCRNet outperforms state-of-the-art methods, achieving significant performance improvements. Code will be available.

[29] AI-driven visual monitoring of industrial assembly tasks cs.CVPDF

Mattia Nardon, Stefano Messelodi, Antonio Granata, Fabio Poiesi, Alberto Danese

TL;DR: ViMAT是一种新型AI驱动系统，用于实时视觉监控工业装配任务，无需刚性工作环境或视觉标记，结合感知和推理模块，在真实场景中验证了有效性。

Details

Motivation: 现有工业装配任务视觉监控方案通常需要刚性工作环境或视觉标记，限制了实际应用；需要一种更灵活、适应性更强的解决方案。

Result: 在LEGO组件更换和液压模具重新配置任务中验证了ViMAT的有效性，展示了在真实复杂场景中的定量和定性分析结果。

Insight: ViMAT展示了在不确定视觉观察下完成任务推理的潜力，为工业场景提供了一种更通用的视觉监控解决方案。

Abstract: Visual monitoring of industrial assembly tasks is critical for preventing equipment damage due to procedural errors and ensuring worker safety. Although commercial solutions exist, they typically require rigid workspace setups or the application of visual markers to simplify the problem. We introduce ViMAT, a novel AI-driven system for real-time visual monitoring of assembly tasks that operates without these constraints. ViMAT combines a perception module that extracts visual observations from multi-view video streams with a reasoning module that infers the most likely action being performed based on the observed assembly state and prior task knowledge. We validate ViMAT on two assembly tasks, involving the replacement of LEGO components and the reconfiguration of hydraulic press molds, demonstrating its effectiveness through quantitative and qualitative analysis in challenging real-world scenarios characterized by partial and uncertain visual observations. Project page: https://tev-fbk.github.io/ViMAT

[30] MEGC2025: Micro-Expression Grand Challenge on Spot Then Recognize and Visual Question Answering cs.CV | cs.MMPDF

Xinqi Fan, Jingting Li, John See, Moi Hoon Yap, Wen-Huang Cheng

TL;DR: MEGC2025挑战赛聚焦于微表情的发现与识别（ME-STR）以及视觉问答（ME-VQA），引入多模态大语言模型（MLLMs）和视觉语言模型（LVLMs）提升分析能力。

Details

Motivation: 传统方法将微表情的发现与识别视为独立任务，在长视频分析中效果不佳。多模态模型的兴起为这一领域提供了新的机遇。

Result: 挑战赛为算法提供测试集和排行榜，促进技术发展。

Insight: 多模态模型有望显著提升微表情分析的准确性和效率，尤其在复杂场景中。

Abstract: Facial micro-expressions (MEs) are involuntary movements of the face that occur spontaneously when a person experiences an emotion but attempts to suppress or repress the facial expression, typically found in a high-stakes environment. In recent years, substantial advancements have been made in the areas of ME recognition, spotting, and generation. However, conventional approaches that treat spotting and recognition as separate tasks are suboptimal, particularly for analyzing long-duration videos in realistic settings. Concurrently, the emergence of multimodal large language models (MLLMs) and large vision-language models (LVLMs) offers promising new avenues for enhancing ME analysis through their powerful multimodal reasoning capabilities. The ME grand challenge (MEGC) 2025 introduces two tasks that reflect these evolving research directions: (1) ME spot-then-recognize (ME-STR), which integrates ME spotting and subsequent recognition in a unified sequential pipeline; and (2) ME visual question answering (ME-VQA), which explores ME understanding through visual question answering, leveraging MLLMs or LVLMs to address diverse question types related to MEs. All participating algorithms are required to run on this test set and submit their results on a leaderboard. More details are available at https://megc2025.github.io.

[31] MapFM: Foundation Model-Driven HD Mapping with Multi-Task Contextual Learning cs.CV | cs.AIPDF

Leonid Ivanov, Vasily Yuryev, Dmitry Yudin

TL;DR: 本文提出MapFM模型，利用基础模型增强相机图像特征表示，结合多任务上下文学习（如BEV语义分割）提升向量化高精地图生成质量。

Details

Motivation: 自动驾驶中高清地图（HD maps）和鸟瞰图（BEV）语义地图对定位和规划至关重要，但现有方法在特征表示和场景理解上仍有提升空间。

Result: 实验表明，模型在向量化HD地图生成任务上实现了更高精度和质量。

Insight: 多任务上下文学习能有效提升端到端模型的场景理解能力，适用于复杂自动驾驶任务。

Abstract: In autonomous driving, high-definition (HD) maps and semantic maps in bird’s-eye view (BEV) are essential for accurate localization, planning, and decision-making. This paper introduces an enhanced End-to-End model named MapFM for online vectorized HD map generation. We show significantly boost feature representation quality by incorporating powerful foundation model for encoding camera images. To further enrich the model’s understanding of the environment and improve prediction quality, we integrate auxiliary prediction heads for semantic segmentation in the BEV representation. This multi-task learning approach provides richer contextual supervision, leading to a more comprehensive scene representation and ultimately resulting in higher accuracy and improved quality of the predicted vectorized HD maps. The source code is available at https://github.com/LIvanoff/MapFM.

[32] OpenPath: Open-Set Active Learning for Pathology Image Classification via Pre-trained Vision-Language Models cs.CVPDF

Lanfeng Zhong, Xin Liao, Shichuan Zhang, Shaoting Zhang, Guotai Wang

TL;DR: OpenPath提出了一种基于预训练视觉语言模型的开集主动学习方法，用于病理图像分类，通过任务特定提示和多样化信息采样有效减少标注成本并提升模型性能。

Details

Motivation: 病理图像分类对医学诊断至关重要，但传统主动学习方法在开集场景（含大量OOD数据）中效率低下，且初始随机选择浪费标注资源。本文旨在解决这些问题。

Result: 在两个公共病理图像数据集上，OpenPath显著超越现有开集主动学习方法，样本选择纯度高且模型性能提升明显。

Insight: 1. 预训练VLM在医学图像任务中潜力巨大；2. 开集场景下初始选择和多样化采样是关键；3. OOD数据过滤对提升主动学习效率至关重要。

Abstract: Pathology image classification plays a crucial role in accurate medical diagnosis and treatment planning. Training high-performance models for this task typically requires large-scale annotated datasets, which are both expensive and time-consuming to acquire. Active Learning (AL) offers a solution by iteratively selecting the most informative samples for annotation, thereby reducing the labeling effort. However, most AL methods are designed under the assumption of a closed-set scenario, where all the unannotated images belong to target classes. In real-world clinical environments, the unlabeled pool often contains a substantial amount of Out-Of-Distribution (OOD) data, leading to low efficiency of annotation in traditional AL methods. Furthermore, most existing AL methods start with random selection in the first query round, leading to a significant waste of labeling costs in open-set scenarios. To address these challenges, we propose OpenPath, a novel open-set active learning approach for pathological image classification leveraging a pre-trained Vision-Language Model (VLM). In the first query, we propose task-specific prompts that combine target and relevant non-target class prompts to effectively select In-Distribution (ID) and informative samples from the unlabeled pool. In subsequent queries, Diverse Informative ID Sampling (DIS) that includes Prototype-based ID candidate Selection (PIS) and Entropy-Guided Stochastic Sampling (EGSS) is proposed to ensure both purity and informativeness in a query, avoiding the selection of OOD samples. Experiments on two public pathology image datasets show that OpenPath significantly enhances the model’s performance due to its high purity of selected samples, and outperforms several state-of-the-art open-set AL methods. The code is available at \href{https://github.com/HiLab-git/OpenPath}{https://github.com/HiLab-git/OpenPath}..

[33] Open-World Object Counting in Videos cs.CV | cs.AIPDF

Niki Amini-Naieni, Andrew Zisserman

TL;DR: 论文提出了一种新的任务——视频中的开放世界目标计数，通过文本描述或图像示例指定目标对象，统计视频中目标对象的唯一实例数量。提出的CountVid模型结合图像计数和视频分割跟踪技术，有效解决了拥挤场景中的对象重复计数和遮挡问题。

Details

Motivation: 在拥挤场景中，目标对象的重复计数和遮挡问题导致传统计数方法难以准确统计数量。开放世界目标计数任务需要一种新方法来解决这些挑战。

Result: 在VideoCount数据集上，CountVid显著优于基线方法，提供了准确的对象计数结果。

Insight: 结合图像和视频技术可以解决复杂场景中的开放世界目标计数问题，为实际应用提供了新思路。

Abstract: We introduce a new task of open-world object counting in videos: given a text description, or an image example, that specifies the target object, the objective is to enumerate all the unique instances of the target objects in the video. This task is especially challenging in crowded scenes with occlusions and similar objects, where avoiding double counting and identifying reappearances is crucial. To this end, we make the following contributions: we introduce a model, CountVid, for this task. It leverages an image-based counting model, and a promptable video segmentation and tracking model to enable automated, open-world object counting across video frames. To evaluate its performance, we introduce VideoCount, a new dataset for our novel task built from the TAO and MOT20 tracking datasets, as well as from videos of penguins and metal alloy crystallization captured by x-rays. Using this dataset, we demonstrate that CountVid provides accurate object counts, and significantly outperforms strong baselines. The VideoCount dataset, the CountVid model, and all the code are available at https://github.com/niki-amini-naieni/CountVid/.

[34] Unsupervised Pelage Pattern Unwrapping for Animal Re-identification cs.CVPDF

Aleksandr Algasov, Ekaterina Nepovinnykh, Fedor Zolotarev, Tuomas Eerola, Heikki Kälviäinen

TL;DR: 该论文提出了一种几何感知的纹理映射方法，将动物皮毛图案解包到规范的UV空间，从而提高重新识别的鲁棒性。

Details

Motivation: 现有方法难以处理动物皮毛或皮肤图案的变形问题，本文旨在解决这一挑战。

Result: 在Saimaa环斑海豹和豹子数据集上，重识别准确率提高了5.4%。

Insight: 通过将几何信息融入纹理解包，可以有效应对动物姿势变化带来的变形问题，提升特征匹配的鲁棒性。

Abstract: Existing individual re-identification methods often struggle with the deformable nature of animal fur or skin patterns which undergo geometric distortions due to body movement and posture changes. In this paper, we propose a geometry-aware texture mapping approach that unwarps pelage patterns, the unique markings found on an animal’s skin or fur, into a canonical UV space, enabling more robust feature matching. Our method uses surface normal estimation to guide the unwrapping process while preserving the geometric consistency between the 3D surface and the 2D texture space. We focus on two challenging species: Saimaa ringed seals (Pusa hispida saimensis) and leopards (Panthera pardus). Both species have distinctive yet highly deformable fur patterns. By integrating our pattern-preserving UV mapping with existing re-identification techniques, we demonstrate improved accuracy across diverse poses and viewing angles. Our framework does not require ground truth UV annotations and can be trained in a self-supervised manner. Experiments on seal and leopard datasets show up to a 5.4% improvement in re-identification accuracy.

[35] NERO: Explainable Out-of-Distribution Detection with Neuron-level Relevance cs.CV | cs.LGPDF

Anju Chhetri, Jari Korhonen, Prashnna Gyawali, Binod Bhattarai

TL;DR: NERO 提出了一种基于神经元级相关性的新颖 OOD（分布外）检测方法，通过聚类形成代表性中心并引入相关性距离度量，显著提升了 OOD 检测性能。

Details

Motivation: 在医学影像领域，确保深度学习模型的可靠性至关重要，尤其是识别分布外样本（OOD）以发现潜在异常。现有方法可能无法充分捕捉 OOD 的多样性，因此需要更加精确且可解释的 OOD 检测方法。

Result: 在胃肠影像数据集 Kvasir 和 GastroVision 上，NERO 在多种深度学习架构中优于现有 OOD 检测方法。

Insight: 神经元级相关性能够更精确地捕捉 OOD 样本的多样性，并通过聚类和距离度量提供可解释的检测结果，特别适用于医学影像等高风险领域。

Abstract: Ensuring reliability is paramount in deep learning, particularly within the domain of medical imaging, where diagnostic decisions often hinge on model outputs. The capacity to separate out-of-distribution (OOD) samples has proven to be a valuable indicator of a model’s reliability in research. In medical imaging, this is especially critical, as identifying OOD inputs can help flag potential anomalies that might otherwise go undetected. While many OOD detection methods rely on feature or logit space representations, recent works suggest these approaches may not fully capture OOD diversity. To address this, we propose a novel OOD scoring mechanism, called NERO, that leverages neuron-level relevance at the feature layer. Specifically, we cluster neuron-level relevance for each in-distribution (ID) class to form representative centroids and introduce a relevance distance metric to quantify a new sample’s deviation from these centroids, enhancing OOD separability. Additionally, we refine performance by incorporating scaled relevance in the bias term and combining feature norms. Our framework also enables explainable OOD detection. We validate its effectiveness across multiple deep learning architectures on the gastrointestinal imaging benchmarks Kvasir and GastroVision, achieving improvements over state-of-the-art OOD detection methods.

[36] Hunyuan3D 2.1: From Images to High-Fidelity 3D Assets with Production-Ready PBR Material cs.CV | cs.AIPDF

Team Hunyuan3D, Shuhui Yang, Mingxin Yang, Yifei Feng, Xin Huang

TL;DR: Hunyuan3D 2.1是一个先进的AI生成3D内容系统，通过两个核心模块（形状生成和纹理合成）实现从图像到高保真3D资产的转换，并提供详细的教程流程。

Details

Motivation: 当前3D AI生成内容（AIGC）领域主要面向研究人员和开发者，且涉及复杂的数据处理和训练流程。Hunyuan3D 2.1旨在简化这一过程，使其更适合实际生产环境，如游戏和影视设计。

Result: Hunyuan3D 2.1能够高效生成适合游戏和工业设计的高保真3D资产，显著降低了3D内容生成的复杂性。

Insight: 通过模块化设计和详细教程，论文展示了如何将3D生成技术从研究领域推广到实际应用，为AI生成3D内容的普及提供了实用工具。

Abstract: 3D AI-generated content (AIGC) is a passionate field that has significantly accelerated the creation of 3D models in gaming, film, and design. Despite the development of several groundbreaking models that have revolutionized 3D generation, the field remains largely accessible only to researchers, developers, and designers due to the complexities involved in collecting, processing, and training 3D models. To address these challenges, we introduce Hunyuan3D 2.1 as a case study in this tutorial. This tutorial offers a comprehensive, step-by-step guide on processing 3D data, training a 3D generative model, and evaluating its performance using Hunyuan3D 2.1, an advanced system for producing high-resolution, textured 3D assets. The system comprises two core components: the Hunyuan3D-DiT for shape generation and the Hunyuan3D-Paint for texture synthesis. We will explore the entire workflow, including data preparation, model architecture, training strategies, evaluation metrics, and deployment. By the conclusion of this tutorial, you will have the knowledge to finetune or develop a robust 3D generative model suitable for applications in gaming, virtual reality, and industrial design.

[37] Multimodal Large Language Models for Medical Report Generation via Customized Prompt Tuning cs.CVPDF

Chunlei Li, Jingyang Hou, Yilei Shi, Jingliang Hu, Xiao Xiang Zhu

TL;DR: 本文提出了一种新颖的多模态大语言模型（MLLM）MRG-LLM，用于医学影像报告的生成，通过动态提示定制机制提升性能。

Details

Motivation: 医学影像报告生成是临床实践中的挑战任务，现有的大语言模型（LLMs）与医学影像数据的结合尚需深入探索。

Result: 在IU X-ray和MIMIC-CXR数据集上的实验表明，MRG-LLM在医学报告生成任务中达到了最先进的性能。

Insight: 动态提示定制机制可以显著提升多模态大语言模型在医学报告生成中的表现，为医疗AI的实际应用提供了新思路。

Abstract: Medical report generation from imaging data remains a challenging task in clinical practice. While large language models (LLMs) show great promise in addressing this challenge, their effective integration with medical imaging data still deserves in-depth exploration. In this paper, we present MRG-LLM, a novel multimodal large language model (MLLM) that combines a frozen LLM with a learnable visual encoder and introduces a dynamic prompt customization mechanism. Our key innovation lies in generating instance-specific prompts tailored to individual medical images through conditional affine transformations derived from visual features. We propose two implementations: prompt-wise and promptbook-wise customization, enabling precise and targeted report generation. Extensive experiments on IU X-ray and MIMIC-CXR datasets demonstrate that MRG-LLM achieves state-of-the-art performance in medical report generation. Our code will be made publicly available.

[38] GenHOI: Generalizing Text-driven 4D Human-Object Interaction Synthesis for Unseen Objects cs.CV | cs.AIPDF

Shujia Li, Haiyu Zhang, Xinyuan Chen, Yaohui Wang, Yutong Ban

TL;DR: 该论文提出GenHOI，一个两阶段框架，旨在实现对新物体的泛化和高质量4D人-物交互合成，通过稀疏3D关键帧重建和扩散模型插值实现。

Details

Motivation: 由于缺乏大规模4D人-物交互数据集，如何高效合成高质量的4D人-物交互序列是一个挑战。作者希望通过新方法解决这一限制，实现对未知物体的泛化。

Result: 在OMOMO和3D-FUTURE数据集上达到SOTA表现，展示了强大的泛化能力和高质量4D生成效果。

Insight: 通过稀疏数据驱动的两阶段方法，可以在有限数据下实现高质量4D合成，泛化能力是关键创新。

Abstract: While diffusion models and large-scale motion datasets have advanced text-driven human motion synthesis, extending these advances to 4D human-object interaction (HOI) remains challenging, mainly due to the limited availability of large-scale 4D HOI datasets. In our study, we introduce GenHOI, a novel two-stage framework aimed at achieving two key objectives: 1) generalization to unseen objects and 2) the synthesis of high-fidelity 4D HOI sequences. In the initial stage of our framework, we employ an Object-AnchorNet to reconstruct sparse 3D HOI keyframes for unseen objects, learning solely from 3D HOI datasets, thereby mitigating the dependence on large-scale 4D HOI datasets. Subsequently, we introduce a Contact-Aware Diffusion Model (ContactDM) in the second stage to seamlessly interpolate sparse 3D HOI keyframes into densely temporally coherent 4D HOI sequences. To enhance the quality of generated 4D HOI sequences, we propose a novel Contact-Aware Encoder within ContactDM to extract human-object contact patterns and a novel Contact-Aware HOI Attention to effectively integrate the contact signals into diffusion models. Experimental results show that we achieve state-of-the-art results on the publicly available OMOMO and 3D-FUTURE datasets, demonstrating strong generalization abilities to unseen objects, while enabling high-fidelity 4D HOI generation.

[39] NTIRE 2025 Image Shadow Removal Challenge Report cs.CVPDF

Florin-Alexandru Vasluianu, Tim Seizinger, Zhuyun Zhou, Cailian Chen, Zongwei Wu

TL;DR: NTIRE 2025图像阴影去除挑战赛报告总结了306名参赛者中17支团队的成果，包含重建保真度和视觉感知两个评估赛道。

Details

Motivation: 探讨图像阴影去除技术的最新进展，并通过挑战赛形式推动该领域的发展。

Result: 17支团队成功提交解决方案，展示了阴影去除技术的多样性。

Insight: 阴影去除技术不仅需要重建精确性，还需关注人类视觉感知体验。

Abstract: This work examines the findings of the NTIRE 2025 Shadow Removal Challenge. A total of 306 participants have registered, with 17 teams successfully submitting their solutions during the final evaluation phase. Following the last two editions, this challenge had two evaluation tracks: one focusing on reconstruction fidelity and the other on visual perception through a user study. Both tracks were evaluated with images from the WSRD+ dataset, simulating interactions between self- and cast-shadows with a large number of diverse objects, textures, and materials.

[40] RaCalNet: Radar Calibration Network for Sparse-Supervised Metric Depth Estimation cs.CV | cs.ROPDF

Xingrui Qin, Wentao Zhao, Chuan Cao, Yihe Niu, Houcheng Jiang

TL;DR: RaCalNet 提出了一种稀疏监督的雷达标定网络，用于无需密集 LiDAR 监督的度量深度估计。通过重新标定和优化稀疏雷达点，构建深度先验，该方法在稀疏监督下实现了比密集监督方法更好的性能。

Details

Motivation: 现有的密集度量深度估计方法依赖多帧投影和插值生成的密集 LiDAR 监督，成本高且数据密集。RaCalNet 旨在通过稀疏 LiDAR 监督实现高性能，降低数据需求。

Result: 在 ZJU-4DRadarCam 数据集上，RaCalNet 的 RMSE 分别降低了 35.30% 和 34.89%，并生成了具有清晰轮廓和细节的深度图。

Insight: 稀疏监督下的深度估计可以通过优化雷达点先验实现高性能，挑战了传统密集监督的必要性。这一方法为低成本、高效的多传感器融合提供了新思路。

Abstract: Dense metric depth estimation using millimeter-wave radar typically requires dense LiDAR supervision, generated via multi-frame projection and interpolation, to guide the learning of accurate depth from sparse radar measurements and RGB images. However, this paradigm is both costly and data-intensive. To address this, we propose RaCalNet, a novel framework that eliminates the need for dense supervision by using sparse LiDAR to supervise the learning of refined radar measurements, resulting in a supervision density of merely around 1% compared to dense-supervised methods. Unlike previous approaches that associate radar points with broad image regions and rely heavily on dense labels, RaCalNet first recalibrates and refines sparse radar points to construct accurate depth priors. These priors then serve as reliable anchors to guide monocular depth prediction, enabling metric-scale estimation without resorting to dense supervision. This design improves structural consistency and preserves fine details. Despite relying solely on sparse supervision, RaCalNet surpasses state-of-the-art dense-supervised methods, producing depth maps with clear object contours and fine-grained textures. Extensive experiments on the ZJU-4DRadarCam dataset and real-world deployment scenarios demonstrate its effectiveness, reducing RMSE by 35.30% and 34.89%, respectively.

[41] Show-o2: Improved Native Unified Multimodal Models cs.CVPDF

Jinheng Xie, Zhenheng Yang, Mike Zheng Shou

TL;DR: 该论文提出了改进的原生统一多模态模型Show-o2，结合自回归建模和流匹配技术，通过3D因果变分自编码器构建统一视觉表示，支持图像和视频模态的可扩展性。

Details

Motivation: 为了解决多模态任务中统一表示和生成的挑战，研究者提出了Show-o2，旨在通过原生统一模型实现跨模态的高效理解和生成。

Result: Show-o2在文本、图像和视频的多模态理解和生成任务中表现出色，支持多种模态的灵活处理。

Insight: 通过原生统一模型和流匹配技术，多模态任务的表示和生成能力得到显著提升，为跨模态应用提供了新思路。

Abstract: This paper presents improved native unified multimodal models, \emph{i.e.,} Show-o2, that leverage autoregressive modeling and flow matching. Built upon a 3D causal variational autoencoder space, unified visual representations are constructed through a dual-path of spatial (-temporal) fusion, enabling scalability across image and video modalities while ensuring effective multimodal understanding and generation. Based on a language model, autoregressive modeling and flow matching are natively applied to the language head and flow head, respectively, to facilitate text token prediction and image/video generation. A two-stage training recipe is designed to effectively learn and scale to larger models. The resulting Show-o2 models demonstrate versatility in handling a wide range of multimodal understanding and generation tasks across diverse modalities, including text, images, and videos. Code and models are released at https://github.com/showlab/Show-o.

[42] Baltimore Atlas: FreqWeaver Adapter for Semi-supervised Ultra-high Spatial Resolution Land Cover Classification cs.CVPDF

Junhao Wu, Aboagye-Ntow Stephen, Chuyuan Wang, Gang Chen, Xin Huang

TL;DR: 论文提出了一种参数高效的半监督分割框架，用于处理0.3米高分辨率遥感影像，通过结合SAM2的知识和定制FreqWeaver Adapter，显著提升了细粒度地物分类性能。

Details

Motivation: 高分辨率地物分类因像素级标注成本高、尺度变化大以及大型视觉模型适应性有限而具有挑战性，现有方法多依赖标注数据且局限于1米分辨率。

Result: 方法在结构一致性上表现优异，比现有参数高效策略提升1.78%，比最先进高分辨率遥感分割方法提升3.44%。

Insight: 通过轻量化的适配器和半监督学习，可以在极低参数量下显著提升高分辨率影像的分类性能，为实际应用提供了高效解决方案。

Abstract: Ultra-high Spatial Resolution Land Cover Classification is essential for fine-grained land cover analysis, yet it remains challenging due to the high cost of pixel-level annotations, significant scale variation, and the limited adaptability of large-scale vision models. Existing methods typically focus on 1-meter spatial resolution imagery and rely heavily on annotated data, whereas practical applications often require processing higher-resolution imagery under weak supervision. To address this, we propose a parameter-efficient semi-supervised segmentation framework for 0.3 m spatial resolution imagery, which leverages the knowledge of SAM2 and introduces a remote sensing-specific FreqWeaver Adapter to enhance fine-grained detail modeling while maintaining a lightweight design at only 5.96% of the total model parameters. By effectively leveraging unlabeled data and maintaining minimal parameter overhead, the proposed method delivers robust segmentation results with superior structural consistency, achieving a 1.78% improvement over existing parameter-efficient tuning strategies and a 3.44% gain compared to state-of-the-art high-resolution remote sensing segmentation approaches.

[43] A Unified Graph-based Framework for Scalable 3D Tree Reconstruction and Non-Destructive Biomass Estimation from Point Clouds cs.CVPDF

Di Wang, Shi Li

TL;DR: 提出了一种基于图的统一框架，用于从点云中实现可扩展的3D树木重建和非破坏性生物量估计，显著提升了方法的鲁棒性和可扩展性。

Details

Motivation: 现有基于QSM的生物量估计方法主要针对单棵树且依赖高质量TLS数据，预处理步骤多，难以扩展到大规模应用。

Result: 在含有叶和低密度ULS数据中，相对误差分别约为20%和30%，验证了方法的有效性。

Insight: ULS可作为TLS的替代方案，该方法为大规模森林调查和气候变化研究提供了新工具。

Abstract: Estimating forest above-ground biomass (AGB) is crucial for assessing carbon storage and supporting sustainable forest management. Quantitative Structural Model (QSM) offers a non-destructive approach to AGB estimation through 3D tree structural reconstruction. However, current QSM methods face significant limitations, as they are primarily designed for individual trees,depend on high-quality point cloud data from terrestrial laser scanning (TLS), and also require multiple pre-processing steps that hinder scalability and practical deployment. This study presents a novel unified framework that enables end-to-end processing of large-scale point clouds using an innovative graph-based pipeline. The proposed approach seamlessly integrates tree segmentation,leaf-wood separation and 3D skeletal reconstruction through dedicated graph operations including pathing and abstracting for tree topology reasoning. Comprehensive validation was conducted on datasets with varying leaf conditions (leaf-on and leaf-off), spatial scales (tree- and plot-level), and data sources (TLS and UAV-based laser scanning, ULS). Experimental results demonstrate strong performance under challenging conditions, particularly in leaf-on scenarios (~~20% relative error) and low-density ULS datasets with partial coverage (~~30% relative error). These findings indicate that the proposed framework provides a robust and scalable solution for large-scale, non-destructive AGB estimation. It significantly reduces dependency on specialized pre-processing tools and establishes ULS as a viable alternative to TLS. To our knowledge, this is the first method capable of enabling seamless, end-to-end 3D tree reconstruction at operational scales. This advancement substantially improves the feasibility of QSM-based AGB estimation, paving the way for broader applications in forest inventory and climate change research.

[44] One-Step Diffusion for Detail-Rich and Temporally Consistent Video Super-Resolution cs.CV | cs.AIPDF

Yujing Sun, Lingchen Sun, Shuaizheng Liu, Rongyuan Wu, Zhengqiang Zhang

TL;DR: 该论文提出了一种基于稳定扩散（SD）的单步扩散模型DLoRAL，用于实现视频超分辨率（VSR）中的细节丰富和时间一致性。通过双LoRA学习（D-LoRA和C-LoRA）和跨帧检索（CFR）模块，模型能够同时优化空间细节和时间一致性。

Details

Motivation: 现有的基于SD的Real-VSR方法通常在保持时间一致性时牺牲了空间细节，导致视觉质量不理想。作者认为关键在于如何从低质量输入视频中提取鲁棒的时间一致性先验，并在增强细节的同时维持这些先验。

Result: 实验表明，DLoRAL在精度和速度上均表现出色，能够生成细节丰富且时间一致的视频超分辨率结果。

Insight: 通过分阶段训练（先一致性后细节）和模块化设计（CFR、C-LoRA、D-LoRA），DLoRAL成功解决了Real-VSR中细节与一致性的权衡问题，为基于生成模型的视频修复提供了新思路。

Abstract: It is a challenging problem to reproduce rich spatial details while maintaining temporal consistency in real-world video super-resolution (Real-VSR), especially when we leverage pre-trained generative models such as stable diffusion (SD) for realistic details synthesis. Existing SD-based Real-VSR methods often compromise spatial details for temporal coherence, resulting in suboptimal visual quality. We argue that the key lies in how to effectively extract the degradation-robust temporal consistency priors from the low-quality (LQ) input video and enhance the video details while maintaining the extracted consistency priors. To achieve this, we propose a Dual LoRA Learning (DLoRAL) paradigm to train an effective SD-based one-step diffusion model, achieving realistic frame details and temporal consistency simultaneously. Specifically, we introduce a Cross-Frame Retrieval (CFR) module to aggregate complementary information across frames, and train a Consistency-LoRA (C-LoRA) to learn robust temporal representations from degraded inputs. After consistency learning, we fix the CFR and C-LoRA modules and train a Detail-LoRA (D-LoRA) to enhance spatial details while aligning with the temporal space defined by C-LoRA to keep temporal coherence. The two phases alternate iteratively for optimization, collaboratively delivering consistent and detail-rich outputs. During inference, the two LoRA branches are merged into the SD model, allowing efficient and high-quality video restoration in a single diffusion step. Experiments show that DLoRAL achieves strong performance in both accuracy and speed. Code and models are available at https://github.com/yjsunnn/DLoRAL.

Kyobin Choo, Hyunkyung Han, Jinyeong Kim, Chanyong Yoon, Seong Jae Hwang

TL;DR: 该论文提出了一种名为M2M-Reg的新框架，用于处理高度异构的多模态医学图像配准问题，通过仅使用单模态相似性训练多模态配准模型，并结合GradCyCon正则化器，显著提升了配准效果。

Details

Motivation: 临床实践中，功能性影像模态（如PET和FA）需与结构性参考影像（如MRI、CT）对齐，但传统无监督配准方法因模态差异过大而难以学习可靠的映射，导致图像失真。这一问题的核心在于相似性度量无法捕捉高度差异模态间的对齐关系。

Result: 在ADNI数据集上的实验表明，M2M-Reg在PET-MRI和FA-MRI配准中DSC分数达到先前方法的2倍，显著优于现有方法。

Insight: 通过单模态相似性引导多模态配准是一种有效解决高度异构模态配准问题的新思路，同时循环训练和正则化的引入进一步提升了模型的鲁棒性。

Abstract: In clinical practice, imaging modalities with functional characteristics, such as positron emission tomography (PET) and fractional anisotropy (FA), are often aligned with a structural reference (e.g., MRI, CT) for accurate interpretation or group analysis, necessitating multi-modal deformable image registration (DIR). However, due to the extreme heterogeneity of these modalities compared to standard structural scans, conventional unsupervised DIR methods struggle to learn reliable spatial mappings and often distort images. We find that the similarity metrics guiding these models fail to capture alignment between highly disparate modalities. To address this, we propose M2M-Reg (Multi-to-Mono Registration), a novel framework that trains multi-modal DIR models using only mono-modal similarity while preserving the established architectural paradigm for seamless integration into existing models. We also introduce GradCyCon, a regularizer that leverages M2M-Reg’s cyclic training scheme to promote diffeomorphism. Furthermore, our framework naturally extends to a semi-supervised setting, integrating pre-aligned and unaligned pairs only, without requiring ground-truth transformations or segmentation masks. Experiments on the Alzheimer’s Disease Neuroimaging Initiative (ADNI) dataset demonstrate that M2M-Reg achieves up to 2x higher DSC than prior methods for PET-MRI and FA-MRI registration, highlighting its effectiveness in handling highly heterogeneous multi-modal DIR. Our code is available at https://github.com/MICV-yonsei/M2M-Reg.

[46] BoxFusion: Reconstruction-Free Open-Vocabulary 3D Object Detection via Real-Time Multi-View Box Fusion cs.CVPDF

Yuqing Lan, Chenyang Zhu, Zhirui Gao, Jiazhao Zhang, Yihan Cao

TL;DR: 该论文提出了一种名为BoxFusion的实时多视图框融合方法，用于无重建的开放词汇3D物体检测，避免了密集点云重建的高计算开销，同时实现了高效和实时性能。

Details

Motivation: 现有3D物体检测方法通常依赖密集点云重建，导致高计算和内存开销，难以实时部署。本文旨在解决这一问题，提出一种无重建的在线框架，适用于大范围场景的实时检测。

Result: 实验证明该方法在ScanNetV2和CA-1M数据集上优于现有在线方法，且能在超过1000平方米的环境中实时运行。

Insight: 无重建范式在3D检测中具潜力，通过多视图融合可显著提升系统效率与泛化能力。

Abstract: Open-vocabulary 3D object detection has gained significant interest due to its critical applications in autonomous driving and embodied AI. Existing detection methods, whether offline or online, typically rely on dense point cloud reconstruction, which imposes substantial computational overhead and memory constraints, hindering real-time deployment in downstream tasks. To address this, we propose a novel reconstruction-free online framework tailored for memory-efficient and real-time 3D detection. Specifically, given streaming posed RGB-D video input, we leverage Cubify Anything as a pre-trained visual foundation model (VFM) for single-view 3D object detection by bounding boxes, coupled with CLIP to capture open-vocabulary semantics of detected objects. To fuse all detected bounding boxes across different views into a unified one, we employ an association module for correspondences of multi-views and an optimization module to fuse the 3D bounding boxes of the same instance predicted in multi-views. The association module utilizes 3D Non-Maximum Suppression (NMS) and a box correspondence matching module, while the optimization module uses an IoU-guided efficient random optimization technique based on particle filtering to enforce multi-view consistency of the 3D bounding boxes while minimizing computational complexity. Extensive experiments on ScanNetV2 and CA-1M datasets demonstrate that our method achieves state-of-the-art performance among online methods. Benefiting from this novel reconstruction-free paradigm for 3D object detection, our method exhibits great generalization abilities in various scenarios, enabling real-time perception even in environments exceeding 1000 square meters.

[47] HOIDiNi: Human-Object Interaction through Diffusion Noise Optimization cs.CVPDF

Roey Ron, Guy Tevet, Haim Sawdayee, Amit H. Bermano

TL;DR: HOIDiNi 是一个基于扩散噪声优化的文本驱动框架，用于生成真实且合理的人-物交互（HOI）。它通过两阶段分离优化方法，实现了接触精度和运动自然性的平衡，显著优于现有方法。

Details

Motivation: 人-物交互（HOI）生成需要同时满足严格的接触精度和多样的运动变化，现有方法往往在真实性与物理正确性之间难以平衡。HOIDiNi 的目标是通过扩散模型直接优化噪声空间，解决这一问题。

Result: 在 GRAB 数据集上的定量、定性和主观评估表明，HOIDiNi 在接触精度、物理有效性和整体质量上优于现有方法。

Insight: 分离优化的两阶段方法是解决 HOI 复杂性的关键，扩散模型的噪声空间优化为生成高质量内容提供了新的可能性。

Abstract: We present HOIDiNi, a text-driven diffusion framework for synthesizing realistic and plausible human-object interaction (HOI). HOI generation is extremely challenging since it induces strict contact accuracies alongside a diverse motion manifold. While current literature trades off between realism and physical correctness, HOIDiNi optimizes directly in the noise space of a pretrained diffusion model using Diffusion Noise Optimization (DNO), achieving both. This is made feasible thanks to our observation that the problem can be separated into two phases: an object-centric phase, primarily making discrete choices of hand-object contact locations, and a human-centric phase that refines the full-body motion to realize this blueprint. This structured approach allows for precise hand-object contact without compromising motion naturalness. Quantitative, qualitative, and subjective evaluations on the GRAB dataset alone clearly indicate HOIDiNi outperforms prior works and baselines in contact accuracy, physical validity, and overall quality. Our results demonstrate the ability to generate complex, controllable interactions, including grasping, placing, and full-body coordination, driven solely by textual prompts. https://hoidini.github.io.

[48] FindingDory: A Benchmark to Evaluate Memory in Embodied Agents cs.CV | cs.ROPDF

Karmesh Yadav, Yusuf Ali, Gunshi Gupta, Yarin Gal, Zsolt Kira

TL;DR: 论文提出了一个名为FindingDory的基准测试，用于评估具身智能体在长期记忆任务中的表现，填补了现有长期视频问答基准在具身智能任务中的不足。

Details

Motivation: 当前的大规模视觉-语言模型在规划和控制任务中表现出色，但在具身体验中长期记忆的处理能力有限。现有基准未能涵盖需要低层技能和细粒度推理的具身任务。

Result: 论文展示了新基准在评估记忆密集型任务中的有效性，并指出了当前模型在这些任务中的不足。

Insight: 该研究强调了在具身智能体中整合长期记忆的重要性，为未来研究提供了可扩展的评估框架。

Abstract: Large vision-language models have recently demonstrated impressive performance in planning and control tasks, driving interest in their application to real-world robotics. However, deploying these models for reasoning in embodied contexts is limited by their ability to incorporate long-term experience collected across multiple days and represented by vast collections of images. Current VLMs typically struggle to process more than a few hundred images concurrently, highlighting the need for more efficient mechanisms to handle long-term memory in embodied settings. To effectively evaluate these models for long-horizon control, a benchmark must specifically target scenarios where memory is crucial for success. Existing long-video QA benchmarks overlook embodied challenges like object manipulation and navigation, which demand low-level skills and fine-grained reasoning over past interactions. Moreover, effective memory integration in embodied agents involves both recalling relevant historical information and executing actions based on that information, making it essential to study these aspects together rather than in isolation. In this work, we introduce a new benchmark for long-range embodied tasks in the Habitat simulator. This benchmark evaluates memory-based capabilities across 60 tasks requiring sustained engagement and contextual awareness in an environment. The tasks can also be procedurally extended to longer and more challenging versions, enabling scalable evaluation of memory and reasoning. We also present baselines that integrate state-of-the-art VLMs with low level navigation policies, assessing their performance on these memory-intensive tasks and highlight areas for improvement.

[49] Demystifying the Visual Quality Paradox in Multimodal Large Language Models cs.CV | cs.AIPDF

Shuo Xing, Lanqing Guo, Hongyuan Hua, Seoyoung Lee, Peiran Li

TL;DR: 该论文研究了多模态大语言模型（MLLM）中视觉质量与模型性能的关系，发现模型性能可能因图像质量偏离人类感知而提升，提出了一种轻量级的视觉质量测试时调优方法（VQ-TTT）来动态调整输入图像。

Details

Motivation: 尽管MLLM在视觉语言任务中表现出色，但输入图像的视觉质量如何影响模型性能尚不明确。研究者试图探讨更高的图像质量是否真的带来更好的模型理解能力。

Result: 在所有测试的MLLM和数据集上，VQ-TTT显著提升了平均准确率，且无需外部模型、缓存特征或额外训练数据。

Insight: MLLM对视觉输入的偏好可能与人类感知不同，需要适应性的视觉质量调整而非单纯的“干净”图像。AI作为主要数据消费者时，这一发现尤为重要。

Abstract: Recent Multimodal Large Language Models (MLLMs) excel on benchmark vision-language tasks, yet little is known about how input visual quality shapes their responses. Does higher perceptual quality of images already translate to better MLLM understanding? We conduct the first systematic study spanning leading MLLMs and a suite of vision-language benchmarks, applying controlled degradations and stylistic shifts to each image. Surprisingly, we uncover a visual-quality paradox: model, task, and even individual-instance performance can improve when images deviate from human-perceived fidelity. Off-the-shelf restoration pipelines fail to reconcile these idiosyncratic preferences. To close the gap, we introduce Visual-Quality Test-Time Tuning (VQ-TTT)-a lightweight adaptation module that: (1) inserts a learnable, low-rank kernel before the frozen vision encoder to modulate frequency content; and (2) fine-tunes only shallow vision-encoder layers via LoRA. VQ-TTT dynamically adjusts each input image in a single forward pass, aligning it with task-specific model preferences. Across the evaluated MLLMs and all datasets, VQ-TTT lifts significant average accuracy, with no external models, cached features, or extra training data. These findings redefine better'' visual inputs for MLLMs and highlight the need for adaptive, rather than universally clean’’, imagery, in the new era of AI being the main data customer.

[50] Dual-Stage Value-Guided Inference with Margin-Based Reward Adjustment for Fast and Faithful VLM Captioning cs.CV | cs.LGPDF

Ankan Deria, Adinath Madhavrao Dukre, Feilong Tang, Sara Atito, Sudipta Roy

TL;DR: ViMaR 是一个双阶段推理框架，通过结合时间差分价值模型和基于边界的奖励调整，提升了视觉语言模型（VLM）的推理效率和输出保真度，显著减少幻觉现象并实现4倍加速。

Details

Motivation: 现有视觉语言模型的推理方法计算成本高且容易生成低置信度的错误描述（幻觉），缺乏对低置信度生成的惩罚机制，影响了输出的准确性和效率。

Result: ViMaR 在多个VLM架构中显著提升了描述的可靠性、事实准确性、细节丰富度，推理速度比现有方法快4倍，并能跨模型泛化。

Insight: ViMaR 的模块化和泛化能力为视觉语言模型的推理优化提供了一种可扩展的方案，同时其输出质量的自训练潜力为模型性能提升提供了新方向。

Abstract: Despite significant advances in inference-time search for vision-language models (VLMs), existing approaches remain both computationally expensive and prone to unpenalized, low-confidence generations which often lead to persistent hallucinations. We introduce \textbf{Value-guided Inference with Margin-based Reward (ViMaR)}, a two-stage inference framework that improves both efficiency and output fidelity by combining a temporal-difference value model with a margin-aware reward adjustment. In the first stage, we perform a single pass to identify the highest-value caption among diverse candidates. In the second stage, we selectively refine only those segments that were overlooked or exhibit weak visual grounding, thereby eliminating frequently rewarded evaluations. A calibrated margin-based penalty discourages low-confidence continuations while preserving descriptive richness. Extensive experiments across multiple VLM architectures demonstrate that ViMaR generates captions that are significantly more reliable, factually accurate, detailed, and explanatory, while achieving over 4$\times$ speedup compared to existing value-guided methods. Specifically, we show that ViMaR trained solely on LLaVA Mistral-7B, \textit{generalizes effectively to guide decoding in a stronger unseen model}. To further validate this, we adapt the ViMaR to steer generation in LLaVA-OneVision-Qwen2-7B, leading to consistent improvements in caption quality and demonstrating robust cross-model guidance. This cross-model generalization highlights ViMaR’s flexibility and modularity, positioning it as a scalable and transferable inference-time decoding strategy. Furthermore, when ViMaR-generated captions are used for self-training, the underlying models achieve substantial gains across a broad suite of visual comprehension benchmarks, underscoring the potential of fast, accurate, and self-improving VLM pipelines.

[51] UniRelight: Learning Joint Decomposition and Synthesis for Video Relighting cs.CVPDF

Kai He, Ruofan Liang, Jacob Munkberg, Jon Hasselgren, Nandita Vijaykumar

TL;DR: 该论文提出了一种联合分解和合成的视频重光照方法，通过结合生成式视频扩散模型，提升了场景理解和光照效果的真实性。

Details

Motivation: 现有端到端重光照模型受限于配对多光照数据的稀缺性，而两阶段流水线则容易因误差累积导致输出不真实。论文旨在解决这些问题。

Result: 模型在多样领域表现出强泛化能力，视觉保真度和时间一致性上优于之前的方法。

Insight: 联合学习方法可以隐式提升场景理解，并生成复杂光照和材料交互效果。

Abstract: We address the challenge of relighting a single image or video, a task that demands precise scene intrinsic understanding and high-quality light transport synthesis. Existing end-to-end relighting models are often limited by the scarcity of paired multi-illumination data, restricting their ability to generalize across diverse scenes. Conversely, two-stage pipelines that combine inverse and forward rendering can mitigate data requirements but are susceptible to error accumulation and often fail to produce realistic outputs under complex lighting conditions or with sophisticated materials. In this work, we introduce a general-purpose approach that jointly estimates albedo and synthesizes relit outputs in a single pass, harnessing the generative capabilities of video diffusion models. This joint formulation enhances implicit scene comprehension and facilitates the creation of realistic lighting effects and intricate material interactions, such as shadows, reflections, and transparency. Trained on synthetic multi-illumination data and extensive automatically labeled real-world videos, our model demonstrates strong generalization across diverse domains and surpasses previous methods in both visual fidelity and temporal consistency.

[52] Sekai: A Video Dataset towards World Exploration cs.CV | cs.AIPDF

Zhen Li, Chuanhao Li, Xiaofeng Mao, Shaoheng Lin, Ming Li

TL;DR: 该论文介绍了Sekai数据集，一个高质量的第一人称视角全球视频数据集，适用于世界探索任务。

Details

Motivation: 现有视频生成数据集存在地理位置有限、持续时间短、场景静态及缺乏探索相关标注等问题，无法满足世界探索的需求。

Result: 实验验证了数据集的优质性，YUME模型展示了数据集的实用性。

Insight: Sekai数据集为视频生成和世界探索领域提供了有价值的资源，并可能推动相关应用的发展。

Abstract: Video generation techniques have made remarkable progress, promising to be the foundation of interactive world exploration. However, existing video generation datasets are not well-suited for world exploration training as they suffer from some limitations: limited locations, short duration, static scenes, and a lack of annotations about exploration and the world. In this paper, we introduce Sekai (meaning world'' in Japanese), a high-quality first-person view worldwide video dataset with rich annotations for world exploration. It consists of over 5,000 hours of walking or drone view (FPV and UVA) videos from over 100 countries and regions across 750 cities. We develop an efficient and effective toolbox to collect, pre-process and annotate videos with location, scene, weather, crowd density, captions, and camera trajectories. Experiments demonstrate the quality of the dataset. And, we use a subset to train an interactive video world exploration model, named YUME (meaning dream’’ in Japanese). We believe Sekai will benefit the area of video generation and world exploration, and motivate valuable applications.

[53] Evolutionary Caching to Accelerate Your Off-the-Shelf Diffusion Model cs.CVPDF

Anirud Aggarwal, Abhinav Shrivastava, Matthew Gwilliam

TL;DR: 论文提出了一种名为ECAD的方法，通过遗传算法学习高效的缓存调度，显著加速扩散模型的推理，同时保持生成质量。

Details

Motivation: 扩散模型因推理速度慢和计算成本高而受限，现有缓存方法依赖固定启发式规则，导致加速效果有限或泛化性差。

Result: 在多个扩散模型（如PixArt-alpha）和基准测试中，ECAD表现优于现有方法，推理速度提升至2.58倍，同时在COCO FID上提升了4.47。

Insight: ECAD的缓存调度能泛化到未见过的分辨率和模型变体，展示了其在扩散模型加速中的普适性和可扩展性。

Abstract: Diffusion-based image generation models excel at producing high-quality synthetic content, but suffer from slow and computationally expensive inference. Prior work has attempted to mitigate this by caching and reusing features within diffusion transformers across inference steps. These methods, however, often rely on rigid heuristics that result in limited acceleration or poor generalization across architectures. We propose Evolutionary Caching to Accelerate Diffusion models (ECAD), a genetic algorithm that learns efficient, per-model, caching schedules forming a Pareto frontier, using only a small set of calibration prompts. ECAD requires no modifications to network parameters or reference images. It offers significant inference speedups, enables fine-grained control over the quality-latency trade-off, and adapts seamlessly to different diffusion models. Notably, ECAD’s learned schedules can generalize effectively to resolutions and model variants not seen during calibration. We evaluate ECAD on PixArt-alpha, PixArt-Sigma, and FLUX-1.dev using multiple metrics (FID, CLIP, Image Reward) across diverse benchmarks (COCO, MJHQ-30k, PartiPrompts), demonstrating consistent improvements over previous approaches. On PixArt-alpha, ECAD identifies a schedule that outperforms the previous state-of-the-art method by 4.47 COCO FID while increasing inference speedup from 2.35x to 2.58x. Our results establish ECAD as a scalable and generalizable approach for accelerating diffusion inference. Our project website is available at https://aniaggarwal.github.io/ecad and our code is available at https://github.com/aniaggarwal/ecad.

cs.CL [Back]

[54] Combining Constrained and Unconstrained Decoding via Boosting: BoostCD and Its Application to Information Extraction cs.CLPDF

Marija Šakota, Robert West

TL;DR: 论文提出了BoostCD方法，通过结合约束解码和无约束解码的两阶段过程，利用增强模型提升结构化NLP任务的性能，并在信息提取任务上验证了其有效性。

Details

Motivation: 当前的结构化NLP任务方法通常使用自回归语言模型进行约束解码，但由于训练时模型对约束的潜在性，可能导致测试时输出质量较低。因此，需要一种方法来结合约束和无约束解码的优势。

Result: 在闭式信息提取任务中，BoostIE（基于BoostCD的模型）在分布内外均优于现有方法，解决了常见错误。

Insight: 约束和无约束解码的错误具有互补性，增强模型能够有效利用这种互补性提升性能。

Abstract: Many recent approaches to structured NLP tasks use an autoregressive language model $M$ to map unstructured input text $x$ to output text $y$ representing structured objects (such as tuples, lists, trees, code, etc.), where the desired output structure is enforced via constrained decoding. During training, these approaches do not require the model to be aware of the constraints, which are merely implicit in the training outputs $y$. This is advantageous as it allows for dynamic constraints without requiring retraining, but can lead to low-quality output during constrained decoding at test time. We overcome this problem with Boosted Constrained Decoding (BoostCD), which combines constrained and unconstrained decoding in two phases: Phase 1 decodes from the base model $M$ twice, in constrained and unconstrained mode, obtaining two weak predictions. In phase 2, a learned autoregressive boosted model combines the two weak predictions into one final prediction. The mistakes made by the base model with vs. without constraints tend to be complementary, which the boosted model learns to exploit for improved performance. We demonstrate the power of BoostCD by applying it to closed information extraction. Our model, BoostIE, outperforms prior approaches both in and out of distribution, addressing several common errors identified in those approaches.

[55] CrEst: Credibility Estimation for Contexts in LLMs via Weak Supervision cs.CL | cs.LGPDF

Dyah Adila, Shuai Zhang, Boran Han, Bonan Min, Yuyang Wang

TL;DR: CrEst是一种弱监督框架，用于在LLM推理中自动评估上下文文档的可信度，无需人工标注，显著提升了模型性能。

Details

Motivation: 现有方法忽视了上下文文档可信度的变化，可能导致不可靠信息的传播，因此需要一种自动评估可信度的解决方案。

Result: 在三种模型架构和五个数据集上表现优于基线，最高提升26.86%准确率和3.49% F1分数，且在噪声环境下仍稳健。

Insight: 可信文档的语义一致性较高，可通过文档间一致性实现自动评估；结合可信度能显著提升LLM的性能。

Abstract: The integration of contextual information has significantly enhanced the performance of large language models (LLMs) on knowledge-intensive tasks. However, existing methods often overlook a critical challenge: the credibility of context documents can vary widely, potentially leading to the propagation of unreliable information. In this paper, we introduce CrEst, a novel weakly supervised framework for assessing the credibility of context documents during LLM inference–without requiring manual annotations. Our approach is grounded in the insight that credible documents tend to exhibit higher semantic coherence with other credible documents, enabling automated credibility estimation through inter-document agreement. To incorporate credibility into LLM inference, we propose two integration strategies: a black-box approach for models without access to internal weights or activations, and a white-box method that directly modifies attention mechanisms. Extensive experiments across three model architectures and five datasets demonstrate that CrEst consistently outperforms strong baselines, achieving up to a 26.86% improvement in accuracy and a 3.49% increase in F1 score. Further analysis shows that CrEst maintains robust performance even under high-noise conditions.

[56] MDBench: A Synthetic Multi-Document Reasoning Benchmark Generated with Knowledge Guidance cs.CL | cs.AIPDF

Joseph J. Peper, Wenzhao Qiu, Ali Payani, Lu Wang

TL;DR: MDBench是一个新的多文档推理评估数据集，通过合成生成方法创建，利用知识引导的LLM辅助编辑技术，可控且高效地生成具有挑战性的文档集和对应的问题-答案对。

Details

Motivation: 当前缺乏针对多文档推理的评估基准，尤其是由于长文本标注成本高昂。MDBench旨在填补这一空白，并提供一种高效生成挑战性数据的方法。

Result: 实验发现，MDBench对现有LLM和提示技术均构成显著挑战，即使文档集较短。

Insight: 知识引导的生成技术不仅能快速适应新挑战，还能针对性地分析模型的多文档推理能力。

Abstract: Natural language processing evaluation has made significant progress, largely driven by the proliferation of powerful large language mod-els (LLMs). New evaluation benchmarks are of increasing priority as the reasoning capabilities of LLMs are expanding at a rapid pace. In particular, while multi-document (MD) reasoning is an area of extreme relevance given LLM capabilities in handling longer-context inputs, few benchmarks exist to rigorously examine model behavior in this setting. Moreover, the multi-document setting is historically challenging for benchmark creation due to the expensive cost of annotating long inputs. In this work, we introduce MDBench, a new dataset for evaluating LLMs on the task of multi-document reasoning. Notably, MDBench is created through a novel synthetic generation process, allowing us to controllably and efficiently generate challenging document sets and the corresponding question-answer (QA) examples. Our novel technique operates on condensed structured seed knowledge, modifying it through LLM-assisted edits to induce MD-specific reasoning challenges. We then convert this structured knowledge into a natural text surface form, generating a document set and corresponding QA example. We analyze the behavior of popular LLMs and prompting techniques, finding that MDBENCH poses significant challenges for all methods, even with relatively short document sets. We also see our knowledge-guided generation technique (1) allows us to readily perform targeted analysis of MD-specific reasoning capabilities and (2) can be adapted quickly to account for new challenges and future modeling improvements.

[57] Improving Dialogue Discourse Parsing through Discourse-aware Utterance Clarification cs.CL | cs.AIPDF

Yaxin Fan, Peifeng Li, Qiaoming Zhu

TL;DR: 本文提出了一种基于话语感知的澄清模块（DCM）和贡献感知的偏好优化（CPO）方法，以解决对话话语解析中因语言特征（如省略和习语）导致的模糊性问题，显著提升了解析性能。

Details

Motivation: 对话中的语言特征（如省略和习语）常导致话语关系的模糊性，严重影响解析器的性能。本文旨在通过澄清模糊性提升对话话语解析的准确性。

Result: 在STAC和Molweni数据集上，该方法显著超越了现有最优基线模型，验证了其有效性。

Insight: 结合语言特征的细化分析和贡献感知的优化机制，可以有效解决对话话语解析中的模糊性问题，显著提升性能。

Abstract: Dialogue discourse parsing aims to identify and analyze discourse relations between the utterances within dialogues. However, linguistic features in dialogues, such as omission and idiom, frequently introduce ambiguities that obscure the intended discourse relations, posing significant challenges for parsers. To address this issue, we propose a Discourse-aware Clarification Module (DCM) to enhance the performance of the dialogue discourse parser. DCM employs two distinct reasoning processes: clarification type reasoning and discourse goal reasoning. The former analyzes linguistic features, while the latter distinguishes the intended relation from the ambiguous one. Furthermore, we introduce Contribution-aware Preference Optimization (CPO) to mitigate the risk of erroneous clarifications, thereby reducing cascading errors. CPO enables the parser to assess the contributions of the clarifications from DCM and provide feedback to optimize the DCM, enhancing its adaptability and alignment with the parser’s requirements. Extensive experiments on the STAC and Molweni datasets demonstrate that our approach effectively resolves ambiguities and significantly outperforms the state-of-the-art (SOTA) baselines.

[58] CKD-EHR:Clinical Knowledge Distillation for Electronic Health Records cs.CLPDF

Junke Wang, Hongshun Ling, Li Zhang, Longqian Zhang, Fang Wang

TL;DR: 论文提出了一种基于知识蒸馏的CKD-EHR框架，通过将大型语言模型Qwen2.5-7B的知识迁移到轻量级BERT模型，显著提升了电子健康记录（EHR）的疾病预测性能。

Details

Motivation: 现有大型语言模型在医学知识表示不足和临床部署效率低下的问题限制了其应用，CKD-EHR框架旨在解决这些问题。

Result: 在MIMIC-III数据集上，CKD-EHR比基线模型诊断准确率提升9%，F1分数提高27%，推理速度提升22.2倍。

Insight: 知识蒸馏技术在EHR领域的应用可以有效平衡模型性能与效率，为临床资源优化提供了可行方案。

Abstract: Electronic Health Records (EHR)-based disease prediction models have demonstrated significant clinical value in promoting precision medicine and enabling early intervention. However, existing large language models face two major challenges: insufficient representation of medical knowledge and low efficiency in clinical deployment. To address these challenges, this study proposes the CKD-EHR (Clinical Knowledge Distillation for EHR) framework, which achieves efficient and accurate disease risk prediction through knowledge distillation techniques. Specifically, the large language model Qwen2.5-7B is first fine-tuned on medical knowledge-enhanced data to serve as the teacher model.It then generates interpretable soft labels through a multi-granularity attention distillation mechanism. Finally, the distilled knowledge is transferred to a lightweight BERT student model. Experimental results show that on the MIMIC-III dataset, CKD-EHR significantly outperforms the baseline model:diagnostic accuracy is increased by 9%, F1-score is improved by 27%, and a 22.2 times inference speedup is achieved. This innovative solution not only greatly improves resource utilization efficiency but also significantly enhances the accuracy and timeliness of diagnosis, providing a practical technical approach for resource optimization in clinical settings. The code and data for this research are available athttps://github.com/209506702/CKD_EHR.

[59] ProtoReasoning: Prototypes as the Foundation for Generalizable Reasoning in LLMs cs.CLPDF

Feng He, Zijun Chen, Xinnian Liang, Tingting Ma, Yunqi Qiu

TL;DR: ProtoReasoning提出了一种基于原型的方法来提升大语言模型的推理能力，通过抽象的原型表示（如Prolog和PDDL）实现跨领域泛化，实验表明其在逻辑推理、规划和数学任务上均有显著提升。

Details

Motivation: 尽管大语言模型在跨领域推理上表现出色，但其泛化机制尚不明确。论文假设跨领域泛化源于共享的抽象推理原型，这些原型能捕捉问题本质，而ProtoReasoning旨在验证并利用这一假设。

Result: 在逻辑推理（Enigmata-Eval）、规划任务、普通推理（MMLU）和数学（AIME24）上分别提升4.7%、6.3%、4.0%和1.0%。消融实验验证了原型学习的泛化优势。

Insight: 跨领域推理的泛化能力可能依赖于共享的抽象推理原型，而非具体的自然语言表示，这为提升大语言模型的推理能力提供了新方向。

Abstract: Recent advances in Large Reasoning Models (LRMs) trained with Long Chain-of-Thought (Long CoT) reasoning have demonstrated remarkable cross-domain generalization capabilities. However, the underlying mechanisms supporting such transfer remain poorly understood. We hypothesize that cross-domain generalization arises from shared abstract reasoning prototypes – fundamental reasoning patterns that capture the essence of problems across domains. These prototypes minimize the nuances of the representation, revealing that seemingly diverse tasks are grounded in shared reasoning structures.Based on this hypothesis, we propose ProtoReasoning, a framework that enhances the reasoning ability of LLMs by leveraging scalable and verifiable prototypical representations (Prolog for logical reasoning, PDDL for planning).ProtoReasoning features: (1) an automated prototype construction pipeline that transforms problems into corresponding prototype representations; (2) a comprehensive verification system providing reliable feedback through Prolog/PDDL interpreters; (3) the scalability to synthesize problems arbitrarily within prototype space while ensuring correctness. Extensive experiments show that ProtoReasoning achieves 4.7% improvement over baseline models on logical reasoning (Enigmata-Eval), 6.3% improvement on planning tasks, 4.0% improvement on general reasoning (MMLU) and 1.0% on mathematics (AIME24). Significantly, our ablation studies confirm that learning in prototype space also demonstrates enhanced generalization to structurally similar problems compared to training solely on natural language representations, validating our hypothesis that reasoning prototypes serve as the foundation for generalizable reasoning in large language models.

[60] MinosEval: Distinguishing Factoid and Non-Factoid for Tailored Open-Ended QA Evaluation with LLMs cs.CLPDF

Yongqi Fan, Yating Wang, Guandong Wang, Jie Zhai, Jingping Liu

TL;DR: MinosEval 是一种新颖的开放性问题回答（QA）自动评估方法，通过区分事实性和非事实性问题，分别采用关键点评分和实例感知排序策略，显著提升了评估结果与人工标注的一致性和可解释性。

Details

Motivation: 开放性问题回答（QA）的自动评估方法存在不足，传统指标如 ROUGE 和 BERTScore 难以捕捉语义相似性，而现有基于大语言模型（LLMs）的方法缺乏对不同问题类型（事实性vs非事实性）的区分和适应性。

Result: 在多个开放性问题回答数据集上的实验表明，MinosEval 与人工标注的一致性更高，并提供更直观的解释性结果。

Insight: 开放性问题回答的评估需要根据问题类型（事实性vs非事实性）设计不同的评估策略，这一区分能显著提升自动评估的效果和可解释性。

Abstract: Open-ended question answering (QA) is a key task for evaluating the capabilities of large language models (LLMs). Compared to closed-ended QA, it demands longer answer statements, more nuanced reasoning processes, and diverse expressions, making refined and interpretable automatic evaluation both crucial and challenging. Traditional metrics like ROUGE and BERTScore struggle to capture semantic similarities due to different patterns between model responses and reference answers. Current LLM-based evaluation approaches, such as pairwise or listwise comparisons of candidate answers, lack intuitive interpretability. While pointwise scoring of each response provides some descriptions, it fails to adapt across different question contents. Most notably, existing methods overlook the distinction between factoid and non-factoid questions. To address these challenges, we propose \textbf{MinosEval}, a novel evaluation method that first distinguishes open-ended questions and then ranks candidate answers using different evaluation strategies. For factoid questions, it applies an adaptive key-point scoring strategy, while for non-factoid questions, it uses an instance-aware listwise ranking strategy. Experiments on multiple open-ended QA datasets, including self-built ones with more candidate responses to complement community resources, show that MinosEval better aligns with human annotations and offers more interpretable results.

[61] Research on Graph-Retrieval Augmented Generation Based on Historical Text Knowledge Graphs cs.CLPDF

Yang Fan, Zhang Qi, Xing Wenqian, Liu Chang, Liu Liu

TL;DR: 该论文针对通用大语言模型在历史文本分析中的领域知识空白，提出了Graph RAG框架，整合思维链提示、自指令生成和过程监督，构建了一个低人工标注的历史文本知识图谱数据集，有效降低了知识提取成本。

Details

Motivation: 通用大语言模型在历史文本分析中存在领域知识不足的问题，需结合知识图谱和检索增强生成技术，以提升模型对历史知识的对齐能力。

Result: 实验表明，Graph RAG框架显著提升了关系提取任务的性能（F1提升11%），并优于通用模型Xunzi-Qwen1.5-14B。

Insight: 结合知识图谱与检索增强生成技术是解决领域知识不足的有效途径，同时能够以低资源成本支持经典文本的知识提取。

Abstract: This article addresses domain knowledge gaps in general large language models for historical text analysis in the context of computational humanities and AIGC technology. We propose the Graph RAG framework, combining chain-of-thought prompting, self-instruction generation, and process supervision to create a The First Four Histories character relationship dataset with minimal manual annotation. This dataset supports automated historical knowledge extraction, reducing labor costs. In the graph-augmented generation phase, we introduce a collaborative mechanism between knowledge graphs and retrieval-augmented generation, improving the alignment of general models with historical knowledge. Experiments show that the domain-specific model Xunzi-Qwen1.5-14B, with Simplified Chinese input and chain-of-thought prompting, achieves optimal performance in relation extraction (F1 = 0.68). The DeepSeek model integrated with GraphRAG improves F1 by 11% (0.08-0.19) on the open-domain C-CLUE relation extraction dataset, surpassing the F1 value of Xunzi-Qwen1.5-14B (0.12), effectively alleviating hallucinations phenomenon, and improving interpretability. This framework offers a low-resource solution for classical text knowledge extraction, advancing historical knowledge services and humanities research.

[62] TopClustRAG at SIGIR 2025 LiveRAG Challenge cs.CLPDF

Juli Bakagianni, John Pavlopoulos, Aristidis Likas

TL;DR: TopClustRAG是一个用于SIGIR 2025 LiveRAG挑战赛的检索增强生成（RAG）系统，结合了稀疏和稠密索引的混合检索策略，并通过K-Means聚类分组语义相似的段落，最终生成多样且准确的答案。

Details

Motivation: 在大规模网络语料库上进行端到端问答任务时，如何提升答案的多样性、相关性和忠实性是关键挑战。TopClustRAG旨在通过多阶段流程优化这些问题。

Result: 在FineWeb Sample-10BT数据集上，TopClustRAG在忠实性排名第2，正确性排名第7，验证了方法的有效性。

Insight: 基于聚类的上下文过滤和提示聚合在大规模RAG系统中具有潜力，能显著提升答案多样性和忠实性。

Abstract: We present TopClustRAG, a retrieval-augmented generation (RAG) system developed for the LiveRAG Challenge, which evaluates end-to-end question answering over large-scale web corpora. Our system employs a hybrid retrieval strategy combining sparse and dense indices, followed by K-Means clustering to group semantically similar passages. Representative passages from each cluster are used to construct cluster-specific prompts for a large language model (LLM), generating intermediate answers that are filtered, reranked, and finally synthesized into a single, comprehensive response. This multi-stage pipeline enhances answer diversity, relevance, and faithfulness to retrieved evidence. Evaluated on the FineWeb Sample-10BT dataset, TopClustRAG ranked 2nd in faithfulness and 7th in correctness on the official leaderboard, demonstrating the effectiveness of clustering-based context filtering and prompt aggregation in large-scale RAG systems.

[63] Cohort Discovery: A Survey on LLM-Assisted Clinical Trial Recruitment cs.CL | cs.AIPDF

Shrestha Ghosh, Moritz Schneider, Carina Reinicke, Carsten Eickhoff

TL;DR: 这篇论文综述了大型语言模型（LLM）在临床试验招募中的应用潜力及现有挑战，分析了任务匹配、基准测试和未来发展方向。

Details

Motivation: 虽然LLM在通用领域NLP任务中表现优异，但在临床试验招募等关键领域的应用仍有限。论文旨在探索LLM如何通过知识聚合和推理能力改进试验-患者匹配任务。

Result: 揭示了现有LLM辅助方法依赖专有模型和弱基准测试的问题，并提出了未来研究方向。

Insight: LLM在临床试验招募中的潜力尚未完全发挥，未来需关注模型开放性、评估标准化和领域适配性。

Abstract: Recent advances in LLMs have greatly improved general-domain NLP tasks. Yet, their adoption in critical domains, such as clinical trial recruitment, remains limited. As trials are designed in natural language and patient data is represented as both structured and unstructured text, the task of matching trials and patients benefits from knowledge aggregation and reasoning abilities of LLMs. Classical approaches are trial-specific and LLMs with their ability to consolidate distributed knowledge hold the potential to build a more general solution. Yet recent applications of LLM-assisted methods rely on proprietary models and weak evaluation benchmarks. In this survey, we are the first to analyze the task of trial-patient matching and contextualize emerging LLM-based approaches in clinical trial recruitment. We critically examine existing benchmarks, approaches and evaluation frameworks, the challenges to adopting LLM technologies in clinical research and exciting future directions.

[64] DeVisE: Behavioral Testing of Medical Large Language Models cs.CLPDF

Camila Zurdo Tagliabue, Heloisa Oss Boll, Aykut Erdem, Erkut Erdem, Iacer Calixto

TL;DR: 论文提出了DeVisE框架，通过行为测试评估医疗大语言模型的临床推理能力，揭示了零样本模型与微调模型的不同表现，并强调了公平性评估的重要性。

Details

Motivation: 当前医疗领域的大语言模型评估方法难以区分真实的医学推理与表面模式，需要更细粒度的测试框架来揭示模型的行为和推理策略。

Result: 零样本模型表现出更连贯的反事实推理模式，而微调模型更稳定但对临床有意义的变化响应较弱；人口统计学因素对输出有细微但一致的影响。

Insight: 行为测试可揭示医疗LLM的推理策略，为设计更安全、透明的医疗AI系统提供依据；公平性评估在医疗AI中至关重要。

Abstract: Large language models (LLMs) are increasingly used in clinical decision support, yet current evaluation methods often fail to distinguish genuine medical reasoning from superficial patterns. We introduce DeVisE (Demographics and Vital signs Evaluation), a behavioral testing framework for probing fine-grained clinical understanding. We construct a dataset of ICU discharge notes from MIMIC-IV, generating both raw (real-world) and template-based (synthetic) versions with controlled single-variable counterfactuals targeting demographic (age, gender, ethnicity) and vital sign attributes. We evaluate five LLMs spanning general-purpose and medically fine-tuned variants, under both zero-shot and fine-tuned settings. We assess model behavior via (1) input-level sensitivity - how counterfactuals alter the likelihood of a note; and (2) downstream reasoning - how they affect predicted hospital length-of-stay. Our results show that zero-shot models exhibit more coherent counterfactual reasoning patterns, while fine-tuned models tend to be more stable yet less responsive to clinically meaningful changes. Notably, demographic factors subtly but consistently influence outputs, emphasizing the importance of fairness-aware evaluation. This work highlights the utility of behavioral testing in exposing the reasoning strategies of clinical LLMs and informing the design of safer, more transparent medical AI systems.

[65] SANSKRITI: A Comprehensive Benchmark for Evaluating Language Models’ Knowledge of Indian Culture cs.CLPDF

Arijit Maji, Raghvendra Kumar, Akash Ghosh, Anushka, Sriparna Saha

TL;DR: SANSKRITI是一个新基准，用于评估语言模型对印度文化的理解，包含21,853个问题-答案对，覆盖印度28个邦和8个中央直辖区，涵盖16个文化属性。评估显示主流语言模型在处理文化细节时存在显著短板。

Details

Motivation: 语言模型的全球有效性依赖于对本地社会文化背景的理解，但现有基准缺乏对印度文化多样性的覆盖。

Result: 评估显示许多模型在处理文化细粒度查询时表现欠佳，尤其在区域特定情境下。

Insight: 语言模型的全球适用性需要更强的文化多样性支持，区域化基准对评估和改进至关重要。

Abstract: Language Models (LMs) are indispensable tools shaping modern workflows, but their global effectiveness depends on understanding local socio-cultural contexts. To address this, we introduce SANSKRITI, a benchmark designed to evaluate language models’ comprehension of India’s rich cultural diversity. Comprising 21,853 meticulously curated question-answer pairs spanning 28 states and 8 union territories, SANSKRITI is the largest dataset for testing Indian cultural knowledge. It covers sixteen key attributes of Indian culture: rituals and ceremonies, history, tourism, cuisine, dance and music, costume, language, art, festivals, religion, medicine, transport, sports, nightlife, and personalities, providing a comprehensive representation of India’s cultural tapestry. We evaluate SANSKRITI on leading Large Language Models (LLMs), Indic Language Models (ILMs), and Small Language Models (SLMs), revealing significant disparities in their ability to handle culturally nuanced queries, with many models struggling in region-specific contexts. By offering an extensive, culturally rich, and diverse dataset, SANSKRITI sets a new standard for assessing and improving the cultural understanding of LMs.

[66] COSMMIC: Comment-Sensitive Multimodal Multilingual Indian Corpus for Summarization and Headline Generation cs.CLPDF

Raghvendra Kumar, S. A. Mohammed Salman, Aryan Sahu, Tridib Nandi, Pragathi Y. P.

TL;DR: COSMMIC 是一个多模态、多语言的印度语料库，专注于结合文章、图像和用户评论，为摘要和标题生成任务提供支持。

Details

Motivation: 现有的多模态和多语言研究主要集中在英语和中文上，印度语言的研究相对匮乏。COSMMIC旨在填补这一空白。

Result: 实验表明，结合用户评论和图像的多模态配置能显著提升摘要和标题生成的质量。

Insight: 评论和图像在摘要生成中提供了额外的上下文，尤其在多语言和多模态任务中，融合这些信息有助于更全面的理解。

Abstract: Despite progress in comment-aware multimodal and multilingual summarization for English and Chinese, research in Indian languages remains limited. This study addresses this gap by introducing COSMMIC, a pioneering comment-sensitive multimodal, multilingual dataset featuring nine major Indian languages. COSMMIC comprises 4,959 article-image pairs and 24,484 reader comments, with ground-truth summaries available in all included languages. Our approach enhances summaries by integrating reader insights and feedback. We explore summarization and headline generation across four configurations: (1) using article text alone, (2) incorporating user comments, (3) utilizing images, and (4) combining text, comments, and images. To assess the dataset’s effectiveness, we employ state-of-the-art language models such as LLama3 and GPT-4. We conduct a comprehensive study to evaluate different component combinations, including identifying supportive comments, filtering out noise using a dedicated comment classifier using IndicBERT, and extracting valuable insights from images with a multilingual CLIP-based classifier. This helps determine the most effective configurations for natural language generation (NLG) tasks. Unlike many existing datasets that are either text-only or lack user comments in multimodal settings, COSMMIC uniquely integrates text, images, and user feedback. This holistic approach bridges gaps in Indian language resources, advancing NLP research and fostering inclusivity.

[67] Targeted Lexical Injection: Unlocking Latent Cross-Lingual Alignment in Lugha-Llama via Early-Layer LoRA Fine-Tuning cs.CL | 68T50 | I.2.7; I.2.6PDF

Stanley Ngugi

TL;DR: 论文提出了一种名为Targeted Lexical Injection（TLI）的新方法，通过早期层的LoRA微调，有效提升了低资源语言（如斯瓦希里语）在大语言模型中的跨语言词汇对齐能力。

Details

Motivation: 解决低资源语言在大语言模型中表现不佳的问题，尤其是在跨语言词汇对齐任务中，现有模型的输出层表现与内部层的潜在能力之间存在差距。

Result: TLI显著提升了训练过的斯瓦希里语-英语词汇对的余弦相似度（从0.3211到0.4113），且在未见的词汇对上泛化能力显著（从0.3143到0.4033）。

Insight: 早期层的潜在跨语言知识可以通过TLI被有效利用和传播，为低资源语言的性能提升提供了一种参数高效的策略。

Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities, yet their performance in low-resource languages (LRLs), such as Swahili, often lags due to data scarcity and underrepresentation in pre-training. A key challenge is achieving robust cross-lingual lexical alignment, crucial for tasks like translation and cross-lingual information retrieval. This paper introduces Targeted Lexical Injection (TLI), a novel and efficient fine-tuning approach. We first demonstrate that Lugha-Llama-8B-wura, a Swahili-centric LLM, exhibits strong, near-perfect lexical alignment for Swahili-English word pairs in its early internal layers (specifically Layer 2, with ~0.99998 average cosine similarity based on a pilot study), a capability not fully reflected in its final output representations (baseline ~0.32 similarity on our evaluation set). TLI leverages this insight by using Low-Rank Adaptation (LoRA) and a contrastive learning objective to fine-tune the model, specifically targeting embeddings from this empirically identified optimal early layer. Our experiments show that TLI significantly improves the output-level lexical alignment for 623 trained Swahili-English word pairs, increasing average cosine similarity from 0.3211 to 0.4113 (+28.08%, p < 1.33 x 10^-240). More importantly, these improvements generalize remarkably well to 63 unseen control word pairs, with similarity increasing from 0.3143 to 0.4033 (+28.32%, p < 7.17 x 10^-27). These findings suggest TLI enhances the model’s ability to preserve and propagate its inherent early-layer cross-lingual knowledge, offering a parameter-efficient and effective strategy for improving lexical alignment in LRL-focused LLMs.

[68] Understanding GUI Agent Localization Biases through Logit Sharpness cs.CLPDF

Xingjian Tao, Yiwei Wang, Yujun Cai, Zhicheng Yang, Jing Tang

TL;DR: 该论文提出了一种细粒度评估框架和Peak Sharpness Score (PSS)指标，用于分析GUI代理的定位偏差，并通过Context-Aware Cropping技术提升模型性能。

Details

Motivation: 尽管多模态大语言模型（MLLMs）在GUI代理中表现出色，但其定位错误（幻觉问题）影响了可靠性。需要一种更细粒度的评估方法来揭示传统准确性指标未能捕捉的失败模式。

Result: 实验表明，该框架和方法提升了GUI代理行为的可解释性和鲁棒性，并为模型优化提供了可行的见解。

Insight: 模型定位错误的部分原因在于语义连续性与logits分布不一致，通过调整输入上下文可以显著改善这一问题。

Abstract: Multimodal large language models (MLLMs) have enabled GUI agents to interact with operating systems by grounding language into spatial actions. Despite their promising performance, these models frequently exhibit hallucinations-systematic localization errors that compromise reliability. We propose a fine-grained evaluation framework that categorizes model predictions into four distinct types, revealing nuanced failure modes beyond traditional accuracy metrics. To better quantify model uncertainty, we introduce the Peak Sharpness Score (PSS), a metric that evaluates the alignment between semantic continuity and logits distribution in coordinate prediction. Building on this insight, we further propose Context-Aware Cropping, a training-free technique that improves model performance by adaptively refining input context. Extensive experiments demonstrate that our framework and methods provide actionable insights and enhance the interpretability and robustness of GUI agent behavior.

[69] AgentGroupChat-V2: Divide-and-Conquer Is What LLM-Based Multi-Agent System Need cs.CLPDF

Zhouhong Gu, Xiaoxuan Zhu, Yin Cai, Hao Shen, Xingzhou Chen

TL;DR: AgentGroupChat-V2是一个基于LLM的多智能体系统框架，通过分治并行架构、自适应协作引擎和智能体组织优化策略，显著提升了复杂任务的解决能力。

Details

Motivation: 当前基于大语言模型的多智能体系统在系统架构设计、跨领域通用性和性能保障方面面临挑战，尤其是随着任务复杂性和智能体数量增加。

Result: 在GSM8K、AIME和HumanEval等任务中表现优异，尤其在复杂推理任务（例如MATH Level 5）中提升显著。

Insight: 分治策略和动态协作机制是提升多智能体系统性能的关键。

Abstract: Large language model based multi-agent systems have demonstrated significant potential in social simulation and complex task resolution domains. However, current frameworks face critical challenges in system architecture design, cross-domain generalizability, and performance guarantees, particularly as task complexity and number of agents increases. We introduces AgentGroupChat-V2, a novel framework addressing these challenges through three core innovations: (1) a divide-and-conquer fully parallel architecture that decomposes user queries into hierarchical task forest structures enabling dependency management and distributed concurrent processing. (2) an adaptive collaboration engine that dynamically selects heterogeneous LLM combinations and interaction modes based on task characteristics. (3) agent organization optimization strategies combining divide-and-conquer approaches for efficient problem decomposition. Extensive experiments demonstrate AgentGroupChat-V2’s superior performance across diverse domains, achieving 91.50% accuracy on GSM8K (exceeding the best baseline by 5.6 percentage points), 30.4% accuracy on competition-level AIME (nearly doubling other methods), and 79.20% pass@1 on HumanEval. Performance advantages become increasingly pronounced with higher task difficulty, particularly on Level 5 MATH problems where improvements exceed 11 percentage points compared to state-of-the-art baselines. These results confirm that AgentGroupChat-V2 provides a comprehensive solution for building efficient, general-purpose LLM multi-agent systems with significant advantages in complex reasoning scenarios. Code is available at https://github.com/MikeGu721/AgentGroupChat-V2.

[70] RE-IMAGINE: Symbolic Benchmark Synthesis for Reasoning Evaluation cs.CL | cs.AIPDF

Xinnuo Xu, Rachel Lawrence, Kshitij Dubey, Atharva Pandey, Risa Ueno

TL;DR: RE-IMAGINE提出了一种基于符号表示的框架，用于评估大语言模型（LLMs）的真实推理能力，通过生成不同层次的变异问题来区分模型是基于记忆还是真实推理。

Details

Motivation: 当前LLMs在推理任务上的高准确性可能源于训练数据的统计记忆而非真实推理能力，缺乏系统的评估方法。

Result: 实验表明LLMs在面对变异问题时性能下降，说明其过去表现部分依赖于记忆。

Insight: 符号化的变异问题生成是一种有效的评估工具，未来研究可以针对不同推理层次的能力进行优化。

Abstract: Recent Large Language Models (LLMs) have reported high accuracy on reasoning benchmarks. However, it is still unclear whether the observed results arise from true reasoning or from statistical recall of the training set. Inspired by the ladder of causation (Pearl, 2009) and its three levels (associations, interventions and counterfactuals), this paper introduces RE-IMAGINE, a framework to characterize a hierarchy of reasoning ability in LLMs, alongside an automated pipeline to generate problem variations at different levels of the hierarchy. By altering problems in an intermediate symbolic representation, RE-IMAGINE generates arbitrarily many problems that are not solvable using memorization alone. Moreover, the framework is general and can work across reasoning domains, including math, code, and logic. We demonstrate our framework on four widely-used benchmarks to evaluate several families of LLMs, and observe reductions in performance when the models are queried with problem variations. These assessments indicate a degree of reliance on statistical recall for past performance, and open the door to further research targeting skills across the reasoning hierarchy.

[71] Context-Informed Grounding Supervision cs.CL | cs.AIPDF

Hyunji Lee, Seunghyun Yoon, Yunjae Won, Hanseok Oh, Geewook Kim

TL;DR: 论文提出了一种名为CINGS的监督方法，通过将上下文与回答结合训练模型但仅在回答部分计算损失，显著提升了模型在文本和视觉领域中的基础（grounding）能力。

Details

Motivation: 大型语言模型（LLMs）通常需要依赖外部知识来补充未编码的信息或减少幻觉生成。然而，简单地附加上下文无法确保模型生成基于上下文的回答，因此需要一种训练方法来增强模型的grounding能力。

Result: CINGS在11个信息检索数据集上优于其他训练方法，并在视觉语言任务中减少了幻觉生成，同时保持了下游任务的性能不下降。

Insight: CINGS通过改变模型的先验知识和行为，隐式地鼓励模型更依赖外部上下文，从而提升了grounding能力。

Abstract: Large language models (LLMs) are often supplemented with external knowledge to provide information not encoded in their parameters or to reduce hallucination. In such cases, we expect the model to generate responses by grounding its response in the provided external context. However, prior work has shown that simply appending context at inference time does not ensure grounded generation. To address this, we propose Context-INformed Grounding Supervision (CINGS), a post-training supervision in which the model is trained with relevant context prepended to the response, while computing the loss only over the response tokens and masking out the context. Our experiments demonstrate that models trained with CINGS exhibit stronger grounding in both textual and visual domains compared to standard instruction-tuned models. In the text domain, CINGS outperforms other training methods across 11 information-seeking datasets and is complementary to inference-time grounding techniques. In the vision-language domain, replacing a vision-language model’s LLM backbone with a CINGS-trained model reduces hallucinations across four benchmarks and maintains factual consistency throughout the generated response. This improved grounding comes without degradation in general downstream performance. Finally, we analyze the mechanism underlying the enhanced grounding in CINGS and find that it induces a shift in the model’s prior knowledge and behavior, implicitly encouraging greater reliance on the external context.

[72] SPARE: Single-Pass Annotation with Reference-Guided Evaluation for Automatic Process Supervision and Reward Modelling cs.CL | cs.AI | cs.LGPDF

Md Imbesat Hassan Rizvi, Xiaodan Zhu, Iryna Gurevych

TL;DR: 本文提出了一种名为SPARE的新型框架，用于高效、高质量的单步自动标注和过程监督，通过参考引导的评估方法提升LLM在多步推理任务中的表现。

Details

Motivation: 多步推理任务中对过程监督的需求日益增长，但高效且高质量的自动标注方法仍是一个挑战。

Result: SPARE在数学推理、多跳组合问答和空间推理任务中表现优异，推理性能提升，且运行效率高出基线方法2.6倍。

Insight: 参考引导的步骤级评估是一种有效的过程监督方法，可以显著提升LLM在多步推理任务中的表现，同时提高自动标注的效率。

Abstract: Process or step-wise supervision has played a crucial role in advancing complex multi-step reasoning capabilities of Large Language Models (LLMs). However, efficient, high-quality automated process annotation remains a significant challenge. To address this, we introduce Single-Pass Annotation with Reference-Guided Evaluation (SPARE), a novel structured framework that enables single-pass, per-step annotation by aligning each solution step to one or multiple steps in a reference solution, accompanied by explicit reasoning for evaluation. We show that reference-guided step-level evaluation effectively facilitates process supervision on four datasets spanning three domains: mathematical reasoning, multi-hop compositional question answering, and spatial reasoning. We demonstrate that SPARE, when compared to baselines, improves reasoning performance when used for: (1) fine-tuning models in an offline RL setup for inference-time greedy-decoding, and (2) training reward models for ranking/aggregating multiple LLM-generated outputs. Additionally, SPARE achieves competitive performance on challenging mathematical datasets while offering 2.6 times greater efficiency, requiring only 38% of the runtime, compared to tree search-based automatic annotation. The codebase, along with a trained SPARE-PRM model, is publicly released to facilitate further research and reproducibility.

[73] Lessons from Training Grounded LLMs with Verifiable Rewards cs.CLPDF

Shang Hong Sim, Tej Deep Pala, Vernon Toh, Hai Leong Chieu, Amir Zadeh

TL;DR: 论文探讨了如何通过强化学习（RL）和内部推理提升大语言模型（LLMs）的响应可靠性，采用GRPO方法训练模型，显著提升了在未回答问题和引用生成任务中的表现。

Details

Motivation: 尽管检索增强生成（RAG）和基于引用的方法有潜力，但指令调优模型在简单场景中仍频繁失败，如遗漏显式答案、错误引用或拒绝回答。研究旨在解决这些问题，提升LLMs的响应可靠性。

Result: 在ASQA、QAMPARI、ELI5和ExpertQA等数据集上的实验表明，结合推理的方法显著优于纯指令调优模型，尤其是在处理不可回答问题和完善引用方面。

Insight: 研究强调了推理的重要性、阶段优化策略以及基于结果的强化学习对构建可验证和可靠LLMs的价值。指令调优与强化学习的结合进一步提升了长格式生成问答任务的性能。

Abstract: Generating grounded and trustworthy responses remains a key challenge for large language models (LLMs). While retrieval-augmented generation (RAG) with citation-based grounding holds promise, instruction-tuned models frequently fail even in straightforward scenarios: missing explicitly stated answers, citing incorrectly, or refusing when evidence is available. In this work, we explore how reinforcement learning (RL) and internal reasoning can enhance grounding in LLMs. We use the GRPO (Group Relative Policy Optimization) method to train models using verifiable outcome-based rewards targeting answer correctness, citation sufficiency, and refusal quality, without requiring gold reasoning traces or expensive annotations. Through comprehensive experiments across ASQA, QAMPARI, ELI5, and ExpertQA we show that reasoning-augmented models significantly outperform instruction-only variants, especially in handling unanswerable queries and generating well-cited responses. A two-stage training setup, first optimizing answer and citation behavior and then refusal, further improves grounding by stabilizing the learning signal. Additionally, we revisit instruction tuning via GPT-4 distillation and find that combining it with GRPO enhances performance on long-form, generative QA tasks. Overall, our findings highlight the value of reasoning, stage-wise optimization, and outcome-driven RL for building more verifiable and reliable LLMs.

[74] SciVer: Evaluating Foundation Models for Multimodal Scientific Claim Verification cs.CLPDF

Chengye Wang, Yifei Shen, Zexi Kuang, Arman Cohan, Yilun Zhao

TL;DR: SciVer是一个专为评估基础模型在多模态科学声明验证中的能力而设计的首个基准，包含3000个专家标注的示例，覆盖4种常见的推理类型。测试表明，21个前沿多模态基础模型（如o4-mini、Gemini-2.5-Flash等）与人类专家存在显著性能差距。通过分析RAG和人工错误评估，揭示了开源模型的局限性。

Details

Motivation: 当前多模态科学声明验证任务缺乏专门的评测基准，且现有基础模型在此类任务中的表现尚未被系统评估。SciVer填补了这一空白，旨在推动模型对科学文献的理解和推理能力。

Result: 模型与人类专家之间存在显著性能差距，尤其是在复杂推理任务上。开源模型的表现明显不足。

Insight: 多模态科学声明验证需要更强的上下文理解和推理能力，当前模型的RAG技术仍需改进以缩小与人类专家的差距。

Abstract: We introduce SciVer, the first benchmark specifically designed to evaluate the ability of foundation models to verify claims within a multimodal scientific context. SciVer consists of 3,000 expert-annotated examples over 1,113 scientific papers, covering four subsets, each representing a common reasoning type in multimodal scientific claim verification. To enable fine-grained evaluation, each example includes expert-annotated supporting evidence. We assess the performance of 21 state-of-the-art multimodal foundation models, including o4-mini, Gemini-2.5-Flash, Llama-3.2-Vision, and Qwen2.5-VL. Our experiment reveals a substantial performance gap between these models and human experts on SciVer. Through an in-depth analysis of retrieval-augmented generation (RAG), and human-conducted error evaluations, we identify critical limitations in current open-source models, offering key insights to advance models’ comprehension and reasoning in multimodal scientific literature tasks.

[75] DiscoSG: Towards Discourse-Level Text Scene Graph Parsing through Iterative Graph Refinement cs.CLPDF

Shaoqing Lin, Chong Teng, Fei Li, Donghong Ji, Lizhen Qu

TL;DR: 论文提出了一种新的任务Discourse-level text Scene Graph parsing (DiscoSG)，并发布了一个数据集DiscoSG-DS，同时提出了DiscoSG-Refiner方法，通过迭代图优化来提高多句子场景图的解析性能。

Details

Motivation: 当前的文本场景图解析方法主要针对单句子的描述，无法有效处理多句子（discourse-level）中的跨句共指等现象，导致图解析不完整，影响下游任务的性能。

Result: DiscoSG-Refiner在SPICE指标上比基线提升了约30%，推理速度比GPT-4快86倍，并提升了下游任务（如描述评估和幻觉检测）的性能。

Insight: 迭代优化策略可以在降低计算成本的同时提升模型性能，适用于复杂图结构生成任务；多句子场景图解析是提升VLM任务表现的关键。

Abstract: Vision-Language Models (VLMs) now generate discourse-level, multi-sentence visual descriptions, challenging text scene graph parsers originally designed for single-sentence caption-to-graph mapping. Current approaches typically merge sentence-level parsing outputs for discourse input, often missing phenomena like cross-sentence coreference, resulting in fragmented graphs and degraded downstream VLM task performance. To address this, we introduce a new task, Discourse-level text Scene Graph parsing (DiscoSG), supported by our dataset DiscoSG-DS, which comprises 400 expert-annotated and 8,430 synthesised multi-sentence caption-graph pairs for images. Each caption averages 9 sentences, and each graph contains at least 3 times more triples than those in existing datasets. While fine-tuning large PLMs (i.e., GPT-4) on DiscoSG-DS improves SPICE by approximately 48% over the best sentence-merging baseline, high inference cost and restrictive licensing hinder its open-source use, and smaller fine-tuned PLMs struggle with complex graphs. We propose DiscoSG-Refiner, which drafts a base graph using one small PLM, then employs a second PLM to iteratively propose graph edits, reducing full-graph generation overhead. Using two Flan-T5-Base models, DiscoSG-Refiner still improves SPICE by approximately 30% over the best baseline while achieving 86 times faster inference than GPT-4. It also consistently improves downstream VLM tasks like discourse-level caption evaluation and hallucination detection. Code and data are available at: https://github.com/ShaoqLin/DiscoSG

[76] WikiMixQA: A Multimodal Benchmark for Question Answering over Tables and Charts cs.CL | cs.AI | cs.LGPDF

Negar Foroutan, Angelika Romanou, Matin Ansaripour, Julian Martin Eisenschlos, Karl Aberer

TL;DR: WikiMixQA 是一个多模态基准测试，包含 1000 道选择题，旨在评估从维基百科页面提取的表格和图表的跨模态推理能力，揭示了当前视觉语言模型在长上下文多模态推理中的局限性。

Details

Motivation: 文档通常包含复杂的布局、表格和图表，这对自动文档理解（DU）提出了重大挑战。尽管视觉语言大模型（VLLMs）在各种任务中表现出色，但它们在处理长上下文视觉输入时的有效性尚未明确。

Result: 专有模型在直接上下文中的准确率约为 70%，但在长文档检索场景下性能显著下降；GPT-4-o 是唯一在后者场景中超过 50% 准确率的模型，开源模型最高准确率仅为 27%。

Insight: 长上下文多模态推理仍然是视觉语言模型的重大挑战，WikiMixQA 为未来文档理解研究提供了关键的评估标准。

Abstract: Documents are fundamental to preserving and disseminating information, often incorporating complex layouts, tables, and charts that pose significant challenges for automatic document understanding (DU). While vision-language large models (VLLMs) have demonstrated improvements across various tasks, their effectiveness in processing long-context vision inputs remains unclear. This paper introduces WikiMixQA, a benchmark comprising 1,000 multiple-choice questions (MCQs) designed to evaluate cross-modal reasoning over tables and charts extracted from 4,000 Wikipedia pages spanning seven distinct topics. Unlike existing benchmarks, WikiMixQA emphasizes complex reasoning by requiring models to synthesize information from multiple modalities. We evaluate 12 state-of-the-art vision-language models, revealing that while proprietary models achieve ~70% accuracy when provided with direct context, their performance deteriorates significantly when retrieval from long documents is required. Among these, GPT-4-o is the only model exceeding 50% accuracy in this setting, whereas open-source models perform considerably worse, with a maximum accuracy of 27%. These findings underscore the challenges of long-context, multi-modal reasoning and establish WikiMixQA as a crucial benchmark for advancing document understanding research.

[77] Revisiting Compositional Generalization Capability of Large Language Models Considering Instruction Following Ability cs.CL | cs.AIPDF

Yusuke Sakai, Hidetaka Kamigaito, Taro Watanabe

TL;DR: 该论文提出了Ordered CommonGen基准，用于评估大语言模型（LLMs）的组合泛化能力和指令跟随能力，发现LLMs在概念顺序生成上存在偏差，且最高指令遵从度仅为75%。

Details

Motivation: 为同时评估LLMs的组合泛化能力和指令跟随能力，作者提出了一种新的基准Ordered CommonGen，以解决生成任务中概念顺序遵从的问题。

Result: 实验表明，尽管LLMs能够理解指令意图，但对特定顺序模式的偏好导致生成结果多样性低或相同；最优模型的指令遵从度仅为75%。

Insight: LLMs在指令跟随和组合泛化上仍有改进空间，基准的设计有助于进一步推动模型能力的提升。

Abstract: In generative commonsense reasoning tasks such as CommonGen, generative large language models (LLMs) compose sentences that include all given concepts. However, when focusing on instruction-following capabilities, if a prompt specifies a concept order, LLMs must generate sentences that adhere to the specified order. To address this, we propose Ordered CommonGen, a benchmark designed to evaluate the compositional generalization and instruction-following abilities of LLMs. This benchmark measures ordered coverage to assess whether concepts are generated in the specified order, enabling a simultaneous evaluation of both abilities. We conducted a comprehensive analysis using 36 LLMs and found that, while LLMs generally understand the intent of instructions, biases toward specific concept order patterns often lead to low-diversity outputs or identical results even when the concept order is altered. Moreover, even the most instruction-compliant LLM achieved only about 75% ordered coverage, highlighting the need for improvements in both instruction-following and compositional generalization capabilities.

[78] CC-LEARN: Cohort-based Consistency Learning cs.CLPDF

Xiao Ye, Shaswat Shrivastava, Zhaonan Li, Jacob Dineen, Shijie Lu

TL;DR: CC-LEARN 是一种基于队列的一致性学习框架，通过强化学习提升大语言模型的推理可靠性。

Details

Motivation: 大语言模型在多任务中表现优异，但在一致性和鲁棒性推理方面仍有不足。本文旨在通过队列级别的训练优化推理稳定性。

Result: 在 ARC-Challenge 和 StrategyQA 等基准测试中，CC-LEARN 显著提升了模型的准确性和推理稳定性。

Insight: 队列级别的强化学习能有效提升语言模型的推理一致性，优于传统的监督微调方法。

Abstract: Large language models excel at many tasks but still struggle with consistent, robust reasoning. We introduce Cohort-based Consistency Learning (CC-Learn), a reinforcement learning framework that improves the reliability of LLM reasoning by training on cohorts of similar questions derived from shared programmatic abstractions. To enforce cohort-level consistency, we define a composite objective combining cohort accuracy, a retrieval bonus for effective problem decomposition, and a rejection penalty for trivial or invalid lookups that reinforcement learning can directly optimize, unlike supervised fine-tuning. Optimizing this reward guides the model to adopt uniform reasoning patterns across all cohort members. Experiments on challenging reasoning benchmarks (including ARC-Challenge and StrategyQA) show that CC-Learn boosts both accuracy and reasoning stability over pretrained and SFT baselines. These results demonstrate that cohort-level RL effectively enhances reasoning consistency in LLMs.

[79] Leaky Thoughts: Large Reasoning Models Are Not Private Thinkers cs.CL | cs.AI | cs.CRPDF

Tommaso Green, Martin Gubri, Haritz Puerto, Sangdoo Yun, Seong Joon Oh

TL;DR: 大型推理模型的思维轨迹可能泄露敏感用户数据，增加推理步骤会放大这种泄漏，揭示了推理能力与隐私攻击面之间的矛盾。

Details

Motivation: 研究大型推理模型作为个人代理时，其内部思维轨迹（而非最终输出）可能泄露用户隐私的问题。

Result: 推理步骤的增加虽然提高了模型的谨慎性，但也导致了更多的隐私泄漏。

Insight: 在提升模型推理能力的同时，必须重视和解决其内部思维轨迹的隐私风险。

Abstract: We study privacy leakage in the reasoning traces of large reasoning models used as personal agents. Unlike final outputs, reasoning traces are often assumed to be internal and safe. We challenge this assumption by showing that reasoning traces frequently contain sensitive user data, which can be extracted via prompt injections or accidentally leak into outputs. Through probing and agentic evaluations, we demonstrate that test-time compute approaches, particularly increased reasoning steps, amplify such leakage. While increasing the budget of those test-time compute approaches makes models more cautious in their final answers, it also leads them to reason more verbosely and leak more in their own thinking. This reveals a core tension: reasoning improves utility but enlarges the privacy attack surface. We argue that safety efforts must extend to the model’s internal thinking, not just its outputs.

[80] GenRecal: Generation after Recalibration from Large to Small Vision-Language Models cs.CLPDF

Byung-Kwan Lee, Ryo Hachiuma, Yong Man Ro, Yu-Chiang Frank Wang, Yueh-Hua Wu

TL;DR: GenRecal提出了一种通用的蒸馏框架，通过重新校准异构视觉-语言模型（VLM）之间的特征表示，实现高效的知识迁移。

Details

Motivation: 大型VLM在资源受限设备上部署困难，而现有蒸馏方法受限于特定VLM架构的多样性。

Result: 在多个基准测试中显著提升小模型性能，超越大型开源和闭源VLM。

Insight: 通过特征对齐解决VLM架构多样性问题，为模型压缩提供了通用解决方案。

Abstract: Recent advancements in vision-language models (VLMs) have leveraged large language models (LLMs) to achieve performance on par with closed-source systems like GPT-4V. However, deploying these models in real-world scenarios, particularly on resource-constrained devices, remains challenging due to their substantial computational demands. This has spurred interest in distilling knowledge from large VLMs into smaller, more efficient counterparts. A key challenge arises here from the diversity of VLM architectures, which are built on different LLMs and employ varying token types-differing in vocabulary size, token splits, and token index ordering. To address this challenge of limitation to a specific VLM type, we present Generation after Recalibration (GenRecal), a novel, general-purpose distillation framework for VLMs. GenRecal incorporates a Recalibrator that aligns and adapts feature representations between heterogeneous VLMs, enabling effective knowledge transfer across different types of VLMs. Through extensive experiments on multiple challenging benchmarks, we demonstrate that GenRecal significantly improves baseline performances, eventually outperforming large-scale open- and closed-source VLMs.

cs.MM [Back]

[81] Omnidirectional Video Super-Resolution using Deep Learning cs.MM | cs.CV | cs.LGPDF

Arbind Agrahari Baniya, Tsz-Kwan Lee, Peter W. Eklund, Sunil Aryal

TL;DR: 本文提出了一种名为S3PO的深度学习方法，用于解决360度视频的超分辨率问题，并通过新的数据集和模型改进传统VSR方法的局限性。

Details

Motivation: 360度视频在VR中的广泛应用受到低分辨率的限制，而现有的VSR技术未能解决其特有的投影失真问题。

Result: S3PO在360度视频数据集上优于现有VSR和360度特定超分辨率模型。

Insight: 通过逐步消融实验证实了模型设计的有效性，强调了专用特征提取和损失函数的重要性。

Abstract: Omnidirectional Videos (or 360{\deg} videos) are widely used in Virtual Reality (VR) to facilitate immersive and interactive viewing experiences. However, the limited spatial resolution in 360{\deg} videos does not allow for each degree of view to be represented with adequate pixels, limiting the visual quality offered in the immersive experience. Deep learning Video Super-Resolution (VSR) techniques used for conventional videos could provide a promising software-based solution; however, these techniques do not tackle the distortion present in equirectangular projections of 360{\deg} video signals. An additional obstacle is the limited availability of 360{\deg} video datasets for study. To address these issues, this paper creates a novel 360{\deg} Video Dataset (360VDS) with a study of the extensibility of conventional VSR models to 360{\deg} videos. This paper further proposes a novel deep learning model for 360{\deg} Video Super-Resolution (360{\deg} VSR), called Spherical Signal Super-resolution with a Proportioned Optimisation (S3PO). S3PO adopts recurrent modelling with an attention mechanism, unbound from conventional VSR techniques like alignment. With a purpose-built feature extractor and a novel loss function addressing spherical distortion, S3PO outperforms most state-of-the-art conventional VSR models and 360{\deg}~specific super-resolution models on 360{\deg} video datasets. A step-wise ablation study is presented to understand and demonstrate the impact of the chosen architectural sub-components, targeted training and optimisation.

cs.SI [Back]

[82] Detecting Narrative Shifts through Persistent Structures: A Topological Analysis of Media Discourse cs.SI | cs.CL | physics.soc-ph | 55U10PDF

Mark M. Bailey, Mark I. Heiligman

TL;DR: 论文提出了一种基于持久同调的拓扑分析方法，用于检测媒体报道中叙事结构的变化，无需特定事件的先验知识。

Details

Motivation: 全球性事件如何重塑公共讨论是一个重要问题。传统的文本分析方法难以捕捉叙事结构的突变，而拓扑方法可以填补这一空白。

Result: 主要事件与H0和H1的突变相关，且H0变化通常先于H1。持久熵能区分叙事聚焦的紧密与松散状态。

Insight: 持久同调为无监督检测公共讨论的转折点提供了数学基础，适用于危机、抗议等信息冲击场景。

Abstract: How can we detect when global events fundamentally reshape public discourse? This study introduces a topological framework for identifying structural change in media narratives using persistent homology. Drawing on international news articles surrounding major events - including the Russian invasion of Ukraine (Feb 2022), the murder of George Floyd (May 2020), the U.S. Capitol insurrection (Jan 2021), and the Hamas-led invasion of Israel (Oct 2023) - we construct daily co-occurrence graphs of noun phrases to trace evolving discourse. Each graph is embedded and transformed into a persistence diagram via a Vietoris-Rips filtration. We then compute Wasserstein distances and persistence entropies across homological dimensions to capture semantic disruption and narrative volatility over time. Our results show that major geopolitical and social events align with sharp spikes in both H0 (connected components) and H1 (loops), indicating sudden reorganization in narrative structure and coherence. Cross-correlation analyses reveal a typical lag pattern in which changes to component-level structure (H0) precede higher-order motif shifts (H1), suggesting a bottom-up cascade of semantic change. An exception occurs during the Russian invasion of Ukraine, where H1 entropy leads H0, possibly reflecting top-down narrative framing before local discourse adjusts. Persistence entropy further distinguishes tightly focused from diffuse narrative regimes. These findings demonstrate that persistent homology offers a mathematically principled, unsupervised method for detecting inflection points and directional shifts in public attention - without requiring prior knowledge of specific events. This topological approach advances computational social science by enabling real-time detection of semantic restructuring during crises, protests, and information shocks.

eess.IV [Back]

[83] Deploying and Evaluating Multiple Deep Learning Models on Edge Devices for Diabetic Retinopathy Detection eess.IV | cs.AI | cs.CVPDF

Akwasi Asare, Dennis Agyemanh Nana Gookyi, Derrick Boateng, Fortunatus Aabangbio Wulnye

TL;DR: 该研究提出了一种基于边缘设备的糖尿病视网膜病变（DR）实时检测方法，利用多种深度学习和优化技术，在资源受限的设备上实现了高效的DR诊断。

Details

Motivation: 糖尿病视网膜病变是全球糖尿病患者视力受损的主要原因，传统诊断方法耗时且资源密集，亟需一种高效、低成本的自动检测方案。

Result: MobileNet达到96.45%的准确率，SqueezeNet在GPU上仅17ms延迟且模型大小为176KB。ShuffleNet和自定义DNN在资源效率上表现优异。

Insight: 边缘AI技术为医疗健康领域提供了可扩展的解决方案，尤其在资源匮乏地区，可实现低成本、高效的DR早期检测。

Abstract: Diabetic Retinopathy (DR), a leading cause of vision impairment in individuals with diabetes, affects approximately 34.6% of diabetes patients globally, with the number of cases projected to reach 242 million by 2045. Traditional DR diagnosis relies on the manual examination of retinal fundus images, which is both time-consuming and resource intensive. This study presents a novel solution using Edge Impulse to deploy multiple deep learning models for real-time DR detection on edge devices. A robust dataset of over 3,662 retinal fundus images, sourced from the Kaggle EyePACS dataset, was curated, and enhanced through preprocessing techniques, including augmentation and normalization. Using TensorFlow, various Convolutional Neural Networks (CNNs), such as MobileNet, ShuffleNet, SqueezeNet, and a custom Deep Neural Network (DNN), were designed, trained, and optimized for edge deployment. The models were converted to TensorFlowLite and quantized to 8-bit integers to reduce their size and enhance inference speed, with minimal trade-offs in accuracy. Performance evaluations across different edge hardware platforms, including smartphones and microcontrollers, highlighted key metrics such as inference speed, accuracy, precision, and resource utilization. MobileNet achieved an accuracy of 96.45%, while SqueezeNet demonstrated strong real-time performance with a small model size of 176 KB and latency of just 17 ms on GPU. ShuffleNet and the custom DNN achieved moderate accuracy but excelled in resource efficiency, making them suitable for lower-end devices. This integration of edge AI technology into healthcare presents a scalable, cost-effective solution for early DR detection, providing timely and accurate diagnosis, especially in resource-constrained and remote healthcare settings.

[84] Improving Prostate Gland Segmenting Using Transformer based Architectures eess.IV | cs.CV | cs.LGPDF

Shatha Abudalou

TL;DR: 该论文研究了基于Transformer的架构（UNETR和SwinUNETR）在前列腺腺体分割中的性能，发现其在处理标签噪声和类别不平衡时表现优于传统3D UNet，尤其在大型腺体数据集上表现突出。

Details

Motivation: 前列腺腺体分割在T2加权MRI图像中存在跨站点域偏移和注释者间差异等问题，需要一种更鲁棒的分割方法以提升精度。

Result: SwinUNETR在交叉验证混合训练中表现最佳，Dice分数高达0.902（Reader#1）和0.894（Reader#2），远超UNETR和3D UNet。

Insight: 全局和shifted-window自注意力机制能有效降低标签噪声和类别不平衡的影响，为临床部署提供了更鲁棒的分割模型。

Abstract: Inter reader variability and cross site domain shift challenge the automatic segmentation of prostate anatomy using T2 weighted MRI images. This study investigates whether transformer models can retain precision amid such heterogeneity. We compare the performance of UNETR and SwinUNETR in prostate gland segmentation against our previous 3D UNet model [1], based on 546 MRI (T2weighted) volumes annotated by two independent experts. Three training strategies were analyzed: single cohort dataset, 5 fold cross validated mixed cohort, and gland size based dataset. Hyperparameters were tuned by Optuna. The test set, from an independent population of readers, served as the evaluation endpoint (Dice Similarity Coefficient). In single reader training, SwinUNETR achieved an average dice score of 0.816 for Reader#1 and 0.860 for Reader#2, while UNETR scored 0.8 and 0.833 for Readers #1 and #2, respectively, compared to the baseline UNets 0.825 for Reader #1 and 0.851 for Reader #2. SwinUNETR had an average dice score of 0.8583 for Reader#1 and 0.867 for Reader#2 in cross-validated mixed training. For the gland size-based dataset, SwinUNETR achieved an average dice score of 0.902 for Reader#1 subset and 0.894 for Reader#2, using the five-fold mixed training strategy (Reader#1, n=53; Reader#2, n=87) at larger gland size-based subsets, where UNETR performed poorly. Our findings demonstrate that global and shifted-window self-attention effectively reduces label noise and class imbalance sensitivity, resulting in improvements in the Dice score over CNNs by up to five points while maintaining computational efficiency. This contributes to the high robustness of SwinUNETR for clinical deployment.

[85] Foundation Artificial Intelligence Models for Health Recognition Using Face Photographs (FAHR-Face) eess.IV | cs.AI | cs.CVPDF

Fridolin Haugg, Grace Lee, John He, Leonard Nürnberg, Dennis Bontempi

TL;DR: FAHR-Face是用于健康识别的AI基础模型，通过训练超过4000万张面部图像，并针对生物年龄估计和癌症患者生存风险预测进行了优化。模型表现出色，独立于临床因素，并在多种条件下保持稳健。

Details

Motivation: 面部外观可以提供非侵入性的健康指标，但现有模型在年龄估计和生存风险预测方面的表现有待提高，尤其是在临床数据集较小的情况下。

Result: 1. FAHR-FaceAge年龄估计误差为5.1年；2. FAHR-FaceSurvival预测最高风险患者死亡率是最低风险的三倍；3. 模型在独立队列中验证了普适性。

Insight: 基础模型能够通过有限的临床数据集生成高效的面部生物标志物，为生物衰老和疾病风险提供互补信息。

Abstract: Background: Facial appearance offers a noninvasive window into health. We built FAHR-Face, a foundation model trained on >40 million facial images and fine-tuned it for two distinct tasks: biological age estimation (FAHR-FaceAge) and survival risk prediction (FAHR-FaceSurvival). Methods: FAHR-FaceAge underwent a two-stage, age-balanced fine-tuning on 749,935 public images; FAHR-FaceSurvival was fine-tuned on 34,389 photos of cancer patients. Model robustness (cosmetic surgery, makeup, pose, lighting) and independence (saliency mapping) was tested extensively. Both models were clinically tested in two independent cancer patient datasets with survival analyzed by multivariable Cox models and adjusted for clinical prognostic factors. Findings: For age estimation, FAHR-FaceAge had the lowest mean absolute error of 5.1 years on public datasets, outperforming benchmark models and maintaining accuracy across the full human lifespan. In cancer patients, FAHR-FaceAge outperformed a prior facial age estimation model in survival prognostication. FAHR-FaceSurvival demonstrated robust prediction of mortality, and the highest-risk quartile had more than triple the mortality of the lowest (adjusted hazard ratio 3.22; P<0.001). These findings were validated in the independent cohort and both models showed generalizability across age, sex, race and cancer subgroups. The two algorithms provided distinct, complementary prognostic information; saliency mapping revealed each model relied on distinct facial regions. The combination of FAHR-FaceAge and FAHR-FaceSurvival improved prognostic accuracy. Interpretation: A single foundation model can generate inexpensive, scalable facial biomarkers that capture both biological ageing and disease-related mortality risk. The foundation model enabled effective training using relatively small clinical datasets.

Wajih Hassan Raza, Aamir Bader Shah, Yu Wen, Yidan Shen, Juan Diego Martinez Lemus

TL;DR: 论文提出了NeuroMoE，一种基于Transformer的混合专家框架，用于多模态神经系统疾病分类，整合多模态MRI和临床数据，显著提高了诊断准确性。

Details

Motivation: 现有深度学习方法难以有效利用多模态MRI和临床数据，导致性能不理想，作者希望通过结合Transformer和混合专家框架解决这一问题。

Result: 方法在验证集上达到82.47%的准确率，优于基线方法超过10%，在区分重叠疾病状态方面表现突出。

Insight: 多模态学习和自适应融合机制可以显著提升神经系统疾病的诊断准确性，尤其是在复杂临床数据中效果显著。

Abstract: The integration of multi-modal Magnetic Resonance Imaging (MRI) and clinical data holds great promise for enhancing the diagnosis of neurological disorders (NDs) in real-world clinical settings. Deep Learning (DL) has recently emerged as a powerful tool for extracting meaningful patterns from medical data to aid in diagnosis. However, existing DL approaches struggle to effectively leverage multi-modal MRI and clinical data, leading to suboptimal performance. To address this challenge, we utilize a unique, proprietary multi-modal clinical dataset curated for ND research. Based on this dataset, we propose a novel transformer-based Mixture-of-Experts (MoE) framework for ND classification, leveraging multiple MRI modalities-anatomical (aMRI), Diffusion Tensor Imaging (DTI), and functional (fMRI)-alongside clinical assessments. Our framework employs transformer encoders to capture spatial relationships within volumetric MRI data while utilizing modality-specific experts for targeted feature extraction. A gating mechanism with adaptive fusion dynamically integrates expert outputs, ensuring optimal predictive performance. Comprehensive experiments and comparisons with multiple baselines demonstrate that our multi-modal approach significantly enhances diagnostic accuracy, particularly in distinguishing overlapping disease states. Our framework achieves a validation accuracy of 82.47%, outperforming baseline methods by over 10%, highlighting its potential to improve ND diagnosis by applying multi-modal learning to real-world clinical data.

[87] Privacy-Preserving Chest X-ray Classification in Latent Space with Homomorphically Encrypted Neural Inference eess.IV | cs.CVPDF

Jonghun Kim, Gyeongdeok Jo, Shinyoung Ra, Hyunjin Park

TL;DR: 该论文提出了一种基于同态加密（HE）的隐私保护医疗图像分类框架，通过VQGAN压缩图像至潜在空间以减少计算负担，并结合低阶多项式逼近激活函数以平衡精度与效率。

Details

Motivation: 医疗影像数据包含敏感的患者信息，需要强大的隐私保护。同态加密虽然允许在加密数据上执行计算，但计算成本高昂，尤其是对大尺寸图像（如胸透X光）。因此，需要一种方法在保护隐私的同时降低计算复杂度。

Result: 该方法在压缩倍数为8时找到了性能与计算成本之间的最优平衡。尽管HE推理仍慢于未加密推理且存在轻微性能差异，但展现了在医疗图像实际应用中的潜力。

Insight: 通过将图像压缩与HE技术结合，可以有效解决医疗图像隐私保护的计算挑战，同时保持合理的分类性能。这一框架为其他隐私敏感的视觉任务提供了参考。

Abstract: Medical imaging data contain sensitive patient information requiring strong privacy protection. Many analytical setups require data to be sent to a server for inference purposes. Homomorphic encryption (HE) provides a solution by allowing computations to be performed on encrypted data without revealing the original information. However, HE inference is computationally expensive, particularly for large images (e.g., chest X-rays). In this study, we propose an HE inference framework for medical images that uses VQGAN to compress images into latent representations, thereby significantly reducing the computational burden while preserving image quality. We approximate the activation functions with lower-degree polynomials to balance the accuracy and efficiency in compliance with HE requirements. We observed that a downsampling factor of eight for compression achieved an optimal balance between performance and computational cost. We further adapted the squeeze and excitation module, which is known to improve traditional CNNs, to enhance the HE framework. Our method was tested on two chest X-ray datasets for multi-label classification tasks using vanilla CNN backbones. Although HE inference remains relatively slow and introduces minor performance differences compared with unencrypted inference, our approach shows strong potential for practical use in medical images

[88] A Real-time Endoscopic Image Denoising System eess.IV | cs.AI | cs.CVPDF

Yu Xing, Shishi Huang, Meng Lv, Guo Chen, Huailiang Wang

TL;DR: 本文提出了一种用于实时内窥镜图像去噪的混合系统，结合传统图像处理算法和学习方法，解决了由超紧凑传感器带来的图像噪声问题，并在FPGA平台上实现了实时性能。

Details

Motivation: 医用内窥镜中，超紧凑传感器因感光面积小而导致的噪声问题严重影响图像质量，特别是在高对比度场景下，传统的解决方案难以兼顾细节保留与实时性能。

Result: 实验表明，系统有效减少了噪声，PSNR从21.16提升到33.05，且在FPGA上达到实时性能。

Insight: 混合方法在去噪任务中既能保留细节又能满足实时性需求，尤其适用于硬件受限的医疗设备场景。

Abstract: Endoscopes featuring a miniaturized design have significantly enhanced operational flexibility, portability, and diagnostic capability while substantially reducing the invasiveness of medical procedures. Recently, single-use endoscopes equipped with an ultra-compact analogue image sensor measuring less than 1mm x 1mm bring revolutionary advancements to medical diagnosis. They reduce the structural redundancy and large capital expenditures associated with reusable devices, eliminate the risk of patient infections caused by inadequate disinfection, and alleviate patient suffering. However, the limited photosensitive area results in reduced photon capture per pixel, requiring higher photon sensitivity settings to maintain adequate brightness. In high-contrast medical imaging scenarios, the small-sized sensor exhibits a constrained dynamic range, making it difficult to simultaneously capture details in both highlights and shadows, and additional localized digital gain is required to compensate. Moreover, the simplified circuit design and analog signal transmission introduce additional noise sources. These factors collectively contribute to significant noise issues in processed endoscopic images. In this work, we developed a comprehensive noise model for analog image sensors in medical endoscopes, addressing three primary noise types: fixed-pattern noise, periodic banding noise, and mixed Poisson-Gaussian noise. Building on this analysis, we propose a hybrid denoising system that synergistically combines traditional image processing algorithms with advanced learning-based techniques for captured raw frames from sensors. Experiments demonstrate that our approach effectively reduces image noise without fine detail loss or color distortion, while achieving real-time performance on FPGA platforms and an average PSNR improvement from 21.16 to 33.05 on our test dataset.

cs.GR [Back]

[89] One-shot Face Sketch Synthesis in the Wild via Generative Diffusion Prior and Instruction Tuning cs.GR | cs.CR | cs.CV | cs.CYPDF

Han Wu, Junyao Li, Kangbo Zhao, Sen Zhang, Yukai Shi

TL;DR: 该论文提出了一种基于扩散模型的单次人脸素描合成方法，通过优化文本指令生成高质量素描图像，并引入了一个新的基准数据集OS-Sketch。方法在数据稀缺场景下表现优异。

Details

Motivation: 现有的人脸素描合成方法依赖于大量成对数据训练，面临数据稀缺和人工成本高的问题。作者希望通过扩散模型和指令调优，在数据稀缺情况下实现高效生成。

Result: 实验表明，该方法在单次场景下能生成高质量且一致的素描图像，优于其他方法。

Insight: 扩散模型和指令调优的结合为数据稀缺问题提供了一种新解决方案，具有广泛适用性。

Abstract: Face sketch synthesis is a technique aimed at converting face photos into sketches. Existing face sketch synthesis research mainly relies on training with numerous photo-sketch sample pairs from existing datasets. However, these large-scale discriminative learning methods will have to face problems such as data scarcity and high human labor costs. Once the training data becomes scarce, their generative performance significantly degrades. In this paper, we propose a one-shot face sketch synthesis method based on diffusion models. We optimize text instructions on a diffusion model using face photo-sketch image pairs. Then, the instructions derived through gradient-based optimization are used for inference. To simulate real-world scenarios more accurately and evaluate method effectiveness more comprehensively, we introduce a new benchmark named One-shot Face Sketch Dataset (OS-Sketch). The benchmark consists of 400 pairs of face photo-sketch images, including sketches with different styles and photos with different backgrounds, ages, sexes, expressions, illumination, etc. For a solid out-of-distribution evaluation, we select only one pair of images for training at each time, with the rest used for inference. Extensive experiments demonstrate that the proposed method can convert various photos into realistic and highly consistent sketches in a one-shot context. Compared to other methods, our approach offers greater convenience and broader applicability. The dataset will be available at: https://github.com/HanWu3125/OS-Sketch

[90] Nabla-R2D3: Effective and Efficient 3D Diffusion Alignment with 2D Rewards cs.GR | cs.CV | cs.LGPDF

Qingming Liu, Zhen Liu, Dinghuai Zhang, Kui Jia

TL;DR: Nabla-R2D3是一个高效的强化学习对齐框架，利用2D奖励信号优化3D扩散模型，显著提升了生成内容的真实性和对齐能力。

Details

Motivation: 现有3D生成模型（如扩散模型）在生成高质量、符合人类偏好的3D资产方面仍有不足，尤其是在指令跟随、纹理和几何细节的逼真性上表现欠佳。

Result: 实验表明，Nabla-R2D3在少量微调步骤内即可显著提升奖励值，并减少对先验知识的遗忘。

Insight: 2D奖励信号可以作为优化3D生成模型的有效工具，且通过梯度对齐的方法能够避免传统强化学习中的常见问题。

Abstract: Generating high-quality and photorealistic 3D assets remains a longstanding challenge in 3D vision and computer graphics. Although state-of-the-art generative models, such as diffusion models, have made significant progress in 3D generation, they often fall short of human-designed content due to limited ability to follow instructions, align with human preferences, or produce realistic textures, geometries, and physical attributes. In this paper, we introduce Nabla-R2D3, a highly effective and sample-efficient reinforcement learning alignment framework for 3D-native diffusion models using 2D rewards. Built upon the recently proposed Nabla-GFlowNet method, which matches the score function to reward gradients in a principled manner for reward finetuning, our Nabla-R2D3 enables effective adaptation of 3D diffusion models using only 2D reward signals. Extensive experiments show that, unlike vanilla finetuning baselines which either struggle to converge or suffer from reward hacking, Nabla-R2D3 consistently achieves higher rewards and reduced prior forgetting within a few finetuning steps.

cs.DC [Back]

[91] Cost-Efficient Serving of LLM Agents via Test-Time Plan Caching cs.DC | cs.AI | cs.CL | cs.LG | cs.PFPDF

Qizheng Zhang, Michael Wornow, Kunle Olukotun

TL;DR: 这篇论文提出了一种名为Agentic Plan Caching的新方法，通过提取、存储、适配和重用结构化计划模板，显著降低了LLM代理应用的运行成本。

Details

Motivation: 现有的LLM缓存技术（如上下文缓存和语义缓存）主要针对聊天机器人设计，无法满足依赖外部数据或环境上下文的代理应用需求，导致高昂的计算成本。

Result: 在多个真实代理应用中，该方法平均减少了46.62%的成本，同时保持性能。

Insight: 代理计划缓存为LLM代理应用提供了一种高效且低成本的解决方案，且能与现有LLM服务基础设施无缝集成。

Abstract: LLM-based agentic applications have shown increasingly remarkable capabilities in complex workflows but incur substantial costs due to extensive planning and reasoning requirements. Existing LLM caching techniques (like context caching and semantic caching), primarily designed for serving chatbots, are insufficient for agentic applications where outputs depend on external data or environmental contexts. We propose agentic plan caching, a novel approach that extracts, stores, adapts, and reuses structured plan templates from planning stages of agentic applications across semantically similar tasks to reduce the cost of serving. Unlike traditional semantic caching, our system extracts plan templates from completed agent executions at test-time, employs keyword extraction to match new requests against cached plans, and utilizes lightweight models to adapt these templates to task-specific plans with contexts. Evaluation across multiple real-world agentic applications shows that our system can reduce costs by 46.62% on average while maintaining performance, offering a more efficient solution for serving LLM-based agents that complements existing LLM serving infrastructures.

cs.RO [Back]

[92] Towards Perception-based Collision Avoidance for UAVs when Guiding the Visually Impaired cs.RO | cs.CVPDF

Suman Raj, Swapnil Padhi, Ruchi Bhoot, Prince Modi, Yogesh Simmhan

TL;DR: 该论文研究了无人机在视觉障碍人士户外导航中的应用，提出了一种结合局部感知路径规划和全局地图规划的系统，并通过多深度神经网络实现无人机与视觉障碍人士的障碍物避让。

Details

Motivation: 利用无人机的自主导航能力辅助视觉障碍人士在复杂城市环境中安全导航，提升其独立行动能力。

Result: 系统在三种场景（人行道行走、停车区附近、拥挤街道）中验证了算法的可行性。

Insight: 无人机与视觉障碍人士的协同导航需要结合实时感知与全局规划，多DNN框架能够有效处理复杂环境中的障碍物避让问题。

Abstract: Autonomous navigation by drones using onboard sensors combined with machine learning and computer vision algorithms is impacting a number of domains, including agriculture, logistics, and disaster management. In this paper, we examine the use of drones for assisting visually impaired people (VIPs) in navigating through outdoor urban environments. Specifically, we present a perception-based path planning system for local planning around the neighborhood of the VIP, integrated with a global planner based on GPS and maps for coarse planning. We represent the problem using a geometric formulation and propose a multi DNN based framework for obstacle avoidance of the UAV as well as the VIP. Our evaluations conducted on a drone human system in a university campus environment verifies the feasibility of our algorithms in three scenarios; when the VIP walks on a footpath, near parked vehicles, and in a crowded street.

[93] Robust Instant Policy: Leveraging Student’s t-Regression Model for Robust In-context Imitation Learning of Robot Manipulation cs.RO | cs.CVPDF

Hanbit Oh, Andrea M. Salcedo-Vázquez, Ixchel G. Ramirez-Alpizar, Yukiyasu Domae

TL;DR: 论文提出了一种鲁棒的即时策略（RIP），通过使用Student’s t回归模型解决基于LLM的模仿学习中的幻觉问题，显著提高了任务成功率。

Details

Motivation: 模仿学习（IL）中的上下文模仿学习（In-Context IL）利用大型语言模型（LLMs）作为即时策略，但LLM生成的轨迹可能存在幻觉问题，影响可靠性。为了解决这一问题，作者提出了RIP方法。

Result: 实验表明，RIP在模拟和真实环境中均显著优于现有IL方法，任务成功率至少提高26%，尤其在低数据量的日常任务中表现突出。

Insight: 通过统计模型（如Student’s t分布）处理LLM生成的多个候选轨迹，可以有效减少幻觉问题，提升模仿学习的可靠性和性能。

Abstract: Imitation learning (IL) aims to enable robots to perform tasks autonomously by observing a few human demonstrations. Recently, a variant of IL, called In-Context IL, utilized off-the-shelf large language models (LLMs) as instant policies that understand the context from a few given demonstrations to perform a new task, rather than explicitly updating network models with large-scale demonstrations. However, its reliability in the robotics domain is undermined by hallucination issues such as LLM-based instant policy, which occasionally generates poor trajectories that deviate from the given demonstrations. To alleviate this problem, we propose a new robust in-context imitation learning algorithm called the robust instant policy (RIP), which utilizes a Student’s t-regression model to be robust against the hallucinated trajectories of instant policies to allow reliable trajectory generation. Specifically, RIP generates several candidate robot trajectories to complete a given task from an LLM and aggregates them using the Student’s t-distribution, which is beneficial for ignoring outliers (i.e., hallucinations); thereby, a robust trajectory against hallucinations is generated. Our experiments, conducted in both simulated and real-world environments, show that RIP significantly outperforms state-of-the-art IL methods, with at least $26%$ improvement in task success rates, particularly in low-data scenarios for everyday tasks. Video results available at https://sites.google.com/view/robustinstantpolicy.

[94] MCOO-SLAM: A Multi-Camera Omnidirectional Object SLAM System cs.RO | cs.AI | cs.CVPDF

Miaoxin Pan, Jinnan Li, Yaowen Zhang, Yi Yang, Yufeng Yue

TL;DR: MCOO-SLAM是一个多相机全方位物体SLAM系统，通过利用环绕视角相机配置，在复杂室外场景中实现鲁棒、一致且语义丰富的建图。

Details

Motivation: 现有物体级SLAM系统多依赖RGB-D传感器或单目相机，存在视野窄、遮挡敏感和深度感知有限等问题，限制了物体建模和数据关联的准确性。

Result: 实验表明，MCOO-SLAM在真实场景中实现了精确定位和可扩展的物体级建图，对遮挡、姿态变化和环境复杂性具有更强的鲁棒性。

Insight: 多相机全方位视角可显著提升SLAM系统的语义理解和建图一致性，适用于复杂室外环境的机器人任务。

Abstract: Object-level SLAM offers structured and semantically meaningful environment representations, making it more interpretable and suitable for high-level robotic tasks. However, most existing approaches rely on RGB-D sensors or monocular views, which suffer from narrow fields of view, occlusion sensitivity, and limited depth perception-especially in large-scale or outdoor environments. These limitations often restrict the system to observing only partial views of objects from limited perspectives, leading to inaccurate object modeling and unreliable data association. In this work, we propose MCOO-SLAM, a novel Multi-Camera Omnidirectional Object SLAM system that fully leverages surround-view camera configurations to achieve robust, consistent, and semantically enriched mapping in complex outdoor scenarios. Our approach integrates point features and object-level landmarks enhanced with open-vocabulary semantics. A semantic-geometric-temporal fusion strategy is introduced for robust object association across multiple views, leading to improved consistency and accurate object modeling, and an omnidirectional loop closure module is designed to enable viewpoint-invariant place recognition using scene-level descriptors. Furthermore, the constructed map is abstracted into a hierarchical 3D scene graph to support downstream reasoning tasks. Extensive experiments in real-world demonstrate that MCOO-SLAM achieves accurate localization and scalable object-level mapping with improved robustness to occlusion, pose variation, and environmental complexity.

[95] Particle-Grid Neural Dynamics for Learning Deformable Object Models from RGB-D Videos cs.RO | cs.CV | cs.LGPDF

Kaifeng Zhang, Baoyu Li, Kris Hauser, Yunzhu Li

TL;DR: 该论文提出了一种结合粒子与空间网格的神经动力学框架，用于从RGB-D视频中学习可变形物体的动态模型，优于现有方法，并展示了在模型规划中的应用。

Details

Motivation: 可变形物体的动态建模具有挑战性，尤其是从有限的视觉信息中估计状态。该研究旨在通过神经动力学框架解决这一问题，实现对多样物体动态的建模。

Result: 模型在稀疏视角的RGB-D数据上表现优异，优于现有方法，并能推广到未见过的物体实例。在模型规划中也展示了实用性。

Insight: 粒子与网格的混合表示能够有效捕捉可变形物体的动态特性，同时提高学习效率，为真实世界中的物体动态建模提供了新思路。

Abstract: Modeling the dynamics of deformable objects is challenging due to their diverse physical properties and the difficulty of estimating states from limited visual information. We address these challenges with a neural dynamics framework that combines object particles and spatial grids in a hybrid representation. Our particle-grid model captures global shape and motion information while predicting dense particle movements, enabling the modeling of objects with varied shapes and materials. Particles represent object shapes, while the spatial grid discretizes the 3D space to ensure spatial continuity and enhance learning efficiency. Coupled with Gaussian Splattings for visual rendering, our framework achieves a fully learning-based digital twin of deformable objects and generates 3D action-conditioned videos. Through experiments, we demonstrate that our model learns the dynamics of diverse objects – such as ropes, cloths, stuffed animals, and paper bags – from sparse-view RGB-D recordings of robot-object interactions, while also generalizing at the category level to unseen instances. Our approach outperforms state-of-the-art learning-based and physics-based simulators, particularly in scenarios with limited camera views. Furthermore, we showcase the utility of our learned models in model-based planning, enabling goal-conditioned object manipulation across a range of tasks. The project page is available at https://kywind.github.io/pgnd .

cs.SE [Back]

[96] An Empirical Study of Bugs in Data Visualization Libraries cs.SE | cs.CV | cs.HCPDF

Weiqi Lu, Yongqiang Tian, Xiaohan Zhong, Haoyang Ma, Zhenyang Xu

TL;DR: 本文首次对数据可视化库中的错误进行了全面分析，提取了564个错误样本，揭示了其症状、根本原因及分类，并探索了视觉语言模型在检测错误图像中的潜力。

Details

Motivation: 数据可视化库的准确性对用户体验和决策至关重要，但可视化错误常以隐蔽方式影响用户，研究此类错误的特性有助于改进库的质量和测试方法。

Result: 发现错误图像普遍存在，主要源于计算错误；视觉语言模型的检测效果因提示而异，准确性在29%-57%之间。

Insight: 视觉语言模型在数据可视化错误检测中潜力有限，需进一步优化方法；自动化测试是未来研究方向。

Abstract: Data visualization (DataViz) libraries play a crucial role in presentation, data analysis, and application development, underscoring the importance of their accuracy in transforming data into visual representations. Incorrect visualizations can adversely impact user experience, distort information conveyance, and influence user perception and decision-making processes. Visual bugs in these libraries can be particularly insidious as they may not cause obvious errors like crashes, but instead mislead users of the underlying data graphically, resulting in wrong decision making. Consequently, a good understanding of the unique characteristics of bugs in DataViz libraries is essential for researchers and developers to detect and fix bugs in DataViz libraries. This study presents the first comprehensive analysis of bugs in DataViz libraries, examining 564 bugs collected from five widely-used libraries. Our study systematically analyzes their symptoms and root causes, and provides a detailed taxonomy. We found that incorrect/inaccurate plots are pervasive in DataViz libraries and incorrect graphic computation is the major root cause, which necessitates further automated testing methods for DataViz libraries. Moreover, we identified eight key steps to trigger such bugs and two test oracles specific to DataViz libraries, which may inspire future research in designing effective automated testing techniques. Furthermore, with the recent advancements in Vision Language Models (VLMs), we explored the feasibility of applying these models to detect incorrect/inaccurate plots. The results show that the effectiveness of VLMs in bug detection varies from 29% to 57%, depending on the prompts, and adding more information in prompts does not necessarily increase the effectiveness. More findings can be found in our manuscript.

cs.AI [Back]

[97] Embodied Web Agents: Bridging Physical-Digital Realms for Integrated Agent Intelligence cs.AI | cs.CL | cs.CV | cs.MM | cs.ROPDF

Yining Hong, Rui Sun, Bingxuan Li, Xingcheng Yao, Maxine Wu

TL;DR: 提出了Embodied Web Agents的概念，通过统一的仿真平台和任务基准，结合物理和数字领域的智能，显著提升了AI在跨领域任务中的能力。

Details

Motivation: 现有的AI代理通常在数字或物理领域中各自独立运行，缺乏对两者的整合能力，限制了其在需要跨领域智能的任务中的表现。

Result: 实验结果表明，现有AI系统与人类能力之间存在显著差距，突显了需要进一步研究的挑战和机会。

Insight: 通过结合物理世界的感知能力和数字知识的推理能力，可以显著提升AI代理的跨领域任务解决能力。

Abstract: AI agents today are mostly siloed - they either retrieve and reason over vast amount of digital information and knowledge obtained online; or interact with the physical world through embodied perception, planning and action - but rarely both. This separation limits their ability to solve tasks that require integrated physical and digital intelligence, such as cooking from online recipes, navigating with dynamic map data, or interpreting real-world landmarks using web knowledge. We introduce Embodied Web Agents, a novel paradigm for AI agents that fluidly bridge embodiment and web-scale reasoning. To operationalize this concept, we first develop the Embodied Web Agents task environments, a unified simulation platform that tightly integrates realistic 3D indoor and outdoor environments with functional web interfaces. Building upon this platform, we construct and release the Embodied Web Agents Benchmark, which encompasses a diverse suite of tasks including cooking, navigation, shopping, tourism, and geolocation - all requiring coordinated reasoning across physical and digital realms for systematic assessment of cross-domain intelligence. Experimental results reveal significant performance gaps between state-of-the-art AI systems and human capabilities, establishing both challenges and opportunities at the intersection of embodied cognition and web-scale knowledge access. All datasets, codes and websites are publicly available at our project page https://embodied-web-agent.github.io/.

cs.LG [Back]

[98] Assembly of Experts: Linear-time construction of the Chimera LLM variants with emergent and adaptable behaviors cs.LG | cs.AI | cs.CLPDF

Henrik Klagges, Robert Dahlke, Fabian Klemm, Benjamin Merkel, Daniel Klingmann

TL;DR: 论文提出了一种名为“Assembly-of-Experts”的方法，能在线性时间内通过插值父模型的权重张量生成功能强大的子模型变体，显著降低了计算成本。

Details

Motivation: 当前的LLM预训练需要高昂的计算资源（10^13-10^15 FLOPs），效率低下。为了更高效利用已训练的模型，作者提出了新的方法。

Result: 生成的DeepSeek R1T模型在继承父模型优点的同时，减少了40%的输出token，推理更紧凑有序。

Insight: 模型权重的插值可以导致行为特征的渐变或突变，几乎所有生成的子模型都功能完备，使得模型空间搜索变得简单。

Abstract: Requiring $10^{13}$-$10^{15}$ FLOPs to calculate one 8 bit weight in an LLM during pretraining is extremely expensive and seems inefficient. To better leverage the huge investments made into pretrained models, we develop the new “Assembly-of-Experts” (AoE) construction method to create capable child variants of existing Mixture-of-Experts parent models in linear time. Model weight tensors get interpolated individually, allowing to enhance or suppress semantic features of the parents. Varying the proportion of weights taken from the parent models, we observe some properties of the AoE child model changing gradually, while other behavioral traits emerge with a sharp transition. Surprisingly, nearly every generated model is functional and capable, which makes searching the model space straightforward. We construct the DeepSeek R1T “Chimera”, a 671B open-weights hybrid model combining DeepSeek’s V3-0324 and R1 model variants. The child inherits only the routed expert tensors of R1, but still achieves about R1-level intelligence. At the same time, it uses about 40% fewer output tokens, close to V3 speed. Constructed without any fine-tuning or distillation, the Chimera exhibits surprisingly compact, orderly reasoning compared to its parent models.

[99] Revisiting Reinforcement Learning for LLM Reasoning from A Cross-Domain Perspective cs.LG | cs.AI | cs.CLPDF

Zhoujun Cheng, Shibo Hao, Tianyang Liu, Fan Zhou, Yutao Xie

TL;DR: 本文提出了Guru，一个跨领域的RL推理数据集，覆盖六个领域，通过系统实验揭示了RL在不同领域的性能差异，并训练出SOTA模型Guru-7B和Guru-32B。

Details

Motivation: 现有研究主要关注数学和代码领域的RL推理，缺乏对其他通用推理领域的探索。本文旨在填补这一空白并提供可靠的跨领域RL推理数据和方法。

Result: Guru-7B和Guru-32B在17个任务上的性能分别超过最佳基线7.9%和6.7%，且Pass@k性能显著提升。

Insight: RL在预训练数据中常见的领域（如数学、代码、科学）中表现出跨领域适应性，而在预训练数据中少见的领域（如逻辑、模拟、表格）中需要领域内训练才能获得显著提升。

Abstract: Reinforcement learning (RL) has emerged as a promising approach to improve large language model (LLM) reasoning, yet most open efforts focus narrowly on math and code, limiting our understanding of its broader applicability to general reasoning. A key challenge lies in the lack of reliable, scalable RL reward signals across diverse reasoning domains. We introduce Guru, a curated RL reasoning corpus of 92K verifiable examples spanning six reasoning domains–Math, Code, Science, Logic, Simulation, and Tabular–each built through domain-specific reward design, deduplication, and filtering to ensure reliability and effectiveness for RL training. Based on Guru, we systematically revisit established findings in RL for LLM reasoning and observe significant variation across domains. For example, while prior work suggests that RL primarily elicits existing knowledge from pretrained models, our results reveal a more nuanced pattern: domains frequently seen during pretraining (Math, Code, Science) easily benefit from cross-domain RL training, while domains with limited pretraining exposure (Logic, Simulation, and Tabular) require in-domain training to achieve meaningful performance gains, suggesting that RL is likely to facilitate genuine skill acquisition. Finally, we present Guru-7B and Guru-32B, two models that achieve state-of-the-art performance among open models RL-trained with publicly available data, outperforming best baselines by 7.9% and 6.7% on our 17-task evaluation suite across six reasoning domains. We also show that our models effectively improve the Pass@k performance of their base models, particularly on complex tasks less likely to appear in pretraining data. We release data, models, training and evaluation code to facilitate general-purpose reasoning at: https://github.com/LLM360/Reasoning360

[100] When and How Unlabeled Data Provably Improve In-Context Learning cs.LG | cs.AI | cs.CL | math.OCPDF

Yingcong Li, Xiangyu Chang, Muti Kara, Xiaofeng Liu, Amit Roy-Chowdhury

TL;DR: 本文研究了无标签数据在上下文学习（ICL）中的作用，发现单层线性注意力模型无法利用无标签数据，而多层或循环变换器可通过隐式构建多项式形式的估计器（其幂次随深度指数增长）有效利用无标签数据，并提出了循环预训练模型的方法提升半监督学习性能。

Details

Motivation: 理解在标签缺失或错误的演示中，上下文学习（ICL）仍能有效的原因，并探究如何通过理论分析和实践方法利用无标签数据提升性能。

Result: 多层或循环变换器可通过指数增长的多项式有效利用无标签数据，循环预训练方法显著提升了半监督学习的性能。

Insight: 1. 深度或循环结构对利用无标签数据至关重要；2. 理论分析为半监督学习算法设计提供了指导，尤其是通过循环实现伪标签生成。

Abstract: Recent research shows that in-context learning (ICL) can be effective even when demonstrations have missing or incorrect labels. To shed light on this capability, we examine a canonical setting where the demonstrations are drawn according to a binary Gaussian mixture model (GMM) and a certain fraction of the demonstrations have missing labels. We provide a comprehensive theoretical study to show that: (1) The loss landscape of one-layer linear attention models recover the optimal fully-supervised estimator but completely fail to exploit unlabeled data; (2) In contrast, multilayer or looped transformers can effectively leverage unlabeled data by implicitly constructing estimators of the form $\sum_{i\ge 0} a_i (X^\top X)^iX^\top y$ with $X$ and $y$ denoting features and partially-observed labels (with missing entries set to zero). We characterize the class of polynomials that can be expressed as a function of depth and draw connections to Expectation Maximization, an iterative pseudo-labeling algorithm commonly used in semi-supervised learning. Importantly, the leading polynomial power is exponential in depth, so mild amount of depth/looping suffices. As an application of theory, we propose looping off-the-shelf tabular foundation models to enhance their semi-supervision capabilities. Extensive evaluations on real-world datasets show that our method significantly improves the semisupervised tabular learning performance over the standard single pass inference.

[101] AutoRule: Reasoning Chain-of-thought Extracted Rule-based Rewards Improve Preference Learning cs.LG | cs.AI | cs.CLPDF

Tevin Wang, Chenyan Xiong

TL;DR: AutoRule是一种自动从人类偏好反馈中提取规则并转化为规则奖励的方法，显著提高了基于RLHF的学习效果。

Details

Motivation: 当前基于规则的奖励方法依赖手工设计规则，AutoRule旨在自动提取规则，提升强化学习从人类反馈（RLHF）的效果。

Result: 在AlpacaEval2.0上相对提升了28.6%的胜率，在MT-Bench子集上提升了6.1%的表现。

Insight: AutoRule减少了奖励劫持，提取的规则与数据集偏好一致，并能捕捉不同数据集的独特价值。

Abstract: Rule-based rewards offer a promising strategy for improving reinforcement learning from human feedback (RLHF), but current approaches often rely on manual rule engineering. We present AutoRule, a fully automated method for extracting rules from preference feedback and formulating them into rule-based rewards. AutoRule extraction operates in three stages: it leverages a reasoning model to interpret user preferences, identifies candidate rules from the reasoning chain of these interpretations, and synthesizes them into a unified rule set. Leveraging the finalized rule set, we employ language-model verifiers to compute the fraction of rules satisfied by each output, using this metric as an auxiliary reward alongside the learned reward model during policy optimization. Training a Llama-3-8B model with AutoRule results in a 28.6% relative improvement in length-controlled win rate on AlpacaEval2.0, and a 6.1% relative gain in second-turn performance on a held-out MT-Bench subset, compared to a GRPO baseline trained with the same learned reward model but without the rule-based auxiliary reward. Our analysis confirms that the extracted rules exhibit good agreement with dataset preference. We find that AutoRule demonstrates reduced reward hacking compared to a learned reward model when run over two episodes. Finally, our case study suggests that the extracted rules capture unique qualities valued in different datasets. The extracted rules are provided in the appendix, and the code is open-sourced at https://github.com/cxcscmu/AutoRule.

[102] PIPE: Physics-Informed Position Encoding for Alignment of Satellite Images and Time Series cs.LG | cs.AI | cs.CVPDF

Haobo Li, Eunseo Jung, Zixin Chen, Zhaowei Wang, Yueya Wang

TL;DR: PIPE是一种轻量级方法，通过物理信息位置编码（PIPE）将物理信息嵌入视觉语言模型（VLM），显著提升卫星图像与时间序列的多模态对齐和预测精度。

Details

Motivation: 现有方法主要关注文本数据，忽略了视觉数据中的物理信息（如卫星图像的时空背景）。这种信息的缺失限制了模型的预测能力，尤其是在气候科学领域。

Result: 在最大开源的卫星图像数据集上，PIPE在多模态预测和气候领域方法中达到SOTA性能，台风强度预测精度提升了12%。

Insight: 物理信息的显式嵌入可显著提升多模态模型的性能，尤其在时空背景敏感的领域（如气候预测）。

Abstract: Multimodal time series forecasting is foundational in various fields, such as utilizing satellite imagery and numerical data for predicting typhoons in climate science. However, existing multimodal approaches primarily focus on utilizing text data to help time series forecasting, leaving the visual data in existing time series datasets untouched. Furthermore, it is challenging for models to effectively capture the physical information embedded in visual data, such as satellite imagery’s temporal and geospatial context, which extends beyond images themselves. To address this gap, we propose physics-informed positional encoding (PIPE), a lightweight method that embeds physical information into vision language models (VLMs). PIPE introduces two key innovations: (1) a physics-informed positional indexing scheme for mapping physics to positional IDs, and (2) a variant-frequency positional encoding mechanism for encoding frequency information of physical variables and sequential order of tokens within the embedding space. By preserving both the physical information and sequential order information, PIPE significantly improves multimodal alignment and forecasting accuracy. Through the experiments on the most representative and the largest open-sourced satellite image dataset, PIPE achieves state-of-the-art performance in both deep learning forecasting and climate domain methods, demonstrating superiority across benchmarks, including a 12% improvement in typhoon intensity forecasting over prior works. Our code is provided in the supplementary material.

[103] Reinforcing VLMs to Use Tools for Detailed Visual Reasoning Under Resource Constraints cs.LG | cs.AI | cs.CVPDF

Sunil Kumar, Bowen Zhao, Leo Dirac, Paulina Varshavskaya

TL;DR: 论文提出了一种结合GRPO训练和小规模模型的方法，通过外部工具（如缩放）增强视觉语言模型（VLM）在资源受限下的详细视觉推理能力。

Details

Motivation: 尽管大模型在推理能力上取得了显著进展，但视觉语言模型（VLM）在资源受限时仍难以进行详细的视觉推理。论文旨在解决这一挑战。

Result: 与基线模型相比，该方法在部分视觉问答（VQA）任务中表现更好，证明了通过外部工具获取详细视觉信息的有效性。

Insight: 论文表明，即使在资源受限的情况下，合理设计和训练小规模模型也能显著提升VLM的详细视觉推理能力。结合外部工具是一种有效的策略。

Abstract: Despite tremendous recent advances in large model reasoning ability, vision-language models (VLMs) still struggle with detailed visual reasoning, especially when compute resources are limited. To address this challenge, we draw inspiration from methods like Deepseek-r1 for VLMs and train smaller-scale models with Group Relative Policy Optimization (GRPO) to use external tools such as zoom. The greatest benefit is obtained with a combination of GRPO learning, a simple reward structure, a simplified tool-calling interface, allocating additional tokens to the result of the tool call, and a training data mix that over-represents visually difficult examples. Compared to similarly-sized baseline models, our method achieves better performance on some visual question-answering (VQA) tasks, thanks to the detailed visual information gathered from the external tool.

[104] Pixel-level Certified Explanations via Randomized Smoothing cs.LG | cs.AI | cs.CVPDF

Alaa Anani, Tobias Lorenz, Mario Fritz, Bernt Schiele

TL;DR: 该论文提出了首个基于随机平滑的像素级认证框架，确保任何黑盒归因方法的像素级鲁棒性，并通过稀疏化和平滑归因图重新将其任务定义为一个分割问题。

Details

Motivation: 现有的后验归因方法在解释深度学习预测时高度非鲁棒，微小输入扰动即可显著改变归因图而保持预测不变，这降低了其可信度。

Result: 在5个ImageNet模型上对12种归因方法的广泛评估表明，认证后的归因具有鲁棒性、可解释性和忠实性。

Insight: 随机平滑为归因方法的鲁棒性提供了理论保证，提升了其在可信赖下游任务中的实用性。

Abstract: Post-hoc attribution methods aim to explain deep learning predictions by highlighting influential input pixels. However, these explanations are highly non-robust: small, imperceptible input perturbations can drastically alter the attribution map while maintaining the same prediction. This vulnerability undermines their trustworthiness and calls for rigorous robustness guarantees of pixel-level attribution scores. We introduce the first certification framework that guarantees pixel-level robustness for any black-box attribution method using randomized smoothing. By sparsifying and smoothing attribution maps, we reformulate the task as a segmentation problem and certify each pixel’s importance against $\ell_2$-bounded perturbations. We further propose three evaluation metrics to assess certified robustness, localization, and faithfulness. An extensive evaluation of 12 attribution methods across 5 ImageNet models shows that our certified attributions are robust, interpretable, and faithful, enabling reliable use in downstream tasks. Our code is at https://github.com/AlaaAnani/certified-attributions.

Table of Contents

cs.CV [Back]

[1] SemIRNet: A Semantic Irony Recognition Network for Multimodal Sarcasm Detection cs.CV | cs.CL | cs.LGPDF

[2] Argus Inspection: Do Multimodal Large Language Models Possess the Eye of Panoptes? cs.CV | cs.AI | cs.CL | cs.LG | cs.MMPDF

[3] A Hybrid ConvNeXt-EfficientNet AI Solution for Precise Falcon Disease Detection cs.CVPDF

[4] ViLLa: A Neuro-Symbolic approach for Animal Monitoring cs.CV | cs.AIPDF

[5] GraphGSOcc: Semantic and Geometric Graph Transformer for 3D Gaussian Splating-based Occupancy Prediction cs.CV | cs.AIPDF

[6] DAVID-XR1: Detecting AI-Generated Videos with Explainable Reasoning cs.CV | cs.AIPDF

[7] ArchShapeNet:An Interpretable 3D-CNN Framework for Evaluating Architectural Shapes cs.CV | cs.AIPDF

[8] Real-Time, Low-Latency Surveillance Using Entropy-Based Adaptive Buffering and MobileNetV2 on Edge Devices cs.CV | cs.AIPDF

[9] MonoVQD: Monocular 3D Object Detection with Variational Query Denoising and Self-Distillation cs.CVPDF

[10] Improved Iterative Refinement for Chart-to-Code Generation via Structured Instruction cs.CV | cs.AIPDF

[11] PictSure: Pretraining Embeddings Matters for In-Context Learning Image Classifiers cs.CV | cs.AIPDF

[12] Efficient Retail Video Annotation: A Robust Key Frame Generation Approach for Product and Customer Interaction Analysis cs.CV | cs.AI | cs.HC | cs.LGPDF

[13] Peering into the Unknown: Active View Selection with Neural Uncertainty Maps for 3D Reconstruction cs.CV | cs.AIPDF

[14] PeRL: Permutation-Enhanced Reinforcement Learning for Interleaved Vision-Language Reasoning cs.CV | cs.AIPDF

[15] Frequency-Calibrated Membership Inference Attacks on Medical Image Diffusion Models cs.CV | cs.LGPDF

[16] Vision Transformers for End-to-End Quark-Gluon Jet Classification from Calorimeter Images cs.CVPDF

[17] Advances in Compliance Detection: Novel Models Using Vision-Based Tactile Sensors cs.CV | cs.RO | I.2.9PDF

[18] SynPo: Boosting Training-Free Few-Shot Medical Segmentation via High-Quality Negative Prompts cs.CVPDF

[19] ReSeDis: A Dataset for Referring-based Object Search across Large-Scale Image Collections cs.CVPDF

[20] Privacy-Shielded Image Compression: Defending Against Exploitation from Vision-Language Pretrained Models cs.CVPDF

[21] DM-FNet: Unified multimodal medical image fusion via diffusion process-trained encoder-decoder cs.CVPDF

[22] video-SALMONN 2: Captioning-Enhanced Audio-Visual Large Language Models cs.CV | cs.CL | cs.SDPDF

[23] Convolutional Feature Enhancement and Attention Fusion BiFPN for Ship Detection in SAR Images cs.CVPDF

[24] RA-NeRF: Robust Neural Radiance Field Reconstruction with Accurate Camera Pose Estimation under Complex Trajectories cs.CVPDF

[25] Retrospective Memory for Camouflaged Object Detection cs.CVPDF

[26] Domain Adaptation for Image Classification of Defects in Semiconductor Manufacturing cs.CV | cs.AIPDF

[27] MSNeRV: Neural Video Representation with Multi-Scale Feature Fusion cs.CV | cs.MM | eess.IVPDF

[28] BCRNet: Enhancing Landmark Detection in Laparoscopic Liver Surgery via Bezier Curve Refinement cs.CVPDF

[29] AI-driven visual monitoring of industrial assembly tasks cs.CVPDF

[30] MEGC2025: Micro-Expression Grand Challenge on Spot Then Recognize and Visual Question Answering cs.CV | cs.MMPDF

[31] MapFM: Foundation Model-Driven HD Mapping with Multi-Task Contextual Learning cs.CV | cs.AIPDF

[32] OpenPath: Open-Set Active Learning for Pathology Image Classification via Pre-trained Vision-Language Models cs.CVPDF

[33] Open-World Object Counting in Videos cs.CV | cs.AIPDF

[34] Unsupervised Pelage Pattern Unwrapping for Animal Re-identification cs.CVPDF

[35] NERO: Explainable Out-of-Distribution Detection with Neuron-level Relevance cs.CV | cs.LGPDF

[36] Hunyuan3D 2.1: From Images to High-Fidelity 3D Assets with Production-Ready PBR Material cs.CV | cs.AIPDF

[37] Multimodal Large Language Models for Medical Report Generation via Customized Prompt Tuning cs.CVPDF

[38] GenHOI: Generalizing Text-driven 4D Human-Object Interaction Synthesis for Unseen Objects cs.CV | cs.AIPDF

[39] NTIRE 2025 Image Shadow Removal Challenge Report cs.CVPDF

[40] RaCalNet: Radar Calibration Network for Sparse-Supervised Metric Depth Estimation cs.CV | cs.ROPDF

[41] Show-o2: Improved Native Unified Multimodal Models cs.CVPDF

[42] Baltimore Atlas: FreqWeaver Adapter for Semi-supervised Ultra-high Spatial Resolution Land Cover Classification cs.CVPDF

[43] A Unified Graph-based Framework for Scalable 3D Tree Reconstruction and Non-Destructive Biomass Estimation from Point Clouds cs.CVPDF

[44] One-Step Diffusion for Detail-Rich and Temporally Consistent Video Super-Resolution cs.CV | cs.AIPDF

[45] Mono-Modalizing Extremely Heterogeneous Multi-Modal Medical Image Registration cs.CV | I.4.5; I.4.9; J.3PDF

[46] BoxFusion: Reconstruction-Free Open-Vocabulary 3D Object Detection via Real-Time Multi-View Box Fusion cs.CVPDF

[47] HOIDiNi: Human-Object Interaction through Diffusion Noise Optimization cs.CVPDF

[48] FindingDory: A Benchmark to Evaluate Memory in Embodied Agents cs.CV | cs.ROPDF

[49] Demystifying the Visual Quality Paradox in Multimodal Large Language Models cs.CV | cs.AIPDF

[50] Dual-Stage Value-Guided Inference with Margin-Based Reward Adjustment for Fast and Faithful VLM Captioning cs.CV | cs.LGPDF

[51] UniRelight: Learning Joint Decomposition and Synthesis for Video Relighting cs.CVPDF

[52] Sekai: A Video Dataset towards World Exploration cs.CV | cs.AIPDF

[53] Evolutionary Caching to Accelerate Your Off-the-Shelf Diffusion Model cs.CVPDF

cs.CL [Back]

[54] Combining Constrained and Unconstrained Decoding via Boosting: BoostCD and Its Application to Information Extraction cs.CLPDF

[55] CrEst: Credibility Estimation for Contexts in LLMs via Weak Supervision cs.CL | cs.LGPDF

[56] MDBench: A Synthetic Multi-Document Reasoning Benchmark Generated with Knowledge Guidance cs.CL | cs.AIPDF

[57] Improving Dialogue Discourse Parsing through Discourse-aware Utterance Clarification cs.CL | cs.AIPDF

[58] CKD-EHR:Clinical Knowledge Distillation for Electronic Health Records cs.CLPDF

[59] ProtoReasoning: Prototypes as the Foundation for Generalizable Reasoning in LLMs cs.CLPDF

[60] MinosEval: Distinguishing Factoid and Non-Factoid for Tailored Open-Ended QA Evaluation with LLMs cs.CLPDF

[61] Research on Graph-Retrieval Augmented Generation Based on Historical Text Knowledge Graphs cs.CLPDF

[62] TopClustRAG at SIGIR 2025 LiveRAG Challenge cs.CLPDF

[63] Cohort Discovery: A Survey on LLM-Assisted Clinical Trial Recruitment cs.CL | cs.AIPDF

[64] DeVisE: Behavioral Testing of Medical Large Language Models cs.CLPDF

[65] SANSKRITI: A Comprehensive Benchmark for Evaluating Language Models’ Knowledge of Indian Culture cs.CLPDF

[66] COSMMIC: Comment-Sensitive Multimodal Multilingual Indian Corpus for Summarization and Headline Generation cs.CLPDF

[67] Targeted Lexical Injection: Unlocking Latent Cross-Lingual Alignment in Lugha-Llama via Early-Layer LoRA Fine-Tuning cs.CL | 68T50 | I.2.7; I.2.6PDF

[68] Understanding GUI Agent Localization Biases through Logit Sharpness cs.CLPDF

[69] AgentGroupChat-V2: Divide-and-Conquer Is What LLM-Based Multi-Agent System Need cs.CLPDF

[70] RE-IMAGINE: Symbolic Benchmark Synthesis for Reasoning Evaluation cs.CL | cs.AIPDF

[71] Context-Informed Grounding Supervision cs.CL | cs.AIPDF

[72] SPARE: Single-Pass Annotation with Reference-Guided Evaluation for Automatic Process Supervision and Reward Modelling cs.CL | cs.AI | cs.LGPDF

[73] Lessons from Training Grounded LLMs with Verifiable Rewards cs.CLPDF

[74] SciVer: Evaluating Foundation Models for Multimodal Scientific Claim Verification cs.CLPDF

[75] DiscoSG: Towards Discourse-Level Text Scene Graph Parsing through Iterative Graph Refinement cs.CLPDF

[76] WikiMixQA: A Multimodal Benchmark for Question Answering over Tables and Charts cs.CL | cs.AI | cs.LGPDF

[77] Revisiting Compositional Generalization Capability of Large Language Models Considering Instruction Following Ability cs.CL | cs.AIPDF