Table of Contents
- cs.CV [Total: 62]
- cs.CL [Total: 11]
- cs.GR [Total: 2]
- cs.SD [Total: 2]
- cs.RO [Total: 1]
- eess.IV [Total: 3]
- cs.SI [Total: 2]
- cs.IR [Total: 2]
- cs.LG [Total: 5]
- cs.HC [Total: 2]
- eess.AS [Total: 3]
- cs.AI [Total: 1]
cs.CV [Back]
[1] Post-Disaster Affected Area Segmentation with a Vision Transformer (ViT)-based EVAP Model using Sentinel-2 and Formosat-5 Imagery cs.CV | cs.AIPDF
Yi-Shan Chu, Hsuan-Cheng Wei
TL;DR: 提出了一种基于ViT的深度学习框架,用于从遥感图像中精炼灾害影响区域分割,支持台湾太空机构开发的EVAP产品,结合弱监督训练和多种解码器变体提升性能。
Details
Motivation: 现有灾害影响区域分割方法在缺乏准确标注数据时性能有限,需进一步提升分割的平滑性和可靠性以支持灾害应急响应。
Result: 在2022鄱阳湖干旱和2023罗得岛野火案例中,模型提升了分割结果的平滑性和一致性,验证了方法的可行性。
Insight: 结合ViT和弱监督学习可在缺乏精确标注时实现可靠的灾害区域分割,为灾害应急提供了一种可扩展的解决方案。
Abstract: We propose a vision transformer (ViT)-based deep learning framework to refine disaster-affected area segmentation from remote sensing imagery, aiming to support and enhance the Emergent Value Added Product (EVAP) developed by the Taiwan Space Agency (TASA). The process starts with a small set of manually annotated regions. We then apply principal component analysis (PCA)-based feature space analysis and construct a confidence index (CI) to expand these labels, producing a weakly supervised training set. These expanded labels are then used to train ViT-based encoder-decoder models with multi-band inputs from Sentinel-2 and Formosat-5 imagery. Our architecture supports multiple decoder variants and multi-stage loss strategies to improve performance under limited supervision. During the evaluation, model predictions are compared with higher-resolution EVAP output to assess spatial coherence and segmentation consistency. Case studies on the 2022 Poyang Lake drought and the 2023 Rhodes wildfire demonstrate that our framework improves the smoothness and reliability of segmentation results, offering a scalable approach for disaster mapping when accurate ground truth is unavailable.
[2] Toward a Real-Time Framework for Accurate Monocular 3D Human Pose Estimation with Geometric Priors cs.CV | cs.AIPDF
Mohamed Adjel
TL;DR: 论文提出了一种结合实时2D关键点检测与几何感知的2D到3D提升框架,利用相机内参和人体解剖学先验知识,实现快速、个性化的单目3D人体姿态估计。
Details
Motivation: 单目3D人体姿态估计在实时场景和无约束环境中仍是一个具有挑战性的非适定问题,直接的方法需要大量标注数据和复杂模型。论文旨在通过结合数据驱动学习和模型先验,提高精度和可解释性。
Result: 该方法能够在不依赖专用硬件的情况下,快速、精确地估计3D姿态,适用于边缘设备。
Insight: 论文展示了如何通过结合数据驱动学习和模型先验,提升单目3D姿态估计的精度和实时性,同时增强可解释性和部署能力。
Abstract: Monocular 3D human pose estimation remains a challenging and ill-posed problem, particularly in real-time settings and unconstrained environments. While direct imageto-3D approaches require large annotated datasets and heavy models, 2D-to-3D lifting offers a more lightweight and flexible alternative-especially when enhanced with prior knowledge. In this work, we propose a framework that combines real-time 2D keypoint detection with geometry-aware 2D-to-3D lifting, explicitly leveraging known camera intrinsics and subject-specific anatomical priors. Our approach builds on recent advances in self-calibration and biomechanically-constrained inverse kinematics to generate large-scale, plausible 2D-3D training pairs from MoCap and synthetic datasets. We discuss how these ingredients can enable fast, personalized, and accurate 3D pose estimation from monocular images without requiring specialized hardware. This proposal aims to foster discussion on bridging data-driven learning and model-based priors to improve accuracy, interpretability, and deployability of 3D human motion capture on edge devices in the wild.
[3] Coarse-to-fine crack cue for robust crack detection cs.CV | cs.NE | eess.IVPDF
Zelong Liu, Yuliang Gu, Zhichao Sun, Huachao Zhu, Xin Xiao
TL;DR: 論文提出了一種基於粗到細裂紋線索生成的方法CrackCue,通過利用裂紋的細結構特性提升裂紋檢測的魯棒性和泛化能力。
Details
Motivation: 當前深度學習方法在裂紋檢測中泛化能力不足,且忽略了裂紋的細結構特性,需要一種更魯棒的方法來解決這一問題。
Result: 實驗表明,CrackCue能顯著提升基線方法的性能,並在複雜背景、陰影和多變光照下表現優異。
Insight: 裂紋的細結構特性是提升檢測魯棒性的關鍵,粗到細的線索生成方法可以有效地將這一特性融入檢測任務。
Abstract: Crack detection is an important task in computer vision. Despite impressive in-dataset performance, deep learning-based methods still struggle in generalizing to unseen domains. The thin structure property of cracks is usually overlooked by previous methods. In this work, we introduce CrackCue, a novel method for robust crack detection based on coarse-to-fine crack cue generation. The core concept lies on leveraging the thin structure property to generate a robust crack cue, guiding the crack detection. Specifically, we first employ a simple max-pooling and upsampling operation on the crack image. This results in a coarse crack-free background, based on which a fine crack-free background can be obtained via a reconstruction network. The difference between the original image and fine crack-free background provides a fine crack cue. This fine cue embeds robust crack prior information which is unaffected by complex backgrounds, shadow, and varied lighting. As a plug-and-play method, we incorporate the proposed CrackCue into three advanced crack detection networks. Extensive experimental results demonstrate that the proposed CrackCue significantly improves the generalization ability and robustness of the baseline methods. The source code will be publicly available.
[4] CLAMP: Contrastive Learning with Adaptive Multi-loss and Progressive Fusion for Multimodal Aspect-Based Sentiment Analysis cs.CV | cs.AIPDF
Xiaoqiang He
TL;DR: CLAMP是一种用于多模态基于方面的情感分析的端到端对比学习框架,通过渐进注意力融合、多任务对比学习和自适应多损失聚合,解决了跨模态对齐噪声和细粒度表示一致性问题。
Details
Motivation: 现有方法在多模态基于方面的情感分析中存在跨模态对齐噪声和细粒度表示一致性不足的问题,尤其是全局模态对齐方法忽略了方面项与局部视觉区域的联系。
Result: 在标准公开基准测试中,CLAMP显著优于现有最先进方法。
Insight: 解决多模态情感分析中的对齐噪声和一致性问题是关键,而动态损失校准和渐进融合能有效提升模型性能。
Abstract: Multimodal aspect-based sentiment analysis(MABSA) seeks to identify aspect terms within paired image-text data and determine their fine grained sentiment polarities, representing a fundamental task for improving the effectiveness of applications such as product review systems and public opinion monitoring. Existing methods face challenges such as cross modal alignment noise and insufficient consistency in fine-grained representations. While global modality alignment methods often overlook the connection between aspect terms and their corresponding local visual regions, bridging the representation gap between text and images remains a challenge. To address these limitations, this paper introduces an end to end Contrastive Learning framework with Adaptive Multi-loss and Progressive Attention Fusion(CLAMP). The framework is composed of three novel modules: Progressive Attention Fusion network, Multi-task Contrastive Learning, and Adaptive Multi-loss Aggregation. The Progressive Attention Fusion network enhances fine-grained alignment between textual features and image regions via hierarchical, multi-stage cross modal interactions, effectively suppressing irrelevant visual noise. Secondly, multi-task contrastive learning combines global modal contrast and local granularity alignment to enhance cross modal representation consistency. Adaptive Multi-loss Aggregation employs a dynamic uncertainty based weighting mechanism to calibrate loss contributions according to each task’s uncertainty, thereby mitigating gradient interference. Evaluation on standard public benchmarks demonstrates that CLAMP consistently outperforms the vast majority of existing state of the art methods.
[5] SIA: Enhancing Safety via Intent Awareness for Vision-Language Models cs.CV | cs.AIPDF
Youngjin Na, Sangheon Jeong, Youngwan Lee
TL;DR: SIA是一种无需训练的提示工程框架,通过意图感知提升视觉语言模型(VLM)的安全性,主动检测和减轻多模态输入中的潜在风险。
Details
Motivation: 现有方法基于后过滤或静态拒绝提示,难以检测多模态输入中潜在的危害性意图,特别是在危害性仅由输入组合引发时。
Result: 在多个安全关键基准测试(SIUO、MM-SafetyBench、HoliSafe)中,SIA显著提升了安全性,优于先前方法。
Insight: 意图感知推理在提升VLM安全性的同时,可能对通用推理准确性产生轻微影响,但安全性收益显著。
Abstract: As vision-language models (VLMs) are increasingly deployed in real-world applications, new safety risks arise from the subtle interplay between images and text. In particular, seemingly innocuous inputs can combine to reveal harmful intent, leading to unsafe model responses. Despite increasing attention to multimodal safety, previous approaches based on post hoc filtering or static refusal prompts struggle to detect such latent risks, especially when harmfulness emerges only from the combination of inputs. We propose SIA (Safety via Intent Awareness), a training-free prompt engineering framework that proactively detects and mitigates harmful intent in multimodal inputs. SIA employs a three-stage reasoning process: (1) visual abstraction via captioning, (2) intent inference through few-shot chain-of-thought prompting, and (3) intent-conditioned response refinement. Rather than relying on predefined rules or classifiers, SIA dynamically adapts to the implicit intent inferred from the image-text pair. Through extensive experiments on safety-critical benchmarks including SIUO, MM-SafetyBench, and HoliSafe, we demonstrate that SIA achieves substantial safety improvements, outperforming prior methods. Although SIA shows a minor reduction in general reasoning accuracy on MMStar, the corresponding safety gains highlight the value of intent-aware reasoning in aligning VLMs with human-centric values.
[6] Look Before You Fuse: 2D-Guided Cross-Modal Alignment for Robust 3D Detection cs.CV | cs.AIPDF
Xiang Li
TL;DR: 该论文提出了一种利用2D先验信息校准LiDAR和相机特征的方法,通过局部和全局对齐提升3D检测的鲁棒性,并在nuScenes数据集上实现了最优性能。
Details
Motivation: LiDAR与相机特征的对齐问题导致3D检测性能下降,论文提出利用2D物体先验信息解决这一问题。
Result: 在nuScenes验证集上,mAP和NDS分别达到71.5%和73.6%,实现了最先进性能。
Insight: 2D物体边界信息可以显著提升LiDAR与相机特征的对齐效果,从而改善3D检测的鲁棒性。
Abstract: Integrating LiDAR and camera inputs into a unified Bird’s-Eye-View (BEV) representation is crucial for enhancing 3D perception capabilities of autonomous vehicles. However, current methods are often affected by misalignment between camera and LiDAR features. This misalignment leads to inaccurate depth supervision in camera branch and erroneous fusion during cross-modal feature aggregation. The root cause of this misalignment lies in projection errors, stemming from minor extrinsic calibration inaccuracies and rolling shutter effect of LiDAR during vehicle motion. In this work, our key insight is that these projection errors are predominantly concentrated at object-background boundaries, which are readily identified by 2D detectors. Based on this, our main motivation is to utilize 2D object priors to pre-align cross-modal features before fusion. To address local misalignment, we propose Prior Guided Depth Calibration (PGDC), which leverages 2D priors to correct local misalignment and preserve correct cross-modal feature pairs. To resolve global misalignment, we introduce Discontinuity Aware Geometric Fusion (DAGF) to process calibrated results from PGDC, suppressing noise and explicitly enhancing sharp transitions at object-background boundaries. To effectively utilize these transition-aware depth representations, we incorporate Structural Guidance Depth Modulator (SGDM), using a gated attention mechanism to efficiently fuse aligned depth and image features. Our proposed method achieves state-of-the-art performance on nuScenes validation dataset, with its mAP and NDS reaching 71.5% and 73.6% respectively.
[7] Pixels, Patterns, but No Poetry: To See The World like Humans cs.CV | cs.AI | cs.CLPDF
Hongcheng Gao, Zihao Huang, Lin Xu, Jingyi Tang, Xinhao Li
TL;DR: 这篇论文提出了Turing Eye Test (TET),一个专注于评估多模态大语言模型(MLLMs)感知能力的基准测试,并揭示了当前MLLMs在人类直觉性任务中的重大缺陷。
Details
Motivation: 目前MLLMs的研究主要关注推理能力,而忽略了感知能力的重要性。论文旨在探索MLLMs是否能像人类一样真正感知世界。
Result: 现有的MLLMs在TET任务中表现极差,而视觉塔的微调能够显著提升性能,这表明视觉泛化能力是当前MLLMs与人类感知的主要差距。
Insight: 论文指出MLLMs的视觉泛化能力不足是其感知能力薄弱的关键,未来研究应更多关注视觉塔的改进。
Abstract: Achieving human-like perception and reasoning in Multimodal Large Language Models (MLLMs) remains a central challenge in artificial intelligence. While recent research has primarily focused on enhancing reasoning capabilities in MLLMs, a fundamental question persists: Can Multimodal Large Language Models truly perceive the world as humans do? This paper shifts focus from reasoning to perception. Rather than constructing benchmarks specifically for reasoning, we introduce the Turing Eye Test (TET), a challenging perception-oriented benchmark comprising four diagnostic tasks that evaluate MLLMs’ performance on synthetic images that humans process intuitively. Our findings reveal that state-of-the-art MLLMs exhibit catastrophic failures on our perceptual tasks trivial for humans. Both in-context learning and training on language backbone-effective for previous benchmarks-fail to improve performance on our tasks, while fine-tuning the vision tower enables rapid adaptation, suggesting that our benchmark poses challenges for vision tower generalization rather than for the knowledge and reasoning capabilities of the language backbone-a key gap between current MLLMs and human perception. We release a representative subset of TET tasks in this version, and will introduce more diverse tasks and methods to enhance visual generalization in future work.
[8] HIPPO-Video: Simulating Watch Histories with Large Language Models for Personalized Video Highlighting cs.CV | cs.AIPDF
Jeongeun Lee, Youngjae Yu, Dongha Lee
TL;DR: HIPPO-Video利用大型语言模型生成个性化的观看历史数据,提出了HiPHer方法,基于这些数据预测用户偏好的视频片段显著性得分,性能优于现有方法。
Details
Motivation: 视频内容的爆炸式增长使得个性化视频高亮成为重要任务,但现有数据集缺乏个性化,难以捕捉用户行为的复杂性。
Result: 实验表明HiPHer优于通用和基于查询的方法,验证了其在个性化视频高亮中的有效性。
Insight: LLM可以模拟复杂用户行为,生成真实数据集;个性化历史数据对视频高亮任务至关重要。
Abstract: The exponential growth of video content has made personalized video highlighting an essential task, as user preferences are highly variable and complex. Existing video datasets, however, often lack personalization, relying on isolated videos or simple text queries that fail to capture the intricacies of user behavior. In this work, we introduce HIPPO-Video, a novel dataset for personalized video highlighting, created using an LLM-based user simulator to generate realistic watch histories reflecting diverse user preferences. The dataset includes 2,040 (watch history, saliency score) pairs, covering 20,400 videos across 170 semantic categories. To validate our dataset, we propose HiPHer, a method that leverages these personalized watch histories to predict preference-conditioned segment-wise saliency scores. Through extensive experiments, we demonstrate that our method outperforms existing generic and query-based approaches, showcasing its potential for highly user-centric video highlighting in real-world scenarios.
[9] ReMeREC: Relation-aware and Multi-entity Referring Expression Comprehension cs.CV | cs.AI | cs.CLPDF
Yizhi Hu, Zezhao Tian, Xingqun Qi, Chen Su, Bingkun Yang
TL;DR: 该论文提出了一种名为ReMeREC的新型框架,用于解决多实体指代表达理解(REC)任务。通过构建关系感知数据集ReMeX及辅助数据集EntityText,并结合文本自适应多实体感知器(TMP)和实体间关系推理器(EIR),显著提升了多实体定位和关系推理的准确性。
Details
Motivation: 现有REC方法主要关注单实体定位,忽视了多实体场景中复杂的实体间关系,且缺乏高质量的关系标注数据集。这不仅限制了模型的准确性,也阻碍了进一步的研究进展。
Result: 在四个基准数据集上的实验表明,ReMeREC在多实体定位和关系预测任务中超越了现有方法,取得了显著的性能提升。
Insight: 1. 多实体REC任务需要同时关注实体定位和关系推理。
2. 细粒度的文本和关系标注对模型性能至关重要。
3. 结合动态推断和关系建模能有效提升多实体场景的指代理解能力。
Abstract: Referring Expression Comprehension (REC) aims to localize specified entities or regions in an image based on natural language descriptions. While existing methods handle single-entity localization, they often ignore complex inter-entity relationships in multi-entity scenes, limiting their accuracy and reliability. Additionally, the lack of high-quality datasets with fine-grained, paired image-text-relation annotations hinders further progress. To address this challenge, we first construct a relation-aware, multi-entity REC dataset called ReMeX, which includes detailed relationship and textual annotations. We then propose ReMeREC, a novel framework that jointly leverages visual and textual cues to localize multiple entities while modeling their inter-relations. To address the semantic ambiguity caused by implicit entity boundaries in language, we introduce the Text-adaptive Multi-entity Perceptron (TMP), which dynamically infers both the quantity and span of entities from fine-grained textual cues, producing distinctive representations. Additionally, our Entity Inter-relationship Reasoner (EIR) enhances relational reasoning and global scene understanding. To further improve language comprehension for fine-grained prompts, we also construct a small-scale auxiliary dataset, EntityText, generated using large language models. Experiments on four benchmark datasets show that ReMeREC achieves state-of-the-art performance in multi-entity grounding and relation prediction, outperforming existing approaches by a large margin.
[10] CausalStep: A Benchmark for Explicit Stepwise Causal Reasoning in Videos cs.CV | cs.AIPDF
Xuchen Li, Xuzhao Li, Shiyu Hu, Kaiqi Huang, Wentao Zhang
TL;DR: CausalStep是一个专为视频中明确的逐步因果推理设计的基准测试,通过分割视频成因果单元和严格的逐步问答协议,全面评估模型能力,揭示当前模型与人类推理能力之间的差距。
Details
Motivation: 现有视频基准测试主要评估浅层理解和推理能力,允许模型利用全局上下文,未能严格评估真实的因果和逐步推理。为解决这一问题,团队开发了CausalStep。
Result: 实验显示,当前领先的专有和开源模型在逐步推理能力上与人类基线存在显著差距。
Insight: CausalStep为视频推理提供了严格的评估标准,强调了模型需提升逐步和因果推理能力,以实现更稳健和可解释的视频推理。
Abstract: Recent advances in large language models (LLMs) have improved reasoning in text and image domains, yet achieving robust video reasoning remains a significant challenge. Existing video benchmarks mainly assess shallow understanding and reasoning and allow models to exploit global context, failing to rigorously evaluate true causal and stepwise reasoning. We present CausalStep, a benchmark designed for explicit stepwise causal reasoning in videos. CausalStep segments videos into causally linked units and enforces a strict stepwise question-answer (QA) protocol, requiring sequential answers and preventing shortcut solutions. Each question includes carefully constructed distractors based on error type taxonomy to ensure diagnostic value. The benchmark features 100 videos across six categories and 1,852 multiple-choice QA pairs. We introduce seven diagnostic metrics for comprehensive evaluation, enabling precise diagnosis of causal reasoning capabilities. Experiments with leading proprietary and open-source models, as well as human baselines, reveal a significant gap between current models and human-level stepwise reasoning. CausalStep provides a rigorous benchmark to drive progress in robust and interpretable video reasoning.
[11] AURA: A Multi-Modal Medical Agent for Understanding, Reasoning & Annotation cs.CV | cs.LG | cs.MAPDF
Nima Fathi, Amar Kumar, Tal Arbel
TL;DR: TL;DR: AURA is a多模态医学代理,通过视觉语言解释能力,动态交互和假设测试,推进医学影像分析的透明度和适应性。
Details
Motivation: 大语言模型(LLM)在医学影像领域的应用尚处于早期阶段,需要更透明、适应性强且符合临床需求的AI系统。AURA旨在填补这一空白。
Result: AURA实现了对医学影像的全面分析和解释,提升了AI系统的透明度和临床适应性。
Insight: 代理型AI(Agentic AI)在医学影像分析中具有潜力,可推动从静态预测到交互式决策支持的转变。
Abstract: Recent advancements in Large Language Models (LLMs) have catalyzed a paradigm shift from static prediction systems to agentic AI agents capable of reasoning, interacting with tools, and adapting to complex tasks. While LLM-based agentic systems have shown promise across many domains, their application to medical imaging remains in its infancy. In this work, we introduce AURA, the first visual linguistic explainability agent designed specifically for comprehensive analysis, explanation, and evaluation of medical images. By enabling dynamic interactions, contextual explanations, and hypothesis testing, AURA represents a significant advancement toward more transparent, adaptable, and clinically aligned AI systems. We highlight the promise of agentic AI in transforming medical image analysis from static predictions to interactive decision support. Leveraging Qwen-32B, an LLM-based architecture, AURA integrates a modular toolbox comprising: (i) a segmentation suite with phase grounding, pathology segmentation, and anatomy segmentation to localize clinically meaningful regions; (ii) a counterfactual image-generation module that supports reasoning through image-level explanations; and (iii) a set of evaluation tools including pixel-wise difference-map analysis, classification, and advanced state-of-the-art components to assess diagnostic relevance and visual interpretability.
[12] Toward Long-Tailed Online Anomaly Detection through Class-Agnostic Concepts cs.CVPDF
Chiao-An Yang, Kuan-Chuan Peng, Raymond A. Yeh
TL;DR: 本文探讨了长尾在线异常检测(LTOAD)的新任务,发现现有的离线长尾异常检测方法无法直接应用于在线设置,提出了一种类别无关的框架并适配到在线学习中。
Details
Motivation: 现实场景中的异常检测通常缺乏异常样本且数据分布长尾,同时需要在线学习能力。现有离线方法依赖于类别标签,无法直接适用。
Result: 在工业制造和医疗领域的离线实验中显著优于基线方法(如MVTec上提升4.63% image-AUROC),在线设置中也表现更优(提升0.53%)。
Insight: 类别无关的设计和在线学习的适配是关键,为长尾分布的异常检测提供了新思路。
Abstract: Anomaly detection (AD) identifies the defect regions of a given image. Recent works have studied AD, focusing on learning AD without abnormal images, with long-tailed distributed training data, and using a unified model for all classes. In addition, online AD learning has also been explored. In this work, we expand in both directions to a realistic setting by considering the novel task of long-tailed online AD (LTOAD). We first identified that the offline state-of-the-art LTAD methods cannot be directly applied to the online setting. Specifically, LTAD is class-aware, requiring class labels that are not available in the online setting. To address this challenge, we propose a class-agnostic framework for LTAD and then adapt it to our online learning setting. Our method outperforms the SOTA baselines in most offline LTAD settings, including both the industrial manufacturing and the medical domain. In particular, we observe +4.63% image-AUROC on MVTec even compared to methods that have access to class labels and the number of classes. In the most challenging long-tailed online setting, we achieve +0.53% image-AUROC compared to baselines. Our LTOAD benchmark is released here: https://doi.org/10.5281/zenodo.16283852 .
[13] Divisive Decisions: Improving Salience-Based Training for Generalization in Binary Classification Tasks cs.CV | cs.LGPDF
Jacob Piland, Chris Sweet, Adam Czajka
TL;DR: 这篇论文提出了一种改进的基于显著性的训练方法,通过同时利用真实类别和错误类别的类激活图(CAM)来提高深度学习模型在二元分类任务中的泛化能力。
Details
Motivation: 现有的显著性引导训练方法仅利用真实类别的类激活图(CAM),忽略了错误类别的CAM。论文假设在二元分类任务中,真实和错误类别的CAM应在重要特征上表现出差异,从而利用这一差异改进训练策略。
Result: 在合成人脸检测、生物特征攻击检测和胸部X光异常分类等任务中,新方法显著优于传统的仅使用真实类别CAM的方法,提高了模型的泛化能力。
Insight: 论文的见解在于,错误类别的CAM信息对模型训练同样重要,通过显式利用其与真实类别CAM的差异,可以更好地引导模型学习区分性特征。
Abstract: Existing saliency-guided training approaches improve model generalization by incorporating a loss term that compares the model’s class activation map (CAM) for a sample’s true-class ({\it i.e.}, correct-label class) against a human reference saliency map. However, prior work has ignored the false-class CAM(s), that is the model’s saliency obtained for incorrect-label class. We hypothesize that in binary tasks the true and false CAMs should diverge on the important classification features identified by humans (and reflected in human saliency maps). We use this hypothesis to motivate three new saliency-guided training methods incorporating both true- and false-class model’s CAM into the training strategy and a novel post-hoc tool for identifying important features. We evaluate all introduced methods on several diverse binary close-set and open-set classification tasks, including synthetic face detection, biometric presentation attack detection, and classification of anomalies in chest X-ray scans, and find that the proposed methods improve generalization capabilities of deep learning models over traditional (true-class CAM only) saliency-guided training approaches. We offer source codes and model weights\footnote{GitHub repository link removed to preserve anonymity} to support reproducible research.
[14] Transformer Based Building Boundary Reconstruction using Attraction Field Maps cs.CVPDF
Muhammad Kamran, Mohammad Moein Sheikholeslami, Andreas Wichmann, Gunho Sohn
TL;DR: 论文提出了一种基于图卷积网络(GCN)的新方法,通过引入几何规则性和吸引力场地图,从单张卫星图像中自动重建建筑边界,显著提升了性能。
Details
Motivation: 卫星图像提供了丰富的视觉数据,但传统的空间地图生成依赖人工,效率低下。如何从单张卫星图像中自动、准确地重建建筑边界是一个重要且具有挑战性的任务。
Result: 模型在多样化和复杂场景中表现优异,AP和AR分别提升6%和10%,验证了其高精度和正则化能力。
Insight: 吸引力场地图和多尺度特征的结合是解决建筑边界重建问题的关键,几何规则性的引入显著提升了模型的性能。
Abstract: In recent years, the number of remote satellites orbiting the Earth has grown significantly, streaming vast amounts of high-resolution visual data to support diverse applications across civil, public, and military domains. Among these applications, the generation and updating of spatial maps of the built environment have become critical due to the extensive coverage and detailed imagery provided by satellites. However, reconstructing spatial maps from satellite imagery is a complex computer vision task, requiring the creation of high-level object representations, such as primitives, to accurately capture the built environment. While the past decade has witnessed remarkable advancements in object detection and representation using visual data, primitives-based object representation remains a persistent challenge in computer vision. Consequently, high-quality spatial maps often rely on labor-intensive and manual processes. This paper introduces a novel deep learning methodology leveraging Graph Convolutional Networks (GCNs) to address these challenges in building footprint reconstruction. The proposed approach enhances performance by incorporating geometric regularity into building boundaries, integrating multi-scale and multi-resolution features, and embedding Attraction Field Maps into the network. These innovations provide a scalable and precise solution for automated building footprint extraction from a single satellite image, paving the way for impactful applications in urban planning, disaster management, and large-scale spatial analysis. Our model, Decoupled-PolyGCN, outperforms existing methods by 6% in AP and 10% in AR, demonstrating its ability to deliver accurate and regularized building footprints across diverse and challenging scenarios.
[15] Controllable Hybrid Captioner for Improved Long-form Video Understanding cs.CV | cs.AIPDF
Kuleen Sasse, Efsun Sarioglu Kayi, Arun Reddy
TL;DR: 论文提出了一种可控混合字幕生成器,通过结合动态动作和静态场景描述,提升了长视频理解的文本表示质量,并通过LLM支持复杂查询。
Details
Motivation: 长视频内容密集且高维,传统视频字幕生成器仅关注动态动作,忽略了静态场景信息,限制了回答复杂问题的能力。
Result: 模型生成了更全面的字幕日志,扩展了可回答问题的范围,并显著提升了字幕生成效率。
Insight: 结合动态与静态信息的混合字幕生成是提升长视频理解的关键,且通过输入标记控制字幕类型可以高效适应视频内容变化。
Abstract: Video data, especially long-form video, is extremely dense and high-dimensional. Text-based summaries of video content offer a way to represent query-relevant content in a much more compact manner than raw video. In addition, textual representations are easily ingested by state-of-the-art large language models (LLMs), which enable reasoning over video content to answer complex natural language queries. To solve this issue, we rely on the progressive construction of a text-based memory by a video captioner operating on shorter chunks of the video, where spatio-temporal modeling is computationally feasible. We explore ways to improve the quality of the activity log comprised solely of short video captions. Because the video captions tend to be focused on human actions, and questions may pertain to other information in the scene, we seek to enrich the memory with static scene descriptions using Vision Language Models (VLMs). Our video understanding system relies on the LaViLa video captioner in combination with a LLM to answer questions about videos. We first explored different ways of partitioning the video into meaningful segments such that the textual descriptions more accurately reflect the structure of the video content. Furthermore, we incorporated static scene descriptions into the captioning pipeline using LLaVA VLM, resulting in a more detailed and complete caption log and expanding the space of questions that are answerable from the textual memory. Finally, we have successfully fine-tuned the LaViLa video captioner to produce both action and scene captions, significantly improving the efficiency of the captioning pipeline compared to using separate captioning models for the two tasks. Our model, controllable hybrid captioner, can alternate between different types of captions according to special input tokens that signals scene changes detected in the video.
[16] Toward Scalable Video Narration: A Training-free Approach Using Multimodal Large Language Models cs.CVPDF
Tz-Ying Wu, Tahani Trigui, Sharath Nittur Sridhar, Anand Bodas, Subarna Tripathi
TL;DR: 论文提出了一种无训练的方法VideoNarrator,利用现成的多模态大型语言模型(MLLMs)和视觉语言模型(VLMs)生成密集视频描述,显著提升了时间对齐和描述质量。
Details
Motivation: 多模态大型语言模型在视频理解中常存在时间对齐不准确和幻觉问题,特别是在不熟悉场景中。论文旨在解决这些问题,提升视频叙述的准确性和实用性。
Result: 实验结果显示,该方法显著提升了视频叙述的准确性和时间对齐,减少了幻觉现象,适用于多种下游任务。
Insight: 无训练的方法可以高效利用现成模型提升视频叙述能力,展示了多模态模型协同作用的潜力。
Abstract: In this paper, we introduce VideoNarrator, a novel training-free pipeline designed to generate dense video captions that offer a structured snapshot of video content. These captions offer detailed narrations with precise timestamps, capturing the nuances present in each segment of the video. Despite advancements in multimodal large language models (MLLMs) for video comprehension, these models often struggle with temporally aligned narrations and tend to hallucinate, particularly in unfamiliar scenarios. VideoNarrator addresses these challenges by leveraging a flexible pipeline where off-the-shelf MLLMs and visual-language models (VLMs) can function as caption generators, context providers, or caption verifiers. Our experimental results demonstrate that the synergistic interaction of these components significantly enhances the quality and accuracy of video narrations, effectively reducing hallucinations and improving temporal alignment. This structured approach not only enhances video understanding but also facilitates downstream tasks such as video summarization and video question answering, and can be potentially extended for advertising and marketing applications.
[17] Few-Shot Learning in Video and 3D Object Detection: A Survey cs.CVPDF
Md Meftahul Ferdaus, Kendall N. Niles, Joe Tom, Mahdi Abdelguerfi, Elias Ioup
TL;DR: 这篇综述探讨了少样本学习(FSL)在视频和3D目标检测中的最新进展,展示了如何通过少量标注数据检测新类别,减少人工标注成本,并分析了在时空结构和点云数据中的挑战与解决方案。
Details
Motivation: 视频和3D目标检测需要大量标注数据,但标注成本高昂。少样本学习的引入能够显著减少标注需求,使其在实际应用中更具可行性。
Result: FSL在视频和3D检测中表现优异,显著降低了标注需求,推动了自动驾驶等实际应用的部署。
Insight: 通过结合时空结构和数据模态特性,FSL有望在视频、3D等领域进一步减少监督需求,实现更广泛的应用。
Abstract: Few-shot learning (FSL) enables object detection models to recognize novel classes given only a few annotated examples, thereby reducing expensive manual data labeling. This survey examines recent FSL advances for video and 3D object detection. For video, FSL is especially valuable since annotating objects across frames is more laborious than for static images. By propagating information across frames, techniques like tube proposals and temporal matching networks can detect new classes from a couple examples, efficiently leveraging spatiotemporal structure. FSL for 3D detection from LiDAR or depth data faces challenges like sparsity and lack of texture. Solutions integrate FSL with specialized point cloud networks and losses tailored for class imbalance. Few-shot 3D detection enables practical autonomous driving deployment by minimizing costly 3D annotation needs. Core issues in both domains include balancing generalization and overfitting, integrating prototype matching, and handling data modality properties. In summary, FSL shows promise for reducing annotation requirements and enabling real-world video, 3D, and other applications by efficiently leveraging information across feature, temporal, and data modalities. By comprehensively surveying recent advancements, this paper illuminates FSL’s potential to minimize supervision needs and enable deployment across video, 3D, and other real-world applications.
[18] SDGOCC: Semantic and Depth-Guided Bird’s-Eye View Transformation for 3D Multimodal Occupancy Prediction cs.CV | cs.AIPDF
Zaipeng Duan, Chenxu Dang, Xuzhong Hu, Pei An, Junfeng Ding
TL;DR: SDG-OCC是一种新颖的多模态3D占用预测网络,通过结合语义和深度引导的视角变换以及融合-占用驱动的主动蒸馏,解决了现有方法的深度估计不准确和几何信息利用不足问题。
Details
Motivation: 现有方法多为单模态,相机方法缺少深度信息,LiDAR方法易受遮挡影响。LSS等轻量方法因深度估计不准确和几何语义信息利用不足而受限。
Result: 在Occ3D-nuScenes数据集上达到SOTA性能,实时处理;在SurroundOcc-nuScenes数据集上表现可比性。
Insight: 结合语义和深度信息的多模态方法显著提升了3D占用预测的准确性和鲁棒性,且轻量设计适合实时应用。
Abstract: Multimodal 3D occupancy prediction has garnered significant attention for its potential in autonomous driving. However, most existing approaches are single-modality: camera-based methods lack depth information, while LiDAR-based methods struggle with occlusions. Current lightweight methods primarily rely on the Lift-Splat-Shoot (LSS) pipeline, which suffers from inaccurate depth estimation and fails to fully exploit the geometric and semantic information of 3D LiDAR points. Therefore, we propose a novel multimodal occupancy prediction network called SDG-OCC, which incorporates a joint semantic and depth-guided view transformation coupled with a fusion-to-occupancy-driven active distillation. The enhanced view transformation constructs accurate depth distributions by integrating pixel semantics and co-point depth through diffusion and bilinear discretization. The fusion-to-occupancy-driven active distillation extracts rich semantic information from multimodal data and selectively transfers knowledge to image features based on LiDAR-identified regions. Finally, for optimal performance, we introduce SDG-Fusion, which uses fusion alone, and SDG-KL, which integrates both fusion and distillation for faster inference. Our method achieves state-of-the-art (SOTA) performance with real-time processing on the Occ3D-nuScenes dataset and shows comparable performance on the more challenging SurroundOcc-nuScenes dataset, demonstrating its effectiveness and robustness. The code will be released at https://github.com/DzpLab/SDGOCC.
[19] FedVLM: Scalable Personalized Vision-Language Models through Federated Learning cs.CVPDF
Arkajyoti Mitra, Afia Anjum, Paul Agbaje, Mert Pesé, Habeeb Olufowobi
TL;DR: FedVLM是一个联邦学习的框架,用于扩展性个性化视觉-语言模型(VLM)的调优。通过个性化LoRA(pLoRA),它在非独立同分布(non-iid)数据环境下显著提升了模型性能。
Details
Motivation: 现有的参数高效调优方法(如LoRA)在联邦学习环境中面临数据异构性挑战,导致泛化性能不足。FedVLM旨在解决这一问题,实现分布式环境下的高效个性化调优。
Result: pLoRA在非iid数据环境下比标准LoRA性能提升24.5%,验证了FedVLM的扩展性和高效性。
Insight: 联邦学习结合个性化调优策略(如pLoRA)可有效应对数据异构性问题,推动分布式学习场景下的个性化模型发展。
Abstract: Vision-language models (VLMs) demonstrate impressive zero-shot and few-shot learning capabilities, making them essential for several downstream tasks. However, fine-tuning these models at scale remains challenging, particularly in federated environments where data is decentralized and non-iid across clients. Existing parameter-efficient tuning methods like LoRA (Low-Rank Adaptation) reduce computational overhead but struggle with heterogeneous client data, leading to suboptimal generalization. To address these challenges, we propose FedVLM, a federated LoRA fine-tuning framework that enables decentralized adaptation of VLMs while preserving model privacy and reducing reliance on centralized training. To further tackle data heterogeneity, we introduce personalized LoRA (pLoRA), which dynamically adapts LoRA parameters to each client’s unique data distribution, significantly improving local adaptation while maintaining global model aggregation. Experiments on the RLAIF-V dataset show that pLoRA improves client-specific performance by 24.5% over standard LoRA, demonstrating superior adaptation in non-iid settings. FedVLM provides a scalable and efficient solution for fine-tuning VLMs in federated settings, advancing personalized adaptation in distributed learning scenarios.
[20] IONext: Unlocking the Next Era of Inertial Odometry cs.CV | cs.ROPDF
Shanshan Zhang, Siyue Wang, Tianshui Wen, Qi Zhang, Ziheng Zhou
TL;DR: IONext提出了一种基于CNN的新型惯性里程计框架,通过DADM模块和STGU单元,有效地结合了全局运动模式和局部精细运动特征,显著提升了定位精度和泛化能力。
Details
Motivation: 当前的Transformer模型在惯性里程计中虽然能建模长距离依赖,但对局部精细运动变化的敏感性和缺乏归纳偏置限制了定位精度和泛化性能。为此,IONext提出了CNN-based的解决方案。
Result: 在六个公开数据集上的实验表明,IONext显著优于现有Transformer和CNN方法,例如在RNIN数据集上平均ATE和RTE分别降低了10%和12%。
Insight: CNN在惯性里程计中通过引入动态权重和时空选择性单元,可以更好地捕捉运动特征,弥补了Transformer的不足。
Abstract: Researchers have increasingly adopted Transformer-based models for inertial odometry. While Transformers excel at modeling long-range dependencies, their limited sensitivity to local, fine-grained motion variations and lack of inherent inductive biases often hinder localization accuracy and generalization. Recent studies have shown that incorporating large-kernel convolutions and Transformer-inspired architectural designs into CNN can effectively expand the receptive field, thereby improving global motion perception. Motivated by these insights, we propose a novel CNN-based module called the Dual-wing Adaptive Dynamic Mixer (DADM), which adaptively captures both global motion patterns and local, fine-grained motion features from dynamic inputs. This module dynamically generates selective weights based on the input, enabling efficient multi-scale feature aggregation. To further improve temporal modeling, we introduce the Spatio-Temporal Gating Unit (STGU), which selectively extracts representative and task-relevant motion features in the temporal domain. This unit addresses the limitations of temporal modeling observed in existing CNN approaches. Built upon DADM and STGU, we present a new CNN-based inertial odometry backbone, named Next Era of Inertial Odometry (IONext). Extensive experiments on six public datasets demonstrate that IONext consistently outperforms state-of-the-art (SOTA) Transformer- and CNN-based methods. For instance, on the RNIN dataset, IONext reduces the average ATE by 10% and the average RTE by 12% compared to the representative model iMOT.
[21] Robust Five-Class and binary Diabetic Retinopathy Classification Using Transfer Learning and Data Augmentation cs.CV | cs.LG | F.2.2; I.2.7PDF
Faisal Ahmed, Mohammad Alfrad Nobel Bhuiyan
TL;DR: 论文提出一种基于迁移学习和数据增强的深度学习框架,用于糖尿病视网膜病变(DR)的二分类和五分类任务,并在APTOS 2019数据集上取得了优异性能。
Details
Motivation: 解决糖尿病视网膜病变诊断中类不平衡和训练数据不足的问题,提升自动诊断的准确性和实用性。
Result: 1. 二分类任务:准确率98.9%,AUC 99.4%;2. 五分类任务:准确率84.6%,AUC 94.1%;均优于现有方法。
Insight: 类平衡的数据增强与迁移学习结合能显著提升DR分类性能,为临床部署提供可扩展且高效的解决方案。
Abstract: Diabetic retinopathy (DR) is a leading cause of vision loss worldwide, and early diagnosis through automated retinal image analysis can significantly reduce the risk of blindness. This paper presents a robust deep learning framework for both binary and five-class DR classification, leveraging transfer learning and extensive data augmentation to address the challenges of class imbalance and limited training data. We evaluate a range of pretrained convolutional neural network architectures, including variants of ResNet and EfficientNet, on the APTOS 2019 dataset. For binary classification, our proposed model achieves a state-of-the-art accuracy of 98.9%, with a precision of 98.6%, recall of 99.3%, F1-score of 98.9%, and an AUC of 99.4%. In the more challenging five-class severity classification task, our model obtains a competitive accuracy of 84.6% and an AUC of 94.1%, outperforming several existing approaches. Our findings also demonstrate that EfficientNet-B0 and ResNet34 offer optimal trade-offs between accuracy and computational efficiency across both tasks. These results underscore the effectiveness of combining class-balanced augmentation with transfer learning for high-performance DR diagnosis. The proposed framework provides a scalable and accurate solution for DR screening, with potential for deployment in real-world clinical environments.
[22] ScSAM: Debiasing Morphology and Distributional Variability in Subcellular Semantic Segmentation cs.CV | cs.AI | cs.LG | I.4.6PDF
Bo Fang, Jianan Fan, Dongnan Liu, Hang Chang, Gerald J. Shami
TL;DR: ScSAM是一种结合预训练SAM和MAE的方法,通过特征对齐和融合模块增强子细胞分割的鲁棒性,解决了数据不平衡和形态多变性带来的训练偏差问题。
Details
Motivation: 子细胞组件在形态和分布上的显著多变性导致学习性分割模型容易产生特征学习的偏差,现有方法通常忽视特征多样性,而SAM虽然提供了丰富的特征表示,但在子细胞场景中面临标签空间差距和忽略细粒度细节的挑战。
Result: 在多个子细胞图像数据集上的实验表明,ScSAM优于现有方法。
Insight: 通过结合全局上下文理解和细粒度空间细节的方法,可以显著提升子细胞分割的精度和鲁棒性,尤其是在数据分布不平衡和形态多变的情况下。
Abstract: The significant morphological and distributional variability among subcellular components poses a long-standing challenge for learning-based organelle segmentation models, significantly increasing the risk of biased feature learning. Existing methods often rely on single mapping relationships, overlooking feature diversity and thereby inducing biased training. Although the Segment Anything Model (SAM) provides rich feature representations, its application to subcellular scenarios is hindered by two key challenges: (1) The variability in subcellular morphology and distribution creates gaps in the label space, leading the model to learn spurious or biased features. (2) SAM focuses on global contextual understanding and often ignores fine-grained spatial details, making it challenging to capture subtle structural alterations and cope with skewed data distributions. To address these challenges, we introduce ScSAM, a method that enhances feature robustness by fusing pre-trained SAM with Masked Autoencoder (MAE)-guided cellular prior knowledge to alleviate training bias from data imbalance. Specifically, we design a feature alignment and fusion module to align pre-trained embeddings to the same feature space and efficiently combine different representations. Moreover, we present a cosine similarity matrix-based class prompt encoder to activate class-specific features to recognize subcellular categories. Extensive experiments on diverse subcellular image datasets demonstrate that ScSAM outperforms state-of-the-art methods.
[23] VBCD: A Voxel-Based Framework for Personalized Dental Crown Design cs.CVPDF
Linda Wei, Chang Liu, Wenran Zhang, Zengji Zhang, Shaoting Zhang
TL;DR: VBCD提出了一种基于体素的自动化牙冠设计框架,通过粗到精的设计流程和距离感知监督提升牙冠设计的准确性,结合曲率和边缘线惩罚损失优化边缘对齐,并利用牙位编号提示进一步提升效果。
Details
Motivation: 传统的牙冠设计过程依赖人工,费时费力。VBCD旨在通过自动化框架减轻牙科技师的工作负担。
Result: 在大规模口腔扫描数据集上验证,VBCD优于现有方法,能高效、高质量地完成个性化牙冠设计。
Insight: 自动化结合领域知识(如牙位编号)能显著提升牙冠设计的精度和效率。
Abstract: The design of restorative dental crowns from intraoral scans is labor-intensive for dental technicians. To address this challenge, we propose a novel voxel-based framework for automated dental crown design (VBCD). The VBCD framework generates an initial coarse dental crown from voxelized intraoral scans, followed by a fine-grained refiner incorporating distance-aware supervision to improve accuracy and quality. During the training stage, we employ the Curvature and Margin line Penalty Loss (CMPL) to enhance the alignment of the generated crown with the margin line. Additionally, a positional prompt based on the FDI tooth numbering system is introduced to further improve the accuracy of the generated dental crowns. Evaluation on a large-scale dataset of intraoral scans demonstrated that our approach outperforms existing methods, providing a robust solution for personalized dental crown design.
[24] PIG-Nav: Key Insights for Pretrained Image Goal Navigation Models cs.CV | cs.ROPDF
Jiansong Wan, Chengming Zhou, Jinkua Liu, Xiangge Huang, Xiaoyu Chen
TL;DR: PIG-Nav提出了一种基于预训练模型的视觉导航方法,通过早期融合网络结构和辅助任务提升性能,并利用游戏视频数据进行数据增强,显著提升了零样本和微调性能。
Details
Motivation: 研究旨在提升视觉导航模型的通用性和迁移能力,尤其是在未见环境中的零样本表现。
Result: 在零样本和微调场景中,模型性能分别平均提升22.6%和37.5%,且在真实环境中表现优异。
Insight: 预训练策略和数据集多样性对导航模型性能至关重要,且模型能在少量微调数据下保持竞争力。
Abstract: Recent studies have explored pretrained (foundation) models for vision-based robotic navigation, aiming to achieve generalizable navigation and positive transfer across diverse environments while enhancing zero-shot performance in unseen settings. In this work, we introduce PIG-Nav (Pretrained Image-Goal Navigation), a new approach that further investigates pretraining strategies for vision-based navigation models and contributes in two key areas. Model-wise, we identify two critical design choices that consistently improve the performance of pretrained navigation models: (1) integrating an early-fusion network structure to combine visual observations and goal images via appropriately pretrained Vision Transformer (ViT) image encoder, and (2) introducing suitable auxiliary tasks to enhance global navigation representation learning, thus further improving navigation performance. Dataset-wise, we propose a novel data preprocessing pipeline for efficiently labeling large-scale game video datasets for navigation model training. We demonstrate that augmenting existing open navigation datasets with diverse gameplay videos improves model performance. Our model achieves an average improvement of 22.6% in zero-shot settings and a 37.5% improvement in fine-tuning settings over existing visual navigation foundation models in two complex simulated environments and one real-world environment. These results advance the state-of-the-art in pretrained image-goal navigation models. Notably, our model maintains competitive performance while requiring significantly less fine-tuning data, highlighting its potential for real-world deployment with minimal labeled supervision.
[25] MaskedCLIP: Bridging the Masked and CLIP Space for Semi-Supervised Medical Vision-Language Pre-training cs.CVPDF
Lei Zhu, Jun Zhou, Rick Siow Mong Goh, Yong Liu
TL;DR: 论文提出MaskedCLIP框架,通过结合掩码图像建模和对比语言-图像预训练,实现半监督的视觉-语言预训练,以充分利用成对和非成对图像数据学习更泛化的医学图像特征。
Details
Motivation: 当前的医学图像分析中,基础模型通常仅基于成对的图像-文本数据或非成对的图像数据学习,这限制了模型捕捉更丰富和全面的图像特征。论文旨在通过半监督学习结合这两种数据提升特征学习的全面性。
Result: 在视网膜图像分析任务上的实验表明,MaskedCLIP能更高效地利用数据,提升下游任务的性能。
Insight: 通过桥接不同特征空间并结合蒸馏损失,可以充分利用成对和非成对数据的互补性,从而学习更泛化和语义丰富的图像特征。
Abstract: Foundation models have recently gained tremendous popularity in medical image analysis. State-of-the-art methods leverage either paired image-text data via vision-language pre-training or unpaired image data via self-supervised pre-training to learn foundation models with generalizable image features to boost downstream task performance. However, learning foundation models exclusively on either paired or unpaired image data limits their ability to learn richer and more comprehensive image features. In this paper, we investigate a novel task termed semi-supervised vision-language pre-training, aiming to fully harness the potential of both paired and unpaired image data for foundation model learning. To this end, we propose MaskedCLIP, a synergistic masked image modeling and contrastive language-image pre-training framework for semi-supervised vision-language pre-training. The key challenge in combining paired and unpaired image data for learning a foundation model lies in the incompatible feature spaces derived from these two types of data. To address this issue, we propose to connect the masked feature space with the CLIP feature space with a bridge transformer. In this way, the more semantic specific CLIP features can benefit from the more general masked features for semantic feature extraction. We further propose a masked knowledge distillation loss to distill semantic knowledge of original image features in CLIP feature space back to the predicted masked image features in masked feature space. With this mutually interactive design, our framework effectively leverages both paired and unpaired image data to learn more generalizable image features for downstream tasks. Extensive experiments on retinal image analysis demonstrate the effectiveness and data efficiency of our method.
[26] Perceptual Classifiers: Detecting Generative Images using Perceptual Features cs.CVPDF
Krishna Srikar Durbha, Asvin Kumar Venkataramanan, Rajesh Sureddi, Alan C. Bovik
TL;DR: 该论文提出了一种基于图像质量评估(IQA)模型的特征空间的感知分类器,用于区分真实图像与AI生成的图像。该方法在小规模网络上训练,表现出优异的泛化能力和鲁棒性。
Details
Motivation: 随着生成式AI技术的迅速发展,互联网上涌现大量AI生成的内容,需要一种有效的检测方法。现有的IQA模型能够捕捉真实图像的统计特征,因此可以利用其能力区分真实与生成图像。
Result: 实验表明,该方法在检测不同生成模型的假图像时达到最先进性能,且在图像退化场景下仍保持稳定表现。
Insight: IQA模型的特征空间具有区分真实与生成图像的能力,为轻量级且高效的假图像检测提供了新思路。
Abstract: Image Quality Assessment (IQA) models are employed in many practical image and video processing pipelines to reduce storage, minimize transmission costs, and improve the Quality of Experience (QoE) of millions of viewers. These models are sensitive to a diverse range of image distortions and can accurately predict image quality as judged by human viewers. Recent advancements in generative models have resulted in a significant influx of “GenAI” content on the internet. Existing methods for detecting GenAI content have progressed significantly with improved generalization performance on images from unseen generative models. Here, we leverage the capabilities of existing IQA models, which effectively capture the manifold of real images within a bandpass statistical space, to distinguish between real and AI-generated images. We investigate the generalization ability of these perceptual classifiers to the task of GenAI image detection and evaluate their robustness against various image degradations. Our results show that a two-layer network trained on the feature space of IQA models demonstrates state-of-the-art performance in detecting fake images across generative models, while maintaining significant robustness against image degradations.
[27] TransLPRNet: Lite Vision-Language Network for Single/Dual-line Chinese License Plate Recognition cs.CV | cs.CLPDF
Guangzhu Xu, Zhi Ke, Pengcheng Zuo, Bangjun Lei
TL;DR: 论文提出了一种轻量化的视觉-语言网络TransLPRNet,用于单/双线中文车牌识别,通过预训练框架和视角校正网络提升了识别精度和实用性。
Details
Motivation: 现有CNN和CRNN方法在车牌识别中面临多样性和成像条件的挑战,且缺乏双线车牌数据集,亟需一种统一且高效的解决方案。
Result: 在CCPD测试集上粗定位扰动下准确率99.34%,精细定位下提升至99.58%;双线车牌测试集上准确率98.70%,实时速度167FPS。
Insight: 合成数据和视角校正网络能有效解决数据稀缺和成像多样性问题,轻量化设计适合实际应用场景。
Abstract: License plate recognition in open environments is widely applicable across various domains; however, the diversity of license plate types and imaging conditions presents significant challenges. To address the limitations encountered by CNN and CRNN-based approaches in license plate recognition, this paper proposes a unified solution that integrates a lightweight visual encoder with a text decoder, within a pre-training framework tailored for single and double-line Chinese license plates. To mitigate the scarcity of double-line license plate datasets, we constructed a single/double-line license plate dataset by synthesizing images, applying texture mapping onto real scenes, and blending them with authentic license plate images. Furthermore, to enhance the system’s recognition accuracy, we introduce a perspective correction network (PTN) that employs license plate corner coordinate regression as an implicit variable, supervised by license plate view classification information. This network offers improved stability, interpretability, and low annotation costs. The proposed algorithm achieves an average recognition accuracy of 99.34% on the corrected CCPD test set under coarse localization disturbance. When evaluated under fine localization disturbance, the accuracy further improves to 99.58%. On the double-line license plate test set, it achieves an average recognition accuracy of 98.70%, with processing speeds reaching up to 167 frames per second, indicating strong practical applicability.
[28] Unsupervised Exposure Correction cs.CVPDF
Ruodai Cui, Li Niu, Guosheng Hu
TL;DR: 该论文提出了一种无需人工标注的Unsupervised Exposure Correction (UEC)方法,通过模拟ISP管道生成配对数据,提升了模型的泛化能力,并在低层视觉任务中表现优异。
Details
Motivation: 现有曝光校正方法需要大量人工标注数据(paired data),泛化能力有限,且严重影响低层视觉任务的性能,因此提出了一种无需标注的解决方案。
Result: UEC方法在曝光校正任务中超越了有监督方法,同时仅使用其0.01%的参数。在边缘检测等下游任务中也表现出色。
Insight: 无监督学习可以解决曝光校正中的数据标注问题,并显著提升泛化能力;低层视觉任务的性能与曝光质量密切相关。
Abstract: Current exposure correction methods have three challenges, labor-intensive paired data annotation, limited generalizability, and performance degradation in low-level computer vision tasks. In this work, we introduce an innovative Unsupervised Exposure Correction (UEC) method that eliminates the need for manual annotations, offers improved generalizability, and enhances performance in low-level downstream tasks. Our model is trained using freely available paired data from an emulated Image Signal Processing (ISP) pipeline. This approach does not need expensive manual annotations, thereby minimizing individual style biases from the annotation and consequently improving its generalizability. Furthermore, we present a large-scale Radiometry Correction Dataset, specifically designed to emphasize exposure variations, to facilitate unsupervised learning. In addition, we develop a transformation function that preserves image details and outperforms state-of-the-art supervised methods [12], while utilizing only 0.01% of their parameters. Our work further investigates the broader impact of exposure correction on downstream tasks, including edge detection, demonstrating its effectiveness in mitigating the adverse effects of poor exposure on low-level features. The source code and dataset are publicly available at https://github.com/BeyondHeaven/uec_code.
[29] VisionTrap: Unanswerable Questions On Visual Data cs.CVPDF
Asir Saadat, Syem Aziz, Shahriar Mahmud, Abdullah Ibne Masud Mahi, Sabbir Ahmed
TL;DR: VisionTrap数据集旨在评估VQA模型在遇到无法回答问题时是否能够识别知识局限性,而不是生成错误答案。
Details
Motivation: 目前VQA研究主要集中在可回答问题,缺乏对模型在无法回答问题中表现的评估,尤其是模型是否知道何时应避免回答。
Result: 研究表明,VQA模型倾向于给出答案而非承认局限性,突显了在评估中加入无法回答问题的重要性。
Insight: 未来VQA基准测试应包括无法回答问题,以更全面地评估模型的鲁棒性和知识边界意识。
Abstract: Visual Question Answering (VQA) has been a widely studied topic, with extensive research focusing on how VLMs respond to answerable questions based on real-world images. However, there has been limited exploration of how these models handle unanswerable questions, particularly in cases where they should abstain from providing a response. This research investigates VQA performance on unrealistically generated images or asking unanswerable questions, assessing whether models recognize the limitations of their knowledge or attempt to generate incorrect answers. We introduced a dataset, VisionTrap, comprising three categories of unanswerable questions across diverse image types: (1) hybrid entities that fuse objects and animals, (2) objects depicted in unconventional or impossible scenarios, and (3) fictional or non-existent figures. The questions posed are logically structured yet inherently unanswerable, testing whether models can correctly recognize their limitations. Our findings highlight the importance of incorporating such questions into VQA benchmarks to evaluate whether models tend to answer, even when they should abstain.
[30] URPO: A Unified Reward & Policy Optimization Framework for Large Language Models cs.CV | cs.CLPDF
Songshuo Lu, Hua Wang, Zhi Chen, Yaohua Tang
TL;DR: URPO提出了一种统一的奖励与策略优化框架,将指令遵循和奖励建模结合在一个模型中,显著提升了性能,同时简化了训练流程。
Details
Motivation: 传统的对齐流程需要独立的奖励模型,不仅复杂且资源密集,且性能受限于静态奖励信号。URPO旨在通过统一的框架解决这些问题。
Result: 实验表明,URPO在Qwen2.5-7B模型上表现优异,指令遵循分数从42.24提升至44.84,推理平均分从32.66提升至35.66,奖励评分达85.15。
Insight: 统一奖励与策略优化不仅简化了流程,还通过共同演化机制提升了模型性能,为语言模型的对齐提供了更高效的新路径。
Abstract: Large-scale alignment pipelines typically pair a policy model with a separately trained reward model whose parameters remain frozen during reinforcement learning (RL). This separation creates a complex, resource-intensive pipeline and suffers from a performance ceiling due to a static reward signal. We propose a novel framework, Unified Reward & Policy Optimization (URPO), that unifies instruction-following (“player”) and reward modeling (“referee”) within a single model and a single training phase. Our method recasts all alignment data-including preference pairs, verifiable reasoning, and open-ended instructions-into a unified generative format optimized by a single Group-Relative Policy Optimization (GRPO) loop. This enables the model to learn from ground-truth preferences and verifiable logic while simultaneously generating its own rewards for open-ended tasks. Experiments on the Qwen2.5-7B model demonstrate URPO’s superiority. Our unified model significantly outperforms a strong baseline using a separate generative reward model, boosting the instruction-following score on AlpacaEval from 42.24 to 44.84 and the composite reasoning average from 32.66 to 35.66. Furthermore, URPO cultivates a superior internal evaluator as a byproduct of training, achieving a RewardBench score of 85.15 and surpassing the dedicated reward model it replaces (83.55). By eliminating the need for a separate reward model and fostering a co-evolutionary dynamic between generation and evaluation, URPO presents a simpler, more efficient, and more effective path towards robustly aligned language models.
[31] Dual-branch Prompting for Multimodal Machine Translation cs.CV | cs.CLPDF
Jie Wang, Zhendong Yang, Liansong Zong, Xiaobo Zhang, Dexian Wang
TL;DR: 论文提出D2P-MMT框架,通过扩散模型生成重构图像并结合双分支提示策略,提升多模态机器翻译的鲁棒性和性能。
Details
Motivation: 当前多模态机器翻译方法依赖成对图像-文本输入且易受无关视觉噪声干扰,限制了其实际应用。
Result: 在Multi30K数据集上,D2P-MMT表现优于现有方法。
Insight: 扩散模型可有效过滤视觉噪声,双分支策略能增强跨模态对齐和模型鲁棒性。
Abstract: Multimodal Machine Translation (MMT) typically enhances text-only translation by incorporating aligned visual features. Despite the remarkable progress, state-of-the-art MMT approaches often rely on paired image-text inputs at inference and are sensitive to irrelevant visual noise, which limits their robustness and practical applicability. To address these issues, we propose D2P-MMT, a diffusion-based dual-branch prompting framework for robust vision-guided translation. Specifically, D2P-MMT requires only the source text and a reconstructed image generated by a pre-trained diffusion model, which naturally filters out distracting visual details while preserving semantic cues. During training, the model jointly learns from both authentic and reconstructed images using a dual-branch prompting strategy, encouraging rich cross-modal interactions. To bridge the modality gap and mitigate training-inference discrepancies, we introduce a distributional alignment loss that enforces consistency between the output distributions of the two branches. Extensive experiments on the Multi30K dataset demonstrate that D2P-MMT achieves superior translation performance compared to existing state-of-the-art approaches.
[32] CasP: Improving Semi-Dense Feature Matching Pipeline Leveraging Cascaded Correspondence Priors for Guidance cs.CVPDF
Peiqi Chen, Lei Yu, Yi Wan, Yingying Pei, Xinyi Liu
TL;DR: 本文提出了一种新颖的半稠密特征匹配流程CasP,通过级联对应先验指导,显著提升了匹配精度和效率。
Details
Motivation: 现有半稠密特征匹配方法依赖全局搜索,限制了精度和效率的提升。
Result: 在1152分辨率下,CasP Lite模型速度提升2.2倍,并在几何估计和跨域泛化中表现优异。
Insight: 级联先验和分阶段搜索策略显著提升匹配效率,适合SLAM和无人机等实时高鲁棒性应用。
Abstract: Semi-dense feature matching methods have shown strong performance in challenging scenarios. However, the existing pipeline relies on a global search across the entire feature map to establish coarse matches, limiting further improvements in accuracy and efficiency. Motivated by this limitation, we propose a novel pipeline, CasP, which leverages cascaded correspondence priors for guidance. Specifically, the matching stage is decomposed into two progressive phases, bridged by a region-based selective cross-attention mechanism designed to enhance feature discriminability. In the second phase, one-to-one matches are determined by restricting the search range to the one-to-many prior areas identified in the first phase. Additionally, this pipeline benefits from incorporating high-level features, which helps reduce the computational costs of low-level feature extraction. The acceleration gains of CasP increase with higher resolution, and our lite model achieves a speedup of $\sim2.2\times$ at a resolution of 1152 compared to the most efficient method, ELoFTR. Furthermore, extensive experiments demonstrate its superiority in geometric estimation, particularly with impressive cross-domain generalization. These advantages highlight its potential for latency-sensitive and high-robustness applications, such as SLAM and UAV systems. Code is available at https://github.com/pq-chen/CasP.
[33] CartoonAlive: Towards Expressive Live2D Modeling from Single Portraits cs.CVPDF
Chao He, Jianqiang Ren, Jianjing Xiang, Xiejie Shen
TL;DR: 本文提出了一种名为CartoonAlive的创新方法,可以从单张肖像图像生成高质量的Live2D数字人模型,解决了2D卡通风格数字人交互性的问题。
Details
Motivation: 随着数字人技术的发展,3D模型和2D视频方案存在建模复杂或灵活性不足的问题,而2D卡通风格的Live2D模型提供了一种高效且表现力强的替代方案。
Result: 能在半分钟内生成与输入肖像高度相似的Live2D模型,兼具高表达性和视觉准确性。
Insight: Live2D通过分层分割模拟3D运动,避免了复杂建模和高渲染成本,为交互式2D卡通角色提供了可扩展的解决方案。
Abstract: With the rapid advancement of large foundation models, AIGC, cloud rendering, and real-time motion capture technologies, digital humans are now capable of achieving synchronized facial expressions and body movements, engaging in intelligent dialogues driven by natural language, and enabling the fast creation of personalized avatars. While current mainstream approaches to digital humans primarily focus on 3D models and 2D video-based representations, interactive 2D cartoon-style digital humans have received relatively less attention. Compared to 3D digital humans that require complex modeling and high rendering costs, and 2D video-based solutions that lack flexibility and real-time interactivity, 2D cartoon-style Live2D models offer a more efficient and expressive alternative. By simulating 3D-like motion through layered segmentation without the need for traditional 3D modeling, Live2D enables dynamic and real-time manipulation. In this technical report, we present CartoonAlive, an innovative method for generating high-quality Live2D digital humans from a single input portrait image. CartoonAlive leverages the shape basis concept commonly used in 3D face modeling to construct facial blendshapes suitable for Live2D. It then infers the corresponding blendshape weights based on facial keypoints detected from the input image. This approach allows for the rapid generation of a highly expressive and visually accurate Live2D model that closely resembles the input portrait, within less than half a minute. Our work provides a practical and scalable solution for creating interactive 2D cartoon characters, opening new possibilities in digital content creation and virtual character animation. The project homepage is https://human3daigc.github.io/CartoonAlive_webpage/.
[34] Temporal Point-Supervised Signal Reconstruction: A Human-Annotation-Free Framework for Weak Moving Target Detection cs.CV | cs.AIPDF
Weihua Gao, Chunxu Ren, Wenlong Niu, Xiaodong Peng
TL;DR: 论文提出了一种无需人工标注的Temporal Point-Supervised (TPS)框架,用于弱运动目标检测。通过重构瞬时信号和动态多尺度注意力模块,该方法在低信噪比数据集上表现优异,且实时性强。
Details
Motivation: 在低空监视和预警系统中,弱运动目标检测面临低信号能量、小空间范围和复杂背景的挑战。现有方法因缺乏可靠标注和鲁棒特征提取而受限。
Result: 在低信噪比数据集上优于现有方法,检测性能强,实时性达1000 FPS以上。
Insight: 通过时序信号建模取代传统帧检测,解决了弱目标检测的标注依赖问题,且高效适用于实时场景。
Abstract: In low-altitude surveillance and early warning systems, detecting weak moving targets remains a significant challenge due to low signal energy, small spatial extent, and complex background clutter. Existing methods struggle with extracting robust features and suffer from the lack of reliable annotations. To address these limitations, we propose a novel Temporal Point-Supervised (TPS) framework that enables high-performance detection of weak targets without any manual annotations.Instead of conventional frame-based detection, our framework reformulates the task as a pixel-wise temporal signal modeling problem, where weak targets manifest as short-duration pulse-like responses. A Temporal Signal Reconstruction Network (TSRNet) is developed under the TPS paradigm to reconstruct these transient signals.TSRNet adopts an encoder-decoder architecture and integrates a Dynamic Multi-Scale Attention (DMSAttention) module to enhance its sensitivity to diverse temporal patterns. Additionally, a graph-based trajectory mining strategy is employed to suppress false alarms and ensure temporal consistency.Extensive experiments on a purpose-built low-SNR dataset demonstrate that our framework outperforms state-of-the-art methods while requiring no human annotations. It achieves strong detection performance and operates at over 1000 FPS, underscoring its potential for real-time deployment in practical scenarios.
[35] Principled Multimodal Representation Learning cs.CV | cs.LG | cs.MMPDF
Xiaohao Liu, Xiaobo Xia, See-Kiong Ng, Tat-Seng Chua
TL;DR: 论文提出了Principled Multimodal Representation Learning (PMRL)框架,用于无锚点多模态对齐,解决了传统对比学习中固定锚点的限制和优化的不稳定性问题。
Details
Motivation: 传统多模态表示学习方法依赖于预定义的锚点模态,限制了所有模态的完全对齐,且优化过程中存在不稳定问题。
Result: 在多任务实验中,PMRL表现优于基线方法,实现了更好的多模态表示学习效果。
Insight: 模态对齐的数学本质是Gram矩阵的秩为1,PMRL通过优化奇异值提供了一种稳定且无锚点的对齐方法。
Abstract: Multimodal representation learning seeks to create a unified representation space by integrating diverse data modalities to improve multimodal understanding. Traditional methods often depend on pairwise contrastive learning, which relies on a predefined anchor modality, restricting alignment across all modalities. Recent advances have investigated the simultaneous alignment of multiple modalities, yet several challenges remain, such as limitations imposed by fixed anchor points and instability arising from optimizing the product of singular values. To address the challenges, in this paper, we propose Principled Multimodal Representation Learning (PMRL), a novel framework that achieves simultaneous alignment of multiple modalities without anchor dependency in a more stable manner. Specifically, grounded in the theoretical insight that full alignment corresponds to a rank-1 Gram matrix, PMRL optimizes the dominant singular value of the representation matrix to align modalities along a shared leading direction. We propose a softmax-based loss function that treats singular values as logits to prioritize the largest singular value. Besides, instance-wise contrastive regularization on the leading eigenvectors maintains inter-instance separability and prevents representation collapse. Extensive experiments across diverse tasks demonstrate PMRL’s superiority compared to baseline methods. The source code will be publicly available.
[36] Exploring Active Learning for Label-Efficient Training of Semantic Neural Radiance Field cs.CVPDF
Yuzhe Zhu, Lile Cai, Kangkang Lu, Fayao Liu, Xulei Yang
TL;DR: 本研究探讨了如何通过主动学习降低语义感知神经辐射场(NeRF)训练的标注成本,提出了一种结合3D几何约束的样本选择策略,实验显示标注成本可减少超过2倍。
Details
Motivation: 语义感知NeRF需要大量像素级标注数据,标注成本高昂。为了解决这一问题,作者探索了通过主动学习减少标注量的方法。
Result: 实验表明,主动学习可显著降低标注成本(超过2倍),同时保持模型性能。
Insight: 结合3D几何信息的主动学习策略能更高效地选择对模型训练最有价值的样本,从而减少标注负担。
Abstract: Neural Radiance Field (NeRF) models are implicit neural scene representation methods that offer unprecedented capabilities in novel view synthesis. Semantically-aware NeRFs not only capture the shape and radiance of a scene, but also encode semantic information of the scene. The training of semantically-aware NeRFs typically requires pixel-level class labels, which can be prohibitively expensive to collect. In this work, we explore active learning as a potential solution to alleviate the annotation burden. We investigate various design choices for active learning of semantically-aware NeRF, including selection granularity and selection strategies. We further propose a novel active learning strategy that takes into account 3D geometric constraints in sample selection. Our experiments demonstrate that active learning can effectively reduce the annotation cost of training semantically-aware NeRF, achieving more than 2X reduction in annotation cost compared to random sampling.
[37] Exploring Spatial Diversity for Region-based Active Learning cs.CVPDF
Lile Cai, Xun Xu, Lining Zhang, Chuan-Sheng Foo
TL;DR: 论文提出了一种基于区域的空间多样性主动学习方法,通过结合局部空间多样性和传统不确定性标准,显著降低了语义分割任务的标注成本,同时保持了高性能。
Details
Motivation: 语义分割任务需要大量像素级标注数据,成本高昂。基于区域的方法可以减少标注量,但现有方法通常忽略局部空间多样性对模型性能的影响。因此,作者提出在主动学习中引入空间多样性以提高效率。
Result: 实验表明,仅需标注5-9%的像素即可达到全监督方法95%的性能,显著优于现有区域主动学习方法。
Insight: 局部空间多样性在区域主动学习中至关重要,其与传统标准的结合能进一步提升模型效率。这一思路可扩展到其他需要密集标注的任务中。
Abstract: State-of-the-art methods for semantic segmentation are based on deep neural networks trained on large-scale labeled datasets. Acquiring such datasets would incur large annotation costs, especially for dense pixel-level prediction tasks like semantic segmentation. We consider region-based active learning as a strategy to reduce annotation costs while maintaining high performance. In this setting, batches of informative image regions instead of entire images are selected for labeling. Importantly, we propose that enforcing local spatial diversity is beneficial for active learning in this case, and to incorporate spatial diversity along with the traditional active selection criterion, e.g., data sample uncertainty, in a unified optimization framework for region-based active learning. We apply this framework to the Cityscapes and PASCAL VOC datasets and demonstrate that the inclusion of spatial diversity effectively improves the performance of uncertainty-based and feature diversity-based active learning methods. Our framework achieves $95%$ performance of fully supervised methods with only $5-9%$ of the labeled pixels, outperforming all state-of-the-art region-based active learning methods for semantic segmentation.
[38] A Conditional Probability Framework for Compositional Zero-shot Learning cs.CVPDF
Peng Wu, Qiuxia Lai, Hao Fang, Guo-Sen Xie, Yilong Yin
TL;DR: 该论文提出了一种条件概率框架(CPF),用于显式建模属性与对象之间的依赖关系,解决了组合零样本学习(CZSL)中的语义约束和上下文依赖问题。
Details
Motivation: 传统方法通常将属性和对象视为独立实体,忽略了它们之间的语义约束和上下文依赖关系。因此,论文提出通过条件概率框架来显式建模这种依赖关系。
Result: 在多个CZSL基准测试中取得了优越性能,验证了方法的有效性。
Insight: 显式建模属性与对象的依赖关系对于组合零样本学习至关重要,而条件概率框架是一种有效的解决方案。
Abstract: Compositional Zero-Shot Learning (CZSL) aims to recognize unseen combinations of known objects and attributes by leveraging knowledge from previously seen compositions. Traditional approaches primarily focus on disentangling attributes and objects, treating them as independent entities during learning. However, this assumption overlooks the semantic constraints and contextual dependencies inside a composition. For example, certain attributes naturally pair with specific objects (e.g., “striped” applies to “zebra” or “shirts” but not “sky” or “water”), while the same attribute can manifest differently depending on context (e.g., “young” in “young tree” vs. “young dog”). Thus, capturing attribute-object interdependence remains a fundamental yet long-ignored challenge in CZSL. In this paper, we adopt a Conditional Probability Framework (CPF) to explicitly model attribute-object dependencies. We decompose the probability of a composition into two components: the likelihood of an object and the conditional likelihood of its attribute. To enhance object feature learning, we incorporate textual descriptors to highlight semantically relevant image regions. These enhanced object features then guide attribute learning through a cross-attention mechanism, ensuring better contextual alignment. By jointly optimizing object likelihood and conditional attribute likelihood, our method effectively captures compositional dependencies and generalizes well to unseen compositions. Extensive experiments on multiple CZSL benchmarks demonstrate the superiority of our approach. Code is available at here.
[39] EndoGen: Conditional Autoregressive Endoscopic Video Generation cs.CV | eess.IVPDF
Xinyu Liu, Hengyu Liu, Cheng Wang, Tianming Liu, Yixuan Yuan
TL;DR: EndoGen是一个条件自回归内窥镜视频生成框架,通过时空网格帧模式(SGP)和语义感知标记掩码(SAT)机制,生成高质量的条件引导内窥镜内容。
Details
Motivation: 现有方法局限于静态图像或无条件的视频生成,缺乏动态上下文和临床参考意义,难以满足实际应用需求。
Result: 实验表明EndoGen能生成高质量条件视频,并提升息肉分割下游任务的性能。
Insight: 条件生成和自回归架构的结合在内窥镜视频任务中表现出色,为医学影像领域提供了新思路。
Abstract: Endoscopic video generation is crucial for advancing medical imaging and enhancing diagnostic capabilities. However, prior efforts in this field have either focused on static images, lacking the dynamic context required for practical applications, or have relied on unconditional generation that fails to provide meaningful references for clinicians. Therefore, in this paper, we propose the first conditional endoscopic video generation framework, namely EndoGen. Specifically, we build an autoregressive model with a tailored Spatiotemporal Grid-Frame Patterning (SGP) strategy. It reformulates the learning of generating multiple frames as a grid-based image generation pattern, which effectively capitalizes the inherent global dependency modeling capabilities of autoregressive architectures. Furthermore, we propose a Semantic-Aware Token Masking (SAT) mechanism, which enhances the model’s ability to produce rich and diverse content by selectively focusing on semantically meaningful regions during the generation process. Through extensive experiments, we demonstrate the effectiveness of our framework in generating high-quality, conditionally guided endoscopic content, and improves the performance of downstream task of polyp segmentation. Code released at https://www.github.com/CUHK-AIM-Group/EndoGen.
[40] HiProbe-VAD: Video Anomaly Detection via Hidden States Probing in Tuning-Free Multimodal LLMs cs.CV | cs.AIPDF
Zhaolin Cai, Fan Li, Ziwei Zheng, Yanjun Qin
TL;DR: HiProbe-VAD是一种新颖的视频异常检测框架,利用预训练的多模态大语言模型(MLLMs)的中间隐藏状态,无需微调即可检测视频异常,性能优于现有方法。
Details
Motivation: 传统视频异常检测方法计算成本高且依赖大量标注数据,限制了实际应用。HiProbe-VAD旨在利用预训练MLLMs的潜力,无需微调即可解决这些问题。
Result: 在UCF-Crime和XD-Violence数据集上表现优于传统方法和无需训练的方法,并展现出跨模型的泛化能力。
Insight: 预训练MLLMs的中间隐藏状态是信息丰富的表示,可用于高效异常检测,为实际应用提供了可扩展的解决方案。
Abstract: Video Anomaly Detection (VAD) aims to identify and locate deviations from normal patterns in video sequences. Traditional methods often struggle with substantial computational demands and a reliance on extensive labeled datasets, thereby restricting their practical applicability. To address these constraints, we propose HiProbe-VAD, a novel framework that leverages pre-trained Multimodal Large Language Models (MLLMs) for VAD without requiring fine-tuning. In this paper, we discover that the intermediate hidden states of MLLMs contain information-rich representations, exhibiting higher sensitivity and linear separability for anomalies compared to the output layer. To capitalize on this, we propose a Dynamic Layer Saliency Probing (DLSP) mechanism that intelligently identifies and extracts the most informative hidden states from the optimal intermediate layer during the MLLMs reasoning. Then a lightweight anomaly scorer and temporal localization module efficiently detects anomalies using these extracted hidden states and finally generate explanations. Experiments on the UCF-Crime and XD-Violence datasets demonstrate that HiProbe-VAD outperforms existing training-free and most traditional approaches. Furthermore, our framework exhibits remarkable cross-model generalization capabilities in different MLLMs without any tuning, unlocking the potential of pre-trained MLLMs for video anomaly detection and paving the way for more practical and scalable solutions.
[41] HLFormer: Enhancing Partially Relevant Video Retrieval with Hyperbolic Learning cs.CV | cs.IR | cs.MMPDF
Li Jun, Wang Jinpeng, Tan Chaolei, Lian Niu, Chen Long
TL;DR: HLFormer提出了一种双曲学习框架,通过结合Lorentz和欧几里得注意力块,增强了部分相关视频检索(PRVR)中的层次建模能力,并引入了部分顺序保持损失来优化跨模态匹配。
Details
Motivation: 现有方法在欧几里得空间中存在几何失真,无法充分建模视频的层次语义,导致PRVR任务中的性能不足。
Result: 实验表明HLFormer在PRVR任务中优于现有方法。
Insight: 双曲空间更适合建模视频的层次结构,混合空间编码和动态融合能有效提升部分相关检索的性能。
Abstract: Partially Relevant Video Retrieval (PRVR) addresses the critical challenge of matching untrimmed videos with text queries describing only partial content. Existing methods suffer from geometric distortion in Euclidean space that sometimes misrepresents the intrinsic hierarchical structure of videos and overlooks certain hierarchical semantics, ultimately leading to suboptimal temporal modeling. To address this issue, we propose the first hyperbolic modeling framework for PRVR, namely HLFormer, which leverages hyperbolic space learning to compensate for the suboptimal hierarchical modeling capabilities of Euclidean space. Specifically, HLFormer integrates the Lorentz Attention Block and Euclidean Attention Block to encode video embeddings in hybrid spaces, using the Mean-Guided Adaptive Interaction Module to dynamically fuse features. Additionally, we introduce a Partial Order Preservation Loss to enforce “text < video” hierarchy through Lorentzian cone constraints. This approach further enhances cross-modal matching by reinforcing partial relevance between video content and text queries. Extensive experiments show that HLFormer outperforms state-of-the-art methods. Code is released at https://github.com/lijun2005/ICCV25-HLFormer.
[42] Physics-based Human Pose Estimation from a Single Moving RGB Camera cs.CVPDF
Ayce Idil Aytekin, Chuqiao Li, Diogo Luvizon, Rishabh Dabral, Martin Oswald
TL;DR: 该论文提出了MoviCam数据集和PhysDynPose方法,解决了单目RGB相机动态拍摄下的人体姿态估计问题,尤其是在不平坦场景和相机运动时的挑战。
Details
Motivation: 当前的单目及基于物理的人体姿态跟踪方法在非平坦地面或相机运动时会出现伪影,且缺乏真实世界数据的支持。
Result: 实验表明,现有方法在此类挑战性场景下表现不佳,而PhysDynPose能稳健地估计世界坐标系中的人体及相机姿态。
Insight: 动态相机和非平坦场景的复杂性揭示了现有方法的局限性,需结合场景几何和物理约束提升鲁棒性。
Abstract: Most monocular and physics-based human pose tracking methods, while achieving state-of-the-art results, suffer from artifacts when the scene does not have a strictly flat ground plane or when the camera is moving. Moreover, these methods are often evaluated on in-the-wild real world videos without ground-truth data or on synthetic datasets, which fail to model the real world light transport, camera motion, and pose-induced appearance and geometry changes. To tackle these two problems, we introduce MoviCam, the first non-synthetic dataset containing ground-truth camera trajectories of a dynamically moving monocular RGB camera, scene geometry, and 3D human motion with human-scene contact labels. Additionally, we propose PhysDynPose, a physics-based method that incorporates scene geometry and physical constraints for more accurate human motion tracking in case of camera motion and non-flat scenes. More precisely, we use a state-of-the-art kinematics estimator to obtain the human pose and a robust SLAM method to capture the dynamic camera trajectory, enabling the recovery of the human pose in the world frame. We then refine the kinematic pose estimate using our scene-aware physics optimizer. From our new benchmark, we found that even state-of-the-art methods struggle with this inherently challenging setting, i.e. a moving camera and non-planar environments, while our method robustly estimates both human and camera poses in world coordinates.
[43] CAPRI-CT: Causal Analysis and Predictive Reasoning for Image Quality Optimization in Computed Tomography cs.CVPDF
Sneha George Gnanakalavathy, Hairil Abdul Razak, Robert Meertens, Jonathan E. Fieldsend, Xujiong Ye
TL;DR: 论文提出了一种名为CAPRI-CT的因果感知深度学习框架,用于优化CT成像质量。该方法通过整合图像数据和采集元数据,利用变分自编码器(VAE)提取特征并建模因果关系,支持预测和反事实推断,从而优化CT协议设计。
Details
Motivation: 在CT成像中,平衡图像质量和辐射剂量是关键挑战。现有的方法缺乏对图像质量影响因素的因果分析,难以支持决策优化。
Result: CAPRI-CT表现出强大的预测性能,能够通过反事实推理提供可操作的优化建议,减少重复物理扫描的需求。
Insight: 因果分析能够有效揭示CT成像参数与图像质量的潜在关系,为协议设计提供数据驱动的优化途径。
Abstract: In computed tomography (CT), achieving high image quality while minimizing radiation exposure remains a key clinical challenge. This paper presents CAPRI-CT, a novel causal-aware deep learning framework for Causal Analysis and Predictive Reasoning for Image Quality Optimization in CT imaging. CAPRI-CT integrates image data with acquisition metadata (such as tube voltage, tube current, and contrast agent types) to model the underlying causal relationships that influence image quality. An ensemble of Variational Autoencoders (VAEs) is employed to extract meaningful features and generate causal representations from observational data, including CT images and associated imaging parameters. These input features are fused to predict the Signal-to-Noise Ratio (SNR) and support counterfactual inference, enabling what-if simulations, such as changes in contrast agents (types and concentrations) or scan parameters. CAPRI-CT is trained and validated using an ensemble learning approach, achieving strong predictive performance. By facilitating both prediction and interpretability, CAPRI-CT provides actionable insights that could help radiologists and technicians design more efficient CT protocols without repeated physical scans. The source code and dataset are publicly available at https://github.com/SnehaGeorge22/capri-ct.
[44] Dynamic-DINO: Fine-Grained Mixture of Experts Tuning for Real-time Open-Vocabulary Object Detection cs.CVPDF
Yehao Lu, Minghe Weng, Zekang Xiao, Rui Jiang, Wei Su
TL;DR: Dynamic-DINO是一种基于Mixture of Experts (MoE)的动态推理框架,用于实时开放词汇目标检测,通过细粒度专家调整和预训练权重分配策略,显著提升了Grounding DINO 1.5 Edge的性能。
Details
Motivation: 在大型视觉语言模型(LVLMs)中,MoE架构表现出色,但其在实时开放词汇目标检测领域的潜力尚未被充分探索。
Result: Dynamic-DINO仅用1.56M开源数据预训练,性能优于基于私有Grounding20M数据集预训练的Grounding DINO 1.5 Edge。
Insight: 浅层专家倾向于多样化合作以扩展搜索空间,而深层专家则形成固定的协作结构,专注于特定模式处理。
Abstract: The Mixture of Experts (MoE) architecture has excelled in Large Vision-Language Models (LVLMs), yet its potential in real-time open-vocabulary object detectors, which also leverage large-scale vision-language datasets but smaller models, remains unexplored. This work investigates this domain, revealing intriguing insights. In the shallow layers, experts tend to cooperate with diverse peers to expand the search space. While in the deeper layers, fixed collaborative structures emerge, where each expert maintains 2-3 fixed partners and distinct expert combinations are specialized in processing specific patterns. Concretely, we propose Dynamic-DINO, which extends Grounding DINO 1.5 Edge from a dense model to a dynamic inference framework via an efficient MoE-Tuning strategy. Additionally, we design a granularity decomposition mechanism to decompose the Feed-Forward Network (FFN) of base model into multiple smaller expert networks, expanding the subnet search space. To prevent performance degradation at the start of fine-tuning, we further propose a pre-trained weight allocation strategy for the experts, coupled with a specific router initialization. During inference, only the input-relevant experts are activated to form a compact subnet. Experiments show that, pretrained with merely 1.56M open-source data, Dynamic-DINO outperforms Grounding DINO 1.5 Edge, pretrained on the private Grounding20M dataset.
[45] VLM-Guided Visual Place Recognition for Planet-Scale Geo-Localization cs.CV | cs.ROPDF
Sania Waheed, Na Min An, Michael Milford, Sarvapali D. Ramchurn, Shoaib Ehsan
TL;DR: 该论文提出了一种结合视觉语言模型(VLM)和检索式视觉地点识别(VPR)的混合地理定位框架,通过VLM生成先验指导检索,显著提升了地理定位的准确性和鲁棒性。
Details
Motivation: 传统的检索方法在规模扩展和感知混淆方面存在不足,而分类方法泛化能力有限且需要大量训练数据。尽管VLM在上下文理解和推理方面表现优异,但其易产生幻觉且缺乏可解释性,不适合单独使用。因此,该研究旨在结合两者的优势,解决全球尺度下的地理定位问题。
Result: 在多个地理定位基准测试中表现优于现有方法,尤其是在街道和城市级别的定位准确率上提升显著。
Insight: VLM生成的先验能够有效指导检索,而混合框架的结合解决了VLM的幻觉问题,同时保留了检索方法的高效性和可扩展性。
Abstract: Geo-localization from a single image at planet scale (essentially an advanced or extreme version of the kidnapped robot problem) is a fundamental and challenging task in applications such as navigation, autonomous driving and disaster response due to the vast diversity of locations, environmental conditions, and scene variations. Traditional retrieval-based methods for geo-localization struggle with scalability and perceptual aliasing, while classification-based approaches lack generalization and require extensive training data. Recent advances in vision-language models (VLMs) offer a promising alternative by leveraging contextual understanding and reasoning. However, while VLMs achieve high accuracy, they are often prone to hallucinations and lack interpretability, making them unreliable as standalone solutions. In this work, we propose a novel hybrid geo-localization framework that combines the strengths of VLMs with retrieval-based visual place recognition (VPR) methods. Our approach first leverages a VLM to generate a prior, effectively guiding and constraining the retrieval search space. We then employ a retrieval step, followed by a re-ranking mechanism that selects the most geographically plausible matches based on feature similarity and proximity to the initially estimated coordinates. We evaluate our approach on multiple geo-localization benchmarks and show that it consistently outperforms prior state-of-the-art methods, particularly at street (up to 4.51%) and city level (up to 13.52%). Our results demonstrate that VLM-generated geographic priors in combination with VPR lead to scalable, robust, and accurate geo-localization systems.
[46] Dynamic Scoring with Enhanced Semantics for Training-Free Human-Object Interaction Detection cs.CVPDF
Francesco Tonini, Lorenzo Vaquero, Alessandro Conti, Cigdem Beyan, Elisa Ricci
TL;DR: DYSCO提出了一种无需训练的HOI检测框架,通过增强语义的动态评分和多模态注册表,有效结合文本和视觉交互表示,提升了罕见交互的理解能力。
Details
Motivation: 传统HOI方法依赖大量人工标注数据,费时且难以扩展到新领域和罕见交互。作者提出利用VLM的潜力,探索无需训练的解决方案。
Result: DYSCO在无需训练的方法中表现最佳,且在罕见交互任务中优于部分需要训练的方法。
Insight: 结合VLM的语义能力可以显著提升HOI检测的泛化性能,尤其是对罕见交互的理解。无需训练的框架具有潜在的实际应用价值。
Abstract: Human-Object Interaction (HOI) detection aims to identify humans and objects within images and interpret their interactions. Existing HOI methods rely heavily on large datasets with manual annotations to learn interactions from visual cues. These annotations are labor-intensive to create, prone to inconsistency, and limit scalability to new domains and rare interactions. We argue that recent advances in Vision-Language Models (VLMs) offer untapped potential, particularly in enhancing interaction representation. While prior work has injected such potential and even proposed training-free methods, there remain key gaps. Consequently, we propose a novel training-free HOI detection framework for Dynamic Scoring with enhanced semantics (DYSCO) that effectively utilizes textual and visual interaction representations within a multimodal registry, enabling robust and nuanced interaction understanding. This registry incorporates a small set of visual cues and uses innovative interaction signatures to improve the semantic alignment of verbs, facilitating effective generalization to rare interactions. Additionally, we propose a unique multi-head attention mechanism that adaptively weights the contributions of the visual and textual features. Experimental results demonstrate that our DYSCO surpasses training-free state-of-the-art models and is competitive with training-based approaches, particularly excelling in rare interactions. Code is available at https://github.com/francescotonini/dysco.
[47] ERMV: Editing 4D Robotic Multi-view images to enhance embodied agents cs.CVPDF
Chang Nie, Guangming Wang, Zhe Lie, Hesheng Wang
TL;DR: ERMV是一个用于编辑4D机器人多视角序列图像的数据增强框架,旨在解决机器人模仿学习中高质量数据稀缺的问题。通过EMA-Attn机制、稀疏时空模块和反馈干预机制,ERMV实现了高效的数据编辑,提升了视觉-语言-动作模型的鲁棒性和泛化能力。
Details
Motivation: 机器人模仿学习依赖4D多视角序列图像,但高质量数据采集成本高且稀缺,限制了如视觉-语言-动作模型的泛化和应用。数据增强是解决这一问题的关键方法,但目前缺乏针对4D多视角序列图像的编辑技术。
Result: 实验表明,ERMV增强的数据显著提升了视觉-语言-动作模型在仿真和真实环境中的鲁棒性和泛化性能。
Insight: ERMV为机器人模仿学习的数据增强提供了新思路,其模块化设计和高效率特性在4D数据编辑领域具有广泛的应用潜力。
Abstract: Robot imitation learning relies on 4D multi-view sequential images. However, the high cost of data collection and the scarcity of high-quality data severely constrain the generalization and application of embodied intelligence policies like Vision-Language-Action (VLA) models. Data augmentation is a powerful strategy to overcome data scarcity, but methods for editing 4D multi-view sequential images for manipulation tasks are currently lacking. Thus, we propose ERMV (Editing Robotic Multi-View 4D data), a novel data augmentation framework that efficiently edits an entire multi-view sequence based on single-frame editing and robot state conditions. This task presents three core challenges: (1) maintaining geometric and appearance consistency across dynamic views and long time horizons; (2) expanding the working window with low computational costs; and (3) ensuring the semantic integrity of critical objects like the robot arm. ERMV addresses these challenges through a series of innovations. First, to ensure spatio-temporal consistency in motion blur, we introduce a novel Epipolar Motion-Aware Attention (EMA-Attn) mechanism that learns pixel shift caused by movement before applying geometric constraints. Second, to maximize the editing working window, ERMV pioneers a Sparse Spatio-Temporal (STT) module, which decouples the temporal and spatial views and remodels a single-frame multi-view problem through sparse sampling of the views to reduce computational demands. Third, to alleviate error accumulation, we incorporate a feedback intervention Mechanism, which uses a Multimodal Large Language Model (MLLM) to check editing inconsistencies and request targeted expert guidance only when necessary. Extensive experiments demonstrate that ERMV-augmented data significantly boosts the robustness and generalization of VLA models in both simulated and real-world environments.
[48] Probing Vision-Language Understanding through the Visual Entailment Task: promises and pitfalls cs.CV | cs.AIPDF
Elena Pitta, Tom Kouwenhoven, Tessa Verhoef
TL;DR: 该研究探讨了视觉蕴含(VE)任务作为多模态语言模型视觉-语言理解的可靠诊断工具的潜力与局限,通过实验发现三样本推理优于零样本基线,但过多样本会引入噪声,且标签顺序影响预测。微调模型表现最佳,但视觉信息的缺失导致模型依赖语言先验,对任务的视觉基础表示质疑。
Details
Motivation: 研究动机在于评估VE任务能否有效诊断多模态模型的视觉-语言理解能力,并揭示其在实践中的潜力与限制。
Result: 结果显示:1)三样本推理效果最佳;2)标签顺序显著影响预测;3)缺乏视觉信息时模型易产生幻觉;4)微调模型表现优异(83.3%准确率),但视觉基础受到质疑(BERTScore相似)。
Insight: 研究发现VE任务作为诊断工具虽有用但存在局限性,需改进多模态评估方法以减少对语言先验的依赖,并增强视觉基础。
Abstract: This study investigates the extent to which the Visual Entailment (VE) task serves as a reliable probe of vision-language understanding in multimodal language models, using the LLaMA 3.2 11B Vision model as a test case. Beyond reporting performance metrics, we aim to interpret what these results reveal about the underlying possibilities and limitations of the VE task. We conduct a series of experiments across zero-shot, few-shot, and fine-tuning settings, exploring how factors such as prompt design, the number and order of in-context examples and access to visual information might affect VE performance. To further probe the reasoning processes of the model, we used explanation-based evaluations. Results indicate that three-shot inference outperforms the zero-shot baselines. However, additional examples introduce more noise than they provide benefits. Additionally, the order of the labels in the prompt is a critical factor that influences the predictions. In the absence of visual information, the model has a strong tendency to hallucinate and imagine content, raising questions about the model’s over-reliance on linguistic priors. Fine-tuning yields strong results, achieving an accuracy of 83.3% on the e-SNLI-VE dataset and outperforming the state-of-the-art OFA-X model. Additionally, the explanation evaluation demonstrates that the fine-tuned model provides semantically meaningful explanations similar to those of humans, with a BERTScore F1-score of 89.2%. We do, however, find comparable BERTScore results in experiments with limited vision, questioning the visual grounding of this task. Overall, our results highlight both the utility and limitations of VE as a diagnostic task for vision-language understanding and point to directions for refining multimodal evaluation methods.
[49] Unsupervised anomaly detection using Bayesian flow networks: application to brain FDG PET in the context of Alzheimer’s disease cs.CV | cs.AIPDF
Hugues Roy, Reuben Dorent, Ninon Burgos
TL;DR: 该论文提出了一种基于贝叶斯流网络(BFN)的无监督异常检测方法AnoBFN,应用于阿尔茨海默病的脑FDG PET图像,在性能和假阳性率上优于现有方法。
Details
Motivation: 无监督异常检测在神经影像学中对识别健康数据的偏差至关重要,现有生成模型在医学影像或异常检测中尚未应用贝叶斯流网络。
Result: 在阿尔茨海默病的FDG PET图像异常检测任务中,AnoBFN优于基于VAE、GAN和扩散模型的现有方法。
Insight: BFN结合扩散和贝叶斯推理的能力,为医学影像异常检测提供了新的有效工具。
Abstract: Unsupervised anomaly detection (UAD) plays a crucial role in neuroimaging for identifying deviations from healthy subject data and thus facilitating the diagnosis of neurological disorders. In this work, we focus on Bayesian flow networks (BFNs), a novel class of generative models, which have not yet been applied to medical imaging or anomaly detection. BFNs combine the strength of diffusion frameworks and Bayesian inference. We introduce AnoBFN, an extension of BFNs for UAD, designed to: i) perform conditional image generation under high levels of spatially correlated noise, and ii) preserve subject specificity by incorporating a recursive feedback from the input image throughout the generative process. We evaluate AnoBFN on the challenging task of Alzheimer’s disease-related anomaly detection in FDG PET images. Our approach outperforms other state-of-the-art methods based on VAEs (beta-VAE), GANs (f-AnoGAN), and diffusion models (AnoDDPM), demonstrating its effectiveness at detecting anomalies while reducing false positive rates.
[50] Illicit object detection in X-ray imaging using deep learning techniques: A comparative evaluation cs.CVPDF
Jorgen Cani, Christos Diou, Spyridon Evangelatos, Vasileios Argyriou, Panagiotis Radoglou-Grammatikis
TL;DR: 该论文对X射线影像中的违禁物品检测进行了系统的深度学习方法比较评估,提出了一个包含多个数据集和多种模型的综合评估框架,并公开了代码和模型权重。
Details
Motivation: X射线自动检测在公共安全中非常重要,但由于物体遮挡、物品物理特性变化、X射线扫描设备多样性以及训练数据有限等问题,检测的准确性和可靠性仍存在挑战。当前的实验评估往往不完整且结果不一致,因此需要一个系统的比较研究。
Result: 论文通过详细分析得出了关键观察和见解,包括整体检测方法的表现、对象级检测性能、数据集特定观察以及时间效率和计算复杂度分析。
Insight: 研究强调了检测方法的多样性及其在不同数据集上的表现差异,为未来的研究提供了基准和方向,同时公开的代码和模型支持了研究的可复现性。
Abstract: Automated X-ray inspection is crucial for efficient and unobtrusive security screening in various public settings. However, challenges such as object occlusion, variations in the physical properties of items, diversity in X-ray scanning devices, and limited training data hinder accurate and reliable detection of illicit items. Despite the large body of research in the field, reported experimental evaluations are often incomplete, with frequently conflicting outcomes. To shed light on the research landscape and facilitate further research, a systematic, detailed, and thorough comparative evaluation of recent Deep Learning (DL)-based methods for X-ray object detection is conducted. For this, a comprehensive evaluation framework is developed, composed of: a) Six recent, large-scale, and widely used public datasets for X-ray illicit item detection (OPIXray, CLCXray, SIXray, EDS, HiXray, and PIDray), b) Ten different state-of-the-art object detection schemes covering all main categories in the literature, including generic Convolutional Neural Network (CNN), custom CNN, generic transformer, and hybrid CNN-transformer architectures, and c) Various detection (mAP50 and mAP50:95) and time/computational-complexity (inference time (ms), parameter size (M), and computational load (GFLOPS)) metrics. A thorough analysis of the results leads to critical observations and insights, emphasizing key aspects such as: a) Overall behavior of the object detection schemes, b) Object-level detection performance, c) Dataset-specific observations, and d) Time efficiency and computational complexity analysis. To support reproducibility of the reported experimental results, the evaluation code and model weights are made publicly available at https://github.com/jgenc/xray-comparative-evaluation.
[51] Accelerating Parallel Diffusion Model Serving with Residual Compression cs.CVPDF
Jiajun Luo, Yicheng Xiao, Jianru Xu, Yangxiu You, Rongwei Lu
TL;DR: CompactFusion通过残差压缩减少并行扩散模型推理中的通信开销,显著提升效率及生成质量。
Details
Motivation: 扩散模型需要大量计算资源,多加速器并行推理引入高通信开销,阻碍实时部署。
Result: 在4xL20上实现3.0x加速,通信密集型任务中达到6.7x加速。
Insight: 扩散模型的激活具有时间冗余性,残差压缩能高效捕捉关键信息。
Abstract: Diffusion models produce realistic images and videos but require substantial computational resources, necessitating multi-accelerator parallelism for real-time deployment. However, parallel inference introduces significant communication overhead from exchanging large activations between devices, limiting efficiency and scalability. We present CompactFusion, a compression framework that significantly reduces communication while preserving generation quality. Our key observation is that diffusion activations exhibit strong temporal redundancy-adjacent steps produce highly similar activations, saturating bandwidth with near-duplicate data carrying little new information. To address this inefficiency, we seek a more compact representation that encodes only the essential information. CompactFusion achieves this via Residual Compression that transmits only compressed residuals (step-wise activation differences). Based on empirical analysis and theoretical justification, we show that it effectively removes redundant data, enabling substantial data reduction while maintaining high fidelity. We also integrate lightweight error feedback to prevent error accumulation. CompactFusion establishes a new paradigm for parallel diffusion inference, delivering lower latency and significantly higher generation quality than prior methods. On 4xL20, it achieves 3.0x speedup while greatly improving fidelity. It also uniquely supports communication-heavy strategies like sequence parallelism on slow networks, achieving 6.7x speedup over prior overlap-based method. CompactFusion applies broadly across diffusion models and parallel settings, and integrates easily without requiring pipeline rework. Portable implementation demonstrated on xDiT is publicly available at https://github.com/Cobalt-27/CompactFusion
[52] Multi-modal Multi-task Pre-training for Improved Point Cloud Understanding cs.CVPDF
Liwen Liu, Weidong Yang, Lipeng Ma, Ben Fei
TL;DR: 本文提出了一种多模态多任务预训练框架MMPT,通过三种预训练任务(TLR、PLR和MCL)增强点云理解,无需3D标注,并在下游任务中表现优异。
Details
Motivation: 现有多模态预训练方法仅依赖单一任务,难以充分利用多模态数据信息,限制了模型在复杂下游任务中的性能。
Result: 在多个判别性和生成性应用中,MMPT优于现有方法,证明了其有效性。
Insight: 多任务预训练能够充分利用多模态数据的信息,提升模型在下游任务中的表现。
Abstract: Recent advances in multi-modal pre-training methods have shown promising effectiveness in learning 3D representations by aligning multi-modal features between 3D shapes and their corresponding 2D counterparts. However, existing multi-modal pre-training frameworks primarily rely on a single pre-training task to gather multi-modal data in 3D applications. This limitation prevents the models from obtaining the abundant information provided by other relevant tasks, which can hinder their performance in downstream tasks, particularly in complex and diverse domains. In order to tackle this issue, we propose MMPT, a Multi-modal Multi-task Pre-training framework designed to enhance point cloud understanding. Specifically, three pre-training tasks are devised: (i) Token-level reconstruction (TLR) aims to recover masked point tokens, endowing the model with representative learning abilities. (ii) Point-level reconstruction (PLR) is integrated to predict the masked point positions directly, and the reconstructed point cloud can be considered as a transformed point cloud used in the subsequent task. (iii) Multi-modal contrastive learning (MCL) combines feature correspondences within and across modalities, thus assembling a rich learning signal from both 3D point cloud and 2D image modalities in a self-supervised manner. Moreover, this framework operates without requiring any 3D annotations, making it scalable for use with large datasets. The trained encoder can be effectively transferred to various downstream tasks. To demonstrate its effectiveness, we evaluated its performance compared to state-of-the-art methods in various discriminant and generative applications under widely-used benchmarks.
[53] Boosting Ray Search Procedure of Hard-label Attacks with Transfer-based Priors cs.CV | cs.CR | cs.LG | I.2.6; I.5.1; G.1.6PDF
Chen Ma, Xinjie Xu, Shuyu Cheng, Qi Xuan
TL;DR: 本文提出了一种改进硬标签攻击射线搜索效率的方法,通过引入基于迁移的先验知识,优化了梯度估计过程,显著提高了查询效率。
Details
Motivation: 硬标签攻击是黑盒攻击中最具挑战性的一种,现有方法在射线搜索中梯度估计效率不高,特别是在高查询成本下。因此,作者希望通过引入先验知识来提升梯度估计的质量和效率。
Result: 在ImageNet和CIFAR-10数据集上的实验表明,本文方法在查询效率上显著优于11种现有先进方法。
Insight: 引入先验知识可以显著提升梯度估计的准确性和效率,尤其是在黑盒攻击中,迁移学习为优化搜索方向提供了有效的信息来源。
Abstract: One of the most practical and challenging types of black-box adversarial attacks is the hard-label attack, where only the top-1 predicted label is available. One effective approach is to search for the optimal ray direction from the benign image that minimizes the $\ell_p$-norm distance to the adversarial region. The unique advantage of this approach is that it transforms the hard-label attack into a continuous optimization problem. The objective function value is the ray’s radius, which can be obtained via binary search at a high query cost. Existing methods use a “sign trick” in gradient estimation to reduce the number of queries. In this paper, we theoretically analyze the quality of this gradient estimation and propose a novel prior-guided approach to improve ray search efficiency both theoretically and empirically. Specifically, we utilize the transfer-based priors from surrogate models, and our gradient estimators appropriately integrate them by approximating the projection of the true gradient onto the subspace spanned by these priors and random directions, in a query-efficient manner. We theoretically derive the expected cosine similarities between the obtained gradient estimators and the true gradient, and demonstrate the improvement achieved by incorporating priors. Extensive experiments on the ImageNet and CIFAR-10 datasets show that our approach significantly outperforms 11 state-of-the-art methods in terms of query efficiency.
[54] RemixFusion: Residual-based Mixed Representation for Large-scale Online RGB-D Reconstruction cs.CVPDF
Yuqing Lan, Chenyang Zhu, Shuaifeng Zhi, Jiazhao Zhang, Zhoufeng Wang
TL;DR: RemixFusion提出了一种基于残差的混合表示方法,用于大规模在线RGB-D重建,结合了显式TSDF网格和隐式神经模块,实现了细节丰富且高效的重建。
Details
Motivation: 传统的神经隐式表示在在线密集重建中存在细节缺失和学习耗时的问题,而显式表示(如TSDF)则缺乏细节重建能力。RemixFusion旨在通过混合表示解决这些问题。
Result: 在大规模场景中,RemixFusion在重建和相机跟踪精度上均优于现有方法(包括显式和隐式表示)。
Insight: 混合表示结合了显式和隐式方法的优势,既保持了细节丰富性,又提高了计算效率;位姿优化的创新方法提升了全局收敛性。
Abstract: The introduction of the neural implicit representation has notably propelled the advancement of online dense reconstruction techniques. Compared to traditional explicit representations, such as TSDF, it improves the mapping completeness and memory efficiency. However, the lack of reconstruction details and the time-consuming learning of neural representations hinder the widespread application of neural-based methods to large-scale online reconstruction. We introduce RemixFusion, a novel residual-based mixed representation for scene reconstruction and camera pose estimation dedicated to high-quality and large-scale online RGB-D reconstruction. In particular, we propose a residual-based map representation comprised of an explicit coarse TSDF grid and an implicit neural module that produces residuals representing fine-grained details to be added to the coarse grid. Such mixed representation allows for detail-rich reconstruction with bounded time and memory budget, contrasting with the overly-smoothed results by the purely implicit representations, thus paving the way for high-quality camera tracking. Furthermore, we extend the residual-based representation to handle multi-frame joint pose optimization via bundle adjustment (BA). In contrast to the existing methods, which optimize poses directly, we opt to optimize pose changes. Combined with a novel technique for adaptive gradient amplification, our method attains better optimization convergence and global optimality. Furthermore, we adopt a local moving volume to factorize the mixed scene representation with a divide-and-conquer design to facilitate efficient online learning in our residual-based framework. Extensive experiments demonstrate that our method surpasses all state-of-the-art ones, including those based either on explicit or implicit representations, in terms of the accuracy of both mapping and tracking on large-scale scenes.
[55] PRIX: Learning to Plan from Raw Pixels for End-to-End Autonomous Driving cs.CV | cs.AI | cs.LG | cs.ROPDF
Maciej K. Wozniak, Lianhang Liu, Yixi Cai, Patric Jensfelt
TL;DR: PRIX是一种仅使用摄像头数据的端到端自动驾驶架构,避免了昂贵的LiDAR和BEV表示,通过视觉特征提取器和生成式规划头直接预测轨迹,核心模块CaRT增强了多级视觉特征的鲁棒性,在NavSim和nuScenes基准上达到SOTA性能。
Details
Motivation: 当前端到端自动驾驶模型依赖LiDAR和计算密集的BEV表示,限制了其在仅配备摄像头的量产车上的部署。
Result: 在NavSim和nuScenes基准上表现优异,效率显著高于多模态扩散规划器。
Insight: 去除了对LiDAR和BEV的依赖,提升了自动驾驶模型的实用性和可扩展性。
Abstract: While end-to-end autonomous driving models show promising results, their practical deployment is often hindered by large model sizes, a reliance on expensive LiDAR sensors and computationally intensive BEV feature representations. This limits their scalability, especially for mass-market vehicles equipped only with cameras. To address these challenges, we propose PRIX (Plan from Raw Pixels). Our novel and efficient end-to-end driving architecture operates using only camera data, without explicit BEV representation and forgoing the need for LiDAR. PRIX leverages a visual feature extractor coupled with a generative planning head to predict safe trajectories from raw pixel inputs directly. A core component of our architecture is the Context-aware Recalibration Transformer (CaRT), a novel module designed to effectively enhance multi-level visual features for more robust planning. We demonstrate through comprehensive experiments that PRIX achieves state-of-the-art performance on the NavSim and nuScenes benchmarks, matching the capabilities of larger, multimodal diffusion planners while being significantly more efficient in terms of inference speed and model size, making it a practical solution for real-world deployment. Our work is open-source and the code will be at https://maxiuw.github.io/prix.
[56] Vision Transformer attention alignment with human visual perception in aesthetic object evaluation cs.CV | cs.AI | cs.LGPDF
Miguel Carrasco, César González-Martín, José Aranda, Luis Oliveros
TL;DR: 本文研究了视觉Transformer(ViT)注意力机制与人类视觉感知在美学对象评估中的一致性,通过眼动实验和注意力地图分析发现特定注意力头与人类注意力模式具有较强相关性。
Details
Motivation: 探讨ViT注意力机制与人类视觉注意力的对应关系,尤其是在美学评估领域,填补现有研究的空白。
Result: 发现sigma=2.4时相关性最佳,注意力头#12与人类模式最接近,而#7和#9差异显著,表明ViT的全局注意力与人类聚焦注意力存在根本差异。
Insight: ViT的某些注意力机制可以模拟人类视觉行为,尤其在特定对象特征(如包袋扣环)上,但在整体策略上与人类仍有差异,为改进AI模型提供了方向。
Abstract: Visual attention mechanisms play a crucial role in human perception and aesthetic evaluation. Recent advances in Vision Transformers (ViTs) have demonstrated remarkable capabilities in computer vision tasks, yet their alignment with human visual attention patterns remains underexplored, particularly in aesthetic contexts. This study investigates the correlation between human visual attention and ViT attention mechanisms when evaluating handcrafted objects. We conducted an eye-tracking experiment with 30 participants (9 female, 21 male, mean age 24.6 years) who viewed 20 artisanal objects comprising basketry bags and ginger jars. Using a Pupil Labs eye-tracker, we recorded gaze patterns and generated heat maps representing human visual attention. Simultaneously, we analyzed the same objects using a pre-trained ViT model with DINO (Self-DIstillation with NO Labels), extracting attention maps from each of the 12 attention heads. We compared human and ViT attention distributions using Kullback-Leibler divergence across varying Gaussian parameters (sigma=0.1 to 3.0). Statistical analysis revealed optimal correlation at sigma=2.4 +-0.03, with attention head #12 showing the strongest alignment with human visual patterns. Significant differences were found between attention heads, with heads #7 and #9 demonstrating the greatest divergence from human attention (p< 0.05, Tukey HSD test). Results indicate that while ViTs exhibit more global attention patterns compared to human focal attention, certain attention heads can approximate human visual behavior, particularly for specific object features like buckles in basketry items. These findings suggest potential applications of ViT attention mechanisms in product design and aesthetic evaluation, while highlighting fundamental differences in attention strategies between human perception and current AI models.
[57] Reusing Attention for One-stage Lane Topology Understanding cs.CVPDF
Yang Li, Zongzheng Zhang, Xuchong Qiu, Xinrun Li, Ziming Liu
TL;DR: 本文提出了一种单阶段架构,利用Transformer解码器中的注意力资源复用,同时预测交通元素、车道中心线和拓扑关系,提高了车道拓扑理解的精度和推理速度。
Details
Motivation: 现有两阶段方法存在误差传播和计算开销大的问题,阻碍了车道拓扑关系理解的效率,本文旨在解决这些问题。
Result: 在OpenLane-V2数据集上,相较于基线方法,本文方法在车道检测、交通元素识别和拓扑推理等方面取得了更优结果。
Insight: 注意力资源复用和知识蒸馏是实现高效车道拓扑理解的有效手段,同时减少了模型对标准地图的依赖。
Abstract: Understanding lane toplogy relationships accurately is critical for safe autonomous driving. However, existing two-stage methods suffer from inefficiencies due to error propagations and increased computational overheads. To address these challenges, we propose a one-stage architecture that simultaneously predicts traffic elements, lane centerlines and topology relationship, improving both the accuracy and inference speed of lane topology understanding for autonomous driving. Our key innovation lies in reusing intermediate attention resources within distinct transformer decoders. This approach effectively leverages the inherent relational knowledge within the element detection module to enable the modeling of topology relationships among traffic elements and lanes without requiring additional computationally expensive graph networks. Furthermore, we are the first to demonstrate that knowledge can be distilled from models that utilize standard definition (SD) maps to those operates without using SD maps, enabling superior performance even in the absence of SD maps. Extensive experiments on the OpenLane-V2 dataset show that our approach outperforms baseline methods in both accuracy and efficiency, achieving superior results in lane detection, traffic element identification, and topology reasoning. Our code is available at https://github.com/Yang-Li-2000/one-stage.git.
[58] CNS-Bench: Benchmarking Image Classifier Robustness Under Continuous Nuisance Shifts cs.CVPDF
Olaf Dünkel, Artur Jesslen, Jiahao Xie, Christian Theobalt, Christian Rupprecht
TL;DR: CNS-Bench 是一个新的基准测试工具,用于评估图像分类器在连续真实干扰变化下的鲁棒性,通过LoRA适配器和过滤机制生成连续的干扰变化,从而更全面地评估模型在OOD场景中的表现。
Details
Motivation: 现有评估OOD鲁棒性的方法多依赖简单的合成干扰或二值化干扰,难以捕捉真实世界中连续的干扰变化,限制了模型鲁棒性的全面评估。
Result: 实验表明,CNS-Bench能更全面地评估模型鲁棒性,且模型排名会因干扰变化而改变。连续干扰评估还能识别模型的失效点。
Insight: 连续干扰比二值化干扰更能反映真实场景,模型鲁棒性评估需要更细致的干扰设计和分析。
Abstract: An important challenge when using computer vision models in the real world is to evaluate their performance in potential out-of-distribution (OOD) scenarios. While simple synthetic corruptions are commonly applied to test OOD robustness, they often fail to capture nuisance shifts that occur in the real world. Recently, diffusion models have been applied to generate realistic images for benchmarking, but they are restricted to binary nuisance shifts. In this work, we introduce CNS-Bench, a Continuous Nuisance Shift Benchmark to quantify OOD robustness of image classifiers for continuous and realistic generative nuisance shifts. CNS-Bench allows generating a wide range of individual nuisance shifts in continuous severities by applying LoRA adapters to diffusion models. To address failure cases, we propose a filtering mechanism that outperforms previous methods, thereby enabling reliable benchmarking with generative models. With the proposed benchmark, we perform a large-scale study to evaluate the robustness of more than 40 classifiers under various nuisance shifts. Through carefully designed comparisons and analyses, we find that model rankings can change for varying shifts and shift scales, which cannot be captured when applying common binary shifts. Additionally, we show that evaluating the model performance on a continuous scale allows the identification of model failure points, providing a more nuanced understanding of model robustness. Project page including code and data: https://genintel.github.io/CNS.
[59] See the Forest and the Trees: A Synergistic Reasoning Framework for Knowledge-Based Visual Question Answering cs.CVPDF
Junjie Wang, Yunhan Tang, Yijie Wang, Zhihao Yuan, Huan Wang
TL;DR: 该论文提出了Synergos-VQA框架,通过融合三种互补的证据流(整体证据、结构证据和因果证据),显著提升了基于知识的视觉问答任务的性能,并在多个基准测试中达到了新的最先进水平。
Details
Motivation: 现有的多模态大模型(MLLMs)在基于知识的视觉问答(KBVQA)中依赖单一维度的证据,导致推理能力受限。论文旨在通过多角度证据的融合,实现更全面和鲁棒的推理。
Result: 在OK-VQA和A-OKVQA等多个基准测试中,Synergos-VQA均取得了最先进的性能。同时,该框架能够显著提升其他开源MLLMs的性能。
Insight: 研究表明,多角度证据的协同融合比单纯增加模型规模更能提升推理能力。此外,结构化推理和因果推理的引入有助于增强模型的可解释性和鲁棒性。
Abstract: Multimodal Large Language Models (MLLMs) have pushed the frontiers of Knowledge-Based Visual Question Answering (KBVQA), yet their reasoning is fundamentally bottlenecked by a reliance on uni-dimensional evidence. This “seeing only the trees, but not the forest” approach prevents robust, multi-faceted understanding. Inspired by the principle of seeing both the forest and trees, we propose Synergos-VQA, a novel synergistic reasoning framework. At its core, Synergos-VQA concurrently generates and fuses three complementary evidence streams at inference time: (1) Holistic Evidence to perceive the entire scene (the “forest”), (2) Structural Evidence from a prototype-driven module to identify key objects (the “trees”), and (3) Causal Evidence from a counterfactual probe to ensure the reasoning is robustly grounded. By synergistically fusing this multi-faceted evidence, our framework achieves a more comprehensive and reliable reasoning process. Extensive experiments show that Synergos-VQA decisively establishes a new state-of-the-art on three challenging benchmarks, including OK-VQA and A-OKVQA. Furthermore, our approach demonstrates strong plug-and-play capabilities, significantly boosting various open-source MLLMs and proving that superior methodological design can outperform sheer model scale.
[60] Talk2Event: Grounded Understanding of Dynamic Scenes from Event Cameras cs.CV | cs.ROPDF
Lingdong Kong, Dongyue Lu, Ao Liang, Rong Li, Yuhao Dong
TL;DR: 论文提出了Talk2Event,首个大规模事件相机语言驱动目标定位基准,并提出EventRefer框架,通过多属性专家混合(MoEE)动态融合多模态信息,显著提升了事件相机场景中的语言理解能力。
Details
Motivation: 事件相机具有微秒级延迟和运动模糊鲁棒性,适用于动态环境感知,但将其异步数据流与人类语言连接仍具挑战。
Result: 在事件相机、传统帧相机及多模态融合设置中,EventRefer均显著优于现有方法。
Insight: 多属性表征的动态融合能有效提升事件相机场景中的语言驱动感知能力,为机器人及自动驾驶领域的多模态实时感知奠定基础。
Abstract: Event cameras offer microsecond-level latency and robustness to motion blur, making them ideal for understanding dynamic environments. Yet, connecting these asynchronous streams to human language remains an open challenge. We introduce Talk2Event, the first large-scale benchmark for language-driven object grounding in event-based perception. Built from real-world driving data, we provide over 30,000 validated referring expressions, each enriched with four grounding attributes – appearance, status, relation to viewer, and relation to other objects – bridging spatial, temporal, and relational reasoning. To fully exploit these cues, we propose EventRefer, an attribute-aware grounding framework that dynamically fuses multi-attribute representations through a Mixture of Event-Attribute Experts (MoEE). Our method adapts to different modalities and scene dynamics, achieving consistent gains over state-of-the-art baselines in event-only, frame-only, and event-frame fusion settings. We hope our dataset and approach will establish a foundation for advancing multimodal, temporally-aware, and language-driven perception in real-world robotics and autonomy.
[61] BetterCheck: Towards Safeguarding VLMs for Automotive Perception Systems cs.CV | I.4.mPDF
Malsha Ashani Mahawatta Dona, Beatriz Cabrero-Daniel, Yinan Yu, Christian Berger
TL;DR: 论文《BetterCheck》提出了一种方法,用于检测和防范VLM在自动驾驶感知系统中的幻觉问题,增强其安全性。
Details
Motivation: VLM在理解复杂交通场景中表现优异,但其幻觉问题可能导致自动驾驶系统做出危险决策,因此需要一种机制来检测和防范这些幻觉。
Result: 研究发现,VLM在图像理解上表现优异,但仍存在幻觉问题,BetterCheck能有效检测这些幻觉。
Insight: VLM虽然强大,但幻觉问题限制了其在自动驾驶中的应用,需要通过类似BetterCheck的方法进行优化和验证。
Abstract: Large language models (LLMs) are growingly extended to process multimodal data such as text and video simultaneously. Their remarkable performance in understanding what is shown in images is surpassing specialized neural networks (NNs) such as Yolo that is supporting only a well-formed but very limited vocabulary, ie., objects that they are able to detect. When being non-restricted, LLMs and in particular state-of-the-art vision language models (VLMs) show impressive performance to describe even complex traffic situations. This is making them potentially suitable components for automotive perception systems to support the understanding of complex traffic situations or edge case situation. However, LLMs and VLMs are prone to hallucination, which mean to either potentially not seeing traffic agents such as vulnerable road users who are present in a situation, or to seeing traffic agents who are not there in reality. While the latter is unwanted making an ADAS or autonomous driving systems (ADS) to unnecessarily slow down, the former could lead to disastrous decisions from an ADS. In our work, we are systematically assessing the performance of 3 state-of-the-art VLMs on a diverse subset of traffic situations sampled from the Waymo Open Dataset to support safety guardrails for capturing such hallucinations in VLM-supported perception systems. We observe that both, proprietary and open VLMs exhibit remarkable image understanding capabilities even paying thorough attention to fine details sometimes difficult to spot for us humans. However, they are also still prone to making up elements in their descriptions to date requiring hallucination detection strategies such as BetterCheck that we propose in our work.
[62] Yume: An Interactive World Generation Model cs.CV | cs.AI | cs.HCPDF
Xiaofeng Mao, Shaoheng Lin, Zhen Li, Chuanhao Li, Wenshuo Peng
TL;DR: Yume是一个交互式世界生成模型,能够从图像、文本或视频中生成动态世界,支持通过键盘或神经信号探索和控制。预发布版本通过量化相机运动、改进视频生成架构和优化采样器,实现了高质量交互式视频生成。
Details
Motivation: 构建一个能够将静态输入(图像、文本或视频)转化为交互式动态世界的模型,支持用户通过多种方式探索和控制。
Result: 模型在高质量数据集\sekai上训练,在多样化场景中表现优异。代码、数据和模型均已开源。
Insight: 相机运动量化和训练无关的采样机制为交互式世界生成提供了新思路,开源计划有助于社区发展。
Abstract: Yume aims to use images, text, or videos to create an interactive, realistic, and dynamic world, which allows exploration and control using peripheral devices or neural signals. In this report, we present a preview version of \method, which creates a dynamic world from an input image and allows exploration of the world using keyboard actions. To achieve this high-fidelity and interactive video world generation, we introduce a well-designed framework, which consists of four main components, including camera motion quantization, video generation architecture, advanced sampler, and model acceleration. First, we quantize camera motions for stable training and user-friendly interaction using keyboard inputs. Then, we introduce the Masked Video Diffusion Transformer~(MVDT) with a memory module for infinite video generation in an autoregressive manner. After that, training-free Anti-Artifact Mechanism (AAM) and Time Travel Sampling based on Stochastic Differential Equations (TTS-SDE) are introduced to the sampler for better visual quality and more precise control. Moreover, we investigate model acceleration by synergistic optimization of adversarial distillation and caching mechanisms. We use the high-quality world exploration dataset \sekai to train \method, and it achieves remarkable results in diverse scenes and applications. All data, codebase, and model weights are available on https://github.com/stdstu12/YUME. Yume will update monthly to achieve its original goal. Project page: https://stdstu12.github.io/YUME-Project/.
cs.CL [Back]
[63] AI-based Clinical Decision Support for Primary Care: A Real-World Study cs.CLPDF
Robert Korom, Sarah Kiptinness, Najib Adan, Kassim Said, Catherine Ithuli
TL;DR: 该研究评估了基于大型语言模型的临床决策支持工具AI Consult在真实医疗环境中的效果。工具集成到临床工作流中,减少了诊断和治疗错误,医生反馈正面,展示了AI在减少医疗错误方面的潜力。
Details
Motivation: 医疗错误在初级保健中是一个重要问题。研究旨在探讨AI工具是否能减少临床决策中的错误,并评估其在真实环境中的可行性和效果。
Result: AI Consult减少了16%的诊断错误和13%的治疗错误,每年可避免大量错误。75%的医生认为其对医疗质量有实质性提升。
Insight: 研究强调了AI工具与临床工作流整合及主动推广的重要性,展示了AI在提升初级保健质量和安全性方面的潜力。
Abstract: We evaluate the impact of large language model-based clinical decision support in live care. In partnership with Penda Health, a network of primary care clinics in Nairobi, Kenya, we studied AI Consult, a tool that serves as a safety net for clinicians by identifying potential documentation and clinical decision-making errors. AI Consult integrates into clinician workflows, activating only when needed and preserving clinician autonomy. We conducted a quality improvement study, comparing outcomes for 39,849 patient visits performed by clinicians with or without access to AI Consult across 15 clinics. Visits were rated by independent physicians to identify clinical errors. Clinicians with access to AI Consult made relatively fewer errors: 16% fewer diagnostic errors and 13% fewer treatment errors. In absolute terms, the introduction of AI Consult would avert diagnostic errors in 22,000 visits and treatment errors in 29,000 visits annually at Penda alone. In a survey of clinicians with AI Consult, all clinicians said that AI Consult improved the quality of care they delivered, with 75% saying the effect was “substantial”. These results required a clinical workflow-aligned AI Consult implementation and active deployment to encourage clinician uptake. We hope this study demonstrates the potential for LLM-based clinical decision support tools to reduce errors in real-world settings and provides a practical framework for advancing responsible adoption.
[64] Harnessing RLHF for Robust Unanswerability Recognition and Trustworthy Response Generation in LLMs cs.CLPDF
Shuyuan Lin, Lei Duan, Philip Hughes, Yuxuan Sheng
TL;DR: 该论文提出了一种名为SALU的新方法,通过多任务学习和RLHF技术,将不可回答性问题检测直接集成到LLM的生成过程中,显著减少了幻觉内容并提高了可靠性。
Details
Motivation: 解决传统CIR系统在处理不可回答性问题时的局限性,避免生成误导性或幻觉内容。
Result: 在自定义C-IR_Answerability数据集上,SALU表现优于基线模型,人类评估也证实其高可靠性和低幻觉率。
Insight: 直接集成不可回答性检测到LLM的生成过程中,结合RLHF技术,可以有效提升模型的自我知识边界意识。
Abstract: Conversational Information Retrieval (CIR) systems, while offering intuitive access to information, face a significant challenge: reliably handling unanswerable questions to prevent the generation of misleading or hallucinated content. Traditional approaches often rely on external classifiers, which can introduce inconsistencies with the core generative Large Language Models (LLMs). This paper introduces Self-Aware LLM for Unanswerability (SALU), a novel approach that deeply integrates unanswerability detection directly within the LLM’s generative process. SALU is trained using a multi-task learning framework for both standard Question Answering (QA) and explicit abstention generation for unanswerable queries. Crucially, it incorporates a confidence-score-guided reinforcement learning with human feedback (RLHF) phase, which explicitly penalizes hallucinated responses and rewards appropriate abstentions, fostering intrinsic self-awareness of knowledge boundaries. Through extensive experiments on our custom-built C-IR_Answerability dataset, SALU consistently outperforms strong baselines, including hybrid LLM-classifier systems, in overall accuracy for correctly answering or abstaining from questions. Human evaluation further confirms SALU’s superior reliability, achieving high scores in factuality, appropriate abstention, and, most importantly, a dramatic reduction in hallucination, demonstrating its ability to robustly “know when to say ‘I don’t know’.”
[65] Text-to-SPARQL Goes Beyond English: Multilingual Question Answering Over Knowledge Graphs through Human-Inspired Reasoning cs.CL | cs.AI | cs.IRPDF
Aleksandr Perevalov, Andreas Both
TL;DR: 这篇论文提出了一个名为mKGQAgent的框架,通过模块化和可解释的子任务将多语言自然语言问题转换为SPARQL查询,并在Text2SPARQL挑战赛中取得第一名。
Details
Motivation: 多语言自然语言接口访问知识是信息检索领域的一个新兴挑战,而现有的方法多依赖于组合式组件,缺乏模块化和可解释性。
Result: 在DBpedia和企业知识图谱的Text2SPARQL 2025挑战赛中,mKGQAgent取得了第一名。
Insight: 通过模仿人类的模块化推理过程,并结合上下文学习,可以有效提升多语言语义解析的能力和可解释性。
Abstract: Accessing knowledge via multilingual natural-language interfaces is one of the emerging challenges in the field of information retrieval and related ones. Structured knowledge stored in knowledge graphs can be queried via a specific query language (e.g., SPARQL). Therefore, one needs to transform natural-language input into a query to fulfill an information need. Prior approaches mostly focused on combining components (e.g., rule-based or neural-based) that solve downstream tasks and come up with an answer at the end. We introduce mKGQAgent, a human-inspired framework that breaks down the task of converting natural language questions into SPARQL queries into modular, interpretable subtasks. By leveraging a coordinated LLM agent workflow for planning, entity linking, and query refinement - guided by an experience pool for in-context learning - mKGQAgent efficiently handles multilingual KGQA. Evaluated on the DBpedia- and Corporate-based KGQA benchmarks within the Text2SPARQL challenge 2025, our approach took first place among the other participants. This work opens new avenues for developing human-like reasoning systems in multilingual semantic parsing.
[66] CogDual: Enhancing Dual Cognition of LLMs via Reinforcement Learning with Implicit Rule-Based Rewards cs.CLPDF
Cheng Liu, Yifei Lu, Fanghua Ye, Jian Li, Xingyu Chen
TL;DR: 论文提出了CogDual,一种通过强化学习增强大型语言模型(LLM)认知能力的角色扮演语言代理(RPLA)。其创新在于联合建模外部情境意识和内部自我意识,并通过强化学习和隐式规则奖励优化性能。实验结果表明,CogDual在多任务中表现优异。
Details
Motivation: 现有角色扮演语言代理(RPLA)主要依赖提示工程或监督微调,忽略了行为背后的认知机制。作者从认知心理学获得灵感,提出模仿人类认知的方式来改善角色扮演的一致性。
Result: 在CoSER、Cross-MR和LifeChoice基准测试中,CogDual显著优于现有基线,并在多个任务中展示了良好的泛化能力。
Insight: 角色扮演语言代理的关键在于模拟人类认知机制,而不仅仅是行为模仿。强化学习与隐式规则奖励的结合是提升开放域任务表现的有效途径。
Abstract: Role-Playing Language Agents (RPLAs) have emerged as a significant application direction for Large Language Models (LLMs). Existing approaches typically rely on prompt engineering or supervised fine-tuning to enable models to imitate character behaviors in specific scenarios, but often neglect the underlying \emph{cognitive} mechanisms driving these behaviors. Inspired by cognitive psychology, we introduce \textbf{CogDual}, a novel RPLA adopting a \textit{cognize-then-respond } reasoning paradigm. By jointly modeling external situational awareness and internal self-awareness, CogDual generates responses with improved character consistency and contextual alignment. To further optimize the performance, we employ reinforcement learning with two general-purpose reward schemes designed for open-domain text generation. Extensive experiments on the CoSER benchmark, as well as Cross-MR and LifeChoice, demonstrate that CogDual consistently outperforms existing baselines and generalizes effectively across diverse role-playing tasks.
[67] CLARIFID: Improving Radiology Report Generation by Reinforcing Clinically Accurate Impressions and Enforcing Detailed Findings cs.CLPDF
Kyeongkyu Lee, Seonghwan Yoon, Hongki Lim
TL;DR: CLARIFID提出了一种新颖的框架,通过模仿专家的工作流程优化放射学报告的诊断准确性,结合多视图X光片和强化学习,显著提升了报告的临床有效性。
Details
Motivation: 当前放射学报告生成方法注重文本流畅性而忽视诊断事实的正确性,且多依赖单视图图像,限制了诊断的全面性。
Result: 在MIMIC-CXR数据集上,CLARIFID在自然语言生成指标和临床评分上均优于现有基线。
Insight: 专家工作流程的模拟和多视图融合显著提升了放射报告生成的临床可靠性,推理感知解码策略确保了逻辑一致性。
Abstract: Automatic generation of radiology reports has the potential to alleviate radiologists’ significant workload, yet current methods struggle to deliver clinically reliable conclusions. In particular, most prior approaches focus on producing fluent text without effectively ensuring the factual correctness of the reports and often rely on single-view images, limiting diagnostic comprehensiveness. We propose CLARIFID, a novel framework that directly optimizes diagnostic correctness by mirroring the two-step workflow of experts. Specifically, CLARIFID (1) learns the logical flow from Findings to Impression through section-aware pretraining, (2) is fine-tuned with Proximal Policy Optimization in which the CheXbert F1 score of the Impression section serves as the reward, (3) enforces reasoning-aware decoding that completes “Findings” before synthesizing the “Impression”, and (4) fuses multiple chest X-ray views via a vision-transformer-based multi-view encoder. During inference, we apply a reasoning-aware next-token forcing strategy followed by report-level re-ranking, ensuring that the model first produces a comprehensive Findings section before synthesizing the Impression and thereby preserving coherent clinical reasoning. Experimental results on the MIMIC-CXR dataset demonstrate that our method achieves superior clinical efficacy and outperforms existing baselines on both standard NLG metrics and clinically aware scores.
[68] Triple X: A LLM-Based Multilingual Speech Recognition System for the INTERSPEECH2025 MLC-SLM Challenge cs.CL | eess.ASPDF
Miaomiao Gao, Xiaoxiao Xiang, Yiwen Guo
TL;DR: 论文提出了Triple X多语言语音识别系统,采用创新的编码器-适配器-LLM架构,结合多阶段训练策略,在INTERSPEECH2025 MLC-SLM挑战赛中取得了第二名的成绩。
Details
Motivation: 解决多语言对话场景下的语音识别问题,提升识别准确率。
Result: 在挑战赛的开发集和测试集上均取得了有竞争力的词错误率(WER),获得第二名。
Insight: 结合大语言模型和多语言数据集的适应性训练,可以显著提升多语言语音识别的性能。
Abstract: This paper describes our Triple X speech recognition system submitted to Task 1 of the Multi-Lingual Conversational Speech Language Modeling (MLC-SLM) Challenge. Our work focuses on optimizing speech recognition accuracy in multilingual conversational scenarios through an innovative encoder-adapter-LLM architecture. This framework harnesses the powerful reasoning capabilities of text-based large language models while incorporating domain-specific adaptations. To further enhance multilingual recognition performance, we adopted a meticulously designed multi-stage training strategy leveraging extensive multilingual audio datasets. Experimental results demonstrate that our approach achieves competitive Word Error Rate (WER) performance on both dev and test sets, obtaining second place in the challenge ranking.
[69] Millions of $\text{GeAR}$-s: Extending GraphRAG to Millions of Documents cs.CL | cs.AI | cs.IRPDF
Zhili Shen, Chenxin Diao, Pascual Merita, Pavlos Vougiouklis, Jeff Z. Pan
TL;DR: 论文探讨了将基于图的检索增强生成(RAG)方法扩展到大规模文档集的可行性,研究了现有方法在SIGIR 2025 LiveRAG挑战中的表现与局限性。
Details
Motivation: 当前基于图的RAG方法多针对特定任务设计(如多跳问答),缺乏在大规模通用数据集上的验证,亟需研究其扩展性和普适性。
Result: 实验表明$ ext{GeAR}$在大规模文档任务中具有一定的扩展性和性能,但也揭示了其局限性。
Insight: 图结构的引入可以提升检索效率,但大规模文档的复杂性和多样性对方法的设计提出了更高要求。
Abstract: Recent studies have explored graph-based approaches to retrieval-augmented generation, leveraging structured or semi-structured information – such as entities and their relations extracted from documents – to enhance retrieval. However, these methods are typically designed to address specific tasks, such as multi-hop question answering and query-focused summarisation, and therefore, there is limited evidence of their general applicability across broader datasets. In this paper, we aim to adapt a state-of-the-art graph-based RAG solution: $\text{GeAR}$ and explore its performance and limitations on the SIGIR 2025 LiveRAG Challenge.
[70] MultiNRC: A Challenging and Native Multilingual Reasoning Evaluation Benchmark for LLMs cs.CL | cs.AIPDF
Alexander R. Fabbri, Diego Mares, Jorge Flores, Meher Mankikar, Ernesto Hernandez
TL;DR: 论文提出了MultiNRC,一个评估大型语言模型(LLMs)在多语言和文化背景下推理能力的基准测试,结果显示当前LLMs在原生多语言推理任务中表现不足。
Details
Motivation: 现有评估主要基于英语基准的翻译,缺乏针对原生语言和文化背景的推理能力评估,因此需要更全面的多语言推理基准。
Result: LLMs在原生多语言推理任务中表现不佳(准确率<50%),数学推理中英语表现显著优于原生语言(+10%)。
Insight: LLMs在语言、文化和逻辑推理任务中存在显著差异,文化相关知识仍是其短板。
Abstract: Although recent Large Language Models (LLMs) have shown rapid improvement on reasoning benchmarks in English, the evaluation of such LLMs’ multilingual reasoning capability across diverse languages and cultural contexts remains limited. Existing multilingual reasoning benchmarks are typically constructed by translating existing English reasoning benchmarks, biasing these benchmarks towards reasoning problems with context in English language/cultures. In this work, we introduce the Multilingual Native Reasoning Challenge (MultiNRC), a benchmark designed to assess LLMs on more than 1,000 native, linguistic and culturally grounded reasoning questions written by native speakers in French, Spanish, and Chinese. MultiNRC covers four core reasoning categories: language-specific linguistic reasoning, wordplay & riddles, cultural/tradition reasoning, and math reasoning with cultural relevance. For cultural/tradition reasoning and math reasoning with cultural relevance, we also provide English equivalent translations of the multilingual questions by manual translation from native speakers fluent in English. This set of English equivalents can provide a direct comparison of LLM reasoning capacity in other languages vs. English on the same reasoning questions. We systematically evaluate current 14 leading LLMs covering most LLM families on MultiNRC and its English equivalent set. The results show that (1) current LLMs are still not good at native multilingual reasoning, with none scoring above 50% on MultiNRC; (2) LLMs exhibit distinct strengths and weaknesses in handling linguistic, cultural, and logical reasoning tasks; (3) Most models perform substantially better in math reasoning in English compared to in original languages (+10%), indicating persistent challenges with culturally grounded knowledge.
[71] Seed LiveInterpret 2.0: End-to-end Simultaneous Speech-to-speech Translation with Your Voice cs.CL | cs.SD | eess.ASPDF
Shanbo Cheng, Yu Bao, Zhichao Huang, Yu Lu, Ningxin Peng
TL;DR: Seed-LiveInterpret 2.0 是一种端到端的同声传译模型,通过新型的双工语音到语音理解-生成框架,解决了语音转录和翻译质量低、实时语音生成不足等问题,显著提升了翻译准确性和延迟表现。
Details
Motivation: 研究旨在解决同声传译(SI)领域的核心挑战,如低质量转录和翻译、实时性不足、多说话者混淆以及长篇幅翻译中的语音膨胀问题。
Result: 实验结果显示,模型在复杂场景中的翻译正确率超过 70%,同时将克隆语音的平均延迟从 10 秒降至 3 秒,显著优于商业解决方案。
Insight: 大规模预训练和强化学习是实现高质量、低延迟语音到语音翻译的关键,双工框架有效解决了传统 SI 的瓶颈问题。
Abstract: Simultaneous Interpretation (SI) represents one of the most daunting frontiers in the translation industry, with product-level automatic systems long plagued by intractable challenges: subpar transcription and translation quality, lack of real-time speech generation, multi-speaker confusion, and translated speech inflation, especially in long-form discourses. In this study, we introduce Seed-LiveInterpret 2.0, an end-to-end SI model that delivers high-fidelity, ultra-low-latency speech-to-speech generation with voice cloning capabilities. As a fully operational product-level solution, Seed-LiveInterpret 2.0 tackles these challenges head-on through our novel duplex speech-to-speech understanding-generating framework. Experimental results demonstrate that through large-scale pretraining and reinforcement learning, the model achieves a significantly better balance between translation accuracy and latency, validated by human interpreters to exceed 70% correctness in complex scenarios. Notably, Seed-LiveInterpret 2.0 outperforms commercial SI solutions by significant margins in translation quality, while slashing the average latency of cloned speech from nearly 10 seconds to a near-real-time 3 seconds, which is around a near 70% reduction that drastically enhances practical usability.
[72] Megrez2 Technical Report cs.CLPDF
Boxun Li, Yadong Li, Zhiyuan Li, Congyi Liu, Weilin Liu
TL;DR: Megrez2是一个轻量高效的端侧部署语言模型架构,通过跨层专家共享和预门控路由技术减少参数量并提升推理效率。
Details
Motivation: 提出一种能在资源受限设备上高效部署的语言模型架构,平衡性能与效率。
Result: 3B激活参数和7.5B存储参数的Megrez2-Preview在语言理解、数学推理等任务上表现优异。
Insight: 轻量设计可在保持性能的同时减少资源占用,适合实际部署。
Abstract: We present Megrez2, a novel lightweight and high-performance language model architecture optimized for device native deployment. Megrez2 introduces a novel cross-layer expert sharing mechanism, which significantly reduces total parameter count by reusing expert modules across adjacent transformer layers while maintaining most of the model’s capacity. It also incorporates pre-gated routing, enabling memory-efficient expert loading and faster inference. As the first instantiation of the Megrez2 architecture, we introduce the Megrez2-Preview model, which is pre-trained on a 5-trillion-token corpus and further enhanced through supervised fine-tuning and reinforcement learning with verifiable rewards. With only 3B activated and 7.5B stored parameters, Megrez2-Preview demonstrates competitive or superior performance compared to larger models on a wide range of tasks, including language understanding, instruction following, mathematical reasoning, and code generation. These results highlight the effectiveness of the Megrez2 architecture to achieve a balance between accuracy, efficiency, and deployability, making it a strong candidate for real-world, resource-constrained applications.
[73] Pretraining on the Test Set Is No Longer All You Need: A Debate-Driven Approach to QA Benchmarks cs.CL | cs.AIPDF
Linbo Cao, Jinman Zhao
TL;DR: 该论文提出了一种基于辩论的问答评估范式,通过将传统QA数据集转化为对抗性辩论任务,显著提高了评估难度,同时减少了数据污染和记忆化的问题。
Details
Motivation: 随着前沿语言模型在标准QA基准上的表现趋近饱和,数据污染、记忆化以及数据集创建成本的问题日益突出。论文旨在提出一种可持续的评估方法,以更真实地衡量模型的高级推理能力。
Result: 实验表明,该方法对数据污染具有鲁棒性(调优模型在辩论中表现更差)且成本效益高。即使较弱裁判也能可靠区分更强辩论者,验证了该范式的可扩展性。
Insight: 基于辩论的评估不仅减少了数据集的重复创建成本,还更有效地衡量了模型的真实推理能力,为未来更强大系统的评估提供了可持续路径。
Abstract: As frontier language models increasingly saturate standard QA benchmarks, concerns about data contamination, memorization, and escalating dataset creation costs persist. We propose a debate-driven evaluation paradigm that transforms any existing QA dataset into structured adversarial debates–where one model is given the official answer to defend, and another constructs and defends an alternative answer–adjudicated by a judge model blind to the correct solution. By forcing multi-round argumentation, this approach substantially increases difficulty while penalizing shallow memorization, yet reuses QA items to reduce curation overhead. We make two main contributions: (1) an evaluation pipeline to systematically convert QA tasks into debate-based assessments, and (2) a public benchmark that demonstrates our paradigm’s effectiveness on a subset of MMLU-Pro questions, complete with standardized protocols and reference models. Empirical results validate the robustness of the method and its effectiveness against data contamination–a Llama 3.1 model fine-tuned on test questions showed dramatic accuracy improvements (50% -> 82%) but performed worse in debates. Results also show that even weaker judges can reliably differentiate stronger debaters, highlighting how debate-based evaluation can scale to future, more capable systems while maintaining a fraction of the cost of creating new benchmarks. Overall, our framework underscores that “pretraining on the test set is no longer all you need,” offering a sustainable path for measuring the genuine reasoning ability of advanced language models.
cs.GR [Back]
[74] Controllable Video Generation: A Survey cs.GR | cs.CVPDF
Yue Ma, Kunyu Feng, Zhongyuan Hu, Xinyu Wang, Yucheng Wang
TL;DR: 这篇综述系统地总结了可控视频生成的理论基础与最新进展,重点关注了如何通过多模态条件(如相机运动、深度图等)扩展预训练视频生成模型,以实现更精准的用户意图表达。
Details
Motivation: 随着AI生成内容(AIGC)的快速发展,视频生成为其最具影响力的子领域之一。然而,现有的文本到视频生成模型在表达复杂、多模态和细粒度用户需求时表现不足,因此需要探索更灵活的控制机制。
Result: 总结了当前可控视频生成的研究现状,提出了分类框架,并整理了相关文献资源库。
Insight: 未来研究可以进一步探索多模态条件的动态融合方法,以及如何实现更通用的可控视频生成框架。
Abstract: With the rapid development of AI-generated content (AIGC), video generation has emerged as one of its most dynamic and impactful subfields. In particular, the advancement of video generation foundation models has led to growing demand for controllable video generation methods that can more accurately reflect user intent. Most existing foundation models are designed for text-to-video generation, where text prompts alone are often insufficient to express complex, multi-modal, and fine-grained user requirements. This limitation makes it challenging for users to generate videos with precise control using current models. To address this issue, recent research has explored the integration of additional non-textual conditions, such as camera motion, depth maps, and human pose, to extend pretrained video generation models and enable more controllable video synthesis. These approaches aim to enhance the flexibility and practical applicability of AIGC-driven video generation systems. In this survey, we provide a systematic review of controllable video generation, covering both theoretical foundations and recent advances in the field. We begin by introducing the key concepts and commonly used open-source video generation models. We then focus on control mechanisms in video diffusion models, analyzing how different types of conditions can be incorporated into the denoising process to guide generation. Finally, we categorize existing methods based on the types of control signals they leverage, including single-condition generation, multi-condition generation, and universal controllable generation. For a complete list of the literature on controllable video generation reviewed, please visit our curated repository at https://github.com/mayuelala/Awesome-Controllable-Video-Generation.
[75] StreamME: Simplify 3D Gaussian Avatar within Live Stream cs.GR | cs.AI | cs.CVPDF
Luchuan Song, Yang Zhou, Zhan Xu, Yi Zhou, Deepali Aneja
TL;DR: StreamME 提出了一种快速重建 3D 头像的方法,适用于实时视频流,无需预缓存数据,采用动态训练策略和简化的点云分布方法,提升效率并保护隐私。
Details
Motivation: 现有的 3D 头像重建方法通常需要预缓存数据或依赖复杂的神经网络(如MLP),导致速度较慢且难以适应实时视频流的需求。StreamME 旨在解决这些问题。
Result: 方法显著提升了头像重建速度,适用于实时视频流,并有效保护用户隐私,降低了 VR 或在线会议中的通信带宽需求。
Insight: 简化几何表达和动态训练是实现实时 3D 头像重建的关键,该方法为未来实时应用(如虚拟会议、动画等)提供了新思路。
Abstract: We propose StreamME, a method focuses on fast 3D avatar reconstruction. The StreamME synchronously records and reconstructs a head avatar from live video streams without any pre-cached data, enabling seamless integration of the reconstructed appearance into downstream applications. This exceptionally fast training strategy, which we refer to as on-the-fly training, is central to our approach. Our method is built upon 3D Gaussian Splatting (3DGS), eliminating the reliance on MLPs in deformable 3DGS and relying solely on geometry, which significantly improves the adaptation speed to facial expression. To further ensure high efficiency in on-the-fly training, we introduced a simplification strategy based on primary points, which distributes the point clouds more sparsely across the facial surface, optimizing points number while maintaining rendering quality. Leveraging the on-the-fly training capabilities, our method protects the facial privacy and reduces communication bandwidth in VR system or online conference. Additionally, it can be directly applied to downstream application such as animation, toonify, and relighting. Please refer to our project page for more details: https://songluchuan.github.io/StreamME/.
cs.SD [Back]
[76] BoSS: Beyond-Semantic Speech cs.SD | cs.CL | eess.ASPDF
Qing Wang, Zehan Li, Hang Lv, Hongjie Chen, Yaodong Song
TL;DR: 该论文提出了超越语义语音(BoSS)的概念,并引入了一个分层框架(L1-L5)来评估语音交互系统的能力,强调当前语音模型在捕捉情感、上下文等非显式语义信号方面的不足。
Details
Motivation: 现代语音技术(如ASR和TTS)未能充分捕捉人类交流中的非显式语义信号(如情感、上下文等),导致其无法实现更自然的人机交互。
Result: 研究发现当前语音模型难以全面解释BoSS信号,表明需要进一步研究以提升上下文感知能力。
Insight: BoSS研究为人机交互提供了新的方向,强调情感和上下文信号的重要性,未来语音技术需更关注多维特征的建模。
Abstract: Human communication involves more than explicit semantics, with implicit signals and contextual cues playing a critical role in shaping meaning. However, modern speech technologies, such as Automatic Speech Recognition (ASR) and Text-to-Speech (TTS) often fail to capture these beyond-semantic dimensions. To better characterize and benchmark the progression of speech intelligence, we introduce Spoken Interaction System Capability Levels (L1-L5), a hierarchical framework illustrated the evolution of spoken dialogue systems from basic command recognition to human-like social interaction. To support these advanced capabilities, we propose Beyond-Semantic Speech (BoSS), which refers to the set of information in speech communication that encompasses but transcends explicit semantics. It conveys emotions, contexts, and modifies or extends meanings through multidimensional features such as affective cues, contextual dynamics, and implicit semantics, thereby enhancing the understanding of communicative intentions and scenarios. We present a formalized framework for BoSS, leveraging cognitive relevance theories and machine learning models to analyze temporal and contextual speech dynamics. We evaluate BoSS-related attributes across five different dimensions, reveals that current spoken language models (SLMs) are hard to fully interpret beyond-semantic signals. These findings highlight the need for advancing BoSS research to enable richer, more context-aware human-machine communication.
[77] Audio-Vision Contrastive Learning for Phonological Class Recognition cs.SD | cs.CV | cs.MM | eess.ASPDF
Daiqi Liu, Tomás Arias-Vergara, Jana Hutter, Andreas Maier, Paula Andrea Pérez-Toro
TL;DR: 该论文提出了一种结合实时磁共振成像(rtMRI)和语音信号的多模态深度学习框架,用于分类三种关键的发音维度:发音方式、发音部位和嗓音。通过对比学习的方法,该框架在USC-TIMIT数据集上达到了最先进的性能,平均F1得分为0.81。
Details
Motivation: 准确的发音-语音特征分类在理解人类语音生成和开发鲁棒的语音技术中至关重要,特别是在临床环境中,针对性的音素分析和治疗可以提高疾病诊断的准确性和个性化康复效果。
Result: 在USC-TIMIT数据集上,基于对比学习的方法平均F1得分为0.81,比单模态基线提升了0.23。
Insight: 对比学习在多模态表示学习中具有显著优势,能够有效结合不同模态的信息,提升语音分析任务的性能。
Abstract: Accurate classification of articulatory-phonological features plays a vital role in understanding human speech production and developing robust speech technologies, particularly in clinical contexts where targeted phonemic analysis and therapy can improve disease diagnosis accuracy and personalized rehabilitation. In this work, we propose a multimodal deep learning framework that combines real-time magnetic resonance imaging (rtMRI) and speech signals to classify three key articulatory dimensions: manner of articulation, place of articulation, and voicing. We perform classification on 15 phonological classes derived from the aforementioned articulatory dimensions and evaluate the system with four audio/vision configurations: unimodal rtMRI, unimodal audio signals, multimodal middle fusion, and contrastive learning-based audio-vision fusion. Experimental results on the USC-TIMIT dataset show that our contrastive learning-based approach achieves state-of-the-art performance, with an average F1-score of 0.81, representing an absolute increase of 0.23 over the unimodal baseline. The results confirm the effectiveness of contrastive representation learning for multimodal articulatory analysis. Our code and processed dataset will be made publicly available at https://github.com/DaE-plz/AC_Contrastive_Phonology to support future research.
cs.RO [Back]
[78] InstructVLA: Vision-Language-Action Instruction Tuning from Understanding to Manipulation cs.RO | cs.CVPDF
Shuai Yang, Hao Li, Yilun Chen, Bin Wang, Yang Tian
TL;DR: InstructVLA是一个端到端的视觉-语言-动作模型,通过新训练范式VLA-IT,在推理和动作生成上实现领先性能,同时保留大视觉语言模型的灵活性。
Details
Motivation: 解决现有视觉-语言-动作模型牺牲推理或动作能力、局限于任务特定数据及遗忘预训练能力的问题。
Result: 在SimplerEnv任务中提升30.5%,在SimplerEnv-Instruct基准上超越基线模型92%。
Insight: 通过文本推理增强动作性能,为直观可控的人机交互与高效策略学习提供潜力。
Abstract: To operate effectively in the real world, robots must integrate multimodal reasoning with precise action generation. However, existing vision-language-action (VLA) models often sacrifice one for the other, narrow their abilities to task-specific manipulation data, and suffer catastrophic forgetting of pre-trained vision-language capabilities. To bridge this gap, we introduce InstructVLA, an end-to-end VLA model that preserves the flexible reasoning of large vision-language models (VLMs) while delivering leading manipulation performance. InstructVLA introduces a novel training paradigm, Vision-Language-Action Instruction Tuning (VLA-IT), which employs multimodal training with mixture-of-experts adaptation to jointly optimize textual reasoning and action generation on both standard VLM corpora and a curated 650K-sample VLA-IT dataset. On in-domain SimplerEnv tasks, InstructVLA achieves 30.5% improvement over SpatialVLA. To evaluate generalization, we introduce SimplerEnv-Instruct, an 80-task benchmark requiring closed-loop control and high-level instruction understanding, where it outperforms a fine-tuned OpenVLA by 92% and an action expert aided by GPT-4o by 29%. Additionally, InstructVLA surpasses baseline VLMs on multimodal tasks and exhibits inference-time scaling by leveraging textual reasoning to boost manipulation performance in both simulated and real-world settings. These results demonstrate InstructVLA’s potential for bridging intuitive and steerable human-robot interaction with efficient policy learning.
eess.IV [Back]
[79] Harmonization in Magnetic Resonance Imaging: A Survey of Acquisition, Image-level, and Feature-level Methods eess.IV | cs.CV | physics.med-phPDF
Qinqin Yang, Firoozeh Shomal-Zadeh, Ali Gholipour
TL;DR: 这篇综述论文对医学影像(尤其是MRI)中的图像协调问题进行了全面总结,重点分析了采集、图像级和特征级的方法,并讨论了未来研究方向。
Details
Motivation: 医学影像数据因扫描仪、协议或站点不同存在异质性(如批次效应),这种非生物变异会掩盖真实生物信号,影响基于学习的模型的泛化能力。图像协调旨在消除这些偏差。
Result: 通过综述,论文整理了许多典型方法和数据集,突出了深度学习的潜力,但也指出了协调技术的局限性。
Insight: 图像协调的核心挑战是在消除站点效应与保留生物信息之间取得平衡。未来可能需要结合多模态数据或开发更具适应性的算法。
Abstract: Modern medical imaging technologies have greatly advanced neuroscience research and clinical diagnostics. However, imaging data collected across different scanners, acquisition protocols, or imaging sites often exhibit substantial heterogeneity, known as “batch effects” or “site effects”. These non-biological sources of variability can obscure true biological signals, reduce reproducibility and statistical power, and severely impair the generalizability of learning-based models across datasets. Image harmonization aims to eliminate or mitigate such site-related biases while preserving meaningful biological information, thereby improving data comparability and consistency. This review provides a comprehensive overview of key concepts, methodological advances, publicly available datasets, current challenges, and future directions in the field of medical image harmonization, with a focus on magnetic resonance imaging (MRI). We systematically cover the full imaging pipeline, and categorize harmonization approaches into prospective acquisition and reconstruction strategies, retrospective image-level and feature-level methods, and traveling-subject-based techniques. Rather than providing an exhaustive survey, we focus on representative methods, with particular emphasis on deep learning-based approaches. Finally, we summarize the major challenges that remain and outline promising avenues for future research.
[80] A Versatile Pathology Co-pilot via Reasoning Enhanced Multimodal Large Language Model eess.IV | cs.AI | cs.CVPDF
Zhe Xu, Ziyi Liu, Junlin Hou, Jiabo Ma, Cheng Jin
TL;DR: 该论文提出了一种多模态大语言模型SmartPath-R1,能够同时处理ROI和WSI级别的病理分析任务,并通过强化学习和混合专家机制实现动态多任务处理,展示了显著的病理推理能力。
Details
Motivation: 当前病理学中的多模态大语言模型存在推理能力受限的问题,主要依赖于昂贵的链式思维标注,且仅支持简单的VQA任务,无法满足临床实践中的多任务需求。
Result: 在72项任务上的实验验证了模型的有效性和优越性,展示了其在病理分析中的潜力。
Insight: 通过利用MLLM的固有知识,可以绕过链式思维标注的限制,同时实现多任务和多尺度分析,为精准病理学中的通用AI系统提供了新方向。
Abstract: Multimodal large language models (MLLMs) have emerged as powerful tools for computational pathology, offering unprecedented opportunities to integrate pathological images with language context for comprehensive diagnostic analysis. These models hold particular promise for automating complex tasks that traditionally require expert interpretation of pathologists. However, current MLLM approaches in pathology demonstrate significantly constrained reasoning capabilities, primarily due to their reliance on expensive chain-of-thought annotations. Additionally, existing methods remain limited to simplex application of visual question answering (VQA) at region-of-interest (ROI) level, failing to address the full spectrum of diagnostic needs such as ROI classification, detection, segmentation, whole-slide-image (WSI) classification and VQA in clinical practice. In this study, we present SmartPath-R1, a versatile MLLM capable of simultaneously addressing both ROI-level and WSI-level tasks while demonstrating robust pathological reasoning capability. Our framework combines scale-dependent supervised fine-tuning and task-aware reinforcement fine-tuning, which circumvents the requirement for chain-of-thought supervision by leveraging the intrinsic knowledge within MLLM. Furthermore, SmartPath-R1 integrates multiscale and multitask analysis through a mixture-of-experts mechanism, enabling dynamic processing for diverse tasks. We curate a large-scale dataset comprising 2.3M ROI samples and 188K WSI samples for training and evaluation. Extensive experiments across 72 tasks validate the effectiveness and superiority of the proposed approach. This work represents a significant step toward developing versatile, reasoning-enhanced AI systems for precision pathology.
[81] Mammo-Mamba: A Hybrid State-Space and Transformer Architecture with Sequential Mixture of Experts for Multi-View Mammography eess.IV | cs.CV | cs.LGPDF
Farnoush Bayatmakou, Reza Taleei, Nicole Simone, Arash Mohammadi
TL;DR: Mammo-Mamba提出了一种结合选择性状态空间模型(SSMs)、Transformer注意力和专家驱动特征优化的新型架构,用于多视角乳腺X光片分类,解决了传统Transformer计算复杂度高的问题,在分类性能和计算效率上均表现优异。
Details
Motivation: 乳腺X光片的多视角分类对乳腺癌早期诊断至关重要,但现有基于Transformer的模型计算复杂度高,亟需更高效的替代方案。
Result: 在CBIS-DDSM数据集上,Mammo-Mamba在所有关键指标上均优于现有方法,同时保持高效计算。
Insight: 结合状态空间模型和注意力机制可以平衡模型性能与计算效率,适用于高分辨率医学图像任务。
Abstract: Breast cancer (BC) remains one of the leading causes of cancer-related mortality among women, despite recent advances in Computer-Aided Diagnosis (CAD) systems. Accurate and efficient interpretation of multi-view mammograms is essential for early detection, driving a surge of interest in Artificial Intelligence (AI)-powered CAD models. While state-of-the-art multi-view mammogram classification models are largely based on Transformer architectures, their computational complexity scales quadratically with the number of image patches, highlighting the need for more efficient alternatives. To address this challenge, we propose Mammo-Mamba, a novel framework that integrates Selective State-Space Models (SSMs), transformer-based attention, and expert-driven feature refinement into a unified architecture. Mammo-Mamba extends the MambaVision backbone by introducing the Sequential Mixture of Experts (SeqMoE) mechanism through its customized SecMamba block. The SecMamba is a modified MambaVision block that enhances representation learning in high-resolution mammographic images by enabling content-adaptive feature refinement. These blocks are integrated into the deeper stages of MambaVision, allowing the model to progressively adjust feature emphasis through dynamic expert gating, effectively mitigating the limitations of traditional Transformer models. Evaluated on the CBIS-DDSM benchmark dataset, Mammo-Mamba achieves superior classification performance across all key metrics while maintaining computational efficiency.
cs.SI [Back]
[82] Disaster Informatics after the COVID-19 Pandemic: Bibliometric and Topic Analysis based on Large-scale Academic Literature cs.SI | cs.AI | cs.CL | cs.DLPDF
Ngan Tran, Haihua Chen, Ana Cleveland, Yuhan Zhou
TL;DR: 该研究通过文献计量和主题分析,探究了2020年至2022年间灾害信息学领域的研究动态,发现COVID-19大流行显著影响了研究重点,并揭示了国家、机构和作者之间的合作模式及新兴主题。
Details
Motivation: COVID-19大流行凸显了全球对灾害信息学的需求,激发了研究兴趣的转变。通过分析大规模学术文献,揭示研究趋势和优先领域,为决策者、从业者和学者提供战略洞察。
Result: 1. 受疫情影响严重的国家研究活跃;2. 区域和语言相近的国家/机构更易合作;3. 作者倾向于专注于1-2个主题,机构兴趣更广泛;4. 研究重点转向公共卫生和多维韧性策略。
Insight: 灾害信息学领域正朝着跨学科、数据共享和全球协作方向发展,反映了对全球脆弱性和相互依赖性的日益重视。研究方法和工具可推广至类似数据集或分析问题。
Abstract: This study presents a comprehensive bibliometric and topic analysis of the disaster informatics literature published between January 2020 to September 2022. Leveraging a large-scale corpus and advanced techniques such as pre-trained language models and generative AI, we identify the most active countries, institutions, authors, collaboration networks, emergent topics, patterns among the most significant topics, and shifts in research priorities spurred by the COVID-19 pandemic. Our findings highlight (1) countries that were most impacted by the COVID-19 pandemic were also among the most active, with each country having specific research interests, (2) countries and institutions within the same region or share a common language tend to collaborate, (3) top active authors tend to form close partnerships with one or two key partners, (4) authors typically specialized in one or two specific topics, while institutions had more diverse interests across several topics, and (5) the COVID-19 pandemic has influenced research priorities in disaster informatics, placing greater emphasis on public health. We further demonstrate that the field is converging on multidimensional resilience strategies and cross-sectoral data-sharing collaborations or projects, reflecting a heightened awareness of global vulnerability and interdependency. Collecting and quality assurance strategies, data analytic practices, LLM-based topic extraction and summarization approaches, and result visualization tools can be applied to comparable datasets or solve similar analytic problems. By mapping out the trends in disaster informatics, our analysis offers strategic insights for policymakers, practitioners, and scholars aiming to enhance disaster informatics capacities in an increasingly uncertain and complex risk landscape.
[83] Weak Links in LinkedIn: Enhancing Fake Profile Detection in the Age of LLMs cs.SI | cs.CV | cs.CYPDF
Apoorva Gulati, Rajesh Kumar, Vinti Agarwal, Aditya Sharma
TL;DR: 该论文研究了大型语言模型(LLMs)如何使LinkedIn上的虚假资料生成更加真实,并评估了现有虚假资料检测器的鲁棒性。研究发现现有检测器无法有效识别GPT生成的虚假资料,提出了一种基于GPT辅助的对抗训练方法,显著降低了误识率。实验表明,结合数值和文本嵌入的检测器具有最佳鲁棒性。
Details
Motivation: 随着大型语言模型(LLMs)的发展,生成高度逼真的虚假资料变得更加容易,这对LinkedIn等平台的虚假资料检测系统构成了新的挑战。研究旨在评估现有检测器的局限性,并提出一种更鲁棒的解决方案。
Result: 现有检测器对GPT生成资料的误识率高达42-52%,而通过GPT辅助对抗训练后,误识率降至1-7%,同时保持了低误拒率(0.5-2%)。消融实验证明,结合数值和文本嵌入的检测器表现最佳。
Insight: 随着LLMs的普及,虚假资料的生成能力大幅提升,传统的检测方法已无法应对。对抗训练和结合多模态嵌入(数值与文本)是提升检测器鲁棒性的有效途径。未来需要持续关注LLM技术的滥用问题,并开发更先进的检测工具。
Abstract: Large Language Models (LLMs) have made it easier to create realistic fake profiles on platforms like LinkedIn. This poses a significant risk for text-based fake profile detectors. In this study, we evaluate the robustness of existing detectors against LLM-generated profiles. While highly effective in detecting manually created fake profiles (False Accept Rate: 6-7%), the existing detectors fail to identify GPT-generated profiles (False Accept Rate: 42-52%). We propose GPT-assisted adversarial training as a countermeasure, restoring the False Accept Rate to between 1-7% without impacting the False Reject Rates (0.5-2%). Ablation studies revealed that detectors trained on combined numerical and textual embeddings exhibit the highest robustness, followed by those using numerical-only embeddings, and lastly those using textual-only embeddings. Complementary analysis on the ability of prompt-based GPT-4Turbo and human evaluators affirms the need for robust automated detectors such as the one proposed in this study.
cs.IR [Back]
[84] A Query-Aware Multi-Path Knowledge Graph Fusion Approach for Enhancing Retrieval-Augmented Generation in Large Language Models cs.IR | cs.AI | cs.CLPDF
Qikai Wei, Huansheng Ning, Chunlong Han, Jianguo Ding
TL;DR: 该论文提出了一种名为QMKGF的查询感知多路径知识图融合方法,旨在通过构建和优化知识图来增强检索增强生成(RAG)任务的效果,显著提升了大型语言模型的生成质量。
Details
Motivation: 现有的检索增强生成(RAG)方法主要依赖基于相似性的片段检索,忽略了片段之间的内在联系,导致性能受限。QMKGF旨在通过知识图构建和多路径子图优化来解决这一问题。
Result: 在HotpotQA数据集上,QMKGF的ROUGE-1得分达64.98%,比BGE-Rerank提升了9.72个百分点,证明了其优越性。
Insight: 通过知识图和多路径子图策略,能够更全面地捕捉查询的语义相关性,显著提升RAG任务的性能。
Abstract: Retrieval Augmented Generation (RAG) has gradually emerged as a promising paradigm for enhancing the accuracy and factual consistency of content generated by large language models (LLMs). However, existing RAG studies primarily focus on retrieving isolated segments using similarity-based matching methods, while overlooking the intrinsic connections between them. This limitation hampers performance in RAG tasks. To address this, we propose QMKGF, a Query-Aware Multi-Path Knowledge Graph Fusion Approach for Enhancing Retrieval Augmented Generation. First, we design prompt templates and employ general-purpose LLMs to extract entities and relations, thereby generating a knowledge graph (KG) efficiently. Based on the constructed KG, we introduce a multi-path subgraph construction strategy that incorporates one-hop relations, multi-hop relations, and importance-based relations, aiming to improve the semantic relevance between the retrieved documents and the user query. Subsequently, we designed a query-aware attention reward model that scores subgraph triples based on their semantic relevance to the query. Then, we select the highest score subgraph and enrich subgraph with additional triples from other subgraphs that are highly semantically relevant to the query. Finally, the entities, relations, and triples within the updated subgraph are utilised to expand the original query, thereby enhancing its semantic representation and improving the quality of LLMs’ generation. We evaluate QMKGF on the SQuAD, IIRC, Culture, HotpotQA, and MuSiQue datasets. On the HotpotQA dataset, our method achieves a ROUGE-1 score of 64.98%, surpassing the BGE-Rerank approach by 9.72 percentage points (from 55.26% to 64.98%). Experimental results demonstrate the effectiveness and superiority of the QMKGF approach.
[85] VL-CLIP: Enhancing Multimodal Recommendations via Visual Grounding and LLM-Augmented CLIP Embeddings cs.IR | cs.AI | cs.CVPDF
Ramin Giahi, Kehui Yao, Sriram Kollipara, Kai Zhao, Vahid Mirjalili
TL;DR: VL-CLIP通过视觉定位和LLM增强的CLIP嵌入改进多模态推荐,解决了现有视觉语言模型在电子商务推荐系统中的细粒度对齐、文本歧义和领域适配问题。
Details
Motivation: 现有CLIP等视觉语言模型在电商推荐系统中存在细粒度对齐不足、文本描述模糊及领域适配不佳的问题,影响了检索和推荐性能。
Result: 在美国大型电商平台上,VL-CLIP显著提高了CTR(18.6%)、ATC(15.5%)和GMV(4.0%),并优于现有视觉语言模型。
Insight: 结合对象感知的视觉定位和LLM增强的文本表示,可以有效提升多模态推荐系统的性能和语义对齐能力。
Abstract: Multimodal learning plays a critical role in e-commerce recommendation platforms today, enabling accurate recommendations and product understanding. However, existing vision-language models, such as CLIP, face key challenges in e-commerce recommendation systems: 1) Weak object-level alignment, where global image embeddings fail to capture fine-grained product attributes, leading to suboptimal retrieval performance; 2) Ambiguous textual representations, where product descriptions often lack contextual clarity, affecting cross-modal matching; and 3) Domain mismatch, as generic vision-language models may not generalize well to e-commerce-specific data. To address these limitations, we propose a framework, VL-CLIP, that enhances CLIP embeddings by integrating Visual Grounding for fine-grained visual understanding and an LLM-based agent for generating enriched text embeddings. Visual Grounding refines image representations by localizing key products, while the LLM agent enhances textual features by disambiguating product descriptions. Our approach significantly improves retrieval accuracy, multimodal retrieval effectiveness, and recommendation quality across tens of millions of items on one of the largest e-commerce platforms in the U.S., increasing CTR by 18.6%, ATC by 15.5%, and GMV by 4.0%. Additional experimental results show that our framework outperforms vision-language models, including CLIP, FashionCLIP, and GCL, in both precision and semantic alignment, demonstrating the potential of combining object-aware visual grounding and LLM-enhanced text representation for robust multimodal recommendations.
cs.LG [Back]
[86] SiLQ: Simple Large Language Model Quantization-Aware Training cs.LG | cs.AI | cs.CLPDF
Steven K. Esser, Jeffrey L. McKinstry, Deepika Bablani, Rathinakumar Appuswamy, Dharmendra S. Modha
TL;DR: SiLQ提出了一种简单的大语言模型量化训练方法,通过极小的训练额外成本(<0.1%),在多个基准测试中显著超越现有量化方法,且无需引入额外操作。
Details
Motivation: 大语言模型量化可降低推理延迟、模型大小和能耗,但如何在不损失精度且适配专用推理加速器的前提下实现高效量化仍是一大挑战。
Result: 实验显示,SiLQ在多个现代基准测试中大幅领先现有量化方法,包括基础模型和指令模型变体。
Insight: 研究证明,高效量化训练可通过极简设计实现,无需复杂机制或额外操作,为模型部署提供了低成本解决方案。
Abstract: Large language models can be quantized to reduce inference time latency, model size, and energy consumption, thereby delivering a better user experience at lower cost. A challenge exists to deliver quantized models with minimal loss of accuracy in reasonable time, and in particular to do so without requiring mechanisms incompatible with specialized inference accelerators. Here, we demonstrate a simple, end-to-end quantization-aware training approach that, with an increase in total model training budget of less than 0.1%, outperforms the leading published quantization methods by large margins on several modern benchmarks, with both base and instruct model variants. The approach easily generalizes across different model architectures, can be applied to activations, cache, and weights, and requires the introduction of no additional operations to the model other than the quantization itself.
[87] Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains cs.LG | cs.AI | cs.CLPDF
Anisha Gunjal, Anthony Wang, Elaine Lau, Vaskar Nath, Bing Liu
TL;DR: 这篇论文提出了一个名为“Rubrics as Rewards”(RaR)的框架,通过将结构化、清单式的评分标准(rubrics)用作可解释的奖励信号,以解决强化学习中奖励信号难以定义的问题。RaR在HealthBench-1k任务上表现优于传统的Likert评分方法,并展现了与专家编写的参考奖励信号相当的性能。
Details
Motivation: 在强化学习中,许多现实世界任务缺乏明确的奖励信号,尤其是当任务涉及主观评价标准时。传统的基于偏好的方法存在奖励函数不透明且易受虚假相关影响的问题。因此,需要一种可解释且鲁棒的奖励信号生成方法。
Result: 在HealthBench-1k任务上,RaR相比简单的Likert评分方法取得了28%的相对改进,同时达到了与专家编写的参考奖励信号相当甚至更好的性能。
Insight: 结构化的评分标准可以作为有效的奖励信号,不仅提高了奖励的可解释性,还能在小规模模型中实现更好的对齐效果。这为强化学习在复杂任务中的应用提供了新思路。
Abstract: Extending Reinforcement Learning with Verifiable Rewards (RLVR) to real-world tasks often requires balancing objective and subjective evaluation criteria. However, many such tasks lack a single, unambiguous ground truth-making it difficult to define reliable reward signals for post-training language models. While traditional preference-based methods offer a workaround, they rely on opaque reward functions that are difficult to interpret and prone to spurious correlations. We introduce $\textbf{Rubrics as Rewards}$ (RaR), a framework that uses structured, checklist-style rubrics as interpretable reward signals for on-policy training with GRPO. Our best RaR method yields up to a $28%$ relative improvement on HealthBench-1k compared to simple Likert-based approaches, while matching or surpassing the performance of reward signals derived from expert-written references. By treating rubrics as structured reward signals, we show that RaR enables smaller-scale judge models to better align with human preferences and sustain robust performance across model scales.
[88] Dataset Distillation as Data Compression: A Rate-Utility Perspective cs.LG | cs.CVPDF
Youneng Bao, Yiping Liu, Zhuo Chen, Yongsheng Liang, Mu Li
TL;DR: 该论文提出了一种联合率-效用优化的数据集蒸馏方法,将数据集压缩为少量合成样本,并通过量化的潜在码和轻量网络实现高效的存储与性能平衡。
Details
Motivation: 现代机器学习对大数据集和大模型的需求导致计算和存储成本剧增。数据集蒸馏通过压缩原始数据集为少量合成样本来缓解这一问题,但现有方法未能同时优化存储效率和性能。
Result: 在CIFAR-10、CIFAR-100和ImageNet-128等数据集上,与标准蒸馏方法相比,实现了更高的压缩率(如170倍),同时保持相似精度。
Insight: 联合优化存储效率和性能是数据集蒸馏的关键。bpc为跨方法比较提供了统一的度量标准,轻量网络和潜在码优化是实现高效压缩的有效途径。
Abstract: Driven by the ``scale-is-everything’’ paradigm, modern machine learning increasingly demands ever-larger datasets and models, yielding prohibitive computational and storage requirements. Dataset distillation mitigates this by compressing an original dataset into a small set of synthetic samples, while preserving its full utility. Yet, existing methods either maximize performance under fixed storage budgets or pursue suitable synthetic data representations for redundancy removal, without jointly optimizing both objectives. In this work, we propose a joint rate-utility optimization method for dataset distillation. We parameterize synthetic samples as optimizable latent codes decoded by extremely lightweight networks. We estimate the Shannon entropy of quantized latents as the rate measure and plug any existing distillation loss as the utility measure, trading them off via a Lagrange multiplier. To enable fair, cross-method comparisons, we introduce bits per class (bpc), a precise storage metric that accounts for sample, label, and decoder parameter costs. On CIFAR-10, CIFAR-100, and ImageNet-128, our method achieves up to $170\times$ greater compression than standard distillation at comparable accuracy. Across diverse bpc budgets, distillation losses, and backbone architectures, our approach consistently establishes better rate-utility trade-offs.
[89] On the Interaction of Compressibility and Adversarial Robustness cs.LG | cs.AI | cs.CV | stat.MLPDF
Melih Barsbey, Antônio H. Ribeiro, Umut Şimşekli, Tolga Birdal
TL;DR: 该论文研究了神经网络的可压缩性与对抗鲁棒性之间的相互作用,揭示了压缩性(如神经元稀疏性和谱可压缩性)会引入一些敏感方向,从而容易受到对抗攻击的影响。
Details
Motivation: 现代神经网络需要同时满足多种需求,如训练数据拟合、泛化能力、参数效率、计算效率以及对抗鲁棒性。然而,可压缩性与鲁棒性之间的交互关系仍不清楚,论文旨在填补这一空白。
Result: 研究发现,压缩性会导致对抗攻击的有效性增加,且这种现象在对抗训练和迁移学习中仍然存在。此外,压缩性还与通用对抗扰动(UAPs)的出现相关。
Insight: 论文揭示了结构化的可压缩性与鲁棒性之间存在根本性矛盾,为设计既高效又安全的模型提供了新的思路。
Abstract: Modern neural networks are expected to simultaneously satisfy a host of desirable properties: accurate fitting to training data, generalization to unseen inputs, parameter and computational efficiency, and robustness to adversarial perturbations. While compressibility and robustness have each been studied extensively, a unified understanding of their interaction still remains elusive. In this work, we develop a principled framework to analyze how different forms of compressibility - such as neuron-level sparsity and spectral compressibility - affect adversarial robustness. We show that these forms of compression can induce a small number of highly sensitive directions in the representation space, which adversaries can exploit to construct effective perturbations. Our analysis yields a simple yet instructive robustness bound, revealing how neuron and spectral compressibility impact $L_\infty$ and $L_2$ robustness via their effects on the learned representations. Crucially, the vulnerabilities we identify arise irrespective of how compression is achieved - whether via regularization, architectural bias, or implicit learning dynamics. Through empirical evaluations across synthetic and realistic tasks, we confirm our theoretical predictions, and further demonstrate that these vulnerabilities persist under adversarial training and transfer learning, and contribute to the emergence of universal adversarial perturbations. Our findings show a fundamental tension between structured compressibility and robustness, and suggest new pathways for designing models that are both efficient and secure.
[90] Large Learning Rates Simultaneously Achieve Robustness to Spurious Correlations and Compressibility cs.LG | cs.AI | cs.CV | stat.MLPDF
Melih Barsbey, Lucas Prieto, Stefanos Zafeiriou, Tolga Birdal
TL;DR: 这篇论文探讨了高学习率如何同时实现对抗伪相关性的鲁棒性和模型的可压缩性。研究发现,高学习率还能带来不变特征利用、类别分离和激活稀疏性等理想的表示特性。
Details
Motivation: 现代机器学习模型需要同时具备鲁棒性和资源效率,但实现这两者仍然是一个挑战。本文旨在研究高学习率如何同时满足这两种需求。
Result: 结果表明,高学习率在对抗伪相关性和模型压缩方面表现优异,且在其他标准分类任务中的成功可能源于其对隐藏/罕见伪相关性的处理。
Insight: 高学习率不仅是一种训练策略,还隐含地解决了数据中的伪相关问题,为模型设计和训练提供了新的视角。
Abstract: Robustness and resource-efficiency are two highly desirable properties for modern machine learning models. However, achieving them jointly remains a challenge. In this paper, we position high learning rates as a facilitator for simultaneously achieving robustness to spurious correlations and network compressibility. We demonstrate that large learning rates also produce desirable representation properties such as invariant feature utilization, class separation, and activation sparsity. Importantly, our findings indicate that large learning rates compare favorably to other hyperparameters and regularization methods, in consistently satisfying these properties in tandem. In addition to demonstrating the positive effect of large learning rates across diverse spurious correlation datasets, models, and optimizers, we also present strong evidence that the previously documented success of large learning rates in standard classification tasks is likely due to its effect on addressing hidden/rare spurious correlations in the training dataset.
cs.HC [Back]
[91] Assessing Medical Training Skills via Eye and Head Movements cs.HC | cs.CVPDF
Kayhan Latifzadeh, Luis A. Leiva, Klen Čopič Pucihar, Matjaž Kljun, Iztok Devetak
TL;DR: 该研究通过分析眼部和头部动作,评估临床技能发展。结果表明,眼部和头部追踪可以有效区分训练有素和未经训练的从业者,为基于计算模型的技能评估提供了新方法。
Details
Motivation: 传统临床技能评估依赖主观评分,作者希望通过客观的眼部和头部动作数据,提供更可靠的技能评估方法。
Result: 头部相关特征(F1=0.85, AUC=0.86)比瞳孔相关特征(F1=0.77, AUC=0.85)表现更好。
Insight: 眼部和头部追踪可作为补充工具,为临床技能评估提供客观数据支持。
Abstract: We examined eye and head movements to gain insights into skill development in clinical settings. A total of 24 practitioners participated in simulated baby delivery training sessions. We calculated key metrics, including pupillary response rate, fixation duration, or angular velocity. Our findings indicate that eye and head tracking can effectively differentiate between trained and untrained practitioners, particularly during labor tasks. For example, head-related features achieved an F1 score of 0.85 and AUC of 0.86, whereas pupil-related features achieved F1 score of 0.77 and AUC of 0.85. The results lay the groundwork for computational models that support implicit skill assessment and training in clinical settings by using commodity eye-tracking glasses as a complementary device to more traditional evaluation methods such as subjective scores.
[92] Explainable AI for Collaborative Assessment of 2D/3D Registration Quality cs.HC | cs.CVPDF
Sue Min Cho, Alexander Do, Russell H. Taylor, Mathias Unberath
TL;DR: The paper introduces an explainable AI (XAI) framework for verifying 2D/3D registration quality in surgery, aiming to improve human operators’ ability to detect misalignments, though explainability features only modestly enhance trust and performance.
Details
Motivation: Current visualization-based methods are insufficient for reliably detecting 2D/3D registration errors in surgery, which can lead to serious consequences like revision surgeries. There’s a need for robust quality assurance tools.
Result: Explainability features slightly improve user trust and willingness to correct AI errors but do not outperform standalone AI in overall performance.
Insight: While XAI aids human decision-making, further improvements in algorithmic design and human-AI collaboration are needed for more reliable quality assurance in surgical settings.
Abstract: As surgery embraces digital transformation–integrating sophisticated imaging, advanced algorithms, and robotics to support and automate complex sub-tasks–human judgment of system correctness remains a vital safeguard for patient safety. This shift introduces new “operator-type” roles tasked with verifying complex algorithmic outputs, particularly at critical junctures of the procedure, such as the intermediary check before drilling or implant placement. A prime example is 2D/3D registration, a key enabler of image-based surgical navigation that aligns intraoperative 2D images with preoperative 3D data. Although registration algorithms have advanced significantly, they occasionally yield inaccurate results. Because even small misalignments can lead to revision surgery or irreversible surgical errors, there is a critical need for robust quality assurance. Current visualization-based strategies alone have been found insufficient to enable humans to reliably detect 2D/3D registration misalignments. In response, we propose the first artificial intelligence (AI) framework trained specifically for 2D/3D registration quality verification, augmented by explainability features that clarify the model’s decision-making. Our explainable AI (XAI) approach aims to enhance informed decision-making for human operators by providing a second opinion together with a rationale behind it. Through algorithm-centric and human-centered evaluations, we systematically compare four conditions: AI-only, human-only, human-AI, and human-XAI. Our findings reveal that while explainability features modestly improve user trust and willingness to override AI errors, they do not exceed the standalone AI in aggregate performance. Nevertheless, future work extending both the algorithmic design and the human-XAI collaboration elements holds promise for more robust quality assurance of 2D/3D registration.
eess.AS [Back]
[93] Towards Robust Speech Recognition for Jamaican Patois Music Transcription eess.AS | cs.AI | cs.CLPDF
Jordan Madden, Matthew Stone, Dimitri Johnson, Daniel Geddez
TL;DR: 该论文针对牙买加方言音乐的语音识别问题,提出了数据驱动的方法,通过手工标注40小时的数据集,优化了当前的ASR模型,并研究了Whisper模型的性能扩展规律。
Details
Motivation: 当前语音识别系统在牙买加方言音乐上的表现不佳,限制了其可访问性和下游应用,因此需要改进。
Result: 提高了牙买加方言音乐的语音识别性能,并总结出Whisper模型的性能扩展规律。
Insight: 数据质量和规模对低资源语言的语音识别性能至关重要,Whisper模型在这一任务上具有潜力。
Abstract: Although Jamaican Patois is a widely spoken language, current speech recognition systems perform poorly on Patois music, producing inaccurate captions that limit accessibility and hinder downstream applications. In this work, we take a data-centric approach to this problem by curating more than 40 hours of manually transcribed Patois music. We use this dataset to fine-tune state-of-the-art automatic speech recognition (ASR) models, and use the results to develop scaling laws for the performance of Whisper models on Jamaican Patois audio. We hope that this work will have a positive impact on the accessibility of Jamaican Patois music and the future of Jamaican Patois language modeling.
[94] Evaluating Speech-to-Text x LLM x Text-to-Speech Combinations for AI Interview Systems eess.AS | cs.CLPDF
Nima Yazdani, Ali Ansari, Aruj Mahajan, Amirhossein Afsharrad, Seyed Shahabeddin Mousavi
TL;DR: 论文通过大规模实验评估了不同语音转文本(STT)、大语言模型(LLM)和文本转语音(TTS)组合在AI面试系统中的表现,发现谷歌STT与GPT-4.1的组合表现最佳,并揭示了技术指标与用户满意度之间相关性较弱的问题。
Details
Motivation: 语音驱动的对话AI系统通常采用STT、LLM和TTS的级联架构,但不同组件组合在实际生产环境中的系统化评估较少。本文旨在填补这一空白,为实际应用提供指导。
Result: 谷歌STT与GPT-4.1的组合在对话质量和技术准确性上显著优于其他组合,但技术指标与用户满意度的相关性较弱。
Insight: 论文的启示在于,语音AI系统的用户体验可能依赖于技术性能以外的因素,如对话的自然性或情感共鸣。这为未来的研究和实际系统设计提供了重要方向。
Abstract: Voice-based conversational AI systems increasingly rely on cascaded architectures combining speech-to-text (STT), large language models (LLMs), and text-to-speech (TTS) components. However, systematic evaluation of different component combinations in production settings remains understudied. We present a large-scale empirical comparison of STT x LLM x TTS stacks using data from over 300,000 AI-conducted job interviews. We develop an automated evaluation framework using LLM-as-a-Judge to assess conversational quality, technical accuracy, and skill assessment capabilities. Our analysis of four production configurations reveals that Google STT paired with GPT-4.1 significantly outperforms alternatives in both conversational and technical quality metrics. Surprisingly, we find that objective quality metrics correlate weakly with user satisfaction scores, suggesting that user experience in voice-based AI systems depends on factors beyond technical performance. Our findings provide practical guidance for selecting components in multimodal conversational AI systems and contribute a validated evaluation methodology for voice-based interactions.
[95] Segmentation-free Goodness of Pronunciation eess.AS | cs.AI | cs.CLPDF
Xinwei Cao, Zijian Fan, Torbjørn Svendsen, Giampiero Salvi
TL;DR: 该论文提出了一种无需预分段的自对齐GOP(GOP-SA)和对齐无关的GOP(GOP-AF)方法,用于发音评估,超越了传统方法的限制并取得了SOTA结果。
Details
Motivation: 传统的发音评估方法需要预分段语音,限制了准确性且无法利用CTC训练的声学模型。本文旨在解决这一问题。
Result: 在CMU Kids和Speechocean762数据集上验证了方法的有效性,并在发音评估任务中取得了SOTA结果。
Insight: 取消预分段要求可以显著提升发音评估的灵活性和准确性,尤其是在结合现代CTC声学模型时。
Abstract: Mispronunciation detection and diagnosis (MDD) is a significant part in modern computer aided language learning (CALL) systems. Within MDD, phoneme-level pronunciation assessment is key to helping L2 learners improve their pronunciation. However, most systems are based on a form of goodness of pronunciation (GOP) which requires pre-segmentation of speech into phonetic units. This limits the accuracy of these methods and the possibility to use modern CTC-based acoustic models for their evaluation. In this study, we first propose self-alignment GOP (GOP-SA) that enables the use of CTC-trained ASR models for MDD. Next, we define a more general alignment-free method that takes all possible alignments of the target phoneme into account (GOP-AF). We give a theoretical account of our definition of GOP-AF, an implementation that solves potential numerical issues as well as a proper normalization which makes the method applicable with acoustic models with different peakiness over time. We provide extensive experimental results on the CMU Kids and Speechocean762 datasets comparing the different definitions of our methods, estimating the dependency of GOP-AF on the peakiness of the acoustic models and on the amount of context around the target phoneme. Finally, we compare our methods with recent studies over the Speechocean762 data showing that the feature vectors derived from the proposed method achieve state-of-the-art results on phoneme-level pronunciation assessment.
cs.AI [Back]
[96] Constructing Ophthalmic MLLM for Positioning-diagnosis Collaboration Through Clinical Cognitive Chain Reasoning cs.AI | cs.CV | eess.IVPDF
Xinyao Liu, Diping Song
TL;DR: 论文提出了FundusExpert,一种眼科专用的多模态大语言模型(MLLM),通过临床认知链推理实现定位与诊断的协同。作者还构建了FundusGen数据集和智能Fundus-Engine系统,显著提升了模型在眼科问答和报告生成任务中的表现。
Details
Motivation: 当前MLLM在眼科等专业领域面临标注粒度碎片化和临床推理逻辑不一致的问题,导致跨模态理解不精确。
Result: 1. 在眼科问答任务中比40B MedRegA平均准确率高26.6%;2. 在零样本报告生成任务中临床一致性达77.0%,显著优于GPT-4o的47.6%;3. 发现数据质量与模型能力的缩放规律($L \propto N^{0.068}$)。
Insight: 1. 区域级定位与诊断推理链的结合可提升MLLM的临床对齐能力;2. 数据质量的高效利用可通过认知对齐标注实现;3. FundusExpert的成功为特定领域MLLM的视觉-语言鸿沟提供了解决方案。
Abstract: Multimodal large language models (MLLMs) demonstrate significant potential in the field of medical diagnosis. However, they face critical challenges in specialized domains such as ophthalmology, particularly the fragmentation of annotation granularity and inconsistencies in clinical reasoning logic, which hinder precise cross-modal understanding. This paper introduces FundusExpert, an ophthalmology-specific MLLM with integrated positioning-diagnosis reasoning capabilities, along with FundusGen, a dataset constructed through the intelligent Fundus-Engine system. Fundus-Engine automates localization and leverages MLLM-based semantic expansion to integrate global disease classification, local object detection, and fine-grained feature analysis within a single fundus image. Additionally, by constructing a clinically aligned cognitive chain, it guides the model to generate interpretable reasoning paths. FundusExpert, fine-tuned with instruction data from FundusGen, achieves the best performance in ophthalmic question-answering tasks, surpassing the average accuracy of the 40B MedRegA by 26.6%. It also excels in zero-shot report generation tasks, achieving a clinical consistency of 77.0%, significantly outperforming GPT-4o’s 47.6%. Furthermore, we reveal a scaling law between data quality and model capability ($L \propto N^{0.068}$), demonstrating that the cognitive alignment annotations in FundusGen enhance data utilization efficiency. By integrating region-level localization with diagnostic reasoning chains, our work develops a scalable, clinically-aligned MLLM and explores a pathway toward bridging the visual-language gap in specific MLLMs. Our project can be found at https://github.com/MeteorElf/FundusExpert.