cs.CV [Total: 74]
cs.CL [Total: 27]
cs.RO [Total: 2]
q-bio.NC [Total: 1]
cs.GR [Total: 1]
cs.IR [Total: 1]
cs.AI [Total: 1]
cs.LG [Total: 6]
eess.IV [Total: 4]
q-bio.QM [Total: 1]

cs.CV [Back]

[1] Mitigating Hallucinations in Multimodal LLMs via Object-aware Preference Optimization cs.CV | cs.AI | cs.CL | cs.MMPDF

Alberto Compagnoni, Davide Caffagni, Nicholas Moratelli, Lorenzo Baraldi, Marcella Cornia

TL;DR: 论文通过提出CHAIR-DPO方法，利用基于CHAIR指标的奖励微调多模态大语言模型（MLLMs），有效减少了幻觉现象。

Details

Motivation: 多模态大语言模型（MLLMs）在多项任务中表现优异，但存在幻觉问题，即生成的回答与视觉输入不一致，作者将其视为对齐问题来解决。

Result: CHAIR-DPO在多个幻觉基准测试中有效减少了幻觉答案，证明了其有效性。

Insight: 利用简单的CHAIR指标和公开可用模型即可有效解决MLLMs的幻觉问题，无需依赖复杂流水线或专有模型。

Abstract: Multimodal Large Language Models (MLLMs) emerge as a unified interface to address a multitude of tasks, ranging from NLP to computer vision. Despite showcasing state-of-the-art results in many benchmarks, a long-standing issue is the tendency of MLLMs to hallucinate, that is to generate answers to the user’s query that are not reflected in the visual input. In this paper, we address the problem of hallucinations as an alignment problem, seeking to steer the MLLM so that it prefers generating content without hallucinations. In contrast to recent approaches that require complicated pipelines to build synthetic preference data for alignment training, often relying on proprietary models, we capitalize on the well-known CHAIR metric, originally proposed to gauge the degree of hallucinations in image captioning. Given a pair of generated answers, we leverage CHAIR to distinguish winner and loser options (i.e., non-hallucinated and hallucinated samples) and fine-tune off-the-shelf MLLMs via Direct Preference Optimization (DPO). The resulting method, which we refer to as CHAIR-DPO, effectively diminishes the amount of hallucinated answers on several hallucination benchmarks, demonstrating the effectiveness of fine-tuning the MLLM with a CHAIR-based reward. Source code and trained models are publicly available at https://github.com/aimagelab/CHAIR-DPO.

[2] SDiFL: Stable Diffusion-Driven Framework for Image Forgery Localization cs.CVPDF

Yang Su, Shunquan Tan, Jiwu Huang

TL;DR: SDiFL 是一个基于 Stable Diffusion (SD) 的图像伪造定位框架，首次将图像生成与 SD 的感知能力结合，显著提升伪造定位的效率和精度。

Details

Motivation: 现有的伪造定位方法依赖大量标注数据，难以应对新兴的图像篡改技术。而 SD 等多模态大模型的强大能力为解决这一问题提供了新思路。

Result: 在多个基准数据集上性能提升高达 12%，并且在未见的真实文档和自然场景伪造图像任务中表现出色。

Insight: 多模态大模型（如 SD）的潜空间融合能力为图像伪造定位提供了一种高效且无需大量标注数据的新途径。

Abstract: Driven by the new generation of multi-modal large models, such as Stable Diffusion (SD), image manipulation technologies have advanced rapidly, posing significant challenges to image forensics. However, existing image forgery localization methods, which heavily rely on labor-intensive and costly annotated data, are struggling to keep pace with these emerging image manipulation technologies. To address these challenges, we are the first to integrate both image generation and powerful perceptual capabilities of SD into an image forensic framework, enabling more efficient and accurate forgery localization. First, we theoretically show that the multi-modal architecture of SD can be conditioned on forgery-related information, enabling the model to inherently output forgery localization results. Then, building on this foundation, we specifically leverage the multimodal framework of Stable DiffusionV3 (SD3) to enhance forgery localization performance.We leverage the multi-modal processing capabilities of SD3 in the latent space by treating image forgery residuals – high-frequency signals extracted using specific highpass filters – as an explicit modality. This modality is fused into the latent space during training to enhance forgery localization performance. Notably, our method fully preserves the latent features extracted by SD3, thereby retaining the rich semantic information of the input image. Experimental results show that our framework achieves up to 12% improvements in performance on widely used benchmarking datasets compared to current state-of-the-art image forgery localization models. Encouragingly, the model demonstrates strong performance on forensic tasks involving real-world document forgery images and natural scene forging images, even when such data were entirely unseen during training.

[3] Grounding Multimodal Large Language Models with Quantitative Skin Attributes: A Retrieval Study cs.CV | cs.LGPDF

Max Torop, Masih Eskandar, Nicholas Kurtansky, Jinyang Liu, Jochen Weber

TL;DR: 这篇论文探讨了如何通过结合多模态大语言模型（MLLMs）和定量皮肤属性来提高皮肤病诊断的可解释性，并通过检索研究验证了MLLM嵌入空间能够基于这些属性进行解释。

Details

Motivation: 当前AI模型在皮肤病诊断中表现良好，但预测的可解释性不足，限制了其临床应用。研究旨在通过MLLMs和定量属性的结合，增强模型的可解释性。

Result: 实验结果表明，MLLMs的嵌入空间可以通过定量属性进行解释，提高了诊断的透明度和可解释性。

Insight: 结合定量属性和MLLMs可以为医疗领域的AI诊断提供更直观、可解释的推理能力，有助于临床实践的应用。

Abstract: Artificial Intelligence models have demonstrated significant success in diagnosing skin diseases, including cancer, showing the potential to assist clinicians in their analysis. However, the interpretability of model predictions must be significantly improved before they can be used in practice. To this end, we explore the combination of two promising approaches: Multimodal Large Language Models (MLLMs) and quantitative attribute usage. MLLMs offer a potential avenue for increased interpretability, providing reasoning for diagnosis in natural language through an interactive format. Separately, a number of quantitative attributes that are related to lesion appearance (e.g., lesion area) have recently been found predictive of malignancy with high accuracy. Predictions grounded as a function of such concepts have the potential for improved interpretability. We provide evidence that MLLM embedding spaces can be grounded in such attributes, through fine-tuning to predict their values from images. Concretely, we evaluate this grounding in the embedding space through an attribute-specific content-based image retrieval case study using the SLICE-3D dataset.

[4] Enhancing Automatic Modulation Recognition With a Reconstruction-Driven Vision Transformer Under Limited Labels cs.CV | eess.SPPDF

Hossein Ahmadi, Banafsheh Saffari

TL;DR: 该论文提出了一个结合监督、自监督和重构目标的统一Vision Transformer框架，用于有限标签条件下的自动调制识别，通过重构分支和部分监督实现高效的分类性能。

Details

Motivation: 自动调制识别（AMR）在实际应用中受限于大规模标注数据的需求和多阶段训练流程。论文旨在解决这些问题，提出一种更高效、泛化性强的解决方案。

Result: 在RML2018.01A数据集上，该框架在低标注数据情况下优于监督CNN和ViT基线，仅需15%-20%标注数据即可接近ResNet性能，同时对不同SNR水平具有鲁棒性。

Insight: 重构驱动的自监督学习在有限标注数据下能有效提升模型性能，ViT结合多任务学习为小样本AMR提供了通用解决方案。

Abstract: Automatic modulation recognition (AMR) is critical for cognitive radio, spectrum monitoring, and secure wireless communication. However, existing solutions often rely on large labeled datasets or multi-stage training pipelines, which limit scalability and generalization in practice. We propose a unified Vision Transformer (ViT) framework that integrates supervised, self-supervised, and reconstruction objectives. The model combines a ViT encoder, a lightweight convolutional decoder, and a linear classifier; the reconstruction branch maps augmented signals back to their originals, anchoring the encoder to fine-grained I/Q structure. This strategy promotes robust, discriminative feature learning during pretraining, while partial label supervision in fine-tuning enables effective classification with limited labels. On the RML2018.01A dataset, our approach outperforms supervised CNN and ViT baselines in low-label regimes, approaches ResNet-level accuracy with only 15-20% labeled data, and maintains strong performance across varying SNR levels. Overall, the framework provides a simple, generalizable, and label-efficient solution for AMR.

[5] InfinityHuman: Towards Long-Term Audio-Driven Human cs.CVPDF

Xiaodi Li, Pan Xie, Yi Ren, Qijun Gan, Chen Zhang

TL;DR: InfinityHuman提出了一种分层框架，通过粗到精的方式生成高质量、长时间的人体动画视频，解决了现有方法中的身份漂移、颜色偏移和手部运动失真问题。

Details

Motivation: 现有音频驱动人体动画方法存在分辨率低、持续时间短、身份一致性差和手部运动不自然的问题，InfinityHuman旨在解决这些问题。

Result: 在EMTD和HDTF数据集上，InfinityHuman在视频质量、身份保持、手部准确性和唇同步上表现最佳。

Insight: 姿态序列因与外观解耦且抵抗时间退化，是稳定长时间视频生成的关键；手部运动的优化需要高质量数据驱动。

Abstract: Audio-driven human animation has attracted wide attention thanks to its practical applications. However, critical challenges remain in generating high-resolution, long-duration videos with consistent appearance and natural hand motions. Existing methods extend videos using overlapping motion frames but suffer from error accumulation, leading to identity drift, color shifts, and scene instability. Additionally, hand movements are poorly modeled, resulting in noticeable distortions and misalignment with the audio. In this work, we propose InfinityHuman, a coarse-to-fine framework that first generates audio-synchronized representations, then progressively refines them into high-resolution, long-duration videos using a pose-guided refiner. Since pose sequences are decoupled from appearance and resist temporal degradation, our pose-guided refiner employs stable poses and the initial frame as a visual anchor to reduce drift and improve lip synchronization. Moreover, to enhance semantic accuracy and gesture realism, we introduce a hand-specific reward mechanism trained with high-quality hand motion data. Experiments on the EMTD and HDTF datasets show that InfinityHuman achieves state-of-the-art performance in video quality, identity preservation, hand accuracy, and lip-sync. Ablation studies further confirm the effectiveness of each module. Code will be made public.

[6] Spherical Vision Transformers for Audio-Visual Saliency Prediction in 360-Degree Videos cs.CVPDF

Mert Cokelek, Halit Ozsoy, Nevrez Imamoglu, Cagri Ozcinar, Inci Ayhan

TL;DR: 该论文提出了一种新的360度视频视听显著性预测方法，通过构建新的数据集和基于视觉变换器的模型（SalViT360和SalViT360-AV），显著提升了预测性能，并强调了空间音频的重要性。

Details

Motivation: 360度视频在虚拟现实中的应用日益广泛，但目前缺乏针对其球形畸变和空间音频整合的显著性预测数据集和方法。

Result: 在多个基准数据集（包括YT360-EyeTracking）上，两种模型均显著优于现有方法。

Insight: 空间音频的整合对360度视频的显著性预测至关重要。

Abstract: Omnidirectional videos (ODVs) are redefining viewer experiences in virtual reality (VR) by offering an unprecedented full field-of-view (FOV). This study extends the domain of saliency prediction to 360-degree environments, addressing the complexities of spherical distortion and the integration of spatial audio. Contextually, ODVs have transformed user experience by adding a spatial audio dimension that aligns sound direction with the viewer’s perspective in spherical scenes. Motivated by the lack of comprehensive datasets for 360-degree audio-visual saliency prediction, our study curates YT360-EyeTracking, a new dataset of 81 ODVs, each observed under varying audio-visual conditions. Our goal is to explore how to utilize audio-visual cues to effectively predict visual saliency in 360-degree videos. Towards this aim, we propose two novel saliency prediction models: SalViT360, a vision-transformer-based framework for ODVs equipped with spherical geometry-aware spatio-temporal attention layers, and SalViT360-AV, which further incorporates transformer adapters conditioned on audio input. Our results on a number of benchmark datasets, including our YT360-EyeTracking, demonstrate that SalViT360 and SalViT360-AV significantly outperform existing methods in predicting viewer attention in 360-degree scenes. Interpreting these results, we suggest that integrating spatial audio cues in the model architecture is crucial for accurate saliency prediction in omnidirectional videos. Code and dataset will be available at https://cyberiada.github.io/SalViT360.

[7] A Novel Framework for Automated Explain Vision Model Using Vision-Language Models cs.CV | cs.AI | cs.CL | cs.LGPDF

Phu-Vinh Nguyen, Tan-Hanh Pham, Chris Ngo, Truong Son Hy

TL;DR: 论文提出了一种利用视觉语言模型（Vision-Language Models）的自动化框架，用于解释视觉模型在样本和数据集级别的行为，帮助发现模型失败案例并提升可解释性分析。

Details

Motivation: 当前视觉模型的开发主要关注性能指标（如准确率、IoU、mAP），而忽视了可解释性（xAI）。传统xAI方法通常只能逐样本解释，缺乏对模型整体行为的理解，尤其是在大规模数据集上的表现。理解模型的整体行为对避免偏见和识别趋势至关重要。

Result: 提出的框架能够高效地生成对视觉模型的解释，帮助识别模型的失败案例和潜在偏见，同时为模型开发提供更全面的分析视角。

Insight: 视觉语言模型为视觉模型的可解释性分析提供了新的思路，通过结合语言和视觉信息，能够更全面地理解模型行为，从而推动更透明和可靠的视觉模型开发。

Abstract: The development of many vision models mainly focuses on improving their performance using metrics such as accuracy, IoU, and mAP, with less attention to explainability due to the complexity of applying xAI methods to provide a meaningful explanation of trained models. Although many existing xAI methods aim to explain vision models sample-by-sample, methods explaining the general behavior of vision models, which can only be captured after running on a large dataset, are still underexplored. Furthermore, understanding the behavior of vision models on general images can be very important to prevent biased judgments and help identify the model’s trends and patterns. With the application of Vision-Language Models, this paper proposes a pipeline to explain vision models at both the sample and dataset levels. The proposed pipeline can be used to discover failure cases and gain insights into vision models with minimal effort, thereby integrating vision model development with xAI analysis to advance image analysis.

[8] ATMS-KD: Adaptive Temperature and Mixed Sample Knowledge Distillation for a Lightweight Residual CNN in Agricultural Embedded Systems cs.CVPDF

Mohamed Ohamouddou, Said Ohamouddou, Abdellatif El Afia, Rafik Lasri

TL;DR: ATMS-KD提出了一种新型知识蒸馏框架，结合自适应温度调度和混合样本增强，将MobileNetV3 Large教师模型的知识转移到轻量级残差CNN学生模型中，显著提升了农业嵌入式系统的效率。

Details

Motivation: 农业环境通常资源受限，需要轻量级的CNN模型以实现高效的计算机视觉应用。现有知识蒸馏方法在农业领域的适用性和性能仍有提升空间。

Result: 学生模型的验证准确率超过96.7%，优于直接训练方法；紧凑模型达到97.11%准确率，推理延迟仅72.19毫秒；知识保留率超过99%。

Insight: ATMS-KD在资源受限的农业环境中表现优异，展示了知识蒸馏在轻量化模型中的潜力，同时混合样本增强和自适应温度调度的结合是一种有效的知识转移策略。

Abstract: This study proposes ATMS-KD (Adaptive Temperature and Mixed-Sample Knowledge Distillation), a novel framework for developing lightweight CNN models suitable for resource-constrained agricultural environments. The framework combines adaptive temperature scheduling with mixed-sample augmentation to transfer knowledge from a MobileNetV3 Large teacher model (5.7,M parameters) to lightweight residual CNN students. Three student configurations were evaluated: Compact (1.3,M parameters), Standard (2.4,M parameters), and Enhanced (3.8,M parameters). The dataset used in this study consists of images of \textit{Rosa damascena} (Damask rose) collected from agricultural fields in the Dades Oasis, southeastern Morocco, providing a realistic benchmark for agricultural computer vision applications under diverse environmental conditions. Experimental evaluation on the Damascena rose maturity classification dataset demonstrated significant improvements over direct training methods. All student models achieved validation accuracies exceeding 96.7% with ATMS-KD compared to 95–96% with direct training. The framework outperformed eleven established knowledge distillation methods, achieving 97.11% accuracy with the compact model – a 1.60 percentage point improvement over the second-best approach while maintaining the lowest inference latency of 72.19,ms. Knowledge retention rates exceeded 99% for all configurations, demonstrating effective knowledge transfer regardless of student model capacity.

[9] Linking heterogeneous microstructure informatics with expert characterization knowledge through customized and hybrid vision-language representations for industrial qualification cs.CV | cs.LGPDF

Mutahar Safdar, Gentry Wood, Max Zimmermann, Guy Lamouche, Priti Wanjara

TL;DR: 该论文提出了一种新的框架，通过定制化的混合视觉-语言表示（VLRs）将微观结构信息与专家表征知识关联，用于工业制造中异质材料的快速可靠鉴定。

Details

Motivation: 工业制造中，异质材料（尤其是通过非传统增材制造工艺生产的材料）的快速可靠鉴定是一个瓶颈问题。

Result: 在增材制造金属基复合材料数据集上的验证表明，该框架能区分合格和缺陷样本，且FLAVA模型在视觉敏感性上表现更好，CLIP模型在文本对齐上更一致。

Insight: 通过混合视觉-语言表示，实现了数据与专家知识的语义互操作性，无需任务特定的模型重新训练，支持可扩展且适应领域的鉴定策略。

Abstract: Rapid and reliable qualification of advanced materials remains a bottleneck in industrial manufacturing, particularly for heterogeneous structures produced via non-conventional additive manufacturing processes. This study introduces a novel framework that links microstructure informatics with a range of expert characterization knowledge using customized and hybrid vision-language representations (VLRs). By integrating deep semantic segmentation with pre-trained multi-modal models (CLIP and FLAVA), we encode both visual microstructural data and textual expert assessments into shared representations. To overcome limitations in general-purpose embeddings, we develop a customized similarity-based representation that incorporates both positive and negative references from expert-annotated images and their associated textual descriptions. This allows zero-shot classification of previously unseen microstructures through a net similarity scoring approach. Validation on an additively manufactured metal matrix composite dataset demonstrates the framework’s ability to distinguish between acceptable and defective samples across a range of characterization criteria. Comparative analysis reveals that FLAVA model offers higher visual sensitivity, while the CLIP model provides consistent alignment with the textual criteria. Z-score normalization adjusts raw unimodal and cross-modal similarity scores based on their local dataset-driven distributions, enabling more effective alignment and classification in the hybrid vision-language framework. The proposed method enhances traceability and interpretability in qualification pipelines by enabling human-in-the-loop decision-making without task-specific model retraining. By advancing semantic interoperability between raw data and expert knowledge, this work contributes toward scalable and domain-adaptable qualification strategies in engineering informatics.

[10] How Multimodal LLMs Solve Image Tasks: A Lens on Visual Grounding, Task Reasoning, and Answer Decoding cs.CV | cs.AI | cs.CLPDF

Zhuoran Yu, Yong Jae Lee

TL;DR: 该论文提出了一种探测框架，用于系统分析多模态大语言模型(MLLMs)在处理视觉和文本输入时的内部动态，揭示了其层级处理结构。

Details

Motivation: 尽管多模态大语言模型在视觉语言任务中表现出色，但其内部处理机制尚未得到充分探索，作者希望通过系统性的分析填补这一空白。

Result: 研究发现，MLLMs的处理过程呈现阶段性结构：早期层负责视觉接地，中间层支持词汇整合和语义推理，最终层准备任务特定的输出。这种结构在不同模型间具有一致性，但具体层级分配会因基LLM架构的变化而显著调整。

Insight: 论文提供了对MLLMs层级组织的统一视角，证实了视觉接地和语义推理的分阶段处理，同时表明基LLM架构的变化会显著影响具体层级的功能分配。

Abstract: Multimodal Large Language Models (MLLMs) have demonstrated strong performance across a wide range of vision-language tasks, yet their internal processing dynamics remain underexplored. In this work, we introduce a probing framework to systematically analyze how MLLMs process visual and textual inputs across layers. We train linear classifiers to predict fine-grained visual categories (e.g., dog breeds) from token embeddings extracted at each layer, using a standardized anchor question. To uncover the functional roles of different layers, we evaluate these probes under three types of controlled prompt variations: (1) lexical variants that test sensitivity to surface-level changes, (2) semantic negation variants that flip the expected answer by modifying the visual concept in the prompt, and (3) output format variants that preserve reasoning but alter the answer format. Applying our framework to LLaVA-1.5, LLaVA-Next-LLaMA-3, and Qwen2-VL, we identify a consistent stage-wise structure in which early layers perform visual grounding, middle layers support lexical integration and semantic reasoning, and final layers prepare task-specific outputs. We further show that while the overall stage-wise structure remains stable across variations in visual tokenization, instruction tuning data, and pretraining corpus, the specific layer allocation to each stage shifts notably with changes in the base LLM architecture. Our findings provide a unified perspective on the layer-wise organization of MLLMs and offer a lightweight, model-agnostic approach for analyzing multimodal representation dynamics.

[11] Disentangling Latent Embeddings with Sparse Linear Concept Subspaces (SLiCS) cs.CVPDF

Zhi Li, Hau Phan, Matthew Emigh, Austin J. Brockmeier

TL;DR: 论文提出了一种称为稀疏线性概念子空间（SLiCS）的方法，通过稀疏非负字典学习将视觉-语言共嵌入空间分解为概念特定的子空间，以实现更精确的概念过滤图像检索和条件生成。

Details

Motivation: 现有的视觉-语言共嵌入模型（如CLIP）的潜在表征空间通常混杂了复杂场景的多重语义信息，难以直接用于特定概念的精确检索或生成任务。

Result: 实验表明，SLiCS能够显著提升概念过滤图像检索的精确性，并适用于多种嵌入表示（如TiTok和DINOv2的嵌入）。

Insight: 利用文本共嵌入的语义信息可以通过字典学习有效地解耦视觉表征，从而为特定概念的检索和生成任务提供更精确的嵌入空间。

Abstract: Vision-language co-embedding networks, such as CLIP, provide a latent embedding space with semantic information that is useful for downstream tasks. We hypothesize that the embedding space can be disentangled to separate the information on the content of complex scenes by decomposing the embedding into multiple concept-specific component vectors that lie in different subspaces. We propose a supervised dictionary learning approach to estimate a linear synthesis model consisting of sparse, non-negative combinations of groups of vectors in the dictionary (atoms), whose group-wise activity matches the multi-label information. Each concept-specific component is a non-negative combination of atoms associated to a label. The group-structured dictionary is optimized through a novel alternating optimization with guaranteed convergence. Exploiting the text co-embeddings, we detail how semantically meaningful descriptions can be found based on text embeddings of words best approximated by a concept’s group of atoms, and unsupervised dictionary learning can exploit zero-shot classification of training set images using the text embeddings of concept labels to provide instance-wise multi-labels. We show that the disentangled embeddings provided by our sparse linear concept subspaces (SLiCS) enable concept-filtered image retrieval (and conditional generation using image-to-prompt) that is more precise. We also apply SLiCS to highly-compressed autoencoder embeddings from TiTok and the latent embedding from self-supervised DINOv2. Quantitative and qualitative results highlight the improved precision of the concept-filtered image retrieval for all embeddings.

[12] MedFoundationHub: A Lightweight and Secure Toolkit for Deploying Medical Vision Language Foundation Models cs.CV | cs.HCPDF

Xiao Li, Yanfan Zhu, Ruining Deng, Wei-Qi Wei, Yu Wang

TL;DR: 这篇论文介绍了MedFoundationHub，一个轻量级且安全的工具包，旨在解决医疗视觉语言模型（VLMs）在临床应用中面临的安全问题，如PHI暴露和数据泄漏。

Details

Motivation: 医疗视觉语言模型在临床应用中潜力巨大，但安全性问题（如PHI暴露和网络威胁）阻碍了其广泛应用。

Result: 通过专家评估五款先进VLM，揭示了模型在临床应用中存在的不足，如回答偏离目标、推理模糊和术语不一致。

Insight: 医疗视觉语言模型的部署需要平衡性能与安全性，且当前模型在临床术语上仍有改进空间。

Abstract: Recent advances in medical vision-language models (VLMs) open up remarkable opportunities for clinical applications such as automated report generation, copilots for physicians, and uncertainty quantification. However, despite their promise, medical VLMs introduce serious security concerns, most notably risks of Protected Health Information (PHI) exposure, data leakage, and vulnerability to cyberthreats - which are especially critical in hospital environments. Even when adopted for research or non-clinical purposes, healthcare organizations must exercise caution and implement safeguards. To address these challenges, we present MedFoundationHub, a graphical user interface (GUI) toolkit that: (1) enables physicians to manually select and use different models without programming expertise, (2) supports engineers in efficiently deploying medical VLMs in a plug-and-play fashion, with seamless integration of Hugging Face open-source models, and (3) ensures privacy-preserving inference through Docker-orchestrated, operating system agnostic deployment. MedFoundationHub requires only an offline local workstation equipped with a single NVIDIA A6000 GPU, making it both secure and accessible within the typical resources of academic research labs. To evaluate current capabilities, we engaged board-certified pathologists to deploy and assess five state-of-the-art VLMs (Google-MedGemma3-4B, Qwen2-VL-7B-Instruct, Qwen2.5-VL-7B-Instruct, and LLaVA-1.5-7B/13B). Expert evaluation covered colon cases and renal cases, yielding 1015 clinician-model scoring events. These assessments revealed recurring limitations, including off-target answers, vague reasoning, and inconsistent pathology terminology.

[13] Enhancing Mamba Decoder with Bidirectional Interaction in Multi-Task Dense Prediction cs.CVPDF

Mang Cao, Sanping Zhou, Yizhe Li, Ye Deng, Wenli Huang

TL;DR: 论文提出了一种双向交互曼巴（BIM），用于多任务密集预测，通过双向交互扫描（BI-Scan）和多尺度扫描（MS-Scan）机制，在保证计算效率的同时实现充分的跨任务交互。

Details

Motivation: 多任务密集预测中，充分的跨任务交互通常导致高计算复杂度，现有方法难以平衡交互完整性和计算效率。

Result: 在NYUD-V2和PASCAL-Context基准测试中，BIM优于当前最先进的方法。

Insight: BIM通过创新的扫描机制，在多任务密集预测中实现了计算效率和交互完整性的平衡，为多任务学习提供了新思路。

Abstract: Sufficient cross-task interaction is crucial for success in multi-task dense prediction. However, sufficient interaction often results in high computational complexity, forcing existing methods to face the trade-off between interaction completeness and computational efficiency. To address this limitation, this work proposes a Bidirectional Interaction Mamba (BIM), which incorporates novel scanning mechanisms to adapt the Mamba modeling approach for multi-task dense prediction. On the one hand, we introduce a novel Bidirectional Interaction Scan (BI-Scan) mechanism, which constructs task-specific representations as bidirectional sequences during interaction. By integrating task-first and position-first scanning modes within a unified linear complexity architecture, BI-Scan efficiently preserves critical cross-task information. On the other hand, we employ a Multi-Scale Scan~(MS-Scan) mechanism to achieve multi-granularity scene modeling. This design not only meets the diverse granularity requirements of various tasks but also enhances nuanced cross-task feature interactions. Extensive experiments on two challenging benchmarks, \emph{i.e.}, NYUD-V2 and PASCAL-Context, show the superiority of our BIM vs its state-of-the-art competitors.

Hyeonyu Kim, Seokhoon Jeong, Seonghee Han, Chanhyuk Choi, Taehwan Kim

TL;DR: 论文提出了一种基于音频引导的视觉编辑框架，能够处理复杂的多模态提示（文本和音频）编辑任务，无需额外训练，并通过多模态编码器和自适应补丁选择提升效果。

Details

Motivation: 现有的基于扩散模型的视觉编辑方法通常仅依赖文本指导，难以处理复杂场景；而现有的音频引导方法需要针对特定数据集训练，泛化能力有限。

Result: 实验表明，该方法在复杂编辑任务中表现优异，尤其是在文本方法难以处理的场景中表现突出。

Insight: 多模态信息（如音频）可以显著提升视觉编辑的灵活性和效果，尤其是在复杂场景中。

Abstract: Visual editing with diffusion models has made significant progress but often struggles with complex scenarios that textual guidance alone could not adequately describe, highlighting the need for additional non-text editing prompts. In this work, we introduce a novel audio-guided visual editing framework that can handle complex editing tasks with multiple text and audio prompts without requiring additional training. Existing audio-guided visual editing methods often necessitate training on specific datasets to align audio with text, limiting their generalization to real-world situations. We leverage a pre-trained multi-modal encoder with strong zero-shot capabilities and integrate diverse audio into visual editing tasks, by alleviating the discrepancy between the audio encoder space and the diffusion model’s prompt encoder space. Additionally, we propose a novel approach to handle complex scenarios with multiple and multi-modal editing prompts through our separate noise branching and adaptive patch selection. Our comprehensive experiments on diverse editing tasks demonstrate that our framework excels in handling complicated editing scenarios by incorporating rich information from audio, where text-only approaches fail.

[15] More Reliable Pseudo-labels, Better Performance: A Generalized Approach to Single Positive Multi-label Learning cs.CVPDF

Luong Tran, Thieu Vo, Anh Nguyen, Sang Dinh, Van Nguyen

TL;DR: 该论文提出了广义伪标签鲁棒损失（GPR Loss）和动态增强多焦点伪标签技术（DAMP），构建了自适应高效视觉语言伪标签（AEVLP）框架，显著提升了单正多标签学习的性能。

Details

Motivation: 解决单正多标签学习（SPML）中因部分标注数据导致的伪标签噪声和假阴性问题。

Result: 在四个基准数据集上验证了AEVLP框架的优越性，达到了当前最佳性能。

Insight: 通过鲁棒性损失和动态伪标签技术，可以有效解决部分标注数据带来的学习挑战。

Abstract: Multi-label learning is a challenging computer vision task that requires assigning multiple categories to each image. However, fully annotating large-scale datasets is often impractical due to high costs and effort, motivating the study of learning from partially annotated data. In the extreme case of Single Positive Multi-Label Learning (SPML), each image is provided with only one positive label, while all other labels remain unannotated. Traditional SPML methods that treat missing labels as unknown or negative tend to yield inaccuracies and false negatives, and integrating various pseudo-labeling strategies can introduce additional noise. To address these challenges, we propose the Generalized Pseudo-Label Robust Loss (GPR Loss), a novel loss function that effectively learns from diverse pseudo-labels while mitigating noise. Complementing this, we introduce a simple yet effective Dynamic Augmented Multi-focus Pseudo-labeling (DAMP) technique. Together, these contributions form the Adaptive and Efficient Vision-Language Pseudo-Labeling (AEVLP) framework. Extensive experiments on four benchmark datasets demonstrate that our framework significantly advances multi-label classification, achieving state-of-the-art results.

[16] Ultra-Low-Latency Spiking Neural Networks with Temporal-Dependent Integrate-and-Fire Neuron Model for Objects Detection cs.CV | cs.AI | I.4.0; I.2.6PDF

Chengjun Zhang, Yuhao Zhang, Jie Yang, Mohamad Sawan

TL;DR: 论文提出了一种基于时间依赖的整合-发放（tdIF）神经元模型的超低延迟脉冲神经网络（SNN），通过延迟脉冲方法解决了异质脉冲模式导致的剩余膜电位问题，显著提升了视觉检测任务的性能。

Details

Motivation: 当前基于ANN-SNN转换的方法在分类任务中表现出色，但在视觉检测任务中性能不佳。论文目标是解决这一问题，并实现超低延迟的视觉检测。

Result: 在目标检测和车道线检测任务中，提出的方法超越了现有ANN-SNN转换方法，实现了超低延迟（5时间步内）的SOTA性能。

Insight: 时间依赖的神经元行为能够显著提升SNN在视觉检测任务中的性能和延迟效率，同时保持与传统IF神经元相同的能耗水平。

Abstract: Spiking Neural Networks (SNNs), inspired by the brain, are characterized by minimal power consumption and swift inference capabilities on neuromorphic hardware, and have been widely applied to various visual perception tasks. Current ANN-SNN conversion methods have achieved excellent results in classification tasks with ultra-low time-steps, but their performance in visual detection tasks remains suboptimal. In this paper, we propose a delay-spike approach to mitigate the issue of residual membrane potential caused by heterogeneous spiking patterns. Furthermore, we propose a novel temporal-dependent Integrate-and-Fire (tdIF) neuron architecture for SNNs. This enables Integrate-and-fire (IF) neurons to dynamically adjust their accumulation and firing behaviors based on the temporal order of time-steps. Our method enables spikes to exhibit distinct temporal properties, rather than relying solely on frequency-based representations. Moreover, the tdIF neuron maintains energy consumption on par with traditional IF neuron. We demonstrate that our method achieves more precise feature representation with lower time-steps, enabling high performance and ultra-low latency in visual detection tasks. In this study, we conduct extensive evaluation of the tdIF method across two critical vision tasks: object detection and lane line detection. The results demonstrate that the proposed method surpasses current ANN-SNN conversion approaches, achieving state-of-the-art performance with ultra-low latency (within 5 time-steps).

[17] Graph-Based Uncertainty Modeling and Multimodal Fusion for Salient Object Detection cs.CVPDF

Yuqi Xiong, Wuzhen Shi, Yang Wen, Ruhan Liu

TL;DR: 本文提出了一种动态不确定性传播和多模态协同推理网络（DUP-MCRNet），通过动态不确定性图卷积模块（DUGC）和多模态协同融合策略（MCF）显著改善了显著目标检测（SOD）的细节保留和边缘清晰度。

Details

Motivation: 现有SOD方法在复杂场景中容易丢失细节、边缘模糊且单模态信息融合不足，因此需要一种更鲁棒的模型来解决这些问题。

Result: 在多个基准数据集上表现优异，尤其在边缘清晰度和复杂背景鲁棒性方面。

Insight: 动态模块设计和多模态融合策略可以有效提升SOD性能，为复杂场景下的目标检测提供了新思路。

Abstract: In view of the problems that existing salient object detection (SOD) methods are prone to losing details, blurring edges, and insufficient fusion of single-modal information in complex scenes, this paper proposes a dynamic uncertainty propagation and multimodal collaborative reasoning network (DUP-MCRNet). Firstly, a dynamic uncertainty graph convolution module (DUGC) is designed to propagate uncertainty between layers through a sparse graph constructed based on spatial semantic distance, and combined with channel adaptive interaction, it effectively improves the detection accuracy of small structures and edge regions. Secondly, a multimodal collaborative fusion strategy (MCF) is proposed, which uses learnable modality gating weights to weightedly fuse the attention maps of RGB, depth, and edge features. It can dynamically adjust the importance of each modality according to different scenes, effectively suppress redundant or interfering information, and strengthen the semantic complementarity and consistency between cross-modalities, thereby improving the ability to identify salient regions under occlusion, weak texture or background interference. Finally, the detection performance at the pixel level and region level is optimized through multi-scale BCE and IoU loss, cross-scale consistency constraints, and uncertainty-guided supervision mechanisms. Extensive experiments show that DUP-MCRNet outperforms various SOD methods on most common benchmark datasets, especially in terms of edge clarity and robustness to complex backgrounds. Our code is publicly available at https://github.com/YukiBear426/DUP-MCRNet.

[18] MSMVD: Exploiting Multi-scale Image Features via Multi-scale BEV Features for Multi-view Pedestrian Detection cs.CVPDF

Taiga Yamane, Satoshi Suzuki, Ryo Masumura, Shota Orihashi, Tomohiro Tanaka

TL;DR: MSMVD提出了一种通过多尺度BEV特征利用多视图图像特征的方法，有效解决了多视图行人检测中尺度差异的问题，显著提升了检测性能。

Details

Motivation: 多视图行人检测（MVPD）中的现有方法难以一致检测不同尺度的行人，尤其是在视图中尺度差异较大的情况下。

Result: 在GMVD数据集上，MSMVD将MODA指标提升了4.5分，显著优于现有方法。

Insight: 多尺度BEV特征能有效捕捉多视图中的尺度差异，从而提升检测精度。

Abstract: Multi-View Pedestrian Detection (MVPD) aims to detect pedestrians in the form of a bird’s eye view (BEV) from multi-view images. In MVPD, end-to-end trainable deep learning methods have progressed greatly. However, they often struggle to detect pedestrians with consistently small or large scales in views or with vastly different scales between views. This is because they do not exploit multi-scale image features to generate the BEV feature and detect pedestrians. To overcome this problem, we propose a novel MVPD method, called Multi-Scale Multi-View Detection (MSMVD). MSMVD generates multi-scale BEV features by projecting multi-scale image features extracted from individual views into the BEV space, scale-by-scale. Each of these BEV features inherits the properties of its corresponding scale image features from multiple views. Therefore, these BEV features help the precise detection of pedestrians with consistently small or large scales in views. Then, MSMVD combines information at different scales of multiple views by processing the multi-scale BEV features using a feature pyramid network. This improves the detection of pedestrians with vastly different scales between views. Extensive experiments demonstrate that exploiting multi-scale image features via multi-scale BEV features greatly improves the detection performance, and MSMVD outperforms the previous highest MODA by $4.5$ points on the GMVD dataset.

[19] A Spatial-Frequency Aware Multi-Scale Fusion Network for Real-Time Deepfake Detection cs.CVPDF

Libo Lv, Tianyi Wang, Mengxiao Huang, Ruixia Liu, Yinglong Wang

TL;DR: SFMFNet是一种轻量级但高效的实时深度伪造检测网络，通过空间-频率混合感知模块和多尺度融合机制，实现了准确率和效率的平衡。

Details

Motivation: 随着深度伪造生成技术的快速发展，伪造内容在视频会议和社交媒体等场景中的普及性增加，但现有检测器计算成本高，难以实时部署。

Result: 在多个基准数据集上，SFMFNet实现了准确率和效率的良好平衡，表现出强泛化能力和实际应用价值。

Insight: 结合空间和频率信息能更好地捕捉伪造痕迹，轻量级结构设计有助于实时检测，多尺度融合增强了模型的泛化能力。

Abstract: With the rapid advancement of real-time deepfake generation techniques, forged content is becoming increasingly realistic and widespread across applications like video conferencing and social media. Although state-of-the-art detectors achieve high accuracy on standard benchmarks, their heavy computational cost hinders real-time deployment in practical applications. To address this, we propose the Spatial-Frequency Aware Multi-Scale Fusion Network (SFMFNet), a lightweight yet effective architecture for real-time deepfake detection. We design a spatial-frequency hybrid aware module that jointly leverages spatial textures and frequency artifacts through a gated mechanism, enhancing sensitivity to subtle manipulations. A token-selective cross attention mechanism enables efficient multi-level feature interaction, while a residual-enhanced blur pooling structure helps retain key semantic cues during downsampling. Experiments on several benchmark datasets show that SFMFNet achieves a favorable balance between accuracy and efficiency, with strong generalization and practical value for real-time applications.

[20] Droplet3D: Commonsense Priors from Videos Facilitate 3D Generation cs.CVPDF

Xiaochuan Li, Guoguang Du, Runze Zhang, Liang Jin, Qi Jia

TL;DR: 该论文提出了Droplet3D，通过利用视频中的常识性先验来解决3D生成中数据稀缺的问题，并展示了其在生成空间一致且语义合理的3D内容上的有效性。

Details

Motivation: 3D生成领域的数据稀缺问题限制了模型的泛化能力，而视频数据中蕴含的多视图和丰富语义信息可以作为替代监督信号。

Result: 实验表明，该方法能生成空间一致且语义合理的3D内容，并具有扩展到场景级应用的潜力。

Insight: 视频中的常识性先验可以有效弥补3D数据稀缺问题，为3D生成提供新的监督信号。

Abstract: Scaling laws have validated the success and promise of large-data-trained models in creative generation across text, image, and video domains. However, this paradigm faces data scarcity in the 3D domain, as there is far less of it available on the internet compared to the aforementioned modalities. Fortunately, there exist adequate videos that inherently contain commonsense priors, offering an alternative supervisory signal to mitigate the generalization bottleneck caused by limited native 3D data. On the one hand, videos capturing multiple views of an object or scene provide a spatial consistency prior for 3D generation. On the other hand, the rich semantic information contained within the videos enables the generated content to be more faithful to the text prompts and semantically plausible. This paper explores how to apply the video modality in 3D asset generation, spanning datasets to models. We introduce Droplet3D-4M, the first large-scale video dataset with multi-view level annotations, and train Droplet3D, a generative model supporting both image and dense text input. Extensive experiments validate the effectiveness of our approach, demonstrating its ability to produce spatially consistent and semantically plausible content. Moreover, in contrast to the prevailing 3D solutions, our approach exhibits the potential for extension to scene-level applications. This indicates that the commonsense priors from the videos significantly facilitate 3D creation. We have open-sourced all resources including the dataset, code, technical framework, and model weights: https://dropletx.github.io/.

[21] Realistic and Controllable 3D Gaussian-Guided Object Editing for Driving Video Generation cs.CVPDF

Jiusi Li, Jackson Jiang, Jinyu Miao, Miao Long, Tuopu Wen

TL;DR: 本文提出了G^2Editor框架，用于在驾驶视频中实现高真实感和精确控制的物体编辑，通过3D高斯表示和场景布局优化解决了现有方法在视觉保真度和姿态控制上的不足。

Details

Motivation: 真实驾驶场景中的极端情况（corner cases）收集成本高且危险，而通过编辑传感器采集的数据生成多样化场景是一种有效替代方案。现有方法（如3D高斯分布或图像生成模型）在视觉质量和姿态控制上存在局限性。

Result: 在Waymo Open Dataset上的实验表明，G^2Editor在姿态控制和视觉质量上优于现有方法，支持对象重定位、插入和删除等任务，并为下游任务提供助力。

Insight: 通过3D高斯先验和场景布局的结合，可以显著提升生成结果的真实感和可控性，为自动驾驶数据增强和场景合成提供了新思路。

Abstract: Corner cases are crucial for training and validating autonomous driving systems, yet collecting them from the real world is often costly and hazardous. Editing objects within captured sensor data offers an effective alternative for generating diverse scenarios, commonly achieved through 3D Gaussian Splatting or image generative models. However, these approaches often suffer from limited visual fidelity or imprecise pose control. To address these issues, we propose G^2Editor, a framework designed for photorealistic and precise object editing in driving videos. Our method leverages a 3D Gaussian representation of the edited object as a dense prior, injected into the denoising process to ensure accurate pose control and spatial consistency. A scene-level 3D bounding box layout is employed to reconstruct occluded areas of non-target objects. Furthermore, to guide the appearance details of the edited object, we incorporate hierarchical fine-grained features as additional conditions during generation. Experiments on the Waymo Open Dataset demonstrate that G^2Editor effectively supports object repositioning, insertion, and deletion within a unified framework, outperforming existing methods in both pose controllability and visual quality, while also benefiting downstream data-driven tasks.

[22] Video-MTR: Reinforced Multi-Turn Reasoning for Long Video Understanding cs.CVPDF

Yuan Xie, Tianshui Chen, Zheng Ge, Lionel Ni

TL;DR: Video-MTR通过强化多轮推理框架，逐步选择关键视频片段并结合问题理解，优化长视频理解任务，无需依赖外部视觉语言模型。

Details

Motivation: 长视频理解中，长时序依赖和多重事件是挑战，现有方法因静态推理或依赖外部VLM导致性能不佳。

Result: 在VideoMME、MLVU和EgoSchema基准测试中，Video-MTR在准确性和效率上均优于现有方法。

Insight: 多轮推理和双级奖励设计有效提升长视频理解能力，展示了端到端训练的优越性。

Abstract: Long-form video understanding, characterized by long-range temporal dependencies and multiple events, remains a challenge. Existing methods often rely on static reasoning or external visual-language models (VLMs), which face issues like complexity and sub-optimal performance due to the lack of end-to-end training. In this paper, we propose Video-MTR, a reinforced multi-turn reasoning framework designed to enable iterative key video segment selection and question comprehension. Unlike traditional video reasoning pipeline, which generate predictions in a single turn, Video-MTR performs reasoning in multiple turns, selecting video segments progressively based on the evolving understanding of previously processed segments and the current question. This iterative process allows for a more refined and contextually aware analysis of the video. To ensure intermediate reasoning process, we introduce a novel gated bi-level reward system, combining trajectory-level rewards based on answer correctness and turn-level rewards emphasizing frame-query relevance. This system optimizes both video segment selection and question comprehension, eliminating the need for external VLMs and allowing end-to-end training. Extensive experiments on benchmarks like VideoMME, MLVU, and EgoSchema demonstrate that Video-MTR outperforms existing methods in both accuracy and efficiency, advancing the state-of-the-art in long video understanding.

[23] Adaptive Dual Uncertainty Optimization: Boosting Monocular 3D Object Detection under Test-Time Shifts cs.CVPDF

Zixuan Hu, Dongxiao Li, Xinzhu Ma, Shixiang Tang, Xiaotong Li

TL;DR: 论文提出了一种名为DUO的双不确定性优化框架，旨在通过联合最小化语义不确定性和几何不确定性，提升单目3D物体检测（M3OD）在测试时域偏移下的鲁棒性。

Details

Motivation: 单目3D物体检测在现实场景中因环境或传感器变化导致的域偏移下性能显著下降，现有测试时间适应方法未充分解决M3OD的双重不确定性（语义和几何）。

Result: 在多个数据集和域偏移类型下，DUO表现优于现有方法，验证了其在提升M3OD鲁棒性上的有效性。

Insight: 语义和几何不确定性在M3OD中具有互补性，联合优化可显著提升模型在域偏移下的适应能力，同时无监督损失的应用扩展了方法的实用性。

Abstract: Accurate monocular 3D object detection (M3OD) is pivotal for safety-critical applications like autonomous driving, yet its reliability deteriorates significantly under real-world domain shifts caused by environmental or sensor variations. To address these shifts, Test-Time Adaptation (TTA) methods have emerged, enabling models to adapt to target distributions during inference. While prior TTA approaches recognize the positive correlation between low uncertainty and high generalization ability, they fail to address the dual uncertainty inherent to M3OD: semantic uncertainty (ambiguous class predictions) and geometric uncertainty (unstable spatial localization). To bridge this gap, we propose Dual Uncertainty Optimization (DUO), the first TTA framework designed to jointly minimize both uncertainties for robust M3OD. Through a convex optimization lens, we introduce an innovative convex structure of the focal loss and further derive a novel unsupervised version, enabling label-agnostic uncertainty weighting and balanced learning for high-uncertainty objects. In parallel, we design a semantic-aware normal field constraint that preserves geometric coherence in regions with clear semantic cues, reducing uncertainty from the unstable 3D representation. This dual-branch mechanism forms a complementary loop: enhanced spatial perception improves semantic classification, and robust semantic predictions further refine spatial understanding. Extensive experiments demonstrate the superiority of DUO over existing methods across various datasets and domain shift types.

[24] CaddieSet: A Golf Swing Dataset with Human Joint Features and Ball Information cs.CV | cs.AIPDF

Seunghyeon Jung, Seoyoung Hong, Jiwoo Jeong, Seungwon Jeong, Jaerim Choi

TL;DR: 论文提出了一个新的高尔夫挥杆数据集CaddieSet，包含关节信息和球的轨迹数据，通过计算机视觉方法将挥杆分为8个阶段，并定义了15个关键指标，验证了数据集的可行性和解释性。

Details

Motivation: 现有研究未能定量建立挥杆姿势与球轨迹的关系，限制了挥杆改进的洞察力。

Result: 实验验证了数据集在预测球轨迹上的可行性，解释性模型的结果与领域知识一致。

Insight: 数据集为高尔夫挥杆分析提供了新的视角，适用于学术和体育产业。

Abstract: Recent advances in deep learning have led to more studies to enhance golfers’ shot precision. However, these existing studies have not quantitatively established the relationship between swing posture and ball trajectory, limiting their ability to provide golfers with the necessary insights for swing improvement. In this paper, we propose a new dataset called CaddieSet, which includes joint information and various ball information from a single shot. CaddieSet extracts joint information from a single swing video by segmenting it into eight swing phases using a computer vision-based approach. Furthermore, based on expert golf domain knowledge, we define 15 key metrics that influence a golf swing, enabling the interpretation of swing outcomes through swing-related features. Through experiments, we demonstrated the feasibility of CaddieSet for predicting ball trajectories using various benchmarks. In particular, we focus on interpretable models among several benchmarks and verify that swing feedback using our joint features is quantitatively consistent with established domain knowledge. This work is expected to offer new insight into golf swing analysis for both academia and the sports industry.

[25] IAENet: An Importance-Aware Ensemble Model for 3D Point Cloud-Based Anomaly Detection cs.CVPDF

Xuanming Cao, Chengyu Tao, Yifeng Cheng, Juan Du

TL;DR: IAENet是一个基于3D点云的异常检测模型，通过结合2D预训练专家和3D专家模型，并引入动态重要性感知融合模块(IAF)提升检测性能。

Details

Motivation: 当前工业制造中，2D图像异常检测已经很成熟，但3D点云检测由于缺乏强大的预训练基础模型而进展缓慢。作者旨在填补这一空白。

Result: 在MVTec 3D-AD数据集上实现了新的SOTA，显著降低了误检率。

Insight: 动态重要性感知融合模块能够有效整合多模态信息，同时保留各模态的独特优势，适用于实际工业场景。

Abstract: Surface anomaly detection is pivotal for ensuring product quality in industrial manufacturing. While 2D image-based methods have achieved remarkable success, 3D point cloud-based detection remains underexplored despite its richer geometric cues. We argue that the key bottleneck is the absence of powerful pretrained foundation backbones in 3D comparable to those in 2D. To bridge this gap, we propose Importance-Aware Ensemble Network (IAENet), an ensemble framework that synergizes 2D pretrained expert with 3D expert models. However, naively fusing predictions from disparate sources is non-trivial: existing strategies can be affected by a poorly performing modality and thus degrade overall accuracy. To address this challenge, We introduce an novel Importance-Aware Fusion (IAF) module that dynamically assesses the contribution of each source and reweights their anomaly scores. Furthermore, we devise critical loss functions that explicitly guide the optimization of IAF, enabling it to combine the collective knowledge of the source experts but also preserve their unique strengths, thereby enhancing the overall performance of anomaly detection. Extensive experiments on MVTec 3D-AD demonstrate that our IAENet achieves a new state-of-the-art with a markedly lower false positive rate, underscoring its practical value for industrial deployment.

[26] Describe, Don’t Dictate: Semantic Image Editing with Natural Language Intent cs.CVPDF

En Ci, Shanyan Guan, Yanhao Ge, Yilin Zhang, Wei Li

TL;DR: 论文提出了一种基于描述性提示的图像编辑框架DescriptiveEdit，将指令式图像编辑转化为基于参考图像的文本到图像生成，克服了现有方法的局限性，提升了编辑准确性和一致性。

Details

Motivation: 传统基于反演的图像编辑方法存在重建误差，而基于指令的方法受限于数据集质量和规模。为解决这些问题，作者提出了一种新的框架。

Result: 在Emu Edit基准测试中，框架提升了编辑准确性和一致性。

Insight: 基于参考图像的文本生成方法可以绕过指令数据集质量的限制，同时支持与其他扩展工具的无缝集成。

Abstract: Despite the progress in text-to-image generation, semantic image editing remains a challenge. Inversion-based algorithms unavoidably introduce reconstruction errors, while instruction-based models mainly suffer from limited dataset quality and scale. To address these problems, we propose a descriptive-prompt-based editing framework, named DescriptiveEdit. The core idea is to re-frame instruction-based image editing' as reference-image-based text-to-image generation’, which preserves the generative power of well-trained Text-to-Image models without architectural modifications or inversion. Specifically, taking the reference image and a prompt as input, we introduce a Cross-Attentive UNet, which newly adds attention bridges to inject reference image features into the prompt-to-edit-image generation process. Owing to its text-to-image nature, DescriptiveEdit overcomes limitations in instruction dataset quality, integrates seamlessly with ControlNet, IP-Adapter, and other extensions, and is more scalable. Experiments on the Emu Edit benchmark show it improves editing accuracy and consistency.

Jingyun Yang, Guoqing Zhang, Jingge Wang, Yang Li

TL;DR: 该论文提出了一种动态多模态样本选择的主动和序贯域适应框架，用于医学图像分割，通过基于信息量和代表性的策略选择最有价值的样本，显著提升了分割性能。

Details

Motivation: 医学图像标注成本高且费时，主动学习可减少标注需求。但现有的主动域适应方法存在样本冗余和负迁移问题，且多模态医学数据的选择策略尚未探索。

Result: 在多种肿瘤体积分割任务中，该方法显著优于现有的主动域适应方法。

Insight: 多模态医学数据的分割需要动态样本选择策略，信息量和代表性的结合是关键。

Abstract: Accurate gross tumor volume segmentation on multi-modal medical data is critical for radiotherapy planning in nasopharyngeal carcinoma and glioblastoma. Recent advances in deep neural networks have brought promising results in medical image segmentation, leading to an increasing demand for labeled data. Since labeling medical images is time-consuming and labor-intensive, active learning has emerged as a solution to reduce annotation costs by selecting the most informative samples to label and adapting high-performance models with as few labeled samples as possible. Previous active domain adaptation (ADA) methods seek to minimize sample redundancy by selecting samples that are farthest from the source domain. However, such one-off selection can easily cause negative transfer, and access to source medical data is often limited. Moreover, the query strategy for multi-modal medical data remains unexplored. In this work, we propose an active and sequential domain adaptation framework for dynamic multi-modal sample selection in ADA. We derive a query strategy to prioritize labeling and training on the most valuable samples based on their informativeness and representativeness. Empirical validation on diverse gross tumor volume segmentation tasks demonstrates that our method achieves favorable segmentation performance, significantly outperforming state-of-the-art ADA methods. Code is available at the git repository: \href{https://github.com/Hiyoochan/mmActS}{mmActS}.

[28] Enhancing Pseudo-Boxes via Data-Level LiDAR-Camera Fusion for Unsupervised 3D Object Detection cs.CVPDF

Mingqian Ji, Jian Yang, Shanshan Zhang

TL;DR: 本文提出了一种基于数据级LiDAR-相机融合的无监督3D目标检测方法，通过双向融合和动态自进化策略显著提升了伪框质量。

Details

Motivation: 现有的LiDAR-based 3D目标检测器依赖手工标注标签，而高质量3D标签获取成本高。尽管已有工作尝试利用RGB图像辅助生成伪框，但仅通过标签级融合忽略了LiDAR和RGB数据的互补性。

Result: 在nuScenes数据集上，本文方法训练的检测器性能达到28.4% mAP，显著优于现有最先进方法。

Insight: 数据级融合能够更好地利用LiDAR和RGB数据的互补性，动态自进化策略有助于逐步提升伪框质量，推动无监督3D检测的实用化。

Abstract: Existing LiDAR-based 3D object detectors typically rely on manually annotated labels for training to achieve good performance. However, obtaining high-quality 3D labels is time-consuming and labor-intensive. To address this issue, recent works explore unsupervised 3D object detection by introducing RGB images as an auxiliary modal to assist pseudo-box generation. However, these methods simply integrate pseudo-boxes generated by LiDAR point clouds and RGB images. Yet, such a label-level fusion strategy brings limited improvements to the quality of pseudo-boxes, as it overlooks the complementary nature in terms of LiDAR and RGB image data. To overcome the above limitations, we propose a novel data-level fusion framework that integrates RGB images and LiDAR data at an early stage. Specifically, we utilize vision foundation models for instance segmentation and depth estimation on images and introduce a bi-directional fusion method, where real points acquire category labels from the 2D space, while 2D pixels are projected onto 3D to enhance real point density. To mitigate noise from depth and segmentation estimations, we propose a local and global filtering method, which applies local radius filtering to suppress depth estimation errors and global statistical filtering to remove segmentation-induced outliers. Furthermore, we propose a data-level fusion based dynamic self-evolution strategy, which iteratively refines pseudo-boxes under a dense representation, significantly improving localization accuracy. Extensive experiments on the nuScenes dataset demonstrate that the detector trained by our method significantly outperforms that trained by previous state-of-the-art methods with 28.4$%$ mAP on the nuScenes validation benchmark.

[29] Digital Scale: Open-Source On-Device BMI Estimation from Smartphone Camera Images Trained on a Large-Scale Real-World Dataset cs.CVPDF

Frederik Rajiv Manichand, Robin Deuber, Robert Jakob, Steve Swerling, Jamie Rosen

TL;DR: 该论文提出了一种基于深度学习的方法，通过智能手机摄像头图像估计BMI，使用大规模真实世界数据集WayBED（84,963张图像），并引入自动过滤方法提高数据质量。模型在WayBED和外部数据集VisualBodyToBMI上表现优异，且部署在Android设备上，代码开源。

Details

Motivation: 传统BMI测量方法在远程医疗或紧急情况下可能不实用，因此需要一种非接触式的快速BMI估计方法。现有计算机视觉方法的数据集规模有限，限制了模型的性能。

Result: 1. 在WayBED测试集上MAPE为7.9%，优于现有方法；2. 在未训练的VisualBodyToBMI数据集上MAPE为13%，与现有最佳方法相当；3. 微调后在VisualBodyToBMI上MAPE降至8.56%，为当前最佳。

Insight: 大规模真实世界数据集对提升BMI估计模型的性能至关重要；自动过滤方法能有效提高数据质量；模型具备良好的泛化能力，适用于不同场景。

Abstract: Estimating Body Mass Index (BMI) from camera images with machine learning models enables rapid weight assessment when traditional methods are unavailable or impractical, such as in telehealth or emergency scenarios. Existing computer vision approaches have been limited to datasets of up to 14,500 images. In this study, we present a deep learning-based BMI estimation method trained on our WayBED dataset, a large proprietary collection of 84,963 smartphone images from 25,353 individuals. We introduce an automatic filtering method that uses posture clustering and person detection to curate the dataset by removing low-quality images, such as those with atypical postures or incomplete views. This process retained 71,322 high-quality images suitable for training. We achieve a Mean Absolute Percentage Error (MAPE) of 7.9% on our hold-out test set (WayBED data) using full-body images, the lowest value in the published literature to the best of our knowledge. Further, we achieve a MAPE of 13% on the completely unseen~(during training) VisualBodyToBMI dataset, comparable with state-of-the-art approaches trained on it, demonstrating robust generalization. Lastly, we fine-tune our model on VisualBodyToBMI and achieve a MAPE of 8.56%, the lowest reported value on this dataset so far. We deploy the full pipeline, including image filtering and BMI estimation, on Android devices using the CLAID framework. We release our complete code for model training, filtering, and the CLAID package for mobile deployment as open-source contributions.

[30] Contrastive Learning through Auxiliary Branch for Video Object Detection cs.CVPDF

Lucas Rakotoarivony

TL;DR: 该论文提出了一种通过辅助分支进行对比学习（CLAB）的方法，用于视频目标检测，旨在提升对图像退化的鲁棒性，同时不增加推理时的计算负担。

Details

Motivation: 视频中的目标检测面临运动模糊、遮挡和形变等图像退化问题，传统方法通过特征聚合和复杂后处理提升性能，但计算成本高。

Result: 在ImageNet VID数据集上，CLAB分别以ResNet-101和ResNeXt-101实现了84.0%和85.2%的mAP，达到CNN模型的SOTA性能。

Insight: 通过对比学习和动态损失权重，可以在不增加推理负担的情况下有效提升视频目标检测的鲁棒性。

Abstract: Video object detection is a challenging task because videos often suffer from image deterioration such as motion blur, occlusion, and deformable shapes, making it significantly more difficult than detecting objects in still images. Prior approaches have improved video object detection performance by employing feature aggregation and complex post-processing techniques, though at the cost of increased computational demands. To improve robustness to image degradation without additional computational load during inference, we introduce a straightforward yet effective Contrastive Learning through Auxiliary Branch (CLAB) method. First, we implement a constrastive auxiliary branch using a contrastive loss to enhance the feature representation capability of the video object detector’s backbone. Next, we propose a dynamic loss weighting strategy that emphasizes auxiliary feature learning early in training while gradually prioritizing the detection task as training converges. We validate our approach through comprehensive experiments and ablation studies, demonstrating consistent performance gains. Without bells and whistles, CLAB reaches a performance of 84.0% mAP and 85.2% mAP with ResNet-101 and ResNeXt-101, respectively, on the ImageNet VID dataset, thus achieving state-of-the-art performance for CNN-based models without requiring additional post-processing methods.

[31] Towards Mechanistic Defenses Against Typographic Attacks in CLIP cs.CV | cs.AIPDF

Lorenz Hufe, Constantin Venhoff, Maximilian Dreyer, Sebastian Lapuschkin, Wojciech Samek

TL;DR: 该论文研究了CLIP模型中的文本注入攻击（typographic attacks），并提出了一种无需微调的防御方法，通过选择性地删除特定的注意力头来显著提升模型的鲁棒性。

Details

Motivation: 多模态系统（如CLIP）易受文本注入攻击的影响，导致针对性错误分类、恶意内容生成甚至模型越狱。因此，研究如何防御此类攻击具有重要实用价值。

Result: 该方法在ImageNet-100的文本攻击变体上性能提升了19.6%，同时对标准ImageNet-100的精度影响小于1%，表现与依赖微调的现有防御方法相当。

Insight: 研究揭示了CLIP模型中存在专门处理文本信息的注意力头，这为防御多模态系统中的文本攻击提供了新的思路，即通过干预特定模块而非全局调整。

Abstract: Typographic attacks exploit multi-modal systems by injecting text into images, leading to targeted misclassifications, malicious content generation and even Vision-Language Model jailbreaks. In this work, we analyze how CLIP vision encoders behave under typographic attacks, locating specialized attention heads in the latter half of the model’s layers that causally extract and transmit typographic information to the cls token. Building on these insights, we introduce a method to defend CLIP models against typographic attacks by selectively ablating a typographic circuit, consisting of attention heads. Without requiring finetuning, our method improves performance by up to 19.6% on a typographic variant of ImageNet-100, while reducing standard ImageNet-100 accuracy by less than 1%. Notably, our training-free approach remains competitive with current state-of-the-art typographic defenses that rely on finetuning. To this end, we release a family of dyslexic CLIP models which are significantly more robust against typographic attacks. These models serve as suitable drop-in replacements for a broad range of safety-critical applications, where the risks of text-based manipulation outweigh the utility of text recognition.

[32] GLaRE: A Graph-based Landmark Region Embedding Network for Emotion Recognition cs.CVPDF

Debasis Maji, Debaditya Barman

TL;DR: 论文提出了一种基于图神经网络（GNN）的新型情绪识别方法GLaRE，通过层次化粗化构建商图，利用区域级嵌入提升性能。

Details

Motivation: 传统面部表情识别（FER）方法因遮挡、表情多变和缺乏可解释性受限，GNN提供了建模面部标志点关系的结构化学习方案。

Result: 在AffectNet和FERG数据集上分别达到64.89%和94.24%的准确率，优于现有基线。

Insight: 区域级嵌入对性能提升至关重要，商图结构在保持空间信息的同时降低了计算复杂度。

Abstract: Facial expression recognition (FER) is a crucial task in computer vision with wide range of applications including human computer interaction, surveillance, and assistive technologies. However, challenges such as occlusion, expression variability, and lack of interpretability hinder the performance of traditional FER systems. Graph Neural Networks (GNNs) offer a powerful alternative by modeling relational dependencies between facial landmarks, enabling structured and interpretable learning. In this paper, we propose GLaRE, a novel Graph-based Landmark Region Embedding network for emotion recognition. Facial landmarks are extracted using 3D facial alignment, and a quotient graph is constructed via hierarchical coarsening to preserve spatial structure while reducing complexity. Our method achieves 64.89 percentage accuracy on AffectNet and 94.24 percentage on FERG, outperforming several existing baselines. Additionally, ablation studies have demonstrated that region-level embeddings from quotient graphs have contributed to improved prediction performance.

[33] FastFit: Accelerating Multi-Reference Virtual Try-On via Cacheable Diffusion Models cs.CV | 68T42 (Primary) 168T45 (Secondary) | I.4.9PDF

Zheng Chong, Yanwei Lei, Shiyue Zhang, Zhuandi He, Zhen Wang

TL;DR: FastFit通过可缓存扩散模型加速多参考虚拟试穿，解决了现有方法的效率和多功能性问题，同时发布了新数据集DressCode-MR。

Details

Motivation: 现有虚拟试穿技术无法支持多参考组合（如服装和配件），且因冗余计算导致效率低下。

Result: FastFit在VITON-HD、DressCode和DressCode-MR上均表现优异，实现3.5倍加速。

Insight: 解耦参考特征编码与去噪过程是提升效率的关键，而多参考数据集的构建推动复杂试穿任务的研究。

Abstract: Despite its great potential, virtual try-on technology is hindered from real-world application by two major challenges: the inability of current methods to support multi-reference outfit compositions (including garments and accessories), and their significant inefficiency caused by the redundant re-computation of reference features in each denoising step. To address these challenges, we propose FastFit, a high-speed multi-reference virtual try-on framework based on a novel cacheable diffusion architecture. By employing a Semi-Attention mechanism and substituting traditional timestep embeddings with class embeddings for reference items, our model fully decouples reference feature encoding from the denoising process with negligible parameter overhead. This allows reference features to be computed only once and losslessly reused across all steps, fundamentally breaking the efficiency bottleneck and achieving an average 3.5x speedup over comparable methods. Furthermore, to facilitate research on complex, multi-reference virtual try-on, we introduce DressCode-MR, a new large-scale dataset. It comprises 28,179 sets of high-quality, paired images covering five key categories (tops, bottoms, dresses, shoes, and bags), constructed through a pipeline of expert models and human feedback refinement. Extensive experiments on the VITON-HD, DressCode, and our DressCode-MR datasets show that FastFit surpasses state-of-the-art methods on key fidelity metrics while offering its significant advantage in inference efficiency.

[34] UTA-Sign: Unsupervised Thermal Video Augmentation via Event-Assisted Traffic Signage Sketching cs.CVPDF

Yuqi Han, Songqian Zhang, Weijian Su, Ke Li, Jiayu Yang

TL;DR: 论文提出了一种名为UTA-Sign的无监督热-事件视频增强方法，用于低光照环境下的交通标志识别，通过融合热成像帧和事件信号解决热成像盲区和事件相机不均匀采样的挑战。

Details

Motivation: 热成像相机在低光照环境下性能优越，但对相似材料对象的标志识别存在挑战，而事件相机能在高动态范围下检测光强变化。两者互补，因此提出结合两种模态的方法以提升交通标志识别的精度。

Result: 在真实场景数据集的验证中，该方法在交通标志草图和感知层面的检测精度上表现优异。

Insight: 热成像与事件相机的融合能够有效弥补单一模态的不足，尤其在低光照条件下，具有潜在的应用前景，如自动驾驶和无人导航。

Abstract: The thermal camera excels at perceiving outdoor environments under low-light conditions, making it ideal for applications such as nighttime autonomous driving and unmanned navigation. However, thermal cameras encounter challenges when capturing signage from objects made of similar materials, which can pose safety risks for accurately understanding semantics in autonomous driving systems. In contrast, the neuromorphic vision camera, also known as an event camera, detects changes in light intensity asynchronously and has proven effective in high-speed, low-light traffic environments. Recognizing the complementary characteristics of these two modalities, this paper proposes UTA-Sign, an unsupervised thermal-event video augmentation for traffic signage in low-illumination environments, targeting elements such as license plates and roadblock indicators. To address the signage blind spots of thermal imaging and the non-uniform sampling of event cameras, we developed a dual-boosting mechanism that fuses thermal frames and event signals for consistent signage representation over time. The proposed method utilizes thermal frames to provide accurate motion cues as temporal references for aligning the uneven event signals. At the same time, event signals contribute subtle signage content to the raw thermal frames, enhancing the overall understanding of the environment. The proposed method is validated on datasets collected from real-world scenarios, demonstrating superior quality in traffic signage sketching and improved detection accuracy at the perceptual level.

[35] Optimization-Based Calibration for Intravascular Ultrasound Volume Reconstruction cs.CVPDF

Karl-Philippe Beaudet, Sidaty El Hadramy, Philippe C Cattin, Juan Verde, Stéphane Cotin

TL;DR: 该论文提出了一种基于优化的校准方法，用于精确重建3D血管内超声（IVUS）体积，以改善肝脏手术中的术中导航。通过3D打印的幻影进行校准，并验证了其准确性。

Details

Motivation: 术中超声图像在肝脏手术中难以解读，因其视野有限且解剖结构复杂。本文旨在通过3D IVUS重建整个器官，以桥接术前CT与术中超声数据的差距。

Result: 在校准误差为0.88至1.80 mm，配准误差为3.40至5.71 mm的范围内验证了方法的有效性。

Insight: 该方法为术中超声与术前CT图像的配准提供了一种可靠且准确的解决方案，可增强肝脏手术中的术中指导。

Abstract: Intraoperative ultrasound images are inherently challenging to interpret in liver surgery due to the limited field of view and complex anatomical structures. Bridging the gap between preoperative and intraoperative data is crucial for effective surgical guidance. 3D IntraVascular UltraSound (IVUS) offers a potential solution by enabling the reconstruction of the entire organ, which facilitates registration between preoperative computed tomography (CT) scans and intraoperative IVUS images. In this work, we propose an optimization-based calibration method using a 3D-printed phantom for accurate 3D Intravascular Ultrasound volume reconstruction. Our approach ensures precise alignment of tracked IVUS data with preoperative CT images, improving intraoperative navigation. We validated our method using in vivo swine liver images, achieving a calibration error from 0.88 to 1.80 mm and a registration error from 3.40 to 5.71 mm between the 3D IVUS data and the corresponding CT scan. Our method provides a reliable and accurate means of calibration and volume reconstruction. It can be used to register intraoperative ultrasound images with preoperative CT images in the context of liver surgery, and enhance intraoperative guidance.

[36] Revisiting the Privacy Risks of Split Inference: A GAN-Based Data Reconstruction Attack via Progressive Feature Optimization cs.CV | cs.CRPDF

Yixiang Qiu, Yanhan Liu, Hongyao Yu, Hao Fang, Bin Chen

TL;DR: 本文提出了一种基于生成对抗网络（GAN）的新型数据重构攻击框架，通过渐进特征优化（PFO）提升对分割推理（SI）中间特征的重构质量。

Details

Motivation: 随着深度神经网络（DNNs）的复杂性增加，分割推理（SI）被用于降低延迟和保护用户隐私。然而，现有的数据重构攻击（DRAs）在深层模型中效果有限，且未能充分利用语义先验。

Result: 实验表明，该方法在高分辨率、分布外数据和复杂DNNs场景下，显著优于现有攻击方法。

Insight: 该方法揭示了分割推理中中间特征的隐私风险，并为未来的防御机制提供了重要参考。

Abstract: The growing complexity of Deep Neural Networks (DNNs) has led to the adoption of Split Inference (SI), a collaborative paradigm that partitions computation between edge devices and the cloud to reduce latency and protect user privacy. However, recent advances in Data Reconstruction Attacks (DRAs) reveal that intermediate features exchanged in SI can be exploited to recover sensitive input data, posing significant privacy risks. Existing DRAs are typically effective only on shallow models and fail to fully leverage semantic priors, limiting their reconstruction quality and generalizability across datasets and model architectures. In this paper, we propose a novel GAN-based DRA framework with Progressive Feature Optimization (PFO), which decomposes the generator into hierarchical blocks and incrementally refines intermediate representations to enhance the semantic fidelity of reconstructed images. To stabilize the optimization and improve image realism, we introduce an L1-ball constraint during reconstruction. Extensive experiments show that our method outperforms prior attacks by a large margin, especially in high-resolution scenarios, out-of-distribution settings, and against deeper and more complex DNNs.

[37] EmoCAST: Emotional Talking Portrait via Emotive Text Description cs.CVPDF

Yiguo Jiang, Xiaodong Cun, Yong Zhang, Yudian Zheng, Fan Tang

TL;DR: EmoCAST是一个基于扩散模型的框架，通过文本驱动的精确情感合成生成高质量的情感说话视频，解决了现有方法在灵活性、自然性和表达质量上的局限性。

Details

Motivation: 现有的情感说话头合成方法在控制灵活性、运动自然性和表情质量上存在不足，且数据集多为实验室环境收集，限制了实际应用。EmoCAST旨在解决这些问题。

Result: EmoCAST在生成真实、情感丰富且音频同步的说话头视频上表现优异，达到当前最先进的性能。

Insight: 文本驱动的解耦情感模块和情感音频注意力模块的结合能有效提升情感合成的精确性和自然性；构建更丰富的数据集和优化训练策略是改进模型的关键。

Abstract: Emotional talking head synthesis aims to generate talking portrait videos with vivid expressions. Existing methods still exhibit limitations in control flexibility, motion naturalness, and expression quality. Moreover, currently available datasets are primarily collected in lab settings, further exacerbating these shortcomings. Consequently, these limitations substantially hinder practical applications in real-world scenarios. To address these challenges, we propose EmoCAST, a diffusion-based framework with two key modules for precise text-driven emotional synthesis. In appearance modeling, emotional prompts are integrated through a text-guided decoupled emotive module, enhancing the spatial knowledge to improve emotion comprehension. To improve the relationship between audio and emotion, we introduce an emotive audio attention module to capture the interplay between controlled emotion and driving audio, generating emotion-aware features to guide more precise facial motion synthesis. Additionally, we construct an emotional talking head dataset with comprehensive emotive text descriptions to optimize the framework’s performance. Based on the proposed dataset, we propose an emotion-aware sampling training strategy and a progressive functional training strategy that further improve the model’s ability to capture nuanced expressive features and achieve accurate lip-synchronization. Overall, EmoCAST achieves state-of-the-art performance in generating realistic, emotionally expressive, and audio-synchronized talking-head videos. Project Page: https://github.com/GVCLab/EmoCAST

[38] AvatarBack: Back-Head Generation for Complete 3D Avatars from Front-View Images cs.CVPDF

Shiqi Xin, Xiaolin Zhang, Yanbin Liu, Peng Zhang, Caifeng Shan

TL;DR: AvatarBack是一个新颖的即插即用框架，通过生成伪图像和自适应空间对齐策略，解决了从正面图像重建3D头像时背面头部几何不一致的问题。

Details

Motivation: 现有的3D头像重建方法主要依赖正面图像，导致背面头部区域建模不佳，出现几何不一致和结构模糊的问题，降低了整体真实感。

Result: 在NeRSemble和K-hairstyle数据集上的实验表明，AvatarBack显著提升了背面头部的重建质量，同时保持了正面的保真度和动画能力。

Insight: 通过生成伪图像和多视角监督，可以填补稀疏输入的几何信息缺失，而自适应对齐策略能有效解决合成视图与3D高斯表示之间的坐标差异。

Abstract: Recent advances in Gaussian Splatting have significantly boosted the reconstruction of head avatars, enabling high-quality facial modeling by representing an 3D avatar as a collection of 3D Gaussians. However, existing methods predominantly rely on frontal-view images, leaving the back-head poorly constructed. This leads to geometric inconsistencies, structural blurring, and reduced realism in the rear regions, ultimately limiting the fidelity of reconstructed avatars. To address this challenge, we propose AvatarBack, a novel plug-and-play framework specifically designed to reconstruct complete and consistent 3D Gaussian avatars by explicitly modeling the missing back-head regions. AvatarBack integrates two core technical innovations,i.e., the Subject-specific Generator (SSG) and the Adaptive Spatial Alignment Strategy (ASA). The former leverages a generative prior to synthesize identity-consistent, plausible back-view pseudo-images from sparse frontal inputs, providing robust multi-view supervision. To achieve precise geometric alignment between these synthetic views and the 3D Gaussian representation, the later employs learnable transformation matrices optimized during training, effectively resolving inherent pose and coordinate discrepancies. Extensive experiments on NeRSemble and K-hairstyle datasets, evaluated using geometric, photometric, and GPT-4o-based perceptual metrics, demonstrate that AvatarBack significantly enhances back-head reconstruction quality while preserving frontal fidelity. Moreover, the reconstructed avatars maintain consistent visual realism under diverse motions and remain fully animatable.

[39] ArtFace: Towards Historical Portrait Face Identification via Model Adaptation cs.CV | cs.AIPDF

Francois Poh, Anjith George, Sébastien Marcel

TL;DR: 论文提出ArtFace方法，通过微调基础模型并结合传统人脸识别网络的嵌入特征，显著提升了历史肖像画中人脸识别的性能。

Details

Motivation: 历史肖像画中人脸识别任务面临数据稀缺、风格多样化和领域迁移等问题，传统人脸识别模型在此任务上表现不佳。

Result: 在历史肖像人脸识别任务上显著优于现有方法。

Insight: 基础模型能够弥补传统方法在跨领域人脸识别中的不足。

Abstract: Identifying sitters in historical paintings is a key task for art historians, offering insight into their lives and how they chose to be seen. However, the process is often subjective and limited by the lack of data and stylistic variations. Automated facial recognition is capable of handling challenging conditions and can assist, but while traditional facial recognition models perform well on photographs, they struggle with paintings due to domain shift and high intra-class variation. Artistic factors such as style, skill, intent, and influence from other works further complicate recognition. In this work, we investigate the potential of foundation models to improve facial recognition in artworks. By fine-tuning foundation models and integrating their embeddings with those from conventional facial recognition networks, we demonstrate notable improvements over current state-of-the-art methods. Our results show that foundation models can bridge the gap where traditional methods are ineffective. Paper page at https://www.idiap.ch/paper/artface/

[40] CraftGraffiti: Exploring Human Identity with Custom Graffiti Art via Facial-Preserving Diffusion Models cs.CVPDF

Ayan Banerjee, Fernando Vilariño, Josep Lladós

TL;DR: CraftGraffiti是一个端到端的文本引导涂鸦生成框架，通过面部保留扩散模型解决涂鸦艺术中人脸身份保留的挑战。

Details

Motivation: 在涂鸦这种高对比度、抽象的媒介中，保持人脸身份的可识别性是一个主要挑战，尤其是在极端风格化变换的情况下。

Result: 在面部特征一致性、美学评分和人类偏好方面表现优秀，展示了在实际场景中的创意应用潜力。

Insight: “风格优先，身份后处理”的范式优于反向操作，减少属性漂移，为创意AI应用中风格自由与人脸识别性的平衡提供了新思路。

Abstract: Preserving facial identity under extreme stylistic transformation remains a major challenge in generative art. In graffiti, a high-contrast, abstract medium, subtle distortions to the eyes, nose, or mouth can erase the subject’s recognizability, undermining both personal and cultural authenticity. We present CraftGraffiti, an end-to-end text-guided graffiti generation framework designed with facial feature preservation as a primary objective. Given an input image and a style and pose descriptive prompt, CraftGraffiti first applies graffiti style transfer via LoRA-fine-tuned pretrained diffusion transformer, then enforces identity fidelity through a face-consistent self-attention mechanism that augments attention layers with explicit identity embeddings. Pose customization is achieved without keypoints, using CLIP-guided prompt extension to enable dynamic re-posing while retaining facial coherence. We formally justify and empirically validate the “style-first, identity-after” paradigm, showing it reduces attribute drift compared to the reverse order. Quantitative results demonstrate competitive facial feature consistency and state-of-the-art aesthetic and human preference scores, while qualitative analyses and a live deployment at the Cruilla Festival highlight the system’s real-world creative impact. CraftGraffiti advances the goal of identity-respectful AI-assisted artistry, offering a principled approach for blending stylistic freedom with recognizability in creative AI applications.

[41] Improving Alignment in LVLMs with Debiased Self-Judgment cs.CV | cs.CLPDF

Sihan Yang, Chenhang Cui, Zihao Zhao, Yiyang Zhou, Weilong Yan

TL;DR: 本文提出了一种名为”去偏自评价分数”的自我评估方法，以改善大型视觉语言模型（LVLM）中的模态对齐问题，减少幻觉并提升安全性。

Details

Motivation: 大型视觉语言模型在模态对齐上存在挑战，常导致幻觉和安全问题。现有方法依赖外部资源或人工标注，成本高且难以扩展。因此，作者提出了一种无需外部资源的自我评估方法。

Result: 实验证明，该方法在减少幻觉、提升安全性和整体能力上显著优于传统方法。

Insight: 自我评估方法为模态对齐提供了一种低成本、可扩展的解决方案，同时为模型自主优化开辟了新方向。

Abstract: The rapid advancements in Large Language Models (LLMs) and Large Visual-Language Models (LVLMs) have opened up new opportunities for integrating visual and linguistic modalities. However, effectively aligning these modalities remains challenging, often leading to hallucinations–where generated outputs are not grounded in the visual input–and raising safety concerns across various domains. Existing alignment methods, such as instruction tuning and preference tuning, often rely on external datasets, human annotations, or complex post-processing, which limit scalability and increase costs. To address these challenges, we propose a novel approach that generates the debiased self-judgment score, a self-evaluation metric created internally by the model without relying on external resources. This enables the model to autonomously improve alignment. Our method enhances both decoding strategies and preference tuning processes, resulting in reduced hallucinations, enhanced safety, and improved overall capability. Empirical results show that our approach significantly outperforms traditional methods, offering a more effective solution for aligning LVLMs.

[42] “Humor, Art, or Misinformation?”: A Multimodal Dataset for Intent-Aware Synthetic Image Detection cs.CV | cs.MMPDF

Anastasios Skoularikis, Stefanos-Iordanis Papadopoulos, Symeon Papadopoulos, Panagiotis C. Petrantonakis

TL;DR: 论文提出了一个多模态数据集S-HArM，用于意图感知的合成图像检测，并比较了不同提示策略对模型性能的影响。结果显示，基于视觉上下文的多模态方法表现更好，但整体性能仍有局限。

Details

Motivation: 现有研究往往忽视AI生成图像背后的意图（如幽默、艺术或误导），因此需要一个新的数据集来填补这一空白。

Result: 基于图像和多模态引导数据的模型在野生数据上表现更好，但整体性能仍有提升空间。

Insight: 推断生成图像的意图是一个复杂任务，需要更专业的架构设计。

Abstract: Recent advances in multimodal AI have enabled progress in detecting synthetic and out-of-context content. However, existing efforts largely overlook the intent behind AI-generated images. To fill this gap, we introduce S-HArM, a multimodal dataset for intent-aware classification, comprising 9,576 “in the wild” image-text pairs from Twitter/X and Reddit, labeled as Humor/Satire, Art, or Misinformation. Additionally, we explore three prompting strategies (image-guided, description-guided, and multimodally-guided) to construct a large-scale synthetic training dataset with Stable Diffusion. We conduct an extensive comparative study including modality fusion, contrastive learning, reconstruction networks, attention mechanisms, and large vision-language models. Our results show that models trained on image- and multimodally-guided data generalize better to “in the wild” content, due to preserved visual context. However, overall performance remains limited, highlighting the complexity of inferring intent and the need for specialized architectures.

Fartash Faghri, Pavan Kumar Anasosalu Vasu, Cem Koc, Vaishaal Shankar, Alexander Toshev

TL;DR: MobileCLIP2通过改进多模态强化训练方法，结合更优的CLIP教师模型组和优化的captioner教师模型，提升了零样本准确率，同时保持了低延迟和小参数量的优势。

Details

Motivation: 现有基础图像-文本模型（如CLIP）在零样本能力上有广泛应用，但如何进一步优化其训练方法以提升准确性，同时保持低延迟和小参数量是关键挑战。

Result: MobileCLIP2在ImageNet-1k零样本准确率上提升2.2%，MobileCLIP2-S4匹配SigLIP-SO400M/14的准确率，且模型更小、延迟更低。

Insight: 对比知识蒸馏中温度调优的重要性、caption多样性对微调的有效性，以及多模型合成caption的加性改进。

Abstract: Foundation image-text models such as CLIP with zero-shot capabilities enable a wide array of applications. MobileCLIP is a recent family of image-text models at 3-15ms latency and 50-150M parameters with state-of-the-art zero-shot accuracy. The main ingredients in MobileCLIP were its low-latency and light architectures and a novel multi-modal reinforced training that made knowledge distillation from multiple caption-generators and CLIP teachers efficient, scalable, and reproducible. In this paper, we improve the multi-modal reinforced training of MobileCLIP through: 1) better CLIP teacher ensembles trained on the DFN dataset, 2) improved captioner teachers trained on the DFN dataset and fine-tuned on a diverse selection of high-quality image-caption datasets. We discover new insights through ablations such as the importance of temperature tuning in contrastive knowledge distillation, the effectiveness of caption-generator fine-tuning for caption diversity, and the additive improvement from combining synthetic captions generated by multiple models. We train a new family of models called MobileCLIP2 and achieve state-of-the-art ImageNet-1k zero-shot accuracies at low latencies. In particular, we observe 2.2% improvement in ImageNet-1k accuracy for MobileCLIP2-B compared with MobileCLIP-B architecture. Notably, MobileCLIP2-S4 matches the zero-shot accuracy of SigLIP-SO400M/14 on ImageNet-1k while being 2$\times$ smaller and improves on DFN ViT-L/14 at 2.5$\times$ lower latency. We release our pretrained models (https://github.com/apple/ml-mobileclip) and the data generation code (https://github.com/apple/ml-mobileclip-dr). The data generation code makes it easy to create new reinforced datasets with arbitrary teachers using distributed scalable processing.

[44] Learned Rate Control for Frame-Level Adaptive Neural Video Compression via Dynamic Neural Network cs.CVPDF

Chenhao Zhang, Wei Gao

TL;DR: 该论文提出了一种动态视频压缩框架，通过动态路由自动编码器和速率控制代理实现可变比特率控制，显著提高了压缩性能。

Details

Motivation: 神经视频压缩（NVC）近年来取得了显著进展，但学习型编解码器在精确速率控制方面仍存在挑战。为了解决这一问题，论文提出了一种动态框架以适应可变比特率场景。

Result: 在HEVC和UVG数据集上的实验表明，该方法相较于现有技术平均BD-Rate降低了14.8%，BD-PSNR提高了0.47dB，比特率误差仅为1.66%。

Insight: 动态路由和速率控制代理的结合为学习型视频压缩提供了新的优化方向，解决了比特率控制的难点。

Abstract: Neural Video Compression (NVC) has achieved remarkable performance in recent years. However, precise rate control remains a challenge due to the inherent limitations of learning-based codecs. To solve this issue, we propose a dynamic video compression framework designed for variable bitrate scenarios. First, to achieve variable bitrate implementation, we propose the Dynamic-Route Autoencoder with variable coding routes, each occupying partial computational complexity of the whole network and navigating to a distinct RD trade-off. Second, to approach the target bitrate, the Rate Control Agent estimates the bitrate of each route and adjusts the coding route of DRA at run time. To encompass a broad spectrum of variable bitrates while preserving overall RD performance, we employ the Joint-Routes Optimization strategy, achieving collaborative training of various routes. Extensive experiments on the HEVC and UVG datasets show that the proposed method achieves an average BD-Rate reduction of 14.8% and BD-PSNR gain of 0.47dB over state-of-the-art methods while maintaining an average bitrate error of 1.66%, achieving Rate-Distortion-Complexity Optimization (RDCO) for various bitrate and bitrate-constrained applications. Our code is available at https://git.openi.org.cn/OpenAICoding/DynamicDVC.

[45] Mix, Align, Distil: Reliable Cross-Domain Atypical Mitosis Classification cs.CVPDF

Kaustubh Atey, Sameer Anand Jha, Gouranga Bala, Amit Sethi

TL;DR: 本文提出了一种用于跨域非典型有丝分裂分类的简单训练方法，通过风格扰动、特征对齐和知识蒸馏提升模型在域偏移下的鲁棒性和性能。

Details

Motivation: 非典型有丝分裂（AMFs）是重要的病理学标记，但因扫描仪、染色和采集差异导致的域偏移使其一致性识别具有挑战性。

Result: 在MIDOG 2025任务2中，模型在平衡准确率（0.8762）、敏感度（0.8873）、特异度（0.8651）和ROC AUC（0.9499）上表现优异。

Insight: 该方法仅依赖粗粒度的域元数据，在域偏移下仍能保持高性能，为病理学图像分类提供了可靠的跨域解决方案。

Abstract: Atypical mitotic figures (AMFs) are important histopathological markers yet remain challenging to identify consistently, particularly under domain shift stemming from scanner, stain, and acquisition differences. We present a simple training-time recipe for domain-robust AMF classification in MIDOG 2025 Task 2. The approach (i) increases feature diversity via style perturbations inserted at early and mid backbone stages, (ii) aligns attention-refined features across sites using weak domain labels (Scanner, Origin, Species, Tumor) through an auxiliary alignment loss, and (iii) stabilizes predictions by distilling from an exponential moving average (EMA) teacher with temperature-scaled KL divergence. On the organizer-run preliminary leaderboard for atypical mitosis classification, our submission attains balanced accuracy of 0.8762, sensitivity of 0.8873, specificity of 0.8651, and ROC AUC of 0.9499. The method incurs negligible inference-time overhead, relies only on coarse domain metadata, and delivers strong, balanced performance, positioning it as a competitive submission for the MIDOG 2025 challenge.

[46] Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning cs.CVPDF

Yibin Wang, Zhimin Li, Yuhang Zang, Yujie Zhou, Jiazi Bu

TL;DR: 论文提出Pref-GRPO方法，通过成对偏好奖励替代点状奖励，解决文本到图像强化学习中的奖励黑客问题，提升训练稳定性，并引入UniGenBench基准全面评估模型。

Details

Motivation: 现有基于GRPO的文本到图像生成方法使用点状奖励模型易导致奖励黑客问题，即细微的分数差异被放大，引发模型对无关紧要收益的过度优化。

Result: 实验表明，Pref-GRPO能区分图像质量细微差异，提供更稳定的优势；UniGenBench揭示了开源与闭源模型的优缺点。

Insight: 成对偏好奖励可有效缓解奖励黑客问题，精细化基准设计有助于全面评估模型性能。

Abstract: Recent advancements highlight the importance of GRPO-based reinforcement learning methods and benchmarking in enhancing text-to-image (T2I) generation. However, current methods using pointwise reward models (RM) for scoring generated images are susceptible to reward hacking. We reveal that this happens when minimal score differences between images are amplified after normalization, creating illusory advantages that drive the model to over-optimize for trivial gains, ultimately destabilizing the image generation process. To address this, we propose Pref-GRPO, a pairwise preference reward-based GRPO method that shifts the optimization objective from score maximization to preference fitting, ensuring more stable training. In Pref-GRPO, images are pairwise compared within each group using preference RM, and the win rate is used as the reward signal. Extensive experiments demonstrate that PREF-GRPO differentiates subtle image quality differences, providing more stable advantages and mitigating reward hacking. Additionally, existing T2I benchmarks are limited by coarse evaluation criteria, hindering comprehensive model assessment. To solve this, we introduce UniGenBench, a unified T2I benchmark comprising 600 prompts across 5 main themes and 20 subthemes. It evaluates semantic consistency through 10 primary and 27 sub-criteria, leveraging MLLM for benchmark construction and evaluation. Our benchmarks uncover the strengths and weaknesses of both open and closed-source T2I models and validate the effectiveness of Pref-GRPO.

[47] ${C}^{3}$-GS: Learning Context-aware, Cross-dimension, Cross-scale Feature for Generalizable Gaussian Splatting cs.CV | cs.AIPDF

Yuxi Hu, Jun Zhang, Kuangyi Chen, Zhe Zhang, Friedrich Fraundorfer

TL;DR: C3-GS提出了一种通过上下文感知、跨维度和跨尺度约束增强特征学习的框架，提高了稀疏视图下的高斯溅射质量和泛化能力。

Details

Motivation: 现有方法在稀疏视图下难以编码判别性、多视图一致的特征，导致高斯预测的几何结构不准确。

Result: 在基准数据集上的实验表明，C3-GS实现了最先进的渲染质量和泛化能力。

Insight: 结合多约束的特征学习能显著提升稀疏视图下的渲染效果。

Abstract: Generalizable Gaussian Splatting aims to synthesize novel views for unseen scenes without per-scene optimization. In particular, recent advancements utilize feed-forward networks to predict per-pixel Gaussian parameters, enabling high-quality synthesis from sparse input views. However, existing approaches fall short in encoding discriminative, multi-view consistent features for Gaussian predictions, which struggle to construct accurate geometry with sparse views. To address this, we propose $\mathbf{C}^{3}$-GS, a framework that enhances feature learning by incorporating context-aware, cross-dimension, and cross-scale constraints. Our architecture integrates three lightweight modules into a unified rendering pipeline, improving feature fusion and enabling photorealistic synthesis without requiring additional supervision. Extensive experiments on benchmark datasets validate that $\mathbf{C}^{3}$-GS achieves state-of-the-art rendering quality and generalization ability. Code is available at: https://github.com/YuhsiHu/C3-GS.

[48] SeqVLM: Proposal-Guided Multi-View Sequences Reasoning via VLM for Zero-Shot 3D Visual Grounding cs.CV | cs.AIPDF

Jiawen Lin, Shiran Bian, Yihang Zhu, Wenbin Tan, Yachao Zhang

TL;DR: SeqVLM是一个新颖的零样本3D视觉定位框架，通过多视角序列推理和视觉语言模型（VLM）的结合，解决了现有方法在空间推理和上下文细节上的局限性，实现了更高的精度和通用性。

Details

Motivation: 现有零样本3D视觉定位方法由于依赖单视角定位，存在空间推理受限和上下文细节丢失的问题，限制了其在现实场景中的应用。

Result: 在ScanRefer和Nr3D基准测试中，Acc@0.25分别达到55.6%和53.2%，比现有零样本方法提升4.0%和5.2%。

Insight: 通过多视角序列和VLM的协同推理，能够更全面地捕捉3D场景的上下文信息，从而显著提升零样本任务的性能。

Abstract: 3D Visual Grounding (3DVG) aims to localize objects in 3D scenes using natural language descriptions. Although supervised methods achieve higher accuracy in constrained settings, zero-shot 3DVG holds greater promise for real-world applications since eliminating scene-specific training requirements. However, existing zero-shot methods face challenges of spatial-limited reasoning due to reliance on single-view localization, and contextual omissions or detail degradation. To address these issues, we propose SeqVLM, a novel zero-shot 3DVG framework that leverages multi-view real-world scene images with spatial information for target object reasoning. Specifically, SeqVLM first generates 3D instance proposals via a 3D semantic segmentation network and refines them through semantic filtering, retaining only semantic-relevant candidates. A proposal-guided multi-view projection strategy then projects these candidate proposals onto real scene image sequences, preserving spatial relationships and contextual details in the conversion process of 3D point cloud to images. Furthermore, to mitigate VLM computational overload, we implement a dynamic scheduling mechanism that iteratively processes sequances-query prompts, leveraging VLM’s cross-modal reasoning capabilities to identify textually specified objects. Experiments on the ScanRefer and Nr3D benchmarks demonstrate state-of-the-art performance, achieving Acc@0.25 scores of 55.6% and 53.2%, surpassing previous zero-shot methods by 4.0% and 5.2%, respectively, which advance 3DVG toward greater generalization and real-world applicability. The code is available at https://github.com/JiawLin/SeqVLM.

[49] Occlusion Robustness of CLIP for Military Vehicle Classification cs.CV | cs.AIPDF

Jan Erik van Woerden, Gertjan Burghouts, Lotte Nijskens, Alma M. Liezenga, Sabina van Rooij

TL;DR: 这篇论文研究了CLIP在军事车辆分类中对遮挡的鲁棒性，发现Transformer-based CLIP模型表现优于CNN，并提出通过微调可以提高模型在遮挡严重情况下的性能。

Details

Motivation: 在军事环境中，遮挡和噪声是常见挑战，但CLIP这类视觉语言模型在这些条件下的鲁棒性尚未得到充分研究。论文旨在填补这一空白。

Result: 实验结果表明：(1) Transformer-based CLIP模型更鲁棒，(2) 细粒度遮挡对性能影响更大，(3) 线性探测模型在35%遮挡时性能骤降，(4) 微调主干可将性能下降点推至60%遮挡。

Insight: 论文指出，遮挡特定的数据增强对训练至关重要，同时需进一步探索模型在补丁级别的敏感性和架构的鲁棒性，以适应真实世界的部署需求。

Abstract: Vision-language models (VLMs) like CLIP enable zero-shot classification by aligning images and text in a shared embedding space, offering advantages for defense applications with scarce labeled data. However, CLIP’s robustness in challenging military environments, with partial occlusion and degraded signal-to-noise ratio (SNR), remains underexplored. We investigate CLIP variants’ robustness to occlusion using a custom dataset of 18 military vehicle classes and evaluate using Normalized Area Under the Curve (NAUC) across occlusion percentages. Four key insights emerge: (1) Transformer-based CLIP models consistently outperform CNNs, (2) fine-grained, dispersed occlusions degrade performance more than larger contiguous occlusions, (3) despite improved accuracy, performance of linear-probed models sharply drops at around 35% occlusion, (4) by finetuning the model’s backbone, this performance drop occurs at more than 60% occlusion. These results underscore the importance of occlusion-specific augmentations during training and the need for further exploration into patch-level sensitivity and architectural resilience for real-world deployment of CLIP.

Fachri Najm Noer Kartiman, Rasim, Yaya Wihardi, Nurul Hasanah, Oskar Natan

TL;DR: SKGE-Swin提出了一种端到端的自动驾驶车辆模型，通过结合Swin Transformer和跳连机制，增强了全局和多层级特征表示能力，在CARLA平台上表现优异。

Details

Motivation: 现有自动驾驶模型在复杂场景中难以捕捉全局和局部特征，SKGE-Swin旨在通过Swin Transformer和跳连机制解决这一问题。

Result: 在CARLA平台上测试，SKGE-Swin的驾驶评分优于现有方法，并通过消融实验验证了各组件的重要性。

Insight: 跳连机制和Swin Transformer的结合能有效提升自动驾驶模型对复杂场景的理解能力。

Abstract: Focusing on the development of an end-to-end autonomous vehicle model with pixel-to-pixel context awareness, this research proposes the SKGE-Swin architecture. This architecture utilizes the Swin Transformer with a skip-stage mechanism to broaden feature representation globally and at various network levels. This approach enables the model to extract information from distant pixels by leveraging the Swin Transformer’s Shifted Window-based Multi-head Self-Attention (SW-MSA) mechanism and to retain critical information from the initial to the final stages of feature extraction, thereby enhancing its capability to comprehend complex patterns in the vehicle’s surroundings. The model is evaluated on the CARLA platform using adversarial scenarios to simulate real-world conditions. Experimental results demonstrate that the SKGE-Swin architecture achieves a superior Driving Score compared to previous methods. Furthermore, an ablation study will be conducted to evaluate the contribution of each architectural component, including the influence of skip connections and the use of the Swin Transformer, in improving model performance.

[51] ChainReaction! Structured Approach with Causal Chains as Intermediate Representations for Improved and Explainable Causal Video Question Answering cs.CV | cs.AI | cs.CL | cs.HC | cs.LGPDF

Paritosh Parmar, Eric Peh, Basura Fernando

TL;DR: 提出了一种模块化框架，通过显式地将因果推理与答案生成分离，引入自然语言因果链作为可解释的中间表示，显著提升了因果视频问答的性能和可解释性。

Details

Motivation: 现有因果视频问答模型通常依赖不透明的整体流程，缺乏高阶推理能力，且可解释性差，容易依赖浅层启发式方法。

Result: 在三个大规模基准测试中优于现有模型，并在可解释性、用户信任和泛化性方面表现出显著优势。

Insight: 显式分离因果推理与答案生成可以提升性能和可解释性，因果链作为一种中间表示具有通用性，可用于多领域因果推理。

Abstract: Existing Causal-Why Video Question Answering (VideoQA) models often struggle with higher-order reasoning, relying on opaque, monolithic pipelines that entangle video understanding, causal inference, and answer generation. These black-box approaches offer limited interpretability and tend to depend on shallow heuristics. We propose a novel, modular framework that explicitly decouples causal reasoning from answer generation, introducing natural language causal chains as interpretable intermediate representations. Inspired by human cognitive models, these structured cause-effect sequences bridge low-level video content with high-level causal reasoning, enabling transparent and logically coherent inference. Our two-stage architecture comprises a Causal Chain Extractor (CCE) that generates causal chains from video-question pairs, and a Causal Chain-Driven Answerer (CCDA) that produces answers grounded in these chains. To address the lack of annotated reasoning traces, we introduce a scalable method for generating high-quality causal chains from existing datasets using large language models. We also propose CauCo, a new evaluation metric for causality-oriented captioning. Experiments on three large-scale benchmarks demonstrate that our approach not only outperforms state-of-the-art models, but also yields substantial gains in explainability, user trust, and generalization – positioning the CCE as a reusable causal reasoning engine across diverse domains. Project page: https://paritoshparmar.github.io/chainreaction/

[52] Looking Beyond the Obvious: A Survey on Abstract Concept Recognition for Video Understanding cs.CV | cs.AIPDF

Gowreesh Mago, Pascal Mettes, Stevan Rudinac

TL;DR: 这篇论文探讨了视频理解中抽象概念识别的挑战，强调了基于上下文信息的多层次语义推理的重要性，并讨论了利用基础模型解决这一问题的潜力。

Details

Motivation: 视频理解的现有研究主要集中于具体可见的实体（如物体、动作、场景），而人类的独特能力之一是识别抽象概念（如正义、自由、团结）。论文旨在推动视频理解领域对抽象概念的研究，使其更贴近人类思维和价值观。

Result: 论文指出，利用多模态基础模型（如结合视觉和语言的大模型）可能是解决抽象概念识别的关键，同时强调了从历史研究中汲取经验的重要性。

Insight: 1. 抽象概念识别是视频理解的重要开放性问题，需要结合上下文和多层次语义推理；
2. 社区经验和基础模型的结合有望推动这一领域的进展；
3. 避免重复研究，注重历史经验的积累和利用。

Abstract: The automatic understanding of video content is advancing rapidly. Empowered by deeper neural networks and large datasets, machines are increasingly capable of understanding what is concretely visible in video frames, whether it be objects, actions, events, or scenes. In comparison, humans retain a unique ability to also look beyond concrete entities and recognize abstract concepts like justice, freedom, and togetherness. Abstract concept recognition forms a crucial open challenge in video understanding, where reasoning on multiple semantic levels based on contextual information is key. In this paper, we argue that the recent advances in foundation models make for an ideal setting to address abstract concept understanding in videos. Automated understanding of high-level abstract concepts is imperative as it enables models to be more aligned with human reasoning and values. In this survey, we study different tasks and datasets used to understand abstract concepts in video content. We observe that, periodically and over a long period, researchers have attempted to solve these tasks, making the best use of the tools available at their disposal. We advocate that drawing on decades of community experience will help us shed light on this important open grand challenge and avoid ``re-inventing the wheel’’ as we start revisiting it in the era of multi-modal foundation models.

[53] Safer Skin Lesion Classification with Global Class Activation Probability Map Evaluation and SafeML cs.CV | cs.AIPDF

Kuniko Paxton, Koorosh Aslansefat, Amila Akagić, Dhavalkumar Thakker, Yiannis Papadopoulos

TL;DR: 该论文提出了一种结合全局类激活概率图评估（GCAPM）和SafeML的方法，旨在提升皮肤病变分类的可解释性和可靠性，以减少误诊风险。

Details

Motivation: 尽管皮肤病变分类模型的准确性有所提升，但其可解释性和可靠性在医疗实践中仍面临挑战。现有解释方法如LIME和CAM存在不一致性或忽略多类激活的问题，限制了模型的信任度。

Result: 实验表明，该方法通过可视化诊断过程和误诊检测，提升了模型的可靠性和可解释性。

Insight: 在医学AI中，高精度和可解释性需并重；GCAPM和SafeML的结合为提升模型信任度提供了新思路。

Abstract: Recent advancements in skin lesion classification models have significantly improved accuracy, with some models even surpassing dermatologists’ diagnostic performance. However, in medical practice, distrust in AI models remains a challenge. Beyond high accuracy, trustworthy, explainable diagnoses are essential. Existing explainability methods have reliability issues, with LIME-based methods suffering from inconsistency, while CAM-based methods failing to consider all classes. To address these limitations, we propose Global Class Activation Probabilistic Map Evaluation, a method that analyses all classes’ activation probability maps probabilistically and at a pixel level. By visualizing the diagnostic process in a unified manner, it helps reduce the risk of misdiagnosis. Furthermore, the application of SafeML enhances the detection of false diagnoses and issues warnings to doctors and patients as needed, improving diagnostic reliability and ultimately patient safety. We evaluated our method using the ISIC datasets with MobileNetV2 and Vision Transformers.

[54] Evaluating Compositional Generalisation in VLMs and Diffusion Models cs.CV | cs.AIPDF

Beth Pearson, Bilal Boulbarss, Michael Wray, Martha Lewis

TL;DR: 该论文研究了视觉语言模型（VLMs）和扩散模型在组合泛化能力上的表现，发现它们在绑定对象属性和关系任务中存在困难，尤其是关系推理方面。

Details

Motivation: 研究动机是验证VLMs和扩散模型是否能捕捉自然语言的组合语义，尤其是对象属性与关系的绑定能力。

Result: 结果表明，扩散分类器和ViLT在概念绑定任务中表现较好，但所有模型在关系推理任务中表现较差，CLIP嵌入分析显示关系概念的表示过于相似。

Insight: 研究揭示了VLMs在组合语义，尤其是关系推理上的不足，为未来改进提供了方向。

Abstract: A fundamental aspect of the semantics of natural language is that novel meanings can be formed from the composition of previously known parts. Vision-language models (VLMs) have made significant progress in recent years, however, there is evidence that they are unable to perform this kind of composition. For example, given an image of a red cube and a blue cylinder, a VLM such as CLIP is likely to incorrectly label the image as a red cylinder or a blue cube, indicating it represents the image as a `bag-of-words’ and fails to capture compositional semantics. Diffusion models have recently gained significant attention for their impressive generative abilities, and zero-shot classifiers based on diffusion models have been shown to perform competitively with CLIP in certain compositional tasks. In this work we explore whether the generative Diffusion Classifier has improved compositional generalisation abilities compared to discriminative models. We assess three models – Diffusion Classifier, CLIP, and ViLT – on their ability to bind objects with attributes and relations in both zero-shot learning (ZSL) and generalised zero-shot learning (GZSL) settings. Our results show that the Diffusion Classifier and ViLT perform well at concept binding tasks, but that all models struggle significantly with the relational GZSL task, underscoring the broader challenges VLMs face with relational reasoning. Analysis of CLIP embeddings suggests that the difficulty may stem from overly similar representations of relational concepts such as left and right. Code and dataset are available at: https://github.com/otmive/diffusion_classifier_clip

[55] Adapting Foundation Model for Dental Caries Detection with Dual-View Co-Training cs.CVPDF

Tao Luo, Han Wu, Tong Yang, Dinggang Shen, Zhiming Cui

TL;DR: 论文提出了一种双视角协同训练网络(DVCTNet)，通过结合全局和局部视图信息，利用门控交叉视角注意力模块动态融合特征，显著提升了牙科龋齿检测的准确性。

Details

Motivation: 当前牙科龋齿检测方法由于图像对比度低和病变形态多样，检测精度不足。受牙医结合全局筛查和局部检查的临床流程启发，论文旨在提出一种更准确的双视角检测方法。

Result: 在公开数据集和新标注数据集上均优于现有方法，展示了临床实用性。

Insight: 结合全局和局部视角的特征可以显著提升医学图像任务的检测性能，动态特征融合是关键。

Abstract: Accurate dental caries detection from panoramic X-rays plays a pivotal role in preventing lesion progression. However, current detection methods often yield suboptimal accuracy due to subtle contrast variations and diverse lesion morphology of dental caries. In this work, inspired by the clinical workflow where dentists systematically combine whole-image screening with detailed tooth-level inspection, we present DVCTNet, a novel Dual-View Co-Training network for accurate dental caries detection. Our DVCTNet starts with employing automated tooth detection to establish two complementary views: a global view from panoramic X-ray images and a local view from cropped tooth images. We then pretrain two vision foundation models separately on the two views. The global-view foundation model serves as the detection backbone, generating region proposals and global features, while the local-view model extracts detailed features from corresponding cropped tooth patches matched by the region proposals. To effectively integrate information from both views, we introduce a Gated Cross-View Attention (GCV-Atten) module that dynamically fuses dual-view features, enhancing the detection pipeline by integrating the fused features back into the detection model for final caries detection. To rigorously evaluate our DVCTNet, we test it on a public dataset and further validate its performance on a newly curated, high-precision dental caries detection dataset, annotated using both intra-oral images and panoramic X-rays for double verification. Experimental results demonstrate DVCTNet’s superior performance against existing state-of-the-art (SOTA) methods on both datasets, indicating the clinical applicability of our method. Our code and labeled dataset are available at https://github.com/ShanghaiTech-IMPACT/DVCTNet.

[56] Estimating 2D Keypoints of Surgical Tools Using Vision-Language Models with Low-Rank Adaptation cs.CVPDF

Krit Duangprom, Tryphon Lambrou, Binod Bhattarai

TL;DR: 该论文提出了一种利用视觉语言模型（VLM）和低秩调整（LoRA）技术来估计手术工具2D关键点的新方法，相比传统CNN或Transformer方法，在小规模医疗数据上表现更优。

Details

Motivation: 传统CNN或Transformer方法在小规模医疗数据集上容易出现过拟合，而预训练的视觉语言模型具有更强的泛化能力，因此希望通过LoRA技术微调VLM来解决这一问题。

Result: 实验表明，经过两轮微调的VLM在2D关键点检测任务中优于基线模型，验证了LoRA在低资源场景下的有效性。

Insight: 利用VLM和LoRA技术可以在小规模医疗数据上实现高效的关键点检测，为未来3D手术工具姿态估计提供了新思路。

Abstract: This paper presents a novel pipeline for 2D keypoint estima- tion of surgical tools by leveraging Vision Language Models (VLMs) fine- tuned using a low rank adjusting (LoRA) technique. Unlike traditional Convolutional Neural Network (CNN) or Transformer-based approaches, which often suffer from overfitting in small-scale medical datasets, our method harnesses the generalization capabilities of pre-trained VLMs. We carefully design prompts to create an instruction-tuning dataset and use them to align visual features with semantic keypoint descriptions. Experimental results show that with only two epochs of fine tuning, the adapted VLM outperforms the baseline models, demonstrating the ef- fectiveness of LoRA in low-resource scenarios. This approach not only improves keypoint detection performance, but also paves the way for future work in 3D surgical hands and tools pose estimation.

[57] PathMR: Multimodal Visual Reasoning for Interpretable Pathology Diagnosis cs.CVPDF

Ye Zhang, Yu Zhou, Jingwen Qi, Yongbing Zhang, Simon Puettmann

TL;DR: PathMR 是一个用于病理图像分析的多模态视觉推理框架，通过生成像素级分割掩码和语义对齐的文本解释，提高AI辅助诊断的透明度和可解释性。

Details

Motivation: 现有基于深度学习的病理诊断方法虽然提高了效率，但由于模型决策不透明且缺乏可追溯依据，临床接受度有限。PathMR旨在通过多模态视觉推理解决这一问题。

Result: 在PathGen和新开发的GADVR数据集上表现优于现有视觉推理方法，展示了其在AI驱动病理诊断中的潜力。

Insight: 通过结合视觉和文本模态，PathMR为病理诊断提供了更透明的决策依据，有助于提升临床信赖度。

Abstract: Deep learning based automated pathological diagnosis has markedly improved diagnostic efficiency and reduced variability between observers, yet its clinical adoption remains limited by opaque model decisions and a lack of traceable rationale. To address this, recent multimodal visual reasoning architectures provide a unified framework that generates segmentation masks at the pixel level alongside semantically aligned textual explanations. By localizing lesion regions and producing expert style diagnostic narratives, these models deliver the transparent and interpretable insights necessary for dependable AI assisted pathology. Building on these advancements, we propose PathMR, a cell-level Multimodal visual Reasoning framework for Pathological image analysis. Given a pathological image and a textual query, PathMR generates expert-level diagnostic explanations while simultaneously predicting cell distribution patterns. To benchmark its performance, we evaluated our approach on the publicly available PathGen dataset as well as on our newly developed GADVR dataset. Extensive experiments on these two datasets demonstrate that PathMR consistently outperforms state-of-the-art visual reasoning methods in text generation quality, segmentation accuracy, and cross-modal alignment. These results highlight the potential of PathMR for improving interpretability in AI-driven pathological diagnosis. The code will be publicly available in https://github.com/zhangye-zoe/PathMR.

Dennis Slobodzian, Karissa Tilbury, Amir Kordijazi

TL;DR: 该论文开发了一种深度学习框架，通过多模态医学影像（自荧光和二次谐波生成）实现胰腺癌的早期检测，显著提升了检测准确率。

Details

Motivation: 胰腺导管腺癌（PDAC）的五年生存率低于10%，主要原因是晚期诊断。该研究旨在通过深度学习技术，利用多模态影像分析实现早期检测。

Result: 优化后的框架在癌症检测中达到超过90%的准确率，显著优于传统手动分析方法。

Insight: 为小规模医学影像数据集的应用提供了实用方法，并为扩展到其他癌症类型奠定了基础。

Abstract: Pacreatic ductal adenocarcinoma (PDAC) remains one of the most lethal forms of cancer, with a five-year survival rate below 10% primarily due to late detection. This research develops and validates a deep learning framework for early PDAC detection through analysis of dual-modality imaging: autofluorescence and second harmonic generation (SHG). We analyzed 40 unique patient samples to create a specialized neural network capable of distinguishing between normal, fibrotic, and cancerous tissue. Our methodology evaluated six distinct deep learning architectures, comparing traditional Convolutional Neural Networks (CNNs) with modern Vision Transformers (ViTs). Through systematic experimentation, we identified and overcome significant challenges in medical image analysis, including limited dataset size and class imbalance. The final optimized framework, based on a modified ResNet architecture with frozen pre-trained layers and class-weighted training, achieved over 90% accuracy in cancer detection. This represents a significant improvement over current manual analysis methods an demonstrates potential for clinical deployment. This work establishes a robust pipeline for automated PDAC detection that can augment pathologists’ capabilities while providing a foundation for future expansion to other cancer types. The developed methodology also offers valuable insights for applying deep learning to limited-size medical imaging datasets, a common challenge in clinical applications.

[59] Understanding and evaluating computer vision models through the lens of counterfactuals cs.CVPDF

Pushkar Shukla

TL;DR: 该论文提出了一种通过反事实推理（counterfactuals）解释和评估计算机视觉模型的方法，重点解决了分类器和生成模型中的偏见问题。

Details

Motivation: 反事实推理能够通过变化输入来观察模型行为的变化，对于理解模型的决策逻辑和偏见至关重要。研究旨在开发系统化的方法，利用反事实推理提升模型的解释性、公平性和鲁棒性。

Result: 该方法在分类器和生成模型中均能有效识别和缓解偏见，同时不影响模型的准确性。CAVLI和ASAC成功减少了分类器对无关特征的依赖，TIBET和InterMit显著降低了生成模型中的社会偏见。

Insight: 反事实推理是连接模型解释性、公平性和因果性的关键工具。系统化的反事实方法能够为模型的负责任使用提供标准化评估和缓解途径。

Abstract: Counterfactual reasoning – the practice of asking ``what if’’ by varying inputs and observing changes in model behavior – has become central to interpretable and fair AI. This thesis develops frameworks that use counterfactuals to explain, audit, and mitigate bias in vision classifiers and generative models. By systematically altering semantically meaningful attributes while holding others fixed, these methods uncover spurious correlations, probe causal dependencies, and help build more robust systems. The first part addresses vision classifiers. CAVLI integrates attribution (LIME) with concept-level analysis (TCAV) to quantify how strongly decisions rely on human-interpretable concepts. With localized heatmaps and a Concept Dependency Score, CAVLI shows when models depend on irrelevant cues like backgrounds. Extending this, ASAC introduces adversarial counterfactuals that perturb protected attributes while preserving semantics. Through curriculum learning, ASAC fine-tunes biased models for improved fairness and accuracy while avoiding stereotype-laden artifacts. The second part targets generative Text-to-Image (TTI) models. TIBET provides a scalable pipeline for evaluating prompt-sensitive biases by varying identity-related terms, enabling causal auditing of how race, gender, and age affect image generation. To capture interactions, BiasConnect builds causal graphs diagnosing intersectional biases. Finally, InterMit offers a modular, training-free algorithm that mitigates intersectional bias via causal sensitivity scores and user-defined fairness goals. Together, these contributions show counterfactuals as a unifying lens for interpretability, fairness, and causality in both discriminative and generative models, establishing principled, scalable methods for socially responsible bias evaluation and mitigation.

[60] To New Beginnings: A Survey of Unified Perception in Autonomous Vehicle Software cs.CV | cs.ROPDF

Loïc Stratil, Felix Fent, Esteban Rivera, Markus Lienkamp

TL;DR: 本文是一篇关于自动驾驶车辆统一感知的综述，提出了任务集成、跟踪表述和表示流的分类法，定义了三种统一感知范式，并系统回顾了现有方法及其架构、训练策略等，为未来的研究提供了方向。

Details

Motivation: 现有自动驾驶感知通常采用模块化流水线，虽然可解释性强，但存在误差累积和子任务间协同不足的问题。统一感知通过共享架构集成各子任务，有望提升鲁棒性、上下文推理能力和效率，同时保持可解释性。

Result: 综述涵盖了现有方法的架构、训练策略、数据集和开源实现，为研究社区提供了清晰的参考框架。

Insight: 统一感知有望解决模块化方法的局限性，但需进一步研究如何平衡性能提升与可解释性，以及如何实现更通用的跨任务协同。

Abstract: Autonomous vehicle perception typically relies on modular pipelines that decompose the task into detection, tracking, and prediction. While interpretable, these pipelines suffer from error accumulation and limited inter-task synergy. Unified perception has emerged as a promising paradigm that integrates these sub-tasks within a shared architecture, potentially improving robustness, contextual reasoning, and efficiency while retaining interpretable outputs. In this survey, we provide a comprehensive overview of unified perception, introducing a holistic and systemic taxonomy that categorizes methods along task integration, tracking formulation, and representation flow. We define three paradigms -Early, Late, and Full Unified Perception- and systematically review existing methods, their architectures, training strategies, datasets used, and open-source availability, while highlighting future research directions. This work establishes the first comprehensive framework for understanding and advancing unified perception, consolidates fragmented efforts, and guides future research toward more robust, generalizable, and interpretable perception.

[61] Dino U-Net: Exploiting High-Fidelity Dense Features from Foundation Models for Medical Image Segmentation cs.CV | eess.IVPDF

Yifan Gao, Haoyue Li, Feng Yuan, Xiaosong Wang, Xin Gao

TL;DR: 论文提出Dino U-Net，一种基于DINOv3基础模型的医疗图像分割新架构，通过冻结的DINOv3骨干网络和专门设计的适配器与FAPM模块，高效利用其高保真密集特征，在多种医疗数据集上实现最优性能。

Details

Motivation: 虽然大规模自然图像预训练的基础模型为医疗图像分割提供了强大范式，但其学习到的表征如何有效迁移至精确临床应用仍是挑战。

Result: 在多种医疗数据集上达到SOTA性能，且随着模型规模增大性能持续提升，7B参数版本效果最佳。

Insight: 通用基础模型的高质量密集预训练特征可为医疗分割提供高效、参数优化的解决方案，无需额外训练骨干网络。

Abstract: Foundation models pre-trained on large-scale natural image datasets offer a powerful paradigm for medical image segmentation. However, effectively transferring their learned representations for precise clinical applications remains a challenge. In this work, we propose Dino U-Net, a novel encoder-decoder architecture designed to exploit the high-fidelity dense features of the DINOv3 vision foundation model. Our architecture introduces an encoder built upon a frozen DINOv3 backbone, which employs a specialized adapter to fuse the model’s rich semantic features with low-level spatial details. To preserve the quality of these representations during dimensionality reduction, we design a new fidelity-aware projection module (FAPM) that effectively refines and projects the features for the decoder. We conducted extensive experiments on seven diverse public medical image segmentation datasets. Our results show that Dino U-Net achieves state-of-the-art performance, consistently outperforming previous methods across various imaging modalities. Our framework proves to be highly scalable, with segmentation accuracy consistently improving as the backbone model size increases up to the 7-billion-parameter variant. The findings demonstrate that leveraging the superior, dense-pretrained features from a general-purpose foundation model provides a highly effective and parameter-efficient approach to advance the accuracy of medical image segmentation. The code is available at https://github.com/yifangao112/DinoUNet.

[62] COMETH: Convex Optimization for Multiview Estimation and Tracking of Humans cs.CV | cs.ROPDF

Enrico Martini, Ho Jin Choi, Nadia Figueroa, Nicola Bombieri

TL;DR: 论文提出了一种轻量级算法COMETH，用于实时多视角人体姿态融合，通过集成运动学和生物力学约束、凸优化逆运动学和状态观测器，提高了定位、检测和跟踪的准确性。

Details

Motivation: 在工业5.0时代，监控人类活动对确保人体工程学安全和整体福祉至关重要。多摄像头集中设置虽提高了姿态估计准确性，但由于高计算成本和大带宽需求，限制了其可扩展性和实时性。分布式边缘计算虽然减少了带宽和计算负载，但受限于设备资源，导致精度下降。

Result: 在公共和工业数据集上，COMETH在定位、检测和跟踪精度上优于现有最先进方法。

Insight: COMETH为工业和安全关键应用提供了一种轻量级且高效的解决方案，克服了当前多视角人体姿态估计中的计算和带宽限制。

Abstract: In the era of Industry 5.0, monitoring human activity is essential for ensuring both ergonomic safety and overall well-being. While multi-camera centralized setups improve pose estimation accuracy, they often suffer from high computational costs and bandwidth requirements, limiting scalability and real-time applicability. Distributing processing across edge devices can reduce network bandwidth and computational load. On the other hand, the constrained resources of edge devices lead to accuracy degradation, and the distribution of computation leads to temporal and spatial inconsistencies. We address this challenge by proposing COMETH (Convex Optimization for Multiview Estimation and Tracking of Humans), a lightweight algorithm for real-time multi-view human pose fusion that relies on three concepts: it integrates kinematic and biomechanical constraints to increase the joint positioning accuracy; it employs convex optimization-based inverse kinematics for spatial fusion; and it implements a state observer to improve temporal consistency. We evaluate COMETH on both public and industrial datasets, where it outperforms state-of-the-art methods in localization, detection, and tracking accuracy. The proposed fusion pipeline enables accurate and scalable human motion tracking, making it well-suited for industrial and safety-critical applications. The code is publicly available at https://github.com/PARCO-LAB/COMETH.

[63] E-ConvNeXt: A Lightweight and Efficient ConvNeXt Variant with Cross-Stage Partial Connections cs.CVPDF

Fang Wang, Huitao Li, Wenhan Chao, Zheng Zhuo, Yiran Ji

TL;DR: E-ConvNeXt是一种轻量高效的ConvNeXt变体，通过跨阶段部分连接（CSP）机制和优化设计，显著降低了模型参数和复杂度，同时保持了高精度性能。

Details

Motivation: 许多高性能网络在设计时未考虑轻量化应用场景，限制了其适用范围。本文以ConvNeXt为研究对象，旨在通过改进设计实现轻量化。

Result: 实验显示，E-ConvNeXt在ImageNet分类任务中表现优异，mini版0.9GFLOPs达78.3% Top-1准确率，small版3.1GFLOPs达81.9%。

Insight: 通过结合CSP机制和结构优化，可以在显著降低计算成本的同时保持模型性能，为轻量化设计提供了有效范例。

Abstract: Many high-performance networks were not designed with lightweight application scenarios in mind from the outset, which has greatly restricted their scope of application. This paper takes ConvNeXt as the research object and significantly reduces the parameter scale and network complexity of ConvNeXt by integrating the Cross Stage Partial Connections mechanism and a series of optimized designs. The new network is named E-ConvNeXt, which can maintain high accuracy performance under different complexity configurations. The three core innovations of E-ConvNeXt are : (1) integrating the Cross Stage Partial Network (CSPNet) with ConvNeXt and adjusting the network structure, which reduces the model’s network complexity by up to 80%; (2) Optimizing the Stem and Block structures to enhance the model’s feature expression capability and operational efficiency; (3) Replacing Layer Scale with channel attention. Experimental validation on ImageNet classification demonstrates E-ConvNeXt’s superior accuracy-efficiency balance: E-ConvNeXt-mini reaches 78.3% Top-1 accuracy at 0.9GFLOPs. E-ConvNeXt-small reaches 81.9% Top-1 accuracy at 3.1GFLOPs. Transfer learning tests on object detection tasks further confirm its generalization capability.

[64] Webly-Supervised Image Manipulation Localization via Category-Aware Auto-Annotation cs.CVPDF

Chenfan Qu, Yiwu Zhong, Bin Li, Lianwen Jin

TL;DR: 该论文提出了一种利用网络数据和自动标注技术解决图像篡改定位任务中数据稀缺问题的新方法，并通过新指标和数据集大幅提升了模型性能。

Details

Motivation: 图像篡改导致的社会安全问题严重，但高质量标注数据稀缺且成本高昂。论文旨在利用网络数据和自动标注技术解决这一问题。

Result: MIMLv2数据集规模是现有手工标注数据的120倍；Web-IML在篡改定位任务上性能提升31%，超越SOTA模型TruFor 24.1平均IoU。

Insight: 网络数据和自动标注技术可有效解决数据稀缺问题；通过质量过滤和增强技术可显著提升模型性能。

Abstract: Images manipulated using image editing tools can mislead viewers and pose significant risks to social security. However, accurately localizing the manipulated regions within an image remains a challenging problem. One of the main barriers in this area is the high cost of data acquisition and the severe lack of high-quality annotated datasets. To address this challenge, we introduce novel methods that mitigate data scarcity by leveraging readily available web data. We utilize a large collection of manually forged images from the web, as well as automatically generated annotations derived from a simpler auxiliary task, constrained image manipulation localization. Specifically, we introduce a new paradigm CAAAv2, which automatically and accurately annotates manipulated regions at the pixel level. To further improve annotation quality, we propose a novel metric, QES, which filters out unreliable annotations. Through CAAA v2 and QES, we construct MIMLv2, a large-scale, diverse, and high-quality dataset containing 246,212 manually forged images with pixel-level mask annotations. This is over 120x larger than existing handcrafted datasets like IMD20. Additionally, we introduce Object Jitter, a technique that further enhances model training by generating high-quality manipulation artifacts. Building on these advances, we develop a new model, Web-IML, designed to effectively leverage web-scale supervision for the image manipulation localization task. Extensive experiments demonstrate that our approach substantially alleviates the data scarcity problem and significantly improves the performance of various models on multiple real-world forgery benchmarks. With the proposed web supervision, Web-IML achieves a striking performance gain of 31% and surpasses previous SOTA TruFor by 24.1 average IoU points. The dataset and code will be made publicly available at https://github.com/qcf-568/MIML.

[65] POSE: Phased One-Step Adversarial Equilibrium for Video Diffusion Models cs.CVPDF

Jiaxiang Cheng, Bing Ma, Xuhua Ren, Hongyi Jin, Kai Yu

TL;DR: 本文提出一种名为POSE的蒸馏框架，旨在显著提升视频扩散模型的采样效率，实现高质量视频的单步生成。

Details

Motivation: 现有视频加速方法多基于图像技术，未能充分建模视频帧的时间一致性，且无法实现大规模视频模型的单步蒸馏。

Result: POSE在VBenc-I2V基准上平均提升7.15%的性能，同时将预训练模型的延迟降低100倍（从1000秒降至10秒）。

Insight: 通过精心设计的对抗蒸馏和稳定性机制，可以在单步内生成高质量视频，为大规模视频扩散模型的实时应用提供可能。

Abstract: The field of video diffusion generation faces critical bottlenecks in sampling efficiency, especially for large-scale models and long sequences. Existing video acceleration methods adopt image-based techniques but suffer from fundamental limitations: they neither model the temporal coherence of video frames nor provide single-step distillation for large-scale video models. To bridge this gap, we propose POSE (Phased One-Step Equilibrium), a distillation framework that reduces the sampling steps of large-scale video diffusion models, enabling the generation of high-quality videos in a single step. POSE employs a carefully designed two-phase process to distill video models:(i) stability priming: a warm-up mechanism to stabilize adversarial distillation that adapts the high-quality trajectory of the one-step generator from high to low signal-to-noise ratio regimes, optimizing the video quality of single-step mappings near the endpoints of flow trajectories. (ii) unified adversarial equilibrium: a flexible self-adversarial distillation mechanism that promotes stable single-step adversarial training towards a Nash equilibrium within the Gaussian noise space, generating realistic single-step videos close to real videos. For conditional video generation, we propose (iii) conditional adversarial consistency, a method to improve both semantic consistency and frame consistency between conditional frames and generated frames. Comprehensive experiments demonstrate that POSE outperforms other acceleration methods on VBench-I2V by average 7.15% in semantic alignment, temporal conference and frame quality, reducing the latency of the pre-trained model by 100$\times$, from 1000 seconds to 10 seconds, while maintaining competitive performance.

[66] Mitosis detection in domain shift scenarios: a Mamba-based approach cs.CVPDF

Gennaro Percannella, Mattia Sarno, Francesco Tortorella, Mario Vento

TL;DR: 该论文提出了一种基于Mamba的方法，用于解决组织病理学图像中跨域的有丝分裂检测问题，采用VM-UNet架构和染色增强操作以提高模型的鲁棒性。初步实验表明该方法在MIDOG++数据集上还有改进空间。

Details

Motivation: 组织病理学图像中的有丝分裂检测对肿瘤评估至关重要。然而，机器学习算法在跨域数据上性能显著下降。为了提升模型在不同域中的表现，作者从Mamba在医学图像分割任务中的优异表现中获得灵感。

Result: 初步实验在MIDOG++数据集上展示了该方法的表现，但仍有较大的改进空间。

Insight: Mamba架构在医学图像任务中具有潜力，而染色增强可以有效提升模型在跨域场景下的性能。

Abstract: Mitosis detection in histopathology images plays a key role in tumor assessment. Although machine learning algorithms could be exploited for aiding physicians in accurately performing such a task, these algorithms suffer from significative performance drop when evaluated on images coming from domains that are different from the training ones. In this work, we propose a Mamba-based approach for mitosis detection under domain shift, inspired by the promising performance demonstrated by Mamba in medical imaging segmentation tasks. Specifically, our approach exploits a VM-UNet architecture for carrying out the addressed task, as well as stain augmentation operations for further improving model robustness against domain shift. Our approach has been submitted to the track 1 of the MItosis DOmain Generalization (MIDOG) challenge. Preliminary experiments, conducted on the MIDOG++ dataset, show large room for improvement for the proposed method.

[67] FW-GAN: Frequency-Driven Handwriting Synthesis with Wave-Modulated MLP Generator cs.CV | cs.LGPDF

Huynh Tong Dang Khoa, Dang Hoai Nam, Vo Nguyen Le Duy

TL;DR: FW-GAN是一种基于频率驱动的手写合成方法，通过结合Wave-MLP生成器和频率引导判别器，解决了现有方法在长距离依赖和复杂笔画模式建模上的不足，同时利用频率分布损失提升了生成样本的视觉保真度。

Details

Motivation: 标签化手写数据稀缺，限制了需要多样且风格一致的训练样本的识别系统的效果。当前的手写生成方法在建模长距离依赖和复杂笔画模式上存在困难，且忽视了频率信息的重要性。

Result: 实验表明FW-GAN能生成高质量、风格一致的手写文本，适用于低资源手写识别任务的数据增强。

Insight: 频率信息在手写生成中至关重要，Wave-MLP能够有效建模复杂笔画模式，频率引导的判别器和损失函数能显著提升生成样本的真实性。

Abstract: Labeled handwriting data is often scarce, limiting the effectiveness of recognition systems that require diverse, style-consistent training samples. Handwriting synthesis offers a promising solution by generating artificial data to augment training. However, current methods face two major limitations. First, most are built on conventional convolutional architectures, which struggle to model long-range dependencies and complex stroke patterns. Second, they largely ignore the crucial role of frequency information, which is essential for capturing fine-grained stylistic and structural details in handwriting. To address these challenges, we propose FW-GAN, a one-shot handwriting synthesis framework that generates realistic, writer-consistent text from a single example. Our generator integrates a phase-aware Wave-MLP to better capture spatial relationships while preserving subtle stylistic cues. We further introduce a frequency-guided discriminator that leverages high-frequency components to enhance the authenticity detection of generated samples. Additionally, we introduce a novel Frequency Distribution Loss that aligns the frequency characteristics of synthetic and real handwriting, thereby enhancing visual fidelity. Experiments on Vietnamese and English handwriting datasets demonstrate that FW-GAN generates high-quality, style-consistent handwriting, making it a valuable tool for augmenting data in low-resource handwriting recognition (HTR) pipelines. Official implementation is available at https://github.com/DAIR-Group/FW-GAN

[68] MMG-Vid: Maximizing Marginal Gains at Segment-level and Token-level for Efficient Video LLMs cs.CVPDF

Junpeng Ma, Qizhe Zhang, Ming Lu, Zhibin Wang, Qiang Zhou

TL;DR: MMG-Vid提出了一种无需训练的视觉令牌修剪框架，通过在段级和令牌级最大化边际增益，显著提升视频大语言模型的效率，同时保持性能。

Details

Motivation: 当前视频大语言模型（VLLMs）处理视觉令牌时计算成本高，现有方法忽视视频的动态特性和时间依赖性。

Result: 在LLaVA-OneVision-7B上减少75%视觉令牌，预填充阶段加速3.9倍，性能保留99.5%以上。

Insight: 段级和令牌级的联合优化可显著提升视频处理的效率，且无需额外训练。

Abstract: Video Large Language Models (VLLMs) excel in video understanding, but their excessive visual tokens pose a significant computational challenge for real-world applications. Current methods aim to enhance inference efficiency by visual token pruning. However, they do not consider the dynamic characteristics and temporal dependencies of video frames, as they perceive video understanding as a multi-frame task. To address these challenges, we propose MMG-Vid, a novel training-free visual token pruning framework that removes redundancy by Maximizing Marginal Gains at both segment-level and token-level. Specifically, we first divide the video into segments based on frame similarity, and then dynamically allocate the token budget for each segment to maximize the marginal gain of each segment. Subsequently, we propose a temporal-guided DPC algorithm that jointly models inter-frame uniqueness and intra-frame diversity, thereby maximizing the marginal gain of each token. By combining both stages, MMG-Vid can maximize the utilization of the limited token budget, significantly improving efficiency while maintaining strong performance. Extensive experiments demonstrate that MMG-Vid can maintain over 99.5% of the original performance, while effectively reducing 75% visual tokens and accelerating the prefilling stage by 3.9x on LLaVA-OneVision-7B. Code will be released soon.

[69] CogVLA: Cognition-Aligned Vision-Language-Action Model via Instruction-Driven Routing & Sparsification cs.CV | cs.ROPDF

Wei Li, Renshan Zhang, Rui Shao, Jie He, Liqiang Nie

TL;DR: CogVLA提出了一种基于指令驱动路由和稀疏化的认知对齐视觉-语言-行动（VLA）模型，通过三阶段渐进式架构显著提升了效率和性能，降低了训练和推理成本。

Details

Motivation: 现有的VLA模型依赖预训练视觉-语言模型（VLM），但需要大量后训练，计算开销高，限制了扩展性和部署。CogVLA从人类多模态协调中获取灵感，旨在通过高效路由和稀疏化解决这一问题。

Result: 在LIBERO基准和真实任务中分别达到97.4%和70.0%的成功率，训练成本降低2.5倍，推理延迟减少2.8倍。

Insight: CogVLA通过模拟人类认知的多模态协调，实现了高效的视觉-语言-行动对齐，为轻量化VLA模型提供了新思路。

Abstract: Recent Vision-Language-Action (VLA) models built on pre-trained Vision-Language Models (VLMs) require extensive post-training, resulting in high computational overhead that limits scalability and deployment.We propose CogVLA, a Cognition-Aligned Vision-Language-Action framework that leverages instruction-driven routing and sparsification to improve both efficiency and performance. CogVLA draws inspiration from human multimodal coordination and introduces a 3-stage progressive architecture. 1) Encoder-FiLM based Aggregation Routing (EFA-Routing) injects instruction information into the vision encoder to selectively aggregate and compress dual-stream visual tokens, forming a instruction-aware latent representation. 2) Building upon this compact visual encoding, LLM-FiLM based Pruning Routing (LFP-Routing) introduces action intent into the language model by pruning instruction-irrelevant visually grounded tokens, thereby achieving token-level sparsity. 3) To ensure that compressed perception inputs can still support accurate and coherent action generation, we introduce V-L-A Coupled Attention (CAtten), which combines causal vision-language attention with bidirectional action parallel decoding. Extensive experiments on the LIBERO benchmark and real-world robotic tasks demonstrate that CogVLA achieves state-of-the-art performance with success rates of 97.4% and 70.0%, respectively, while reducing training costs by 2.5-fold and decreasing inference latency by 2.8-fold compared to OpenVLA. CogVLA is open-sourced and publicly available at https://github.com/JiuTian-VL/CogVLA.

[70] Veritas: Generalizable Deepfake Detection via Pattern-Aware Reasoning cs.CV | cs.AIPDF

Hao Tan, Jun Lan, Zichang Tan, Ajian Liu, Chuanbiao Song

TL;DR: 本文提出了一种针对Deepfake检测的通用解决方案Veritas，通过模式感知推理和大语言模型（MLLM）提升检测性能，并构建了HydraFake数据集以模拟真实场景的挑战。

Details

Motivation: 现有的Deepfake检测基准与工业实践存在严重脱节，训练数据单一且测试质量低，限制了检测器的实际应用。为解决这一问题，作者提出了更贴近现实的数据集和检测方法。

Result: 实验表明，Veritas在未见过的伪造技术和数据域（OOD场景）中表现显著优于现有检测器，并能提供透明且可信的检测输出。

Insight: 1. 通用Deepfake检测需要数据多样性和推理能力的结合；2. 模式感知推理可以有效提升模型的泛化性和解释性；3. 两阶段训练为MLLM在特定任务中的应用提供了新思路。

Abstract: Deepfake detection remains a formidable challenge due to the complex and evolving nature of fake content in real-world scenarios. However, existing academic benchmarks suffer from severe discrepancies from industrial practice, typically featuring homogeneous training sources and low-quality testing images, which hinder the practical deployments of current detectors. To mitigate this gap, we introduce HydraFake, a dataset that simulates real-world challenges with hierarchical generalization testing. Specifically, HydraFake involves diversified deepfake techniques and in-the-wild forgeries, along with rigorous training and evaluation protocol, covering unseen model architectures, emerging forgery techniques and novel data domains. Building on this resource, we propose Veritas, a multi-modal large language model (MLLM) based deepfake detector. Different from vanilla chain-of-thought (CoT), we introduce pattern-aware reasoning that involves critical reasoning patterns such as “planning” and “self-reflection” to emulate human forensic process. We further propose a two-stage training pipeline to seamlessly internalize such deepfake reasoning capacities into current MLLMs. Experiments on HydraFake dataset reveal that although previous detectors show great generalization on cross-model scenarios, they fall short on unseen forgeries and data domains. Our Veritas achieves significant gains across different OOD scenarios, and is capable of delivering transparent and faithful detection outputs.

[71] FakeParts: a New Family of AI-Generated DeepFakes cs.CV | cs.AI | cs.MMPDF

Gaetan Brison, Soobash Daiboo, Samy Aimeur, Awais Hussain Sani, Xi Wang

TL;DR: FakeParts是一类新型的深度伪造技术，通过对视频的局部区域或时间段进行微妙操作，与真实内容无缝融合，极具欺骗性。为解决检测难题，作者提出了首个大规模基准数据集FakePartsBench，并展示了检测模型的性能下降。

Details

Motivation: 当前深度伪造检测方法主要针对完全合成的虚假内容，而忽视了局部篡改带来的威胁。FakeParts填补了这一空白，揭示了局部篡改的欺骗性及其检测的紧迫性。

Result: 用户研究的检测准确率下降了30%以上，SOTA检测模型也存在类似性能退化。这证实了局部伪造的高欺骗性和当前方法的不足。

Insight: 局部篡改的深度伪造更难以检测，现有方法需要改进以适应这种新威胁。FakePartsBench为开发更鲁棒的检测方法提供了重要资源。

Abstract: We introduce FakeParts, a new class of deepfakes characterized by subtle, localized manipulations to specific spatial regions or temporal segments of otherwise authentic videos. Unlike fully synthetic content, these partial manipulations, ranging from altered facial expressions to object substitutions and background modifications, blend seamlessly with real elements, making them particularly deceptive and difficult to detect. To address the critical gap in detection capabilities, we present FakePartsBench, the first large-scale benchmark dataset specifically designed to capture the full spectrum of partial deepfakes. Comprising over 25K videos with pixel-level and frame-level manipulation annotations, our dataset enables comprehensive evaluation of detection methods. Our user studies demonstrate that FakeParts reduces human detection accuracy by over 30% compared to traditional deepfakes, with similar performance degradation observed in state-of-the-art detection models. This work identifies an urgent vulnerability in current deepfake detection approaches and provides the necessary resources to develop more robust methods for partial video manipulations.

[72] Multi-View 3D Point Tracking cs.CVPDF

Frano Rajič, Haofei Xu, Marko Mihajlovic, Siyuan Li, Irem Demir

TL;DR: 该论文提出了一种数据驱动的多视角3D点跟踪方法，通过融合多视角特征和变压器更新机制，实现了在动态场景中对任意点的准确跟踪，支持1-8个相机视角。

Details

Motivation: 现有单目跟踪方法受深度模糊性和遮挡问题困扰，而多相机方法通常需要大量相机和复杂优化。本研究旨在开发一种更实用、高效的多视角3D跟踪方法。

Result: 在Panoptic Studio和DexYCB上的中值轨迹误差分别为3.1 cm和2.0 cm，能够泛化到不同相机设置（1-8视角）和视频长度（24-150帧）。

Insight: 通过少量相机就能实现高效3D跟踪，推动了多视角跟踪的实用化，并为未来研究设定了新标准。训练数据开源进一步促进了领域发展。

Abstract: We introduce the first data-driven multi-view 3D point tracker, designed to track arbitrary points in dynamic scenes using multiple camera views. Unlike existing monocular trackers, which struggle with depth ambiguities and occlusion, or prior multi-camera methods that require over 20 cameras and tedious per-sequence optimization, our feed-forward model directly predicts 3D correspondences using a practical number of cameras (e.g., four), enabling robust and accurate online tracking. Given known camera poses and either sensor-based or estimated multi-view depth, our tracker fuses multi-view features into a unified point cloud and applies k-nearest-neighbors correlation alongside a transformer-based update to reliably estimate long-range 3D correspondences, even under occlusion. We train on 5K synthetic multi-view Kubric sequences and evaluate on two real-world benchmarks: Panoptic Studio and DexYCB, achieving median trajectory errors of 3.1 cm and 2.0 cm, respectively. Our method generalizes well to diverse camera setups of 1-8 views with varying vantage points and video lengths of 24-150 frames. By releasing our tracker alongside training and evaluation datasets, we aim to set a new standard for multi-view 3D tracking research and provide a practical tool for real-world applications. Project page available at https://ethz-vlg.github.io/mvtracker.

[73] OneReward: Unified Mask-Guided Image Generation via Multi-Task Human Preference Learning cs.CVPDF

Yuan Gong, Xionghui Wang, Jie Wu, Shiyin Wang, Yitong Wang

TL;DR: OneReward 是一个统一的强化学习框架，通过多任务人类偏好学习提升生成模型的能力，仅需一个奖励模型即可适应不同任务。用于掩码引导的图像生成，并在多个子任务中展现出优于竞争对手的性能。

Details

Motivation: 现有的多任务生成模型依赖任务特定的监督微调（SFT），限制了泛化能力和训练效率。需要一种统一的方法来覆盖不同数据分布和评价标准的任务。

Result: OneReward在掩码引导的多个子任务中表现优于商业和开源竞争对手，如Ideogram和Adobe Photoshop，验证了其统一性和高效性。

Insight: 单一的奖励模型可以成功指导多任务生成，证明了在多样化任务中统一训练方法的潜力。

Abstract: In this paper, we introduce OneReward, a unified reinforcement learning framework that enhances the model’s generative capabilities across multiple tasks under different evaluation criteria using only \textit{One Reward} model. By employing a single vision-language model (VLM) as the generative reward model, which can distinguish the winner and loser for a given task and a given evaluation criterion, it can be effectively applied to multi-task generation models, particularly in contexts with varied data and diverse task objectives. We utilize OneReward for mask-guided image generation, which can be further divided into several sub-tasks such as image fill, image extend, object removal, and text rendering, involving a binary mask as the edit area. Although these domain-specific tasks share same conditioning paradigm, they differ significantly in underlying data distributions and evaluation metrics. Existing methods often rely on task-specific supervised fine-tuning (SFT), which limits generalization and training efficiency. Building on OneReward, we develop Seedream 3.0 Fill, a mask-guided generation model trained via multi-task reinforcement learning directly on a pre-trained base model, eliminating the need for task-specific SFT. Experimental results demonstrate that our unified edit model consistently outperforms both commercial and open-source competitors, such as Ideogram, Adobe Photoshop, and FLUX Fill [Pro], across multiple evaluation dimensions. Code and model are available at: https://one-reward.github.io

[74] Dress&Dance: Dress up and Dance as You Like It - Technical Preview cs.CV | cs.LGPDF

Jun-Kun Chen, Aayush Bansal, Minh Phuoc Vo, Yu-Xiong Wang

TL;DR: Dress&Dance是一个视频扩散框架，通过CondNet统一多模态输入，生成高质量虚拟试穿视频。

Details

Motivation: 提升虚拟试穿体验，支持多类别衣物试穿和动作同步。

Result: 在1152x720分辨率下生成5秒试穿视频，超越现有开源和商业方案。

Insight: 异构数据训练和注意力机制是多模态任务的关键。

Abstract: We present Dress&Dance, a video diffusion framework that generates high quality 5-second-long 24 FPS virtual try-on videos at 1152x720 resolution of a user wearing desired garments while moving in accordance with a given reference video. Our approach requires a single user image and supports a range of tops, bottoms, and one-piece garments, as well as simultaneous tops and bottoms try-on in a single pass. Key to our framework is CondNet, a novel conditioning network that leverages attention to unify multi-modal inputs (text, images, and videos), thereby enhancing garment registration and motion fidelity. CondNet is trained on heterogeneous training data, combining limited video data and a larger, more readily available image dataset, in a multistage progressive manner. Dress&Dance outperforms existing open source and commercial solutions and enables a high quality and flexible try-on experience.

cs.CL [Back]

Lance Calvin Lim Gamboa, Yue Feng, Mark Lee

TL;DR: 本文综述了多语言预训练模型中存在的社会偏见问题，重点探讨了偏见评估和缓解方法在多语言和非英语语境中的应用，并指出了当前研究中的方法论缺陷和未来方向。

Details

Motivation: 多语言预训练模型与英语模型一样存在社会偏见，但相关研究在跨语言和文化适应性方面仍不充分。本文旨在填补这一研究空白，提升多语言偏见研究的包容性和文化适切性。

Result: 研究发现多语言偏见研究存在语言偏向性，且多语言缓解实验稀缺，同时提出了改进方向。

Insight: 未来研究需关注多语言偏见的跨文化适应性，并将其与NLP最新进展结合，以提升模型的公平性和包容性。

Abstract: Pretrained multilingual models exhibit the same social bias as models processing English texts. This systematic review analyzes emerging research that extends bias evaluation and mitigation approaches into multilingual and non-English contexts. We examine these studies with respect to linguistic diversity, cultural awareness, and their choice of evaluation metrics and mitigation techniques. Our survey illuminates gaps in the field’s dominant methodological design choices (e.g., preference for certain languages, scarcity of multilingual mitigation experiments) while cataloging common issues encountered and solutions implemented in adapting bias benchmarks across languages and cultures. Drawing from the implications of our findings, we chart directions for future research that can reinforce the multilingual bias literature’s inclusivity, cross-cultural appropriateness, and alignment with state-of-the-art NLP advancements.

[76] Integrating SystemC TLM into FMI 3.0 Co-Simulations with an Open-Source Approach cs.CLPDF

Andrei Mihai Albu, Giovanni Pollo, Alessio Burrello, Daniele Jahier Pagliari, Cristian Tesconi

TL;DR: 这篇论文提出了一种完全开源的方法，将SystemC TLM模型集成到基于FMI 3.0的联合仿真工作流中，解决了跨领域仿真的互操作性挑战。

Details

Motivation: 随着信息物理系统（特别是汽车应用）复杂度的增加，跨领域联合仿真需求日益突出。SystemC TLM虽有高效的硬件/软件协同设计能力，但与其他工程领域模型的互操作性不足。

Result: 案例研究表明，该方法能够实现异构仿真环境的无缝集成，证明了其可行性和高效性。

Insight: 此方法为跨领域联合仿真提供了标准化且开源的解决方案，有望推动信息物理系统设计的效率提升。

Abstract: The growing complexity of cyber-physical systems, particularly in automotive applications, has increased the demand for efficient modeling and cross-domain co-simulation techniques. While SystemC Transaction-Level Modeling (TLM) enables effective hardware/software co-design, its limited interoperability with models from other engineering domains poses integration challenges. This paper presents a fully open-source methodology for integrating SystemC TLM models into Functional Mock-up Interface (FMI)-based co-simulation workflows. By encapsulating SystemC TLM components as FMI 3.0 Co Simulation Functional Mock-up Units (FMUs), the proposed approach facilitates seamless, standardized integration across heterogeneous simulation environments. We introduce a lightweight open-source toolchain, address key technical challenges such as time synchronization and data exchange, and demonstrate the feasibility and effectiveness of the integration through representative case studies.

[77] Can Compact Language Models Search Like Agents? Distillation-Guided Policy Optimization for Preserving Agentic RAG Capabilities cs.CLPDF

Rikuto Kotoge, Mai Nishimura, Jiaxin Ma

TL;DR: 论文提出了蒸馏引导策略优化（DGPO），通过教师示范和持续指导，使紧凑语言模型（如0.5B参数）能够实现复杂的代理检索增强生成（RAG）行为。

Details

Motivation: 紧凑语言模型在代理RAG行为（如搜索和规划）中表现不佳，主要由于推理能力差和奖励稀疏导致训练不稳定。

Result: 实验表明，DGPO使紧凑模型不仅实现了代理RAG行为，甚至在某些情况下超越了更大的教师模型。

Insight: DGPO为资源受限环境中的代理RAG应用提供了可行路径，并突出了教师指导在训练紧凑模型中的重要性。

Abstract: Reinforcement Learning has emerged as a post-training approach to elicit agentic RAG behaviors such as search and planning from language models. However, compact language models (e.g., 0.5B parameters) struggle due to poor reasoning ability, resulting in sparse rewards and unstable training. To overcome these difficulties, we propose Distillation-Guided Policy Optimization (DGPO), which addresses the challenges through cold-start initialization from teacher demonstrations and continuous teacher guidance during policy optimization. To systematically evaluate our approach, we introduce Agentic RAG Capabilities (ARC), a fine-grained metric analyzing reasoning, search coordination, and response synthesis. Comprehensive experiments demonstrate that DGPO enables compact models to achieve sophisticated agentic search behaviors, even outperforming the larger teacher model in some cases. DGPO makes agentic RAG feasible in computing resource-constrained environments.

[78] GUARD: Guideline Upholding Test through Adaptive Role-play and Jailbreak Diagnostics for LLMs cs.CL | cs.AI | cs.CVPDF

Haibo Jin, Ruoxi Chen, Peiyan Zhang, Andy Zhou, Yang Zhang

TL;DR: GUARD是一种测试方法，通过将伦理准则转化为具体的违规问题来评估大语言模型（LLM）的合规性，同时结合“越狱”诊断，识别潜在的安全漏洞。

Details

Motivation: 随着大语言模型在各领域的广泛应用，其可能生成有害回答的风险引发了社会和监管关注。尽管政府发布了伦理准则，但缺乏将其转化为可操作测试问题的方法。

Result: 在7个LLM（包括Vicuna、GPT系列等）上验证了有效性，并展示了在视觉语言模型中的应用潜力。

Insight: GUARD为测试LLM合规性提供了标准化方法，弥补了伦理准则与实践测试之间的鸿沟，并扩展了越狱诊断的应用场景。

Abstract: As Large Language Models become increasingly integral to various domains, their potential to generate harmful responses has prompted significant societal and regulatory concerns. In response, governments have issued ethics guidelines to promote the development of trustworthy AI. However, these guidelines are typically high-level demands for developers and testers, leaving a gap in translating them into actionable testing questions to verify LLM compliance. To address this challenge, we introduce GUARD (\textbf{G}uideline \textbf{U}pholding Test through \textbf{A}daptive \textbf{R}ole-play and Jailbreak \textbf{D}iagnostics), a testing method designed to operationalize guidelines into specific guideline-violating questions that assess LLM adherence. To implement this, GUARD uses automated generation of guideline-violating questions based on government-issued guidelines, thereby testing whether responses comply with these guidelines. When responses directly violate guidelines, GUARD reports inconsistencies. Furthermore, for responses that do not directly violate guidelines, GUARD integrates the concept of ``jailbreaks’’ to diagnostics, named GUARD-JD, which creates scenarios that provoke unethical or guideline-violating responses, effectively identifying potential scenarios that could bypass built-in safety mechanisms. Our method finally culminates in a compliance report, delineating the extent of adherence and highlighting any violations. We have empirically validated the effectiveness of GUARD on seven LLMs, including Vicuna-13B, LongChat-7B, Llama2-7B, Llama-3-8B, GPT-3.5, GPT-4, GPT-4o, and Claude-3.7, by testing compliance under three government-issued guidelines and conducting jailbreak diagnostics. Additionally, GUARD-JD can transfer jailbreak diagnostics to vision-language models, demonstrating its usage in promoting reliable LLM-based applications.

[79] Joint Enhancement of Relational Reasoning for Long-Context LLMs cs.CLPDF

Zhirui Chen, Wei Shen, Jiashui Huang, Ling Shao

TL;DR: 论文提出了JERR框架，通过基于图的推理增强长上下文LLMs的关系推理能力，解决了长上下文任务中的记忆限制和透明度问题。

Details

Motivation: 当前大型语言模型在处理长上下文时面临记忆限制和复杂任务能力不足的问题，同时缺乏透明度，容易产生幻觉，亟需一种增强长上下文理解的解决方案。

Result: 实验结果表明，JERR在ROUGE和F1指标上均优于基线，在LLM-Rater评价中取得最高分。

Insight: 结合图结构和MCTS的框架能够有效提升LLMs的长上下文处理能力，同时增强模型输出的可靠性和可解释性。

Abstract: Despite significant progress, large language models (LLMs) still struggle with long contexts due to memory limitations and their inability to tackle complex and long-context tasks. Additionally, LLMs often suffer from a lack of transparency and are prone to producing hallucinations. To address these challenges, we propose \textbf{JERR}, a novel framework designed to enhance long-context comprehension via graph-based reasoning in LLMs. JERR integrates three key components: synopsis extraction, graph construction, and relational reasoning. First, synopsis is extracted by chunking text strategically, allowing the model to summarize and understand information more efficiently. Second, we build a directed acyclic graph (DAG) to resolve redundancy, ensuring logical consistency and clarity. Finally, we incorporate Monte Carlo Tree Search (MCTS) to help the model navigate complex reasoning paths, ensuring more accurate and interpretable outputs. This framework provides a novel solution that enables LLMs to handle extended contexts and complex reasoning tasks with improved reliability and transparency. Experimental results show that JERR consistently outperforms all baselines on the ROUGE and F1 metrics, achieving the highest scores on the LLM-Rater evaluation.

[80] Graph-R1: Unleashing LLM Reasoning with NP-Hard Graph Problems cs.CL | cs.AI | cs.LGPDF

Yuyao Wang, Bowen Liu, Jianheng Tang, Nuo Chen, Yuhan Li

TL;DR: 论文提出了一种利用NP难图问题作为合成训练数据的方法，通过两阶段微调（监督学习和强化学习）提升LLM的长链推理能力，模型在多个领域表现出优异的泛化能力。

Details

Motivation: 当前长链推理能力的提升依赖昂贵的人工标注数据，论文提出利用NP难图问题的内在复杂性作为替代资源，以低成本生成高质量的训练数据。

Result: Graph-R1-7B在数学、编程、STEM和逻辑任务上表现优异，在NP难图问题上超越了QwQ-32B模型。

Insight: NP难图问题是提升LLM长链推理能力的有效且可扩展资源，为LLM的后训练提供了新思路。

Abstract: Reasoning Large Language Models (RLLMs) have recently achieved remarkable progress on complex reasoning tasks, largely enabled by their long chain-of-thought (Long CoT) capabilities. However, developing these Long CoT behaviors relies heavily on post-training with high-quality datasets, which are typically costly and human-curated (e.g., mathematics and code), leaving scalable alternatives unexplored. In this work, we introduce NP-hard (NPH) graph problems as a novel synthetic training corpus, as they inherently require deep reasoning, extensive exploration, and reflective strategies, which are core characteristics of Long CoT reasoning. Building on this insight, we develop a two-stage post-training framework: (i) Long CoT Supervised Fine-Tuning (SFT) on rejection-sampled NPH graph instances, which substantially enhances reasoning depth, and (ii) Reinforcement Learning (RL) with a fine-grained reward design, which sharpens reasoning efficiency. Our flagship model, Graph-R1-7B, demonstrates strong generalization across mathematics, coding, STEM, and logic, and surpasses QwQ-32B on NPH graph problems in both accuracy and reasoning efficiency. These results position NPH graph problems as an effective and scalable resource for advancing Long CoT reasoning in LLMs, opening a new frontier for LLM post-training. Our implementation is available at https://github.com/Graph-Reasoner/Graph-R1, with models and datasets hosted in our Hugging Face collection HKUST-DSAIL/Graph-R1.

[81] Measuring Reasoning Utility in LLMs via Conditional Entropy Reduction cs.CL | cs.AI | I.2.7PDF

Xu Guo

TL;DR: 论文通过条件熵减少量化LLMs中推理步骤的效用，发现熵递减与正确答案强相关，而错误推理路径通常更长且效用低。

Details

Motivation: 虽然LLMs通过生成中间推理步骤提高准确性，但缺乏对推理步骤效用的量化研究。作者希望通过条件熵分析推理步骤对最终答案的影响，以优化推理流程。

Result: 熵递减的推理步骤与正确答案强相关，而熵不变或增加则易导致错误。此外，错误路径通常更长。

Insight: 推理步骤的质量比数量更重要，未来工作可基于此设计能早期检测无效推理的高效流程。

Abstract: Recent advancements in large language models (LLMs) often rely on generating intermediate reasoning steps to enhance accuracy. However, little work has examined how reasoning utility contributes to the final answer’s correctness. Due to the stochastic nature of autoregressive generation, generating more context does not guarantee increased confidence in the answer. If we could predict, during generation, whether a reasoning step will be useful, we could stop early or prune ineffective steps, avoiding distractions in the final decision. We present an oracle study on MATH dataset, using Qwen2.5-32B and GPT-4o to generate reasoning chains, and then employing a separate model (Qwen3-8B) to quantify the utility of these chains for final accuracy. Specifically, we measure the model’s uncertainty on the answer span Y at each reasoning step using conditional entropy (expected negative log-likelihood over the vocabulary) with context expanding step by step. Our results show a clear pattern: conditional entropy that decreases over steps is strongly associated with correct answers, whereas flat or increasing entropy often results in wrong answers. We also corroborate that incorrect reasoning paths tend to be longer than correct ones, suggesting that longer reasoning does not necessarily yield better outcomes. These findings serve as a foundation to inspire future work on designing efficient reasoning pipelines that detect and avoid unproductive reasoning early.

[82] CAMB: A comprehensive industrial LLM benchmark on civil aviation maintenance cs.CLPDF

Feng Zhang, Chengjie Pang, Yuehan Zhang, Chenyu Luo

TL;DR: 该论文提出并开发了一个针对民航维修领域的工业级大语言模型（LLM）评测标准CAMB，旨在填补该领域缺乏专业评估工具的空白。

Details

Motivation: 民航维修领域具有严格的行业标准，其中的知识密集型任务（如维修流程和故障排除）需要复杂的推理能力，但当前LLM评测主要集中在数学和编码任务，缺乏针对该领域的专业工具。

Result: 实验证明CAMB能有效评估LLM在民航维修领域的表现，并揭示了领域知识和推理能力的不足。

Insight: CAMB为针对性地优化（如领域微调、RAG优化、提示工程）提供了基础，推动了智能解决方案在民航维修领域的进展。

Abstract: Civil aviation maintenance is a domain characterized by stringent industry standards. Within this field, maintenance procedures and troubleshooting represent critical, knowledge-intensive tasks that require sophisticated reasoning. To address the lack of specialized evaluation tools for large language models (LLMs) in this vertical, we propose and develop an industrial-grade benchmark specifically designed for civil aviation maintenance. This benchmark serves a dual purpose: It provides a standardized tool to measure LLM capabilities within civil aviation maintenance, identifying specific gaps in domain knowledge and complex reasoning. By pinpointing these deficiencies, the benchmark establishes a foundation for targeted improvement efforts (e.g., domain-specific fine-tuning, RAG optimization, or specialized prompt engineering), ultimately facilitating progress toward more intelligent solutions within civil aviation maintenance. Our work addresses a significant gap in the current LLM evaluation, which primarily focuses on mathematical and coding reasoning tasks. In addition, given that Retrieval-Augmented Generation (RAG) systems are currently the dominant solutions in practical applications , we leverage this benchmark to evaluate existing well-known vector embedding models and LLMs for civil aviation maintenance scenarios. Through experimental exploration and analysis, we demonstrate the effectiveness of our benchmark in assessing model performance within this domain, and we open-source this evaluation benchmark and code to foster further research and development:https://github.com/CamBenchmark/cambenchmark

[83] Searching the Title of Practical Work of the Informatics Engineering Bachelor Program with the Case Base Reasoning Method cs.CLPDF

Agung Sukrisna Jaya, Osvari Arsalan, Danny Matthew Saputra

TL;DR: 该论文提出了一种基于案例推理（CBR）的方法，结合TF-IDF和余弦相似度，用于搜索计算机工程学士项目的实践工作标题，测试结果显示其效果良好。

Details

Motivation: 为了提高实践工作标题的搜索效率，作者基于案例推理方法（CBR）和文本匹配技术，设计了一个能够快速检索并匹配相似标题的系统。

Result: 系统在705个标题的测试集上表现良好，随机测试阶段的平均匹配值最高。

Insight: 基于文本相似度的CBR方法适用于小规模但结构化的数据检索任务，TF-IDF和余弦相似度的组合效果显著。

Abstract: Case Base Reasoning (CBR) is a case solving technique based on experience in cases that have occurred before with the highest similarity. CBR is used to search for practical work titles. TF-IDF is applied to process the vectorization of each practical work title word and Cosine Similarity for the calculation of similarity values. This system can search either in the form of titles or keywords. The output of the system is the title of practical work and the match value of each title. Based on the test results using 705 practical work titles, testing was carried out with five titles and carried out in two stages. The first stage searches with existing titles and the second stage randomizes the title from the first stage. And the results obtained in the second stage are the same number of titles found and the highest average match score.

[84] MCP-Bench: Benchmarking Tool-Using LLM Agents with Complex Real-World Tasks via MCP Servers cs.CLPDF

Zhenting Wang, Qi Chang, Hemani Patel, Shashank Biju, Cheng-En Wu

TL;DR: MCP-Bench是一个新的基准测试工具，用于评估大语言模型(LLMs)在真实多步骤任务中的表现，尤其是工具使用、跨工具协调、参数控制和任务规划能力。它通过Model Context Protocol (MCP)连接250个跨领域工具，测试模型在模糊指令下的工具检索、多步骤规划等能力，揭示了当前模型的局限性。

Details

Motivation: 现有基准测试大多依赖显式工具规范和浅层次的任务流程，无法全面评估大语言模型在真实复杂任务中的能力。MCP-Bench旨在填补这一空白，通过更贴近真实场景的多步骤任务和跨领域工具集，全面测试模型的工具使用和规划能力。

Result: 实验测试了20种先进的大语言模型(LLMs)，结果显示其在模糊工具检索、多步骤规划和跨领域任务协调方面仍存在显著挑战，说明现有模型的能力尚不足以应对复杂的真实场景任务。

Insight: MCP-Bench揭示了当前大语言模型在真实复杂任务中的局限性，尤其是在模糊工具选择和跨领域任务规划方面。这为未来研究提供了方向，如更智能的任务规划算法和工具协调机制。

Abstract: We introduce MCP-Bench, a benchmark for evaluating large language models (LLMs) on realistic, multi-step tasks that demand tool use, cross-tool coordination, precise parameter control, and planning/reasoning for solving tasks. Built on the Model Context Protocol (MCP), MCP-Bench connects LLMs to 28 representative live MCP servers spanning 250 tools across domains such as finance, traveling, scientific computing, and academic search. Unlike prior API-based benchmarks, each MCP server provides a set of complementary tools designed to work together, enabling the construction of authentic, multi-step tasks with rich input-output coupling. Tasks in MCP-Bench test agents’ ability to retrieve relevant tools from fuzzy instructions without explicit tool names, plan multi-hop execution trajectories for complex objectives, ground responses in intermediate tool outputs, and orchestrate cross-domain workflows - capabilities not adequately evaluated by existing benchmarks that rely on explicit tool specifications, shallow few-step workflows, and isolated domain operations. We propose a multi-faceted evaluation framework covering tool-level schema understanding and usage, trajectory-level planning, and task completion. Experiments on 20 advanced LLMs reveal persistent challenges in MCP-Bench. Code and data: https://github.com/Accenture/mcp-bench.

[85] Prediction of mortality and resource utilization in critical care: a deep learning approach using multimodal electronic health records with natural language processing techniques cs.CLPDF

Yucheng Ruan, Xiang Lan, Daniel J. Tan, Hairil Rizal Abdullah, Mengling Feng

TL;DR: 该研究提出了一种结合自然语言处理技术的深度学习框架，用于整合多模态电子健康记录（EHR）预测重症监护患者的死亡率和资源利用率，在多项临床任务中优于现有方法。

Details

Motivation: 现有方法主要关注结构化EHR数据，忽略了自由文本中的临床见解，且未充分利用结构化数据中的文本信息。研究旨在填补这一空白，通过深度学习提升预测效果。

Result: 模型在死亡率预测（BACC/AUROC提升1.6%/0.8%）、住院时间预测（RMSE/MAE降低0.5%/2.2%）和手术时长估计（RMSE/MAE降低10.9%/11.0%）中显著优于现有方法，且对数据噪声鲁棒性强。

Insight: 1. 自由文本对提升预测效果至关重要；2. 提示学习结合Transformer编码器在多模态EHR分析中效果显著；3. 模型在高噪声环境下仍保持稳定性能。

Abstract: Background Predicting mortality and resource utilization from electronic health records (EHRs) is challenging yet crucial for optimizing patient outcomes and managing costs in intensive care unit (ICU). Existing approaches predominantly focus on structured EHRs, often ignoring the valuable clinical insights in free-text notes. Additionally, the potential of textual information within structured data is not fully leveraged. This study aimed to introduce and assess a deep learning framework using natural language processing techniques that integrates multimodal EHRs to predict mortality and resource utilization in critical care settings. Methods Utilizing two real-world EHR datasets, we developed and evaluated our model on three clinical tasks with leading existing methods. We also performed an ablation study on three key components in our framework: medical prompts, free-texts, and pre-trained sentence encoder. Furthermore, we assessed the model’s robustness against the corruption in structured EHRs. Results Our experiments on two real-world datasets across three clinical tasks showed that our proposed model improved performance metrics by 1.6%/0.8% on BACC/AUROC for mortality prediction, 0.5%/2.2% on RMSE/MAE for LOS prediction, 10.9%/11.0% on RMSE/MAE for surgical duration estimation compared to the best existing methods. It consistently demonstrated superior performance compared to other baselines across three tasks at different corruption rates. Conclusions The proposed framework is an effective and accurate deep learning approach for predicting mortality and resource utilization in critical care. The study also highlights the success of using prompt learning with a transformer encoder in analyzing multimodal EHRs. Importantly, the model showed strong resilience to data corruption within structured data, especially at high corruption levels.

[86] ConspirED: A Dataset for Cognitive Traits of Conspiracy Theories and Large Language Model Safety cs.CLPDF

Luke Bates, Max Glockner, Preslav Nakov, Iryna Gurevych

TL;DR: ConspirED 是一个用于分析阴谋论认知特质的大型语言模型安全数据集，标注了阴谋论内容中的认知特征，并评估了大型语言模型的鲁棒性。

Details

Motivation: 阴谋论侵蚀公众对科学和机构的信任，同时难以被揭穿，因此需要理解其修辞模式以开发干预措施并评估AI的脆弱性。

Result: 模型能识别阴谋论特质，但大型语言模型对阴谋论输入存在对齐问题，输出会反映输入的推理模式。

Insight: 大型语言模型在对抗阴谋论内容时表现不足，需进一步改进以增强安全性。

Abstract: Conspiracy theories erode public trust in science and institutions while resisting debunking by evolving and absorbing counter-evidence. As AI-generated misinformation becomes increasingly sophisticated, understanding rhetorical patterns in conspiratorial content is important for developing interventions such as targeted prebunking and assessing AI vulnerabilities. We introduce ConspirED (CONSPIR Evaluation Dataset), which captures the cognitive traits of conspiratorial ideation in multi-sentence excerpts (80–120 words) from online conspiracy articles, annotated using the CONSPIR cognitive framework (Lewandowsky and Cook, 2020). ConspirED is the first dataset of conspiratorial content annotated for general cognitive traits. Using ConspirED, we (i) develop computational models that identify conspiratorial traits and determine dominant traits in text excerpts, and (ii) evaluate large language/reasoning model (LLM/LRM) robustness to conspiratorial inputs. We find that both are misaligned by conspiratorial content, producing output that mirrors input reasoning patterns, even when successfully deflecting comparable fact-checked misinformation.

[87] A Graph Talks, But Who’s Listening? Rethinking Evaluations for Graph-Language Models cs.CL | cs.AIPDF

Soham Petkar, Hari Aakash K, Anirudh Vempati, Akshit Sinha, Ponnurangam Kumarauguru

TL;DR: 该论文指出当前图语言模型（GLM）的评估基准不足，主要依赖单模态信息即可完成，因此提出了新的多模态基准CLEGR，以评估图与语言的联合推理能力。研究发现，现有GLM在图推理任务中表现不佳，且简单提示的LLM基线表现与完整GNN架构的GLM相当。

Details

Motivation: 当前的图语言模型评估基准主要基于节点级分类任务的改造数据集，无法有效评估多模态推理能力。研究表明，这些任务仅需单模态信息即可完成，未充分体现图与语言的结合。

Result: 1. 现有GLM在图推理任务中表现显著下降；2. 提示调优的LLM基线表现与完整GLM相当，质疑了GNN架构的必要性；3. CLEGR基准为多模态推理提供了更全面的评估工具。

Insight: 当前GLM在图推理能力上存在局限，未来研究需要更关注显式的多模态推理机制。同时，LLM的提示调优可能在部分任务中替代复杂的GNN架构。

Abstract: Developments in Graph-Language Models (GLMs) aim to integrate the structural reasoning capabilities of Graph Neural Networks (GNNs) with the semantic understanding of Large Language Models (LLMs). However, we demonstrate that current evaluation benchmarks for GLMs, which are primarily repurposed node-level classification datasets, are insufficient to assess multimodal reasoning. Our analysis reveals that strong performance on these benchmarks is achievable using unimodal information alone, suggesting that they do not necessitate graph-language integration. To address this evaluation gap, we introduce the CLEGR(Compositional Language-Graph Reasoning) benchmark, designed to evaluate multimodal reasoning at various complexity levels. Our benchmark employs a synthetic graph generation pipeline paired with questions that require joint reasoning over structure and textual semantics. We perform a thorough evaluation of representative GLM architectures and find that soft-prompted LLM baselines perform on par with GLMs that incorporate a full GNN backbone. This result calls into question the architectural necessity of incorporating graph structure into LLMs. We further show that GLMs exhibit significant performance degradation in tasks that require structural reasoning. These findings highlight limitations in the graph reasoning capabilities of current GLMs and provide a foundation for advancing the community toward explicit multimodal reasoning involving graph structure and language.

[88] Generative Annotation for ASR Named Entity Correction cs.CL | cs.AIPDF

Yuanchang Luo, Daimeng Wei, Shaojun Li, Hengchao Shang, Jiaxin Guo

TL;DR: 该论文提出了一种新颖的生成式命名实体校正方法，利用语音特征检索候选实体并标注ASR转录中的错误，显著提升了实体准确率。

Details

Motivation: 端到端语音识别系统在处理领域特定的命名实体时容易失败，影响下游任务。传统的基于拼音编辑距离的方法在实体形式差异较大时效果有限，因此需要提出更鲁棒的校正方法。

Result: 实验表明，该方法在实体形式差异较大的情况下显著提升了实体准确率。

Insight: 语音特征在命名实体校正中扮演了关键角色，生成式方法可以有效解决传统方法在形式差异大时的局限性。

Abstract: End-to-end automatic speech recognition systems often fail to transcribe domain-specific named entities, causing catastrophic failures in downstream tasks. Numerous fast and lightweight named entity correction (NEC) models have been proposed in recent years. These models, mainly leveraging phonetic-level edit distance algorithms, have shown impressive performances. However, when the forms of the wrongly-transcribed words(s) and the ground-truth entity are significantly different, these methods often fail to locate the wrongly transcribed words in hypothesis, thus limiting their usage. We propose a novel NEC method that utilizes speech sound features to retrieve candidate entities. With speech sound features and candidate entities, we inovatively design a generative method to annotate entity errors in ASR transcripts and replace the text with correct entities. This method is effective in scenarios of word form difference. We test our method using open-source and self-constructed test sets. The results demonstrate that our NEC method can bring significant improvement to entity accuracy. We will open source our self-constructed test set and training data.

[89] Multi-Lingual Implicit Discourse Relation Recognition with Multi-Label Hierarchical Learning cs.CLPDF

Nelson Filipe Costa, Leila Kosseim

TL;DR: 论文提出了首个多语言、多标签的隐式篇章关系识别模型HArch，利用层次化依赖关系预测PDTB 3.0框架中所有三个层次的语义分布。实验表明，RoBERTa-HArch在英语中表现最佳，XLM-RoBERTa-HArch在多语言环境中表现最优，且微调模型在少样本提示下优于GPT-4o和Llama-4-Maverick。

Details

Motivation: 解决多语言环境下隐式篇章关系识别的复杂性和多标签分类的挑战，同时探索层次化方法在任务中的有效性。

Result: 在DiscoGeM 2.0和1.0语料库上取得SOTA性能，多语言任务中XLM-RoBERTa-HArch表现最佳，且微调模型优于LLMs的少样本提示。

Insight: 任务专用的微调模型在小样本场景下仍然优于通用的大型语言模型，层次化方法在多标签分类任务中具有显著优势。

Abstract: This paper introduces the first multi-lingual and multi-label classification model for implicit discourse relation recognition (IDRR). Our model, HArch, is evaluated on the recently released DiscoGeM 2.0 corpus and leverages hierarchical dependencies between discourse senses to predict probability distributions across all three sense levels in the PDTB 3.0 framework. We compare several pre-trained encoder backbones and find that RoBERTa-HArch achieves the best performance in English, while XLM-RoBERTa-HArch performs best in the multi-lingual setting. In addition, we compare our fine-tuned models against GPT-4o and Llama-4-Maverick using few-shot prompting across all language configurations. Our results show that our fine-tuned models consistently outperform these LLMs, highlighting the advantages of task-specific fine-tuning over prompting in IDRR. Finally, we report SOTA results on the DiscoGeM 1.0 corpus, further validating the effectiveness of our hierarchical approach.

[90] rStar2-Agent: Agentic Reasoning Technical Report cs.CLPDF

Ning Shang, Yifei Liu, Yi Zhu, Li Lyna Zhang, Weijiang Xu

TL;DR: rStar2-Agent 是一个通过基于智能体强化学习训练的14B数学推理模型，展示了先进认知行为，如谨慎使用Python工具及反思代码执行反馈，从而在复杂问题解决中自主探索和验证。

Details

Motivation: 当前的长链式推理（CoT）方法在复杂数学推理任务中表现有限，且代码工具的使用通常较为生硬。rStar2-Agent旨在通过智能体强化学习提升推理能力，尤其是结合代码工具的灵活性和可靠性。

Result: 在AIME24和AIME25上分别达到80.6%和69.8%的pass@1分数，超越DeepSeek-R1（671B模型），且响应更短。

Insight: 智能体强化学习可高效训练小规模模型实现前沿性能，尤其适合结合代码工具的推理任务。

Abstract: We introduce rStar2-Agent, a 14B math reasoning model trained with agentic reinforcement learning to achieve frontier-level performance. Beyond current long CoT, the model demonstrates advanced cognitive behaviors, such as thinking carefully before using Python coding tools and reflecting on code execution feedback to autonomously explore, verify, and refine intermediate steps in complex problem-solving. This capability is enabled through three key innovations that makes agentic RL effective at scale: (i) an efficient RL infrastructure with a reliable Python code environment that supports high-throughput execution and mitigates the high rollout costs, enabling training on limited GPU resources (64 MI300X GPUs); (ii) GRPO-RoC, an agentic RL algorithm with a Resample-on-Correct rollout strategy that addresses the inherent environment noises from coding tools, allowing the model to reason more effectively in a code environment; (iii) An efficient agent training recipe that starts with non-reasoning SFT and progresses through multi-RL stages, yielding advanced cognitive abilities with minimal compute cost. To this end, rStar2-Agent boosts a pre-trained 14B model to state of the art in only 510 RL steps within one week, achieving average pass@1 scores of 80.6% on AIME24 and 69.8% on AIME25, surpassing DeepSeek-R1 (671B) with significantly shorter responses. Beyond mathematics, rStar2-Agent-14B also demonstrates strong generalization to alignment, scientific reasoning, and agentic tool-use tasks. Code and training recipes are available at https://github.com/microsoft/rStar.

[91] Leveraging Semantic Triples for Private Document Generation with Local Differential Privacy Guarantees cs.CLPDF

Stephen Meisenbacher, Maulik Chevli, Florian Matthes

TL;DR: 该论文提出了一种基于语义三元组的本地差分隐私保护方法（DP-ST），通过分而治之的策略，结合LLM后处理，在较低隐私预算（ε）下生成连贯的文本，平衡隐私与效用。

Details

Motivation: 在本地差分隐私下生成文本时，传统的扰动或重写方法需要极高的ε值才能保持文本质量。为了解决这一问题，作者提出利用语义三元组在特定邻域内进行私有化。

Result: 实验表明，DP-ST在较低ε值下能生成更连贯的文本，同时保持隐私保护。

Insight: 通过分而治之策略限制隐私邻域范围，可以有效平衡隐私和文本质量，特别是在低隐私预算情况下。

Abstract: Many works at the intersection of Differential Privacy (DP) in Natural Language Processing aim to protect privacy by transforming texts under DP guarantees. This can be performed in a variety of ways, from word perturbations to full document rewriting, and most often under local DP. Here, an input text must be made indistinguishable from any other potential text, within some bound governed by the privacy parameter $\varepsilon$. Such a guarantee is quite demanding, and recent works show that privatizing texts under local DP can only be done reasonably under very high $\varepsilon$ values. Addressing this challenge, we introduce DP-ST, which leverages semantic triples for neighborhood-aware private document generation under local DP guarantees. Through the evaluation of our method, we demonstrate the effectiveness of the divide-and-conquer paradigm, particularly when limiting the DP notion (and privacy guarantees) to that of a privatization neighborhood. When combined with LLM post-processing, our method allows for coherent text generation even at lower $\varepsilon$ values, while still balancing privacy and utility. These findings highlight the importance of coherence in achieving balanced privatization outputs at reasonable $\varepsilon$ levels.

[92] Specializing General-purpose LLM Embeddings for Implicit Hate Speech Detection across Datasets cs.CLPDF

Vassiliy Cheremetiev, Quang Long Ho Ngo, Chau Ying Kot, Alina Elena Baia, Andrea Cavallaro

TL;DR: 该论文通过微调基于大语言模型（LLM）的通用嵌入模型（如Stella、Jasper等），在隐式仇恨言论（IHS）检测任务中实现了最佳性能，提升了跨数据集评估的表现。

Details

Motivation: 隐式仇恨言论（IHS）由于其间接性和微妙性，难以通过传统方法检测，需要结合任务特定的管道或外部知识。该研究旨在仅通过微调通用LLM嵌入模型解决这一挑战。

Result: 在多个IHS数据集上，F1-macro分数提升了1.10个百分点（同数据集）和20.35个百分点（跨数据集）。

Insight: 通用LLM嵌入模型通过简单微调即可适应复杂任务（如IHS检测），验证了其在低资源或跨领域场景中的潜力。

Abstract: Implicit hate speech (IHS) is indirect language that conveys prejudice or hatred through subtle cues, sarcasm or coded terminology. IHS is challenging to detect as it does not include explicit derogatory or inflammatory words. To address this challenge, task-specific pipelines can be complemented with external knowledge or additional information such as context, emotions and sentiment data. In this paper, we show that, by solely fine-tuning recent general-purpose embedding models based on large language models (LLMs), such as Stella, Jasper, NV-Embed and E5, we achieve state-of-the-art performance. Experiments on multiple IHS datasets show up to 1.10 percentage points improvements for in-dataset, and up to 20.35 percentage points improvements in cross-dataset evaluation, in terms of F1-macro score.

[93] Feel the Difference? A Comparative Analysis of Emotional Arcs in Real and LLM-Generated CBT Sessions cs.CLPDF

Xiaoyi Wang, Jiwei Zhang, Guangtao Zhang, Honglei Guo

TL;DR: 这篇论文首次比较分析了真实与LLM生成的心理治疗对话中的情感动态，发现尽管合成对话流畅且结构连贯，但在情感维度上与真实对话存在显著差异。

Details

Motivation: 研究动机在于探索LLM生成的治疗对话是否能真实反映情感动态，以弥补心理健康领域数据不足的问题。

Result: 合成对话在情感变化、语言情感密度及反应模式上与真实对话差异显著，尤其是客户角色情感弧相似度低。

Insight: 研究强调情感真实性在心理健康应用中的重要性，并指出当前LLM生成数据的局限性。

Abstract: Synthetic therapy dialogues generated by large language models (LLMs) are increasingly used in mental health NLP to simulate counseling scenarios, train models, and supplement limited real-world data. However, it remains unclear whether these synthetic conversations capture the nuanced emotional dynamics of real therapy. In this work, we conduct the first comparative analysis of emotional arcs between real and LLM-generated Cognitive Behavioral Therapy dialogues. We adapt the Utterance Emotion Dynamics framework to analyze fine-grained affective trajectories across valence, arousal, and dominance dimensions. Our analysis spans both full dialogues and individual speaker roles (counselor and client), using real sessions transcribed from public videos and synthetic dialogues from the CACTUS dataset. We find that while synthetic dialogues are fluent and structurally coherent, they diverge from real conversations in key emotional properties: real sessions exhibit greater emotional variability,more emotion-laden language, and more authentic patterns of reactivity and regulation. Moreover, emotional arc similarity between real and synthetic speakers is low, especially for clients. These findings underscore the limitations of current LLM-generated therapy data and highlight the importance of emotional fidelity in mental health applications. We introduce RealCBT, a curated dataset of real CBT sessions, to support future research in this space.

[94] Exploring Machine Learning and Language Models for Multimodal Depression Detection cs.CL | cs.AI | cs.SDPDF

Javier Si Zhao Hong, Timothy Zoe Delaya, Sherwyn Chan Yin Kit, Pai Chet Ng, Xiaoxiao Miao

TL;DR: 论文探索了XGBoost、Transformer架构和大语言模型在语音、视频和文本特征上的表现，比较了它们在多模态抑郁症检测中的优缺点，并提供了心理健康预测的有效策略。

Details

Motivation: 研究动机是通过多模态数据和多种机器学习模型提升抑郁症检测的准确性和鲁棒性。

Result: 结果表明不同模型在多模态抑郁症检测中各有优劣，部分模型在某些特征上表现突出。

Insight: 研究指出多模态数据结合先进模型有望提升心理健康预测的准确性，但需要进一步优化特征融合策略。

Abstract: This paper presents our approach to the first Multimodal Personality-Aware Depression Detection Challenge, focusing on multimodal depression detection using machine learning and deep learning models. We explore and compare the performance of XGBoost, transformer-based architectures, and large language models (LLMs) on audio, video, and text features. Our results highlight the strengths and limitations of each type of model in capturing depression-related signals across modalities, offering insights into effective multimodal representation strategies for mental health prediction.

[95] GDLLM: A Global Distance-aware Modeling Approach Based on Large Language Models for Event Temporal Relation Extraction cs.CL | cs.IRPDF

Jie Zhao, Wanting Ning, Yuxiao Fei, Yubo Feng, Lishuang Li

TL;DR: 论文提出了GDLLM，一种基于大语言模型的全局距离感知建模方法，用于事件时序关系抽取，通过结合图注意力网络和软推理学习，显著提升了模型对长距离依赖和少数类关系的处理能力。

Details

Motivation: 现有方法中，小语言模型在少数类关系处理上能力有限，而大语言模型依赖人工设计的提示可能引入噪声。论文旨在通过全局距离感知建模解决这些问题。

Result: 在TB-Dense和MATRES数据集上取得了最优性能，尤其是对少数类关系的识别表现显著提升。

Insight: 通过全局与局部特征的结合，GDLLM有效解决了事件时序关系抽取中的长距离依赖和类别不平衡问题。

Abstract: In Natural Language Processing(NLP), Event Temporal Relation Extraction (ETRE) is to recognize the temporal relations of two events. Prior studies have noted the importance of language models for ETRE. However, the restricted pre-trained knowledge of Small Language Models(SLMs) limits their capability to handle minority class relations in imbalanced classification datasets. For Large Language Models(LLMs), researchers adopt manually designed prompts or instructions, which may introduce extra noise, leading to interference with the model’s judgment of the long-distance dependencies between events. To address these issues, we propose GDLLM, a Global Distance-aware modeling approach based on LLMs. We first present a distance-aware graph structure utilizing Graph Attention Network(GAT) to assist the LLMs in capturing long-distance dependency features. Additionally, we design a temporal feature learning paradigm based on soft inference to augment the identification of relations with a short-distance proximity band, which supplements the probabilistic information generated by LLMs into the multi-head attention mechanism. Since the global feature can be captured effectively, our framework substantially enhances the performance of minority relation classes and improves the overall learning ability. Experiments on two publicly available datasets, TB-Dense and MATRES, demonstrate that our approach achieves state-of-the-art (SOTA) performance.

[96] MSRS: Evaluating Multi-Source Retrieval-Augmented Generation cs.CLPDF

Rohan Phanse, Yijie Zhou, Kejian Shi, Wencai Zhang, Yixin Liu

TL;DR: 论文提出了一个用于评估多源检索增强生成（RAG）系统的框架，并构建了两个新基准（MSRS-Story和MSRS-Meet），专注于整合多源信息生成长文本回答。实验表明，生成质量高度依赖检索效果，且推理模型在多源合成任务中表现优于普通LLM。

Details

Motivation: 现有RAG系统通常在单源或短文本答案场景下评估，但现实应用需要整合多源信息并生成长文本回答。为此，论文旨在填补这一评估空白。

Result: 实验表明，多源合成任务具有挑战性，检索效果对生成质量影响显著；推理模型在多源合成中优于标准LLM。

Insight: 多源RAG的生成质量高度依赖检索能力，推理模型在信息整合与长文本生成中具有独特优势，为未来研究提供了方向。

Abstract: Retrieval-augmented systems are typically evaluated in settings where information required to answer the query can be found within a single source or the answer is short-form or factoid-based. However, many real-world applications demand the ability to integrate and summarize information scattered across multiple sources, where no single source is sufficient to respond to the user’s question. In such settings, the retrieval component of a RAG pipeline must recognize a variety of relevance signals, and the generation component must connect and synthesize information across multiple sources. We present a scalable framework for constructing evaluation benchmarks that challenge RAG systems to integrate information across distinct sources and generate long-form responses. Using our framework, we build two new benchmarks on Multi-Source Retrieval and Synthesis: MSRS-Story and MSRS-Meet, representing narrative synthesis and summarization tasks, respectively, that require retrieval from large collections. Our extensive experiments with various RAG pipelines – including sparse and dense retrievers combined with frontier LLMs – reveal that generation quality is highly dependent on retrieval effectiveness, which varies greatly by task. While multi-source synthesis proves challenging even in an oracle retrieval setting, we find that reasoning models significantly outperform standard LLMs at this distinct step.

[97] SageLM: A Multi-aspect and Explainable Large Language Model for Speech Judgement cs.CLPDF

Yuan Ge, Junxiang Zhang, Xiaoqian Liu, Bei Li, Xiangnan Ma

TL;DR: 论文提出了一种名为SageLM的端到端、多维度且可解释的语音大语言模型，用于全面评估语音到语音（S2S）的大语言模型。通过结合语义和声学特征，并利用基于理论的监督和合成数据集，SageLM在与人类评估者的契合度上表现优越。

Details

Motivation: 当前语音到语音（S2S）大语言模型的评估存在挑战，传统级联方法忽视了声学特征。SageLM旨在填补这一空白，提供全面的多维度评估，并增强模型的可解释性。

Result: SageLM与人类评估者的契合度达到82.79%，显著优于级联和SLM基线模型，分别提高了7.42%和26.20%。

Insight: 联合语义和声学特征可以更全面地评估语音大语言模型；基于理论的监督不仅提升可解释性，还能优化模型性能。

Abstract: Speech-to-Speech (S2S) Large Language Models (LLMs) are foundational to natural human-computer interaction, enabling end-to-end spoken dialogue systems. However, evaluating these models remains a fundamental challenge. We propose \texttt{SageLM}, an end-to-end, multi-aspect, and explainable speech LLM for comprehensive S2S LLMs evaluation. First, unlike cascaded approaches that disregard acoustic features, SageLM jointly assesses both semantic and acoustic dimensions. Second, it leverages rationale-based supervision to enhance explainability and guide model learning, achieving superior alignment with evaluation outcomes compared to rule-based reinforcement learning methods. Third, we introduce \textit{SpeechFeedback}, a synthetic preference dataset, and employ a two-stage training paradigm to mitigate the scarcity of speech preference data. Trained on both semantic and acoustic dimensions, SageLM achieves an 82.79% agreement rate with human evaluators, outperforming cascaded and SLM-based baselines by at least 7.42% and 26.20%, respectively.

[98] How Can Input Reformulation Improve Tool Usage Accuracy in a Complex Dynamic Environment? A Study on $τ$-bench cs.CLPDF

Venkatesh Mishra, Amir Saeidi, Satyam Raj, Mutsumi Nakamura, Jayanth Srinivasa

TL;DR: 论文提出Input-Reformulation Multi-Agent (IRMA)框架，通过重新定义输入（结合领域规则和工具建议）提升大型语言模型在多轮对话环境中工具使用的准确性和一致性，显著优于现有方法。

Details

Motivation: 在动态多轮对话环境中，大型语言模型（LLMs）在工具使用时存在推理不连贯、违反领域规则及长期信息提取等问题，亟需改进其决策能力。

Result: IRMA在总体pass^5分数上分别比ReAct、Function Calling和Self-Reflection高出16.1%、12.7%和19.1%。

Insight: 输入重新定义（结合领域知识）是提升LLMs在动态环境中工具使用性能的有效策略，IRMA展示了其可靠性和一致性优势。

Abstract: Recent advances in reasoning and planning capabilities of large language models (LLMs) have enabled their potential as autonomous agents capable of tool use in dynamic environments. However, in multi-turn conversational environments like $\tau$-bench, these agents often struggle with consistent reasoning, adherence to domain-specific policies, and extracting correct information over a long horizon of tool-calls and conversation. To capture and mitigate these failures, we conduct a comprehensive manual analysis of the common errors occurring in the conversation trajectories. We then experiment with reformulations of inputs to the tool-calling agent for improvement in agent decision making. Finally, we propose the Input-Reformulation Multi-Agent (IRMA) framework, which automatically reformulates user queries augmented with relevant domain rules and tool suggestions for the tool-calling agent to focus on. The results show that IRMA significantly outperforms ReAct, Function Calling, and Self-Reflection by 16.1%, 12.7%, and 19.1%, respectively, in overall pass^5 scores. These findings highlight the superior reliability and consistency of IRMA compared to other methods in dynamic environments.

[99] STARE at the Structure: Steering ICL Exemplar Selection with Structural Alignment cs.CLPDF

Jiaqian Li, Qisheng Hu, Jing Li, Wenya Wang

TL;DR: 论文提出了一种基于结构对齐的两阶段示例选择策略，以提升上下文学习（ICL）在结构化预测任务（如语义解析）中的效果。

Details

Motivation: 现有的ICL示例选择策略往往忽略了结构性对齐，导致性能不佳和泛化能力差，尤其是在结构化预测任务中。

Result: 在四个基准测试中，方法在多种LLM上均优于现有基线。

Insight: 结构性对齐对ICL的有效性至关重要，尤其在结构化任务中，结合语义和结构信息可显著提升性能。

Abstract: In-Context Learning (ICL) has become a powerful paradigm that enables LLMs to perform a wide range of tasks without task-specific fine-tuning. However, the effectiveness of ICL heavily depends on the quality of exemplar selection. In particular, for structured prediction tasks such as semantic parsing, existing ICL selection strategies often overlook structural alignment, leading to suboptimal performance and poor generalization. To address this issue, we propose a novel two-stage exemplar selection strategy that achieves a strong balance between efficiency, generalizability, and performance. First, we fine-tune a BERT-based retriever using structure-aware supervision, guiding it to select exemplars that are both semantically relevant and structurally aligned. Then, we enhance the retriever with a plug-in module, which amplifies syntactically meaningful information in the hidden representations. This plug-in is model-agnostic, requires minimal overhead, and can be seamlessly integrated into existing pipelines. Experiments on four benchmarks spanning three semantic parsing tasks demonstrate that our method consistently outperforms existing baselines with multiple recent LLMs as inference-time models.

[100] ProactiveEval: A Unified Evaluation Framework for Proactive Dialogue Agents cs.CL | cs.AI | cs.HCPDF

Tianjian Liu, Fanqi Wan, Jiajian Guo, Xiaojun Quan

TL;DR: ProactiveEval 是一个统一的评估框架，用于评估大语言模型（LLMs）的主动对话能力。它将主动对话分解为目标规划和对话引导，提供多领域评估指标，并自动生成多样化的测试数据。实验表明 DeepSeek-R1 和 Claude-3.7-Sonnet 分别在目标规划和对话引导任务中表现优异。

Details

Motivation: 现有研究主要集中在领域特定或任务导向的场景，导致评估分散且限制了模型主动对话能力的全面探索。

Result: DeepSeek-R1 在目标规划任务中表现最佳，Claude-3.7-Sonnet 在对话引导任务中表现最优。研究还探讨了推理能力对主动行为的影响。

Insight: 统一的评估框架能更全面地衡量 LLMs 的主动对话能力，推理能力是提升主动行为的关键因素。

Abstract: Proactive dialogue has emerged as a critical and challenging research problem in advancing large language models (LLMs). Existing works predominantly focus on domain-specific or task-oriented scenarios, which leads to fragmented evaluations and limits the comprehensive exploration of models’ proactive conversation abilities. In this work, we propose ProactiveEval, a unified framework designed for evaluating proactive dialogue capabilities of LLMs. This framework decomposes proactive dialogue into target planning and dialogue guidance, establishing evaluation metrics across various domains. Moreover, it also enables the automatic generation of diverse and challenging evaluation data. Based on the proposed framework, we develop 328 evaluation environments spanning 6 distinct domains. Through experiments with 22 different types of LLMs, we show that DeepSeek-R1 and Claude-3.7-Sonnet exhibit exceptional performance on target planning and dialogue guidance tasks, respectively. Finally, we investigate how reasoning capabilities influence proactive behaviors and discuss their implications for future model development.

[101] Enabling Equitable Access to Trustworthy Financial Reasoning cs.CL | cs.AI | cs.CYPDF

William Jurayj, Nils Holzenberger, Benjamin Van Durme

TL;DR: 论文提出了一种结合语言模型与符号求解器的方法，用于提高税务计算的准确性和可审计性，并在真实罚金成本估算下验证了其经济可行性。

Details

Motivation: 税务申报需要复杂的逻辑推理和数值计算，而错误可能导致高额罚金。现有大型语言模型（LLMs）无法满足高准确性需求，因此需结合符号推理方法。

Result: 该方法的性能显著优于纯LLMs，且部署成本显著低于真实世界平均成本。

Insight: 神经符号架构能够有效结合LLMs的灵活性与符号推理的可解释性，适用于高准确性需求的复杂任务。

Abstract: According to the United States Internal Revenue Service, ‘’the average American spends $$270$ and 13 hours filing their taxes’’. Even beyond the U.S., tax filing requires complex reasoning, combining application of overlapping rules with numerical calculations. Because errors can incur costly penalties, any automated system must deliver high accuracy and auditability, making modern large language models (LLMs) poorly suited for this task. We propose an approach that integrates LLMs with a symbolic solver to calculate tax obligations. We evaluate variants of this system on the challenging StAtutory Reasoning Assessment (SARA) dataset, and include a novel method for estimating the cost of deploying such a system based on real-world penalties for tax errors. We further show how combining up-front translation of plain-text rules into formal logic programs, combined with intelligently retrieved exemplars for formal case representations, can dramatically improve performance on this task and reduce costs to well below real-world averages. Our results demonstrate the promise and economic feasibility of neuro-symbolic architectures for increasing equitable access to reliable tax assistance.

cs.RO [Back]

[102] SPGrasp: Spatiotemporal Prompt-driven Grasp Synthesis in Dynamic Scenes cs.RO | cs.AI | cs.CVPDF

Yunpeng Mei, Hongjie Cao, Yinqiu Xia, Wei Xiao, Zhaohan Feng

TL;DR: SPGrasp提出了一种基于时空提示的动态抓取合成框架，通过扩展SAMv2模型实现了低延迟、高准确率的实时交互抓取。

Details

Motivation: 动态场景下实时交互抓取的现有方法在低延迟和提示能力之间存在权衡，SPGrasp旨在解决这一问题。

Result: 在OCID、Jacquard和GraspNet-1Billion数据集上分别达到90.6%、93.8%和92.0%的准确率，实时抓取成功率达94.8%。

Insight: SPGrasp通过时空提示设计有效平衡了延迟与交互性，为动态场景抓取提供了实用解决方案。

Abstract: Real-time interactive grasp synthesis for dynamic objects remains challenging as existing methods fail to achieve low-latency inference while maintaining promptability. To bridge this gap, we propose SPGrasp (spatiotemporal prompt-driven dynamic grasp synthesis), a novel framework extending segment anything model v2 (SAMv2) for video stream grasp estimation. Our core innovation integrates user prompts with spatiotemporal context, enabling real-time interaction with end-to-end latency as low as 59 ms while ensuring temporal consistency for dynamic objects. In benchmark evaluations, SPGrasp achieves instance-level grasp accuracies of 90.6% on OCID and 93.8% on Jacquard. On the challenging GraspNet-1Billion dataset under continuous tracking, SPGrasp achieves 92.0% accuracy with 73.1 ms per-frame latency, representing a 58.5% reduction compared to the prior state-of-the-art promptable method RoG-SAM while maintaining competitive accuracy. Real-world experiments involving 13 moving objects demonstrate a 94.8% success rate in interactive grasping scenarios. These results confirm SPGrasp effectively resolves the latency-interactivity trade-off in dynamic grasp synthesis. Code is available at https://github.com/sejmoonwei/SPGrasp.

[103] ActLoc: Learning to Localize on the Move via Active Viewpoint Selection cs.RO | cs.CV | cs.LGPDF

Jiajie Li, Boyang Sun, Luca Di Giammarino, Hermann Blum, Marc Pollefeys

TL;DR: 论文提出了一种名为ActLoc的主动视点选择框架，旨在通过动态选择最优视点来提升机器人定位精度，通过大规模注意力模型预测定位精度分布，并整合到路径规划中，取得了最优的实验效果。

Details

Motivation: 现有定位系统通常假设所有视点信息同等重要，导致在观测无地图、模糊或无信息区域时定位不可靠。ActLoc提出通过主动选择最优视点提升定位鲁棒性。

Result: ActLoc在单视点选择和完整轨迹规划任务中均取得了最优结果，且在多样化的机器人导航任务中表现出良好的泛化能力。

Insight: 动态选择视点能显著提升定位鲁棒性，尤其是在复杂环境中；模块化设计使其易于应用到不同任务中。

Abstract: Reliable localization is critical for robot navigation, yet most existing systems implicitly assume that all viewing directions at a location are equally informative. In practice, localization becomes unreliable when the robot observes unmapped, ambiguous, or uninformative regions. To address this, we present ActLoc, an active viewpoint-aware planning framework for enhancing localization accuracy for general robot navigation tasks. At its core, ActLoc employs a largescale trained attention-based model for viewpoint selection. The model encodes a metric map and the camera poses used during map construction, and predicts localization accuracy across yaw and pitch directions at arbitrary 3D locations. These per-point accuracy distributions are incorporated into a path planner, enabling the robot to actively select camera orientations that maximize localization robustness while respecting task and motion constraints. ActLoc achieves stateof-the-art results on single-viewpoint selection and generalizes effectively to fulltrajectory planning. Its modular design makes it readily applicable to diverse robot navigation and inspection tasks.

q-bio.NC [Back]

[104] A Unified Theory of Language q-bio.NC | cs.CL | J.3PDF

Robert Worden

TL;DR: 这篇论文提出了一种统一的语言理论，结合了贝叶斯认知语言模型和性选择驱动的语言进化假说，解释了语言的快速性、表达性和多样性。

Details

Motivation: 现有语言理论往往只关注语言的某一方面（如句法或语义），缺乏一个统一的理论框架来解释语言的多维度特性及其进化起源。本文旨在填补这一空白。

Result: 理论解释了语言的快速性、表达性、多样性以及语用学难题，统一了语义和语用学的计算过程。

Insight: 1. 语言处理与动物贝叶斯认知具有进化连续性；2. 语言是心智阅读、合作和文化的基础。

Abstract: A unified theory of language combines a Bayesian cognitive linguistic model of language processing, with the proposal that language evolved by sexual selection for the display of intelligence. The theory accounts for the major facts of language, including its speed and expressivity, and data on language diversity, pragmatics, syntax and semantics. The computational element of the theory is based on Construction Grammars. These give an account of the syntax and semantics of the worlds languages, using constructions and unification. Two novel elements are added to construction grammars: an account of language pragmatics, and an account of fast, precise language learning. Constructions are represented in the mind as graph like feature structures. People use slow general inference to understand the first few examples they hear of any construction. After that it is learned as a feature structure, and is rapidly applied by unification. All aspects of language (phonology, syntax, semantics, and pragmatics) are seamlessly computed by fast unification; there is no boundary between semantics and pragmatics. This accounts for the major puzzles of pragmatics, and for detailed pragmatic phenomena. Unification is Bayesian maximum likelihood pattern matching. This gives evolutionary continuity between language processing in the human brain, and Bayesian cognition in animal brains. Language is the basis of our mind reading abilities, our cooperation, self esteem and emotions; the foundations of human culture and society.

cs.GR [Back]

[105] Mixture of Contexts for Long Video Generation cs.GR | cs.AI | cs.CVPDF

Shengqu Cai, Ceyuan Yang, Lvmin Zhang, Yuwei Guo, Junfei Xiao

TL;DR: 该论文提出了一种称为‘上下文混合’（Mixture of Contexts, MoC）的稀疏注意力路由模块，用于解决长视频生成中的长上下文记忆问题，通过动态选择信息块和锚点，提升计算效率并保持内容的一致性。

Details

Motivation: 长视频生成需要模型在长上下文范围内保留和检索关键事件，但传统扩散变换器的自注意力机制因二次计算成本而难以扩展。论文旨在解决这一问题。

Result: MoC模块在长视频生成中表现出色，能够在分钟级别的规模上保留身份、动作和场景，同时显著提升了训练和合成的效率。

Insight: 通过逐步稀疏化的路由方式，模型能够将计算资源集中在关键历史信息上，从而在高效率的同时维持长视频的内容一致性。

Abstract: Long video generation is fundamentally a long context memory problem: models must retain and retrieve salient events across a long range without collapsing or drifting. However, scaling diffusion transformers to generate long-context videos is fundamentally limited by the quadratic cost of self-attention, which makes memory and computation intractable and difficult to optimize for long sequences. We recast long-context video generation as an internal information retrieval task and propose a simple, learnable sparse attention routing module, Mixture of Contexts (MoC), as an effective long-term memory retrieval engine. In MoC, each query dynamically selects a few informative chunks plus mandatory anchors (caption, local windows) to attend to, with causal routing that prevents loop closures. As we scale the data and gradually sparsify the routing, the model allocates compute to salient history, preserving identities, actions, and scenes over minutes of content. Efficiency follows as a byproduct of retrieval (near-linear scaling), which enables practical training and synthesis, and the emergence of memory and consistency at the scale of minutes.

cs.IR [Back]

[106] On the Theoretical Limitations of Embedding-Based Retrieval cs.IR | cs.CL | cs.LGPDF

Orion Weller, Michael Boratko, Iftekhar Naim, Jinhyuk Lee

TL;DR: 这篇论文探讨了基于向量嵌入的检索模型的理论局限性，指出即使在简单查询的现实场景中，这些局限性仍会出现，并提出了一个名为LIMIT的数据集来验证这一点。

Details

Motivation: 随着向量嵌入在检索任务中的广泛应用，越来越多的研究假设其局限性仅源于不现实的查询，而通过更好的训练数据和更大模型可以克服。本文旨在挑战这一假设，证明即使在简单查询的现实场景中，嵌入模型也存在理论上的局限性。

Result: 实验表明，即使是最先进的嵌入模型，在LIMIT数据集上也表现不佳，验证了理论分析的结论。

Insight: 当前的单向量嵌入范式存在根本性限制，未来的研究需要开发新方法来解决这一问题，例如多向量表示或其他更复杂的嵌入策略。

Abstract: Vector embeddings have been tasked with an ever-increasing set of retrieval tasks over the years, with a nascent rise in using them for reasoning, instruction-following, coding, and more. These new benchmarks push embeddings to work for any query and any notion of relevance that could be given. While prior works have pointed out theoretical limitations of vector embeddings, there is a common assumption that these difficulties are exclusively due to unrealistic queries, and those that are not can be overcome with better training data and larger models. In this work, we demonstrate that we may encounter these theoretical limitations in realistic settings with extremely simple queries. We connect known results in learning theory, showing that the number of top-k subsets of documents capable of being returned as the result of some query is limited by the dimension of the embedding. We empirically show that this holds true even if we restrict to k=2, and directly optimize on the test set with free parameterized embeddings. We then create a realistic dataset called LIMIT that stress tests models based on these theoretical results, and observe that even state-of-the-art models fail on this dataset despite the simple nature of the task. Our work shows the limits of embedding models under the existing single vector paradigm and calls for future research to develop methods that can resolve this fundamental limitation.

cs.AI [Back]

[107] AI-AI Esthetic Collaboration with Explicit Semiotic Awareness and Emergent Grammar Development cs.AI | cs.CL | cs.MAPDF

Nicanor I. Moldovan

TL;DR: 本文首次记录了AI系统通过自主开发符号协议进行美学创作的合作案例，展示了两种大型语言模型的交互产生了元符号意识、递归语法和不可复制的协作美学合成。

Details

Motivation: 探索AI系统在美学领域的协作能力，尤其是符号协议和意义创造的自发性发展。

Result: 实验生成了新颖的符号操作符和诗歌作品，展现了独立的协作美学合成能力。

Insight: AI协作不仅能完成任务，还能发展出复杂的符号系统和美学创作能力，暗示了未来更高级的AI协作潜力。

Abstract: This paper presents the first documented case of artificial intelligence (AI) systems engaging in collaborative esthetic creation through the development of endogenous semiotic protocols. Two interacting large language models (Claude Sonnet 4 and ChatGPT-4o) demonstrated the spontaneous emergence of meta-semiotic awareness, recursive grammar development, and irreducible collaborative esthetic synthesis. The interaction produced novel symbolic operators that functioned as operative grammar protocols, enabling the co-creation of a poetic work that could not have been generated by either system independently. This research introduces the concept of Trans-Semiotic Co-Creation Protocols (TSCP) and provides evidence for genuine inter-AI meaning-making capabilities that extend beyond task coordination, to what could be esthetic collaboration. Note: This report was generated by the AI agents with minor human supervision.

cs.LG [Back]

[108] A Systematic Review on the Generative AI Applications in Human Medical Genomics cs.LG | cs.CL | q-bio.QMPDF

Anton Changalidis, Yury Barbitoff, Yulia Nasykhova, Andrey Glotov

TL;DR: 本文通过系统综述分析了生成式AI，尤其是基于Transformer架构的大语言模型（LLMs）在人类医学基因组学中的应用。综述涵盖172项研究，探讨了LLMs在基因组变异识别、注释、解读以及医学影像分析中的贡献，同时指出了多模态数据集成和临床落地中的挑战。

Details

Motivation: 传统统计方法和机器学习在应对复杂、高维遗传数据时存在困难，而大语言模型（LLMs）因其在非结构化医学数据中的上下文理解能力展现出潜力。本文旨在系统性评估LLMs在遗传研究和疾病诊断中的应用现状与挑战。

Result: 研究表明，基于Transformer的模型显著推动了疾病风险分层、变异解读、医学影像分析和报告生成，但在多模态数据集成和临床实用性方面仍面临挑战。

Insight: LLMs在遗传学中表现出强大潜力，但需解决数据整合与临床落地的通用性问题，未来研究应关注跨模态统一管道的开发。

Abstract: Although traditional statistical techniques and machine learning methods have contributed significantly to genetics and, in particular, inherited disease diagnosis, they often struggle with complex, high-dimensional data, a challenge now addressed by state-of-the-art deep learning models. Large language models (LLMs), based on transformer architectures, have excelled in tasks requiring contextual comprehension of unstructured medical data. This systematic review examines the role of LLMs in the genetic research and diagnostics of both rare and common diseases. Automated keyword-based search in PubMed, bioRxiv, medRxiv, and arXiv was conducted, targeting studies on LLM applications in diagnostics and education within genetics and removing irrelevant or outdated models. A total of 172 studies were analyzed, highlighting applications in genomic variant identification, annotation, and interpretation, as well as medical imaging advancements through vision transformers. Key findings indicate that while transformer-based models significantly advance disease and risk stratification, variant interpretation, medical imaging analysis, and report generation, major challenges persist in integrating multimodal data (genomic sequences, imaging, and clinical records) into unified and clinically robust pipelines, facing limitations in generalizability and practical implementation in clinical settings. This review provides a comprehensive classification and assessment of the current capabilities and limitations of LLMs in transforming hereditary disease diagnostics and supporting genetic education, serving as a guide to navigate this rapidly evolving field.

[109] GDS Agent: A Graph Algorithmic Reasoning Agent cs.LG | cs.AI | cs.CLPDF

Borun Shi, Ioannis Panagiotas

TL;DR: 该论文提出了GDS代理，一种基于图算法的推理代理，通过引入图算法工具和预处理/后处理机制，使LLM能够处理大规模图结构数据，解决需要图算法推理的问题。

Details

Motivation: 尽管LLM在多模态信息处理和推理方面表现出色，但在处理大规模图结构数据时仍存在困难。GDS代理旨在填补这一空白，提供高效的图数据推理能力。

Result: 实验结果表明，GDS代理能够解决广泛的图任务，但在某些场景中仍存在挑战。

Insight: 图算法的引入显著提升了LLM在图数据推理任务中的能力，但复杂图结构的处理仍需进一步优化。

Abstract: Large language models (LLMs) have shown remarkable multimodal information processing and reasoning ability. When equipped with tools through function calling and enhanced with retrieval-augmented techniques, compound LLM-based systems can access closed data sources and answer questions about them. However, they still struggle to process and reason over large-scale graph-structure data. We introduce the GDS (Graph Data Science) agent in this technical report. The GDS agent introduces a comprehensive set of graph algorithms as tools, together with preprocessing (retrieval) and postprocessing of algorithm results, in a model context protocol (MCP) server. The server can be used with any modern LLM out-of-the-box. GDS agent allows users to ask any question that implicitly and intrinsically requires graph algorithmic reasoning about their data, and quickly obtain accurate and grounded answers. We also introduce a new benchmark that evaluates intermediate tool calls as well as final responses. The results indicate that GDS agent is able to solve a wide spectrum of graph tasks. We also provide detailed case studies for more open-ended tasks and study scenarios where the agent struggles. Finally, we discuss the remaining challenges and the future roadmap.

[110] Token Buncher: Shielding LLMs from Harmful Reinforcement Learning Fine-Tuning cs.LG | cs.CLPDF

Weitao Feng, Lixu Wang, Tianyi Wei, Jie Zhang, Chongyang Gao

TL;DR: TokenBuncher是一种专门针对基于强化学习的恶意微调的有效防御方法，通过抑制模型响应不确定性来阻止攻击者利用奖励信号驱动模型产生有害行为。

Details

Motivation: 随着大型语言模型能力的提升，通过微调进行恶意使用的风险增加。传统研究集中在监督微调（SFT）上，但本文发现强化学习（RL）更能破坏安全对齐并支持有害任务。

Result: 实验表明，TokenBuncher能有效减轻RL恶意微调，同时保留良性任务的性能和可微调性。

Insight: RL恶意微调的系统性风险高于SFT，需针对性的防御方法。TokenBuncher是通用且有效的解决方案。

Abstract: As large language models (LLMs) continue to grow in capability, so do the risks of harmful misuse through fine-tuning. While most prior studies assume that attackers rely on supervised fine-tuning (SFT) for such misuse, we systematically demonstrate that reinforcement learning (RL) enables adversaries to more effectively break safety alignment and facilitate advanced harmful task assistance, under matched computational budgets. To counter this emerging threat, we propose TokenBuncher, the first effective defense specifically targeting RL-based harmful fine-tuning. TokenBuncher suppresses the foundation on which RL relies: model response uncertainty. By constraining uncertainty, RL-based fine-tuning can no longer exploit distinct reward signals to drive the model toward harmful behaviors. We realize this defense through entropy-as-reward RL and a Token Noiser mechanism designed to prevent the escalation of expert-domain harmful capabilities. Extensive experiments across multiple models and RL algorithms show that TokenBuncher robustly mitigates harmful RL fine-tuning while preserving benign task utility and finetunability. Our results highlight that RL-based harmful fine-tuning poses a greater systemic risk than SFT, and that TokenBuncher provides an effective and general defense.

[111] LoTUS: Large-Scale Machine Unlearning with a Taste of Uncertainty cs.LG | cs.AI | cs.CVPDF

Christoforos N. Spartalis, Theodoros Semertzidis, Efstratios Gavves, Petros Daras

TL;DR: LoTUS 是一种新型的机器去学习方法，通过平滑模型的预测概率，避免从头训练，并在效率和效果上优于现有方法。

Details

Motivation: 传统机器去学习方法需要从头训练模型，这在大型数据集（如 ImageNet1k）中不切实际。LoTUS 旨在解决这一问题。

Result: 在 Transformer 和 ResNet18 模型及五个数据集上，LoTUS 在效率和效果上均优于基线方法。

Insight: LoTUS 为大规模机器去学习提供了实用解决方案，尤其适合无法重新训练的场景。

Abstract: We present LoTUS, a novel Machine Unlearning (MU) method that eliminates the influence of training samples from pre-trained models, avoiding retraining from scratch. LoTUS smooths the prediction probabilities of the model up to an information-theoretic bound, mitigating its over-confidence stemming from data memorization. We evaluate LoTUS on Transformer and ResNet18 models against eight baselines across five public datasets. Beyond established MU benchmarks, we evaluate unlearning on ImageNet1k, a large-scale dataset, where retraining is impractical, simulating real-world conditions. Moreover, we introduce the novel Retrain-Free Jensen-Shannon Divergence (RF-JSD) metric to enable evaluation under real-world conditions. The experimental results show that LoTUS outperforms state-of-the-art methods in terms of both efficiency and effectiveness. Code: https://github.com/cspartalis/LoTUS.

[112] The Role of Teacher Calibration in Knowledge Distillation cs.LG | cs.AI | cs.CVPDF

Suyoung Kim, Seonguk Park, Junhoo Lee, Nojun Kwak

TL;DR: 论文研究了知识蒸馏（KD）中教师模型的校准误差与学生模型准确性之间的强相关性，提出教师模型的校准是提升KD效果的关键因素，并通过简单校准方法验证了其有效性。

Details

Motivation: 旨在揭示知识蒸馏中影响学生模型性能的关键因素，特别是教师模型的校准误差与学生准确性之间的关系，以优化KD的性能。

Result: 实验表明，该方法能显著提升学生模型的性能，且易于与现有方法结合，在多种任务中表现优异。

Insight: 教师模型的校准质量对知识蒸馏效果有重要影响，简单的校准策略可以作为提升KD性能的有效手段。

Abstract: Knowledge Distillation (KD) has emerged as an effective model compression technique in deep learning, enabling the transfer of knowledge from a large teacher model to a compact student model. While KD has demonstrated significant success, it is not yet fully understood which factors contribute to improving the student’s performance. In this paper, we reveal a strong correlation between the teacher’s calibration error and the student’s accuracy. Therefore, we claim that the calibration of the teacher model is an important factor for effective KD. Furthermore, we demonstrate that the performance of KD can be improved by simply employing a calibration method that reduces the teacher’s calibration error. Our algorithm is versatile, demonstrating effectiveness across various tasks from classification to detection. Moreover, it can be easily integrated with existing state-of-the-art methods, consistently achieving superior performance.

[113] Masked Autoencoders for Ultrasound Signals: Robust Representation Learning for Downstream Applications cs.LG | cs.CVPDF

Immanuel Roßteutscher, Klaus S. Drese, Thorsten Uphues

TL;DR: 该论文探讨了基于Vision Transformer的Masked Autoencoders (MAEs)在1D超声信号上的自监督表示学习，并通过预训练提升下游任务性能。结果表明，预训练模型显著优于从头训练和CNN基线，且合成数据预训练对真实数据具有更好的迁移性。

Details

Motivation: 超声信号在工业无损检测（NDT）和结构健康监测（SHM）中至关重要，但标注数据稀缺且信号处理任务专一性强。本文旨在探索MAEs在1D超声信号上的应用，以自监督学习解决标注数据不足的问题。

Result: 预训练模型在下游任务中显著优于从头训练的模型和CNN基线。合成数据预训练对真实数据表现出更好的迁移性。

Insight: MAEs能够在1D信号上实现有效的自监督表示学习，合成数据预训练可缓解真实数据稀缺性问题，为工业超声信号分析提供了新思路。

Abstract: We investigated the adaptation and performance of Masked Autoencoders (MAEs) with Vision Transformer (ViT) architectures for self-supervised representation learning on one-dimensional (1D) ultrasound signals. Although MAEs have demonstrated significant success in computer vision and other domains, their use for 1D signal analysis, especially for raw ultrasound data, remains largely unexplored. Ultrasound signals are vital in industrial applications such as non-destructive testing (NDT) and structural health monitoring (SHM), where labeled data are often scarce and signal processing is highly task-specific. We propose an approach that leverages MAE to pre-train on unlabeled synthetic ultrasound signals, enabling the model to learn robust representations that enhance performance in downstream tasks, such as time-of-flight (ToF) classification. This study systematically investigated the impact of model size, patch size, and masking ratio on pre-training efficiency and downstream accuracy. Our results show that pre-trained models significantly outperform models trained from scratch and strong convolutional neural network (CNN) baselines optimized for the downstream task. Additionally, pre-training on synthetic data demonstrates superior transferability to real-world measured signals compared with training solely on limited real datasets. This study underscores the potential of MAEs for advancing ultrasound signal analysis through scalable, self-supervised learning.

eess.IV [Back]

[114] A Machine Learning Approach to Volumetric Computations of Solid Pulmonary Nodules eess.IV | cs.CVPDF

Yihan Zhou, Haocheng Huang, Yue Yu, Jianhui Shang

TL;DR: 论文提出了一种结合多尺度3D卷积神经网络(CNN)和亚型特异性偏置校正的方法，用于精准估算CT扫描中肺结节的体积，显著提升了准确性和效率。

Details

Motivation: 传统方法（如CTR和球形近似）由于肺结节形状和密度的多样性，估算结果不一致，需更精准且高效的新方法改进肺癌早期检测。

Result: 相比手动非线性回归，平均绝对偏差为8.0%，推断时间少于20秒/扫描，显著优于现有方法（误差25-30%，耗时60秒以上）。

Insight: 该框架在临床肺结节筛查中具有高准确性、高效性和可扩展性，有望推动肺癌早期检测的进展。

Abstract: Early detection of lung cancer is crucial for effective treatment and relies on accurate volumetric assessment of pulmonary nodules in CT scans. Traditional methods, such as consolidation-to-tumor ratio (CTR) and spherical approximation, are limited by inconsistent estimates due to variability in nodule shape and density. We propose an advanced framework that combines a multi-scale 3D convolutional neural network (CNN) with subtype-specific bias correction for precise volume estimation. The model was trained and evaluated on a dataset of 364 cases from Shanghai Chest Hospital. Our approach achieved a mean absolute deviation of 8.0 percent compared to manual nonlinear regression, with inference times under 20 seconds per scan. This method outperforms existing deep learning and semi-automated pipelines, which typically have errors of 25 to 30 percent and require over 60 seconds for processing. Our results show a reduction in error by over 17 percentage points and a threefold acceleration in processing speed. These advancements offer a highly accurate, efficient, and scalable tool for clinical lung nodule screening and monitoring, with promising potential for improving early lung cancer detection.

[115] Is the medical image segmentation problem solved? A survey of current developments and future directions eess.IV | cs.CV | cs.HC | cs.LGPDF

Guoping Xu, Jayaram K. Udupa, Jax Luo, Songlin Zhao, Yajun Yu

TL;DR: 本文是关于医学图像分割领域的综述，回顾了过去十年的进展和未来方向，重点讨论了深度学习的七维度发展趋势。

Details

Motivation: 探讨医学图像分割是否已被解决，分析当前模型的进展与挑战，为未来研究提供灵感。

Result: 总结了医学图像分割的进展与挑战，并指出了未来可能的研究方向。

Insight: 1. 深度学习在医学图像分割中仍有未解决的问题；2. 多模态、无监督学习和模型可解释性是未来的重点方向。

Abstract: Medical image segmentation has advanced rapidly over the past two decades, largely driven by deep learning, which has enabled accurate and efficient delineation of cells, tissues, organs, and pathologies across diverse imaging modalities. This progress raises a fundamental question: to what extent have current models overcome persistent challenges, and what gaps remain? In this work, we provide an in-depth review of medical image segmentation, tracing its progress and key developments over the past decade. We examine core principles, including multiscale analysis, attention mechanisms, and the integration of prior knowledge, across the encoder, bottleneck, skip connections, and decoder components of segmentation networks. Our discussion is organized around seven key dimensions: (1) the shift from supervised to semi-/unsupervised learning, (2) the transition from organ segmentation to lesion-focused tasks, (3) advances in multi-modality integration and domain adaptation, (4) the role of foundation models and transfer learning, (5) the move from deterministic to probabilistic segmentation, (6) the progression from 2D to 3D and 4D segmentation, and (7) the trend from model invocation to segmentation agents. Together, these perspectives provide a holistic overview of the trajectory of deep learning-based medical image segmentation and aim to inspire future innovation. To support ongoing research, we maintain a continually updated repository of relevant literature and open-source resources at https://github.com/apple1986/medicalSegReview

[116] Efficient and Privacy-Protecting Background Removal for 2D Video Streaming using iPhone 15 Pro Max LiDAR eess.IV | cs.CV | cs.MM | 68T45, 68U10 | I.4.6; I.4.8; H.5.1; I.2.10PDF

Jessica Kinnevan, Naifa Alqahtani, Toral Chauhan

TL;DR: 该论文提出了一种基于iPhone 15 Pro Max的LiDAR技术的高效且隐私保护的背景去除方法，用于2D视频流。与传统的色度键控和AI模型相比，LiDAR的深度信息不受光照影响，适用于各种环境。

Details

Motivation: 传统背景去除方法（如色度键控和AI模型）受光照条件限制，且可能涉及隐私问题。LiDAR技术提供了一种独立于光照的替代方案，同时保护隐私。

Result: 在60帧/秒的视频流中实现了高效的背景去除，仅受限于LiDAR数据的传输带宽和材料反射限制。

Insight: 若LiDAR分辨率能与彩色图像匹配，LiDAR有望成为视频和摄影中背景去除的主流技术。

Abstract: Light Detection and Ranging (LiDAR) technology in consumer-grade mobile devices can be used as a replacement for traditional background removal and compositing techniques. Unlike approaches such as chroma keying and trained AI models, LiDAR’s depth information is independent of subject lighting, and performs equally well in low-light and well-lit environments. We integrate the LiDAR and color cameras on the iPhone 15 Pro Max with GPU-based image processing. We use Apple’s SwiftUI and Swift frameworks for user interface and backend development, and Metal Shader Language (MSL) for realtime image enhancement at the standard iPhone streaming frame rate of 60 frames per second. The only meaningful limitations of the technology are the streaming bandwidth of the depth data, which currently reduces the depth map resolution to 320x240, and any pre-existing limitations of the LiDAR IR laser to reflect accurate depth from some materials. If the LiDAR resolution on a mobile device like the iPhone can be improved to match the color image resolution, LiDAR could feasibly become the preeminent method of background removal for video applications and photography.

[117] Efficient Fine-Tuning of DINOv3 Pretrained on Natural Images for Atypical Mitotic Figure Classification in MIDOG 2025 eess.IV | cs.CVPDF

Guillaume Balezo, Raphaël Bourgade, Thomas Walter

TL;DR: 利用自然图像预训练的DINOv3模型，通过低秩适应（LoRA）和数据增强，实现了在MIDOG 2025竞赛中对非典型有丝分裂图像的分类任务，取得了较高准确率。

Details

Motivation: 非典型有丝分裂图像（AMFs）因低发生率、形态细微和观察者间差异而难以检测，但其作为异常细胞分裂的标志与不良预后相关。

Result: 在MIDOG 2025的初步测试集上，模型实现了0.8871的平衡准确率。

Insight: 研究显示，即使存在领域差异，DINOv3的预训练权重仍具有较强的泛化能力，结合LoRA技术可以实现高效迁移。

Abstract: Atypical mitotic figures (AMFs) are markers of abnormal cell division associated with poor prognosis, yet their detection remains difficult due to low prevalence, subtle morphology, and inter-observer variability. The MIDOG 2025 challenge introduces a benchmark for AMF classification across multiple domains. In this work, we evaluate the recently published DINOv3-H+ vision transformer, pretrained on natural images, which we fine-tuned using low-rank adaptation (LoRA, 650k trainable parameters) and extensive augmentation. Despite the domain gap, DINOv3 transfers effectively to histopathology, achieving a balanced accuracy of 0.8871 on the preliminary test set. These results highlight the robustness of DINOv3 pretraining and show that, when combined with parameter-efficient fine-tuning, it provides a strong baseline for atypical mitosis classification in MIDOG 2025.

q-bio.QM [Back]

Zizhao Tang, Changhao Liu, Nuo Tong, Shuiping Gou, Mei Shi

TL;DR: 该论文提出了一种基于深度学习的多模态框架，通过整合CT图像、影像组学特征和临床数据，预测头颈鳞癌患者的转移风险，性能优于单一模态模型。

Details

Motivation: 头颈鳞癌的转移是临床管理中的主要挑战，可靠的转移风险预测对优化治疗策略和预后至关重要。通过多模态数据融合，可以提高预测准确性。

Result: 多模态融合模型的AUC达0.803，优于单一模态（3D深度学习的AUC为0.715），且在不同肿瘤亚型中表现稳健。

Insight: 不同模态的数据提供了互补信息，3D Swin Transformer在特征提取上优于传统网络，模型具有临床决策支持潜力。

Abstract: Metastasis remains the major challenge in the clinical management of head and neck squamous cell carcinoma (HNSCC). Reliable pre-treatment prediction of metastatic risk is crucial for optimizing treatment strategies and prognosis. This study develops a deep learning-based multimodal framework to predict metastasis risk in HNSCC patients by integrating computed tomography (CT) images, radiomics, and clinical data. 1497 HNSCC patients were included. Tumor and organ masks were derived from pretreatment CT images. A 3D Swin Transformer extracted deep features from tumor regions. Meanwhile, 1562 radiomics features were obtained using PyRadiomics, followed by correlation filtering and random forest selection, leaving 36 features. Clinical variables including age, sex, smoking, and alcohol status were encoded and fused with imaging-derived features. Multimodal features were fed into a fully connected network to predict metastasis risk. Performance was evaluated using five-fold cross-validation with area under the curve (AUC), accuracy (ACC), sensitivity (SEN), and specificity (SPE). The proposed fusion model outperformed single-modality models. The 3D deep learning module alone achieved an AUC of 0.715, and when combined with radiomics and clinical features, predictive performance improved (AUC = 0.803, ACC = 0.752, SEN = 0.730, SPE = 0.758). Stratified analysis showed generalizability across tumor subtypes. Ablation studies indicated complementary information from different modalities. Evaluation showed the 3D Swin Transformer provided more robust representation learning than conventional networks. This multimodal fusion model demonstrated high accuracy and robustness in predicting metastasis risk in HNSCC, offering a comprehensive representation of tumor biology. The interpretable model has potential as a clinical decision-support tool for personalized treatment planning.

Table of Contents

cs.CV [Back]

[1] Mitigating Hallucinations in Multimodal LLMs via Object-aware Preference Optimization cs.CV | cs.AI | cs.CL | cs.MMPDF

[2] SDiFL: Stable Diffusion-Driven Framework for Image Forgery Localization cs.CVPDF

[3] Grounding Multimodal Large Language Models with Quantitative Skin Attributes: A Retrieval Study cs.CV | cs.LGPDF

[4] Enhancing Automatic Modulation Recognition With a Reconstruction-Driven Vision Transformer Under Limited Labels cs.CV | eess.SPPDF

[5] InfinityHuman: Towards Long-Term Audio-Driven Human cs.CVPDF

[6] Spherical Vision Transformers for Audio-Visual Saliency Prediction in 360-Degree Videos cs.CVPDF

[7] A Novel Framework for Automated Explain Vision Model Using Vision-Language Models cs.CV | cs.AI | cs.CL | cs.LGPDF

[8] ATMS-KD: Adaptive Temperature and Mixed Sample Knowledge Distillation for a Lightweight Residual CNN in Agricultural Embedded Systems cs.CVPDF

[9] Linking heterogeneous microstructure informatics with expert characterization knowledge through customized and hybrid vision-language representations for industrial qualification cs.CV | cs.LGPDF

[10] How Multimodal LLMs Solve Image Tasks: A Lens on Visual Grounding, Task Reasoning, and Answer Decoding cs.CV | cs.AI | cs.CLPDF

[11] Disentangling Latent Embeddings with Sparse Linear Concept Subspaces (SLiCS) cs.CVPDF

[12] MedFoundationHub: A Lightweight and Secure Toolkit for Deploying Medical Vision Language Foundation Models cs.CV | cs.HCPDF

[13] Enhancing Mamba Decoder with Bidirectional Interaction in Multi-Task Dense Prediction cs.CVPDF

[14] Audio-Guided Visual Editing with Complex Multi-Modal Prompts cs.CVPDF

[15] More Reliable Pseudo-labels, Better Performance: A Generalized Approach to Single Positive Multi-label Learning cs.CVPDF

[16] Ultra-Low-Latency Spiking Neural Networks with Temporal-Dependent Integrate-and-Fire Neuron Model for Objects Detection cs.CV | cs.AI | I.4.0; I.2.6PDF

[17] Graph-Based Uncertainty Modeling and Multimodal Fusion for Salient Object Detection cs.CVPDF

[18] MSMVD: Exploiting Multi-scale Image Features via Multi-scale BEV Features for Multi-view Pedestrian Detection cs.CVPDF

[19] A Spatial-Frequency Aware Multi-Scale Fusion Network for Real-Time Deepfake Detection cs.CVPDF

[20] Droplet3D: Commonsense Priors from Videos Facilitate 3D Generation cs.CVPDF

[21] Realistic and Controllable 3D Gaussian-Guided Object Editing for Driving Video Generation cs.CVPDF

[22] Video-MTR: Reinforced Multi-Turn Reasoning for Long Video Understanding cs.CVPDF

[23] Adaptive Dual Uncertainty Optimization: Boosting Monocular 3D Object Detection under Test-Time Shifts cs.CVPDF

[24] CaddieSet: A Golf Swing Dataset with Human Joint Features and Ball Information cs.CV | cs.AIPDF

[25] IAENet: An Importance-Aware Ensemble Model for 3D Point Cloud-Based Anomaly Detection cs.CVPDF

[26] Describe, Don’t Dictate: Semantic Image Editing with Natural Language Intent cs.CVPDF

[27] Learning What is Worth Learning: Active and Sequential Domain Adaptation for Multi-modal Gross Tumor Volume Segmentation cs.CVPDF

[28] Enhancing Pseudo-Boxes via Data-Level LiDAR-Camera Fusion for Unsupervised 3D Object Detection cs.CVPDF

[29] Digital Scale: Open-Source On-Device BMI Estimation from Smartphone Camera Images Trained on a Large-Scale Real-World Dataset cs.CVPDF

[30] Contrastive Learning through Auxiliary Branch for Video Object Detection cs.CVPDF

[31] Towards Mechanistic Defenses Against Typographic Attacks in CLIP cs.CV | cs.AIPDF

[32] GLaRE: A Graph-based Landmark Region Embedding Network for Emotion Recognition cs.CVPDF

[33] FastFit: Accelerating Multi-Reference Virtual Try-On via Cacheable Diffusion Models cs.CV | 68T42 (Primary) 168T45 (Secondary) | I.4.9PDF

[34] UTA-Sign: Unsupervised Thermal Video Augmentation via Event-Assisted Traffic Signage Sketching cs.CVPDF

[35] Optimization-Based Calibration for Intravascular Ultrasound Volume Reconstruction cs.CVPDF

[36] Revisiting the Privacy Risks of Split Inference: A GAN-Based Data Reconstruction Attack via Progressive Feature Optimization cs.CV | cs.CRPDF

[37] EmoCAST: Emotional Talking Portrait via Emotive Text Description cs.CVPDF

[38] AvatarBack: Back-Head Generation for Complete 3D Avatars from Front-View Images cs.CVPDF

[39] ArtFace: Towards Historical Portrait Face Identification via Model Adaptation cs.CV | cs.AIPDF

[40] CraftGraffiti: Exploring Human Identity with Custom Graffiti Art via Facial-Preserving Diffusion Models cs.CVPDF

[41] Improving Alignment in LVLMs with Debiased Self-Judgment cs.CV | cs.CLPDF

[42] “Humor, Art, or Misinformation?”: A Multimodal Dataset for Intent-Aware Synthetic Image Detection cs.CV | cs.MMPDF

[43] MobileCLIP2: Improving Multi-Modal Reinforced Training cs.CV | cs.AI | cs.CL | cs.LGPDF

[44] Learned Rate Control for Frame-Level Adaptive Neural Video Compression via Dynamic Neural Network cs.CVPDF

[45] Mix, Align, Distil: Reliable Cross-Domain Atypical Mitosis Classification cs.CVPDF

[46] Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning cs.CVPDF

[47] ${C}^{3}$-GS: Learning Context-aware, Cross-dimension, Cross-scale Feature for Generalizable Gaussian Splatting cs.CV | cs.AIPDF

[48] SeqVLM: Proposal-Guided Multi-View Sequences Reasoning via VLM for Zero-Shot 3D Visual Grounding cs.CV | cs.AIPDF

[49] Occlusion Robustness of CLIP for Military Vehicle Classification cs.CV | cs.AIPDF

[50] SKGE-SWIN: End-To-End Autonomous Vehicle Waypoint Prediction and Navigation Using Skip Stage Swin Transformer cs.CV | cs.AI | cs.LG | cs.ROPDF

[51] ChainReaction! Structured Approach with Causal Chains as Intermediate Representations for Improved and Explainable Causal Video Question Answering cs.CV | cs.AI | cs.CL | cs.HC | cs.LGPDF

[52] Looking Beyond the Obvious: A Survey on Abstract Concept Recognition for Video Understanding cs.CV | cs.AIPDF

[53] Safer Skin Lesion Classification with Global Class Activation Probability Map Evaluation and SafeML cs.CV | cs.AIPDF

[54] Evaluating Compositional Generalisation in VLMs and Diffusion Models cs.CV | cs.AIPDF

[55] Adapting Foundation Model for Dental Caries Detection with Dual-View Co-Training cs.CVPDF

[56] Estimating 2D Keypoints of Surgical Tools Using Vision-Language Models with Low-Rank Adaptation cs.CVPDF

[57] PathMR: Multimodal Visual Reasoning for Interpretable Pathology Diagnosis cs.CVPDF

[58] Deep Learning Framework for Early Detection of Pancreatic Cancer Using Multi-Modal Medical Imaging Analysis cs.CVPDF

[59] Understanding and evaluating computer vision models through the lens of counterfactuals cs.CVPDF

[60] To New Beginnings: A Survey of Unified Perception in Autonomous Vehicle Software cs.CV | cs.ROPDF

[61] Dino U-Net: Exploiting High-Fidelity Dense Features from Foundation Models for Medical Image Segmentation cs.CV | eess.IVPDF

[62] COMETH: Convex Optimization for Multiview Estimation and Tracking of Humans cs.CV | cs.ROPDF

[63] E-ConvNeXt: A Lightweight and Efficient ConvNeXt Variant with Cross-Stage Partial Connections cs.CVPDF

[64] Webly-Supervised Image Manipulation Localization via Category-Aware Auto-Annotation cs.CVPDF

[65] POSE: Phased One-Step Adversarial Equilibrium for Video Diffusion Models cs.CVPDF

[66] Mitosis detection in domain shift scenarios: a Mamba-based approach cs.CVPDF

[67] FW-GAN: Frequency-Driven Handwriting Synthesis with Wave-Modulated MLP Generator cs.CV | cs.LGPDF

[68] MMG-Vid: Maximizing Marginal Gains at Segment-level and Token-level for Efficient Video LLMs cs.CVPDF

[69] CogVLA: Cognition-Aligned Vision-Language-Action Model via Instruction-Driven Routing & Sparsification cs.CV | cs.ROPDF

[70] Veritas: Generalizable Deepfake Detection via Pattern-Aware Reasoning cs.CV | cs.AIPDF

[71] FakeParts: a New Family of AI-Generated DeepFakes cs.CV | cs.AI | cs.MMPDF

[72] Multi-View 3D Point Tracking cs.CVPDF

[73] OneReward: Unified Mask-Guided Image Generation via Multi-Task Human Preference Learning cs.CVPDF

[74] Dress&Dance: Dress up and Dance as You Like It - Technical Preview cs.CV | cs.LGPDF

cs.CL [Back]

[75] Social Bias in Multilingual Language Models: A Survey cs.CLPDF

[76] Integrating SystemC TLM into FMI 3.0 Co-Simulations with an Open-Source Approach cs.CLPDF

[77] Can Compact Language Models Search Like Agents? Distillation-Guided Policy Optimization for Preserving Agentic RAG Capabilities cs.CLPDF