cs.CV [Total: 166]
cs.CL [Total: 36]
eess.AS [Total: 2]
q-fin.ST [Total: 1]
cs.LG [Total: 8]
cs.RO [Total: 4]
cs.IT [Total: 1]
cs.MM [Total: 1]
stat.ML [Total: 1]
eess.IV [Total: 14]
cs.SD [Total: 2]
cs.AI [Total: 13]
cs.IR [Total: 4]
q-bio.NC [Total: 1]

cs.CV [Back]

[1] Med-GRIM: Enhanced Zero-Shot Medical VQA using prompt-embedded Multimodal Graph RAG cs.CV | cs.MAPDF

Rakesh Raj Madavan, Akshat Kaimal, Hashim Faisal, Chandrakala S

TL;DR: Med-GRIM通过结合密集编码和图检索增强技术，提出一种高效、低计算的医疗视觉问答系统，避免了大规模模型微调的需求，并在零样本任务中表现优异。

Details

Motivation: 现有多模态模型在复杂领域（如医疗VQA）中难以实现细节精确的回答，需要通过高效、低计算量的方法提升性能。

Result: Med-GRIM在医疗VQA任务中表现优异，以较低的算力成本达到大模型性能，同时DermaGraph数据集为研究提供了支持。

Insight: 通过模块化和高效的设计，可以在零样本任务中实现高性能，同时避免大规模模型的算力负担。

Abstract: An ensemble of trained multimodal encoders and vision-language models (VLMs) has become a standard approach for visual question answering (VQA) tasks. However, such models often fail to produce responses with the detailed precision necessary for complex, domain-specific applications such as medical VQA. Our representation model, BIND: BLIVA Integrated with Dense Encoding, extends prior multimodal work by refining the joint embedding space through dense, query-token-based encodings inspired by contrastive pretraining techniques. This refined encoder powers Med-GRIM, a model designed for medical VQA tasks that leverages graph-based retrieval and prompt engineering to integrate domain-specific knowledge. Rather than relying on compute-heavy fine-tuning of vision and language models on specific datasets, Med-GRIM applies a low-compute, modular workflow with small language models (SLMs) for efficiency. Med-GRIM employs prompt-based retrieval to dynamically inject relevant knowledge, ensuring both accuracy and robustness in its responses. By assigning distinct roles to each agent within the VQA system, Med-GRIM achieves large language model performance at a fraction of the computational cost. Additionally, to support scalable research in zero-shot multimodal medical applications, we introduce DermaGraph, a novel Graph-RAG dataset comprising diverse dermatological conditions. This dataset facilitates both multimodal and unimodal querying. The code and dataset are available at: https://github.com/Rakesh-123-cryp/Med-GRIM.git

[2] DiTalker: A Unified DiT-based Framework for High-Quality and Speaking Styles Controllable Portrait Animation cs.CVPDF

He Feng, Yongjia Ma, Donglin Di, Lei Fan, Tonghua Su

TL;DR: DiTalker是一个基于DiT的统一框架，用于高质量和说话风格可控的肖像动画，解决了现有方法在动态风格（如头部运动）上的不足，并通过双分支结构提升性能。

Details

Motivation: 现有扩散基础的肖像动画方法通常只关注唇同步或静态情感转换，忽略了动态风格（如头部运动），且双U-Net结构虽能保持身份一致性但计算开销大。

Result: 实验表明DiTalker在唇同步和说话风格控制方面优于现有方法。

Insight: 通过解耦音频与动态风格，DiTalker实现了更灵活的说话风格控制，同时保持了高质量的动画生成。

Abstract: Portrait animation aims to synthesize talking videos from a static reference face, conditioned on audio and style frame cues (e.g., emotion and head poses), while ensuring precise lip synchronization and faithful reproduction of speaking styles. Existing diffusion-based portrait animation methods primarily focus on lip synchronization or static emotion transformation, often overlooking dynamic styles such as head movements. Moreover, most of these methods rely on a dual U-Net architecture, which preserves identity consistency but incurs additional computational overhead. To this end, we propose DiTalker, a unified DiT-based framework for speaking style-controllable portrait animation. We design a Style-Emotion Encoding Module that employs two separate branches: a style branch extracting identity-specific style information (e.g., head poses and movements), and an emotion branch extracting identity-agnostic emotion features. We further introduce an Audio-Style Fusion Module that decouples audio and speaking styles via two parallel cross-attention layers, using these features to guide the animation process. To enhance the quality of results, we adopt and modify two optimization constraints: one to improve lip synchronization and the other to preserve fine-grained identity and background details. Extensive experiments demonstrate the superiority of DiTalker in terms of lip synchronization and speaking style controllability. Project Page: https://thenameishope.github.io/DiTalker/

[3] BigTokDetect: A Clinically-Informed Vision-Language Model Framework for Detecting Pro-Bigorexia Videos on TikTok cs.CVPDF

Minh Duc Chu, Kshitij Pawar, Zihao He, Roxanna Sharifi, Ross Sonnenblick

TL;DR: BigTokDetect是一个基于临床知识的视觉-语言模型框架，用于检测TikTok上促进肌肉畸形的有害内容，通过多模态融合显著提升了分类性能。

Details

Motivation: 社交媒体上关于肌肉畸形的有害内容（pro-bigorexia）伪装成健身内容，传统文本检测系统难以识别，亟需多模态方法解决这一问题。

Result: 在多模态分类任务中，主类别分类准确率达到0.829%，子类别为0.690%，多模态融合比纯文本方法性能提升5-10%。

Insight: 视频特征在多模态融合中提供了最显著的判别信号，凸显了视觉信息在识别伪装内容中的重要性。

Abstract: Social media platforms increasingly struggle to detect harmful content that promotes muscle dysmorphic behaviors, particularly pro-bigorexia content that disproportionately affects adolescent males. Unlike traditional eating disorder detection focused on the “thin ideal,” pro-bigorexia material masquerades as legitimate fitness content through complex multimodal combinations of visual displays, coded language, and motivational messaging that evade text-based detection systems. We address this challenge by developing BigTokDetect, a clinically-informed detection framework for identifying pro-bigorexia content on TikTok. We introduce BigTok, the first expert-annotated multimodal dataset of over 2,200 TikTok videos labeled by clinical psychologists and psychiatrists across five primary categories spanning body image, nutrition, exercise, supplements, and masculinity. Through a comprehensive evaluation of state-of-the-art vision language models, we achieve 0.829% accuracy on primary category classification and 0.690% on subcategory detection via domain-specific finetuning. Our ablation studies demonstrate that multimodal fusion improves performance by 5-10% over text-only approaches, with video features providing the most discriminative signals. These findings establish new benchmarks for multimodal harmful content detection and provide both the computational tools and methodological framework needed for scalable content moderation in specialized mental health domains.

[4] Frequency Prior Guided Matching: A Data Augmentation Approach for Generalizable Semi-Supervised Polyp Segmentation cs.CVPDF

Haoran Xi, Chen Liu, Xiaolin Li

TL;DR: 本文提出了一种基于频率先验指导的数据增强方法（FPGM），通过捕捉息肉边缘的频率特征，显著提升了半监督息肉分割模型的泛化能力。

Details

Motivation: 现有的半监督息肉分割方法依赖通用增强策略，忽略了息肉特有的结构特性，导致模型在新设备和中心上的泛化能力不足。

Result: 在六个公开数据集上验证，FPGM在零样本泛化能力上取得了显著提升，Dice分数提高了10%以上。

Insight: 息肉边缘的频率特征是跨域泛化的关键，FPGM通过频域扰动实现了模型对解剖结构的泛化学习。

Abstract: Automated polyp segmentation is essential for early diagnosis of colorectal cancer, yet developing robust models remains challenging due to limited annotated data and significant performance degradation under domain shift. Although semi-supervised learning (SSL) reduces annotation requirements, existing methods rely on generic augmentations that ignore polyp-specific structural properties, resulting in poor generalization to new imaging centers and devices. To address this, we introduce Frequency Prior Guided Matching (FPGM), a novel augmentation framework built on a key discovery: polyp edges exhibit a remarkably consistent frequency signature across diverse datasets. FPGM leverages this intrinsic regularity in a two-stage process. It first learns a domain-invariant frequency prior from the edge regions of labeled polyps. Then, it performs principled spectral perturbations on unlabeled images, aligning their amplitude spectra with this learned prior while preserving phase information to maintain structural integrity. This targeted alignment normalizes domain-specific textural variations, thereby compelling the model to learn the underlying, generalizable anatomical structure. Validated on six public datasets, FPGM establishes a new state-of-the-art against ten competing methods. It demonstrates exceptional zero-shot generalization capabilities, achieving over 10% absolute gain in Dice score in data-scarce scenarios. By significantly enhancing cross-domain robustness, FPGM presents a powerful solution for clinically deployable polyp segmentation under limited supervision.

[5] Large Language Models Facilitate Vision Reflection in Image Classification cs.CVPDF

Guoyuan An, JaeYoon Kim, SungEui Yoon

TL;DR: 本文通过研究大型多模态模型（LMMs）中的视觉反射机制，发现其对提升图像分类任务的准确性和可解释性有显著效果。

Details

Motivation: 研究旨在探索LMMs在视觉任务中的表现，尤其是如何通过视觉反射提升模型性能，并分析其内部机制。

Result: 实验表明，视觉反射机制可提升ImageNet等数据集的识别准确率，并增强模型的可解释性。

Insight: LMMs可能依赖压缩的文本表征而非原始视觉特征进行推理，视觉反射为实现鲁棒且可解释的视觉识别提供了新思路。

Abstract: This paper presents several novel findings on the explainability of vision reflection in large multimodal models (LMMs). First, we show that prompting an LMM to verify the prediction of a specialized vision model can improve recognition accuracy, even on benchmarks like ImageNet, despite prior evidence that LMMs typically underperform dedicated vision encoders. Second, we analyze the internal behavior of vision reflection and find that the vision-language connector maps visual features into explicit textual concepts, allowing the language model to reason about prediction plausibility using commonsense knowledge. We further observe that replacing a large number of vision tokens with only a few text tokens still enables LLaVA to generate similar answers, suggesting that LMMs may rely primarily on a compact set of distilled textual representations rather than raw vision features. Third, we show that a training-free connector can enhance LMM performance in fine-grained recognition tasks, without extensive feature-alignment training. Together, these findings offer new insights into the explainability of vision-language models and suggest that vision reflection is a promising strategy for achieving robust and interpretable visual recognition.

[6] A Framework Combining 3D CNN and Transformer for Video-Based Behavior Recognition cs.CV | cs.AIPDF

Xiuliang Zhang, Tadiwa Elisha Nyamasvisva, Chuntao Liu

TL;DR: 论文提出了一种结合3D CNN和Transformer的混合框架，用于视频行为识别，解决了传统方法在局部和全局特征建模上的局限性。

Details

Motivation: 视频行为识别在公共安全和智能监控等领域至关重要，但传统3D CNN难以捕捉长时依赖关系，而Transformer计算成本高。

Result: 在基准数据集上表现优于传统3D CNN和单独Transformer，实现了更高识别精度且计算复杂度可控。

Insight: 混合框架通过结合局部和全局特征建模优势，为视频行为识别提供了高效且可扩展的解决方案。

Abstract: Video-based behavior recognition is essential in fields such as public safety, intelligent surveillance, and human-computer interaction. Traditional 3D Convolutional Neural Network (3D CNN) effectively capture local spatiotemporal features but struggle with modeling long-range dependencies. Conversely, Transformers excel at learning global contextual information but face challenges with high computational costs. To address these limitations, we propose a hybrid framework combining 3D CNN and Transformer architectures. The 3D CNN module extracts low-level spatiotemporal features, while the Transformer module captures long-range temporal dependencies, with a fusion mechanism integrating both representations. Evaluated on benchmark datasets, the proposed model outperforms traditional 3D CNN and standalone Transformers, achieving higher recognition accuracy with manageable complexity. Ablation studies further validate the complementary strengths of the two modules. This hybrid framework offers an effective and scalable solution for video-based behavior recognition.

[7] RMT-PPAD: Real-time Multi-task Learning for Panoptic Perception in Autonomous Driving cs.CV | cs.LGPDF

Jiayuan Wang, Q. M. Jonathan Wu, Katsuya Suto, Ning Zhang

TL;DR: RMT-PPAD 是一种基于 Transformer 的实时多任务学习模型，用于自动驾驶的全景感知，同时高效完成目标检测、可行驶区域分割和车道线分割任务，通过轻量级模块和自适应方法解决任务间负迁移问题。

Details

Motivation: 自动驾驶系统需要高精度和实时性的全景感知，但现有方法在多任务学习中存在任务间负迁移和手动设计任务特定结构的问题，RMT-PPAD 旨在解决这些问题。

Result: 在 BDD100K 数据集上，目标检测 mAP50 84.9%、召回率 95.4%，可行驶区域分割 mIoU 92.6%，车道线分割 IoU 56.8%、准确率 84.7%；推理速度 32.6 FPS。

Insight: 通过自适应模块和标签一致性改进，RMT-PPAD 在多任务学习中平衡了性能和效率，适用于实际自动驾驶场景。

Abstract: Autonomous driving systems rely on panoptic driving perception that requires both precision and real-time performance. In this work, we propose RMT-PPAD, a real-time, transformer-based multi-task model that jointly performs object detection, drivable area segmentation, and lane line segmentation. We introduce a lightweight module, a gate control with an adapter to adaptively fuse shared and task-specific features, effectively alleviating negative transfer between tasks. Additionally, we design an adaptive segmentation decoder to learn the weights over multi-scale features automatically during the training stage. This avoids the manual design of task-specific structures for different segmentation tasks. We also identify and resolve the inconsistency between training and testing labels in lane line segmentation. This allows fairer evaluation. Experiments on the BDD100K dataset demonstrate that RMT-PPAD achieves state-of-the-art results with mAP50 of 84.9% and Recall of 95.4% for object detection, mIoU of 92.6% for drivable area segmentation, and IoU of 56.8% and accuracy of 84.7% for lane line segmentation. The inference speed reaches 32.6 FPS. Moreover, we introduce real-world scenarios to evaluate RMT-PPAD performance in practice. The results show that RMT-PPAD consistently delivers stable performance. The source codes and pre-trained models are released at https://github.com/JiayuanWang-JW/RMT-PPAD.

[8] What Makes “Good” Distractors for Object Hallucination Evaluation in Large Vision-Language Models? cs.CV | cs.LGPDF

Ming-Kun Xie, Jia-Hao Xiao, Gang Niu, Lei Feng, Zhiqiang Kou

TL;DR: 本文提出了HOPE基准测试，通过生成更具误导性的干扰项，更严格地评估大型视觉语言模型（LVLM）的幻觉问题。

Details

Motivation: 现有的POPE基准测试在评估LVLM的幻觉问题时，采用简单采样策略，忽视了图像特定信息，且仅限制于负面对象类别，导致评估效果逐渐减弱。

Result: 实验表明，HOPE在各种最先进的LVLM上实现了至少9%至高达23%的精度下降，显著优于POPE。

Insight: HOPE通过更细致的干扰项设计，揭示了LVLM的幻觉脆弱性，为未来模型优化提供了更严格的评估标准。

Abstract: Large Vision-Language Models (LVLMs), empowered by the success of Large Language Models (LLMs), have achieved impressive performance across domains. Despite the great advances in LVLMs, they still suffer from the unavailable object hallucination issue, which tends to generate objects inconsistent with the image content. The most commonly used Polling-based Object Probing Evaluation (POPE) benchmark evaluates this issue by sampling negative categories according to category-level statistics, \textit{e.g.}, category frequencies and co-occurrence. However, with the continuous advancement of LVLMs, the POPE benchmark has shown diminishing effectiveness in assessing object hallucination, as it employs a simplistic sampling strategy that overlooks image-specific information and restricts distractors to negative object categories only. In this paper, we introduce the Hallucination searching-based Object Probing Evaluation (HOPE) benchmark, aiming to generate the most misleading distractors (\textit{i.e.}, non-existent objects or incorrect image descriptions) that can trigger hallucination in LVLMs, which serves as a means to more rigorously assess their immunity to hallucination. To explore the image-specific information, the content-aware hallucination searching leverages Contrastive Language-Image Pre-Training (CLIP) to approximate the predictive behavior of LVLMs by selecting negative objects with the highest predicted likelihood as distractors. To expand the scope of hallucination assessment, the description-based hallucination searching constructs highly misleading distractors by pairing true objects with false descriptions. Experimental results show that HOPE leads to a precision drop of at least 9% and up to 23% across various state-of-the-art LVLMs, significantly outperforming POPE in exposing hallucination vulnerabilities. The code is available at https://github.com/xiemk/HOPE.

[9] Benchmarking Deep Learning-Based Object Detection Models on Feature Deficient Astrophotography Imagery Dataset cs.CV | astro-ph.IMPDF

Shantanusinh Parmar

TL;DR: 论文通过在特征稀疏的智能手机天文摄影数据集MobilTelesco上评测多种物体检测模型，揭示了在非商业领域数据中的信号稀疏性对模型性能的影响。

Details

Motivation: 标准的物体检测数据集（如ImageNet、COCO）主要针对日常物体，缺乏非商业领域（如天文摄影）中的信号稀疏性。本文旨在填补这一空白，探究模型在特征稀疏条件下的表现。

Result: 实验表明，现有的物体检测模型在特征稀疏条件下表现不佳，突显了这些模型在非商业领域的局限性。

Insight: 针对特征稀疏的物体检测任务，未来可能需要开发专门的数据增强或模型优化方法，以克服信号稀疏性带来的挑战。

Abstract: Object detection models are typically trained on datasets like ImageNet, COCO, and PASCAL VOC, which focus on everyday objects. However, these lack signal sparsity found in non-commercial domains. MobilTelesco, a smartphone-based astrophotography dataset, addresses this by providing sparse night-sky images. We benchmark several detection models on it, highlighting challenges under feature-deficient conditions.

[10] MILD: Multi-Layer Diffusion Strategy for Complex and Precise Multi-IP Aware Human Erasing cs.CVPDF

Jinghan Yu, Zhiyuan Ma, Yue Ma, Kaiqi Liu, Yuhan Wang

TL;DR: 论文提出了一种多层扩散策略（MILD），用于解决多IP场景下复杂和精确的人像擦除任务，并通过新数据集和人类形态学引导提升了性能。

Details

Motivation: 现有方法在复杂的多IP场景（如人-人遮挡、人-物体纠缠和背景干扰）中表现不佳，主要由于数据集局限性和缺乏空间解耦能力。

Result: 在复杂人像擦除基准测试中，MILD显著优于现有方法。

Insight: 语义解耦和人类形态学引导能有效提升复杂场景下的人像擦除能力。

Abstract: Recent years have witnessed the success of diffusion models in image-customized tasks. Prior works have achieved notable progress on human-oriented erasing using explicit mask guidance and semantic-aware inpainting. However, they struggle under complex multi-IP scenarios involving human-human occlusions, human-object entanglements, and background interferences. These challenges are mainly due to: 1) Dataset limitations, as existing datasets rarely cover dense occlusions, camouflaged backgrounds, and diverse interactions; 2) Lack of spatial decoupling, where foreground instances cannot be effectively disentangled, limiting clean background restoration. In this work, we introduce a high-quality multi-IP human erasing dataset with diverse pose variations and complex backgrounds. We then propose Multi-Layer Diffusion (MILD), a novel strategy that decomposes generation into semantically separated pathways for each instance and the background. To enhance human-centric understanding, we introduce Human Morphology Guidance, integrating pose, parsing, and spatial relations. We further present Spatially-Modulated Attention to better guide attention flow. Extensive experiments show that MILD outperforms state-of-the-art methods on challenging human erasing benchmarks.

[11] Statistical Confidence Rescoring for Robust 3D Scene Graph Generation from Multi-View Images cs.CV | eess.IVPDF

Qi Xun Yeo, Yanyan Li, Gim Hee Lee

TL;DR: 该论文提出了一种基于多视角RGB图像的3D场景图生成方法，通过语义掩码和统计先验提升节点和边的预测准确性。

Details

Motivation: 现有方法依赖3D标注数据，而该研究旨在仅利用多视角图像实现鲁棒的3D场景图生成，需解决深度图生成的伪几何噪声和多视角特征的背景噪声问题。

Result: 实验表明，该方法在多视角图像输入下优于现有方法。

Insight: 通过语义和空间信息提升节点和边特征，统计先验能有效改善预测鲁棒性。

Abstract: Modern 3D semantic scene graph estimation methods utilize ground truth 3D annotations to accurately predict target objects, predicates, and relationships. In the absence of given 3D ground truth representations, we explore leveraging only multi-view RGB images to tackle this task. To attain robust features for accurate scene graph estimation, we must overcome the noisy reconstructed pseudo point-based geometry from predicted depth maps and reduce the amount of background noise present in multi-view image features. The key is to enrich node and edge features with accurate semantic and spatial information and through neighboring relations. We obtain semantic masks to guide feature aggregation to filter background features and design a novel method to incorporate neighboring node information to aid robustness of our scene graph estimates. Furthermore, we leverage on explicit statistical priors calculated from the training summary statistics to refine node and edge predictions based on their one-hop neighborhood. Our experiments show that our method outperforms current methods purely using multi-view images as the initial input. Our project page is available at https://qixun1.github.io/projects/SCRSSG.

[12] Age-Diverse Deepfake Dataset: Bridging the Age Gap in Deepfake Detection cs.CV | cs.LGPDF

Unisha Joshi

TL;DR: 该论文提出了一种年龄多样化的深度伪造数据集，旨在解决深度伪造检测中的年龄偏见问题，通过结合多个现有数据集和生成合成数据填补年龄分布空白。

Details

Motivation: 随着技术的进步和深度伪造内容的流行，深度伪造检测中的年龄偏见问题亟待解决。现有数据集存在明显的年龄分布不均问题，影响了检测模型的公平性和泛化能力。

Result: 实验表明，在年龄多样化的数据集上训练的模型在年龄组间表现更公平，整体准确性更高，且在不同数据集上泛化能力更强。

Insight: 1. 数据集的年龄多样性对模型的公平性和泛化能力至关重要；2. 结合现有数据集和合成数据是解决数据分布问题的有效方法；3. 未来研究可以关注其他人口统计学属性的公平性。

Abstract: The challenges associated with deepfake detection are increasing significantly with the latest advancements in technology and the growing popularity of deepfake videos and images. Despite the presence of numerous detection models, demographic bias in the deepfake dataset remains largely unaddressed. This paper focuses on the mitigation of age-specific bias in the deepfake dataset by introducing an age-diverse deepfake dataset that will improve fairness across age groups. The dataset is constructed through a modular pipeline incorporating the existing deepfake datasets Celeb-DF, FaceForensics++, and UTKFace datasets, and the creation of synthetic data to fill the age distribution gaps. The effectiveness and generalizability of this dataset are evaluated using three deepfake detection models: XceptionNet, EfficientNet, and LipForensics. Evaluation metrics, including AUC, pAUC, and EER, revealed that models trained on the age-diverse dataset demonstrated fairer performance across age groups, improved overall accuracy, and higher generalization across datasets. This study contributes a reproducible, fairness-aware deepfake dataset and model pipeline that can serve as a foundation for future research in fairer deepfake detection. The complete dataset and implementation code are available at https://github.com/unishajoshi/age-diverse-deepfake-detection.

[13] Static and Plugged: Make Embodied Evaluation Simple cs.CVPDF

Jiahao Xiao, Jianbo Zhang, BoWen Yan, Shengyu Guo, Tongrui Ye

TL;DR: 论文提出了StaticEmbodiedBench，一个基于静态场景表示的即插即用基准测试，用于简化具身智能的评估，支持42个场景和8个核心维度，并发布了200个样本以加速研究。

Details

Motivation: 当前具身智能评估依赖交互式模拟环境或真实设备，成本高且难以扩展，亟需一种高效、统一的评估方法。

Result: 评估了19个VLMs和11个VLAs模型，展示了基准的普适性和可扩展性。

Insight: 静态场景表示是具身智能评估的高效替代方案，能够降低成本并加速研究进展。

Abstract: Embodied intelligence is advancing rapidly, driving the need for efficient evaluation. Current benchmarks typically rely on interactive simulated environments or real-world setups, which are costly, fragmented, and hard to scale. To address this, we introduce StaticEmbodiedBench, a plug-and-play benchmark that enables unified evaluation using static scene representations. Covering 42 diverse scenarios and 8 core dimensions, it supports scalable and comprehensive assessment through a simple interface. Furthermore, we evaluate 19 Vision-Language Models (VLMs) and 11 Vision-Language-Action models (VLAs), establishing the first unified static leaderboard for Embodied intelligence. Moreover, we release a subset of 200 samples from our benchmark to accelerate the development of embodied intelligence.

[14] StyleTailor: Towards Personalized Fashion Styling via Hierarchical Negative Feedback cs.CV | cs.CY | cs.MAPDF

Hongbo Ma, Fei Shen, Hongbin Xu, Xiaoce Wang, Gang Xu

TL;DR: 本文提出StyleTailor，一种结合个性化服饰设计、购物推荐、虚拟试穿和系统评估的智能框架，通过多级负面反馈实现迭代优化。

Details

Motivation: 个性化时尚搭配领域缺乏有效的智能解决方案，StyleTailor旨在通过闭环机制提升用户体验。

Result: 实验证明StyleTailor在个性化设计和推荐上优于无负面反馈的基线，为智能时尚系统设立新标准。

Insight: 负面反馈闭环机制显著提升个性化系统的精准性和适应性，未来可扩展至其他交互式推荐场景。

Abstract: The advancement of intelligent agents has revolutionized problem-solving across diverse domains, yet solutions for personalized fashion styling remain underexplored, which holds immense promise for promoting shopping experiences. In this work, we present StyleTailor, the first collaborative agent framework that seamlessly unifies personalized apparel design, shopping recommendation, virtual try-on, and systematic evaluation into a cohesive workflow. To this end, StyleTailor pioneers an iterative visual refinement paradigm driven by multi-level negative feedback, enabling adaptive and precise user alignment. Specifically, our framework features two core agents, i.e., Designer for personalized garment selection and Consultant for virtual try-on, whose outputs are progressively refined via hierarchical vision-language model feedback spanning individual items, complete outfits, and try-on efficacy. Counterexamples are aggregated into negative prompts, forming a closed-loop mechanism that enhances recommendation quality.To assess the performance, we introduce a comprehensive evaluation suite encompassing style consistency, visual quality, face similarity, and artistic appraisal. Extensive experiments demonstrate StyleTailor’s superior performance in delivering personalized designs and recommendations, outperforming strong baselines without negative feedback and establishing a new benchmark for intelligent fashion systems.

[15] From Label Error Detection to Correction: A Modular Framework and Benchmark for Object Detection Datasets cs.CV | cs.LGPDF

Sarina Penquitt, Jonathan Klees, Rinor Cakaj, Daniel Kondermann, Matthias Rottmann

TL;DR: 论文提出了一种半自动化的标签纠错框架REC$✓$D，旨在系统性地解决目标检测数据集中的标签错误问题，并通过众包微任务验证和修正错误，显著提高了KITTI数据集的标注质量。

Details

Motivation: 目标检测数据集中普遍存在标签错误（如缺失标签、分类错误或定位不准确），影响模型训练和评估效果，目前缺乏系统性的大规模纠错方法。

Result: 在KITTI数据集的”pedestrian”类上，REC$✓$D成功修正了数百个错误，但现有检测器仍可能漏检66%的错误或引入新错误。

Insight: 1. 标签错误修正需要结合自动化检测与人工验证；2. 现有检测方法仍有较大改进空间；3. 发布的基准数据集为未来研究提供了重要资源。

Abstract: Object detection has advanced rapidly in recent years, driven by increasingly large and diverse datasets. However, label errors, defined as missing labels, incorrect classification or inaccurate localization, often compromise the quality of these datasets. This can have a significant impact on the outcomes of training and benchmark evaluations. Although several methods now exist for detecting label errors in object detection datasets, they are typically validated only on synthetic benchmarks or limited manual inspection. How to correct such errors systemically and at scale therefore remains an open problem. We introduce a semi-automated framework for label-error correction called REC$\checkmark$D (Rechecked). Building on existing detectors, the framework pairs their error proposals with lightweight, crowd-sourced microtasks. These tasks enable multiple annotators to independently verify each candidate bounding box, and their responses are aggregated to estimate ambiguity and improve label quality. To demonstrate the effectiveness of REC$\checkmark$D, we apply it to the class pedestrian in the KITTI dataset. Our crowdsourced review yields high-quality corrected annotations, which indicate a rate of at least 24% of missing and inaccurate annotations in original annotations. This validated set will be released as a new real-world benchmark for label error detection and correction. We show that current label error detection methods, when combined with our correction framework, can recover hundreds of errors in the time it would take a human to annotate bounding boxes from scratch. However, even the best methods still miss up to 66% of the true errors and with low quality labels introduce more errors than they find. This highlights the urgent need for further research, now enabled by our released benchmark.

[16] On the effectiveness of multimodal privileged knowledge distillation in two vision transformer based diagnostic applications cs.CV | cs.LGPDF

Simon Baur, Alexandra Benova, Emilio Dolgener Cantú, Jackie Ma

TL;DR: 论文提出了多模态特权知识蒸馏（MMPKD）方法，利用训练时额外的模态指导学生模型（Vision Transformer），提升了视觉模型的注意力图零样本能力，但效果未跨域泛化。

Details

Motivation: 临床深度学习中常需多模态数据支持决策，但推理时并非所有模态都可用，因此需训练时利用多模态指导学生模型。

Result: MMPKD改进了学生模型的注意力图零样本定位能力，但未实现跨领域泛化。

Insight: 多模态特权蒸馏在单模态测试时有效，但需注意领域特异性，无法直接迁移至其他任务。

Abstract: Deploying deep learning models in clinical practice often requires leveraging multiple data modalities, such as images, text, and structured data, to achieve robust and trustworthy decisions. However, not all modalities are always available at inference time. In this work, we propose multimodal privileged knowledge distillation (MMPKD), a training strategy that utilizes additional modalities available solely during training to guide a unimodal vision model. Specifically, we used a text-based teacher model for chest radiographs (MIMIC-CXR) and a tabular metadata-based teacher model for mammography (CBIS-DDSM) to distill knowledge into a vision transformer student model. We show that MMPKD can improve the resulting attention maps’ zero-shot capabilities of localizing ROI in input images, while this effect does not generalize across domains, as contrarily suggested by prior research.

[17] Grounding Emotion Recognition with Visual Prototypes: VEGA – Revisiting CLIP in MERC cs.CVPDF

Guanyu Hu, Dimitrios Kollias, Xinyu Yang

TL;DR: 该论文提出了一种新的视觉情感引导锚定（VEGA）机制，通过引入类级视觉语义改进了多模态情感识别中的CLIP模型，结合面部示例构建情感特定的视觉锚点，实现更好的多模态对齐和心理意义表示。

Details

Motivation: 多模态情感识别任务中，现有模型缺乏心理学有意义的先验指导多模态对齐，无法充分利用视觉信号。作者希望通过视觉锚点引入心理学理论支持的先验信息。

Result: 在IEMOCAP和MELD数据集上表现优异，达到当前最佳性能。

Insight: 视觉锚点和心理学理论的结合可以显著提升多模态情感识别的效果，表明心理学启发的方法在AI任务中具有潜在价值。

Abstract: Multimodal Emotion Recognition in Conversations remains a challenging task due to the complex interplay of textual, acoustic and visual signals. While recent models have improved performance via advanced fusion strategies, they often lack psychologically meaningful priors to guide multimodal alignment. In this paper, we revisit the use of CLIP and propose a novel Visual Emotion Guided Anchoring (VEGA) mechanism that introduces class-level visual semantics into the fusion and classification process. Distinct from prior work that primarily utilizes CLIP’s textual encoder, our approach leverages its image encoder to construct emotion-specific visual anchors based on facial exemplars. These anchors guide unimodal and multimodal features toward a perceptually grounded and psychologically aligned representation space, drawing inspiration from cognitive theories (prototypical emotion categories and multisensory integration). A stochastic anchor sampling strategy further enhances robustness by balancing semantic stability and intra-class diversity. Integrated into a dual-branch architecture with self-distillation, our VEGA-augmented model achieves sota performance on IEMOCAP and MELD. Code is available at: https://github.com/dkollias/VEGA.

[18] Bridging Brain Connectomes and Clinical Reports for Early Alzheimer’s Disease Diagnosis cs.CV | cs.LGPDF

Jing Zhang, Xiaowei Yu, Minheng Chen, Lu Zhang, Tong Chen

TL;DR: 论文提出了一种新颖的框架，将脑连接组与临床报告在共享的跨模态潜在空间中对齐，以提升表征学习，用于早期阿尔茨海默病的诊断。

Details

Motivation: 结合脑成像数据和临床报告可以利用多模态信息进行更有效的诊断，但如何将客观成像数据与主观文本报告有效链接仍是一个挑战。

Result: 在ADNI数据集上应用于轻度认知障碍（MCI）时，不仅达到了最先进的预测性能，还发现了具有临床意义的连接组-文本对。

Insight: 脑疾病通常表现为网络级异常而非孤立区域变化，该方法为研究阿尔茨海默病的早期机制提供了新思路。

Abstract: Integrating brain imaging data with clinical reports offers a valuable opportunity to leverage complementary multimodal information for more effective and timely diagnosis in practical clinical settings. This approach has gained significant attention in brain disorder research, yet a key challenge remains: how to effectively link objective imaging data with subjective text-based reports, such as doctors’ notes. In this work, we propose a novel framework that aligns brain connectomes with clinical reports in a shared cross-modal latent space at both the subject and connectome levels, thereby enhancing representation learning. The key innovation of our approach is that we treat brain subnetworks as tokens of imaging data, rather than raw image patches, to align with word tokens in clinical reports. This enables a more efficient identification of system-level associations between neuroimaging findings and clinical observations, which is critical since brain disorders often manifest as network-level abnormalities rather than isolated regional alterations. We applied our method to mild cognitive impairment (MCI) using the ADNI dataset. Our approach not only achieves state-of-the-art predictive performance but also identifies clinically meaningful connectome-text pairs, offering new insights into the early mechanisms of Alzheimer’s disease and supporting the development of clinically useful multimodal biomarkers.

[19] Surformer v1: Transformer-Based Surface Classification Using Tactile and Vision Features cs.CV | cs.AIPDF

Manish Kansana, Elias Hossain, Shahram Rahimi, Noorbakhsh Amiri Golilarz

TL;DR: Surformer v1是一种基于Transformer的架构，用于结合触觉和视觉特征进行表面分类，通过模态特定编码器和跨模态注意力层实现高效的多模态融合，在准确性和计算效率上表现优异。

Details

Motivation: 表面材料识别是机器人感知和物理交互的关键任务，尤其是在结合触觉和视觉输入时，需要高效且准确的模型来支持实时应用。

Result: Surformer v1在多模态分类中达到99.4%的准确率，推理时间为0.77毫秒，优于Multimodal CNN的效率和性能平衡。

Insight: 结构化特征与高效架构（如Transformer）结合在多模态任务中能够显著提升效率，而不显著牺牲准确性。

Abstract: Surface material recognition is a key component in robotic perception and physical interaction, particularly when leveraging both tactile and visual sensory inputs. In this work, we propose Surformer v1, a transformer-based architecture designed for surface classification using structured tactile features and PCA-reduced visual embeddings extracted via ResNet-50. The model integrates modality-specific encoders with cross-modal attention layers, enabling rich interactions between vision and touch. Currently, state-of-the-art deep learning models for vision tasks have achieved remarkable performance. With this in mind, our first set of experiments focused exclusively on tactile-only surface classification. Using feature engineering, we trained and evaluated multiple machine learning models, assessing their accuracy and inference time. We then implemented an encoder-only Transformer model tailored for tactile features. This model not only achieved the highest accuracy but also demonstrated significantly faster inference time compared to other evaluated models, highlighting its potential for real-time applications. To extend this investigation, we introduced a multimodal fusion setup by combining vision and tactile inputs. We trained both Surformer v1 (using structured features) and Multimodal CNN (using raw images) to examine the impact of feature-based versus image-based multimodal learning on classification accuracy and computational efficiency. The results showed that Surformer v1 achieved 99.4% accuracy with an inference time of 0.77 ms, while the Multimodal CNN achieved slightly higher accuracy but required significantly more inference time. These findings suggest Surformer v1 offers a compelling balance between accuracy, efficiency, and computational cost for surface material recognition.

[20] ImpliHateVid: A Benchmark Dataset and Two-stage Contrastive Learning Framework for Implicit Hate Speech Detection in Videos cs.CV | cs.LGPDF

Mohammad Zia Ur Rehman, Anukriti Bhatnagar, Omkar Kabde, Shubhi Bansal, Nagendra Kumar

TL;DR: 该论文提出了一个专门用于视频中隐性仇恨言论检测的新数据集ImpliHateVid，并设计了一个两阶段对比学习框架，通过多模态特征（音频、文本、图像）和附加特征（情感、情绪、字幕）提升检测效果。

Details

Motivation: 现有研究主要集中在文本和图像中的仇恨言论检测，而视频中的隐性仇恨言论检测仍未被充分探索。为了解决这一问题，论文提出了一个新数据集和一个新的检测框架。

Result: 实验表明，该方法在ImpliHateVid和HateMM数据集上均表现优异，验证了多模态对比学习在仇恨内容检测中的有效性。

Insight: 论文的启发在于，多模态特征和附加特征（如情感、情绪）对隐性仇恨言论检测至关重要，且两阶段对比学习能有效融合这些特征。

Abstract: The existing research has primarily focused on text and image-based hate speech detection, video-based approaches remain underexplored. In this work, we introduce a novel dataset, ImpliHateVid, specifically curated for implicit hate speech detection in videos. ImpliHateVid consists of 2,009 videos comprising 509 implicit hate videos, 500 explicit hate videos, and 1,000 non-hate videos, making it one of the first large-scale video datasets dedicated to implicit hate detection. We also propose a novel two-stage contrastive learning framework for hate speech detection in videos. In the first stage, we train modality-specific encoders for audio, text, and image using contrastive loss by concatenating features from the three encoders. In the second stage, we train cross-encoders using contrastive learning to refine multimodal representations. Additionally, we incorporate sentiment, emotion, and caption-based features to enhance implicit hate detection. We evaluate our method on two datasets, ImpliHateVid for implicit hate speech detection and another dataset for general hate speech detection in videos, HateMM dataset, demonstrating the effectiveness of the proposed multimodal contrastive learning for hateful content detection in videos and the significance of our dataset.

Sihan Ma, Qiming Wu, Ruotong Jiang, Frank Burns

TL;DR: ContextGuard-LVLM 是一个基于视觉语言大模型（LVLM）的框架，旨在通过细粒度的跨模态上下文一致性验证（FCCC）增强新闻真实性检测。它在情感、叙事主题和逻辑一致性等方面表现优于零-shot基线模型。

Details

Motivation: 数字新闻媒体的泛滥需要可靠的方法来验证内容的真实性，传统的跨模态一致性方法往往局限于实体匹配，而忽略了更深层次的上下文对齐，如情感、叙事和逻辑一致性。

Result: ContextGuard-LVLM 在多个细粒度一致性任务中表现优于InstructBLIP和LLaVA 1.5等基线模型，且在复杂逻辑推理和人类专家一致性评估中表现突出。

Insight: 细粒度的跨模态一致性验证（FCCC）是关键，不仅需要实体对齐，还需捕捉情感、叙事主题和逻辑一致性的深层次关联。

Abstract: The proliferation of digital news media necessitates robust methods for verifying content veracity, particularly regarding the consistency between visual and textual information. Traditional approaches often fall short in addressing the fine-grained cross-modal contextual consistency (FCCC) problem, which encompasses deeper alignment of visual narrative, emotional tone, and background information with text, beyond mere entity matching. To address this, we propose ContextGuard-LVLM, a novel framework built upon advanced Vision-Language Large Models (LVLMs) and integrating a multi-stage contextual reasoning mechanism. Our model is uniquely enhanced through reinforced or adversarial learning paradigms, enabling it to detect subtle contextual misalignments that evade zero-shot baselines. We extend and augment three established datasets (TamperedNews-Ent, News400-Ent, MMG-Ent) with new fine-grained contextual annotations, including “contextual sentiment,” “visual narrative theme,” and “scene-event logical coherence,” and introduce a comprehensive CTXT (Contextual Coherence) entity type. Extensive experiments demonstrate that ContextGuard-LVLM consistently outperforms state-of-the-art zero-shot LVLM baselines (InstructBLIP and LLaVA 1.5) across nearly all fine-grained consistency tasks, showing significant improvements in complex logical reasoning and nuanced contextual understanding. Furthermore, our model exhibits superior robustness to subtle perturbations and a higher agreement rate with human expert judgments on challenging samples, affirming its efficacy in discerning sophisticated forms of context detachment.

[22] VL-MedGuide: A Visual-Linguistic Large Model for Intelligent and Explainable Skin Disease Auxiliary Diagnosis cs.CVPDF

Kexin Yu, Zihan Xu, Jialei Xie, Carter Adams

TL;DR: 论文介绍了VL-MedGuide，一种基于视觉-语言大模型（VLLMs）的框架，用于皮肤疾病的智能和可解释辅助诊断。通过多模态概念感知模块和可解释疾病推理模块的结合，VL-MedGuide在疾病诊断和概念检测方面实现了SOTA性能，并提供了高清晰度和可信的解释。

Details

Motivation: 皮肤疾病的诊断面临视觉特征复杂多样和现有纯视觉诊断模型缺乏可解释性的挑战，VL-MedGuide旨在解决这些问题。

Result: 在Derm7pt数据集上，VL-MedGuide在疾病诊断（83.55% BACC, 80.12% F1）和概念检测（76.10% BACC, 67.45% F1）上均优于现有基线。

Insight: 通过结合多模态理解，VL-MedGuide不仅提升了诊断性能，还提供了可解释的推理过程，增强了临床实用性。

Abstract: Accurate diagnosis of skin diseases remains a significant challenge due to the complex and diverse visual features present in dermatoscopic images, often compounded by a lack of interpretability in existing purely visual diagnostic models. To address these limitations, this study introduces VL-MedGuide (Visual-Linguistic Medical Guide), a novel framework leveraging the powerful multi-modal understanding and reasoning capabilities of Visual-Language Large Models (LVLMs) for intelligent and inherently interpretable auxiliary diagnosis of skin conditions. VL-MedGuide operates in two interconnected stages: a Multi-modal Concept Perception Module, which identifies and linguistically describes dermatologically relevant visual features through sophisticated prompt engineering, and an Explainable Disease Reasoning Module, which integrates these concepts with raw visual information via Chain-of-Thought prompting to provide precise disease diagnoses alongside transparent rationales. Comprehensive experiments on the Derm7pt dataset demonstrate that VL-MedGuide achieves state-of-the-art performance in both disease diagnosis (83.55% BACC, 80.12% F1) and concept detection (76.10% BACC, 67.45% F1), surpassing existing baselines. Furthermore, human evaluations confirm the high clarity, completeness, and trustworthiness of its generated explanations, bridging the gap between AI performance and clinical utility by offering actionable, explainable insights for dermatological practice.

[23] Learning More by Seeing Less: Line Drawing Pretraining for Efficient, Transferable, and Human-Aligned Vision cs.CVPDF

Tianqin Li, George Liu, Tai Sing Lee

TL;DR: 论文提出以线条图作为结构优先的预训练模态，以诱导更紧凑和泛化的视觉表征。实验表明，基于线条图预训练的模型在分类、检测和分割任务中表现更优，且具有更低的内部维度，同时模型知识更容易蒸馏到轻量级学生模型中。

Details

Motivation: 现代计算机视觉系统依赖冗余的视觉输入，而人类能够通过稀疏的线条图高效理解结构信息。论文希望通过线条图预训练，开发更高效、泛化能力更强的视觉表征。

Result: 线条图预训练的模型表现更强形状偏好、注意力更集中，内部维度更低，知识更易蒸馏，且无监督方法也有效。

Insight: 结构优先的视觉学习能提升模型的效率、泛化能力和人类对齐的归纳偏置，为构建更鲁棒和适应性强的视觉系统提供新思路。

Abstract: Despite remarkable progress in computer vision, modern recognition systems remain limited by their dependence on rich, redundant visual inputs. In contrast, humans can effortlessly understand sparse, minimal representations like line drawings - suggesting that structure, rather than appearance, underlies efficient visual understanding. In this work, we propose using line drawings as a structure-first pretraining modality to induce more compact and generalizable visual representations. We show that models pretrained on line drawings develop stronger shape bias, more focused attention, and greater data efficiency across classification, detection, and segmentation tasks. Notably, these models also exhibit lower intrinsic dimensionality, requiring significantly fewer principal components to capture representational variance - echoing the similar observation in low dimensional efficient representation in the brain. Beyond performance improvements, line drawing pretraining produces more compressible representations, enabling better distillation into lightweight student models. Students distilled from line-pretrained teachers consistently outperform those trained from color-supervised teachers, highlighting the benefits of structurally compact knowledge. Finally, we demonstrate that the pretraining with line-drawing can also be extended to unsupervised setting via our proposed method “learning to draw”. Together, our results support the view that structure-first visual learning fosters efficiency, generalization, and human-aligned inductive biases - offering a simple yet powerful strategy for building more robust and adaptable vision systems.

[24] MMFformer: Multimodal Fusion Transformer Network for Depression Detection cs.CV | cs.AI | cs.CL | cs.LG | cs.SD | eess.ASPDF

Md Rezwanul Haque, Md. Milon Islam, S M Taslim Uddin Raju, Hamdi Altaheri, Lobna Nassar

TL;DR: MMFformer是一种多模态融合Transformer网络，用于从社交媒体信息中检测抑郁症，通过结合时空特征和多模态融合策略，显著提升了抑郁检测的性能。

Details

Motivation: 抑郁症的早期检测对于治疗至关重要，但现有方法依赖主观评估且难以从多模态数据中提取有效信息。

Result: 在两个大规模抑郁检测数据集上表现优异，F1分数分别提升了13.92%和7.74%。

Insight: 多模态融合和Transformer架构对于抑郁检测的时空特征提取非常有效，晚期和中期融合策略有助于提升性能。

Abstract: Depression is a serious mental health illness that significantly affects an individual’s well-being and quality of life, making early detection crucial for adequate care and treatment. Detecting depression is often difficult, as it is based primarily on subjective evaluations during clinical interviews. Hence, the early diagnosis of depression, thanks to the content of social networks, has become a prominent research area. The extensive and diverse nature of user-generated information poses a significant challenge, limiting the accurate extraction of relevant temporal information and the effective fusion of data across multiple modalities. This paper introduces MMFformer, a multimodal depression detection network designed to retrieve depressive spatio-temporal high-level patterns from multimodal social media information. The transformer network with residual connections captures spatial features from videos, and a transformer encoder is exploited to design important temporal dynamics in audio. Moreover, the fusion architecture fused the extracted features through late and intermediate fusion strategies to find out the most relevant intermodal correlations among them. Finally, the proposed network is assessed on two large-scale depression detection datasets, and the results clearly reveal that it surpasses existing state-of-the-art approaches, improving the F1-Score by 13.92% for D-Vlog dataset and 7.74% for LMVD dataset. The code is made available publicly at https://github.com/rezwanh001/Large-Scale-Multimodal-Depression-Detection.

[25] Restage4D: Reanimating Deformable 3D Reconstruction from a Single Video cs.CVPDF

Jixuan He, Chieh Hubert Lin, Lu Qi, Ming-Hsuan Yang

TL;DR: Restage4D は、単一のビデオから変形可能な3Dシーンを再現するための手法で、ビデオの運動情報を利用し、物理的に一貫した4Dコンテンツを生成することを目的としています。

Details

Motivation: 既存の生成モデルは意味的に豊かな外観を提供するが、物理的なリアリズムや運動ダイナミクスの再現には課題がある。一方、実世界のビデオは物理的に妥当な幾何学と運動の手がかりを提供するため、これを活用して4Dシーンを再現する方法を模索した。

Result: DAVIS と PointOdyssey で評価し、幾何学的一貫性、運動品質、3Dトラッキング性能が向上。生成モデルの誤りも自動修正可能。

Insight: 実世界のビデオの運動情報は、4D再現タスクにおいて生成モデルの欠点を補完し、物理的に一貫した結果をもたらす可能性がある。

Abstract: Creating deformable 3D content has gained increasing attention with the rise of text-to-image and image-to-video generative models. While these models provide rich semantic priors for appearance, they struggle to capture the physical realism and motion dynamics needed for authentic 4D scene synthesis. In contrast, real-world videos can provide physically grounded geometry and articulation cues that are difficult to hallucinate. One question is raised: \textit{Can we generate physically consistent 4D content by leveraging the motion priors of the real-world video}? In this work, we explore the task of reanimating deformable 3D scenes from a single video, using the original sequence as a supervisory signal to correct artifacts from synthetic motion. We introduce \textbf{Restage4D}, a geometry-preserving pipeline for video-conditioned 4D restaging. Our approach uses a video-rewinding training strategy to temporally bridge a real base video and a synthetic driving video via a shared motion representation. We further incorporate an occlusion-aware rigidity loss and a disocclusion backtracing mechanism to improve structural and geometry consistency under challenging motion. We validate Restage4D on DAVIS and PointOdyssey, demonstrating improved geometry consistency, motion quality, and 3D tracking performance. Our method not only preserves deformable structure under novel motion, but also automatically corrects errors introduced by generative models, revealing the potential of video prior in 4D restaging task. Source code and trained models will be released.

[26] FoundBioNet: A Foundation-Based Model for IDH Genotyping of Glioma from Multi-Parametric MRI cs.CV | cs.AIPDF

Somayeh Farahani, Marjaneh Hejazi, Antonio Di Ieva, Sidong Liu

TL;DR: FoundBioNet提出了一种基于基础模型的深度学习网络，用于从多参数MRI中非侵入性预测胶质瘤IDH突变状态，通过两个关键模块（TAFE和CMD）提升预测精度，并在多中心数据集上验证了其优越性。

Details

Motivation: 传统IDH突变检测依赖侵入性组织采样，无法捕捉肿瘤的空间异质性，而现有深度学习模型因标注数据有限表现受限。FoundBioNet旨在解决这些问题，提供一种通用性强的非侵入检测方法。

Result: 模型在多个独立测试集（EGD、TCGA等）上AUC达90.58%、88.08%等，显著优于基线方法（p <= 0.05）。消融实验验证了TAFE和CMD的必要性。

Insight: 通过预训练和任务特定模块的结合，FoundBioNet实现了胶质瘤IDH分型的通用性和精准性，为个性化医疗提供了潜在工具。

Abstract: Accurate, noninvasive detection of isocitrate dehydrogenase (IDH) mutation is essential for effective glioma management. Traditional methods rely on invasive tissue sampling, which may fail to capture a tumor’s spatial heterogeneity. While deep learning models have shown promise in molecular profiling, their performance is often limited by scarce annotated data. In contrast, foundation deep learning models offer a more generalizable approach for glioma imaging biomarkers. We propose a Foundation-based Biomarker Network (FoundBioNet) that utilizes a SWIN-UNETR-based architecture to noninvasively predict IDH mutation status from multi-parametric MRI. Two key modules are incorporated: Tumor-Aware Feature Encoding (TAFE) for extracting multi-scale, tumor-focused features, and Cross-Modality Differential (CMD) for highlighting subtle T2-FLAIR mismatch signals associated with IDH mutation. The model was trained and validated on a diverse, multi-center cohort of 1705 glioma patients from six public datasets. Our model achieved AUCs of 90.58%, 88.08%, 65.41%, and 80.31% on independent test sets from EGD, TCGA, Ivy GAP, RHUH, and UPenn, consistently outperforming baseline approaches (p <= 0.05). Ablation studies confirmed that both the TAFE and CMD modules are essential for improving predictive accuracy. By integrating large-scale pretraining and task-specific fine-tuning, FoundBioNet enables generalizable glioma characterization. This approach enhances diagnostic accuracy and interpretability, with the potential to enable more personalized patient care.

[27] VOccl3D: A Video Benchmark Dataset for 3D Human Pose and Shape Estimation under real Occlusions cs.CV | cs.GRPDF

Yash Garg, Saketh Bachu, Arindam Dutta, Rohit Lal, Sarosij Bose

TL;DR: VOccl3D是一个视频基准数据集，用于真实遮挡下的3D人体姿态和形状估计，填补了现有数据集中缺乏真实遮挡的空白。通过高级计算机图形渲染技术构建，该数据集包含多样化的真实遮挡场景，并通过微调现有方法（如CLIFF和BEDLAM-CLIFF）展示了显著性能提升。

Details

Motivation: 现有的3D人体姿态和形状估计方法在复杂姿态或显著遮挡下表现不佳，而现有数据集中的遮挡多为随机补丁或剪贴画式叠加，缺乏真实性。因此，需要构建更真实的遮挡数据集以推动研究。

Result: 在多个公共数据集和VOccl3D测试集上，微调后的方法（如CLIFF）展现了显著的质性和量化性能提升。目标检测器的性能也有所增强。

Insight: 真实遮挡数据集对于推动3D人体姿态和形状估计研究至关重要。VOccl3D为未来研究提供了更接近实际场景的基准，有望促进鲁棒性方法的开发。

Abstract: Human pose and shape (HPS) estimation methods have been extensively studied, with many demonstrating high zero-shot performance on in-the-wild images and videos. However, these methods often struggle in challenging scenarios involving complex human poses or significant occlusions. Although some studies address 3D human pose estimation under occlusion, they typically evaluate performance on datasets that lack realistic or substantial occlusions, e.g., most existing datasets introduce occlusions with random patches over the human or clipart-style overlays, which may not reflect real-world challenges. To bridge this gap in realistic occlusion datasets, we introduce a novel benchmark dataset, VOccl3D, a Video-based human Occlusion dataset with 3D body pose and shape annotations. Inspired by works such as AGORA and BEDLAM, we constructed this dataset using advanced computer graphics rendering techniques, incorporating diverse real-world occlusion scenarios, clothing textures, and human motions. Additionally, we fine-tuned recent HPS methods, CLIFF and BEDLAM-CLIFF, on our dataset, demonstrating significant qualitative and quantitative improvements across multiple public datasets, as well as on the test split of our dataset, while comparing its performance with other state-of-the-art methods. Furthermore, we leveraged our dataset to enhance human detection performance under occlusion by fine-tuning an existing object detector, YOLO11, thus leading to a robust end-to-end HPS estimation system under occlusions. Overall, this dataset serves as a valuable resource for future research aimed at benchmarking methods designed to handle occlusions, offering a more realistic alternative to existing occlusion datasets. See the Project page for code and dataset:https://yashgarg98.github.io/VOccl3D-dataset/

[28] SafePLUG: Empowering Multimodal LLMs with Pixel-Level Insight and Temporal Grounding for Traffic Accident Understanding cs.CV | cs.AIPDF

Zihao Sheng, Zilin Huang, Yen-Jung Chen, Yansong Qu, Yuhao Luo

TL;DR: SafePLUG是一种多模态大语言模型（MLLM）框架，专注于交通事故理解的细粒度分析和时间定位，通过像素级理解和时间定位能力显著提升了复杂场景分析的能力。

Details

Motivation: 现有的MLLM在交通事故理解中主要关注粗粒度的图像或视频级理解，难以处理细粒度的视觉细节或局部场景组件，限制了其在复杂事故场景中的应用。SafePLUG旨在填补这一空白，提供更全面的分析能力。

Result: 实验表明，SafePLUG在区域问答、像素分割、事件定位和事故理解等任务上表现优异，证明了其细粒度分析能力。

Insight: SafePLUG为交通场景的细粒度理解奠定了基础，有望提升智能交通系统的安全性和态势感知能力。

Abstract: Multimodal large language models (MLLMs) have achieved remarkable progress across a range of vision-language tasks and demonstrate strong potential for traffic accident understanding. However, existing MLLMs in this domain primarily focus on coarse-grained image-level or video-level comprehension and often struggle to handle fine-grained visual details or localized scene components, limiting their applicability in complex accident scenarios. To address these limitations, we propose SafePLUG, a novel framework that empowers MLLMs with both Pixel-Level Understanding and temporal Grounding for comprehensive traffic accident analysis. SafePLUG supports both arbitrary-shaped visual prompts for region-aware question answering and pixel-level segmentation based on language instructions, while also enabling the recognition of temporally anchored events in traffic accident scenarios. To advance the development of MLLMs for traffic accident understanding, we curate a new dataset containing multimodal question-answer pairs centered on diverse accident scenarios, with detailed pixel-level annotations and temporal event boundaries. Experimental results show that SafePLUG achieves strong performance on multiple tasks, including region-based question answering, pixel-level segmentation, temporal event localization, and accident event understanding. These capabilities lay a foundation for fine-grained understanding of complex traffic scenes, with the potential to improve driving safety and enhance situational awareness in smart transportation systems. The code, dataset, and model checkpoints will be made publicly available at: https://zihaosheng.github.io/SafePLUG

[29] DiffUS: Differentiable Ultrasound Rendering from Volumetric Imaging cs.CV | cs.GRPDF

Noe Bertramo, Gabriel Duguey, Vivek Gopalakrishnan

TL;DR: DiffUS提出了一种基于物理的可微分超声渲染器，能够从3D MRI体积数据生成逼真的B模式超声图像，用于术中介导。

Details

Motivation: 术中介导超声图像解释困难，DiffUS旨在填补术前规划与术中指导之间的鸿沟，通过生成逼真的超声图像。

Result: 在ReMIND数据集上验证了DiffUS能从脑MRI数据生成解剖准确的超声图像。

Insight: DiffUS的可微分特性使其适用于基于梯度的优化任务，如切片-体积配准和体积重建，扩展了其临床应用潜力。

Abstract: Intraoperative ultrasound imaging provides real-time guidance during numerous surgical procedures, but its interpretation is complicated by noise, artifacts, and poor alignment with high-resolution preoperative MRI/CT scans. To bridge the gap between reoperative planning and intraoperative guidance, we present DiffUS, a physics-based, differentiable ultrasound renderer that synthesizes realistic B-mode images from volumetric imaging. DiffUS first converts MRI 3D scans into acoustic impedance volumes using a machine learning approach. Next, we simulate ultrasound beam propagation using ray tracing with coupled reflection-transmission equations. DiffUS formulates wave propagation as a sparse linear system that captures multiple internal reflections. Finally, we reconstruct B-mode images via depth-resolved echo extraction across fan-shaped acquisition geometry, incorporating realistic artifacts including speckle noise and depth-dependent degradation. DiffUS is entirely implemented as differentiable tensor operations in PyTorch, enabling gradient-based optimization for downstream applications such as slice-to-volume registration and volumetric reconstruction. Evaluation on the ReMIND dataset demonstrates DiffUS’s ability to generate anatomically accurate ultrasound images from brain MRI data.

Aarav Mehta, Priya Deshmukh, Vikram Singh, Siddharth Malhotra, Krishnan Menon Iyer

TL;DR: 该论文提出了一种针对医学图像的精确边缘检测方法，通过自上而下的反向细化架构和亚像素上采样，显著提高了器官边界的定位精度。

Details

Motivation: 医学影像中对器官边界的精确定位至关重要，但现有深度卷积网络的边缘检测方法在定位精度上不足，难以满足毫米级精度的临床应用需求。

Result: 在多个CT和MRI器官数据集上的实验表明，所提方法在边界F-measure和Hausdorff距离等严格指标上显著优于基线方法和现有医学边缘检测方法，并提升了下游任务的性能（如分割、配准）。

Insight: 通过自上而下的特征融合和亚像素级上采样，可以在不显著增加计算负担的情况下，显著提升医学图像边界的定位精度，从而改善常见医学图像分析任务的性能。

Abstract: Accurate localization of organ boundaries is critical in medical imaging for segmentation, registration, surgical planning, and radiotherapy. While deep convolutional networks (ConvNets) have advanced general-purpose edge detection to near-human performance on natural images, their outputs often lack precise localization, a limitation that is particularly harmful in medical applications where millimeter-level accuracy is required. Building on a systematic analysis of ConvNet edge outputs, we propose a medically focused crisp edge detector that adapts a novel top-down backward refinement architecture to medical images (2D and volumetric). Our method progressively upsamples and fuses high-level semantic features with fine-grained low-level cues through a backward refinement pathway, producing high-resolution, well-localized organ boundaries. We further extend the design to handle anisotropic volumes by combining 2D slice-wise refinement with light 3D context aggregation to retain computational efficiency. Evaluations on several CT and MRI organ datasets demonstrate substantially improved boundary localization under strict criteria (boundary F-measure, Hausdorff distance) compared to baseline ConvNet detectors and contemporary medical edge/contour methods. Importantly, integrating our crisp edge maps into downstream pipelines yields consistent gains in organ segmentation (higher Dice scores, lower boundary errors), more accurate image registration, and improved delineation of lesions near organ interfaces. The proposed approach produces clinically valuable, crisp organ edges that materially enhance common medical-imaging tasks.

[31] VesselRW: Weakly Supervised Subcutaneous Vessel Segmentation via Learned Random Walk Propagation cs.CVPDF

Ayaan Nooruddin Siddiqui, Mahnoor Zaidi, Ayesha Nazneen Shahbaz, Priyadarshini Chatterjee, Krishnan Menon Iyer

TL;DR: 本文提出了一种弱监督的皮下血管分割方法VesselRW，通过学习的随机游走标签传播模型，利用稀疏标注生成密集的概率监督，结合不确定性加权损失和拓扑感知正则化器，显著减少了标注负担并提升了分割的完整性和拓扑准确性。

Details

Motivation: 皮下血管分割面临标注稀缺和图像低对比度的挑战，本文旨在通过弱监督学习减少标注成本，同时保持血管分割的完整性和拓扑结构。

Result: 在临床皮下血管数据集上，VesselRW优于基于稀疏标注的朴素训练和传统的密集伪标注方法，生成更完整的血管图和更校准的不确定性估计。

Insight: 联合学习标签传播和分割模型可以隐式发现血管边缘和连续性约束，无需显式的边缘监督；拓扑感知正则化器对于临床可用的血管分割至关重要。

Abstract: Accurate segmentation of subcutaneous vessels from clinical images is hampered by scarce, expensive ground truth and by low contrast, noisy appearance of vessels across patients and modalities. We present a novel weakly supervised training framework tailored for subcutaneous vessel segmentation that leverages inexpensive sparse annotations (e.g., centerline traces, dot markers, or short scribbles). Sparse labels are expanded into dense, probabilistic supervision via a differentiable random walk label propagation model whose transition weights incorporate image driven vesselness cues and tubular continuity priors. The propagation yields per-pixel hitting probabilities together with calibrated uncertainty estimates; these are incorporated into an uncertainty weighted loss to avoid over fitting to ambiguous regions. Crucially, the label-propagator is learned jointly with a CNN based segmentation predictor, enabling the system to discover vessel edges and continuity constraints without explicit edge supervision. We further introduce a topology aware regularizer that encourages centerline connectivity and penalizes spurious branches, improving clinical usability. In experiments on clinical subcutaneous imaging datasets, our method consistently outperforms naive training on sparse labels and conventional dense pseudo-labeling, producing more complete vascular maps and better calibrated uncertainty for downstream decision making. The approach substantially reduces annotation burden while preserving clinically relevant vessel topology.

[32] VSI: Visual Subtitle Integration for Keyframe Selection to enhance Long Video Understanding cs.CV | cs.AI | I.2.10PDF

Jianxiang He, Shaoguang Wang, Weiyu Guo, Meisheng Hong, Jungang Li

TL;DR: VSI提出了一种结合视觉和字幕的多模态关键帧检索方法，显著提升了长视频理解和问答任务的性能，超越基线方法20%以上。

Details

Motivation: 长视频理解因数据量大而具挑战性，现有关键帧检索方法因多模态对齐不足和时序语义信息缺失而效果受限。

Result: 在LongVideoBench上关键帧定位准确率达40.00%，视频问答任务准确率达68.48%，超越基线方法20.35%和15.79%。

Insight: 多模态信息（如字幕和视觉内容）的协同利用能显著提升长视频理解任务的性能。

Abstract: Long video understanding presents a significant challenge to multimodal large language models (MLLMs) primarily due to the immense data scale. A critical and widely adopted strategy for making this task computationally tractable is keyframe retrieval, which seeks to identify a sparse set of video frames that are most salient to a given textual query. However, the efficacy of this approach is hindered by weak multimodal alignment between textual queries and visual content and fails to capture the complex temporal semantic information required for precise reasoning. To address this, we propose Visual-Subtitle Integeration(VSI), a multimodal keyframe search method that integrates subtitles, timestamps, and scene boundaries into a unified multimodal search process. The proposed method captures the visual information of video frames as well as the complementary textual information through a dual-stream search mechanism by Video Search Stream as well as Subtitle Match Stream, respectively, and improves the keyframe search accuracy through the interaction of the two search streams. Experimental results show that VSI achieve 40.00% key frame localization accuracy on the text-relevant subset of LongVideoBench and 68.48% accuracy on downstream long Video-QA tasks, surpassing competitive baselines by 20.35% and 15.79%, respectively. Furthermore, on the LongVideoBench, VSI achieved state-of-the-art(SOTA) in medium-to-long video-QA tasks, demonstrating the robustness and generalizability of the proposed multimodal search strategy.

[33] NS-FPN: Improving Infrared Small Target Detection and Segmentation from Noise Suppression Perspective cs.CV | cs.AIPDF

Maoxun Yuan, Duanni Meng, Ziteng Xi, Tianyi Zhao, Shiji Zhao

TL;DR: 本文提出了一种新型噪声抑制特征金字塔网络（NS-FPN），通过低频引导特征净化（LFP）模块和螺旋感知特征采样（SFS）模块，从噪声抑制角度提升红外小目标检测与分割（IRSTDS）性能。

Details

Motivation: 红外小目标检测与分割在国防和民用领域至关重要，但目标外观暗淡、形状模糊以及背景噪声严重的问题导致现有方法假警率高。

Result: 在公开IRSTDS数据集上，NS-FPN显著降低假警率并取得最优性能。

Insight: 噪声抑制是提升IRSTDS任务性能的关键，频域分析与轻量化设计结合可为类似任务提供新思路。

Abstract: Infrared small target detection and segmentation (IRSTDS) is a critical yet challenging task in defense and civilian applications, owing to the dim, shapeless appearance of targets and severe background clutter. Recent CNN-based methods have achieved promising target perception results, but they only focus on enhancing feature representation to offset the impact of noise, which results in the increased false alarms problem. In this paper, through analyzing the problem from the frequency domain, we pioneer in improving performance from noise suppression perspective and propose a novel noise-suppression feature pyramid network (NS-FPN), which integrates a low-frequency guided feature purification (LFP) module and a spiral-aware feature sampling (SFS) module into the original FPN structure. The LFP module suppresses the noise features by purifying high-frequency components to achieve feature enhancement devoid of noise interference, while the SFS module further adopts spiral sampling to fuse target-relevant features in feature fusion process. Our NS-FPN is designed to be lightweight yet effective and can be easily plugged into existing IRSTDS frameworks. Extensive experiments on the public IRSTDS datasets demonstrate that our method significantly reduces false alarms and achieves superior performance on IRSTDS tasks.

[34] BASIC: Boosting Visual Alignment with Intrinsic Refined Embeddings in Multimodal Large Language Models cs.CV | cs.AIPDF

Jianting Tang, Yubo Wang, Haoyu Cao, Linli Xu

TL;DR: 论文提出BASIC方法，通过在MLLMs中引入直接视觉监督，优化视觉嵌入的生成，从而提升视觉-文本模态的对齐效果。

Details

Motivation: 当前MLLMs在处理视觉-文本对齐时，仅依赖上下文提示和文本输出的自回归监督，忽略了直接视觉监督的必要性，导致视觉嵌入的潜力未被充分挖掘。

Result: 实验表明，BASIC在多种基准测试中显著提升了MLLMs的性能，且无需额外监督模型或人工标注。

Insight: 直接针对视觉嵌入引入监督信号，是提升视觉-文本对齐效果的有效途径，同时证明了LLM浅层嵌入的潜在价值。

Abstract: Mainstream Multimodal Large Language Models (MLLMs) achieve visual understanding by using a vision projector to bridge well-pretrained vision encoders and large language models (LLMs). The inherent gap between visual and textual modalities makes the embeddings from the vision projector critical for visual comprehension. However, current alignment approaches treat visual embeddings as contextual cues and merely apply auto-regressive supervision to textual outputs, neglecting the necessity of introducing equivalent direct visual supervision, which hinders the potential finer alignment of visual embeddings. In this paper, based on our analysis of the refinement process of visual embeddings in the LLM’s shallow layers, we propose BASIC, a method that utilizes refined visual embeddings within the LLM as supervision to directly guide the projector in generating initial visual embeddings. Specifically, the guidance is conducted from two perspectives: (i) optimizing embedding directions by reducing angles between initial and supervisory embeddings in semantic space; (ii) improving semantic matching by minimizing disparities between the logit distributions of both visual embeddings. Without additional supervisory models or artificial annotations, BASIC significantly improves the performance of MLLMs across a wide range of benchmarks, demonstrating the effectiveness of our introduced direct visual supervision.

[35] eMotions: A Large-Scale Dataset and Audio-Visual Fusion Network for Emotion Analysis in Short-form Videos cs.CVPDF

Xuecheng Wu, Dingkang Yang, Danlei Huang, Xinyi Yin, Yifan Wang

TL;DR: 论文提出了一个大规模短视频情感数据集eMotions，并通过音频-视觉融合网络AV-CANet解决了短视频情感分析的挑战。

Details

Motivation: 短视频（SVs）在信息获取和分享中日益重要，但其多模态复杂性为情感分析带来了挑战，现有数据集不足且存在主观偏差。

Result: 在3个eMotions相关数据集和4个公共VEA数据集上验证了AV-CANet的有效性，并通过消融实验验证了关键组件的贡献。

Insight: 音频-视觉特征的不一致性是短视频情感分析的主要挑战，局部-全局融合和全局优化策略能显著提升性能。

Abstract: Short-form videos (SVs) have become a vital part of our online routine for acquiring and sharing information. Their multimodal complexity poses new challenges for video analysis, highlighting the need for video emotion analysis (VEA) within the community. Given the limited availability of SVs emotion data, we introduce eMotions, a large-scale dataset consisting of 27,996 videos with full-scale annotations. To ensure quality and reduce subjective bias, we emphasize better personnel allocation and propose a multi-stage annotation procedure. Additionally, we provide the category-balanced and test-oriented variants through targeted sampling to meet diverse needs. While there have been significant studies on videos with clear emotional cues (e.g., facial expressions), analyzing emotions in SVs remains a challenging task. The challenge arises from the broader content diversity, which introduces more distinct semantic gaps and complicates the representations learning of emotion-related features. Furthermore, the prevalence of audio-visual co-expressions in SVs leads to the local biases and collective information gaps caused by the inconsistencies in emotional expressions. To tackle this, we propose AV-CANet, an end-to-end audio-visual fusion network that leverages video transformer to capture semantically relevant representations. We further introduce the Local-Global Fusion Module designed to progressively capture the correlations of audio-visual features. Besides, EP-CE Loss is constructed to globally steer optimizations with tripolar penalties. Extensive experiments across three eMotions-related datasets and four public VEA datasets demonstrate the effectiveness of our proposed AV-CANet, while providing broad insights for future research. Moreover, we conduct ablation studies to examine the critical components of our method. Dataset and code will be made available at Github.

[36] A Simple yet Powerful Instance-Aware Prompting Framework for Training-free Camouflaged Object Segmentation cs.CVPDF

Chao Yin, Jide Li, Xiaoqiang Li

TL;DR: 该论文提出了一个无需训练的实例感知提示框架（IAPF），用于解决伪装目标分割（COS）任务中因语义级提示导致的分割粗糙问题，通过生成细粒度的实例掩码显著提升了性能。

Details

Motivation: 当前基于训练的COS方法在标注稀疏时性能下降，而现有的无需训练方法依赖于语义级提示，无法有效处理多实例场景。

Result: 在标准COS基准测试中，IAPF显著优于现有的无需训练COS方法。

Insight: 通过实例级的精细提示和多路径一致性投票，可以显著提升伪装目标分割的效果，尤其在多实例场景中表现突出。

Abstract: Camouflaged Object Segmentation (COS) remains highly challenging due to the intrinsic visual similarity between target objects and their surroundings. While training-based COS methods achieve good performance, their performance degrades rapidly with increased annotation sparsity. To circumvent this limitation, recent studies have explored training-free COS methods, leveraging the Segment Anything Model (SAM) by automatically generating visual prompts from a single task-generic prompt (\textit{e.g.}, “\textit{camouflaged animal}”) uniformly applied across all test images. However, these methods typically produce only semantic-level visual prompts, causing SAM to output coarse semantic masks and thus failing to handle scenarios with multiple discrete camouflaged instances effectively. To address this critical limitation, we propose a simple yet powerful \textbf{I}nstance-\textbf{A}ware \textbf{P}rompting \textbf{F}ramework (IAPF), the first training-free COS pipeline that explicitly converts a task-generic prompt into fine-grained instance masks. Specifically, the IAPF comprises three steps: (1) Text Prompt Generator, utilizing task-generic queries to prompt a Multimodal Large Language Model (MLLM) for generating image-specific foreground and background tags; (2) \textbf{Instance Mask Generator}, leveraging Grounding DINO to produce precise instance-level bounding box prompts, alongside the proposed Single-Foreground Multi-Background Prompting strategy to sample region-constrained point prompts within each box, enabling SAM to yield a candidate instance mask; (3) Self-consistency Instance Mask Voting, which selects the final COS prediction by identifying the candidate mask most consistent across multiple candidate instance masks. Extensive evaluations on standard COS benchmarks demonstrate that the proposed IAPF significantly surpasses existing state-of-the-art training-free COS methods.

[37] MultiRef: Controllable Image Generation with Multiple Visual References cs.CVPDF

Ruoxi Chen, Dongping Chen, Siyuan Wu, Sinan Wang, Shiyun Lang

TL;DR: 论文提出了MultiRef-bench评估框架，用于多视觉参考图像生成任务，并揭示了现有模型在此任务上的局限性。

Details

Motivation: 现有图像生成框架主要依赖单源输入（如文本或单一参考图像），而设计师创作时常需结合多源视觉参考，论文旨在填补这一空白。

Result: 最优模型OmniGen在合成和真实样本上的表现分别为66.6%和79.0%，表明多参考任务仍具挑战。

Insight: 多参考图像生成需要更灵活的模型设计，当前技术尚未充分满足人类创作需求。

Abstract: Visual designers naturally draw inspiration from multiple visual references, combining diverse elements and aesthetic principles to create artwork. However, current image generative frameworks predominantly rely on single-source inputs – either text prompts or individual reference images. In this paper, we focus on the task of controllable image generation using multiple visual references. We introduce MultiRef-bench, a rigorous evaluation framework comprising 990 synthetic and 1,000 real-world samples that require incorporating visual content from multiple reference images. The synthetic samples are synthetically generated through our data engine RefBlend, with 10 reference types and 33 reference combinations. Based on RefBlend, we further construct a dataset MultiRef containing 38k high-quality images to facilitate further research. Our experiments across three interleaved image-text models (i.e., OmniGen, ACE, and Show-o) and six agentic frameworks (e.g., ChatDiT and LLM + SD) reveal that even state-of-the-art systems struggle with multi-reference conditioning, with the best model OmniGen achieving only 66.6% in synthetic samples and 79.0% in real-world cases on average compared to the golden answer. These findings provide valuable directions for developing more flexible and human-like creative tools that can effectively integrate multiple sources of visual inspiration. The dataset is publicly available at: https://multiref.github.io/.

[38] MMReID-Bench: Unleashing the Power of MLLMs for Effective and Versatile Person Re-identification cs.CV | cs.AIPDF

Jinhao Li, Zijian Chen, Lirong Deng, Changbo Wang, Guangtao Zhai

TL;DR: MMReID-Bench 是一个专为行人重识别（ReID）设计的首个多任务多模态基准数据集，展示了多模态大语言模型（MLLMs）在解决传统 ReID 模型泛化能力不足问题上的潜力。

Details

Motivation: 传统 ReID 模型在多模态数据（如 RGB、热成像、红外、素描图像、文本描述等）上表现不佳，泛化能力有限。MLLMs 的出现为解决这一问题提供了可能。

Result: 实验表明 MLLMs 在多数模态上表现优异，但在处理热成像和红外数据时仍有局限性。

Insight: MMReID-Bench 为开发更鲁棒、通用性更强的多模态基础模型提供了基准，推动了 ReID 领域的进步。

Abstract: Person re-identification (ReID) aims to retrieve the images of an interested person in the gallery images, with wide applications in medical rehabilitation, abnormal behavior detection, and public security. However, traditional person ReID models suffer from uni-modal capability, leading to poor generalization ability in multi-modal data, such as RGB, thermal, infrared, sketch images, textual descriptions, etc. Recently, the emergence of multi-modal large language models (MLLMs) shows a promising avenue for addressing this problem. Despite this potential, existing methods merely regard MLLMs as feature extractors or caption generators, which do not fully unleash their reasoning, instruction-following, and cross-modal understanding capabilities. To bridge this gap, we introduce MMReID-Bench, the first multi-task multi-modal benchmark specifically designed for person ReID. The MMReID-Bench includes 20,710 multi-modal queries and gallery images covering 10 different person ReID tasks. Comprehensive experiments demonstrate the remarkable capabilities of MLLMs in delivering effective and versatile person ReID. Nevertheless, they also have limitations in handling a few modalities, particularly thermal and infrared data. We hope MMReID-Bench can facilitate the community to develop more robust and generalizable multimodal foundation models for person ReID.

[39] AR-GRPO: Training Autoregressive Image Generation Models via Reinforcement Learning cs.CVPDF

Shihao Yuan, Yahui Liu, Yang Yue, Jingyuan Zhang, Wangmeng Zuo

TL;DR: AR-GRPO是一种通过强化学习（RL）优化自回归（AR）图像生成模型的方法，显著提升了生成图像的质量和人类偏好。

Details

Motivation: 受强化学习在大型语言模型（LLMs）中成功的启发，提出将RL应用于自回归图像生成模型，以改进图像质量。

Result: 实验表明，AR-GRPO在类条件和文本条件的图像生成任务中均显著优于标准AR基线，并在多种评估指标上表现一致。

Insight: RL为AR图像生成提供了新的优化途径，展示了可控和高质量图像合成的潜力。

Abstract: Inspired by the success of reinforcement learning (RL) in refining large language models (LLMs), we propose AR-GRPO, an approach to integrate online RL training into autoregressive (AR) image generation models. We adapt the Group Relative Policy Optimization (GRPO) algorithm to refine the vanilla autoregressive models’ outputs by carefully designed reward functions that evaluate generated images across multiple quality dimensions, including perceptual quality, realism, and semantic fidelity. We conduct comprehensive experiments on both class-conditional (i.e., class-to-image) and text-conditional (i.e., text-to-image) image generation tasks, demonstrating that our RL-enhanced framework significantly improves both the image quality and human preference of generated images compared to the standard AR baselines. Our results show consistent improvements across various evaluation metrics, establishing the viability of RL-based optimization for AR image generation and opening new avenues for controllable and high-quality image synthesis. The source codes and models are available at: https://github.com/Kwai-Klear/AR-GRPO.

[40] SLRTP2025 Sign Language Production Challenge: Methodology, Results, and Future Work cs.CV | eess.IV | eess.SPPDF

Harry Walsh, Ed Fish, Ozge Mercanoglu Sincan, Mohamed Ilyes Lakhal, Richard Bowden

TL;DR: 论文介绍了SLRTP2025手语生成挑战赛的设计、结果与未来工作，旨在通过标准化评估指标促进手语生成领域的比较与合作。

Details

Motivation: 由于缺乏标准化的评估指标，手语生成领域的研究难以进行横向比较。因此，该论文通过举办挑战赛，为领域提供统一的评估基准。

Result: 挑战赛吸引了33个团队提交231个解决方案，最佳团队BLEU-1得分为31.40，DTW-MJE为0.0574。

Insight: 标准化评估和高质量数据集的发布将推动手语生成领域的发展，未来可以进一步探索多模态方法和更复杂的生成任务。

Abstract: Sign Language Production (SLP) is the task of generating sign language video from spoken language inputs. The field has seen a range of innovations over the last few years, with the introduction of deep learning-based approaches providing significant improvements in the realism and naturalness of generated outputs. However, the lack of standardized evaluation metrics for SLP approaches hampers meaningful comparisons across different systems. To address this, we introduce the first Sign Language Production Challenge, held as part of the third SLRTP Workshop at CVPR 2025. The competition’s aims are to evaluate architectures that translate from spoken language sentences to a sequence of skeleton poses, known as Text-to-Pose (T2P) translation, over a range of metrics. For our evaluation data, we use the RWTH-PHOENIX-Weather-2014T dataset, a German Sign Language - Deutsche Gebardensprache (DGS) weather broadcast dataset. In addition, we curate a custom hidden test set from a similar domain of discourse. This paper presents the challenge design and the winning methodologies. The challenge attracted 33 participants who submitted 231 solutions, with the top-performing team achieving BLEU-1 scores of 31.40 and DTW-MJE of 0.0574. The winning approach utilized a retrieval-based framework and a pre-trained language model. As part of the workshop, we release a standardized evaluation network, including high-quality skeleton extraction-based keypoints establishing a consistent baseline for the SLP field, which will enable future researchers to compare their work against a broader range of methods.

[41] Beyond Frequency: Seeing Subtle Cues Through the Lens of Spatial Decomposition for Fine-Grained Visual Classification cs.CV | cs.AIPDF

Qin Xu, Lili Zhu, Xiaoxia Cheng, Bo Jiang

TL;DR: 这篇论文提出了一种新型的细粒度视觉分类方法SCOPE，通过空间域自适应增强细节和语义特征，解决了传统频域方法无法动态适应图像内容的局限性，实现了更好的分类性能。

Details

Motivation: 细粒度视觉分类（FGVC）的关键在于捕捉细微的视觉特征，而现有的频域方法因固定基函数缺乏适应性。因此，需要一种能够动态调整特征提取的方法。

Result: 实验表明，SCOPE在四个细粒度图像分类基准上达到了新的最佳性能。

Insight: 空间域的动态特征增强比固定频域方法更适合捕捉细微视觉特征，多阶段级联能有效结合局部与全局信息。

Abstract: The crux of resolving fine-grained visual classification (FGVC) lies in capturing discriminative and class-specific cues that correspond to subtle visual characteristics. Recently, frequency decomposition/transform based approaches have attracted considerable interests since its appearing discriminative cue mining ability. However, the frequency-domain methods are based on fixed basis functions, lacking adaptability to image content and unable to dynamically adjust feature extraction according to the discriminative requirements of different images. To address this, we propose a novel method for FGVC, named Subtle-Cue Oriented Perception Engine (SCOPE), which adaptively enhances the representational capability of low-level details and high-level semantics in the spatial domain, breaking through the limitations of fixed scales in the frequency domain and improving the flexibility of multi-scale fusion. The core of SCOPE lies in two modules: the Subtle Detail Extractor (SDE), which dynamically enhances subtle details such as edges and textures from shallow features, and the Salient Semantic Refiner (SSR), which learns semantically coherent and structure-aware refinement features from the high-level features guided by the enhanced shallow features. The SDE and SSR are cascaded stage-by-stage to progressively combine local details with global semantics. Extensive experiments demonstrate that our method achieves new state-of-the-art on four popular fine-grained image classification benchmarks.

[42] Adversarial Video Promotion Against Text-to-Video Retrieval cs.CVPDF

Qiwei Tian, Chenhao Lin, Zhengyu Zhao, Qian Li, Shuai Liu

TL;DR: 该论文首次提出针对文本到视频检索（T2VR）的对抗攻击ViPro，旨在提升视频的排名，而非传统攻击的降排名攻击。通过MoRe技术增强黑盒可迁移性，实验显示其优越性。

Details

Motivation: 当前文本到视频检索技术的鲁棒性研究不足，尤其是对抗攻击中提升视频排名的攻击方法尚未探索。这类攻击可能带来更大影响，例如增加点击量和传播误导信息。

Result: ViPro在白盒、灰盒和黑盒设置下分别超越基线方法30%、10%和4%。实验覆盖了3种主流T2VR模型和3个数据集。

Insight: 揭示了T2VR中视频排名提升的对抗攻击漏洞，为潜在防御提供了分析基础。多目标设置下的攻击效果验证了其实际威胁。

Abstract: Thanks to the development of cross-modal models, text-to-video retrieval (T2VR) is advancing rapidly, but its robustness remains largely unexamined. Existing attacks against T2VR are designed to push videos away from queries, i.e., suppressing the ranks of videos, while the attacks that pull videos towards selected queries, i.e., promoting the ranks of videos, remain largely unexplored. These attacks can be more impactful as attackers may gain more views/clicks for financial benefits and widespread (mis)information. To this end, we pioneer the first attack against T2VR to promote videos adversarially, dubbed the Video Promotion attack (ViPro). We further propose Modal Refinement (MoRe) to capture the finer-grained, intricate interaction between visual and textual modalities to enhance black-box transferability. Comprehensive experiments cover 2 existing baselines, 3 leading T2VR models, 3 prevailing datasets with over 10k videos, evaluated under 3 scenarios. All experiments are conducted in a multi-target setting to reflect realistic scenarios where attackers seek to promote the video regarding multiple queries simultaneously. We also evaluated our attacks for defences and imperceptibility. Overall, ViPro surpasses other baselines by over $30/10/4%$ for white/grey/black-box settings on average. Our work highlights an overlooked vulnerability, provides a qualitative analysis on the upper/lower bound of our attacks, and offers insights into potential counterplays. Code will be publicly available at https://github.com/michaeltian108/ViPro.

[43] Evaluating Fisheye-Compatible 3D Gaussian Splatting Methods on Real Images Beyond 180 Degree Field of View cs.CV | cs.GRPDF

Ulas Gunes, Matias Turkulainen, Juho Kannala, Esa Rahtu

TL;DR: 论文首次评估了基于鱼眼镜头的3D高斯泼溅方法Fisheye-GS和3DGUT在真实图像上的表现，尤其针对超过180度视场的场景，并提出了深度初始化策略。

Details

Motivation: 研究动机在于解决现有方法在极端鱼眼畸变和大视场条件下的局限性，尤其是在真实场景中的稀疏输入和强畸变问题。

Result: 结果显示Fisheye-GS在160度视场表现最佳，而3DGUT在大视场下仍保持稳定；新提出的深度初始化策略在复杂场景（如雾、眩光或天空）中效果媲美SfM。

Insight: 研究表明鱼眼3DGS方法适用于宽角度3D重建，且深度学习辅助的初始化策略为解决强畸变问题提供了新思路。

Abstract: We present the first evaluation of fisheye-based 3D Gaussian Splatting methods, Fisheye-GS and 3DGUT, on real images with fields of view exceeding 180 degree. Our study covers both indoor and outdoor scenes captured with 200 degree fisheye cameras and analyzes how each method handles extreme distortion in real world settings. We evaluate performance under varying fields of view (200 degree, 160 degree, and 120 degree) to study the tradeoff between peripheral distortion and spatial coverage. Fisheye-GS benefits from field of view (FoV) reduction, particularly at 160 degree, while 3DGUT remains stable across all settings and maintains high perceptual quality at the full 200 degree view. To address the limitations of SfM-based initialization, which often fails under strong distortion, we also propose a depth-based strategy using UniK3D predictions from only 2-3 fisheye images per scene. Although UniK3D is not trained on real fisheye data, it produces dense point clouds that enable reconstruction quality on par with SfM, even in difficult scenes with fog, glare, or sky. Our results highlight the practical viability of fisheye-based 3DGS methods for wide-angle 3D reconstruction from sparse and distortion-heavy image inputs.

[44] WeatherDiffusion: Weather-Guided Diffusion Model for Forward and Inverse Rendering cs.CV | cs.AIPDF

Yixin Zhu, Zuoliang Zhu, Miloš Hašan, Jian Yang, Jin Xie

TL;DR: WeatherDiffusion是一个基于扩散模型的框架，用于自动驾驶场景中复杂天气和光照条件下的正向和逆向渲染，通过文本引导实现可控编辑，并在多个基准测试中表现优异。

Details

Motivation: 复杂天气和光照条件对自动驾驶场景的正向和逆向渲染任务提出了巨大挑战，现有扩散模型难以控制且缺乏鲁棒性。

Result: 在多个基准测试中优于现有方法，并在自动驾驶下游任务中显著提升了目标检测和图像分割的鲁棒性。

Insight: 不同本征图对应图像不同区域的观察为高质量逆向渲染提供了关键思路，数据集和MAA的设计为复杂天气条件下的渲染任务提供了新工具。

Abstract: Forward and inverse rendering have emerged as key techniques for enabling understanding and reconstruction in the context of autonomous driving (AD). However, complex weather and illumination pose great challenges to this task. The emergence of large diffusion models has shown promise in achieving reasonable results through learning from 2D priors, but these models are difficult to control and lack robustness. In this paper, we introduce WeatherDiffusion, a diffusion-based framework for forward and inverse rendering on AD scenes with various weather and lighting conditions. Our method enables authentic estimation of material properties, scene geometry, and lighting, and further supports controllable weather and illumination editing through the use of predicted intrinsic maps guided by text descriptions. We observe that different intrinsic maps should correspond to different regions of the original image. Based on this observation, we propose Intrinsic map-aware attention (MAA) to enable high-quality inverse rendering. Additionally, we introduce a synthetic dataset (\ie WeatherSynthetic) and a real-world dataset (\ie WeatherReal) for forward and inverse rendering on AD scenes with diverse weather and lighting. Extensive experiments show that our WeatherDiffusion outperforms state-of-the-art methods on several benchmarks. Moreover, our method demonstrates significant value in downstream tasks for AD, enhancing the robustness of object detection and image segmentation in challenging weather scenarios.

[45] OctreeNCA: Single-Pass 184 MP Segmentation on Consumer Hardware cs.CVPDF

Nick Lemke, John Kalkhof, Niklas Babendererde, Anirban Mukhopadhyay

TL;DR: OctreeNCA提出了一种基于八叉树数据结构的轻量级神经网络自动机（NCA），用于高效处理高分辨率医学图像分割，显著降低了显存占用。

Details

Motivation: 医学图像分割任务需处理大尺寸输入（如184兆像素的病理切片），传统方法因显存限制无法一次性处理全局信息，导致全局一致性和推理速度下降。

Result: OctreeNCA在保持高分辨率分割性能的同时，显存占用比UNet减少90%，可一次性处理184兆像素的病理切片或1分钟手术视频。

Insight: 通过轻量级NCA结合八叉树数据结构，可显著提升高分辨率医学图像分割的效率，同时避免显存瓶颈。

Abstract: Medical applications demand segmentation of large inputs, like prostate MRIs, pathology slices, or videos of surgery. These inputs should ideally be inferred at once to provide the model with proper spatial or temporal context. When segmenting large inputs, the VRAM consumption of the GPU becomes the bottleneck. Architectures like UNets or Vision Transformers scale very poorly in VRAM consumption, resulting in patch- or frame-wise approaches that compromise global consistency and inference speed. The lightweight Neural Cellular Automaton (NCA) is a bio-inspired model that is by construction size-invariant. However, due to its local-only communication rules, it lacks global knowledge. We propose OctreeNCA by generalizing the neighborhood definition using an octree data structure. Our generalized neighborhood definition enables the efficient traversal of global knowledge. Since deep learning frameworks are mainly developed for large multi-layer networks, their implementation does not fully leverage the advantages of NCAs. We implement an NCA inference function in CUDA that further reduces VRAM demands and increases inference speed. Our OctreeNCA segments high-resolution images and videos quickly while occupying 90% less VRAM than a UNet during evaluation. This allows us to segment 184 Megapixel pathology slices or 1-minute surgical videos at once.

[46] S2-UniSeg: Fast Universal Agglomerative Pooling for Scalable Segment Anything without Supervision cs.CVPDF

Huihui Xu, Jin Ye, Hongqiu Wang, Changkai Ji, Jiashi Lin

TL;DR: S2-UniSeg提出了一种无需监督的快速通用分割方法，通过新型伪掩码算法和连续预训练框架，显著提高了分割性能。

Details

Motivation: 现有自监督图像分割方法的多阶段预训练过程耗时且难以扩展，导致优化不连续和次优解。

Result: 在多个基准测试中显著超越SOFA模型（如COCO上AP+6.9），并验证了在大规模数据上的扩展性。

Insight: 快速并行化的掩码生成和连续优化是提升自监督分割性能的关键。

Abstract: Recent self-supervised image segmentation models have achieved promising performance on semantic segmentation and class-agnostic instance segmentation. However, their pretraining schedule is multi-stage, requiring a time-consuming pseudo-masks generation process between each training epoch. This time-consuming offline process not only makes it difficult to scale with training dataset size, but also leads to sub-optimal solutions due to its discontinuous optimization routine. To solve these, we first present a novel pseudo-mask algorithm, Fast Universal Agglomerative Pooling (UniAP). Each layer of UniAP can identify groups of similar nodes in parallel, allowing to generate both semantic-level and instance-level and multi-granular pseudo-masks within ens of milliseconds for one image. Based on the fast UniAP, we propose the Scalable Self-Supervised Universal Segmentation (S2-UniSeg), which employs a student and a momentum teacher for continuous pretraining. A novel segmentation-oriented pretext task, Query-wise Self-Distillation (QuerySD), is proposed to pretrain S2-UniSeg to learn the local-to-global correspondences. Under the same setting, S2-UniSeg outperforms the SOTA UnSAM model, achieving notable improvements of AP+6.9 on COCO, AR+11.1 on UVO, PixelAcc+4.5 on COCOStuff-27, RQ+8.0 on Cityscapes. After scaling up to a larger 2M-image subset of SA-1B, S2-UniSeg further achieves performance gains on all four benchmarks. Our code and pretrained models are available at https://github.com/bio-mlhui/S2-UniSeg

[47] HiMat: DiT-based Ultra-High Resolution SVBRDF Generation cs.CV | cs.GRPDF

Zixiong Wang, Jian Yang, Yiwei Hu, Milos Hasan, Beibei Wang

TL;DR: HiMat proposes an efficient diffusion-based framework for generating 4K-resolution SVBRDFs using a lightweight CrossStitch module to maintain consistency across maps without altering the DiT backbone.

Details

Motivation: 生成高分辨率的SVBRDF（空间变化双向反射分布函数）对3D内容创作至关重要。现有的基于DiT（扩散变换器）的文本到图像生成模型为这一任务提供了潜力，但如何高效生成多对齐SVBRDF图并保持一致性仍具挑战性。

Result: 实验表明，HiMat能生成结构一致且包含高频细节的4K SVBRDF。文本提示的大规模测试验证了方法的有效性，并可推广到内部分解等任务。

Insight: 通过轻量化模块捕捉跨图依赖，既可保持生成一致性，又避免破坏预训练模型的先验能力，为高分辨率SVBRDF生成提供了高效解决方案。

Abstract: Creating highly detailed SVBRDFs is essential for 3D content creation. The rise of high-resolution text-to-image generative models, based on diffusion transformers (DiT), suggests an opportunity to finetune them for this task. However, retargeting the models to produce multiple aligned SVBRDF maps instead of just RGB images, while achieving high efficiency and ensuring consistency across different maps, remains a challenge. In this paper, we introduce HiMat: a memory- and computation-efficient diffusion-based framework capable of generating native 4K-resolution SVBRDFs. A key challenge we address is maintaining consistency across different maps in a lightweight manner, without relying on training new VAEs or significantly altering the DiT backbone (which would damage its prior capabilities). To tackle this, we introduce the CrossStitch module, a lightweight convolutional module that captures inter-map dependencies through localized operations. Its weights are initialized such that the DiT backbone operation is unchanged before finetuning starts. HiMat enables generation with strong structural coherence and high-frequency details. Results with a large set of text prompts demonstrate the effectiveness of our approach for 4K SVBRDF generation. Further experiments suggest generalization to tasks such as intrinsic decomposition.

[48] TerraMAE: Learning Spatial-Spectral Representations from Hyperspectral Earth Observation Data via Adaptive Masked Autoencoders cs.CV | cs.LGPDF

Tanjim Bin Faruk, Abdul Matin, Shrideep Pallickara, Sangmi Lee Pallickara

TL;DR: TerraMAE是一种针对高光谱遥感数据的自适应掩码自编码器框架，通过改进结构和损失函数，更好地捕捉空间-光谱特征，并在多个下游任务中表现优异。

Details

Motivation: 高光谱遥感数据包含丰富的光谱和空间信息，但现有方法难以充分利用这些复杂特征。本文旨在设计一种能有效学习空间-光谱表征的自监督方法。

Result: TerraMAE在高保真图像重建中表现优异，并在作物识别、土地覆盖分类和土壤纹理预测等任务中显著提升了性能。

Insight: 通过结合光谱相似性和空间-光谱质量指标，可以更有效地学习高光谱数据的表征，为遥感任务提供强大支持。

Abstract: Hyperspectral satellite imagery offers sub-30 m views of Earth in hundreds of contiguous spectral bands, enabling fine-grained mapping of soils, crops, and land cover. While self-supervised Masked Autoencoders excel on RGB and low-band multispectral data, they struggle to exploit the intricate spatial-spectral correlations in 200+ band hyperspectral images. We introduce TerraMAE, a novel HSI encoding framework specifically designed to learn highly representative spatial-spectral embeddings for diverse geospatial analyses. TerraMAE features an adaptive channel grouping strategy, based on statistical reflectance properties to capture spectral similarities, and an enhanced reconstruction loss function that incorporates spatial and spectral quality metrics. We demonstrate TerraMAE’s effectiveness through superior spatial-spectral information preservation in high-fidelity image reconstruction. Furthermore, we validate its practical utility and the quality of its learned representations through strong performance on three key downstream geospatial tasks: crop identification, land cover classification, and soil texture prediction.

[49] DocRefine: An Intelligent Framework for Scientific Document Understanding and Content Optimization based on Multimodal Large Model Agents cs.CVPDF

Kun Qian, Wenjie Li, Tianyu Sun, Wenhong Wang, Wenhan Luo

TL;DR: DocRefine 是一个基于多模态大型模型代理的智能框架，用于科学文档的理解和内容优化，通过多代理协作系统实现高效准确的文档处理。

Details

Motivation: 科学文献的快速增长（尤其是PDF格式）需要更先进的工具来处理复杂布局和多模态内容，传统方法和现有大型模型在精确性和控制力上存在不足。

Result: 在 DocEditBench 数据集上表现优异，语义一致性得分（SCS）达 86.7%，布局保真度指数（LFI）达 93.9%，指令遵循率（IAR）达 85.0%。

Insight: 通过多代理协作和闭环反馈机制，DocRefine 为复杂多模态文档的智能处理提供了新思路，平衡了语义准确性和视觉一致性。

Abstract: The exponential growth of scientific literature in PDF format necessitates advanced tools for efficient and accurate document understanding, summarization, and content optimization. Traditional methods fall short in handling complex layouts and multimodal content, while direct application of Large Language Models (LLMs) and Vision-Language Large Models (LVLMs) lacks precision and control for intricate editing tasks. This paper introduces DocRefine, an innovative framework designed for intelligent understanding, content refinement, and automated summarization of scientific PDF documents, driven by natural language instructions. DocRefine leverages the power of advanced LVLMs (e.g., GPT-4o) by orchestrating a sophisticated multi-agent system comprising six specialized and collaborative agents: Layout & Structure Analysis, Multimodal Content Understanding, Instruction Decomposition, Content Refinement, Summarization & Generation, and Fidelity & Consistency Verification. This closed-loop feedback architecture ensures high semantic accuracy and visual fidelity. Evaluated on the comprehensive DocEditBench dataset, DocRefine consistently outperforms state-of-the-art baselines across various tasks, achieving overall scores of 86.7% for Semantic Consistency Score (SCS), 93.9% for Layout Fidelity Index (LFI), and 85.0% for Instruction Adherence Rate (IAR). These results demonstrate DocRefine’s superior capability in handling complex multimodal document editing, preserving semantic integrity, and maintaining visual consistency, marking a significant advancement in automated scientific document processing.

[50] MV-CoRe: Multimodal Visual-Conceptual Reasoning for Complex Visual Question Answering cs.CVPDF

Jingwei Peng, Jiehao Chen, Mateo Alejandro Rojas, Meilin Zhang

TL;DR: MV-CoRe 是一种多模态视觉-概念推理模型，旨在通过深度融合视觉和语言信息，提升复杂视觉问答任务的性能，显著优于现有基准模型。

Details

Motivation: 现有大型视觉语言模型在复杂视觉问答任务中表现受限，主要依赖于高层全局特征，缺乏对细粒度视觉和语义信息的深度融合。

Result: 在 GQA、A-OKVQA 和 OKVQA 基准测试中表现优异，GQA 准确率达到 77.5%，并通过消融实验和人工评估验证了模型的有效性。

Insight: 细粒度视觉特征和多模态深度融合对复杂视觉问答任务至关重要，MV-CoRe 的设计为类似任务提供了新的技术路径。

Abstract: Complex Visual Question Answering (Complex VQA) tasks, which demand sophisticated multi-modal reasoning and external knowledge integration, present significant challenges for existing large vision-language models (LVLMs) often limited by their reliance on high-level global features. To address this, we propose MV-CoRe (Multimodal Visual-Conceptual Reasoning), a novel model designed to enhance Complex VQA performance through the deep fusion of diverse visual and linguistic information. MV-CoRe meticulously integrates global embeddings from pre-trained Vision Large Models (VLMs) and Language Large Models (LLMs) with fine-grained semantic-aware visual features, including object detection characteristics and scene graph representations. An innovative Multimodal Fusion Transformer then processes and deeply integrates these diverse feature sets, enabling rich cross-modal attention and facilitating complex reasoning. We evaluate MV-CoRe on challenging Complex VQA benchmarks, including GQA, A-OKVQA, and OKVQA, after training on VQAv2. Our experimental results demonstrate that MV-CoRe consistently outperforms established LVLM baselines, achieving an overall accuracy of 77.5% on GQA. Ablation studies confirm the critical contribution of both object and scene graph features, and human evaluations further validate MV-CoRe’s superior factual correctness and reasoning depth, underscoring its robust capabilities for deep visual and conceptual understanding.

[51] Large Language Model Evaluated Stand-alone Attention-Assisted Graph Neural Network with Spatial and Structural Information Interaction for Precise Endoscopic Image Segmentation cs.CVPDF

Juntong Fan, Shuyi Fan, Debesh Jha, Changsheng Fang, Tieyong Zeng

TL;DR: 该论文提出了一种名为FOCUS-Med的端到端内窥镜图像分割方法，通过结合空间和结构信息的双图卷积网络及全局注意力机制，显著提升了息肉分割的准确性，并首次引入大语言模型进行质量评估。

Details

Motivation: 内窥镜图像中息肉的低对比度、镜面高光和模糊边界导致准确分割具有挑战性，亟需一种有效方法来提升分割性能，以辅助早期结直肠癌检测。

Result: 在多个公开基准测试中，FOCUS-Med在五项关键指标上均达到最优性能，证明了其在AI辅助结肠镜中的临床潜力。

Insight: 1. 空间和结构信息的联合建模有助于解决息肉分割中的低对比度和边界模糊问题。2. 大语言模型的引入为医学图像分割质量评估提供了新的角度。

Abstract: Accurate endoscopic image segmentation on the polyps is critical for early colorectal cancer detection. However, this task remains challenging due to low contrast with surrounding mucosa, specular highlights, and indistinct boundaries. To address these challenges, we propose FOCUS-Med, which stands for Fusion of spatial and structural graph with attentional context-aware polyp segmentation in endoscopic medical imaging. FOCUS-Med integrates a Dual Graph Convolutional Network (Dual-GCN) module to capture contextual spatial and topological structural dependencies. This graph-based representation enables the model to better distinguish polyps from background tissues by leveraging topological cues and spatial connectivity, which are often obscured in raw image intensities. It enhances the model’s ability to preserve boundaries and delineate complex shapes typical of polyps. In addition, a location-fused stand-alone self-attention is employed to strengthen global context integration. To bridge the semantic gap between encoder-decoder layers, we incorporate a trainable weighted fast normalized fusion strategy for efficient multi-scale aggregation. Notably, we are the first to introduce the use of a Large Language Model (LLM) to provide detailed qualitative evaluations of segmentation quality. Extensive experiments on public benchmarks demonstrate that FOCUS-Med achieves state-of-the-art performance across five key metrics, underscoring its effectiveness and clinical potential for AI-assisted colonoscopy.

[52] TeSO: Representing and Compressing 3D Point Cloud Scenes with Textured Surfel Octree cs.CVPDF

Yueyu Hu, Ran Gong, Tingyu Fan, Yao Wang

TL;DR: TeSO提出了一种基于纹理化的Surfel Octree（TeSO）的新型3D表示方法，通过结合八叉树结构和纹理贴图，实现了高质量的渲染和高效的压缩。

Details

Motivation: 现有的3D表示（如点云、网格和3D高斯）在渲染质量、表面定义和压缩性方面存在局限性，亟需一种既能保证高质量渲染又能高效压缩的3D表示方法。

Result: 与现有方法相比，TeSO在更低码率下实现了更高的渲染质量。

Insight: 通过分离几何和纹理表示，TeSO有效平衡了渲染质量和数据压缩的需求。

Abstract: 3D visual content streaming is a key technology for emerging 3D telepresence and AR/VR applications. One fundamental element underlying the technology is a versatile 3D representation that is capable of producing high-quality renders and can be efficiently compressed at the same time. Existing 3D representations like point clouds, meshes and 3D Gaussians each have limitations in terms of rendering quality, surface definition, and compressibility. In this paper, we present the Textured Surfel Octree (TeSO), a novel 3D representation that is built from point clouds but addresses the aforementioned limitations. It represents a 3D scene as cube-bounded surfels organized on an octree, where each surfel is further associated with a texture patch. By approximating a smooth surface with a large surfel at a coarser level of the octree, it reduces the number of primitives required to represent the 3D scene, and yet retains the high-frequency texture details through the texture map attached to each surfel. We further propose a compression scheme to encode the geometry and texture efficiently, leveraging the octree structure. The proposed textured surfel octree combined with the compression scheme achieves higher rendering quality at lower bit-rates compared to multiple point cloud and 3D Gaussian-based baselines.

[53] ForeSight: Multi-View Streaming Joint Object Detection and Trajectory Forecasting cs.CV | cs.ROPDF

Sandro Papais, Letian Wang, Brian Cheong, Steven L. Waslander

TL;DR: ForeSight 是一种新型联合检测与预测框架，用于自动驾驶中的 3D 视觉感知，通过双向学习和多任务流式处理实现了检测与预测的高效协同，显著提升了性能。

Details

Motivation: 传统方法将目标检测与轨迹预测视为独立任务，难以利用时间信息，导致性能受限。ForeSight 致力于解决这一问题，通过联合建模提升感知能力。

Result: 在 nuScenes 数据集上，EPA 达到 54.9%，mAP 和 minADE 均优于其他多视角检测与预测模型。

Insight: 联合建模检测与预测任务能够有效利用时空信息，提升自动驾驶系统的感知性能，同时避免显式关联的误差累积问题。

Abstract: We introduce ForeSight, a novel joint detection and forecasting framework for vision-based 3D perception in autonomous vehicles. Traditional approaches treat detection and forecasting as separate sequential tasks, limiting their ability to leverage temporal cues. ForeSight addresses this limitation with a multi-task streaming and bidirectional learning approach, allowing detection and forecasting to share query memory and propagate information seamlessly. The forecast-aware detection transformer enhances spatial reasoning by integrating trajectory predictions from a multiple hypothesis forecast memory queue, while the streaming forecast transformer improves temporal consistency using past forecasts and refined detections. Unlike tracking-based methods, ForeSight eliminates the need for explicit object association, reducing error propagation with a tracking-free model that efficiently scales across multi-frame sequences. Experiments on the nuScenes dataset show that ForeSight achieves state-of-the-art performance, achieving an EPA of 54.9%, surpassing previous methods by 9.3%, while also attaining the best mAP and minADE among multi-view detection and forecasting models.

[54] Communication-Efficient Multi-Agent 3D Detection via Hybrid Collaboration cs.CVPDF

Yue Hu, Juntong Peng, Yunqiao Yang, Siheng Chen

TL;DR: 本文提出了一种通信高效的多智能体3D检测方法HyComm，通过混合协作自适应整合紧凑的感知输出和丰富的原始观测数据，优化感知信息并适应多样化通信场景。

Details

Motivation: 多智能体协作3D检测需要在检测性能和通信带宽之间取得平衡，传统方法难以兼顾高效和适应性。

Result: 在DAIR-V2X和OPV2V数据集上，HyComm以更低通信量（超过2006倍）实现优于现有方法的性能（如AP50）。

Insight: 混合协作策略能有效平衡检测性能和通信开销，标准化数据格式提升了系统适应性。

Abstract: Collaborative 3D detection can substantially boost detection performance by allowing agents to exchange complementary information. It inherently results in a fundamental trade-off between detection performance and communication bandwidth. To tackle this bottleneck issue, we propose a novel hybrid collaboration that adaptively integrates two types of communication messages: perceptual outputs, which are compact, and raw observations, which offer richer information. This approach focuses on two key aspects: i) integrating complementary information from two message types and ii) prioritizing the most critical data within each type. By adaptively selecting the most critical set of messages, it ensures optimal perceptual information and adaptability, effectively meeting the demands of diverse communication scenarios.Building on this hybrid collaboration, we present \texttt{HyComm}, a communication-efficient LiDAR-based collaborative 3D detection system. \texttt{HyComm} boasts two main benefits: i) it facilitates adaptable compression rates for messages, addressing various communication requirements, and ii) it uses standardized data formats for messages. This ensures they are independent of specific detection models, fostering adaptability across different agent configurations. To evaluate HyComm, we conduct experiments on both real-world and simulation datasets: DAIR-V2X and OPV2V. HyComm consistently outperforms previous methods and achieves a superior performance-bandwidth trade-off regardless of whether agents use the same or varied detection models. It achieves a lower communication volume of more than 2,006$\times$ and still outperforms Where2comm on DAIR-V2X in terms of AP50. The related code will be released.

[55] AugLift: Boosting Generalization in Lifting-based 3D Human Pose Estimation cs.CV | cs.LGPDF

Nikolai Warner, Wenjin Zhang, Irfan Essa, Apaar Sadhwani

TL;DR: AugLift通过稀疏增强2D关键点输入（加入置信度和深度估计），显著提升了基于lifting的3D人体姿态估计的泛化性能，且无需额外数据或传感器。

Details

Motivation: 传统的基于lifting的3D人体姿态估计方法在新的数据集和真实场景中泛化能力较差，需要改进。

Result: 跨数据集性能平均提升10.1%，同数据集性能提升4.0%，且适用于多种lifting架构。

Insight: 稀疏且与关键点对齐的上下文信号（置信度和深度）提供了鲁棒的帧级上下文，是提升泛化性能的实用方法。

Abstract: Lifting-based methods for 3D Human Pose Estimation (HPE), which predict 3D poses from detected 2D keypoints, often generalize poorly to new datasets and real-world settings. To address this, we propose \emph{AugLift}, a simple yet effective reformulation of the standard lifting pipeline that significantly improves generalization performance without requiring additional data collection or sensors. AugLift sparsely enriches the standard input – the 2D keypoint coordinates $(x, y)$ – by augmenting it with a keypoint detection confidence score $c$ and a corresponding depth estimate $d$. These additional signals are computed from the image using off-the-shelf, pre-trained models (e.g., for monocular depth estimation), thereby inheriting their strong generalization capabilities. Importantly, AugLift serves as a modular add-on and can be readily integrated into existing lifting architectures. Our extensive experiments across four datasets demonstrate that AugLift boosts cross-dataset performance on unseen datasets by an average of $10.1%$, while also improving in-distribution performance by $4.0%$. These gains are consistent across various lifting architectures, highlighting the robustness of our method. Our analysis suggests that these sparse, keypoint-aligned cues provide robust frame-level context, offering a practical way to significantly improve the generalization of any lifting-based pose estimation model. Code will be made publicly available.

[56] Perceptual Evaluation of GANs and Diffusion Models for Generating X-rays cs.CV | cs.AIPDF

Gregory Schuit, Denis Parra, Cecilia Besa

TL;DR: 该研究评估了生成对抗网络（GANs）和扩散模型（DMs）在生成胸部X射线图像方面的效果，发现DMs整体更逼真，但GANs在某些特定条件下（如无ECS）更准确。

Details

Motivation: 解决医学图像生成的质量和临床实用性问题是核心动机，尤其是为了应对低发病率异常的数据稀缺问题，从而提高AI诊断工具的泛化性和可信度。

Result: 结果显示DMs整体更逼真，但GANs在特定条件下（如无ECS）表现更好。研究还指出了合成图像的视觉缺陷。

Insight: GANs和DMs各有优势，需进一步改进以确保生成模型能可靠地扩充AI诊断系统的训练数据。

Abstract: Generative image models have achieved remarkable progress in both natural and medical imaging. In the medical context, these techniques offer a potential solution to data scarcity-especially for low-prevalence anomalies that impair the performance of AI-driven diagnostic and segmentation tools. However, questions remain regarding the fidelity and clinical utility of synthetic images, since poor generation quality can undermine model generalizability and trust. In this study, we evaluate the effectiveness of state-of-the-art generative models-Generative Adversarial Networks (GANs) and Diffusion Models (DMs)-for synthesizing chest X-rays conditioned on four abnormalities: Atelectasis (AT), Lung Opacity (LO), Pleural Effusion (PE), and Enlarged Cardiac Silhouette (ECS). Using a benchmark composed of real images from the MIMIC-CXR dataset and synthetic images from both GANs and DMs, we conducted a reader study with three radiologists of varied experience. Participants were asked to distinguish real from synthetic images and assess the consistency between visual features and the target abnormality. Our results show that while DMs generate more visually realistic images overall, GANs can report better accuracy for specific conditions, such as absence of ECS. We further identify visual cues radiologists use to detect synthetic images, offering insights into the perceptual gaps in current models. These findings underscore the complementary strengths of GANs and DMs and point to the need for further refinement to ensure generative models can reliably augment training datasets for AI diagnostic systems.

[57] CMAMRNet: A Contextual Mask-Aware Network Enhancing Mural Restoration Through Comprehensive Mask Guidance cs.CVPDF

Yingtie Lei, Fanghai Yi, Yihang Dong, Weihuang Liu, Xiaofeng Zhang

TL;DR: CMAMRNet提出了一种针对壁画修复的上下文感知网络，通过多尺度特征提取和全面的掩码引导，解决了现有方法在修复质量和掩码一致性上的不足。

Details

Motivation: 壁画作为文化遗产易受环境和人为破坏，其数字修复面临复杂退化模式和艺术真实性保持的挑战。现有基于学习的方法在掩码引导上表现不足，影响修复效果。

Result: 在基准数据集上，CMAMRNet优于现有方法，能有效保留壁画的结构完整性和艺术细节。

Insight: 掩码引导和多尺度特征提取的结合对壁画修复至关重要，为文化遗产保护提供了高效解决方案。

Abstract: Murals, as invaluable cultural artifacts, face continuous deterioration from environmental factors and human activities. Digital restoration of murals faces unique challenges due to their complex degradation patterns and the critical need to preserve artistic authenticity. Existing learning-based methods struggle with maintaining consistent mask guidance throughout their networks, leading to insufficient focus on damaged regions and compromised restoration quality. We propose CMAMRNet, a Contextual Mask-Aware Mural Restoration Network that addresses these limitations through comprehensive mask guidance and multi-scale feature extraction. Our framework introduces two key components: (1) the Mask-Aware Up/Down-Sampler (MAUDS), which ensures consistent mask sensitivity across resolution scales through dedicated channel-wise feature selection and mask-guided feature fusion; and (2) the Co-Feature Aggregator (CFA), operating at both the highest and lowest resolutions to extract complementary features for capturing fine textures and global structures in degraded regions. Experimental results on benchmark datasets demonstrate that CMAMRNet outperforms state-of-the-art methods, effectively preserving both structural integrity and artistic details in restored murals. The code is available at~\href{https://github.com/CXH-Research/CMAMRNet}{https://github.com/CXH-Research/CMAMRNet}.

[58] Dynamic Pattern Alignment Learning for Pretraining Lightweight Human-Centric Vision Models cs.CVPDF

Xuanhan Wang, Huimin Deng, Ke Liu, Jun Wang, Lianli Gao

TL;DR: 该论文提出了动态模式对齐学习（DPAL），一种基于蒸馏的轻量级人中心视觉模型（HVM）预训练框架，通过学习典型视觉模式实现高效泛化。

Details

Motivation: 解决大型HVM对计算资源和大规模数据的依赖问题，提出轻量化模型预训练方法，以提升实际应用中的实用性。

Result: 在15个数据集上验证有效性，轻量级模型DPAL-ViT/Ti（5M参数）性能接近大型HVM（如84M的PATH-B），并超越其他蒸馏方法。

Insight: 动态模式对齐学习显著提升轻量化模型的泛化能力，展示了从大型模型中高效迁移视觉模式的可能性。

Abstract: Human-centric vision models (HVMs) have achieved remarkable generalization due to large-scale pretraining on massive person images. However, their dependence on large neural architectures and the restricted accessibility of pretraining data significantly limits their practicality in real-world applications. To address this limitation, we propose Dynamic Pattern Alignment Learning (DPAL), a novel distillation-based pretraining framework that efficiently trains lightweight HVMs to acquire strong generalization from large HVMs. In particular, human-centric visual perception are highly dependent on three typical visual patterns, including global identity pattern, local shape pattern and multi-person interaction pattern. To achieve generalizable lightweight HVMs, we firstly design a dynamic pattern decoder (D-PaDe), acting as a dynamic Mixture of Expert (MoE) model. It incorporates three specialized experts dedicated to adaptively extract typical visual patterns, conditioned on both input image and pattern queries. And then, we present three levels of alignment objectives, which aims to minimize generalization gap between lightweight HVMs and large HVMs at global image level, local pixel level, and instance relation level. With these two deliberate designs, the DPAL effectively guides lightweight model to learn all typical human visual patterns from large HVMs, which can generalize to various human-centric vision tasks. Extensive experiments conducted on 15 challenging datasets demonstrate the effectiveness of the DPAL. Remarkably, when employing PATH-B as the teacher, DPAL-ViT/Ti (5M parameters) achieves surprising generalizability similar to existing large HVMs such as PATH-B (84M) and Sapiens-L (307M), and outperforms previous distillation-based pretraining methods including Proteus-ViT/Ti (5M) and TinyMiM-ViT/Ti (5M) by a large margin.

[59] Intention-Aware Diffusion Model for Pedestrian Trajectory Prediction cs.CV | cs.AIPDF

Yu Liu, Zhijie Liu, Xiao Ren, You-Fu Li, He Kong

TL;DR: 该论文提出了一种基于扩散模型的行人轨迹预测框架，通过结合短期和长期运动意图的语义建模，提升了预测准确性。

Details

Motivation: 尽管现有扩散模型在行人轨迹预测中表现良好，但缺乏对行人意图的显式语义建模可能导致行为误解和预测精度下降。

Result: 在ETH、UCY和SDD基准测试中，取得了与最先进方法竞争的结果。

Insight: 显式建模行人意图可以显著提升轨迹预测的准确性和行为理解能力。

Abstract: Predicting pedestrian motion trajectories is critical for the path planning and motion control of autonomous vehicles. Recent diffusion-based models have shown promising results in capturing the inherent stochasticity of pedestrian behavior for trajectory prediction. However, the absence of explicit semantic modelling of pedestrian intent in many diffusion-based methods may result in misinterpreted behaviors and reduced prediction accuracy. To address the above challenges, we propose a diffusion-based pedestrian trajectory prediction framework that incorporates both short-term and long-term motion intentions. Short-term intent is modelled using a residual polar representation, which decouples direction and magnitude to capture fine-grained local motion patterns. Long-term intent is estimated through a learnable, token-based endpoint predictor that generates multiple candidate goals with associated probabilities, enabling multimodal and context-aware intention modelling. Furthermore, we enhance the diffusion process by incorporating adaptive guidance and a residual noise predictor that dynamically refines denoising accuracy. The proposed framework is evaluated on the widely used ETH, UCY, and SDD benchmarks, demonstrating competitive results against state-of-the-art methods.

[60] SketchAnimator: Animate Sketch via Motion Customization of Text-to-Video Diffusion Models cs.CVPDF

Ruolin Yang, Da Li, Honggang Zhang, Yi-Zhe Song

TL;DR: 论文提出了SketchAnimator，通过将输入草图与参考视频的运动信息结合，利用LoRA和Score Distillation Sampling技术生成动态草图视频，简化了草图动画的制作。

Details

Motivation: 草图动画通常需要专业技能和时间，对业余用户不友好。作者希望通过自动化工具降低门槛，让用户轻松为草图添加创意动态效果。

Result: 实验表明，该方法能够生成保留草图外观且动态效果与参考视频一致的结果，优于其他方法。

Insight: 结合LoRA和SDS技术，可以高效地将静态草图与动态视频信息结合，为创意设计提供新工具。

Abstract: Sketching is a uniquely human tool for expressing ideas and creativity. The animation of sketches infuses life into these static drawings, opening a new dimension for designers. Animating sketches is a time-consuming process that demands professional skills and extensive experience, often proving daunting for amateurs. In this paper, we propose a novel sketch animation model SketchAnimator, which enables adding creative motion to a given sketch, like “a jumping car’’. Namely, given an input sketch and a reference video, we divide the sketch animation into three stages: Appearance Learning, Motion Learning and Video Prior Distillation. In stages 1 and 2, we utilize LoRA to integrate sketch appearance information and motion dynamics from the reference video into the pre-trained T2V model. In the third stage, we utilize Score Distillation Sampling (SDS) to update the parameters of the Bezier curves in each sketch frame according to the acquired motion information. Consequently, our model produces a sketch video that not only retains the original appearance of the sketch but also mirrors the dynamic movements of the reference video. We compare our method with alternative approaches and demonstrate that it generates the desired sketch video under the challenge of one-shot motion customization.

[61] CoopDiff: Anticipating 3D Human-object Interactions via Contact-consistent Decoupled Diffusion cs.CVPDF

Xiaotong Lin, Tianming Liang, Jian-Fang Hu, Kun-Yu Lin, Yulei Kang

TL;DR: CoopDiff提出了一种基于接触点一致性的解耦扩散框架，分别建模人和物体的运动，通过共享接触点实现协同预测，显著优于现有方法。

Details

Motivation: 现有方法忽视了人和物体因物理特性不同而表现出的运动模式差异，通常使用单一模型预测两者的运动。

Result: 在BEHAVE和Human-object Interaction数据集上表现优于现有方法。

Insight: 解耦建模人和物体的运动模式，并结合接触点一致性约束，能够更准确地预测复杂的人-物交互动态。

Abstract: 3D human-object interaction (HOI) anticipation aims to predict the future motion of humans and their manipulated objects, conditioned on the historical context. Generally, the articulated humans and rigid objects exhibit different motion patterns, due to their distinct intrinsic physical properties. However, this distinction is ignored by most of the existing works, which intend to capture the dynamics of both humans and objects within a single prediction model. In this work, we propose a novel contact-consistent decoupled diffusion framework CoopDiff, which employs two distinct branches to decouple human and object motion modeling, with the human-object contact points as shared anchors to bridge the motion generation across branches. The human dynamics branch is aimed to predict highly structured human motion, while the object dynamics branch focuses on the object motion with rigid translations and rotations. These two branches are bridged by a series of shared contact points with consistency constraint for coherent human-object motion prediction. To further enhance human-object consistency and prediction reliability, we propose a human-driven interaction module to guide object motion modeling. Extensive experiments on the BEHAVE and Human-object Interaction datasets demonstrate that our CoopDiff outperforms state-of-the-art methods.

[62] Lightweight Multi-Scale Feature Extraction with Fully Connected LMF Layer for Salient Object Detection cs.CV | cs.AIPDF

Yunpeng Shi, Lei Chen, Xiaolu Shen, Yanju Guo

TL;DR: 该论文提出了一种轻量级多尺度特征提取层（LMF层），通过全连接结构和深度可分离膨胀卷积实现了高效的多尺度学习。基于此，作者构建了LMFNet，在显著目标检测任务中以仅0.81M参数达到了先进性能。

Details

Motivation: 在轻量级网络中实现高效的多尺度特征提取一直是一个挑战，传统方法需要在效率和性能之间权衡。本文旨在解决这一问题。

Result: LMFNet在显著目标检测任务中以0.81M参数达到或超越了传统和轻量级模型的性能。

Insight: 全连接结构与深度可分离膨胀卷积的结合为轻量级网络中的多尺度学习提供了新思路，且这一方法可能扩展至其他图像处理任务。

Abstract: In the domain of computer vision, multi-scale feature extraction is vital for tasks such as salient object detection. However, achieving this capability in lightweight networks remains challenging due to the trade-off between efficiency and performance. This paper proposes a novel lightweight multi-scale feature extraction layer, termed the LMF layer, which employs depthwise separable dilated convolutions in a fully connected structure. By integrating multiple LMF layers, we develop LMFNet, a lightweight network tailored for salient object detection. Our approach significantly reduces the number of parameters while maintaining competitive performance. Here, we show that LMFNet achieves state-of-the-art or comparable results on five benchmark datasets with only 0.81M parameters, outperforming several traditional and lightweight models in terms of both efficiency and accuracy. Our work not only addresses the challenge of multi-scale learning in lightweight networks but also demonstrates the potential for broader applications in image processing tasks. The related code files are available at https://github.com/Shi-Yun-peng/LMFNet

[63] EventRR: Event Referential Reasoning for Referring Video Object Segmentation cs.CVPDF

Huihui Xu, Jiashi Lin, Haoyu Chen, Junjun He, Lei Zhu

TL;DR: 本文提出了EventRR框架，用于解决Referring Video Object Segmentation（RVOS）任务中的事件参考推理问题，通过分解任务为对象总结和参考推理两部分，显著提升了性能。

Details

Motivation: 现有RVOS方法将参考表达式视为非结构化序列，忽略了其语义结构对参考推理的重要性，同时视频参考表达式还包含事件属性和事件间时间关系，增加了复杂性。

Result: 在四个广泛认可的基准数据集上，EventRR在数量和质上均优于当前最先进的RVOS方法。

Insight: 视频参考表达式的事件属性和时间关系对RVOS任务至关重要，结构化语义表示（如REG）和分步推理（如TCRR）能有效提升性能。

Abstract: Referring Video Object Segmentation (RVOS) aims to segment out the object in a video referred by an expression. Current RVOS methods view referring expressions as unstructured sequences, neglecting their crucial semantic structure essential for referent reasoning. Besides, in contrast to image-referring expressions whose semantics focus only on object attributes and object-object relations, video-referring expressions also encompass event attributes and event-event temporal relations. This complexity challenges traditional structured reasoning image approaches. In this paper, we propose the Event Referential Reasoning (EventRR) framework. EventRR decouples RVOS into object summarization part and referent reasoning part. The summarization phase begins by summarizing each frame into a set of bottleneck tokens, which are then efficiently aggregated in the video-level summarization step to exchange the global cross-modal temporal context. For reasoning part, EventRR extracts semantic eventful structure of a video-referring expression into highly expressive Referential Event Graph (REG), which is a single-rooted directed acyclic graph. Guided by topological traversal of REG, we propose Temporal Concept-Role Reasoning (TCRR) to accumulate the referring score of each temporal query from REG leaf nodes to root node. Each reasoning step can be interpreted as a question-answer pair derived from the concept-role relations in REG. Extensive experiments across four widely recognized benchmark datasets, show that EventRR quantitatively and qualitatively outperforms state-of-the-art RVOS methods. Code is available at https://github.com/bio-mlhui/EventRR

[64] Similarity Matters: A Novel Depth-guided Network for Image Restoration and A New Dataset cs.CVPDF

Junyi He, Liuling Chen, Hongyang Zhou, Zhang xiaoxing, Xiaobin Zhu

TL;DR: 提出了一种新颖的深度引导网络（DGN）用于图像修复，并引入了一个新的大规模高分辨率数据集。该方法通过双分支设计（深度估计与图像修复）结合相似性匹配，显著提升了恢复质量。

Details

Motivation: 现有图像修复方法忽略深度信息，导致相似性匹配问题，尤其是在浅景深和深景深场景中表现不佳。为解决这一问题，作者提出利用深度信息引导修复过程。

Result: 在多个标准基准上取得最先进性能，并能泛化到未见过的植物图像，展示了方法的有效性和鲁棒性。

Insight: 深度信息能够显著提升图像修复的质量，尤其是在需要相似性匹配的场景中。双分支的交互设计为多任务学习提供了新的思路。

Abstract: Image restoration has seen substantial progress in recent years. However, existing methods often neglect depth information, which hurts similarity matching, results in attention distractions in shallow depth-of-field (DoF) scenarios, and excessive enhancement of background content in deep DoF settings. To overcome these limitations, we propose a novel Depth-Guided Network (DGN) for image restoration, together with a novel large-scale high-resolution dataset. Specifically, the network consists of two interactive branches: a depth estimation branch that provides structural guidance, and an image restoration branch that performs the core restoration task. In addition, the image restoration branch exploits intra-object similarity through progressive window-based self-attention and captures inter-object similarity via sparse non-local attention. Through joint training, depth features contribute to improved restoration quality, while the enhanced visual features from the restoration branch in turn help refine depth estimation. Notably, we also introduce a new dataset for training and evaluation, consisting of 9,205 high-resolution images from 403 plant species, with diverse depth and texture variations. Extensive experiments show that our method achieves state-of-the-art performance on several standard benchmarks and generalizes well to unseen plant images, demonstrating its effectiveness and robustness.

[65] Unsupervised Real-World Super-Resolution via Rectified Flow Degradation Modelling cs.CV | eess.IVPDF

Hongyang Zhou, Xiaobin Zhu, Liuling Chen, Junyi He, Jingyan Qin

TL;DR: 该论文提出了一种基于整流流的无监督真实世界超分辨率方法，通过改进退化建模和傅里叶先验引导，提升了现有超分辨率方法在真实场景中的性能。

Details

Motivation: 真实世界超分辨率面临未知退化分布的挑战，现有方法在合成数据与真实数据之间存在领域差异。论文旨在通过无监督方法解决这一差距。

Result: 在真实世界数据集上的实验表明，该方法显著提升了现有超分辨率方法的性能。

Insight: 通过连续可逆的退化建模和傅里叶相位信息的结合，可以更精准地模拟真实世界退化，从而改善超分辨率的效果。

Abstract: Unsupervised real-world super-resolution (SR) faces critical challenges due to the complex, unknown degradation distributions in practical scenarios. Existing methods struggle to generalize from synthetic low-resolution (LR) and high-resolution (HR) image pairs to real-world data due to a significant domain gap. In this paper, we propose an unsupervised real-world SR method based on rectified flow to effectively capture and model real-world degradation, synthesizing LR-HR training pairs with realistic degradation. Specifically, given unpaired LR and HR images, we propose a novel Rectified Flow Degradation Module (RFDM) that introduces degradation-transformed LR (DT-LR) images as intermediaries. By modeling the degradation trajectory in a continuous and invertible manner, RFDM better captures real-world degradation and enhances the realism of generated LR images. Additionally, we propose a Fourier Prior Guided Degradation Module (FGDM) that leverages structural information embedded in Fourier phase components to ensure more precise modeling of real-world degradation. Finally, the LR images are processed by both FGDM and RFDM, producing final synthetic LR images with real-world degradation. The synthetic LR images are paired with the given HR images to train the off-the-shelf SR networks. Extensive experiments on real-world datasets demonstrate that our method significantly enhances the performance of existing SR approaches in real-world scenarios.

[66] Bridging Semantic Logic Gaps: A Cognition-Inspired Multimodal Boundary-Preserving Network for Image Manipulation Localization cs.CVPDF

Songlin Li, Zhiqing Guo, Yuanman Li, Zeyu Li, Yunfeng Diao

TL;DR: 本文提出了一种认知启发的多模态边界保留网络（CMB-Net），通过结合视觉和文本信息提升图像篡改定位的准确性。

Details

Motivation: 现有的图像篡改定位（IML）模型主要依赖视觉线索，忽略了内容特征之间的语义逻辑关系。真实的图像内容语义通常符合人类认知规律，而篡改行为会破坏这种关系，留下语义线索。

Result: 实验表明，CMB-Net在图像篡改定位任务中优于现有模型。

Insight: 语义逻辑和多模态信息（尤其是文本）的引入可以显著提升篡改检测的性能。

Abstract: The existing image manipulation localization (IML) models mainly relies on visual cues, but ignores the semantic logical relationships between content features. In fact, the content semantics conveyed by real images often conform to human cognitive laws. However, image manipulation technology usually destroys the internal relationship between content features, thus leaving semantic clues for IML. In this paper, we propose a cognition-inspired multimodal boundary-preserving network (CMB-Net). Specifically, CMB-Net utilizes large language models (LLMs) to analyze manipulated regions within images and generate prompt-based textual information to compensate for the lack of semantic relationships in the visual information. Considering that the erroneous texts induced by hallucination from LLMs will damage the accuracy of IML, we propose an image-text central ambiguity module (ITCAM). It assigns weights to the text features by quantifying the ambiguity between text and image features, thereby ensuring the beneficial impact of textual information. We also propose an image-text interaction module (ITIM) that aligns visual and text features using a correlation matrix for fine-grained interaction. Finally, inspired by invertible neural networks, we propose a restoration edge decoder (RED) that mutually generates input and output features to preserve boundary information in manipulated regions without loss. Extensive experiments show that CMB-Net outperforms most existing IML models.

[67] Generic Calibration: Pose Ambiguity/Linear Solution and Parametric-hybrid Pipeline cs.CVPDF

Yuqi Han, Qi Cai, Yuanxin Wu

TL;DR: 该论文提出了一种解决通用标定方法中姿态模糊问题的线性解法和非线性优化方法，并通过引入全局优化的混合标定方法，结合通用和参数化模型的优势，提高了标定精度。

Details

Motivation: 传统离线相机标定方法依赖于参数化或通用相机模型。参数化模型的选择依赖用户经验，选择不当会影响标定精度；而通用标定方法流程复杂且无法提供传统内参。论文旨在解决通用标定中的姿态模糊问题，并提出混合标定方法以提升精度。

Result: 仿真和真实实验表明，混合标定方法在不同镜头类型和噪声污染下均表现出色，提供了可靠且高精度的相机标定方案。

Insight: 通过结合通用和参数化模型的优势，混合标定方法解决了单一模型的局限性，为复杂场景下的相机标定提供了更优解。

Abstract: Offline camera calibration techniques typically employ parametric or generic camera models. Selecting parametric models relies heavily on user experience, and an inappropriate camera model can significantly affect calibration accuracy. Meanwhile, generic calibration methods involve complex procedures and cannot provide traditional intrinsic parameters. This paper reveals a pose ambiguity in the pose solutions of generic calibration methods that irreversibly impacts subsequent pose estimation. A linear solver and a nonlinear optimization are proposed to address this ambiguity issue. Then a global optimization hybrid calibration method is introduced to integrate generic and parametric models together, which improves extrinsic parameter accuracy of generic calibration and mitigates overfitting and numerical instability in parametric calibration. Simulation and real-world experimental results demonstrate that the generic-parametric hybrid calibration method consistently excels across various lens types and noise contamination, hopefully serving as a reliable and accurate solution for camera calibration in complex scenarios.

[68] Landmark Guided Visual Feature Extractor for Visual Speech Recognition with Limited Resource cs.CVPDF

Lei Yang, Junshan Jin, Mingyuan Zhang, Yi He, Bofan Chen

TL;DR: 论文提出了一种地标引导的视觉特征提取器，用于解决有限资源下的视觉语音识别问题，通过利用面部地标的时空特征提升模型性能。

Details

Motivation: 现有深度学习方法在视觉语音识别中易受视觉干扰（如光照、皮肤纹理等）影响，且需要大量数据和计算资源，因此需开发一种在有限资源下高效的解决方案。

Result: 实验表明该方法在有限数据下表现良好，并在未见过的说话者上提升了识别准确率。

Insight: 地标信息能有效辅助视觉特征提取，减少用户特定特征的干扰，适合资源受限场景。

Abstract: Visual speech recognition is a technique to identify spoken content in silent speech videos, which has raised significant attention in recent years. Advancements in data-driven deep learning methods have significantly improved both the speed and accuracy of recognition. However, these deep learning methods can be effected by visual disturbances, such as lightning conditions, skin texture and other user-specific features. Data-driven approaches could reduce the performance degradation caused by these visual disturbances using models pretrained on large-scale datasets. But these methods often require large amounts of training data and computational resources, making them costly. To reduce the influence of user-specific features and enhance performance with limited data, this paper proposed a landmark guided visual feature extractor. Facial landmarks are used as auxiliary information to aid in training the visual feature extractor. A spatio-temporal multi-graph convolutional network is designed to fully exploit the spatial locations and spatio-temporal features of facial landmarks. Additionally, a multi-level lip dynamic fusion framework is introduced to combine the spatio-temporal features of the landmarks with the visual features extracted from the raw video frames. Experimental results show that this approach performs well with limited data and also improves the model’s accuracy on unseen speakers.

[69] Consistent and Controllable Image Animation with Motion Linear Diffusion Transformers cs.CVPDF

Xin Ma, Yaohui Wang, Genyun Jia, Xinyuan Chen, Tien-Tsin Wong

TL;DR: MiraMo提出了一种高效、一致且可控的图像动画框架，通过线性注意力、运动残差学习和DCT噪声优化，显著提升了生成动画的外观一致性和运动平滑性。

Details

Motivation: 现有图像动画方法在保持输入图像外观一致性和缓解运动突变方面存在不足，且计算资源消耗大。MiraMo旨在解决这些问题。

Result: 实验表明MiraMo在生成一致、平滑且可控的动画上优于现有方法，并提升了推理速度。

Insight: 1. 线性注意力在保持生成质量的同时显著降低计算成本；2. 运动残差学习是提升时间一致性的有效途径；3. DCT噪声优化可减少运动突变。

Abstract: Image animation has seen significant progress, driven by the powerful generative capabilities of diffusion models. However, maintaining appearance consistency with static input images and mitigating abrupt motion transitions in generated animations remain persistent challenges. While text-to-video (T2V) generation has demonstrated impressive performance with diffusion transformer models, the image animation field still largely relies on U-Net-based diffusion models, which lag behind the latest T2V approaches. Moreover, the quadratic complexity of vanilla self-attention mechanisms in Transformers imposes heavy computational demands, making image animation particularly resource-intensive. To address these issues, we propose MiraMo, a framework designed to enhance efficiency, appearance consistency, and motion smoothness in image animation. Specifically, MiraMo introduces three key elements: (1) A foundational text-to-video architecture replacing vanilla self-attention with efficient linear attention to reduce computational overhead while preserving generation quality; (2) A novel motion residual learning paradigm that focuses on modeling motion dynamics rather than directly predicting frames, improving temporal consistency; and (3) A DCT-based noise refinement strategy during inference to suppress sudden motion artifacts, complemented by a dynamics control module to balance motion smoothness and expressiveness. Extensive experiments against state-of-the-art methods validate the superiority of MiraMo in generating consistent, smooth, and controllable animations with accelerated inference speed. Additionally, we demonstrate the versatility of MiraMo through applications in motion transfer and video editing tasks.

[70] SUIT: Spatial-Spectral Union-Intersection Interaction Network for Hyperspectral Object Tracking cs.CVPDF

Fengchao Xiong, Zhenxing Wu, Sen Jia, Yuntao Qian

TL;DR: SUIT提出了一种空间-光谱联合交互网络，用于高光谱目标跟踪，通过结合Transformer和集合论的包含-排除原理，实现了空间和光谱的联合建模，提升了跟踪性能。

Details

Motivation: 高光谱视频（HSV）在复杂背景和小目标跟踪中具有优势，但现有方法主要关注空间交互，忽略了光谱交互，导致性能不佳。本文旨在从架构和训练两个角度解决这一问题。

Result: 实验表明，SUIT在挑战性场景下实现了最先进的跟踪性能。

Insight: 光谱信息的有效建模对高光谱目标跟踪至关重要，联合空间和光谱交互可以显著提升性能。

Abstract: Hyperspectral videos (HSVs), with their inherent spatial-spectral-temporal structure, offer distinct advantages in challenging tracking scenarios such as cluttered backgrounds and small objects. However, existing methods primarily focus on spatial interactions between the template and search regions, often overlooking spectral interactions, leading to suboptimal performance. To address this issue, this paper investigates spectral interactions from both the architectural and training perspectives. At the architectural level, we first establish band-wise long-range spatial relationships between the template and search regions using Transformers. We then model spectral interactions using the inclusion-exclusion principle from set theory, treating them as the union of spatial interactions across all bands. This enables the effective integration of both shared and band-specific spatial cues. At the training level, we introduce a spectral loss to enforce material distribution alignment between the template and predicted regions, enhancing robustness to shape deformation and appearance variations. Extensive experiments demonstrate that our tracker achieves state-of-the-art tracking performance. The source code, trained models and results will be publicly available via https://github.com/bearshng/suit to support reproducibility.

[71] Understanding Dynamic Scenes in Ego Centric 4D Point Clouds cs.CVPDF

Junsheng Huang, Shengyu Hao, Bocheng Hu, Gaoang Wang

TL;DR: 该论文提出了EgoDynamic4D基准测试，用于评估动态4D场景的时空推理能力，并设计了一种端到端的时空推理框架，在12项动态QA任务中表现优于基线方法。

Details

Motivation: 现有的一视角数据集缺乏统一的4D标注和任务驱动的评估协议，难以支持细粒度的时空推理任务。为了填补这一空白，论文提出了EgoDynamic4D。

Result: 在EgoDynamic4D上的实验表明，该方法在多项任务中优于基线方法，验证了多模态时间建模的有效性。

Insight: 动态4D场景的理解需要统一的标注和任务设计，而多模态时序建模在时空推理任务中具有显著优势。

Abstract: Understanding dynamic 4D scenes from an egocentric perspective-modeling changes in 3D spatial structure over time-is crucial for human-machine interaction, autonomous navigation, and embodied intelligence. While existing egocentric datasets contain dynamic scenes, they lack unified 4D annotations and task-driven evaluation protocols for fine-grained spatio-temporal reasoning, especially on motion of objects and human, together with their interactions. To address this gap, we introduce EgoDynamic4D, a novel QA benchmark on highly dynamic scenes, comprising RGB-D video, camera poses, globally unique instance masks, and 4D bounding boxes. We construct 927K QA pairs accompanied by explicit Chain-of-Thought (CoT), enabling verifiable, step-by-step spatio-temporal reasoning. We design 12 dynamic QA tasks covering agent motion, human-object interaction, trajectory prediction, relation understanding, and temporal-causal reasoning, with fine-grained, multidimensional metrics. To tackle these tasks, we propose an end-to-end spatio-temporal reasoning framework that unifies dynamic and static scene information, using instance-aware feature encoding, time and camera encoding, and spatially adaptive down-sampling to compress large 4D scenes into token sequences manageable by LLMs. Experiments on EgoDynamic4D show that our method consistently outperforms baselines, validating the effectiveness of multimodal temporal modeling for egocentric dynamic scene understanding.

[72] Small-Large Collaboration: Training-efficient Concept Personalization for Large VLM using a Meta Personalized Small VLM cs.CVPDF

Sihan Yang, Huitong Ji, Shaolin Lu, Jiayi Chen, Binxiao Xu

TL;DR: 论文提出了一种名为Small-Large Collaboration (SLC)的高效训练框架，通过小型VLM生成个性化信息，大型VLM整合信息以提供准确响应，解决了大型VLM难以直接个性化和小型VLM推理能力不足的问题。

Details

Motivation: 大型VLM虽然具有强大的多模态理解能力，但高昂的训练成本和有限的API访问限制了其个性化应用；小型VLM易于个性化但推理能力有限。SLC框架旨在结合两者的优势，实现高效个性化的多模态任务。

Result: 实验表明SLC在多个基准测试和不同大型VLM上均有效，验证了其高效性和实用性。

Insight: SLC通过协同小型和大型VLM，不仅解决了资源限制问题，还为个性化多模态任务提供了可扩展的解决方案。

Abstract: Personalizing Vision-Language Models (VLMs) to transform them into daily assistants has emerged as a trending research direction. However, leading companies like OpenAI continue to increase model size and develop complex designs such as the chain of thought (CoT). While large VLMs are proficient in complex multi-modal understanding, their high training costs and limited access via paid APIs restrict direct personalization. Conversely, small VLMs are easily personalized and freely available, but they lack sufficient reasoning capabilities. Inspired by this, we propose a novel collaborative framework named Small-Large Collaboration (SLC) for large VLM personalization, where the small VLM is responsible for generating personalized information, while the large model integrates this personalized information to deliver accurate responses. To effectively incorporate personalized information, we develop a test-time reflection strategy, preventing the potential hallucination of the small VLM. Since SLC only needs to train a meta personalized small VLM for the large VLMs, the overall process is training-efficient. To the best of our knowledge, this is the first training-efficient framework that supports both open-source and closed-source large VLMs, enabling broader real-world personalized applications. We conduct thorough experiments across various benchmarks and large VLMs to demonstrate the effectiveness of the proposed SLC framework. The code will be released at https://github.com/Hhankyangg/SLC.

[73] Representation Understanding via Activation Maximization cs.CV | cs.AIPDF

Hongbo Zhu, Angelo Cangelosi

TL;DR: 该论文提出了一种统一的特征可视化框架，适用于CNN和Vision Transformer（ViT），并通过激活最大化技术深入理解网络内部的特征表示。方法扩展到中间层，揭示了层次化的特征结构，并探讨了其对抗样本生成的潜力。

Details

Motivation: 理解深度神经网络（DNN）内部特征的表示是模型可解释性的关键一步。受神经科学启发，研究者希望通过可视化技术（如激活最大化）揭示人工神经元的响应模式。

Result: 实验证明该方法在CNN和ViT中均有效，能够揭示网络的层次特征结构，并生成有意义的对抗样本。

Insight: 1. 中间层可视化能够更深入地理解网络特征的层级化表示；2. 激活最大化不仅能解释模型，还能用于安全领域中的漏洞检测。

Abstract: Understanding internal feature representations of deep neural networks (DNNs) is a fundamental step toward model interpretability. Inspired by neuroscience methods that probe biological neurons using visual stimuli, recent deep learning studies have employed Activation Maximization (AM) to synthesize inputs that elicit strong responses from artificial neurons. In this work, we propose a unified feature visualization framework applicable to both Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs). Unlike prior efforts that predominantly focus on the last output-layer neurons in CNNs, we extend feature visualization to intermediate layers as well, offering deeper insights into the hierarchical structure of learned feature representations. Furthermore, we investigate how activation maximization can be leveraged to generate adversarial examples, revealing potential vulnerabilities and decision boundaries of DNNs. Our experiments demonstrate the effectiveness of our approach in both traditional CNNs and modern ViT, highlighting its generalizability and interpretive value.

[74] SynMatch: Rethinking Consistency in Medical Image Segmentation with Sparse Annotations cs.CVPDF

Zhiqiang Shen, Peng Cao, Xiaoli Liu, Jinzhu Yang, Osmar R. Zaiane

TL;DR: SynMatch 提出了一种新框架，通过合成图像来匹配伪标签以解决医学图像分割中标签稀缺问题，提高了伪标签与未标注图像之间的一致性。

Details

Motivation: 医学图像分割中标签稀缺问题限制了深度学习的性能，现有方法通过强-弱伪监督利用未标注数据，但伪标签与图像之间的不一致性阻碍了性能提升。

Result: 在多任务及不同标注限制（SSL、WSL、BSL）下显著优于现有方法，尤其在仅有少量标注的 BSL 场景下表现突出。

Insight: 跳过伪标签的改进，直接生成匹配的图像更高效，可广泛适用于标注稀缺的医学分割任务。

Abstract: Label scarcity remains a major challenge in deep learning-based medical image segmentation. Recent studies use strong-weak pseudo supervision to leverage unlabeled data. However, performance is often hindered by inconsistencies between pseudo labels and their corresponding unlabeled images. In this work, we propose \textbf{SynMatch}, a novel framework that sidesteps the need for improving pseudo labels by synthesizing images to match them instead. Specifically, SynMatch synthesizes images using texture and shape features extracted from the same segmentation model that generates the corresponding pseudo labels for unlabeled images. This design enables the generation of highly consistent synthesized-image-pseudo-label pairs without requiring any training parameters for image synthesis. We extensively evaluate SynMatch across diverse medical image segmentation tasks under semi-supervised learning (SSL), weakly-supervised learning (WSL), and barely-supervised learning (BSL) settings with increasingly limited annotations. The results demonstrate that SynMatch achieves superior performance, especially in the most challenging BSL setting. For example, it outperforms the recent strong-weak pseudo supervision-based method by 29.71% and 10.05% on the polyp segmentation task with 5% and 10% scribble annotations, respectively. The code will be released at https://github.com/Senyh/SynMatch.

[75] BEVANet: Bilateral Efficient Visual Attention Network for Real-Time Semantic Segmentation cs.CVPDF

Ping-Mao Huang, I-Tien Chao, Ping-Chia Huang, Jia-Wei Liao, Yung-Yu Chuang

TL;DR: BEVANet提出了一种高效的实时语义分割网络，通过引入大型核注意力机制和双边架构，实现了大感受野和细节轮廓的平衡。

Details

Motivation: 为了解决实时语义分割中同时捕获大感受域和精细轮廓的挑战，同时避免ViT的高计算成本。

Result: 在Cityscapes数据集上达到81.0% mIoU，实时性能为33 FPS。

Insight: 动态调整核大小和边界指导的融合模块对语义分割的性能提升至关重要。

Abstract: Real-time semantic segmentation presents the dual challenge of designing efficient architectures that capture large receptive fields for semantic understanding while also refining detailed contours. Vision transformers model long-range dependencies effectively but incur high computational cost. To address these challenges, we introduce the Large Kernel Attention (LKA) mechanism. Our proposed Bilateral Efficient Visual Attention Network (BEVANet) expands the receptive field to capture contextual information and extracts visual and structural features using Sparse Decomposed Large Separable Kernel Attentions (SDLSKA). The Comprehensive Kernel Selection (CKS) mechanism dynamically adapts the receptive field to further enhance performance. Furthermore, the Deep Large Kernel Pyramid Pooling Module (DLKPPM) enriches contextual features by synergistically combining dilated convolutions and large kernel attention. The bilateral architecture facilitates frequent branch communication, and the Boundary Guided Adaptive Fusion (BGAF) module enhances boundary delineation by integrating spatial and semantic features under boundary guidance. BEVANet achieves real-time segmentation at 33 FPS, yielding 79.3% mIoU without pretraining and 81.0% mIoU on Cityscapes after ImageNet pretraining, demonstrating state-of-the-art performance. The code and model is available at https://github.com/maomao0819/BEVANet.

[76] DragonFruitQualityNet: A Lightweight Convolutional Neural Network for Real-Time Dragon Fruit Quality Inspection on Mobile Devices cs.CV | cs.AIPDF

Md Zahurul Haquea, Yeahyea Sarker, Muhammed Farhan Sadique Mahi, Syed Jubayer Jaman, Md Robiul Islam

TL;DR: 该论文提出了一种轻量级卷积神经网络DragonFruitQualityNet，用于火龙果质量实时检测，模型准确率高达93.98%，并嵌入移动应用以支持农民实际使用。

Details

Motivation: 随着火龙果种植规模的扩大，高效的采前和采后质量检测对提高农业生产力和减少损失变得至关重要，但现有方法在移动设备上的实时性和实用性不足。

Result: 模型准确率为93.98%，优于现有方法，且成功嵌入移动应用，实现实时检测。

Insight: 研究通过AI技术为小农户提供可访问的质量检测工具，推动了数字农业和可持续农业的发展，填补了研究与实际应用的差距。

Abstract: Dragon fruit, renowned for its nutritional benefits and economic value, has experienced rising global demand due to its affordability and local availability. As dragon fruit cultivation expands, efficient pre- and post-harvest quality inspection has become essential for improving agricultural productivity and minimizing post-harvest losses. This study presents DragonFruitQualityNet, a lightweight Convolutional Neural Network (CNN) optimized for real-time quality assessment of dragon fruits on mobile devices. We curated a diverse dataset of 13,789 images, integrating self-collected samples with public datasets (dataset from Mendeley Data), and classified them into four categories: fresh, immature, mature, and defective fruits to ensure robust model training. The proposed model achieves an impressive 93.98% accuracy, outperforming existing methods in fruit quality classification. To facilitate practical adoption, we embedded the model into an intuitive mobile application, enabling farmers and agricultural stakeholders to conduct on-device, real-time quality inspections. This research provides an accurate, efficient, and scalable AI-driven solution for dragon fruit quality control, supporting digital agriculture and empowering smallholder farmers with accessible technology. By bridging the gap between research and real-world application, our work advances post-harvest management and promotes sustainable farming practices.

[77] MCITlib: Multimodal Continual Instruction Tuning Library and Benchmark cs.CV | cs.AIPDF

Haiyang Guo, Fei Zhu, Hongbo Zhao, Fanhu Zeng, Wenzhuo Liu

TL;DR: MCITlib 是一个多模态持续指令调优库和基准，旨在支持多模态大语言模型（MLLMs）在持续学习任务中的研究。

Details

Motivation: 传统持续学习方法主要关注单模态任务，而多模态持续学习（如视觉与语言的结合）带来了新的挑战，如跨模态交互与协调。

Result: MCITlib 提供了一个开放的研究工具，未来将持续更新以反映多模态持续学习领域的进展。

Insight: 多模态持续学习需要同时解决灾难性遗忘和跨模态协调问题，MCITlib 为此提供了一个标准化平台。

Abstract: Continual learning aims to equip AI systems with the ability to continuously acquire and adapt to new knowledge without forgetting previously learned information, similar to human learning. While traditional continual learning methods focusing on unimodal tasks have achieved notable success, the emergence of Multimodal Large Language Models has brought increasing attention to Multimodal Continual Learning tasks involving multiple modalities, such as vision and language. In this setting, models are expected to not only mitigate catastrophic forgetting but also handle the challenges posed by cross-modal interactions and coordination. To facilitate research in this direction, we introduce MCITlib, a comprehensive and constantly evolving code library for continual instruction tuning of Multimodal Large Language Models. In MCITlib, we have currently implemented 8 representative algorithms for Multimodal Continual Instruction Tuning and systematically evaluated them on 2 carefully selected benchmarks. MCITlib will be continuously updated to reflect advances in the Multimodal Continual Learning field. The codebase is released at https://github.com/Ghy0501/MCITlib.

[78] MobileViCLIP: An Efficient Video-Text Model for Mobile Devices cs.CVPDF

Min Yang, Zihan Jia, Zhilin Dai, Sheng Guo, Limin Wang

TL;DR: MobileViCLIP提出了一种高效的视频-文本模型，专为移动设备设计，通过时序结构重参数化和大规模视频-文本数据集训练，实现了快速推理和强大的零样本分类与检索能力。

Details

Motivation: 现有视频预训练模型多基于高延迟的ViT架构，缺乏针对移动设备的高效架构。本文旨在填补这一空白，设计适用于移动设备的轻量级视频-文本模型。

Result: MobileViCLIP-Small在移动设备上的推理速度比InternVideo2-L14快55.4倍，比InternVideo2-S14快6.7倍；在MSR-VTT上的零样本检索性能优于InternVideo2-S14 6.9%。

Insight: 通过结构重参数化和轻量级架构设计，可以在移动设备上实现高效的视频-文本联合建模，为移动端应用提供了新的可能性。

Abstract: Efficient lightweight neural networks are with increasing attention due to their faster reasoning speed and easier deployment on mobile devices. However, existing video pre-trained models still focus on the common ViT architecture with high latency, and few works attempt to build efficient architecture on mobile devices. This paper bridges this gap by introducing temporal structural reparameterization into an efficient image-text model and training it on a large-scale high-quality video-text dataset, resulting in an efficient video-text model that can run on mobile devices with strong zero-shot classification and retrieval capabilities, termed as MobileViCLIP. In particular, in terms of inference speed on mobile devices, our MobileViCLIP-Small is 55.4x times faster than InternVideo2-L14 and 6.7x faster than InternVideo2-S14. In terms of zero-shot retrieval performance, our MobileViCLIP-Small obtains similar performance as InternVideo2-L14 and obtains 6.9% better than InternVideo2-S14 on MSR-VTT. The code is available at https://github.com/MCG-NJU/MobileViCLIP.

[79] DocR1: Evidence Page-Guided GRPO for Multi-Page Document Understanding cs.CVPDF

Junyu Xiong, Yonghui Wang, Weichao Zhao, Chenyu Liu, Bing Yin

TL;DR: DocR1论文提出了一种基于强化学习框架EviGRPO的多页文档理解方法，通过证据页面引导的奖励机制，实现了粗到细的推理策略，显著提升了多页文档理解能力。

Details

Motivation: 多页文档理解对多模态大语言模型（MLLMs）提出了挑战，需要细粒度的视觉理解和跨页多跳推理。现有研究在强化学习（RL）的应用上尚不充分，DocR1试图填补这一空白。

Result: DocR1在多页任务（如EviBench和ArxivFullQA）上表现优异，达到了SOTA水平，同时在单页任务上也保持了强大性能。

Insight: 证据页面引导的奖励机制和课程学习策略能够有效提升多页文档理解的性能，尤其是在数据有限的情况下，为MLLMs在多页场景下的应用提供了新思路。

Abstract: Understanding multi-page documents poses a significant challenge for multimodal large language models (MLLMs), as it requires fine-grained visual comprehension and multi-hop reasoning across pages. While prior work has explored reinforcement learning (RL) for enhancing advanced reasoning in MLLMs, its application to multi-page document understanding remains underexplored. In this paper, we introduce DocR1, an MLLM trained with a novel RL framework, Evidence Page-Guided GRPO (EviGRPO). EviGRPO incorporates an evidence-aware reward mechanism that promotes a coarse-to-fine reasoning strategy, guiding the model to first retrieve relevant pages before generating answers. This training paradigm enables us to build high-quality models with limited supervision. To support this, we design a two-stage annotation pipeline and a curriculum learning strategy, based on which we construct two datasets: EviBench, a high-quality training set with 4.8k examples, and ArxivFullQA, an evaluation benchmark with 8.6k QA pairs based on scientific papers. Extensive experiments across a wide range of benchmarks demonstrate that DocR1 achieves state-of-the-art performance on multi-page tasks, while consistently maintaining strong results on single-page benchmarks.

[80] RORPCap: Retrieval-based Objects and Relations Prompt for Image Captioning cs.CVPDF

Jinjing Gu, Tianbao Qin, Yuanyuan Pu, Zhengpeng Zhao

TL;DR: RORPCap是一种基于检索的对象和关系提示的图像描述生成方法，通过提取对象和关系单词并结合预定义的提示模板，快速生成高质量的图像描述。

Details

Motivation: 现代图像描述方法通常依赖于对象检测器或结合GCN，但这些方法存在冗余检测信息、GCN构建困难和训练成本高的问题。RORPCap通过利用图像-文本检索的丰富语义信息来解决这些问题。

Result: 在MS-COCO数据集上，RORPCap仅需2.6小时训练，达到120.5% CIDEr和22.0% SPICE分数，性能与传统方法相当，但训练时间最短。

Insight: RORPCap展示了利用检索信息和提示模板的潜力，为图像描述提供了一种高效且性能优越的替代方案。

Abstract: Image captioning aims to generate natural language descriptions for input images in an open-form manner. To accurately generate descriptions related to the image, a critical step in image captioning is to identify objects and understand their relations within the image. Modern approaches typically capitalize on object detectors or combine detectors with Graph Convolutional Network (GCN). However, these models suffer from redundant detection information, difficulty in GCN construction, and high training costs. To address these issues, a Retrieval-based Objects and Relations Prompt for Image Captioning (RORPCap) is proposed, inspired by the fact that image-text retrieval can provide rich semantic information for input images. RORPCap employs an Objects and relations Extraction Model to extract object and relation words from the image. These words are then incorporate into predefined prompt templates and encoded as prompt embeddings. Next, a Mamba-based mapping network is designed to quickly map image embeddings extracted by CLIP to visual-text embeddings. Finally, the resulting prompt embeddings and visual-text embeddings are concatenated to form textual-enriched feature embeddings, which are fed into a GPT-2 model for caption generation. Extensive experiments conducted on the widely used MS-COCO dataset show that the RORPCap requires only 2.6 hours under cross-entropy loss training, achieving 120.5% CIDEr score and 22.0% SPICE score on the “Karpathy” test split. RORPCap achieves comparable performance metrics to detector-based and GCN-based models with the shortest training time and demonstrates its potential as an alternative for image captioning.

Tuyen Tran, Thao Minh Le, Quang-Hung Le, Truyen Tran

TL;DR: Planner-Refiner框架通过动态时空细化实现视频与语言的语义对齐，Planner模块分解复杂语言提示，Refiner模块逐步优化视觉表示，显著提升了复杂语言场景下的任务性能。

Details

Motivation: 视频与语言的语义对齐面临复杂语言、动态实体交互以及语义鸿沟的挑战，亟需一种能够动态优化视觉表示的方法。

Result: 在Referring Video Object Segmentation和Temporal Grounding任务中表现优于现有方法，并验证了新MeViS-X基准的有效性。

Insight: 动态细化视觉表示并结合语言指导是解决复杂视频-语言对齐任务的关键，Planner-Refiner的模块化设计为相关任务提供了新思路。

Abstract: Vision-language alignment in video must address the complexity of language, evolving interacting entities, their action chains, and semantic gaps between language and vision. This work introduces Planner-Refiner, a framework to overcome these challenges. Planner-Refiner bridges the semantic gap by iteratively refining visual elements’ space-time representation, guided by language until semantic gaps are minimal. A Planner module schedules language guidance by decomposing complex linguistic prompts into short sentence chains. The Refiner processes each short sentence, a noun-phrase and verb-phrase pair, to direct visual tokens’ self-attention across space then time, achieving efficient single-step refinement. A recurrent system chains these steps, maintaining refined visual token representations. The final representation feeds into task-specific heads for alignment generation. We demonstrate Planner-Refiner’s effectiveness on two video-language alignment tasks: Referring Video Object Segmentation and Temporal Grounding with varying language complexity. We further introduce a new MeViS-X benchmark to assess models’ capability with long queries. Superior performance versus state-of-the-art methods on these benchmarks shows the approach’s potential, especially for complex prompts.

[82] CoAR: Concept Injection into Autoregressive Models for Personalized Text-to-Image Generation cs.CVPDF

Fangtai Wu, Mushui Liu, Weijie He, Wanggui He, Hao Jiang

TL;DR: CoAR 是一种新的框架，通过在统一的 AR 模型中注入主题概念，同时冻结所有预训练参数，实现高效个性化的文本到图像生成。它通过学习层间多模态上下文策略，仅用极少参数即可生成特定主题表征。

Details

Motivation: 现有的定制化生成方法依赖全微调或适配器，成本高且易过拟合或灾难性遗忘。CoAR 旨在解决这些问题，同时提升生成质量和效率。

Result: 实验表明，CoAR 在主题驱动和风格个性化任务上优于现有方法，且计算和内存效率显著提高。与 Proxy-Tuning 相比，性能相当但参数量极少。

Insight: 冻结预训练参数并引入少量可学习参数是一种高效且稳定的个性化生成方法，多模态上下文学习策略有效缓解了过拟合和语言漂移问题。

Abstract: The unified autoregressive (AR) model excels at multimodal understanding and generation, but its potential for customized image generation remains underexplored. Existing customized generation methods rely on full fine-tuning or adapters, making them costly and prone to overfitting or catastrophic forgetting. In this paper, we propose \textbf{CoAR}, a novel framework for injecting subject concepts into the unified AR models while keeping all pre-trained parameters completely frozen. CoAR learns effective, specific subject representations with only a minimal number of parameters using a Layerwise Multimodal Context Learning strategy. To address overfitting and language drift, we further introduce regularization that preserves the pre-trained distribution and anchors context tokens to improve subject fidelity and re-contextualization. Additionally, CoAR supports training-free subject customization in a user-provided style. Experiments demonstrate that CoAR achieves superior performance on both subject-driven personalization and style personalization, while delivering significant gains in computational and memory efficiency. Notably, CoAR tunes less than \textbf{0.05%} of the parameters while achieving competitive performance compared to recent Proxy-Tuning. Code: https://github.com/KZF-kzf/CoAR

[83] LET-US: Long Event-Text Understanding of Scenes cs.CVPDF

Rui Chen, Xingyu Chen, Shaoan Wang, Shihan Kong, Junzhi Yu

TL;DR: LET-US是一个用于长事件流文本理解的框架，通过自适应压缩机制和两阶段优化范式，有效解决事件流与文本之间的模态差距，并在多项任务上表现优异。

Details

Motivation: 现有MLLMs在RGB视频内容理解上表现优异，但无法有效解析事件流或仅能处理短序列，LET-US旨在填补这一空白。

Result: 实验表明，LET-US在长事件流的描述准确性和语义理解上优于现有MLLMs。

Insight: LET-US通过特征压缩和模态对齐技术，有效解决了事件流与文本之间的复杂关系，为长序列事件流理解开辟了新方向。

Abstract: Event cameras output event streams as sparse, asynchronous data with microsecond-level temporal resolution, enabling visual perception with low latency and a high dynamic range. While existing Multimodal Large Language Models (MLLMs) have achieved significant success in understanding and analyzing RGB video content, they either fail to interpret event streams effectively or remain constrained to very short sequences. In this paper, we introduce LET-US, a framework for long event-stream–text comprehension that employs an adaptive compression mechanism to reduce the volume of input events while preserving critical visual details. LET-US thus establishes a new frontier in cross-modal inferential understanding over extended event sequences. To bridge the substantial modality gap between event streams and textual representations, we adopt a two-stage optimization paradigm that progressively equips our model with the capacity to interpret event-based scenes. To handle the voluminous temporal information inherent in long event streams, we leverage text-guided cross-modal queries for feature reduction, augmented by hierarchical clustering and similarity computation to distill the most representative event features. Moreover, we curate and construct a large-scale event-text aligned dataset to train our model, achieving tighter alignment of event features within the LLM embedding space. We also develop a comprehensive benchmark covering a diverse set of tasks – reasoning, captioning, classification, temporal localization and moment retrieval. Experimental results demonstrate that LET-US outperforms prior state-of-the-art MLLMs in both descriptive accuracy and semantic comprehension on long-duration event streams. All datasets, codes, and models will be publicly available.

[84] ForensicsSAM: Toward Robust and Unified Image Forgery Detection and Localization Resisting to Adversarial Attack cs.CVPDF

Rongxuan Peng, Shunquan Tan, Chenqi Kong, Anwei Luo, Alex C. Kot

TL;DR: 论文提出了ForensicsSAM，一个抗对抗攻击的统一图像伪造检测与定位框架，通过注入伪造和对抗专家模块提升性能。

Details

Motivation: 现有基于参数高效微调（PEFT）的方法在图像伪造检测与定位（IFDL）任务中易受对抗攻击，作者提出解决方案以增强鲁棒性。

Result: ForensicsSAM在多个基准测试中实现了对多种对抗攻击方法的鲁棒性，同时在伪造检测与定位任务中达到SOTA性能。

Insight: 动态激活对抗专家模块的设计避免了干净图像的干扰，为对抗防御提供了新思路。

Abstract: Parameter-efficient fine-tuning (PEFT) has emerged as a popular strategy for adapting large vision foundation models, such as the Segment Anything Model (SAM) and LLaVA, to downstream tasks like image forgery detection and localization (IFDL). However, existing PEFT-based approaches overlook their vulnerability to adversarial attacks. In this paper, we show that highly transferable adversarial images can be crafted solely via the upstream model, without accessing the downstream model or training data, significantly degrading the IFDL performance. To address this, we propose ForensicsSAM, a unified IFDL framework with built-in adversarial robustness. Our design is guided by three key ideas: (1) To compensate for the lack of forgery-relevant knowledge in the frozen image encoder, we inject forgery experts into each transformer block to enhance its ability to capture forgery artifacts. These forgery experts are always activated and shared across any input images. (2) To detect adversarial images, we design an light-weight adversary detector that learns to capture structured, task-specific artifact in RGB domain, enabling reliable discrimination across various attack methods. (3) To resist adversarial attacks, we inject adversary experts into the global attention layers and MLP modules to progressively correct feature shifts induced by adversarial noise. These adversary experts are adaptively activated by the adversary detector, thereby avoiding unnecessary interference with clean images. Extensive experiments across multiple benchmarks demonstrate that ForensicsSAM achieves superior resistance to various adversarial attack methods, while also delivering state-of-the-art performance in image-level forgery detection and pixel-level forgery localization. The resource is available at https://github.com/siriusPRX/ForensicsSAM.

[85] CharacterShot: Controllable and Consistent 4D Character Animation cs.CVPDF

Junyao Gao, Jiaxing Li, Wenran Liu, Yanhong Zeng, Fei Shen

TL;DR: CharacterShot 是一个可控且一致的 4D 角色动画框架，通过单张参考图像和 2D 姿态序列生成动态 3D 角色动画。结合 2D 动画模型与双注意力模块和相机先验，生成多视角视频，并通过基于高斯溅射的优化方法实现稳定的 4D 角色表示。

Details

Motivation: 现有方法难以从单张图像生成动态 3D 角色动画，且缺乏一致性和可控性。CharacterShot 旨在解决这些问题，使设计师能够轻松创建高质量的 4D 动画。

Result: 实验表明 CharacterShot 在 CharacterBench 上优于现有方法。

Insight: 结合 2D 和 3D 生成方法可以高效实现高质量 4D 动画，数据集的规模和质量对模型性能至关重要。

Abstract: In this paper, we propose \textbf{CharacterShot}, a controllable and consistent 4D character animation framework that enables any individual designer to create dynamic 3D characters (i.e., 4D character animation) from a single reference character image and a 2D pose sequence. We begin by pretraining a powerful 2D character animation model based on a cutting-edge DiT-based image-to-video model, which allows for any 2D pose sequnce as controllable signal. We then lift the animation model from 2D to 3D through introducing dual-attention module together with camera prior to generate multi-view videos with spatial-temporal and spatial-view consistency. Finally, we employ a novel neighbor-constrained 4D gaussian splatting optimization on these multi-view videos, resulting in continuous and stable 4D character representations. Moreover, to improve character-centric performance, we construct a large-scale dataset Character4D, containing 13,115 unique characters with diverse appearances and motions, rendered from multiple viewpoints. Extensive experiments on our newly constructed benchmark, CharacterBench, demonstrate that our approach outperforms current state-of-the-art methods. Code, models, and datasets will be publicly available at https://github.com/Jeoyal/CharacterShot.

[86] Freeze and Reveal: Exposing Modality Bias in Vision-Language Models cs.CV | cs.AIPDF

Vivek Hruday Kavuri, Vysishtya Karanam, Venkata Jahnavi Venkamsetty, Kriti Madumadukala, Lakshmipathi Balaji Darur

TL;DR: 论文提出了一种新方法来识别和减少视觉语言模型中的性别偏见，通过反事实数据增强和任务向量方法，揭示了视觉和文本编码器对偏见的贡献，并提出了一种计算高效的偏见缓解方法。

Details

Motivation: 视觉语言模型在多模态任务中表现出色，但往往继承了训练数据中的性别偏见。本文旨在揭示这些偏见主要来自视觉还是文本模态，并提出高效的去偏见方法。

Result: 1. CDA将性别差距减少6%，DAUDoS减少3%但仅需三分之一数据。2. 两种方法将模型对图像性别的识别准确率提升3%。3. 发现CLIP的视觉编码器偏见更重，而PaliGemma2的文本编码器偏见更明显。

Insight: 视觉和文本编码器对偏见的贡献不同，明确偏见的来源可帮助设计更有针对性的去偏见策略。DAUDoS方法能以较少计算成本实现高效偏见缓解。

Abstract: Vision Language Models achieve impressive multi-modal performance but often inherit gender biases from their training data. This bias might be coming from both the vision and text modalities. In this work, we dissect the contributions of vision and text backbones to these biases by applying targeted debiasing using Counterfactual Data Augmentation and Task Vector methods. Inspired by data-efficient approaches in hate-speech classification, we introduce a novel metric, Degree of Stereotypicality and a corresponding debiasing method, Data Augmentation Using Degree of Stereotypicality - DAUDoS, to reduce bias with minimal computational cost. We curate a gender annotated dataset and evaluate all methods on VisoGender benchmark to quantify improvements and identify dominant source of bias. Our results show that CDA reduces the gender gap by 6% and DAUDoS by 3% but using only one-third of the data. Both methods also improve the model’s ability to correctly identify gender in images by 3%, with DAUDoS achieving this improvement using only almost one-third of training data. From our experiment’s, we observed that CLIP’s vision encoder is more biased whereas PaliGemma2’s text encoder is more biased. By identifying whether bias stems more from vision or text encoders, our work enables more targeted and effective bias mitigation strategies in future multi-modal systems.

[87] Levarging Learning Bias for Noisy Anomaly Detection cs.CVPDF

Yuxin Zhang, Yunkang Cao, Yuqi Cheng, Yihan Sun, Weiming Shen

TL;DR: 该论文提出了一种两阶段框架，利用模型固有的学习偏置来解决完全无监督图像异常检测（FUIAD）中训练数据可能包含异常的问题。通过过滤纯净数据集，显著提升了异常检测性能。

Details

Motivation: 解决现实世界中训练数据可能被异常污染的问题，传统方法假设训练数据无异常，导致模型性能下降。

Result: 在Real-IAD基准测试中表现出优越的异常检测和定位性能，尤其是在噪声条件下。

Insight: 学习偏置（统计优势和特征空间差异）是过滤异常的有效工具，模型无关设计使其适用于多种无监督骨干网络。

Abstract: This paper addresses the challenge of fully unsupervised image anomaly detection (FUIAD), where training data may contain unlabeled anomalies. Conventional methods assume anomaly-free training data, but real-world contamination leads models to absorb anomalies as normal, degrading detection performance. To mitigate this, we propose a two-stage framework that systematically exploits inherent learning bias in models. The learning bias stems from: (1) the statistical dominance of normal samples, driving models to prioritize learning stable normal patterns over sparse anomalies, and (2) feature-space divergence, where normal data exhibit high intra-class consistency while anomalies display high diversity, leading to unstable model responses. Leveraging the learning bias, stage 1 partitions the training set into subsets, trains sub-models, and aggregates cross-model anomaly scores to filter a purified dataset. Stage 2 trains the final detector on this dataset. Experiments on the Real-IAD benchmark demonstrate superior anomaly detection and localization performance under different noise conditions. Ablation studies further validate the framework’s contamination resilience, emphasizing the critical role of learning bias exploitation. The model-agnostic design ensures compatibility with diverse unsupervised backbones, offering a practical solution for real-world scenarios with imperfect training data. Code is available at https://github.com/hustzhangyuxin/LLBNAD.

[88] AURA: A Fine-Grained Benchmark and Decomposed Metric for Audio-Visual Reasoning cs.CVPDF

Siminfar Samakoush Galougah, Rishie Raj, Sanjoy Chowdhury, Sayan Nag, Ramani Duraiswami

TL;DR: AURA是一个细粒度的音频-视觉推理基准测试，通过分解的AuraScore指标评估模型的跨模态推理能力，揭示现有模型在逻辑推理上的缺陷。

Details

Motivation: 当前的音频-视觉基准测试仅关注最终答案的准确性，忽略了推理过程的合理性，可能导致模型通过错误逻辑得出正确答案而未被发现。

Result: 实验显示，尽管SOTA模型在任务准确率上高达92%，但其事实一致性和核心推理得分低于45%，暴露了推理缺陷。

Insight: 正确性不保证推理合理性，AURA为未来多模态模型的稳健评估提供了重要工具。

Abstract: Current audio-visual (AV) benchmarks focus on final answer accuracy, overlooking the underlying reasoning process. This makes it difficult to distinguish genuine comprehension from correct answers derived through flawed reasoning or hallucinations. To address this, we introduce AURA (Audio-visual Understanding and Reasoning Assessment), a benchmark for evaluating the cross-modal reasoning capabilities of Audio-Visual Large Language Models (AV-LLMs) and Omni-modal Language Models (OLMs). AURA includes questions across six challenging cognitive domains, such as causality, timbre and pitch, tempo and AV synchronization, unanswerability, implicit distractions, and skill profiling, explicitly designed to be unanswerable from a single modality. This forces models to construct a valid logical path grounded in both audio and video, setting AURA apart from AV datasets that allow uni-modal shortcuts. To assess reasoning traces, we propose a novel metric, AuraScore, which addresses the lack of robust tools for evaluating reasoning fidelity. It decomposes reasoning into two aspects: (i) Factual Consistency - whether reasoning is grounded in perceptual evidence, and (ii) Core Inference - the logical validity of each reasoning step. Evaluations of SOTA models on AURA reveal a critical reasoning gap: although models achieve high accuracy (up to 92% on some tasks), their Factual Consistency and Core Inference scores fall below 45%. This discrepancy highlights that models often arrive at correct answers through flawed logic, underscoring the need for our benchmark and paving the way for more robust multimodal evaluation.

[89] VisR-Bench: An Empirical Study on Visual Retrieval-Augmented Generation for Multilingual Long Document Understanding cs.CVPDF

Jian Chen, Ming Li, Jihyung Kil, Chenguang Wang, Tong Yu

TL;DR: 该论文介绍了VisR-Bench，一个多语言基准测试，用于长文档中的问答驱动多模态检索，覆盖16种语言和35K问答对，揭示了当前模型在多模态检索中的优势与局限性。

Details

Motivation: 目前大多数基准测试仅关注英文文档检索或单页图像的多语言问答，缺少对多语言长文档多模态检索的评估，因此作者提出了VisR-Bench以填补这一空白。

Result: MLLMs显著优于基于文本和多模态编码器的模型，但在处理结构化表格和低资源语言时仍存在困难。

Insight: 多模态检索中，MLLMs展现了强大能力，但结构化和低资源语言仍是需要解决的关键挑战。

Abstract: Most organizational data in this world are stored as documents, and visual retrieval plays a crucial role in unlocking the collective intelligence from all these documents. However, existing benchmarks focus on English-only document retrieval or only consider multilingual question-answering on a single-page image. To bridge this gap, we introduce VisR-Bench, a multilingual benchmark designed for question-driven multimodal retrieval in long documents. Our benchmark comprises over 35K high-quality QA pairs across 1.2K documents, enabling fine-grained evaluation of multimodal retrieval. VisR-Bench spans sixteen languages with three question types (figures, text, and tables), offering diverse linguistic and question coverage. Unlike prior datasets, we include queries without explicit answers, preventing models from relying on superficial keyword matching. We evaluate various retrieval models, including text-based methods, multimodal encoders, and MLLMs, providing insights into their strengths and limitations. Our results show that while MLLMs significantly outperform text-based and multimodal encoder models, they still struggle with structured tables and low-resource languages, highlighting key challenges in multilingual visual retrieval.

[90] FormCoach: Lift Smarter, Not Harder cs.CV | cs.HCPDF

Xiaoye Zuo, Nikos Athanasiou, Ginger Delmas, Yiming Huang, Xingyu Fu

TL;DR: FormCoach是一个基于视觉语言模型（VLM）的AI健身教练系统，通过实时视频分析用户的动作形式，提供个性化纠正反馈。文章还发布了一个包含1,700对专家标注视频的数据集和自动化评估流程，以推动AI驱动健身教练的研究。

Details

Motivation: 家庭健身爱好者通常难以获得专业反馈，FormCoach旨在填补这一空白，通过AI技术提供实时动作纠正，帮助用户提升训练效果。

Result: 实验表明现有AI模型与人类教练的反馈质量还存在显著差距，但仍展现出潜力。数据集和工具的发布为后续研究提供了基础。

Insight: FormCoach展示了AI在健身领域的潜力，但如何实现更细致、上下文感知的动作分析仍需进一步研究。AI与人类的协作可能是未来发展方向。

Abstract: Good form is the difference between strength and strain, yet for the fast-growing community of at-home fitness enthusiasts, expert feedback is often out of reach. FormCoach transforms a simple camera into an always-on, interactive AI training partner, capable of spotting subtle form errors and delivering tailored corrections in real time, leveraging vision-language models (VLMs). We showcase this capability through a web interface and benchmark state-of-the-art VLMs on a dataset of 1,700 expert-annotated user-reference video pairs spanning 22 strength and mobility exercises. To accelerate research in AI-driven coaching, we release both the dataset and an automated, rubric-based evaluation pipeline, enabling standardized comparison across models. Our benchmarks reveal substantial gaps compared to human-level coaching, underscoring both the challenges and opportunities in integrating nuanced, context-aware movement analysis into interactive AI systems. By framing form correction as a collaborative and creative process between humans and machines, FormCoach opens a new frontier in embodied AI.

[91] Exploring Multimodal Diffusion Transformers for Enhanced Prompt-based Image Editing cs.CVPDF

Joonghyuk Shin, Alchan Hwang, Yujin Kim, Daneul Kim, Jaesik Park

TL;DR: 本文分析了多模态扩散变换器（MM-DiT）的注意力机制，提出了一种基于提示的图像编辑方法，支持从全局到局部的编辑，适用于多种MM-DiT变体。

Details

Motivation: 传统基于U-Net的扩散模型已被Transformer架构取代，但MM-DiT的双向信息流机制为现有编辑技术带来了挑战。本文旨在探索MM-DiT的行为模式并提出适配的编辑方法。

Result: 提出的方法在多种MM-DiT变体上实现了从全局到局部的有效编辑，弥补了传统U-Net方法与新架构之间的差距。

Insight: MM-DiT的双向注意力机制为图像编辑提供了新的可能性，但也需要特定的技术适配以实现高效编辑。

Abstract: Transformer-based diffusion models have recently superseded traditional U-Net architectures, with multimodal diffusion transformers (MM-DiT) emerging as the dominant approach in state-of-the-art models like Stable Diffusion 3 and Flux.1. Previous approaches have relied on unidirectional cross-attention mechanisms, with information flowing from text embeddings to image latents. In contrast, MMDiT introduces a unified attention mechanism that concatenates input projections from both modalities and performs a single full attention operation, allowing bidirectional information flow between text and image branches. This architectural shift presents significant challenges for existing editing techniques. In this paper, we systematically analyze MM-DiT’s attention mechanism by decomposing attention matrices into four distinct blocks, revealing their inherent characteristics. Through these analyses, we propose a robust, prompt-based image editing method for MM-DiT that supports global to local edits across various MM-DiT variants, including few-step models. We believe our findings bridge the gap between existing U-Net-based methods and emerging architectures, offering deeper insights into MMDiT’s behavioral patterns.

[92] Investigating the Design Space of Visual Grounding in Multimodal Large Language Model cs.CV | cs.AI | cs.CL | cs.LGPDF

Weitai Kang, Weiming Zhuang, Zhizhong Li, Yan Yan, Lingjuan Lyu

TL;DR: 本文通过系统研究多模态大语言模型（MLLMs）视觉定位（VG）的设计选择，提出优化方法，显著提升VG任务性能。

Details

Motivation: 现有MLLMs在视觉定位任务中的设计选择多样但缺乏系统性验证，本文旨在填补这一空白。

Result: 优化后的MLLM在RefCOCO/+/g数据集上分别提升了+5.6% / +6.9% / +7.0%。

Insight: 合理的视觉定位范式和数据设计能够显著提升MLLMs在VG任务中的表现。

Abstract: Fine-grained multimodal capability in Multimodal Large Language Models (MLLMs) has emerged as a critical research direction, particularly for tackling the visual grounding (VG) problem. Despite the strong performance achieved by existing approaches, they often employ disparate design choices when fine-tuning MLLMs for VG, lacking systematic verification to support these designs. To bridge this gap, this paper presents a comprehensive study of various design choices that impact the VG performance of MLLMs. We conduct our analysis using LLaVA-1.5, which has been widely adopted in prior empirical studies of MLLMs. While more recent models exist, we follow this convention to ensure our findings remain broadly applicable and extendable to other architectures. We cover two key aspects: (1) exploring different visual grounding paradigms in MLLMs, identifying the most effective design, and providing our insights; and (2) conducting ablation studies on the design of grounding data to optimize MLLMs’ fine-tuning for the VG task. Finally, our findings contribute to a stronger MLLM for VG, achieving improvements of +5.6% / +6.9% / +7.0% on RefCOCO/+/g over the LLaVA-1.5.

[93] Domain Generalization of Pathological Image Segmentation by Patch-Level and WSI-Level Contrastive Learning cs.CVPDF

Yuki Shigeyasu, Shota Harada, Akihiko Yoshizawa, Kazuhiro Terada, Naoki Nakazima

TL;DR: 本文提出了一种针对病理图像分割的域泛化方法，通过两阶段对比学习（WSI级和patch级）来解决全切片图像内的域偏移问题。

Details

Motivation: 现有方法依赖多医院数据来解决域偏移问题，但数据收集困难；本文转向利用医院内数据的域偏移（如患者特性和组织厚度差异）。

Result: 未在摘要中具体说明实验结果，但方法设计旨在有效减少域偏移。

Insight: 医院内数据中的域偏移可通过特征聚类和对比学习有效缓解，无需依赖多医院数据。

Abstract: In this paper, we address domain shifts in pathological images by focusing on shifts within whole slide images~(WSIs), such as patient characteristics and tissue thickness, rather than shifts between hospitals. Traditional approaches rely on multi-hospital data, but data collection challenges often make this impractical. Therefore, the proposed domain generalization method captures and leverages intra-hospital domain shifts by clustering WSI-level features from non-tumor regions and treating these clusters as domains. To mitigate domain shift, we apply contrastive learning to reduce feature gaps between WSI pairs from different clusters. The proposed method introduces a two-stage contrastive learning approach WSI-level and patch-level contrastive learning to minimize these gaps effectively.

[94] CoT-Pose: Chain-of-Thought Reasoning for 3D Pose Generation from Abstract Prompts cs.CVPDF

Junuk Cha, Jihyeon Kim

TL;DR: CoT-Pose结合链式推理（CoT），解决了从抽象提示生成3D人体姿态的难题，取代传统依赖详细描述的方法。

Details

Motivation: 现有3D姿态生成模型需依赖低层次细节提示，与人类常用抽象语言描述行为的习惯不匹配。

Result: 实验表明CoT-Pose能从抽象文本生成语义一致且合理的3D姿态。

Insight: 高层次理解对姿态生成至关重要，链式推理为姿态生成提供了新思路。

Abstract: Recent advances in multi-modal large language models (MLLMs) and chain-of-thought (CoT) reasoning have led to significant progress in image and text generation tasks. However, the field of 3D human pose generation still faces critical limitations. Most existing text-to-pose models rely heavily on detailed (low-level) prompts that explicitly describe joint configurations. In contrast, humans tend to communicate actions and intentions using abstract (high-level) language. This mismatch results in a practical challenge for deploying pose generation systems in real-world scenarios. To bridge this gap, we introduce a novel framework that incorporates CoT reasoning into the pose generation process, enabling the interpretation of abstract prompts into accurate 3D human poses. We further propose a data synthesis pipeline that automatically generates triplets of abstract prompts, detailed prompts, and corresponding 3D poses for training process. Experimental results demonstrate that our reasoning-enhanced model, CoT-Pose, can effectively generate plausible and semantically aligned poses from abstract textual inputs. This work highlights the importance of high-level understanding in pose generation and opens new directions for reasoning-enhanced approach for human pose generation.

[95] Commentary Generation for Soccer Highlights cs.CV | cs.LGPDF

Chidaksh Ravuru

TL;DR: 论文扩展了MatchVoice模型，用于足球集锦的解说生成，通过实验评估不同训练配置和硬件限制的影响，并探索了零样本性能窗口大小的作用。

Details

Motivation: 现有的足球解说生成系统（如SoccerNet-Caption）在视频内容与解说的细粒度对齐上存在不足，MatchVoice模型通过粗粒度与细粒度对齐技术解决了这一问题。本文旨在进一步扩展MatchVoice模型到足球集锦解说生成任务中。

Result: 实验结果显示MatchVoice具有较好的泛化能力，但仍需结合更广泛的视频-语言领域技术以进一步提升性能。

Insight: 足球解说生成需结合更广泛的视频-语言对齐技术，未来研究方向可关注多模态对齐的优化。

Abstract: Automated soccer commentary generation has evolved from template-based systems to advanced neural architectures, aiming to produce real-time descriptions of sports events. While frameworks like SoccerNet-Caption laid foundational work, their inability to achieve fine-grained alignment between video content and commentary remains a significant challenge. Recent efforts such as MatchTime, with its MatchVoice model, address this issue through coarse and fine-grained alignment techniques, achieving improved temporal synchronization. In this paper, we extend MatchVoice to commentary generation for soccer highlights using the GOAL dataset, which emphasizes short clips over entire games. We conduct extensive experiments to reproduce the original MatchTime results and evaluate our setup, highlighting the impact of different training configurations and hardware limitations. Furthermore, we explore the effect of varying window sizes on zero-shot performance. While MatchVoice exhibits promising generalization capabilities, our findings suggest the need for integrating techniques from broader video-language domains to further enhance performance. Our code is available at https://github.com/chidaksh/SoccerCommentary.

[96] Decoupled Functional Evaluation of Autonomous Driving Models via Feature Map Quality Scoring cs.CVPDF

Ludan Zhang, Sihan Wang, Yuqi Dai, Shuofei Qiao, Lei He

TL;DR: 该论文提出了一种基于特征图收敛分数（FMCS）的独立评估方法，通过构建双粒度动态加权评分系统（DG-DWSS）和CLIP-FMQE-Net网络，实现了对自动驾驶模型功能模块生成特征图的实时质量评估，显著提升了3D目标检测性能。

Details

Motivation: 自动驾驶端到端模型的中间功能模块缺乏显式监督信号，导致其机制不透明且可解释性差，传统方法难以独立评估和训练这些模块，亟需一种新的评估方法。

Result: 在NuScenes数据集上，集成评估模块后，3D目标检测性能提升3.89%（NDS指标）。

Insight: 通过独立评估中间功能模块的特征图质量，可以显式优化特征表示，从而提升整体模型性能，为自动驾驶模型的训练与评估提供了新思路。

Abstract: End-to-end models are emerging as the mainstream in autonomous driving perception and planning. However, the lack of explicit supervision signals for intermediate functional modules leads to opaque operational mechanisms and limited interpretability, making it challenging for traditional methods to independently evaluate and train these modules. Pioneering in the issue, this study builds upon the feature map-truth representation similarity-based evaluation framework and proposes an independent evaluation method based on Feature Map Convergence Score (FMCS). A Dual-Granularity Dynamic Weighted Scoring System (DG-DWSS) is constructed, formulating a unified quantitative metric - Feature Map Quality Score - to enable comprehensive evaluation of the quality of feature maps generated by functional modules. A CLIP-based Feature Map Quality Evaluation Network (CLIP-FMQE-Net) is further developed, combining feature-truth encoders and quality score prediction heads to enable real-time quality analysis of feature maps generated by functional modules. Experimental results on the NuScenes dataset demonstrate that integrating our evaluation module into the training improves 3D object detection performance, achieving a 3.89 percent gain in NDS. These results verify the effectiveness of our method in enhancing feature representation quality and overall model performance.

[97] Splat4D: Diffusion-Enhanced 4D Gaussian Splatting for Temporally and Spatially Consistent Content Creation cs.CVPDF

Minghao Yin, Yukang Cao, Songyou Peng, Kai Han

TL;DR: Splat4D 是一个通过扩散增强和 4D 高斯泼溅技术，从单目视频生成高质量 4D 内容的框架，解决了时空一致性和用户指导的问题。

Details

Motivation: 从单目视频生成高质量 4D 内容在数字人和 AR/VR 应用中面临时空一致性、细节保留和用户指导的挑战。Splat4D 旨在解决这些问题。

Result: 在多个公开基准测试中，Splat4D 表现优异，支持多种应用如文本/图像条件 4D 生成和文本引导的内容编辑。

Insight: Splat4D 展示了扩散模型和高斯泼溅技术的结合在内容生成中的潜力，为用户提供了灵活的 4D 内容生成和编辑能力。

Abstract: Generating high-quality 4D content from monocular videos for applications such as digital humans and AR/VR poses challenges in ensuring temporal and spatial consistency, preserving intricate details, and incorporating user guidance effectively. To overcome these challenges, we introduce Splat4D, a novel framework enabling high-fidelity 4D content generation from a monocular video. Splat4D achieves superior performance while maintaining faithful spatial-temporal coherence by leveraging multi-view rendering, inconsistency identification, a video diffusion model, and an asymmetric U-Net for refinement. Through extensive evaluations on public benchmarks, Splat4D consistently demonstrates state-of-the-art performance across various metrics, underscoring the efficacy of our approach. Additionally, the versatility of Splat4D is validated in various applications such as text/image conditioned 4D generation, 4D human generation, and text-guided content editing, producing coherent outcomes following user instructions.

[98] Adaptive Cache Enhancement for Test-Time Adaptation of Vision-Language Models cs.CVPDF

Khanh-Binh Nguyen, Phuoc-Nguyen Bui, Hyunseung Choo, Duc Thanh Nguyen

TL;DR: 论文提出了Adaptive Cache Enhancement (ACE)框架，通过动态、类别特定的阈值选择和优化缓存样本，解决了传统测试时适应(TTA)方法在分布偏移下的不可靠置信度指标和僵化决策边界问题，显著提升了视觉-语言模型(VLMs)的适应性能。

Details

Motivation: 视觉-语言模型(VLMs)在零样本泛化方面表现优异，但在下游任务中面对分布偏移时性能下降，尤其是在缺乏标注数据时。传统的基于缓存的TTA方法在显著分布偏移下可能因置信度不可靠和决策边界僵化而失效。因此，需要一种更鲁棒的缓存优化方法。

Result: ACE在15个基准数据集上实现了最先进的性能，显著提升了在分布偏移场景下的鲁棒性和泛化能力。

Insight: 通过类别特定的自适应缓存和动态阈值优化，可以显著提升TTA方法的鲁棒性和准确性，尤其是在面对显著分布偏移时。

Abstract: Vision-language models (VLMs) exhibit remarkable zero-shot generalization but suffer performance degradation under distribution shifts in downstream tasks, particularly in the absence of labeled data. Test-Time Adaptation (TTA) addresses this challenge by enabling online optimization of VLMs during inference, eliminating the need for annotated data. Cache-based TTA methods exploit historical knowledge by maintaining a dynamic memory cache of low-entropy or high-confidence samples, promoting efficient adaptation to out-of-distribution data. Nevertheless, these methods face two critical challenges: (1) unreliable confidence metrics under significant distribution shifts, resulting in error accumulation within the cache and degraded adaptation performance; and (2) rigid decision boundaries that fail to accommodate substantial distributional variations, leading to suboptimal predictions. To overcome these limitations, we introduce the Adaptive Cache Enhancement (ACE) framework, which constructs a robust cache by selectively storing high-confidence or low-entropy image embeddings per class, guided by dynamic, class-specific thresholds initialized from zero-shot statistics and iteratively refined using an exponential moving average and exploration-augmented updates. This approach enables adaptive, class-wise decision boundaries, ensuring robust and accurate predictions across diverse visual distributions. Extensive experiments on 15 diverse benchmark datasets demonstrate that ACE achieves state-of-the-art performance, delivering superior robustness and generalization compared to existing TTA methods in challenging out-of-distribution scenarios.

[99] Exploiting Layer Normalization Fine-tuning in Visual Transformer Foundation Models for Classification cs.CV | cs.LGPDF

Zhaorui Tan, Tan Pan, Kaizhu Huang, Weimiao Yu, Kai Yao

TL;DR: 本文研究了Vision Transformers中LayerNorm的微调动态性，提出了Fine-tuning Shift Ratio（FSR）和一种简单的重新缩放机制，以优化跨域分类任务中的微调效果。

Details

Motivation: LayerNorm在Vision Transformers中至关重要，但其在数据稀缺和领域变化下的微调动态性仍未充分研究。本文旨在探索LayerNorm参数的变化如何反映源域和目标域之间的转换。

Result: 实验表明，OOD任务通常具有较低的FSR和较高的λ，尤其是在数据稀缺时；病理图像数据的行为更接近ID设置。

Insight: OOD任务中目标训练样本的代表性不足，而病理数据更保守的LayerNorm更新策略可能更适合。

Abstract: LayerNorm is pivotal in Vision Transformers (ViTs), yet its fine-tuning dynamics under data scarcity and domain shifts remain underexplored. This paper shows that shifts in LayerNorm parameters after fine-tuning (LayerNorm shifts) are indicative of the transitions between source and target domains; its efficacy is contingent upon the degree to which the target training samples accurately represent the target domain, as quantified by our proposed Fine-tuning Shift Ratio ($FSR$). Building on this, we propose a simple yet effective rescaling mechanism using a scalar $\lambda$ that is negatively correlated to $FSR$ to align learned LayerNorm shifts with those ideal shifts achieved under fully representative data, combined with a cyclic framework that further enhances the LayerNorm fine-tuning. Extensive experiments across natural and pathological images, in both in-distribution (ID) and out-of-distribution (OOD) settings, and various target training sample regimes validate our framework. Notably, OOD tasks tend to yield lower $FSR$ and higher $\lambda$ in comparison to ID cases, especially with scarce data, indicating under-represented target training samples. Moreover, ViTFs fine-tuned on pathological data behave more like ID settings, favoring conservative LayerNorm updates. Our findings illuminate the underexplored dynamics of LayerNorm in transfer learning and provide practical strategies for LayerNorm fine-tuning.

[100] GAPNet: A Lightweight Framework for Image and Video Salient Object Detection via Granularity-Aware Paradigm cs.CVPDF

Yu-Huan Wu, Wei Liu, Zi-Xuan Zhu, Zizhou Wang, Yong Liu

TL;DR: GAPNet提出了一种轻量级的显著性目标检测框架，通过粒度感知范式实现对图像和视频的高效检测，显著降低了计算成本。

Details

Motivation: 现有的显著性目标检测模型通常依赖重型骨干网络，计算成本高，阻碍了在实际边缘设备中的应用。

Result: 在轻量级图像和视频显著性目标检测模型中，GAPNet实现了新的最优性能。

Insight: 通过粒度感知监督和多尺度特征融合，可以在降低计算成本的同时保持高精度，适合边缘设备部署。

Abstract: Recent salient object detection (SOD) models predominantly rely on heavyweight backbones, incurring substantial computational cost and hindering their practical application in various real-world settings, particularly on edge devices. This paper presents GAPNet, a lightweight network built on the granularity-aware paradigm for both image and video SOD. We assign saliency maps of different granularities to supervise the multi-scale decoder side-outputs: coarse object locations for high-level outputs and fine-grained object boundaries for low-level outputs. Specifically, our decoder is built with granularity-aware connections which fuse high-level features of low granularity and low-level features of high granularity, respectively. To support these connections, we design granular pyramid convolution (GPC) and cross-scale attention (CSA) modules for efficient fusion of low-scale and high-scale features, respectively. On top of the encoder, a self-attention module is built to learn global information, enabling accurate object localization with negligible computational cost. Unlike traditional U-Net-based approaches, our proposed method optimizes feature utilization and semantic interpretation while applying appropriate supervision at each processing stage. Extensive experiments show that the proposed method achieves a new state-of-the-art performance among lightweight image and video SOD models. Code is available at https://github.com/yuhuan-wu/GAPNet.

[101] From Prediction to Explanation: Multimodal, Explainable, and Interactive Deepfake Detection Framework for Non-Expert Users cs.CVPDF

Shahroz Tariq, Simon S. Woo, Priyanka Singh, Irena Irmalasari, Saakshi Gupta

TL;DR: 该论文提出了DF-P2E框架，通过视觉、语义和叙事层的解释增强深度伪造检测的可解释性，为非专家用户提供透明和易用的检测工具。

Details

Motivation: 深度伪造技术对数字完整性构成严重威胁，但现有检测系统多为黑箱模型，缺乏可解释性，限制了非专家用户的实用性。

Result: 在DF40数据集上展示了竞争力的检测性能和高质量的解释效果。

Insight: 通过统一预测和解释流程，为对抗性媒体环境中实现可信赖和透明的AI系统提供了可行路径。

Abstract: The proliferation of deepfake technologies poses urgent challenges and serious risks to digital integrity, particularly within critical sectors such as forensics, journalism, and the legal system. While existing detection systems have made significant progress in classification accuracy, they typically function as black-box models, offering limited transparency and minimal support for human reasoning. This lack of interpretability hinders their usability in real-world decision-making contexts, especially for non-expert users. In this paper, we present DF-P2E (Deepfake: Prediction to Explanation), a novel multimodal framework that integrates visual, semantic, and narrative layers of explanation to make deepfake detection interpretable and accessible. The framework consists of three modular components: (1) a deepfake classifier with Grad-CAM-based saliency visualisation, (2) a visual captioning module that generates natural language summaries of manipulated regions, and (3) a narrative refinement module that uses a fine-tuned Large Language Model (LLM) to produce context-aware, user-sensitive explanations. We instantiate and evaluate the framework on the DF40 benchmark, the most diverse deepfake dataset to date. Experiments demonstrate that our system achieves competitive detection performance while providing high-quality explanations aligned with Grad-CAM activations. By unifying prediction and explanation in a coherent, human-aligned pipeline, this work offers a scalable approach to interpretable deepfake detection, advancing the broader vision of trustworthy and transparent AI systems in adversarial media environments.

[102] ShoulderShot: Generating Over-the-Shoulder Dialogue Videos cs.CV | cs.AIPDF

Yuang Zhang, Junqi Cheng, Haoyu Zhao, Jiaxi Gu, Fangyuan Zou

TL;DR: ShoulderShot是一个生成过肩对话视频的框架，通过双镜头生成和循环视频技术，解决了空间连续性和长对话生成的问题，显著优于现有方法。

Details

Motivation: 过肩对话视频在影视和广告中广泛应用，但在视频生成领域研究较少，面临角色一致性、空间连续性和长对话生成的挑战。

Result: 在镜头布局、空间连续性和对话长度方面优于现有方法，为实际对话视频生成提供了新可能。

Insight: 过肩对话视频的生成需要关注角色一致性和空间连续性，双镜头和循环视频是有效的解决方案。

Abstract: Over-the-shoulder dialogue videos are essential in films, short dramas, and advertisements, providing visual variety and enhancing viewers’ emotional connection. Despite their importance, such dialogue scenes remain largely underexplored in video generation research. The main challenges include maintaining character consistency across different shots, creating a sense of spatial continuity, and generating long, multi-turn dialogues within limited computational budgets. Here, we present ShoulderShot, a framework that combines dual-shot generation with looping video, enabling extended dialogues while preserving character consistency. Our results demonstrate capabilities that surpass existing methods in terms of shot-reverse-shot layout, spatial continuity, and flexibility in dialogue length, thereby opening up new possibilities for practical dialogue video generation. Videos and comparisons are available at https://shouldershot.github.io.

[103] LaVieID: Local Autoregressive Diffusion Transformers for Identity-Preserving Video Creation cs.CVPDF

Wenhui Song, Hanhui Li, Jiehui Huang, Panwen Hu, Yuhao Cheng

TL;DR: LaVieID提出了一种局部自回归扩散变换器框架，用于解决身份保持的文本到视频任务，通过局部路由器和时间自回归模块提升身份信息的一致性。

Details

Motivation: 现有扩散变换器（DiTs）在视频生成中容易丢失身份信息，LaVieID旨在通过空间和时间上的改进来解决这一问题。

Result: LaVieID能生成高保真个性化视频，并达到最先进性能。

Insight: 局部和时间建模对于保持视频生成中的身份信息至关重要。

Abstract: In this paper, we present LaVieID, a novel \underline{l}ocal \underline{a}utoregressive \underline{vi}d\underline{e}o diffusion framework designed to tackle the challenging \underline{id}entity-preserving text-to-video task. The key idea of LaVieID is to mitigate the loss of identity information inherent in the stochastic global generation process of diffusion transformers (DiTs) from both spatial and temporal perspectives. Specifically, unlike the global and unstructured modeling of facial latent states in existing DiTs, LaVieID introduces a local router to explicitly represent latent states by weighted combinations of fine-grained local facial structures. This alleviates undesirable feature interference and encourages DiTs to capture distinctive facial characteristics. Furthermore, a temporal autoregressive module is integrated into LaVieID to refine denoised latent tokens before video decoding. This module divides latent tokens temporally into chunks, exploiting their long-range temporal dependencies to predict biases for rectifying tokens, thereby significantly enhancing inter-frame identity consistency. Consequently, LaVieID can generate high-fidelity personalized videos and achieve state-of-the-art performance. Our code and models are available at https://github.com/ssugarwh/LaVieID.

[104] X2Edit: Revisiting Arbitrary-Instruction Image Editing through Self-Constructed Data and Task-Aware Representation Learning cs.CVPDF

Jian Ma, Xujie Zhu, Zihao Pan, Qirong Peng, Xu Guo

TL;DR: 论文提出了X2Edit数据集和一个任务感知的表示学习方法，用于改进任意指令图像编辑任务。

Details

Motivation: 现有开源数据集在任意指令图像编辑任务上表现不佳，且缺乏与社区主流生成模型兼容的即插即用编辑模块。

Result: 实验表明模型编辑性能优越，且X2Edit数据集比现有数据集更具优势。

Insight: 高质量数据集和任务感知的表示学习对提升图像编辑任务效果至关重要。

Abstract: Existing open-source datasets for arbitrary-instruction image editing remain suboptimal, while a plug-and-play editing module compatible with community-prevalent generative models is notably absent. In this paper, we first introduce the X2Edit Dataset, a comprehensive dataset covering 14 diverse editing tasks, including subject-driven generation. We utilize the industry-leading unified image generation models and expert models to construct the data. Meanwhile, we design reasonable editing instructions with the VLM and implement various scoring mechanisms to filter the data. As a result, we construct 3.7 million high-quality data with balanced categories. Second, to better integrate seamlessly with community image generation models, we design task-aware MoE-LoRA training based on FLUX.1, with only 8% of the parameters of the full model. To further improve the final performance, we utilize the internal representations of the diffusion model and define positive/negative samples based on image editing types to introduce contrastive learning. Extensive experiments demonstrate that the model’s editing performance is competitive among many excellent models. Additionally, the constructed dataset exhibits substantial advantages over existing open-source datasets. The open-source code, checkpoints, and datasets for X2Edit can be found at the following link: https://github.com/OPPO-Mente-Lab/X2Edit.

[105] SOFA: Deep Learning Framework for Simulating and Optimizing Atrial Fibrillation Ablation cs.CV | cs.AIPDF

Yunsung Chung, Chanho Lim, Ghassan Bidaoui, Christian Massad, Nassir Marrouche

TL;DR: SOFA是一个深度学习框架，通过模拟和优化房颤消融过程，预测术后复发风险，并提供消融参数优化方案。

Details

Motivation: 房颤消融手术效果受多种复杂因素影响，现有方法难以准确评估和改进手术效果。

Result: SOFA能准确合成消融后图像，且优化方案使模型预测的复发风险降低了22.18%。

Insight: SOFA为个性化房颤消融提供了新工具，有望提升手术效果。

Abstract: Atrial fibrillation (AF) is a prevalent cardiac arrhythmia often treated with catheter ablation procedures, but procedural outcomes are highly variable. Evaluating and improving ablation efficacy is challenging due to the complex interaction between patient-specific tissue and procedural factors. This paper asks two questions: Can AF recurrence be predicted by simulating the effects of procedural parameters? How should we ablate to reduce AF recurrence? We propose SOFA (Simulating and Optimizing Atrial Fibrillation Ablation), a novel deep-learning framework that addresses these questions. SOFA first simulates the outcome of an ablation strategy by generating a post-ablation image depicting scar formation, conditioned on a patient’s pre-ablation LGE-MRI and the specific procedural parameters used (e.g., ablation locations, duration, temperature, power, and force). During this simulation, it predicts AF recurrence risk. Critically, SOFA then introduces an optimization scheme that refines these procedural parameters to minimize the predicted risk. Our method leverages a multi-modal, multi-view generator that processes 2.5D representations of the atrium. Quantitative evaluations show that SOFA accurately synthesizes post-ablation images and that our optimization scheme leads to a 22.18% reduction in the model-predicted recurrence risk. To the best of our knowledge, SOFA is the first framework to integrate the simulation of procedural effects, recurrence prediction, and parameter optimization, offering a novel tool for personalizing AF ablation.

[106] Enhancing Egocentric Object Detection in Static Environments using Graph-based Spatial Anomaly Detection and Correction cs.CVPDF

Vishakha Lall, Yisi Liu

TL;DR: 论文提出了一种基于图神经网络的框架，用于静态环境下第一视角物体检测的空间异常检测与修正，通过建模物体间的空间关系提升检测性能。

Details

Motivation: 现有的物体检测模型在静态环境中未能充分利用空间布局的先验知识，导致检测结果不一致或错误。

Result: 实验表明，该方法作为后期处理模块，能将mAP@50提升高达4%。

Insight: 静态环境的空间结构可以作为先验知识显著提升物体检测系统的可靠性。

Abstract: In many real-world applications involving static environments, the spatial layout of objects remains consistent across instances. However, state-of-the-art object detection models often fail to leverage this spatial prior, resulting in inconsistent predictions, missed detections, or misclassifications, particularly in cluttered or occluded scenes. In this work, we propose a graph-based post-processing pipeline that explicitly models the spatial relationships between objects to correct detection anomalies in egocentric frames. Using a graph neural network (GNN) trained on manually annotated data, our model identifies invalid object class labels and predicts corrected class labels based on their neighbourhood context. We evaluate our approach both as a standalone anomaly detection and correction framework and as a post-processing module for standard object detectors such as YOLOv7 and RT-DETR. Experiments demonstrate that incorporating this spatial reasoning significantly improves detection performance, with mAP@50 gains of up to 4%. This method highlights the potential of leveraging the environment’s spatial structure to improve reliability in object detection systems.

[107] A Trustworthy Method for Multimodal Emotion Recognition cs.CVPDF

Junxiao Xue, Xiaozhen Liu, Jie Wang, Xuecheng Wu, Bin Wu

TL;DR: 本文提出了一种名为可信情感识别（TER）的新方法，通过不确定性估计和置信值计算，实现对多模态数据的可靠性预测，并引入了新的评估标准。

Details

Motivation: 现有情感识别方法通常通过复杂的深度模型提升性能，但忽视了预测的可靠性，尤其是在噪声、损坏和分布外数据的情况下。本文旨在解决这一问题。

Result: TER在Music-video数据集上达到82.40%的准确率，并在IEMOCAP和Music-video数据集上的可信F1分数分别为0.7511和0.9035，表现优异。

Insight: 通过引入不确定性估计和置信值，TER在保持高性能的同时增强了模型的可靠性，为情感识别任务的实际应用提供了重要支持。

Abstract: Existing emotion recognition methods mainly focus on enhancing performance by employing complex deep models, typically resulting in significantly higher model complexity. Although effective, it is also crucial to ensure the reliability of the final decision, especially for noisy, corrupted and out-of-distribution data. To this end, we propose a novel emotion recognition method called trusted emotion recognition (TER), which utilizes uncertainty estimation to calculate the confidence value of predictions. TER combines the results from multiple modalities based on their confidence values to output the trusted predictions. We also provide a new evaluation criterion to assess the reliability of predictions. Specifically, we incorporate trusted precision and trusted recall to determine the trusted threshold and formulate the trusted Acc. and trusted F1 score to evaluate the model’s trusted performance. The proposed framework combines the confidence module that accordingly endows the model with reliability and robustness against possible noise or corruption. The extensive experimental results validate the effectiveness of our proposed model. The TER achieves state-of-the-art performance on the Music-video, achieving 82.40% Acc. In terms of trusted performance, TER outperforms other methods on the IEMOCAP and Music-video, achieving trusted F1 scores of 0.7511 and 0.9035, respectively.

[108] AR-VRM: Imitating Human Motions for Visual Robot Manipulation with Analogical Reasoning cs.CV | cs.ROPDF

Dejie Yang, Zijing Zhao, Yang Liu

TL;DR: AR-VRM是一种通过类比推理模仿人类动作的视觉机器人操作方法，利用关键点视觉语言模型显式学习人类动作知识，并在少量机器人数据场景下显著优于现有方法。

Details

Motivation: 视觉机器人操作（VRM）需要多模态数据，但机器人数据获取成本高。现有方法使用与机器人任务不匹配的网络数据或隐式训练模型，泛化能力有限。

Result: 在少量数据场景下，AR-VRM显著优于现有方法，证明了显式模仿人类动作的有效性。

Insight: 显式关注动作关键点而非无关视觉线索，能够更高效地利用人类动作知识，解决机器人数据不足的问题。

Abstract: Visual Robot Manipulation (VRM) aims to enable a robot to follow natural language instructions based on robot states and visual observations, and therefore requires costly multi-modal data. To compensate for the deficiency of robot data, existing approaches have employed vision-language pretraining with large-scale data. However, they either utilize web data that differs from robotic tasks, or train the model in an implicit way (e.g., predicting future frames at the pixel level), thus showing limited generalization ability under insufficient robot data. In this paper, we propose to learn from large-scale human action video datasets in an explicit way (i.e., imitating human actions from hand keypoints), introducing Visual Robot Manipulation with Analogical Reasoning (AR-VRM). To acquire action knowledge explicitly from human action videos, we propose a keypoint Vision-Language Model (VLM) pretraining scheme, enabling the VLM to learn human action knowledge and directly predict human hand keypoints. During fine-tuning on robot data, to facilitate the robotic arm in imitating the action patterns of human motions, we first retrieve human action videos that perform similar manipulation tasks and have similar historical observations , and then learn the Analogical Reasoning (AR) map between human hand keypoints and robot components. Taking advantage of focusing on action keypoints instead of irrelevant visual cues, our method achieves leading performance on the CALVIN benchmark {and real-world experiments}. In few-shot scenarios, our AR-VRM outperforms previous methods by large margins , underscoring the effectiveness of explicitly imitating human actions under data scarcity.

[109] Collaborative Learning of Scattering and Deep Features for SAR Target Recognition with Noisy Labels cs.CVPDF

Yimin Fu, Zhunga Liu, Dongxiu Guo, Longfei Wang

TL;DR: 本文提出了一种协作学习散射和深度特征（CLSDF）的方法，用于处理合成孔径雷达（SAR）目标识别中的噪声标签问题，通过多模型特征融合和半监督学习实现噪声鲁棒性。

Details

Motivation: 高质量SAR数据的标签获取困难且易引入噪声标签，现有研究主要针对普通图像数据，无法直接适应SAR数据的非直观视觉特性。

Result: 在MSTAR数据集上的实验表明，该方法在不同噪声条件下均能实现最先进的性能。

Insight: 散射特征的物理特性可显著增强深度特征的表达能力，联合分布对齐策略对提高噪声标签处理的可靠性至关重要。

Abstract: The acquisition of high-quality labeled synthetic aperture radar (SAR) data is challenging due to the demanding requirement for expert knowledge. Consequently, the presence of unreliable noisy labels is unavoidable, which results in performance degradation of SAR automatic target recognition (ATR). Existing research on learning with noisy labels mainly focuses on image data. However, the non-intuitive visual characteristics of SAR data are insufficient to achieve noise-robust learning. To address this problem, we propose collaborative learning of scattering and deep features (CLSDF) for SAR ATR with noisy labels. Specifically, a multi-model feature fusion framework is designed to integrate scattering and deep features. The attributed scattering centers (ASCs) are treated as dynamic graph structure data, and the extracted physical characteristics effectively enrich the representation of deep image features. Then, the samples with clean and noisy labels are divided by modeling the loss distribution with multiple class-wise Gaussian Mixture Models (GMMs). Afterward, the semi-supervised learning of two divergent branches is conducted based on the data divided by each other. Moreover, a joint distribution alignment strategy is introduced to enhance the reliability of co-guessed labels. Extensive experiments have been done on the Moving and Stationary Target Acquisition and Recognition (MSTAR) dataset, and the results show that the proposed method can achieve state-of-the-art performance under different operating conditions with various label noises.

[110] TAR-TVG: Enhancing VLMs with Timestamp Anchor-Constrained Reasoning for Temporal Video Grounding cs.CV | cs.AIPDF

Chaohong Guo, Xun Mo, Yongwei Nie, Xuemiao Xu, Chao Xu

TL;DR: TAR-TVG提出了一种新颖的时序视频定位框架，通过引入时间戳锚点显式约束推理过程，确保推理内容的质量，并结合自蒸馏训练策略提升模型性能。

Details

Motivation: 现有强化学习方法在时序视频定位中虽鼓励模型生成推理链，但缺乏对推理过程的显式约束，导致最终预测质量不稳定。为了解决这一问题，论文提出通过时间戳锚点中间验证点来增强推理过程的可控性。

Result: 实验表明，TAR-TVG在时序视频定位任务中取得了最先进的性能，同时生成可解释且逐步优化的推理链。

Insight: 时间戳锚点作为一种中间监督信号，不仅能提升模型的推理质量，还增强了模型的可解释性和可靠性。自蒸馏策略则有效解决了锚点生成的低概率问题。

Abstract: Temporal Video Grounding (TVG) aims to precisely localize video segments corresponding to natural language queries, which is a critical capability for long-form video understanding. Although existing reinforcement learning approaches encourage models to generate reasoning chains before predictions, they fail to explicitly constrain the reasoning process to ensure the quality of the final temporal predictions. To address this limitation, we propose Timestamp Anchor-constrained Reasoning for Temporal Video Grounding (TAR-TVG), a novel framework that introduces timestamp anchors within the reasoning process to enforce explicit supervision to the thought content. These anchors serve as intermediate verification points. More importantly, we require each reasoning step to produce increasingly accurate temporal estimations, thereby ensuring that the reasoning process contributes meaningfully to the final prediction. To address the challenge of low-probability anchor generation in models (e.g., Qwen2.5-VL-3B), we develop an efficient self-distillation training strategy: (1) initial GRPO training to collect 30K high-quality reasoning traces containing multiple timestamp anchors, (2) supervised fine-tuning (SFT) on distilled data, and (3) final GRPO optimization on the SFT-enhanced model. This three-stage training strategy enables robust anchor generation while maintaining reasoning quality. Experiments show that our model achieves state-of-the-art performance while producing interpretable, verifiable reasoning chains with progressively refined temporal estimations.

[111] Make Your MoVe: Make Your 3D Contents by Adapting Multi-View Diffusion Models to External Editing cs.CVPDF

Weitao Wang, Haoran Xu, Jun Meng, Haoqian Wang

TL;DR: 本文提出了一种无需调参的即插即用方案，通过保持几何一致性来提升3D内容的多视角编辑效果。该方法利用原始输入的法向潜变量和多视角扩散模型，显著提高了编辑后的3D资产的质量和多视角一致性。

Details

Motivation: 随着3D生成技术的快速发展，用户对个性化内容生成的需求日益增长。然而，现有的编辑工具多集中在2D领域，直接将编辑结果输入3D生成方法会导致信息丢失，影响最终3D资产的质量。

Result: 实验表明，该方法在多种多视角扩散模型和编辑方法的组合中，均能显著提升编辑后3D资产的多视角一致性和网格质量。

Insight: 该方法的创新点在于通过几何一致性约束和动态监督机制，有效解决了从2D编辑到3D生成的信息丢失问题，为3D内容编辑提供了一种高效可靠的解决方案。

Abstract: As 3D generation techniques continue to flourish, the demand for generating personalized content is rapidly rising. Users increasingly seek to apply various editing methods to polish generated 3D content, aiming to enhance its color, style, and lighting without compromising the underlying geometry. However, most existing editing tools focus on the 2D domain, and directly feeding their results into 3D generation methods (like multi-view diffusion models) will introduce information loss, degrading the quality of the final 3D assets. In this paper, we propose a tuning-free, plug-and-play scheme that aligns edited assets with their original geometry in a single inference run. Central to our approach is a geometry preservation module that guides the edited multi-view generation with original input normal latents. Besides, an injection switcher is proposed to deliberately control the supervision extent of the original normals, ensuring the alignment between the edited color and normal views. Extensive experiments show that our method consistently improves both the multi-view consistency and mesh quality of edited 3D assets, across multiple combinations of multi-view diffusion models and editing methods.

[112] DoorDet: Semi-Automated Multi-Class Door Detection Dataset via Object Detection and Large Language Models cs.CV | cs.AI | cs.ETPDF

Licheng Zhang, Bach Le, Naveed Akhtar, Tuan Ngo

TL;DR: 论文提出了一种半自动化的多类别门检测数据集构建方法，结合目标检测和大语言模型，显著降低标注成本。

Details

Motivation: 现有公开数据集在细粒度多类别门检测方面匮乏，而准确的门检测对建筑合规检查和室内场景理解至关重要。

Result: 该方法显著减少标注成本，同时生成高质量数据集，适合复杂领域的模型评测。

Insight: 展示了深度学习与多模态推理结合在复杂现实领域高效构建数据集的潜力。

Abstract: Accurate detection and classification of diverse door types in floor plans drawings is critical for multiple applications, such as building compliance checking, and indoor scene understanding. Despite their importance, publicly available datasets specifically designed for fine-grained multi-class door detection remain scarce. In this work, we present a semi-automated pipeline that leverages a state-of-the-art object detector and a large language model (LLM) to construct a multi-class door detection dataset with minimal manual effort. Doors are first detected as a unified category using a deep object detection model. Next, an LLM classifies each detected instance based on its visual and contextual features. Finally, a human-in-the-loop stage ensures high-quality labels and bounding boxes. Our method significantly reduces annotation cost while producing a dataset suitable for benchmarking neural models in floor plan analysis. This work demonstrates the potential of combining deep learning and multimodal reasoning for efficient dataset construction in complex real-world domains.

[113] Enhancing Small-Scale Dataset Expansion with Triplet-Connection-based Sample Re-Weighting cs.CVPDF

Ting Xiang, Changjian Chen, Zhuo Tang, Qifeng Zhang, Fei Lyu

TL;DR: 论文提出了一种基于三元组连接的样本重加权方法TriReWeight，通过理论分析和实验验证，提升了生成数据增强的效果，尤其在小型数据集中表现优异。

Details

Motivation: 现实中某些计算机视觉任务（如医学诊断）因图像数据稀缺而性能受限。生成模型扩展数据集时容易产生噪声图像，传统方法难以有效解决这一问题。

Result: 实验表明，TriReWeight在六个自然图像数据集上平均提升7.9%，在三个医学数据集上平均提升3.4%，且能与多种生成数据增强方法兼容。

Insight: 引入三元组连接的重加权机制可有效减少生成噪声的影响，同时为小型数据集扩展提供了理论支持。

Abstract: The performance of computer vision models in certain real-world applications, such as medical diagnosis, is often limited by the scarcity of available images. Expanding datasets using pre-trained generative models is an effective solution. However, due to the uncontrollable generation process and the ambiguity of natural language, noisy images may be generated. Re-weighting is an effective way to address this issue by assigning low weights to such noisy images. We first theoretically analyze three types of supervision for the generated images. Based on the theoretical analysis, we develop TriReWeight, a triplet-connection-based sample re-weighting method to enhance generative data augmentation. Theoretically, TriReWeight can be integrated with any generative data augmentation methods and never downgrade their performance. Moreover, its generalization approaches the optimal in the order $O(\sqrt{d\ln (n)/n})$. Our experiments validate the correctness of the theoretical analysis and demonstrate that our method outperforms the existing SOTA methods by $7.9%$ on average over six natural image datasets and by $3.4%$ on average over three medical datasets. We also experimentally validate that our method can enhance the performance of different generative data augmentation methods.

[114] Comparison Reveals Commonality: Customized Image Generation through Contrastive Inversion cs.CVPDF

Minseo Kim, Minchan Kwon, Dongyeun Lee, Yunho Jeon, Junmo Kim

TL;DR: 论文提出了一种无需额外引导信息的图像定制生成方法——对比反演（Contrastive Inversion），通过对比学习提取输入图像中的共性概念，并提升生成质量。

Details

Motivation: 当前定制图像生成方法依赖文本提示或空间掩码等额外引导信息，可能导致辅助特征分离不完全，影响生成质量。

Result: 在概念表达和编辑方面取得平衡的高性能，优于现有技术。

Insight: 通过对比学习可以更有效地分离目标概念的语义，无需依赖额外引导信息。

Abstract: The recent demand for customized image generation raises a need for techniques that effectively extract the common concept from small sets of images. Existing methods typically rely on additional guidance, such as text prompts or spatial masks, to capture the common target concept. Unfortunately, relying on manually provided guidance can lead to incomplete separation of auxiliary features, which degrades generation quality.In this paper, we propose Contrastive Inversion, a novel approach that identifies the common concept by comparing the input images without relying on additional information. We train the target token along with the image-wise auxiliary text tokens via contrastive learning, which extracts the well-disentangled true semantics of the target. Then we apply disentangled cross-attention fine-tuning to improve concept fidelity without overfitting. Experimental results and analysis demonstrate that our method achieves a balanced, high-level performance in both concept representation and editing, outperforming existing techniques.

[115] Correspondence as Video: Test-Time Adaption on SAM2 for Reference Segmentation in the Wild cs.CVPDF

Haoran Wang, Zekun Li, Jian Zhang, Lei Qi, Yinghuan Shi

TL;DR: 论文提出了一种名为CAV-SAM的新方法，通过将参考图像和目标图像对的内在对应关系表示为伪视频，利用SAM2模型的交互式视频对象分割能力，实现对下游任务的轻量级适应。

Details

Motivation: 大型视觉模型（如SAM）在野外下游任务中存在显著局限性，而现有的参考分割方法依赖元学习，需要大量数据和计算成本。因此，需要一种更轻量且高效的适应方法。

Result: 在多个数据集上，CAV-SAM的分割性能比SOTA方法提升了5%以上。

Insight: 将图像对视为视频序列，可以利用视频分割模型的动态信息处理能力，实现更高效的模型适应。

Abstract: Large vision models like the Segment Anything Model (SAM) exhibit significant limitations when applied to downstream tasks in the wild. Consequently, reference segmentation, which leverages reference images and their corresponding masks to impart novel knowledge to the model, emerges as a promising new direction for adapting vision models. However, existing reference segmentation approaches predominantly rely on meta-learning, which still necessitates an extensive meta-training process and brings massive data and computational cost. In this study, we propose a novel approach by representing the inherent correspondence between reference-target image pairs as a pseudo video. This perspective allows the latest version of SAM, known as SAM2, which is equipped with interactive video object segmentation (iVOS) capabilities, to be adapted to downstream tasks in a lightweight manner. We term this approach Correspondence As Video for SAM (CAV-SAM). CAV-SAM comprises two key modules: the Diffusion-Based Semantic Transition (DBST) module employs a diffusion model to construct a semantic transformation sequence, while the Test-Time Geometric Alignment (TTGA) module aligns the geometric changes within this sequence through test-time fine-tuning. We evaluated CAVSAM on widely-used datasets, achieving segmentation performance improvements exceeding 5% over SOTA methods. Implementation is provided in the supplementary materials.

[116] UniSVG: A Unified Dataset for Vector Graphic Understanding and Generation with Multimodal Large Language Models cs.CV | cs.AIPDF

Jinke Li, Jiarui Yu, Chenxing Wei, Hande Dong, Qiang Lin

TL;DR: UniSVG 是一个专门为多模态大语言模型（MLLMs）设计的统一数据集，用于矢量图形（SVG）的理解与生成任务。它包含 525k 数据项，首次支持从文本提示和图像生成 SVG，并提升开源 MLLMs 的性能，超越 GPT-4V 等闭源模型。

Details

Motivation: SVG 作为一种高质量且可扩展的图形格式，在计算机视觉和艺术设计中广泛应用。然而，AI 对 SVG 的理解与生成仍面临高精度和多模态处理的挑战，需要新的数据集和模型支持。

Result: 实验表明，UniSVG 显著提升了开源 MLLMs 的 SVG 理解与生成能力，在多个任务上超越 GPT-4V。

Insight: 通过统一数据集和 MLLMs 的结合，可以高效解决 SVG 的高精度和多模态生成问题，为 AI 在矢量图形领域的应用提供新思路。

Abstract: Unlike bitmap images, scalable vector graphics (SVG) maintain quality when scaled, frequently employed in computer vision and artistic design in the representation of SVG code. In this era of proliferating AI-powered systems, enabling AI to understand and generate SVG has become increasingly urgent. However, AI-driven SVG understanding and generation (U&G) remain significant challenges. SVG code, equivalent to a set of curves and lines controlled by floating-point parameters, demands high precision in SVG U&G. Besides, SVG generation operates under diverse conditional constraints, including textual prompts and visual references, which requires powerful multi-modal processing for condition-to-SVG transformation. Recently, the rapid growth of Multi-modal Large Language Models (MLLMs) have demonstrated capabilities to process multi-modal inputs and generate complex vector controlling parameters, suggesting the potential to address SVG U&G tasks within a unified model. To unlock MLLM’s capabilities in the SVG area, we propose an SVG-centric dataset called UniSVG, comprising 525k data items, tailored for MLLM training and evaluation. To our best knowledge, it is the first comprehensive dataset designed for unified SVG generation (from textual prompts and images) and SVG understanding (color, category, usage, etc.). As expected, learning on the proposed dataset boosts open-source MLLMs’ performance on various SVG U&G tasks, surpassing SOTA close-source MLLMs like GPT-4V. We release dataset, benchmark, weights, codes and experiment details on https://ryanlijinke.github.io/.

[117] Dream4D: Lifting Camera-Controlled I2V towards Spatiotemporally Consistent 4D Generation cs.CVPDF

Xiaoyan Liu, Kangrui Li, Jiaxin Liu

TL;DR: Dream4D是一个新颖的框架，通过结合可控视频生成和神经4D重建，解决了生成时空一致4D内容的挑战，提升了视图一致性和动态场景处理能力。

Details

Motivation: 当前方法在保持视图一致性和处理复杂动态场景时存在困难，尤其是在大规模环境中。Dream4D旨在填补这一空白，通过融合视频生成和几何重建的优势，实现高质量的4D内容生成。

Result: Dream4D在质量指标（如mPSNR和mSSIM）上优于现有方法，展示了更高的生成质量。

Insight: 通过融合时间先验和几何意识，Dream4D为4D内容生成提供了新的思路，尤其是在复杂动态场景中表现出色。

Abstract: The synthesis of spatiotemporally coherent 4D content presents fundamental challenges in computer vision, requiring simultaneous modeling of high-fidelity spatial representations and physically plausible temporal dynamics. Current approaches often struggle to maintain view consistency while handling complex scene dynamics, particularly in large-scale environments with multiple interacting elements. This work introduces Dream4D, a novel framework that bridges this gap through a synergy of controllable video generation and neural 4D reconstruction. Our approach seamlessly combines a two-stage architecture: it first predicts optimal camera trajectories from a single image using few-shot learning, then generates geometrically consistent multi-view sequences via a specialized pose-conditioned diffusion process, which are finally converted into a persistent 4D representation. This framework is the first to leverage both rich temporal priors from video diffusion models and geometric awareness of the reconstruction models, which significantly facilitates 4D generation and shows higher quality (e.g., mPSNR, mSSIM) over existing methods.

[118] Prototype-Guided Curriculum Learning for Zero-Shot Learning cs.CVPDF

Lei Wang, Shiming Chen, Guo-Sen Xie, Ziming Hong, Chaojian Yu

TL;DR: 本文提出了一种原型引导的课程学习框架（CLZSL），通过原型引导的课程学习模块（PCL）缓解实例级不匹配问题，并通过原型更新模块（PUP）解决类级不精确问题，从而提升零样本学习的知识迁移效果。

Details

Motivation: 传统零样本学习中，基于嵌入的方法依赖于手动定义的语义原型进行知识迁移，但实例级不匹配（如视角变化、遮挡和标注偏差）和类级不精确（原型不能准确反映类语义）会引入噪声监督，影响视觉-语义映射的准确性。

Result: 在AWA2、SUN和CUB标准数据集上的实验验证了方法的有效性。

Insight: 课程学习和动态原型更新是提升零样本学习性能的关键，能够缓解噪声监督问题并改善视觉-语义对齐。

Abstract: In Zero-Shot Learning (ZSL), embedding-based methods enable knowledge transfer from seen to unseen classes by learning a visual-semantic mapping from seen-class images to class-level semantic prototypes (e.g., attributes). However, these semantic prototypes are manually defined and may introduce noisy supervision for two main reasons: (i) instance-level mismatch: variations in perspective, occlusion, and annotation bias will cause discrepancies between individual sample and the class-level semantic prototypes; and (ii) class-level imprecision: the manually defined semantic prototypes may not accurately reflect the true semantics of the class. Consequently, the visual-semantic mapping will be misled, reducing the effectiveness of knowledge transfer to unseen classes. In this work, we propose a prototype-guided curriculum learning framework (dubbed as CLZSL), which mitigates instance-level mismatches through a Prototype-Guided Curriculum Learning (PCL) module and addresses class-level imprecision via a Prototype Update (PUP) module. Specifically, the PCL module prioritizes samples with high cosine similarity between their visual mappings and the class-level semantic prototypes, and progressively advances to less-aligned samples, thereby reducing the interference of instance-level mismatches to achieve accurate visual-semantic mapping. Besides, the PUP module dynamically updates the class-level semantic prototypes by leveraging the visual mappings learned from instances, thereby reducing class-level imprecision and further improving the visual-semantic mapping. Experiments were conducted on standard benchmark datasets-AWA2, SUN, and CUB-to verify the effectiveness of our method.

[119] Forecasting Continuous Non-Conservative Dynamical Systems in SO(3) cs.CVPDF

Lennart Bastian, Mohammad Rashed, Nassir Navab, Tolga Birdal

TL;DR: 该论文提出了一种基于神经控制微分方程的方法，用于在SO(3)流形上建模和预测非保守动力学系统的旋转轨迹。

Details

Motivation: 旋转建模在计算机视觉中至关重要，但现有的方法基于能量守恒或恒定速度假设，难以适用于现实中的非保守系统。

Result: 在仿真和真实场景中表现出鲁棒的预测能力，适用于未知物理参数的系统。

Insight: 通过结合物理和几何意义的建模方法，提升了非保守动力学系统的旋转预测能力。

Abstract: Modeling the rotation of moving objects is a fundamental task in computer vision, yet $SO(3)$ extrapolation still presents numerous challenges: (1) unknown quantities such as the moment of inertia complicate dynamics, (2) the presence of external forces and torques can lead to non-conservative kinematics, and (3) estimating evolving state trajectories under sparse, noisy observations requires robustness. We propose modeling trajectories of noisy pose estimates on the manifold of 3D rotations in a physically and geometrically meaningful way by leveraging Neural Controlled Differential Equations guided with $SO(3)$ Savitzky-Golay paths. Existing extrapolation methods often rely on energy conservation or constant velocity assumptions, limiting their applicability in real-world scenarios involving non-conservative forces. In contrast, our approach is agnostic to energy and momentum conservation while being robust to input noise, making it applicable to complex, non-inertial systems. Our approach is easily integrated as a module in existing pipelines and generalizes well to trajectories with unknown physical parameters. By learning to approximate object dynamics from noisy states during training, our model attains robust extrapolation capabilities in simulation and various real-world settings. Code is available at https://github.com/bastianlb/forecasting-rotational-dynamics

[120] Power Battery Detection cs.CVPDF

Xiaoqi Zhao, Peiqian Cao, Lihe Zhang, Zonglei Feng, Hanqi Liu

TL;DR: 论文提出了电力电池检测（PBD）任务，并通过建立PBD5K基准数据集和开发MDCNeXt模型，解决了X射线图像中密集板极定位的挑战。

Details

Motivation: 电力电池内部结构缺陷可能引发严重安全问题，传统视觉算法和人工检测效率低下且易出错，亟需自动化解决方案。

Result: 提出的方法在PBD任务上表现优异，能够有效定位板极端点并抑制视觉干扰。

Insight: 任务特定提示和多维信息整合是解决密集板极检测的关键；智能标注流程可提升数据质量。

Abstract: Power batteries are essential components in electric vehicles, where internal structural defects can pose serious safety risks. We conduct a comprehensive study on a new task, power battery detection (PBD), which aims to localize the dense endpoints of cathode and anode plates from industrial X-ray images for quality inspection. Manual inspection is inefficient and error-prone, while traditional vision algorithms struggle with densely packed plates, low contrast, scale variation, and imaging artifacts. To address this issue and drive more attention into this meaningful task, we present PBD5K, the first large-scale benchmark for this task, consisting of 5,000 X-ray images from nine battery types with fine-grained annotations and eight types of real-world visual interference. To support scalable and consistent labeling, we develop an intelligent annotation pipeline that combines image filtering, model-assisted pre-labeling, cross-verification, and layered quality evaluation. We formulate PBD as a point-level segmentation problem and propose MDCNeXt, a model designed to extract and integrate multi-dimensional structure clues including point, line, and count information from the plate itself. To improve discrimination between plates and suppress visual interference, MDCNeXt incorporates two state space modules. The first is a prompt-filtered module that learns contrastive relationships guided by task-specific prompts. The second is a density-aware reordering module that refines segmentation in regions with high plate density. In addition, we propose a distance-adaptive mask generation strategy to provide robust supervision under varying spatial distributions of anode and cathode positions. The source code and datasets will be publicly available at \href{https://github.com/Xiaoqi-Zhao-DLUT/X-ray-PBD}{PBD5K}.

[121] MambaTrans: Multimodal Fusion Image Translation via Large Language Model Priors for Downstream Visual Tasks cs.CVPDF

Yushen Xu, Xiaosong Li, Zhenyu Kuang, Xiaoqi Cheng, Haishu Tan

TL;DR: MambaTrans是一个通过大语言模型先验知识改进多模态图像融合翻译的框架，旨在提升下游视觉任务的性能。

Details

Motivation: 现有下游预训练模型通常基于可见光图像训练，而多模态融合图像与可见光图像的像素分布差异显著，导致下游任务性能下降。需要一种方法将多模态融合图像适配到这些模型中。

Result: 在公开数据集上，MambaTrans显著提升了多模态融合图像在下游任务中的表现，且无需调整预训练模型的参数。

Insight: 通过结合大语言模型的先验知识和多模态融合的视觉信息，可以有效解决模态差异问题，为多模态任务提供新思路。

Abstract: The goal of multimodal image fusion is to integrate complementary information from infrared and visible images, generating multimodal fused images for downstream tasks. Existing downstream pre-training models are typically trained on visible images. However, the significant pixel distribution differences between visible and multimodal fusion images can degrade downstream task performance, sometimes even below that of using only visible images. This paper explores adapting multimodal fused images with significant modality differences to object detection and semantic segmentation models trained on visible images. To address this, we propose MambaTrans, a novel multimodal fusion image modality translator. MambaTrans uses descriptions from a multimodal large language model and masks from semantic segmentation models as input. Its core component, the Multi-Model State Space Block, combines mask-image-text cross-attention and a 3D-Selective Scan Module, enhancing pure visual capabilities. By leveraging object detection prior knowledge, MambaTrans minimizes detection loss during training and captures long-term dependencies among text, masks, and images. This enables favorable results in pre-trained models without adjusting their parameters. Experiments on public datasets show that MambaTrans effectively improves multimodal image performance in downstream tasks.

[122] Pose-RFT: Enhancing MLLMs for 3D Pose Generation via Hybrid Action Reinforcement Fine-Tuning cs.CVPDF

Bao Li, Xiaomei Zhang, Miao Xu, Zhaoxin Fan, Xiangyu Zhu

TL;DR: Pose-RFT提出了一种针对3D人体姿态生成任务的混合动作强化微调框架，通过结合离散语言预测和连续姿态生成的联合优化，显著提升了多模态大语言模型（MLLMs）在姿态生成任务中的性能。

Details

Motivation: 现有的姿态生成模型通常采用监督目标（如SMPL参数回归或词级预测），难以建模任务中的固有模糊性并实现任务特定的对齐。Pose-RFT旨在通过强化学习框架解决这一问题。

Result: 在多个姿态生成基准测试中，Pose-RFT显著优于现有方法，验证了混合动作强化微调的有效性。

Insight: 通过强化学习框架可以更好地建模姿态生成任务中的模糊性，同时任务特定的奖励设计对提升性能至关重要。

Abstract: Generating 3D human poses from multimodal inputs such as images or text requires models to capture both rich spatial and semantic correspondences. While pose-specific multimodal large language models (MLLMs) have shown promise in this task, they are typically trained with supervised objectives such as SMPL parameter regression or token-level prediction, which struggle to model the inherent ambiguity and achieve task-specific alignment required for accurate 3D pose generation. To address these limitations, we propose Pose-RFT, a reinforcement fine-tuning framework tailored for 3D human pose generation in MLLMs. We formulate the task as a hybrid action reinforcement learning problem that jointly optimizes discrete language prediction and continuous pose generation. To this end, we introduce HyGRPO, a hybrid reinforcement learning algorithm that performs group-wise reward normalization over sampled responses to guide joint optimization of discrete and continuous actions. Pose-RFT further incorporates task-specific reward functions to guide optimization towards spatial alignment in image-to-pose generation and semantic consistency in text-to-pose generation. Extensive experiments on multiple pose generation benchmarks demonstrate that Pose-RFT significantly improves performance over existing pose-specific MLLMs, validating the effectiveness of hybrid action reinforcement fine-tuning for 3D pose generation.

[123] DiTVR: Zero-Shot Diffusion Transformer for Video Restoration cs.CVPDF

Sicheng Gao, Nancy Mehta, Zongwei Wu, Radu Timofte

TL;DR: 本文提出了一种名为DiTVR的零样本视频恢复框架，结合了扩散变换器和轨迹感知注意力机制，无需配对数据即可实现高质量视频恢复。

Details

Motivation: 传统基于回归的视频恢复方法通常生成不真实的细节，且依赖大量配对数据，而生成的扩散模型在保证时间一致性方面存在挑战。DiTVR旨在解决这些问题。

Result: 在视频恢复基准测试中达到零样本的SOTA性能，展现出更好的时间一致性和细节保留能力。

Insight: 通过光流轨迹对齐和低频带数据一致性的注入，DiTVR在无需配对数据的情况下实现了高效且鲁棒的视频恢复。

Abstract: Video restoration aims to reconstruct high quality video sequences from low quality inputs, addressing tasks such as super resolution, denoising, and deblurring. Traditional regression based methods often produce unrealistic details and require extensive paired datasets, while recent generative diffusion models face challenges in ensuring temporal consistency. We introduce DiTVR, a zero shot video restoration framework that couples a diffusion transformer with trajectory aware attention and a wavelet guided, flow consistent sampler. Unlike prior 3D convolutional or frame wise diffusion approaches, our attention mechanism aligns tokens along optical flow trajectories, with particular emphasis on vital layers that exhibit the highest sensitivity to temporal dynamics. A spatiotemporal neighbour cache dynamically selects relevant tokens based on motion correspondences across frames. The flow guided sampler injects data consistency only into low-frequency bands, preserving high frequency priors while accelerating convergence. DiTVR establishes a new zero shot state of the art on video restoration benchmarks, demonstrating superior temporal consistency and detail preservation while remaining robust to flow noise and occlusions.

[124] Semi-supervised Multiscale Matching for SAR-Optical Image cs.CVPDF

Jingze Gai, Changchun Li

TL;DR: 论文提出了一种半监督的多尺度匹配方法（S2M2-SAR），用于合成孔径雷达（SAR）与光学图像的匹配，解决了标记数据稀缺和人工标注复杂的问题。通过结合伪标签和无监督的特征增强模块，该方法在性能上超越了现有半监督方法，甚至接近全监督方法的水平。

Details

Motivation: SAR与光学图像的匹配具有互补性，但现有的全监督方法需要大量标记数据，而人工标注耗时且复杂。因此，研究旨在利用半监督学习，通过利用未标记数据和伪标签技术，提升匹配性能并减少对标记数据的依赖。

Result: S2M2-SAR在基准数据集上的实验显示，其性能不仅优于现有半监督方法，还接近全监督方法的水平，验证了其高效性和实用性。

Insight: 半监督学习在SAR-光学图像匹配中具有潜力，伪标签技术和跨模态特征解耦是提升性能的关键。未来的研究可以扩展该方法到其他多模态图像匹配任务中。

Abstract: Driven by the complementary nature of optical and synthetic aperture radar (SAR) images, SAR-optical image matching has garnered significant interest. Most existing SAR-optical image matching methods aim to capture effective matching features by employing the supervision of pixel-level matched correspondences within SAR-optical image pairs, which, however, suffers from time-consuming and complex manual annotation, making it difficult to collect sufficient labeled SAR-optical image pairs. To handle this, we design a semi-supervised SAR-optical image matching pipeline that leverages both scarce labeled and abundant unlabeled image pairs and propose a semi-supervised multiscale matching for SAR-optical image matching (S2M2-SAR). Specifically, we pseudo-label those unlabeled SAR-optical image pairs with pseudo ground-truth similarity heatmaps by combining both deep and shallow level matching results, and train the matching model by employing labeled and pseudo-labeled similarity heatmaps. In addition, we introduce a cross-modal feature enhancement module trained using a cross-modality mutual independence loss, which requires no ground-truth labels. This unsupervised objective promotes the separation of modality-shared and modality-specific features by encouraging statistical independence between them, enabling effective feature disentanglement across optical and SAR modalities. To evaluate the effectiveness of S2M2-SAR, we compare it with existing competitors on benchmark datasets. Experimental results demonstrate that S2M2-SAR not only surpasses existing semi-supervised methods but also achieves performance competitive with fully supervised SOTA methods, demonstrating its efficiency and practical potential.

[125] Segmenting and Understanding: Region-aware Semantic Attention for Fine-grained Image Quality Assessment with Large Language Models cs.CVPDF

Chenyue Song, Chen Hui, Haiqi Zhu, Feng Jiang, Yachun Mi

TL;DR: RSFIQA是一种细粒度图像质量评估模型，通过集成区域级失真信息和多模态大语言模型（MLLM）来感知多维质量差异，并结合区域感知语义注意力机制（RSA）提升局部质量变化的敏感性。

Details

Motivation: 现有的无参考图像质量评估方法往往忽视语义显著区域或采用统一的区域特征权重，导致对局部质量变化的敏感性不足，RSFIQA旨在解决这一问题。

Result: RSFIQA在多个基准数据集上表现出色，取得了有竞争力的质量预测性能。

Insight: 通过结合语义分割和多模态大语言模型，可以更全面地理解图像的局部和全局质量退化，而RSA机制有效提升了区域敏感性和全局表达。

Abstract: No-reference image quality assessment (NR-IQA) aims to simulate the process of perceiving image quality aligned with subjective human perception. However, existing NR-IQA methods either focus on global representations that leads to limited insights into the semantically salient regions or employ a uniform weighting for region features that weakens the sensitivity to local quality variations. In this paper, we propose a fine-grained image quality assessment model, named RSFIQA, which integrates region-level distortion information to perceive multi-dimensional quality discrepancies. To enhance regional quality awareness, we first utilize the Segment Anything Model (SAM) to dynamically partition the input image into non-overlapping semantic regions. For each region, we teach a powerful Multi-modal Large Language Model (MLLM) to extract descriptive content and perceive multi-dimensional distortions, enabling a comprehensive understanding of both local semantics and quality degradations. To effectively leverage this information, we introduce Region-Aware Semantic Attention (RSA) mechanism, which generates a global attention map by aggregating fine-grained representations from local regions. In addition, RSFIQA is backbone-agnostic and can be seamlessly integrated into various deep neural network architectures. Extensive experiments demonstrate the robustness and effectiveness of the proposed method, which achieves competitive quality prediction performance across multiple benchmark datasets.

[126] Architectural Co-Design for Zero-Shot Anomaly Detection: Decoupling Representation and Dynamically Fusing Features in CLIP cs.CV | cs.AI | cs.LGPDF

Ke Ma, Jun Long, Hongxiao Fei, Liujie Hua, Yueyi Luo

TL;DR: 通过架构协同设计框架，结合Conv-LoRA适配器和动态融合网关（DFG），解决了预训练视觉语言模型在零样本异常检测中的适应性问题，显著提升了检测精度和鲁棒性。

Details

Motivation: 预训练的视觉语言模型（VLMs）在零样本异常检测（ZSAD）中存在适应性问题，因其缺乏对密集预测的局部归纳偏置以及特征融合方式的灵活性不足。

Result: 在工业和医学领域的多个基准测试中表现出卓越的准确性和鲁棒性。

Insight: 协同设计特征表示与动态融合是适应基础模型至密集感知任务的关键。

Abstract: Pre-trained Vision-Language Models (VLMs) face a significant adaptation gap when applied to Zero-Shot Anomaly Detection (ZSAD), stemming from their lack of local inductive biases for dense prediction and their reliance on inflexible feature fusion paradigms. We address these limitations through an Architectural Co-Design framework that jointly refines feature representation and cross-modal fusion. Our method integrates a parameter-efficient Convolutional Low-Rank Adaptation (Conv-LoRA) adapter to inject local inductive biases for fine-grained representation, and introduces a Dynamic Fusion Gateway (DFG) that leverages visual context to adaptively modulate text prompts, enabling a powerful bidirectional fusion. Extensive experiments on diverse industrial and medical benchmarks demonstrate superior accuracy and robustness, validating that this synergistic co-design is critical for robustly adapting foundation models to dense perception tasks.

[127] MIMIC: Multimodal Inversion for Model Interpretation and Conceptualization cs.CVPDF

Animesh Jain, Alexandros Stergiou

TL;DR: MIMIC 是一个多模态逆向框架，用于可视化解码视觉语言模型（VLM）的内部表示，通过合成与内部编码对应的视觉概念以提高模型透明度。

Details

Motivation: VLMs 的复杂结构和难以解释的内部表征限制了模型的透明性和可信度，因此需要一种方法能直观展示其内部学习到的概念。

Result: 实验表明 MIMIC 能够从 VLM 的多样化输出文本中逆向生成高质量的视觉概念，并通过定性和定量指标验证其有效性。

Insight: 逆向方法可用于增强 VLMs 的可解释性，为复杂模型的黑盒特性提供了一种可行的解决方案。

Abstract: Vision Language Models (VLMs) encode multimodal inputs over large, complex, and difficult-to-interpret architectures, which limit transparency and trust. We propose a Multimodal Inversion for Model Interpretation and Conceptualization (MIMIC) framework to visualize the internal representations of VLMs by synthesizing visual concepts corresponding to internal encodings. MIMIC uses a joint VLM-based inversion and a feature alignment objective to account for VLM’s autoregressive processing. It additionally includes a triplet of regularizers for spatial alignment, natural image smoothness, and semantic realism. We quantitatively and qualitatively evaluate MIMIC by inverting visual concepts over a range of varying-length free-form VLM output texts. Reported results include both standard visual quality metrics as well as semantic text-based metrics. To the best of our knowledge, this is the first model inversion approach addressing visual interpretations of VLM concepts.

[128] Effortless Vision-Language Model Specialization in Histopathology without Annotation cs.CVPDF

Jingna Qiu, Nishanth Jain, Jonas Ammeling, Marc Aubreville, Katharina Breininger

TL;DR: 该论文提出了一种无标注的视觉语言模型（VLM）专业化方法，通过继续预训练领域和任务相关的图像-描述对，显著提升了零样本和小样本性能，且无需手动标注。

Details

Motivation: 现有的通用视觉语言模型在特定下游任务中表现可能欠佳，而监督微调方法需要手动标注样本。本文旨在探索一种无需标注的适应方法。

Result: 实验显示该方法显著提升了零样本和小样本性能，且在更大训练规模下可与小样本方法媲美。

Insight: 无标注的继续预训练是一种高效且任务无关的VLM适应方法，尤其适用于计算病理学任务。

Abstract: Recent advances in Vision-Language Models (VLMs) in histopathology, such as CONCH and QuiltNet, have demonstrated impressive zero-shot classification capabilities across various tasks. However, their general-purpose design may lead to suboptimal performance in specific downstream applications. While supervised fine-tuning methods address this issue, they require manually labeled samples for adaptation. This paper investigates annotation-free adaptation of VLMs through continued pretraining on domain- and task-relevant image-caption pairs extracted from existing databases. Our experiments on two VLMs, CONCH and QuiltNet, across three downstream tasks reveal that these pairs substantially enhance both zero-shot and few-shot performance. Notably, with larger training sizes, continued pretraining matches the performance of few-shot methods while eliminating manual labeling. Its effectiveness, task-agnostic design, and annotation-free workflow make it a promising pathway for adapting VLMs to new histopathology tasks. Code is available at https://github.com/DeepMicroscopy/Annotation-free-VLM-specialization.

[129] CBDES MoE: Hierarchically Decoupled Mixture-of-Experts for Functional Modules in Autonomous Driving cs.CVPDF

Qi Xiang, Kunsong Shi, Zhigui Lin, Lei He

TL;DR: CBDES MoE提出了一种功能模块级别的分层解耦Mixture-of-Experts架构，解决了多模态BEV感知系统中的输入适应性、建模能力和泛化性问题。通过轻量级自注意力路由机制动态选择专家路径，显著提升了3D目标检测性能。

Details

Motivation: 现有BEV感知系统在多模态输入适应性、建模能力和泛化性方面存在局限，CBDES MoE旨在通过模块化的专家混合架构解决这些问题。

Result: 在nuScenes数据集上，mAP和NDS分别提升1.6和4.1个百分点，优于单专家模型。

Insight: 模块化专家混合架构能有效提升自动驾驶感知任务的性能，动态路由机制是关键创新点。

Abstract: Bird’s Eye View (BEV) perception systems based on multi-sensor feature fusion have become a fundamental cornerstone for end-to-end autonomous driving. However, existing multi-modal BEV methods commonly suffer from limited input adaptability, constrained modeling capacity, and suboptimal generalization. To address these challenges, we propose a hierarchically decoupled Mixture-of-Experts architecture at the functional module level, termed Computing Brain DEvelopment System Mixture-of-Experts (CBDES MoE). CBDES MoE integrates multiple structurally heterogeneous expert networks with a lightweight Self-Attention Router (SAR) gating mechanism, enabling dynamic expert path selection and sparse, input-aware efficient inference. To the best of our knowledge, this is the first modular Mixture-of-Experts framework constructed at the functional module granularity within the autonomous driving domain. Extensive evaluations on the real-world nuScenes dataset demonstrate that CBDES MoE consistently outperforms fixed single-expert baselines in 3D object detection. Compared to the strongest single-expert model, CBDES MoE achieves a 1.6-point increase in mAP and a 4.1-point improvement in NDS, demonstrating the effectiveness and practical advantages of the proposed approach.

[130] Morphological Analysis of Semiconductor Microstructures using Skeleton Graphs cs.CVPDF

Noriko Nitta, Rei Miyata, Naoto Oishi

TL;DR: 论文利用骨架图提取Ge表面微结构的拓扑特征，通过图卷积网络嵌入并进行PCA分析，发现辐照角度比辐照通量对Ge表面形态的影响更显著。

Details

Motivation: 研究半导体微结构的形态学特征对理解其物理性质至关重要，尤其是在离子束辐照条件下。通过电子显微镜图像提取拓扑特征，可以为微结构的形态分析提供新视角。

Result: 结果显示，辐照角度的变化对Ge表面形态的影响远大于辐照通量的变化。

Insight: 通过拓扑特征和图神经网络的结合，可以实现对微结构形态的高效定量分析，这为半导体材料研究提供了新的工具和视角。

Abstract: In this paper, electron microscopy images of microstructures formed on Ge surfaces by ion beam irradiation were processed to extract topological features as skeleton graphs, which were then embedded using a graph convolutional network. The resulting embeddings were analyzed using principal component analysis, and cluster separability in the resulting PCA space was evaluated using the Davies-Bouldin index. The results indicate that variations in irradiation angle have a more significant impact on the morphological properties of Ge surfaces than variations in irradiation fluence.

[131] Being-M0.5: A Real-Time Controllable Vision-Language-Motion Model cs.CV | cs.LGPDF

Bin Cao, Sipeng Zheng, Ye Wang, Lujie Xia, Qianshan Wei

TL;DR: 论文提出了Being-M0.5，一种实时可控的视觉-语言-运动模型（VLMM），解决了现有模型在多样性控制、姿态初始化、长序列生成、未见场景处理及细粒度肢体控制等方面的不足。基于HuMo100M数据集和新型部分感知残差量化技术，模型实现了优异性能。

Details

Motivation: 现有视觉-语言-运动模型在可控性方面存在显著局限性，阻碍了实际应用。论文旨在解决这些局限性，特别是对多样命令响应、姿态初始化、长序列生成、未见过场景处理和细粒度肢体控制的不足。

Result: 在多项运动生成任务中实现了SOTA性能，实验验证了其在多样性和可控性上的优势，同时证实了实时生成能力。

Insight: HuMo100M数据集和部分感知残差量化技术是关键创新，为实际运动生成器的开发提供了重要指导。

Abstract: Human motion generation has emerged as a critical technology with transformative potential for real-world applications. However, existing vision-language-motion models (VLMMs) face significant limitations that hinder their practical deployment. We identify controllability as a main bottleneck, manifesting in five key aspects: inadequate response to diverse human commands, limited pose initialization capabilities, poor performance on long-term sequences, insufficient handling of unseen scenarios, and lack of fine-grained control over individual body parts. To overcome these limitations, we present Being-M0.5, the first real-time, controllable VLMM that achieves state-of-the-art performance across multiple motion generation tasks. Our approach is built upon HuMo100M, the largest and most comprehensive human motion dataset to date, comprising over 5 million self-collected motion sequences, 100 million multi-task instructional instances, and detailed part-level annotations that address a critical gap in existing datasets. We introduce a novel part-aware residual quantization technique for motion tokenization that enables precise, granular control over individual body parts during generation. Extensive experimental validation demonstrates Being-M0.5’s superior performance across diverse motion benchmarks, while comprehensive efficiency analysis confirms its real-time capabilities. Our contributions include design insights and detailed computational analysis to guide future development of practical motion generators. We believe that HuMo100M and Being-M0.5 represent significant advances that will accelerate the adoption of motion generation technologies in real-world applications. The project page is available at https://beingbeyond.github.io/Being-M0.5.

[132] CATP: Contextually Adaptive Token Pruning for Efficient and Enhanced Multimodal In-Context Learning cs.CVPDF

Yanshu Li, Jianjiang Yang, Zhennan Shen, Ligong Han, Haoyan Xu

TL;DR: 本文提出了一种名为CATP的无需训练的图像令牌修剪方法，针对多模态上下文学习中的令牌冗余问题，通过渐进式修剪显著提升效率和性能。

Details

Motivation: 现代大型视觉语言模型（LVLMs）将每张输入图像转换为大量令牌，导致严重的令牌冗余和推理效率低下，尤其在多模态上下文学习中问题更为突出。

Result: 在四个LVLMs和八个基准测试中，CATP在保留性能的同时显著提升了效率，推理延迟平均降低10.78%。

Insight: 通过考虑跨模态交互的渐进式修剪策略，CATP不仅解决了令牌冗余问题，还为多模态上下文学习的未来发展奠定了基础。

Abstract: Modern large vision-language models (LVLMs) convert each input image into a large set of tokens, far outnumbering the text tokens. Although this improves visual perception, it introduces severe image token redundancy. Because image tokens carry sparse information, many add little to reasoning, yet greatly increase inference cost. The emerging image token pruning methods tackle this issue by identifying the most important tokens and discarding the rest. These methods can raise efficiency with only modest performance loss. However, most of them only consider single-image tasks and overlook multimodal in-context learning (ICL), where redundancy is greater and efficiency is more critical. Redundant tokens weaken the advantage of multimodal ICL for rapid domain adaptation and cause unstable performance. Applying existing pruning methods in this setting leads to large accuracy drops, exposing a clear gap and the need for new techniques. Thus, we propose Contextually Adaptive Token Pruning (CATP), a training-free pruning method targeted at multimodal ICL. CATP consists of two stages that perform progressive pruning to fully account for the complex cross-modal interactions in the input sequence. After removing 77.8% of the image tokens, CATP produces an average performance gain of 0.6% over the vanilla model on four LVLMs and eight benchmarks, exceeding all baselines remarkably. Meanwhile, it effectively improves efficiency by achieving an average reduction of 10.78% in inference latency. CATP enhances the practical value of multimodal ICL and lays the groundwork for future progress in interleaved image-text scenarios.

[133] Selective Contrastive Learning for Weakly Supervised Affordance Grounding cs.CV | cs.AIPDF

WonJun Moon, Hyun Seok Seong, Jae-Pil Heo

TL;DR: 该论文提出了一种选择性对比学习框架，用于弱监督可及性定位（WSAG），通过原型对比和像素对比目标，自适应学习可及性相关线索，提高了定位精度。

Details

Motivation: 现有的弱监督可及性定位方法通常依赖分类器，容易关注与可及性无关的类别特定模式。论文旨在解决这一问题，通过部分和对象级别的对比学习区分可及性相关区域。

Result: 实验证明，该方法能够有效将激活从无关区域转移到可及性相关区域，提高了定位性能。代码已开源。

Insight: 结合多视角信息和对比学习可以显著提升弱监督可及性定位的性能，尤其是通过交叉参考增强部分级线索的准确性。

Abstract: Facilitating an entity’s interaction with objects requires accurately identifying parts that afford specific actions. Weakly supervised affordance grounding (WSAG) seeks to imitate human learning from third-person demonstrations, where humans intuitively grasp functional parts without needing pixel-level annotations. To achieve this, grounding is typically learned using a shared classifier across images from different perspectives, along with distillation strategies incorporating part discovery process. However, since affordance-relevant parts are not always easily distinguishable, models primarily rely on classification, often focusing on common class-specific patterns that are unrelated to affordance. To address this limitation, we move beyond isolated part-level learning by introducing selective prototypical and pixel contrastive objectives that adaptively learn affordance-relevant cues at both the part and object levels, depending on the granularity of the available information. Initially, we find the action-associated objects in both egocentric (object-focused) and exocentric (third-person example) images by leveraging CLIP. Then, by cross-referencing the discovered objects of complementary views, we excavate the precise part-level affordance clues in each perspective. By consistently learning to distinguish affordance-relevant regions from affordance-irrelevant background context, our approach effectively shifts activation from irrelevant areas toward meaningful affordance cues. Experimental results demonstrate the effectiveness of our method. Codes are available at github.com/hynnsk/SelectiveCL.

[134] NeeCo: Image Synthesis of Novel Instrument States Based on Dynamic and Deformable 3D Gaussian Reconstruction cs.CV | cs.AI | I.3.3PDF

Tianle Zeng, Junlei Hu, Gerardo Loza Galindo, Sharib Ali, Duygu Sarikaya

TL;DR: 该论文提出了一种基于动态高斯点渲染的新方法（NeeCo），用于生成手术器械的合成图像，解决了手术数据科学中数据稀缺的问题。该方法通过动态训练调整策略和自动标注技术，显著提升了合成数据的真实性和实用性。

Details

Motivation: 手术数据科学中，高质量标注图像数据集稀缺且难以获取。现有的数据驱动方法需要大量标注数据，限制了其应用。论文旨在通过动态高斯点渲染技术生成高质量、多样化且自动标注的手术器械合成图像，以缓解数据不足的问题。

Result: 1. 生成的照片级真实合成图像PSNR达到29.87，为最高值。
2. 在真实世界图像数据集上，使用合成数据训练的模型性能比标准数据增强方法高10%，总体性能提升近15%。

Insight: 动态高斯点渲染技术为手术数据科学提供了高质量、多样化的合成数据生成解决方案，同时自动标注方法降低了数据标注成本。这表明合成数据在数据稀缺领域具有巨大潜力，尤其是在医疗应用中。

Abstract: Computer vision-based technologies significantly enhance surgical automation by advancing tool tracking, detection, and localization. However, Current data-driven approaches are data-voracious, requiring large, high-quality labeled image datasets, which limits their application in surgical data science. Our Work introduces a novel dynamic Gaussian Splatting technique to address the data scarcity in surgical image datasets. We propose a dynamic Gaussian model to represent dynamic surgical scenes, enabling the rendering of surgical instruments from unseen viewpoints and deformations with real tissue backgrounds. We utilize a dynamic training adjustment strategy to address challenges posed by poorly calibrated camera poses from real-world scenarios. Additionally, we propose a method based on dynamic Gaussians for automatically generating annotations for our synthetic data. For evaluation, we constructed a new dataset featuring seven scenes with 14,000 frames of tool and camera motion and tool jaw articulation, with a background of an ex-vivo porcine model. Using this dataset, we synthetically replicate the scene deformation from the ground truth data, allowing direct comparisons of synthetic image quality. Experimental results illustrate that our method generates photo-realistic labeled image datasets with the highest values in Peak-Signal-to-Noise Ratio (29.87). We further evaluate the performance of medical-specific neural networks trained on real and synthetic images using an unseen real-world image dataset. Our results show that the performance of models trained on synthetic images generated by the proposed method outperforms those trained with state-of-the-art standard data augmentation by 10%, leading to an overall improvement in model performances by nearly 15%.

[135] Stand-In: A Lightweight and Plug-and-Play Identity Control for Video Generation cs.CVPDF

Bowen Xue, Qixin Yan, Wenjing Wang, Hao Liu, Chen Li

TL;DR: Stand-In 是一个轻量级即插即用的框架，用于视频生成中的身份控制，仅需少量额外参数就能实现高质量的身份保留，并与其他 AIGC 工具兼容。

Details

Motivation: 现有方法在生成高保真人类视频时需要大量训练参数，且与其他生成工具兼容性差。这限制了其实际应用。

Result: 仅增加约 1% 的参数，却能在视频质量和身份保留方面超越完全参数训练的方法，并能无缝集成到其他任务中。

Insight: 轻量化和模块化设计是提高生成模型实用性的关键，尤其是在多任务集成和实际部署中。

Abstract: Generating high-fidelity human videos that match user-specified identities is important yet challenging in the field of generative AI. Existing methods often rely on an excessive number of training parameters and lack compatibility with other AIGC tools. In this paper, we propose Stand-In, a lightweight and plug-and-play framework for identity preservation in video generation. Specifically, we introduce a conditional image branch into the pre-trained video generation model. Identity control is achieved through restricted self-attentions with conditional position mapping, and can be learned quickly with only 2000 pairs. Despite incorporating and training just $\sim$1% additional parameters, our framework achieves excellent results in video quality and identity preservation, outperforming other full-parameter training methods. Moreover, our framework can be seamlessly integrated for other tasks, such as subject-driven video generation, pose-referenced video generation, stylization, and face swapping.

[136] CTC Transcription Alignment of the Bullinger Letters: Automatic Improvement of Annotation Quality cs.CVPDF

Marco Peer, Anna Scius-Bertrand, Andreas Fischer

TL;DR: 本文提出一种基于CTC对齐算法的自训练方法，用于改进16世纪Bullinger信函集中的标注错误，特别是连字符问题，显著提升了识别性能和对齐精度。

Details

Motivation: 历史手写文档的文本识别面临书写变异性、文档退化及标注有限等挑战，本文旨在通过改进标注质量提升识别效果。

Result: 在PyLaia上CER提升了1.1个百分点，对齐精度显著提高。还发布了100页的手动校正数据集。

Insight: 弱模型在对齐任务中表现更佳，这一发现支持了迭代改进的训练策略。

Abstract: Handwritten text recognition for historical documents remains challenging due to handwriting variability, degraded sources, and limited layout-aware annotations. In this work, we address annotation errors - particularly hyphenation issues - in the Bullinger correspondence, a large 16th-century letter collection. We introduce a self-training method based on a CTC alignment algorithm that matches full transcriptions to text line images using dynamic programming and model output probabilities trained with the CTC loss. Our approach improves performance (e.g., by 1.1 percentage points CER with PyLaia) and increases alignment accuracy. Interestingly, we find that weaker models yield more accurate alignments, enabling an iterative training strategy. We release a new manually corrected subset of 100 pages from the Bullinger dataset, along with our code and benchmarks. Our approach can be applied iteratively to further improve the CER as well as the alignment quality for text recognition pipelines. Code and data are available via https://github.com/andreas-fischer-unifr/nntp.

[137] Generative Video Matting cs.CVPDF

Yongtao Ge, Kangyang Xie, Guangkai Xu, Mingyu Liu, Li Ke

TL;DR: 本文提出了一种生成式视频抠图方法，通过大规模预训练和合成数据生成管道提升模型泛化能力，并利用预训练视频扩散模型的先验知识，确保了时间一致性。

Details

Motivation: 视频抠图缺乏高质量的真实标注数据，现有的数据集通常包含人工标注的不完美数据，导致模型在真实场景中的泛化能力较差。

Result: 在三个基准数据集上表现出卓越性能，并在多样化的真实场景中展示了强大的泛化能力。

Insight: 大规模预训练和合成数据生成对提升视频抠图模型的泛化能力至关重要；利用预训练模型的先验知识可以有效缩小合成数据与真实场景之间的领域差距。

Abstract: Video matting has traditionally been limited by the lack of high-quality ground-truth data. Most existing video matting datasets provide only human-annotated imperfect alpha and foreground annotations, which must be composited to background images or videos during the training stage. Thus, the generalization capability of previous methods in real-world scenarios is typically poor. In this work, we propose to solve the problem from two perspectives. First, we emphasize the importance of large-scale pre-training by pursuing diverse synthetic and pseudo-labeled segmentation datasets. We also develop a scalable synthetic data generation pipeline that can render diverse human bodies and fine-grained hairs, yielding around 200 video clips with a 3-second duration for fine-tuning. Second, we introduce a novel video matting approach that can effectively leverage the rich priors from pre-trained video diffusion models. This architecture offers two key advantages. First, strong priors play a critical role in bridging the domain gap between synthetic and real-world scenes. Second, unlike most existing methods that process video matting frame-by-frame and use an independent decoder to aggregate temporal information, our model is inherently designed for video, ensuring strong temporal consistency. We provide a comprehensive quantitative evaluation across three benchmark datasets, demonstrating our approach’s superior performance, and present comprehensive qualitative results in diverse real-world scenes, illustrating the strong generalization capability of our method. The code is available at https://github.com/aim-uofa/GVM.

[138] Mem4D: Decoupling Static and Dynamic Memory for Dynamic Scene Reconstruction cs.CVPDF

Xudong Cai, Shuo Wang, Peng Wang, Yongcai Wang, Zhaoxin Fan

TL;DR: Mem4D通过分离静态和动态内存，解决了动态场景重建中的内存需求矛盾，实现了高保真的动态内容建模和全局一致的静态结构重建。

Details

Motivation: 动态场景的单目视频密集几何重建面临内存需求矛盾：静态结构需要长期稳定存储，而动态内容需要快速、高保真更新。现有方法只能妥协，导致静态几何漂移或动态内容模糊。

Result: 在多个基准测试中达到或超越现有方法性能，同时保持了高效性。

Insight: 分离静态和动态内存是解决动态场景重建内存矛盾的关键，双内存架构为未来类似任务提供了新思路。

Abstract: Reconstructing dense geometry for dynamic scenes from a monocular video is a critical yet challenging task. Recent memory-based methods enable efficient online reconstruction, but they fundamentally suffer from a Memory Demand Dilemma: The memory representation faces an inherent conflict between the long-term stability required for static structures and the rapid, high-fidelity detail retention needed for dynamic motion. This conflict forces existing methods into a compromise, leading to either geometric drift in static structures or blurred, inaccurate reconstructions of dynamic objects. To address this dilemma, we propose Mem4D, a novel framework that decouples the modeling of static geometry and dynamic motion. Guided by this insight, we design a dual-memory architecture: 1) The Transient Dynamics Memory (TDM) focuses on capturing high-frequency motion details from recent frames, enabling accurate and fine-grained modeling of dynamic content; 2) The Persistent Structure Memory (PSM) compresses and preserves long-term spatial information, ensuring global consistency and drift-free reconstruction for static elements. By alternating queries to these specialized memories, Mem4D simultaneously maintains static geometry with global consistency and reconstructs dynamic elements with high fidelity. Experiments on challenging benchmarks demonstrate that our method achieves state-of-the-art or competitive performance while maintaining high efficiency. Codes will be publicly available.

[139] RSVLM-QA: A Benchmark Dataset for Remote Sensing Vision Language Model-based Question Answering cs.CVPDF

Xing Zi, Jinghao Xiao, Yunxiao Shi, Xian Tao, Jun Li

TL;DR: 论文提出了一个名为RSVLM-QA的新数据集，针对遥感（RS）领域的视觉问答（VQA）任务，解决了现有数据集在标注丰富性、问题多样性和推理能力评估上的不足。

Details

Motivation: 现有遥感VQA数据集在标注丰富性、问题多样性和推理能力评估方面存在局限，制约了相关研究的发展。

Result: RSVLM-QA包含13,820张图像和162,373个VQA对，覆盖广泛注释和多样化问题类型，通过实验验证了其能有效评估主流视觉语言模型的推理能力。

Insight: RSVLM-QA的推出填补了遥感VQA领域的资源空白，有望成为推动VLM研究的重要工具。

Abstract: Visual Question Answering (VQA) in remote sensing (RS) is pivotal for interpreting Earth observation data. However, existing RS VQA datasets are constrained by limitations in annotation richness, question diversity, and the assessment of specific reasoning capabilities. This paper introduces RSVLM-QA dataset, a new large-scale, content-rich VQA dataset for the RS domain. RSVLM-QA is constructed by integrating data from several prominent RS segmentation and detection datasets: WHU, LoveDA, INRIA, and iSAID. We employ an innovative dual-track annotation generation pipeline. Firstly, we leverage Large Language Models (LLMs), specifically GPT-4.1, with meticulously designed prompts to automatically generate a suite of detailed annotations including image captions, spatial relations, and semantic tags, alongside complex caption-based VQA pairs. Secondly, to address the challenging task of object counting in RS imagery, we have developed a specialized automated process that extracts object counts directly from the original segmentation data; GPT-4.1 then formulates natural language answers from these counts, which are paired with preset question templates to create counting QA pairs. RSVLM-QA comprises 13,820 images and 162,373 VQA pairs, featuring extensive annotations and diverse question types. We provide a detailed statistical analysis of the dataset and a comparison with existing RS VQA benchmarks, highlighting the superior depth and breadth of RSVLM-QA’s annotations. Furthermore, we conduct benchmark experiments on Six mainstream Vision Language Models (VLMs), demonstrating that RSVLM-QA effectively evaluates and challenges the understanding and reasoning abilities of current VLMs in the RS domain. We believe RSVLM-QA will serve as a pivotal resource for the RS VQA and VLM research communities, poised to catalyze advancements in the field.

[140] TAG: A Simple Yet Effective Temporal-Aware Approach for Zero-Shot Video Temporal Grounding cs.CVPDF

Jin-Seop Lee, SungJoon Lee, Jaehan Ahn, YunSeok Choi, Jee-Hyong Lee

TL;DR: 论文提出了一种名为TAG的零样本视频时序定位方法，通过时间池化、时间一致性聚类和相似度调整，解决了现有方法的语义碎片化问题，无需训练即可实现高性能。

Details

Motivation: 现有零样本视频时序定位方法面临语义碎片化、相似度分布偏差以及对大型语言模型的依赖等问题，导致定位精度不足。

Result: 在Charades-STA和ActivityNet Captions数据集上实现了最先进性能，且无需依赖大型语言模型。

Insight: 通过简单但有效的时间感知设计，可以在不依赖复杂模型的情况下提升零样本视频时序定位的性能。

Abstract: Video Temporal Grounding (VTG) aims to extract relevant video segments based on a given natural language query. Recently, zero-shot VTG methods have gained attention by leveraging pretrained vision-language models (VLMs) to localize target moments without additional training. However, existing approaches suffer from semantic fragmentation, where temporally continuous frames sharing the same semantics are split across multiple segments. When segments are fragmented, it becomes difficult to predict an accurate target moment that aligns with the text query. Also, they rely on skewed similarity distributions for localization, making it difficult to select the optimal segment. Furthermore, they heavily depend on the use of LLMs which require expensive inferences. To address these limitations, we propose a \textit{TAG}, a simple yet effective Temporal-Aware approach for zero-shot video temporal Grounding, which incorporates temporal pooling, temporal coherence clustering, and similarity adjustment. Our proposed method effectively captures the temporal context of videos and addresses distorted similarity distributions without training. Our approach achieves state-of-the-art results on Charades-STA and ActivityNet Captions benchmark datasets without rely on LLMs. Our code is available at https://github.com/Nuetee/TAG

[141] TrackOR: Towards Personalized Intelligent Operating Rooms Through Robust Tracking cs.CVPDF

Tony Danjun Wang, Christian Heiliger, Nassir Navab, Lennart Bastian

TL;DR: TrackOR是一个用于手术室中长期多人跟踪和重新识别的框架，通过3D几何特征实现高性能的在线跟踪，并支持离线恢复以生成分析就绪的轨迹，为个性化智能系统铺平道路。

Details

Motivation: 为手术团队提供智能支持、提升患者治疗结果的目标，需要解决在长时间手术过程中持续跟踪所有成员位置的挑战。

Result: 在线跟踪性能比最强基线提高了11%的关联准确率，并实现了身份持续跟踪。

Insight: 3D几何信息能显著提升长期身份跟踪的可行性，为手术室中的个性化智能支持提供了关键技术支持。

Abstract: Providing intelligent support to surgical teams is a key frontier in automated surgical scene understanding, with the long-term goal of improving patient outcomes. Developing personalized intelligence for all staff members requires maintaining a consistent state of who is located where for long surgical procedures, which still poses numerous computational challenges. We propose TrackOR, a framework for tackling long-term multi-person tracking and re-identification in the operating room. TrackOR uses 3D geometric signatures to achieve state-of-the-art online tracking performance (+11% Association Accuracy over the strongest baseline), while also enabling an effective offline recovery process to create analysis-ready trajectories. Our work shows that by leveraging 3D geometric information, persistent identity tracking becomes attainable, enabling a critical shift towards the more granular, staff-centric analyses required for personalized intelligent systems in the operating room. This new capability opens up various applications, including our proposed temporal pathway imprints that translate raw tracking data into actionable insights for improving team efficiency and safety and ultimately providing personalized support.

[142] Omni-Effects: Unified and Spatially-Controllable Visual Effects Generation cs.CV | cs.AIPDF

Fangyuan Mao, Aiming Hao, Jintao Chen, Dongxia Liu, Xiaokun Feng

TL;DR: Omni-Effects是一个统一的视觉特效（VFX）生成框架，支持通过提示和空间掩码实现多特效的联合生成和精确控制，解决了现有方法无法同时处理多特效和空间控制的限制。

Details

Motivation: 现代视频生成模型虽然降低了VFX生产的成本，但现有方法需为每个特效单独训练LoRA，无法生成空间可控的复合特效。这限制了需要多特效协同的应用场景。

Result: 实验表明，Omni-Effects能生成多种特效并精确控制其空间位置，满足用户指定特效类别和位置的需求。

Insight: 通过LoRA-MoE和SAP的结合，Omni-Effects首次实现了多特效的统一生成和空间控制，为未来VFX生成任务提供了新的解决方案。

Abstract: Visual effects (VFX) are essential visual enhancements fundamental to modern cinematic production. Although video generation models offer cost-efficient solutions for VFX production, current methods are constrained by per-effect LoRA training, which limits generation to single effects. This fundamental limitation impedes applications that require spatially controllable composite effects, i.e., the concurrent generation of multiple effects at designated locations. However, integrating diverse effects into a unified framework faces major challenges: interference from effect variations and spatial uncontrollability during multi-VFX joint training. To tackle these challenges, we propose Omni-Effects, a first unified framework capable of generating prompt-guided effects and spatially controllable composite effects. The core of our framework comprises two key innovations: (1) LoRA-based Mixture of Experts (LoRA-MoE), which employs a group of expert LoRAs, integrating diverse effects within a unified model while effectively mitigating cross-task interference. (2) Spatial-Aware Prompt (SAP) incorporates spatial mask information into the text token, enabling precise spatial control. Furthermore, we introduce an Independent-Information Flow (IIF) module integrated within the SAP, isolating the control signals corresponding to individual effects to prevent any unwanted blending. To facilitate this research, we construct a comprehensive VFX dataset Omni-VFX via a novel data collection pipeline combining image editing and First-Last Frame-to-Video (FLF2V) synthesis, and introduce a dedicated VFX evaluation framework for validating model performance. Extensive experiments demonstrate that Omni-Effects achieves precise spatial control and diverse effect generation, enabling users to specify both the category and location of desired effects.

[143] The Escalator Problem: Identifying Implicit Motion Blindness in AI for Accessibility cs.CV | cs.HCPDF

Xiantao Zhang

TL;DR: 该论文提出了‘自动扶梯问题’，揭示了当前多模态大语言模型（MLLMs）在感知连续低信号运动方面的缺陷，称为‘隐性运动盲区’，并呼吁从语义识别转向物理感知的范式转变。

Details

Motivation: 当前MLLMs作为辅助技术对盲人和视障人士（BVI）的帮助潜力巨大，但其在动态环境中的应用因无法感知连续运动而存在信任危机。

Result: 指出当前模型在处理连续低信号运动时表现不佳，强调了物理感知的重要性。

Insight: 未来研究需超越静态语义识别，关注动态环境中的安全和可靠性，以满足用户的真实需求。

Abstract: Multimodal Large Language Models (MLLMs) hold immense promise as assistive technologies for the blind and visually impaired (BVI) community. However, we identify a critical failure mode that undermines their trustworthiness in real-world applications. We introduce the Escalator Problem – the inability of state-of-the-art models to perceive an escalator’s direction of travel – as a canonical example of a deeper limitation we term Implicit Motion Blindness. This blindness stems from the dominant frame-sampling paradigm in video understanding, which, by treating videos as discrete sequences of static images, fundamentally struggles to perceive continuous, low-signal motion. As a position paper, our contribution is not a new model but rather to: (I) formally articulate this blind spot, (II) analyze its implications for user trust, and (III) issue a call to action. We advocate for a paradigm shift from purely semantic recognition towards robust physical perception and urge the development of new, human-centered benchmarks that prioritize safety, reliability, and the genuine needs of users in dynamic environments.

Thinesh Thiyakesan Ponbagavathi, Chengzheng Yang, Alina Roitberg

TL;DR: 论文提出了一种名为ProGraD的方法，通过可学习的群组提示和轻量级GroupContext Transformer，利用视觉基础模型（VFMs）显著提升群体行为检测性能，尤其在多群组复杂场景中表现突出。

Details

Motivation: 现有的视觉基础模型（如DinoV2）虽具备优秀特征提取能力，但主要针对物体中心数据预训练，未充分探索群体动态建模；直接替换任务特定的GAD架构效果有限，需引入结构化的群组感知推理。

Result: 在两个GAD基准测试（Cafe和Social-CAD）中超越SOTA，多群组场景下分别提升6.5%（Group mAP@1.0）和8.2%（Group mAP@0.5）。

Insight: ProGraD生成的注意力图具有可解释性，揭示了参与者与群组的推理过程，为群体行为理解提供新视角。

Abstract: Group Activity Detection (GAD) involves recognizing social groups and their collective behaviors in videos. Vision Foundation Models (VFMs), like DinoV2, offer excellent features, but are pretrained primarily on object-centric data and remain underexplored for modeling group dynamics. While they are a promising alternative to highly task-specific GAD architectures that require full fine-tuning, our initial investigation reveals that simply swapping CNN backbones used in these methods with VFMs brings little gain, underscoring the need for structured, group-aware reasoning on top. We introduce Prompt-driven Group Activity Detection (ProGraD) – a method that bridges this gap through 1) learnable group prompts to guide the VFM attention toward social configurations, and 2) a lightweight two-layer GroupContext Transformer that infers actor-group associations and collective behavior. We evaluate our approach on two recent GAD benchmarks: Cafe, which features multiple concurrent social groups, and Social-CAD, which focuses on single-group interactions. While we surpass state-of-the-art in both settings, our method is especially effective in complex multi-group scenarios, where we yield a gain of 6.5% (Group mAP@1.0) and 8.2% (Group mAP@0.5) using only 10M trainable parameters. Furthermore, our experiments reveal that ProGraD produces interpretable attention maps, offering insights into actor-group reasoning. Code and models will be released.

[145] Mitigating Biases in Surgical Operating Rooms with Geometry cs.CVPDF

Tony Danjun Wang, Tobias Czempiel, Nassir Navab, Lennart Bastian

TL;DR: 论文通过几何表示（3D点云序列）解决了手术室（OR）中因标准化服装导致CNN模型学习虚假相关性（如鞋子或眼镜）的偏差问题，展示了几何方法在真实临床环境中比RGB模型更鲁棒。

Details

Motivation: 手术室中标准化服装（如手术服）掩盖了可识别的关键特征，导致CNN模型依赖虚假视觉线索（如鞋子、眼镜等）进行预测，而非真实的生物特征。这限制了智能辅助系统准确识别个性化工作流程的潜力。

Result: 在模拟数据中RGB和几何方法性能相近，但在真实临床环境中RGB模型准确率下降12%，表明几何方法更鲁棒。

Insight: 几何表示能够有效消除因标准化外观引入的偏差，为手术室中的人类建模提供了更可靠的方法。

Abstract: Deep neural networks are prone to learning spurious correlations, exploiting dataset-specific artifacts rather than meaningful features for prediction. In surgical operating rooms (OR), these manifest through the standardization of smocks and gowns that obscure robust identifying landmarks, introducing model bias for tasks related to modeling OR personnel. Through gradient-based saliency analysis on two public OR datasets, we reveal that CNN models succumb to such shortcuts, fixating on incidental visual cues such as footwear beneath surgical gowns, distinctive eyewear, or other role-specific identifiers. Avoiding such biases is essential for the next generation of intelligent assistance systems in the OR, which should accurately recognize personalized workflow traits, such as surgical skill level or coordination with other staff members. We address this problem by encoding personnel as 3D point cloud sequences, disentangling identity-relevant shape and motion patterns from appearance-based confounders. Our experiments demonstrate that while RGB and geometric methods achieve comparable performance on datasets with apparent simulation artifacts, RGB models suffer a 12% accuracy drop in realistic clinical settings with decreased visual diversity due to standardizations. This performance gap confirms that geometric representations capture more meaningful biometric features, providing an avenue to developing robust methods of modeling humans in the OR.

[146] TRIDE: A Text-assisted Radar-Image weather-aware fusion network for Depth Estimation cs.CVPDF

Huawei Sun, Zixu Wang, Hao Feng, Julius Ott, Lorenzo Servadei

TL;DR: 论文提出了TRIDE，一种雷达-图像-文本融合网络，用于深度估计，通过文本生成策略和天气感知融合模块提升性能。

Details

Motivation: 现有雷达-相机融合方法未考虑天气对传感器性能的影响，且未充分利用视觉语言模型的潜力。

Result: 在nuScenes数据集上，MAE和RMSE分别提升了12.87%和9.08%。

Insight: 天气感知和多模态融合（文本、雷达、图像）能显著提升深度估计性能。

Abstract: Depth estimation, essential for autonomous driving, seeks to interpret the 3D environment surrounding vehicles. The development of radar sensors, known for their cost-efficiency and robustness, has spurred interest in radar-camera fusion-based solutions. However, existing algorithms fuse features from these modalities without accounting for weather conditions, despite radars being known to be more robust than cameras under adverse weather. Additionally, while Vision-Language models have seen rapid advancement, utilizing language descriptions alongside other modalities for depth estimation remains an open challenge. This paper first introduces a text-generation strategy along with feature extraction and fusion techniques that can assist monocular depth estimation pipelines, leading to improved accuracy across different algorithms on the KITTI dataset. Building on this, we propose TRIDE, a radar-camera fusion algorithm that enhances text feature extraction by incorporating radar point information. To address the impact of weather on sensor performance, we introduce a weather-aware fusion block that adaptively adjusts radar weighting based on current weather conditions. Our method, benchmarked on the nuScenes dataset, demonstrates performance gains over the state-of-the-art, achieving a 12.87% improvement in MAE and a 9.08% improvement in RMSE. Code: https://github.com/harborsarah/TRIDE

[147] S^2VG: 3D Stereoscopic and Spatial Video Generation via Denoising Frame Matrix cs.CVPDF

Peng Dai, Feitong Tan, Qiangeng Xu, Yihua Huang, David Futschik

TL;DR: 该论文提出了一种无需训练和姿态信息的方法，利用现成的单目视频生成模型生成沉浸式3D视频，通过帧矩阵修复框架确保时空一致性，并在实验中表现出显著优于前方法的性能。

Details

Motivation: 尽管现有的视频生成模型在生成高质量单目视频方面表现出色，但生成用于沉浸式应用的3D立体和空间视频仍未得到充分探索。论文旨在填补这一空白。

Result: 实验结果表明，该方法在Sora、Lumiere等生成模型的视频上显著优于现有方法。

Insight: 通过巧妙的框架设计，可以直接利用现有单目视频生成模型实现高质量的3D视频合成，而无需额外训练或姿态信息。

Abstract: While video generation models excel at producing high-quality monocular videos, generating 3D stereoscopic and spatial videos for immersive applications remains an underexplored challenge. We present a pose-free and training-free method that leverages an off-the-shelf monocular video generation model to produce immersive 3D videos. Our approach first warps the generated monocular video into pre-defined camera viewpoints using estimated depth information, then applies a novel \textit{frame matrix} inpainting framework. This framework utilizes the original video generation model to synthesize missing content across different viewpoints and timestamps, ensuring spatial and temporal consistency without requiring additional model fine-tuning. Moreover, we develop a \dualupdate~scheme that further improves the quality of video inpainting by alleviating the negative effects propagated from disoccluded areas in the latent space. The resulting multi-view videos are then adapted into stereoscopic pairs or optimized into 4D Gaussians for spatial video synthesis. We validate the efficacy of our proposed method by conducting experiments on videos from various generative models, such as Sora, Lumiere, WALT, and Zeroscope. The experiments demonstrate that our method has a significant improvement over previous methods. Project page at: https://daipengwa.github.io/S-2VG_ProjectPage/

[148] PrIINeR: Towards Prior-Informed Implicit Neural Representations for Accelerated MRI cs.CV | cs.LGPDF

Ziad Al-Haj Hemidi, Eytan Kats, Mattias P. Heinrich

TL;DR: PrIINeR利用预训练深度学习模型的先验知识增强隐式神经表示（INR），在高加速因子下显著提升MRI重建质量。

Details

Motivation: MRI加速扫描常导致图像质量下降，现有INR方法在高加速因子下因先验约束不足而效果不佳。PrIINeR旨在通过结合先验知识解决这一问题。

Result: 在NYU fastMRI数据集上超越现有INR及部分学习型方法，有效去除伪影并提升结构保真度。

Insight: PrIINeR成功桥接深度学习和INR技术，为高加速MRI提供了更可靠的解决方案。

Abstract: Accelerating Magnetic Resonance Imaging (MRI) reduces scan time but often degrades image quality. While Implicit Neural Representations (INRs) show promise for MRI reconstruction, they struggle at high acceleration factors due to weak prior constraints, leading to structural loss and aliasing artefacts. To address this, we propose PrIINeR, an INR-based MRI reconstruction method that integrates prior knowledge from pre-trained deep learning models into the INR framework. By combining population-level knowledge with instance-based optimization and enforcing dual data consistency, PrIINeR aligns both with the acquired k-space data and the prior-informed reconstruction. Evaluated on the NYU fastMRI dataset, our method not only outperforms state-of-the-art INR-based approaches but also improves upon several learning-based state-of-the-art methods, significantly improving structural preservation and fidelity while effectively removing aliasing artefacts.PrIINeR bridges deep learning and INR-based techniques, offering a more reliable solution for high-quality, accelerated MRI reconstruction. The code is publicly available on https://github.com/multimodallearning/PrIINeR.

[149] ME-TST+: Micro-expression Analysis via Temporal State Transition with ROI Relationship Awareness cs.CVPDF

Zizheng Guo, Bochao Zou, Junbao Zhuo, Huimin Ma

TL;DR: ME-TST+提出了一种基于状态空间模型的新方法，通过时间状态转移机制解决了微表情分析中的滑动窗口限制和任务分离问题，结合多粒度ROI建模和慢快Mamba框架，显著提升了性能。

Details

Motivation: 现有微表情分析方法依赖固定窗口长度的滑动窗口分类，且将微表情检测和识别视为独立任务，忽略了其内在关联，导致性能受限。

Result: 实验表明，ME-TST+在微表情分析任务上达到了SOTA性能。

Insight: 1. 时间状态转移机制更适合建模微表情的动态变化。2. 联合优化检测和识别任务能显著提升性能，揭示了任务间的内在关联性。

Abstract: Micro-expressions (MEs) are regarded as important indicators of an individual’s intrinsic emotions, preferences, and tendencies. ME analysis requires spotting of ME intervals within long video sequences and recognition of their corresponding emotional categories. Previous deep learning approaches commonly employ sliding-window classification networks. However, the use of fixed window lengths and hard classification presents notable limitations in practice. Furthermore, these methods typically treat ME spotting and recognition as two separate tasks, overlooking the essential relationship between them. To address these challenges, this paper proposes two state space model-based architectures, namely ME-TST and ME-TST+, which utilize temporal state transition mechanisms to replace conventional window-level classification with video-level regression. This enables a more precise characterization of the temporal dynamics of MEs and supports the modeling of MEs with varying durations. In ME-TST+, we further introduce multi-granularity ROI modeling and the slowfast Mamba framework to alleviate information loss associated with treating ME analysis as a time-series task. Additionally, we propose a synergy strategy for spotting and recognition at both the feature and result levels, leveraging their intrinsic connection to enhance overall analysis performance. Extensive experiments demonstrate that the proposed methods achieve state-of-the-art performance. The codes are available at https://github.com/zizheng-guo/ME-TST.

[150] Matrix-3D: Omnidirectional Explorable 3D World Generation cs.CV | cs.GRPDF

Zhongqi Yang, Wenhang Ge, Yuqi Li, Jiaqi Chen, Haoyuan Li

TL;DR: Matrix-3D提出了一种基于全景表示的方法，结合条件视频生成和全景3D重建，实现高质量、几何一致的3D世界生成。

Details

Motivation: 现有的3D世界生成方法在生成的场景上范围有限，难以满足需求。Matrix-3D旨在解决这一问题，通过全景表示实现宽覆盖的3D世界生成。

Result: 实验表明Matrix-3D在全景视频生成和3D世界生成任务中达到最先进性能。

Insight: 全景表示是提升3D世界生成质量的关键，同时结合合成数据集可以有效支持模型训练。

Abstract: Explorable 3D world generation from a single image or text prompt forms a cornerstone of spatial intelligence. Recent works utilize video model to achieve wide-scope and generalizable 3D world generation. However, existing approaches often suffer from a limited scope in the generated scenes. In this work, we propose Matrix-3D, a framework that utilize panoramic representation for wide-coverage omnidirectional explorable 3D world generation that combines conditional video generation and panoramic 3D reconstruction. We first train a trajectory-guided panoramic video diffusion model that employs scene mesh renders as condition, to enable high-quality and geometrically consistent scene video generation. To lift the panorama scene video to 3D world, we propose two separate methods: (1) a feed-forward large panorama reconstruction model for rapid 3D scene reconstruction and (2) an optimization-based pipeline for accurate and detailed 3D scene reconstruction. To facilitate effective training, we also introduce the Matrix-Pano dataset, the first large-scale synthetic collection comprising 116K high-quality static panoramic video sequences with depth and trajectory annotations. Extensive experiments demonstrate that our proposed framework achieves state-of-the-art performance in panoramic video generation and 3D world generation. See more in https://matrix-3d.github.io.

[151] MDD-Net: Multimodal Depression Detection through Mutual Transformer cs.CV | cs.LG | cs.MM | eess.ASPDF

Md Rezwanul Haque, Md. Milon Islam, S M Taslim Uddin Raju, Hamdi Altaheri, Lobna Nassar

TL;DR: MDD-Net是一种新型的多模态抑郁症检测网络，通过提取和融合音频与视觉特征，利用Mutual Transformer捕捉模态间的关系，显著优于现有方法。

Details

Motivation: 抑郁症对身心健康影响深远，但传统检测方法依赖于专业医生且效率低。利用社交媒体多模态数据（音频和视觉）可提供非侵入式的高效检测手段。

Result: 在D-Vlog数据集上的实验表明，MDD-Net的F1-Score比现有方法提升17.37%，验证了其高效性。

Insight: Mutual Transformer在多模态特征融合中表现出色，证明了模态间关系建模对抑郁症检测的重要性。

Abstract: Depression is a major mental health condition that severely impacts the emotional and physical well-being of individuals. The simple nature of data collection from social media platforms has attracted significant interest in properly utilizing this information for mental health research. A Multimodal Depression Detection Network (MDD-Net), utilizing acoustic and visual data obtained from social media networks, is proposed in this work where mutual transformers are exploited to efficiently extract and fuse multimodal features for efficient depression detection. The MDD-Net consists of four core modules: an acoustic feature extraction module for retrieving relevant acoustic attributes, a visual feature extraction module for extracting significant high-level patterns, a mutual transformer for computing the correlations among the generated features and fusing these features from multiple modalities, and a detection layer for detecting depression using the fused feature representations. The extensive experiments are performed using the multimodal D-Vlog dataset, and the findings reveal that the developed multimodal depression detection network surpasses the state-of-the-art by up to 17.37% for F1-Score, demonstrating the greater performance of the proposed system. The source code is accessible at https://github.com/rezwanh001/Multimodal-Depression-Detection.

[152] TBAC-UniImage: Unified Understanding and Generation by Ladder-Side Diffusion Tuning cs.CVPDF

Junzhe Xu, Yuyang Yin, Xi Chen

TL;DR: TBAC-UniImage提出了一种基于梯度的扩散调优方法，通过将MLLM的多层表示作为扩散模型的生成条件，实现了更深层次的多模态理解与生成统一。

Details

Motivation: 现有的扩散统一模型存在生成条件浅层或计算成本高的问题，TBAC-UniImage旨在通过利用MLLM的多层次表示提升模型的生成能力与深度整合。

Result: TBAC-UniImage在多模态理解与生成任务中表现优异，实现了更深层次的能力整合。

Insight: 多层次表示的利用可以显著提升生成模型的性能，避免了从头训练的高成本问题。

Abstract: This paper introduces TBAC-UniImage, a novel unified model for multimodal understanding and generation. We achieve this by deeply integrating a pre-trained Diffusion Model, acting as a generative ladder, with a Multimodal Large Language Model (MLLM). Previous diffusion-based unified models face two primary limitations. One approach uses only the MLLM’s final hidden state as the generative condition. This creates a shallow connection, as the generator is isolated from the rich, hierarchical representations within the MLLM’s intermediate layers. The other approach, pretraining a unified generative architecture from scratch, is computationally expensive and prohibitive for many researchers. To overcome these issues, our work explores a new paradigm. Instead of relying on a single output, we use representations from multiple, diverse layers of the MLLM as generative conditions for the diffusion model. This method treats the pre-trained generator as a ladder, receiving guidance from various depths of the MLLM’s understanding process. Consequently, TBAC-UniImage achieves a much deeper and more fine-grained unification of understanding and generation.

[153] GRASPTrack: Geometry-Reasoned Association via Segmentation and Projection for Multi-Object Tracking cs.CV | cs.AIPDF

Xudong Han, Pengcheng Fang, Yueying Tian, Jianhui Yu, Xiaohao Cai

TL;DR: GRASPTrack 是一种新颖的多目标跟踪方法，结合了深度感知和实例分割，通过 3D 几何推理和动态噪声补偿提升了复杂场景下的跟踪鲁棒性。

Details

Motivation: 传统基于检测的跟踪方法（TBD）在单目视频中难以解决遮挡和深度模糊的问题，因为它们缺乏几何感知能力。GRASPTrack 旨在通过深度感知和 3D 几何推理来解决这些挑战。

Result: 在 MOT17、MOT20 和 DanceTrack 上的实验表明，GRASPTrack 在复杂场景中显著提升了跟踪鲁棒性，尤其是在遮挡和复杂运动模式下。

Insight: 显式 3D 几何推理能够有效弥补单目视频中的深度模糊问题，而动态噪声补偿和深度增强的运动一致性可以进一步提升跟踪的鲁棒性和准确性。

Abstract: Multi-object tracking (MOT) in monocular videos is fundamentally challenged by occlusions and depth ambiguity, issues that conventional tracking-by-detection (TBD) methods struggle to resolve owing to a lack of geometric awareness. To address these limitations, we introduce GRASPTrack, a novel depth-aware MOT framework that integrates monocular depth estimation and instance segmentation into a standard TBD pipeline to generate high-fidelity 3D point clouds from 2D detections, thereby enabling explicit 3D geometric reasoning. These 3D point clouds are then voxelized to enable a precise and robust Voxel-Based 3D Intersection-over-Union (IoU) for spatial association. To further enhance tracking robustness, our approach incorporates Depth-aware Adaptive Noise Compensation, which dynamically adjusts the Kalman filter process noise based on occlusion severity for more reliable state estimation. Additionally, we propose a Depth-enhanced Observation-Centric Momentum, which extends the motion direction consistency from the image plane into 3D space to improve motion-based association cues, particularly for objects with complex trajectories. Extensive experiments on the MOT17, MOT20, and DanceTrack benchmarks demonstrate that our method achieves competitive performance, significantly improving tracking robustness in complex scenes with frequent occlusions and intricate motion patterns.

[154] Follow-Your-Shape: Shape-Aware Image Editing via Trajectory-Guided Region Control cs.CVPDF

Zeqian Long, Mingzhe Zheng, Kunyu Feng, Xinhua Zhang, Hongyu Liu

TL;DR: Follow-Your-Shape提出了一种无需训练和掩码的方法，通过轨迹引导的区域控制实现精确的形状编辑，同时保留非目标内容。实验表明，该方法在大规模形状替换任务中表现优异。

Details

Motivation: 现有的基于流的图像编辑方法在复杂形状变换任务中表现不佳，难以精确控制目标形状或避免背景干扰。为此，作者提出了一种新的框架来解决这一问题。

Result: 该方法在形状感知编辑任务中表现出卓越的编辑能力和视觉保真度，尤其适用于大规模形状替换任务。

Insight: 通过轨迹差异地图（TDM）和KV注入机制，可以在不干扰背景的情况下实现精确的形状编辑，这为复杂编辑任务提供了新的解决思路。

Abstract: While recent flow-based image editing models demonstrate general-purpose capabilities across diverse tasks, they often struggle to specialize in challenging scenarios – particularly those involving large-scale shape transformations. When performing such structural edits, these methods either fail to achieve the intended shape change or inadvertently alter non-target regions, resulting in degraded background quality. We propose Follow-Your-Shape, a training-free and mask-free framework that supports precise and controllable editing of object shapes while strictly preserving non-target content. Motivated by the divergence between inversion and editing trajectories, we compute a Trajectory Divergence Map (TDM) by comparing token-wise velocity differences between the inversion and denoising paths. The TDM enables precise localization of editable regions and guides a Scheduled KV Injection mechanism that ensures stable and faithful editing. To facilitate a rigorous evaluation, we introduce ReShapeBench, a new benchmark comprising 120 new images and enriched prompt pairs specifically curated for shape-aware editing. Experiments demonstrate that our method achieves superior editability and visual fidelity, particularly in tasks requiring large-scale shape replacement.

[155] Pindrop it! Audio and Visual Deepfake Countermeasures for Robust Detection and Fine Grained-Localization cs.CV | cs.SD | eess.ASPDF

Nicholas Klein, Hemlata Tak, James Fullwood, Krishna Regmi, Leonidas Spinoulas

TL;DR: 该论文提出了一种针对视觉与音频深度伪造内容的检测方法，在ACM 1M Deepfakes Detection Challenge中表现优异，尤其在时间定位任务中表现最佳。

Details

Motivation: 随着视觉与音频生成技术的快速发展，深度伪造内容的检测需求日益迫切，尤其是针对局部细微修改的检测更具挑战性。

Result: 在ACM挑战赛中，方法在时间定位任务中表现最佳，分类任务中排名前四。

Insight: 论文强调了多模态信息（视觉与音频）结合对于提升深度伪造检测鲁棒性的重要性。

Abstract: The field of visual and audio generation is burgeoning with new state-of-the-art methods. This rapid proliferation of new techniques underscores the need for robust solutions for detecting synthetic content in videos. In particular, when fine-grained alterations via localized manipulations are performed in visual, audio, or both domains, these subtle modifications add challenges to the detection algorithms. This paper presents solutions for the problems of deepfake video classification and localization. The methods were submitted to the ACM 1M Deepfakes Detection Challenge, achieving the best performance in the temporal localization task and a top four ranking in the classification task for the TestA split of the evaluation dataset.

[156] ReconDreamer-RL: Enhancing Reinforcement Learning via Diffusion-based Scene Reconstruction cs.CVPDF

Chaojun Ni, Guosheng Zhao, Xiaofeng Wang, Zheng Zhu, Wenkang Qin

TL;DR: ReconDreamer-RL利用视频扩散先验和动态对抗代理等技术，提升强化学习在自动驾驶训练中的表现，显著缩小模拟与现实的差距。

Details

Motivation: 现有模拟环境与真实场景存在显著差异（sim2real gap），且难以生成高质量数据覆盖新轨迹或极端场景。

Result: 与模仿学习方法相比，碰撞率降低5倍，提升自动驾驶训练效果。

Insight: 通过生成多样化场景和极端案例，可以有效提升强化学习在自动驾驶中的泛化能力。

Abstract: Reinforcement learning for training end-to-end autonomous driving models in closed-loop simulations is gaining growing attention. However, most simulation environments differ significantly from real-world conditions, creating a substantial simulation-to-reality (sim2real) gap. To bridge this gap, some approaches utilize scene reconstruction techniques to create photorealistic environments as a simulator. While this improves realistic sensor simulation, these methods are inherently constrained by the distribution of the training data, making it difficult to render high-quality sensor data for novel trajectories or corner case scenarios. Therefore, we propose ReconDreamer-RL, a framework designed to integrate video diffusion priors into scene reconstruction to aid reinforcement learning, thereby enhancing end-to-end autonomous driving training. Specifically, in ReconDreamer-RL, we introduce ReconSimulator, which combines the video diffusion prior for appearance modeling and incorporates a kinematic model for physical modeling, thereby reconstructing driving scenarios from real-world data. This narrows the sim2real gap for closed-loop evaluation and reinforcement learning. To cover more corner-case scenarios, we introduce the Dynamic Adversary Agent (DAA), which adjusts the trajectories of surrounding vehicles relative to the ego vehicle, autonomously generating corner-case traffic scenarios (e.g., cut-in). Finally, the Cousin Trajectory Generator (CTG) is proposed to address the issue of training data distribution, which is often biased toward simple straight-line movements. Experiments show that ReconDreamer-RL improves end-to-end autonomous driving training, outperforming imitation learning methods with a 5x reduction in the Collision Ratio.

[157] MedReasoner: Reinforcement Learning Drives Reasoning Grounding from Clinical Thought to Pixel-Level Precision cs.CV | cs.AIPDF

Zhonghao Yan, Muxi Diao, Yuxuan Yang, Jiayuan Xu, Kaizhou Zhang

TL;DR: MedReasoner提出了一种结合强化学习的模块化框架，用于解决医学成像中的推理与像素级定位任务，并发布了新的数据集U-MRG-14K，该方法在性能上达到了SOTA。

Details

Motivation: 当前医学影像的多模态大语言模型依赖显式空间提示的监督微调，无法有效处理临床实践中常见的隐式查询。

Result: MedReasoner在U-MRG-14K上达到SOTA性能，并对未见过的临床查询展现出强泛化能力。

Insight: 强化学习在医学影像任务中展现出潜力，模块化设计为推理与定位提供了解耦的有效途径。

Abstract: Accurately grounding regions of interest (ROIs) is critical for diagnosis and treatment planning in medical imaging. While multimodal large language models (MLLMs) combine visual perception with natural language, current medical-grounding pipelines still rely on supervised fine-tuning with explicit spatial hints, making them ill-equipped to handle the implicit queries common in clinical practice. This work makes three core contributions. We first define Unified Medical Reasoning Grounding (UMRG), a novel vision-language task that demands clinical reasoning and pixel-level grounding. Second, we release U-MRG-14K, a dataset of 14K samples featuring pixel-level masks alongside implicit clinical queries and reasoning traces, spanning 10 modalities, 15 super-categories, and 108 specific categories. Finally, we introduce MedReasoner, a modular framework that distinctly separates reasoning from segmentation: an MLLM reasoner is optimized with reinforcement learning, while a frozen segmentation expert converts spatial prompts into masks, with alignment achieved through format and accuracy rewards. MedReasoner achieves state-of-the-art performance on U-MRG-14K and demonstrates strong generalization to unseen clinical queries, underscoring the significant promise of reinforcement learning for interpretable medical grounding.

[158] PP-Motion: Physical-Perceptual Fidelity Evaluation for Human Motion Generation cs.CV | cs.MMPDF

Sihan Zhao, Zixuan Wang, Tianyu Luan, Jia Jia, Wentao Zhu

TL;DR: PP-Motion提出了一种新的数据驱动指标，用于评估人体运动生成中的物理和感知保真度，通过结合物理标注和人类感知损失，实现了对运动保真度的细粒度和客观评估。

Details

Motivation: 当前人体运动生成的评估方法存在主观性强、标注粗糙的问题，且物理可行性与人类感知保真度之间缺乏统一标准。因此，需要一种结合物理约束和人类感知的细粒度评价方法。

Result: 实验结果表明，PP-Motion不仅符合物理规律，而且在人类感知保真度评估上优于现有方法。

Insight: 结合物理约束和人类感知的细粒度评估是提高人体运动生成质量的关键，物理标注为数据驱动的评价提供了客观基础。

Abstract: Human motion generation has found widespread applications in AR/VR, film, sports, and medical rehabilitation, offering a cost-effective alternative to traditional motion capture systems. However, evaluating the fidelity of such generated motions is a crucial, multifaceted task. Although previous approaches have attempted at motion fidelity evaluation using human perception or physical constraints, there remains an inherent gap between human-perceived fidelity and physical feasibility. Moreover, the subjective and coarse binary labeling of human perception further undermines the development of a robust data-driven metric. We address these issues by introducing a physical labeling method. This method evaluates motion fidelity by calculating the minimum modifications needed for a motion to align with physical laws. With this approach, we are able to produce fine-grained, continuous physical alignment annotations that serve as objective ground truth. With these annotations, we propose PP-Motion, a novel data-driven metric to evaluate both physical and perceptual fidelity of human motion. To effectively capture underlying physical priors, we employ Pearson’s correlation loss for the training of our metric. Additionally, by incorporating a human-based perceptual fidelity loss, our metric can capture fidelity that simultaneously considers both human perception and physical alignment. Experimental results demonstrate that our metric, PP-Motion, not only aligns with physical laws but also aligns better with human perception of motion fidelity than previous work.

[159] THAT: Token-wise High-frequency Augmentation Transformer for Hyperspectral Pansharpening cs.CV | eess.IVPDF

Hongkun Jin, Hongcheng Jiang, Zejun Zhang, Yuan Zhang, Jia Fu

TL;DR: THAT是一个针对高光谱图像融合的Transformer方法，通过高效的高频特征表示和令牌选择解决了ViTs中的冗余和注意力分散问题。

Details

Motivation: Transformer在高光谱图像融合中表现优秀，但存在令牌冗余和多尺度特征建模不足的问题。此外，ViTs难以保留高频成分，且注意力分散，影响了重建质量。

Result: 在标准基准测试中，THAT在重建质量和效率上均达到了最先进的性能。

Insight: 通过关注高频特征和令牌选择，Transformer在高光谱图像融合中的性能可以得到显著提升。

Abstract: Transformer-based methods have demonstrated strong potential in hyperspectral pansharpening by modeling long-range dependencies. However, their effectiveness is often limited by redundant token representations and a lack of multi-scale feature modeling. Hyperspectral images exhibit intrinsic spectral priors (e.g., abundance sparsity) and spatial priors (e.g., non-local similarity), which are critical for accurate reconstruction. From a spectral-spatial perspective, Vision Transformers (ViTs) face two major limitations: they struggle to preserve high-frequency components–such as material edges and texture transitions–and suffer from attention dispersion across redundant tokens. These issues stem from the global self-attention mechanism, which tends to dilute high-frequency signals and overlook localized details. To address these challenges, we propose the Token-wise High-frequency Augmentation Transformer (THAT), a novel framework designed to enhance hyperspectral pansharpening through improved high-frequency feature representation and token selection. Specifically, THAT introduces: (1) Pivotal Token Selective Attention (PTSA) to prioritize informative tokens and suppress redundancy; (2) a Multi-level Variance-aware Feed-forward Network (MVFN) to enhance high-frequency detail learning. Experiments on standard benchmarks show that THAT achieves state-of-the-art performance with improved reconstruction quality and efficiency. The source code is available at https://github.com/kailuo93/THAT.

[160] Reinforcement Learning in Vision: A Survey cs.CVPDF

Weijia Wu, Chen Gao, Joya Chen, Kevin Qinghong Lin, Qingwei Meng

TL;DR: 这篇论文是关于强化学习（RL）在视觉智能领域的综述，总结了近年来RL与视觉结合的研究进展，涵盖了问题定义、方法演进、四大主题支柱（多模态大语言模型、视觉生成、统一模型框架、视觉-语言-动作模型）及其趋势、评估协议和开放挑战。

Details

Motivation: 随着RL和视觉智能的快速发展，研究需要系统梳理这一交叉领域的进展，为研究人员和实践者提供清晰的路线图，并突出未来的研究方向。

Result: 论文总结了当前视觉RL的研究趋势（如课程驱动训练、偏好对齐扩散、统一奖励建模）和评估协议，同时指出了样本效率、泛化性和安全部署等开放挑战。

Insight: 未来的研究方向可能包括提升样本效率、增强泛化能力以及解决安全部署问题。统一的奖励建模和跨模态任务整合是潜在的重要趋势。

Abstract: Recent advances at the intersection of reinforcement learning (RL) and visual intelligence have enabled agents that not only perceive complex visual scenes but also reason, generate, and act within them. This survey offers a critical and up-to-date synthesis of the field. We first formalize visual RL problems and trace the evolution of policy-optimization strategies from RLHF to verifiable reward paradigms, and from Proximal Policy Optimization to Group Relative Policy Optimization. We then organize more than 200 representative works into four thematic pillars: multi-modal large language models, visual generation, unified model frameworks, and vision-language-action models. For each pillar we examine algorithmic design, reward engineering, benchmark progress, and we distill trends such as curriculum-driven training, preference-aligned diffusion, and unified reward modeling. Finally, we review evaluation protocols spanning set-level fidelity, sample-level preference, and state-level stability, and we identify open challenges that include sample efficiency, generalization, and safe deployment. Our goal is to provide researchers and practitioners with a coherent map of the rapidly expanding landscape of visual RL and to highlight promising directions for future inquiry. Resources are available at: https://github.com/weijiawu/Awesome-Visual-Reinforcement-Learning.

[161] Spatial-ORMLLM: Improve Spatial Relation Understanding in the Operating Room with Multimodal Large Language Model cs.CVPDF

Peiqi He, Zhenhao Zhang, Yixiang Zhang, Xiongjun Zhao, Shaoliang Peng

TL;DR: Spatial-ORMLLM是一个多模态大语言模型，专注于通过RGB模态推断手术室中的3D空间关系，无需额外传感器或专家标注，实现了下游医疗任务的详细空间推理，表现优于现有方法。

Details

Motivation: 现有方法在手术室空间建模中忽视了多模态大语言模型（MLLM）的3D能力，且依赖难以获取的多模态3D数据或仅使用2D数据导致细节丢失，亟需一种仅需RGB模态的解决方案。

Result: 在多个临床数据集上达到SOTA性能，并能泛化到未见的手术场景和下游任务。

Insight: 仅通过RGB模态和特征融合设计，模型可以捕捉复杂3D空间语义信息，为临床任务提供更全面的上下文支持，同时降低了数据获取成本。

Abstract: Precise spatial modeling in the operating room (OR) is foundational to many clinical tasks, supporting intraoperative awareness, hazard avoidance, and surgical decision-making. While existing approaches leverage large-scale multimodal datasets for latent-space alignment to implicitly learn spatial relationships, they overlook the 3D capabilities of MLLMs. However, this approach raises two issues: (1) Operating rooms typically lack multiple video and audio sensors, making multimodal 3D data difficult to obtain; (2) Training solely on readily available 2D data fails to capture fine-grained details in complex scenes. To address this gap, we introduce Spatial-ORMLLM, the first large vision-language model for 3D spatial reasoning in operating rooms using only RGB modality to infer volumetric and semantic cues, enabling downstream medical tasks with detailed and holistic spatial context. Spatial-ORMLLM incorporates a Spatial-Enhanced Feature Fusion Block, which integrates 2D modality inputs with rich 3D spatial knowledge extracted by the estimation algorithm and then feeds the combined features into the visual tower. By employing a unified end-to-end MLLM framework, it combines powerful spatial features with textual features to deliver robust 3D scene reasoning without any additional expert annotations or sensor inputs. Experiments on multiple benchmark clinical datasets demonstrate that Spatial-ORMLLM achieves state-of-the-art performance and generalizes robustly to previously unseen surgical scenarios and downstream tasks.

[162] SAGOnline: Segment Any Gaussians Online cs.CVPDF

Wentao Sun, Quanyun Wu, Hanqing Xu, Kyle Gao, Zhengsen Xu

TL;DR: SAGOnline 提出了一种轻量级、零样本的实时 3D 分割框架，通过解耦策略和 GPU 加速算法，解决了 3D 高斯场景中分割效率与一致性问题，并在性能与速度上显著超越现有方法。

Details

Motivation: 3D 高斯样条（3DGS）在显式 3D 场景表示中表现出色，但现有方法在 3D 分割上存在高计算成本、空间推理能力不足和多目标跟踪困难等问题。

Result: 在 NVOS（92.7% mIoU）和 Spin-NeRF（95.2% mIoU）基准测试中达到 SOTA，推理速度提升 15–1500 倍（27 ms/帧）。

Insight: SAGOnline 展示了将 2D 模型迁移到 3D 场景的潜力，为实时渲染与场景理解提供了新方向，推动了 AR/VR 和机器人应用的实用化。

Abstract: 3D Gaussian Splatting (3DGS) has emerged as a powerful paradigm for explicit 3D scene representation, yet achieving efficient and consistent 3D segmentation remains challenging. Current methods suffer from prohibitive computational costs, limited 3D spatial reasoning, and an inability to track multiple objects simultaneously. We present Segment Any Gaussians Online (SAGOnline), a lightweight and zero-shot framework for real-time 3D segmentation in Gaussian scenes that addresses these limitations through two key innovations: (1) a decoupled strategy that integrates video foundation models (e.g., SAM2) for view-consistent 2D mask propagation across synthesized views; and (2) a GPU-accelerated 3D mask generation and Gaussian-level instance labeling algorithm that assigns unique identifiers to 3D primitives, enabling lossless multi-object tracking and segmentation across views. SAGOnline achieves state-of-the-art performance on NVOS (92.7% mIoU) and Spin-NeRF (95.2% mIoU) benchmarks, outperforming Feature3DGS, OmniSeg3D-gs, and SA3D by 15–1500 times in inference speed (27 ms/frame). Qualitative results demonstrate robust multi-object segmentation and tracking in complex scenes. Our contributions include: (i) a lightweight and zero-shot framework for 3D segmentation in Gaussian scenes, (ii) explicit labeling of Gaussian primitives enabling simultaneous segmentation and tracking, and (iii) the effective adaptation of 2D video foundation models to the 3D domain. This work allows real-time rendering and 3D scene understanding, paving the way for practical AR/VR and robotic applications.

[163] Learning User Preferences for Image Generation Model cs.CVPDF

Wenyi Mo, Ying Ba, Tianyu Zhang, Yalong Bai, Biye Li

TL;DR: 本文提出了一种基于多模态大语言模型的方法，通过对比偏好损失和偏好令牌学习用户个性化偏好，从而提升图像生成模型对用户喜好的预测能力。

Details

Motivation: 现有方法通常依赖通用人类偏好或静态用户画像，忽视了用户偏好的个体差异性和动态多面性，限制了图像生成模型对用户个性化需求的满足。

Result: 实验表明，模型在偏好预测准确性上表现优于其他方法，并能够识别具有相似审美倾向的用户群。

Insight: 通过对比学习和共享表示学习，模型能够更好地捕捉用户偏好的动态性和多样性，为个性化图像生成提供更精准的指导。

Abstract: User preference prediction requires a comprehensive and accurate understanding of individual tastes. This includes both surface-level attributes, such as color and style, and deeper content-related aspects, such as themes and composition. However, existing methods typically rely on general human preferences or assume static user profiles, often neglecting individual variability and the dynamic, multifaceted nature of personal taste. To address these limitations, we propose an approach built upon Multimodal Large Language Models, introducing contrastive preference loss and preference tokens to learn personalized user preferences from historical interactions. The contrastive preference loss is designed to effectively distinguish between user ‘’likes’’ and ‘’dislikes’’, while the learnable preference tokens capture shared interest representations among existing users, enabling the model to activate group-specific preferences and enhance consistency across similar users. Extensive experiments demonstrate our model outperforms other methods in preference prediction accuracy, effectively identifying users with similar aesthetic inclinations and providing more precise guidance for generating images that align with individual tastes. The project page is \texttt{https://learn-user-pref.github.io/}.

[164] Cut2Next: Generating Next Shot via In-Context Tuning cs.CV | cs.AIPDF

Jingwen He, Hongbo Liu, Jiajun Li, Ziqi Huang, Yu Qiao

TL;DR: 论文提出了Next Shot Generation (NSG)任务和Cut2Next框架，通过Diffusion Transformer和分层多提示策略生成符合电影编辑模式和连续性的高质量下一镜头。

Details

Motivation: 当前多镜头生成方法多关注基础视觉一致性，忽视了推动叙事的编辑模式（如正反打镜头、插入镜头等），导致输出缺乏叙事复杂性和电影完整性。

Result: 实验显示Cut2Next在视觉一致性和文本保真度上表现优异，用户研究验证其生成的镜头符合编辑模式和电影连续性。

Insight: 分层提示和上下文感知的条件注入是生成高质量、叙事丰富的电影镜头的关键。

Abstract: Effective multi-shot generation demands purposeful, film-like transitions and strict cinematic continuity. Current methods, however, often prioritize basic visual consistency, neglecting crucial editing patterns (e.g., shot/reverse shot, cutaways) that drive narrative flow for compelling storytelling. This yields outputs that may be visually coherent but lack narrative sophistication and true cinematic integrity. To bridge this, we introduce Next Shot Generation (NSG): synthesizing a subsequent, high-quality shot that critically conforms to professional editing patterns while upholding rigorous cinematic continuity. Our framework, Cut2Next, leverages a Diffusion Transformer (DiT). It employs in-context tuning guided by a novel Hierarchical Multi-Prompting strategy. This strategy uses Relational Prompts to define overall context and inter-shot editing styles. Individual Prompts then specify per-shot content and cinematographic attributes. Together, these guide Cut2Next to generate cinematically appropriate next shots. Architectural innovations, Context-Aware Condition Injection (CACI) and Hierarchical Attention Mask (HAM), further integrate these diverse signals without introducing new parameters. We construct RawCuts (large-scale) and CuratedCuts (refined) datasets, both with hierarchical prompts, and introduce CutBench for evaluation. Experiments show Cut2Next excels in visual consistency and text fidelity. Crucially, user studies reveal a strong preference for Cut2Next, particularly for its adherence to intended editing patterns and overall cinematic continuity, validating its ability to generate high-quality, narratively expressive, and cinematically coherent subsequent shots.

[165] StableAvatar: Infinite-Length Audio-Driven Avatar Video Generation cs.CVPDF

Shuyuan Tu, Yueming Pan, Yinming Huang, Xintong Han, Zhen Xing

TL;DR: 该论文提出了StableAvatar，一种端到端的视频扩散变换器，能够生成无限长度的高质量音频驱动视频，同时解决了音频同步和身份一致性问题。

Details

Motivation: 现有扩散模型在生成音频驱动视频时，难以实现较长时间的视频合成，且存在音频同步和身份一致性问题。

Result: 实验表明，StableAvatar在质量和数量上均优于现有方法。

Insight: 直接注入音频嵌入会导致潜在分布误差累积，而时间步感知调制和动态引导信号能显著改善长期视频生成效果。

Abstract: Current diffusion models for audio-driven avatar video generation struggle to synthesize long videos with natural audio synchronization and identity consistency. This paper presents StableAvatar, the first end-to-end video diffusion transformer that synthesizes infinite-length high-quality videos without post-processing. Conditioned on a reference image and audio, StableAvatar integrates tailored training and inference modules to enable infinite-length video generation. We observe that the main reason preventing existing models from generating long videos lies in their audio modeling. They typically rely on third-party off-the-shelf extractors to obtain audio embeddings, which are then directly injected into the diffusion model via cross-attention. Since current diffusion backbones lack any audio-related priors, this approach causes severe latent distribution error accumulation across video clips, leading the latent distribution of subsequent segments to drift away from the optimal distribution gradually. To address this, StableAvatar introduces a novel Time-step-aware Audio Adapter that prevents error accumulation via time-step-aware modulation. During inference, we propose a novel Audio Native Guidance Mechanism to further enhance the audio synchronization by leveraging the diffusion’s own evolving joint audio-latent prediction as a dynamic guidance signal. To enhance the smoothness of the infinite-length videos, we introduce a Dynamic Weighted Sliding-window Strategy that fuses latent over time. Experiments on benchmarks show the effectiveness of StableAvatar both qualitatively and quantitatively.

[166] ReferSplat: Referring Segmentation in 3D Gaussian Splatting cs.CVPDF

Shuting He, Guangquan Jie, Changshuo Wang, Yun Zhou, Shuming Hu

TL;DR: 论文提出了ReferSplat框架，用于解决3D高斯场景中基于自然语言描述的目标对象分割任务（R3DGS），并构建了首个相关数据集Ref-LERF。

Details

Motivation: 为了解决3D场景中基于自然语言的物体分割任务，尤其是涉及遮挡和空间关系的挑战，推动了具身AI的发展。

Result: ReferSplat在R3DGS任务和3D开放词汇分割基准测试中均达到最先进性能。

Insight: 3D多模态理解和空间关系建模是实现3D高斯场景中语言引导分割的关键挑战。

Abstract: We introduce Referring 3D Gaussian Splatting Segmentation (R3DGS), a new task that aims to segment target objects in a 3D Gaussian scene based on natural language descriptions, which often contain spatial relationships or object attributes. This task requires the model to identify newly described objects that may be occluded or not directly visible in a novel view, posing a significant challenge for 3D multi-modal understanding. Developing this capability is crucial for advancing embodied AI. To support research in this area, we construct the first R3DGS dataset, Ref-LERF. Our analysis reveals that 3D multi-modal understanding and spatial relationship modeling are key challenges for R3DGS. To address these challenges, we propose ReferSplat, a framework that explicitly models 3D Gaussian points with natural language expressions in a spatially aware paradigm. ReferSplat achieves state-of-the-art performance on both the newly proposed R3DGS task and 3D open-vocabulary segmentation benchmarks. Dataset and code are available at https://github.com/heshuting555/ReferSplat.

cs.CL [Back]

[167] Retrieval augmented generation based dynamic prompting for few-shot biomedical named entity recognition using large language models cs.CL | cs.AIPDF

Yao Ge, Sudeshna Das, Yuting Guo, Abeed Sarker

TL;DR: 本文提出了一种基于检索增强生成（RAG）的动态提示策略，用于小样本的生物医学命名实体识别（NER），通过动态更新上下文学习示例以提升大语言模型（LLM）的性能。

Details

Motivation: 生物医学NER是NLP中的重要任务，但在小样本场景下，LLM的性能面临挑战。作者希望通过动态提示策略解决这一限制。

Result: 实验在五个生物医学NER数据集上展开。静态提示平均F1值提升了12%（GPT-4）和11%（GPT-3.5和LLaMA 3-70B），动态提示进一步提升了7.3%（5-shot）和5.6%（10-shot）。

Insight: 研究表明，基于RAG的动态提示策略能显著提升小样本生物医学NER的性能，上下文自适应的提示具有重要实用价值。

Abstract: Biomedical named entity recognition (NER) is a high-utility natural language processing (NLP) task, and large language models (LLMs) show promise particularly in few-shot settings (i.e., limited training data). In this article, we address the performance challenges of LLMs for few-shot biomedical NER by investigating a dynamic prompting strategy involving retrieval-augmented generation (RAG). In our approach, the annotated in-context learning examples are selected based on their similarities with the input texts, and the prompt is dynamically updated for each instance during inference. We implemented and optimized static and dynamic prompt engineering techniques and evaluated them on five biomedical NER datasets. Static prompting with structured components increased average F1-scores by 12% for GPT-4, and 11% for GPT-3.5 and LLaMA 3-70B, relative to basic static prompting. Dynamic prompting further improved performance, with TF-IDF and SBERT retrieval methods yielding the best results, improving average F1-scores by 7.3% and 5.6% in 5-shot and 10-shot settings, respectively. These findings highlight the utility of contextually adaptive prompts via RAG for biomedical NER.

[168] The Art of Breaking Words: Rethinking Multilingual Tokenizer Design cs.CL | cs.AIPDF

Aamod Thakur, Ajay Nagpal, Atharva Savarkar, Kundeshwar Pundalik, Siddhesh Dosi

TL;DR: 该论文探讨了多语言上下文中的分词问题，提出了一种新的数据组合算法，显著降低了token-to-word比率，提升了模型性能和推理速度。

Details

Motivation: 现有分词器在多语言环境下效率低下，特别是在处理高多样性和复杂性的语言（如Indic脚本）时表现不佳，因此需要重新设计分词策略。

Result: 新算法将平均token-to-word比率降低了约6%，在Indic模型上的分词效率提升了40%以上，显著提高了模型性能和推理速度。

Insight: 分词策略是构建高效、可扩展多语言LLMs的关键因素，与模型架构和训练目标同等重要。

Abstract: While model architecture and training objectives are well-studied, tokenization, particularly in multilingual contexts, remains a relatively neglected aspect of Large Language Model (LLM) development. Existing tokenizers often exhibit high token-to-word ratios, inefficient use of context length, and slower inference. We present a systematic study that links vocabulary size, pre-tokenization rules, and training-corpus composition to both token-to-word efficiency and model quality. To ground our analysis in a linguistically diverse context, we conduct extensive experiments on Indic scripts, which present unique challenges due to their high script diversity and orthographic complexity. Drawing on the insights from these analyses, we propose a novel algorithm for data composition that balances multilingual data for tokenizer training. Our observations on pretokenization strategies significantly improve model performance, and our data composition algorithm reduces the average token-to-word ratio by approximately 6% with respect to the conventional data randomization approach. Our tokenizer achieves more than 40% improvement on average token-to-word ratio against stateof-the-art multilingual Indic models. This improvement yields measurable gains in both model performance and inference speed. This highlights tokenization alongside architecture and training objectives as a critical lever for building efficient, scalable multilingual LLMs

[169] BrowseComp-Plus: A More Fair and Transparent Evaluation Benchmark of Deep-Research Agent cs.CL | cs.IRPDF

Zijian Chen, Xueguang Ma, Shengyao Zhuang, Ping Nie, Kai Zou

TL;DR: BrowseComp-Plus是一个改进版的基准测试，旨在解决现有评测方法在公平性和透明度上的不足，通过固定化的语料和人工验证的支持文档，实现对深度研究代理的公平评估和分析。

Details

Motivation: 现有基准测试（如BrowseComp）依赖动态、不透明的网络搜索API，导致公平比较和可重复性受限，且无法分离检索器贡献，阻碍了对深度研究代理能力的准确评估。

Result: 实验显示，开源模型Search-R1与BM25检索器组合准确率为3.86%，而GPT-5达到55.9%；结合Qwen3-Embedding-8B检索器后，GPT-5准确率提升至70.1%。

Insight: BrowseComp-Plus能有效区分不同深度研究系统的性能，揭示了检索效率、引用准确性和上下文工程对系统表现的关键影响。

Abstract: Deep-Research agents, which integrate large language models (LLMs) with search tools, have shown success in improving the effectiveness of handling complex queries that require iterative search planning and reasoning over search results. Evaluations on current benchmarks like BrowseComp relies on black-box live web search APIs, have notable limitations in (1) fairness: dynamic and opaque web APIs hinder fair comparisons and reproducibility of deep research methods; (2) transparency: lack of control over the document corpus makes it difficult to isolate retriever contributions. In other words, the current evaluations may compare a complete deep research system at a given time, but they do not foster well-controlled experiments to provide insights into the capability of underlying deep research LLMs. To address these challenges, we introduce BrowseComp-Plus, a benchmark derived from BrowseComp, employing a fixed, carefully curated corpus. Each query in BrowseComp-Plus includes human-verified supporting documents and mined challenging negatives, enabling controlled experimentation. The benchmark is shown to be effective in distinguishing the performance of deep research systems. For instance, the open-source model Search-R1, when paired with the BM25 retriever, achieves 3.86% accuracy, whereas the GPT-5 achieves 55.9%. Integrating the GPT-5 with the Qwen3-Embedding-8B retriever further enhances its accuracy to 70.1% with fewer search calls. This benchmark allows comprehensive evaluation and disentangled analysis of deep research agents and retrieval methods, fostering insights into retrieval effectiveness, citation accuracy, and context engineering in Deep-Research system.

[170] SEVADE: Self-Evolving Multi-Agent Analysis with Decoupled Evaluation for Hallucination-Resistant Irony Detection cs.CL | cs.MAPDF

Ziqi Liu, Yangbin Chen, Ziyang Zhou, Yilin Li, Mingxuan Hu

TL;DR: 该论文提出了一种名为SEVADE的新框架，用于抗幻觉的讽刺检测，通过自我演化的多智能体分析和解耦评估，显著提升了准确性和可靠性。

Details

Motivation: 现有的大型语言模型方法在处理复杂讽刺修辞时往往受限于单视角分析、静态推理路径和容易产生幻觉的问题，影响了其准确性和可靠性。因此，作者提出了SEVADE框架来解决这些问题。

Result: 在四个基准数据集上的实验表明，SEVADE在Accuracy和Macro-F1上平均提升了6.75%和6.29%，达到了最先进的性能。

Insight: 通过将复杂推理与最终分类解耦，可以有效减少幻觉问题，同时多智能体的动态协作提供了更全面的文本分析视角。

Abstract: Sarcasm detection is a crucial yet challenging Natural Language Processing task. Existing Large Language Model methods are often limited by single-perspective analysis, static reasoning pathways, and a susceptibility to hallucination when processing complex ironic rhetoric, which impacts their accuracy and reliability. To address these challenges, we propose SEVADE, a novel Self-Evolving multi-agent Analysis framework with Decoupled Evaluation for hallucination-resistant sarcasm detection. The core of our framework is a Dynamic Agentive Reasoning Engine (DARE), which utilizes a team of specialized agents grounded in linguistic theory to perform a multifaceted deconstruction of the text and generate a structured reasoning chain. Subsequently, a separate lightweight rationale adjudicator (RA) performs the final classification based solely on this reasoning chain. This decoupled architecture is designed to mitigate the risk of hallucination by separating complex reasoning from the final judgment. Extensive experiments on four benchmark datasets demonstrate that our framework achieves state-of-the-art performance, with average improvements of 6.75% in Accuracy and 6.29% in Macro-F1 score.

[171] Annotating Errors in English Learners’ Written Language Production: Advancing Automated Written Feedback Systems cs.CLPDF

Steven Coyne, Diana Galvan-Sosa, Ryan Spring, Camélia Guerraoui, Michael Zock

TL;DR: 论文提出了一个针对英语学习者写作错误的标注框架，旨在改进自动化写作评价系统的反馈方式，从直接纠错转向基于错误类型和通用性的解释和提示。

Details

Motivation: 当前的自动化写作评价系统倾向于直接纠错，忽略了语言学习中通过解释和提示帮助学习者的重要性。论文旨在通过标注错误类型和通用性，生成更有助于学习的反馈。

Result: 人类教师评估了系统的反馈质量，包括相关性、准确性和可理解性，验证了方法的有效性。

Insight: 间接提示（如解释和通用规则）可能比直接纠错更有利于语言学习，尤其是对于可推广的语法规则。

Abstract: Recent advances in natural language processing (NLP) have contributed to the development of automated writing evaluation (AWE) systems that can correct grammatical errors. However, while these systems are effective at improving text, they are not optimally designed for language learning. They favor direct revisions, often with a click-to-fix functionality that can be applied without considering the reason for the correction. Meanwhile, depending on the error type, learners may benefit most from simple explanations and strategically indirect hints, especially on generalizable grammatical rules. To support the generation of such feedback, we introduce an annotation framework that models each error’s error type and generalizability. For error type classification, we introduce a typology focused on inferring learners’ knowledge gaps by connecting their errors to specific grammatical patterns. Following this framework, we collect a dataset of annotated learner errors and corresponding human-written feedback comments, each labeled as a direct correction or hint. With this data, we evaluate keyword-guided, keyword-free, and template-guided methods of generating feedback using large language models (LLMs). Human teachers examined each system’s outputs, assessing them on grounds including relevance, factuality, and comprehensibility. We report on the development of the dataset and the comparative performance of the systems investigated.

[172] BharatBBQ: A Multilingual Bias Benchmark for Question Answering in the Indian Context cs.CLPDF

Aditya Tomar, Nihar Ranjan Sahoo, Pushpak Bhattacharyya

TL;DR: 论文介绍了BharatBBQ，一个针对印度多语言环境的偏见评测基准，填补了现有评测工具在印度文化背景下的不足。

Details

Motivation: 由于现有偏见评测基准（如BBQ）主要关注西方语境，缺乏对印度文化多样性的适应性，作者提出了一个针对印度多语言环境的评测工具。

Result: 研究发现模型在印度语言中的偏见比英语更显著，强调了语言和文化背景对偏见评测的重要性。

Insight: 为全球公平AI提供了针对特定文化背景的评测工具，凸显了多语言和多文化评测的必要性。

Abstract: Evaluating social biases in language models (LMs) is crucial for ensuring fairness and minimizing the reinforcement of harmful stereotypes in AI systems. Existing benchmarks, such as the Bias Benchmark for Question Answering (BBQ), primarily focus on Western contexts, limiting their applicability to the Indian context. To address this gap, we introduce BharatBBQ, a culturally adapted benchmark designed to assess biases in Hindi, English, Marathi, Bengali, Tamil, Telugu, Odia, and Assamese. BharatBBQ covers 13 social categories, including 3 intersectional groups, reflecting prevalent biases in the Indian sociocultural landscape. Our dataset contains 49,108 examples in one language that are expanded using translation and verification to 392,864 examples in eight different languages. We evaluate five multilingual LM families across zero and few-shot settings, analyzing their bias and stereotypical bias scores. Our findings highlight persistent biases across languages and social categories and often amplified biases in Indian languages compared to English, demonstrating the necessity of linguistically and culturally grounded benchmarks for bias evaluation.

[173] Less Is More: Training-Free Sparse Attention with Global Locality for Efficient Reasoning cs.CL | cs.AIPDF

Lijie Yang, Zhihao Zhang, Arti Jain, Shijie Cao, Baihong Yuan

TL;DR: 这篇论文提出了LessIsMore，一种无需训练的稀疏注意力机制，通过全局局部性优化，显著提升了推理任务的效率和准确性。

Details

Motivation: 现有稀疏注意力方法在长生成推理中因误差积累导致精度下降，且需要高令牌保留率或昂贵的重新训练。LessIsMore旨在解决这些问题。

Result: 在多种推理任务上，LessIsMore保持或提高精度，平均解码速度提升1.1倍，同时减少2倍令牌注意力而无精度损失，端到端速度提升1.13倍。

Insight: 全局注意力模式和统一的令牌选择机制可以提高稀疏注意力的效率和泛化能力，减少对训练和保留率的依赖。

Abstract: Large reasoning models achieve strong performance through test-time scaling but incur substantial computational overhead, particularly from excessive token generation when processing short input prompts. While sparse attention mechanisms can reduce latency and memory usage, existing approaches suffer from significant accuracy degradation due to accumulated errors during long-generation reasoning. These methods generally require either high token retention rates or expensive retraining. We introduce LessIsMore, a training-free sparse attention mechanism for reasoning tasks, which leverages global attention patterns rather than relying on traditional head-specific local optimizations. LessIsMore aggregates token selections from local attention heads with recent contextual information, enabling unified cross-head token ranking for future decoding layers. This unified selection improves generalization and efficiency by avoiding the need to maintain separate token subsets per head. Evaluation across diverse reasoning tasks and benchmarks shows that LessIsMore preserves – and in some cases improves – accuracy while achieving a $1.1\times$ average decoding speed-up compared to full attention. Moreover, LessIsMore attends to $2\times$ fewer tokens without accuracy loss, achieving a $1.13\times$ end-to-end speed-up compared to existing sparse attention methods.

[174] Investigating Intersectional Bias in Large Language Models using Confidence Disparities in Coreference Resolution cs.CL | cs.AIPDF

Falaah Arif Khan, Nivedha Sivakumar, Yinong Oliver Wang, Katherine Metcalf, Cezanne Camacho

TL;DR: 该论文提出了一种新的基准测试WinoIdentity，用于评估大型语言模型（LLMs）在交叉性偏见（intersectional bias）方面的表现，并通过置信度差异（Coreference Confidence Disparity）度量模型的偏见。研究发现，模型对某些交叉身份的置信度差异高达40%，且对双重弱势身份在反刻板印象场景下表现最不确定。

Details

Motivation: 尽管LLMs在资源受限的决策支持中广泛应用，但研究表明其可能反映并加剧社会偏见。现有评估多关注单一维度偏见，而忽视了交叉性偏见可能带来的独特劣势模式。

Result: 评估五种最新LLMs发现，模型在体型、性取向和社会经济地位等属性上置信度差异高达40%，且在反刻板场景下对双重弱势身份最不确定。

Insight: LLMs的优异表现可能更多依赖于记忆而非逻辑推理，其偏差问题既存在于价值观对齐（value alignment）失败，也存在于有效性（validity）不足，二者叠加可能导致社会危害。

Abstract: Large language models (LLMs) have achieved impressive performance, leading to their widespread adoption as decision-support tools in resource-constrained contexts like hiring and admissions. There is, however, scientific consensus that AI systems can reflect and exacerbate societal biases, raising concerns about identity-based harm when used in critical social contexts. Prior work has laid a solid foundation for assessing bias in LLMs by evaluating demographic disparities in different language reasoning tasks. In this work, we extend single-axis fairness evaluations to examine intersectional bias, recognizing that when multiple axes of discrimination intersect, they create distinct patterns of disadvantage. We create a new benchmark called WinoIdentity by augmenting the WinoBias dataset with 25 demographic markers across 10 attributes, including age, nationality, and race, intersected with binary gender, yielding 245,700 prompts to evaluate 50 distinct bias patterns. Focusing on harms of omission due to underrepresentation, we investigate bias through the lens of uncertainty and propose a group (un)fairness metric called Coreference Confidence Disparity which measures whether models are more or less confident for some intersectional identities than others. We evaluate five recently published LLMs and find confidence disparities as high as 40% along various demographic attributes including body type, sexual orientation and socio-economic status, with models being most uncertain about doubly-disadvantaged identities in anti-stereotypical settings. Surprisingly, coreference confidence decreases even for hegemonic or privileged markers, indicating that the recent impressive performance of LLMs is more likely due to memorization than logical reasoning. Notably, these are two independent failures in value alignment and validity that can compound to cause social harm.

[175] Fairness of Automatic Speech Recognition: Looking Through a Philosophical Lens cs.CL | cs.AIPDF

Anna Seo Gyeong Choi, Hoon Choi

TL;DR: 论文通过哲学视角分析了自动语音识别（ASR）系统中的偏见问题，指出其对非标准方言的误识别不仅是技术问题，还涉及历史不公正和对边缘化语言社区的尊重缺失。

Details

Motivation: 研究动机在于揭示ASR系统对不同语音变体的系统性误识别不仅是技术局限，更是一种对边缘化群体的不尊重，并探讨其伦理和社会影响。

Result: 结果表明，当前技术公平性指标未能捕捉ASR偏见中的权力不对称问题，需要更全面的公平性评估方法。

Insight: 论文指出，解决ASR偏见不仅需要技术改进，还需承认多样语音变体的合法性，并在开发中嵌入对语言多样性的尊重。

Abstract: Automatic Speech Recognition (ASR) systems now mediate countless human-technology interactions, yet research on their fairness implications remains surprisingly limited. This paper examines ASR bias through a philosophical lens, arguing that systematic misrecognition of certain speech varieties constitutes more than a technical limitation – it represents a form of disrespect that compounds historical injustices against marginalized linguistic communities. We distinguish between morally neutral classification (discriminate1) and harmful discrimination (discriminate2), demonstrating how ASR systems can inadvertently transform the former into the latter when they consistently misrecognize non-standard dialects. We identify three unique ethical dimensions of speech technologies that differentiate ASR bias from other algorithmic fairness concerns: the temporal burden placed on speakers of non-standard varieties (“temporal taxation”), the disruption of conversational flow when systems misrecognize speech, and the fundamental connection between speech patterns and personal/cultural identity. These factors create asymmetric power relationships that existing technical fairness metrics fail to capture. The paper analyzes the tension between linguistic standardization and pluralism in ASR development, arguing that current approaches often embed and reinforce problematic language ideologies. We conclude that addressing ASR bias requires more than technical interventions; it demands recognition of diverse speech varieties as legitimate forms of expression worthy of technological accommodation. This philosophical reframing offers new pathways for developing ASR systems that respect linguistic diversity and speaker autonomy.

[176] Gradient Surgery for Safe LLM Fine-Tuning cs.CLPDF

Biao Yi, Jiahao Li, Baolei Zhang, Lihai Nie, Tong Li

TL;DR: 论文提出了SafeGrad方法，通过梯度手术解决大型语言模型（LLM）微调中的安全问题，平衡用户任务性能和安全性。

Details

Motivation: 现有LLM微调方法在对抗恶意样本时表现脆弱，尤其是随着有害样本比例增加，防御能力急剧下降。这源于用户任务梯度与安全目标的梯度冲突。

Result: 实验表明，SafeGrad在不同LLM和数据集上均能提供最佳防御效果，即使在高有害样本比例下也能保持安全性和任务性能。

Insight: 化解梯度冲突是关键，分布对齐进一步提升了模型的鲁棒性和数据效率。

Abstract: Fine-tuning-as-a-Service introduces a critical vulnerability where a few malicious examples mixed into the user’s fine-tuning dataset can compromise the safety alignment of Large Language Models (LLMs). While a recognized paradigm frames safe fine-tuning as a multi-objective optimization problem balancing user task performance with safety alignment, we find existing solutions are critically sensitive to the harmful ratio, with defenses degrading sharply as harmful ratio increases. We diagnose that this failure stems from conflicting gradients, where the user-task update directly undermines the safety objective. To resolve this, we propose SafeGrad, a novel method that employs gradient surgery. When a conflict is detected, SafeGrad nullifies the harmful component of the user-task gradient by projecting it onto the orthogonal plane of the alignment gradient, allowing the model to learn the user’s task without sacrificing safety. To further enhance robustness and data efficiency, we employ a KL-divergence alignment loss that learns the rich, distributional safety profile of the well-aligned foundation model. Extensive experiments show that SafeGrad provides state-of-the-art defense across various LLMs and datasets, maintaining robust safety even at high harmful ratios without compromising task fidelity.

[177] Omni-SafetyBench: A Benchmark for Safety Evaluation of Audio-Visual Large Language Models cs.CL | 68T50 | I.2.7PDF

Leyi Pan, Zheyu Fu, Yunpeng Zhai, Shuchang Tao, Sheng Guan

TL;DR: Omni-SafetyBench是首个针对多模态大语言模型（OLLMs）安全评估的基准测试，填补了现有基准无法评估音频-视觉联合输入或跨模态安全一致性的空白。

Details

Motivation: 随着多模态大语言模型的兴起，现有基准无法充分评估其在音频-视觉联合输入下的安全性及跨模态一致性，亟需专用基准。

Result: 评估结果显示：(1) 无模型在整体安全性和一致性上表现优异；(2) 复杂输入（尤其是音频-视觉联合）削弱防御；(3) 部分模态下模型得分低至0.14。

Insight: 当前OLLMs在复杂多模态输入下的安全性亟待提升，需针对性改进防御机制。

Abstract: The rise of Omni-modal Large Language Models (OLLMs), which integrate visual and auditory processing with text, necessitates robust safety evaluations to mitigate harmful outputs. However, no dedicated benchmarks currently exist for OLLMs, and prior benchmarks designed for other LLMs lack the ability to assess safety performance under audio-visual joint inputs or cross-modal safety consistency. To fill this gap, we introduce Omni-SafetyBench, the first comprehensive parallel benchmark for OLLM safety evaluation, featuring 24 modality combinations and variations with 972 samples each, including dedicated audio-visual harm cases. Considering OLLMs’ comprehension challenges with complex omni-modal inputs and the need for cross-modal consistency evaluation, we propose tailored metrics: a Safety-score based on conditional Attack Success Rate (C-ASR) and Refusal Rate (C-RR) to account for comprehension failures, and a Cross-Modal Safety Consistency Score (CMSC-score) to measure consistency across modalities. Evaluating 6 open-source and 4 closed-source OLLMs reveals critical vulnerabilities: (1) no model excels in both overall safety and consistency, with only 3 models achieving over 0.6 in both metrics and top performer scoring around 0.8; (2) safety defenses weaken with complex inputs, especially audio-visual joints; (3) severe weaknesses persist, with some models scoring as low as 0.14 on specific modalities. Our benchmark and metrics highlight urgent needs for enhanced OLLM safety, providing a foundation for future improvements.

[178] Schema Lineage Extraction at Scale: Multilingual Pipelines, Composite Evaluation, and Language-Model Benchmarks cs.CL | cs.AI | cs.DBPDF

Jiaqi Yin, Yi-Wei Chen, Meng-Lung Lee, Xiya Liu

TL;DR: 提出一种自动化提取多语言数据管道脚本中细粒度模式谱系的新框架，解决语义漂移问题，并引入SLiCE评估标准和新的基准数据集。实验表明，模型规模和提示技术对性能有显著影响，32B开源模型表现接近GPT系列。

Details

Motivation: 企业数据管道中的语义漂移导致数据可重复性和治理问题，影响RAG和text-to-SQL等服务的效果。

Result: 实验表明32B开源模型在单次推理中表现接近GPT-4o/GPT-4.1，提示技术和模型规模是关键因素。

Insight: 开源模型通过优化提示技术可媲美闭源LLM，为实际应用提供了经济高效的解决方案。

Abstract: Enterprise data pipelines, characterized by complex transformations across multiple programming languages, often cause a semantic disconnect between original metadata and downstream data. This “semantic drift” compromises data reproducibility and governance, and impairs the utility of services like retrieval-augmented generation (RAG) and text-to-SQL systems. To address this, a novel framework is proposed for the automated extraction of fine-grained schema lineage from multilingual enterprise pipeline scripts. This method identifies four key components: source schemas, source tables, transformation logic, and aggregation operations, creating a standardized representation of data transformations. For the rigorous evaluation of lineage quality, this paper introduces the Schema Lineage Composite Evaluation (SLiCE), a metric that assesses both structural correctness and semantic fidelity. A new benchmark is also presented, comprising 1,700 manually annotated lineages from real-world industrial scripts. Experiments were conducted with 12 language models, from 1.3B to 32B small language models (SLMs) to large language models (LLMs) like GPT-4o and GPT-4.1. The results demonstrate that the performance of schema lineage extraction scales with model size and the sophistication of prompting techniques. Specially, a 32B open-source model, using a single reasoning trace, can achieve performance comparable to the GPT series under standard prompting. This finding suggests a scalable and economical approach for deploying schema-aware agents in practical applications.

[179] DySK-Attn: A Framework for Efficient, Real-Time Knowledge Updating in Large Language Models via Dynamic Sparse Knowledge Attention cs.CL | cs.AI | cs.LG | I.2.7; H.3.3; H.2.8PDF

Kabir Khan, Priya Sharma, Arjun Mehta, Neha Gupta, Ravi Narayanan

TL;DR: DySK-Attn框架通过动态稀疏知识注意力机制，实现了大语言模型（LLM）对动态外部知识的高效实时更新，避免了重新训练的高成本与知识编辑的副作用。

Details

Motivation: 大语言模型（LLM）的知识是静态的，容易过时，而重新训练成本高，现有知识编辑技术又存在速度慢和副作用的问题。因此，需要一种高效、可扩展的方法使LLM能实时更新知识。

Result: 在时效性问答任务中，DySK-Attn显著优于标准检索增强生成（RAG）和模型编辑技术，在准确性及计算效率上表现出色。

Insight: 稀疏注意力机制可有效减少计算开销，动态KG为LLM提供了实时的知识更新能力，是一种可扩展的解决方案。

Abstract: Large Language Models (LLMs) suffer from a critical limitation: their knowledge is static and quickly becomes outdated. Retraining these massive models is computationally prohibitive, while existing knowledge editing techniques can be slow and may introduce unforeseen side effects. To address this, we propose DySK-Attn, a novel framework that enables LLMs to efficiently integrate real-time knowledge from a dynamic external source. Our approach synergizes an LLM with a dynamic Knowledge Graph (KG) that can be updated instantaneously. The core of our framework is a sparse knowledge attention mechanism, which allows the LLM to perform a coarse-to-fine grained search, efficiently identifying and focusing on a small, highly relevant subset of facts from the vast KG. This mechanism avoids the high computational cost of dense attention over the entire knowledge base and mitigates noise from irrelevant information. We demonstrate through extensive experiments on time-sensitive question-answering tasks that DySK-Attn significantly outperforms strong baselines, including standard Retrieval-Augmented Generation (RAG) and model editing techniques, in both factual accuracy for updated knowledge and computational efficiency. Our framework offers a scalable and effective solution for building LLMs that can stay current with the ever-changing world.

[180] Incorporating Contextual Paralinguistic Understanding in Large Speech-Language Models cs.CL | cs.AI | eess.ASPDF

Qiongqiong Wang, Hardik B. Sailor, Jeremy H. M. Wong, Tianchi Liu, Shuo Sun

TL;DR: 这篇论文提出了两种方法（显式和隐式）将上下文副语言信息融入大型语音语言模型（Speech-LLMs），显著提升了模型在共情推理任务中的表现。

Details

Motivation: 现有的大型语音语言模型在共情推理方面表现不足，主因是缺乏整合上下文内容和副语言线索的训练数据。

Result: 隐式方法将性能提升38.41%，与显式方法结合后达到46.02%。

Insight: 副语言信息对提升语音语言模型的共情推理能力至关重要，且自动生成训练数据是有效的补充手段。

Abstract: Current large speech language models (Speech-LLMs) often exhibit limitations in empathetic reasoning, primarily due to the absence of training datasets that integrate both contextual content and paralinguistic cues. In this work, we propose two approaches to incorporate contextual paralinguistic information into model training: (1) an explicit method that provides paralinguistic metadata (e.g., emotion annotations) directly to the LLM, and (2) an implicit method that automatically generates novel training question-answer (QA) pairs using both categorical and dimensional emotion annotations alongside speech transcriptions. Our implicit method boosts performance (LLM-judged) by 38.41% on a human-annotated QA benchmark, reaching 46.02% when combined with the explicit approach, showing effectiveness in contextual paralinguistic understanding. We also validate the LLM judge by demonstrating its correlation with classification metrics, providing support for its reliability.

[181] MAQuA: Adaptive Question-Asking for Multidimensional Mental Health Screening using Item Response Theory cs.CL | cs.AIPDF

Vasudha Varadarajan, Hui Xu, Rebecca Astrid Boehme, Mariam Marlan Mirstrom, Sverker Sikstrom

TL;DR: MAQuA是一个基于项目反应理论（IRT）和因子分析的自适应问诊框架，通过优化问题选择减少用户负担，同时提升多维度心理健康筛查的准确性。

Details

Motivation: 当前大型语言模型（LLMs）在心理健康评估中由于过度提问导致用户负担重且效率低。MAQuA旨在通过自适应问诊优化多维症状筛查。

Result: 在实验中，MAQuA减少了50-87%的问题量（如抑郁症评估减少71%问题），在内外化症状领域均表现稳健。

Insight: MAQuA将LLM与临床工具结合，为心理健康筛查提供了高效、可扩展的解决方案，推动了LLM在真实临床工作流中的应用。

Abstract: Recent advances in large language models (LLMs) offer new opportunities for scalable, interactive mental health assessment, but excessive querying by LLMs burdens users and is inefficient for real-world screening across transdiagnostic symptom profiles. We introduce MAQuA, an adaptive question-asking framework for simultaneous, multidimensional mental health screening. Combining multi-outcome modeling on language responses with item response theory (IRT) and factor analysis, MAQuA selects the questions with most informative responses across multiple dimensions at each turn to optimize diagnostic information, improving accuracy and potentially reducing response burden. Empirical results on a novel dataset reveal that MAQuA reduces the number of assessment questions required for score stabilization by 50-87% compared to random ordering (e.g., achieving stable depression scores with 71% fewer questions and eating disorder scores with 85% fewer questions). MAQuA demonstrates robust performance across both internalizing (depression, anxiety) and externalizing (substance use, eating disorder) domains, with early stopping strategies further reducing patient time and burden. These findings position MAQuA as a powerful and efficient tool for scalable, nuanced, and interactive mental health screening, advancing the integration of LLM-based agents into real-world clinical workflows.

[182] “Pull or Not to Pull?’’: Investigating Moral Biases in Leading Large Language Models Across Ethical Dilemmas cs.CL | cs.AI | cs.CYPDF

Junchen Ding, Penghao Jiang, Zihao Xu, Ziqi Ding, Yichen Zhu

TL;DR: 该论文研究大型语言模型（LLM）在道德困境中的决策行为，通过实验发现模型在不同伦理框架下的表现差异显著，并提出将道德推理作为LLM对齐的重要标准。

Details

Motivation: 随着LLM在道德敏感决策中的作用增加，研究其道德推理过程变得至关重要。

Result: 发现模型在利他主义、公平和美德伦理学框架下表现较优，但在强调亲属关系、合法性或自我利益的框架下表现不一致。

Insight: 道德提示不仅是行为调节工具，还能揭示模型隐式对齐哲学，需建立标准化基准评估模型的决策逻辑。

Abstract: As large language models (LLMs) increasingly mediate ethically sensitive decisions, understanding their moral reasoning processes becomes imperative. This study presents a comprehensive empirical evaluation of 14 leading LLMs, both reasoning enabled and general purpose, across 27 diverse trolley problem scenarios, framed by ten moral philosophies, including utilitarianism, deontology, and altruism. Using a factorial prompting protocol, we elicited 3,780 binary decisions and natural language justifications, enabling analysis along axes of decisional assertiveness, explanation answer consistency, public moral alignment, and sensitivity to ethically irrelevant cues. Our findings reveal significant variability across ethical frames and model types: reasoning enhanced models demonstrate greater decisiveness and structured justifications, yet do not always align better with human consensus. Notably, “sweet zones” emerge in altruistic, fairness, and virtue ethics framings, where models achieve a balance of high intervention rates, low explanation conflict, and minimal divergence from aggregated human judgments. However, models diverge under frames emphasizing kinship, legality, or self interest, often producing ethically controversial outcomes. These patterns suggest that moral prompting is not only a behavioral modifier but also a diagnostic tool for uncovering latent alignment philosophies across providers. We advocate for moral reasoning to become a primary axis in LLM alignment, calling for standardized benchmarks that evaluate not just what LLMs decide, but how and why.

[183] Arce: Augmented Roberta with Contextualized Elucidations for Ner in Automated Rule Checking cs.CL | cs.IRPDF

Jian Chen, Jinbao Tian, Yankui Li, Zhou Li

TL;DR: ARCE提出了一种新方法，通过利用大模型生成简单解释性知识（Cote）来增强RoBERTa模型，实现在AEC领域NER任务上的SOTA性能（Macro-F1 77.20%）。

Details

Motivation: AEC领域的专业文本具有术语复杂和上下文关系密集的特点，标准预训练模型因域gap性能受限。传统方法如ARCBERT依赖人工标注的大规模语料，成本高。LLMs为知识自动生成提供了新思路，但如何优化生成策略以增强小模型性能仍未解决。

Result: ARCE在AEC-NER基准数据集上达到Macro-F1 77.20%，超越现有方法。结果表明简单解释性知识优于复杂逻辑知识。

Insight: 1. 简单解释性知识比复杂逻辑更适配专业领域NER；2. LLM生成知识+小模型微调是高效解决域gap的可行路径；3. 未来可探索其他领域是否适用类似策略。

Abstract: Accurate information extraction from specialized texts is a critical challenge, particularly for named entity recognition (NER) in the architecture, engineering, and construction (AEC) domain to support automated rule checking (ARC). The performance of standard pre-trained models is often constrained by the domain gap, as they struggle to interpret the specialized terminology and complex relational contexts inherent in AEC texts. Although this issue can be mitigated by further pre-training on large, human-curated domain corpora, as exemplified by methods like ARCBERT, this approach is both labor-intensive and cost-prohibitive. Consequently, leveraging large language models (LLMs) for automated knowledge generation has emerged as a promising alternative. However, the optimal strategy for generating knowledge that can genuinely enhance smaller, efficient models remains an open question. To address this, we propose ARCE (augmented RoBERTa with contextualized elucidations), a novel approach that systematically explores and optimizes this generation process. ARCE employs an LLM to first generate a corpus of simple, direct explanations, which we term Cote, and then uses this corpus to incrementally pre-train a RoBERTa model prior to its fine-tuning on the downstream task. Our extensive experiments show that ARCE establishes a new state-of-the-art on a benchmark AEC dataset, achieving a Macro-F1 score of 77.20%. This result also reveals a key finding: simple, explanation-based knowledge proves surprisingly more effective than complex, role-based rationales for this task. The code is publicly available at:https://github.com/nxcc-lab/ARCE.

Yexing Du, Kaiyuan Liu, Youcheng Pan, Zheng Chu, Bo Yang

TL;DR: 这篇论文提出了一个名为CCFQA的基准测试，旨在评估多模态大语言模型（MLLMs）在跨语言和跨模态（语音与文本）事实性评估中的表现。实验表明当前MLLMs在此任务中表现不佳，并提出了一种基于少样本迁移学习的方法，显著提升了性能。

Details

Motivation: 随着多模态大语言模型在多语言环境中的应用日益广泛，现有基准测试主要关注英语的文本或视觉模态，而在处理多语言输入（尤其是语音）时存在评估空白。因此，需要一种新的基准测试来填补这一空白。

Result: 实验结果表明，当前MLLMs在CCFQA基准测试中表现不佳；提出的少样本迁移学习方法在仅使用5个样本的情况下，达到了与GPT-4o-mini-Audio相当的性能。

Insight: 跨语言和跨模态事实性评估是MLLMs发展中的重要挑战；少样本迁移学习可以有效提升模型在多语言任务中的表现。

Abstract: As Large Language Models (LLMs) are increasingly popularized in the multilingual world, ensuring hallucination-free factuality becomes markedly crucial. However, existing benchmarks for evaluating the reliability of Multimodal Large Language Models (MLLMs) predominantly focus on textual or visual modalities with a primary emphasis on English, which creates a gap in evaluation when processing multilingual input, especially in speech. To bridge this gap, we propose a novel \textbf{C}ross-lingual and \textbf{C}ross-modal \textbf{F}actuality benchmark (\textbf{CCFQA}). Specifically, the CCFQA benchmark contains parallel speech-text factual questions across 8 languages, designed to systematically evaluate MLLMs’ cross-lingual and cross-modal factuality capabilities. Our experimental results demonstrate that current MLLMs still face substantial challenges on the CCFQA benchmark. Furthermore, we propose a few-shot transfer learning strategy that effectively transfers the Question Answering (QA) capabilities of LLMs in English to multilingual Spoken Question Answering (SQA) tasks, achieving competitive performance with GPT-4o-mini-Audio using just 5-shot training. We release CCFQA as a foundational research resource to promote the development of MLLMs with more robust and reliable speech understanding capabilities. Our code and dataset are available at https://github.com/yxduir/ccfqa.

[185] HealthBranches: Synthesizing Clinically-Grounded Question Answering Datasets via Decision Pathways cs.CL | cs.AI | cs.IR | cs.LGPDF

Cristian Cosentino, Annamaria Defilippo, Marco Dossena, Christopher Irwin, Sara Joubbi

TL;DR: HealthBranches是一个用于医学问答的新基准数据集，通过半自动化流程从医疗决策路径生成真实病例及其问答，旨在评估大语言模型（LLM）的复杂推理能力。

Details

Motivation: 当前缺乏能够全面评估LLM在医疗领域复杂推理能力的基准数据集，HealthBranches填补了这一空白，并提供临床验证的推理路径。

Result: 数据集支持LLM的多步推理评估，尤其是在结构化检索增强生成（RAG）场景中的性能。

Insight: HealthBranches为开发高可信度、可解释的医疗LLM提供了资源，同时也可用于医学教育。

Abstract: HealthBranches is a novel benchmark dataset for medical Question-Answering (Q&A), specifically designed to evaluate complex reasoning in Large Language Models (LLMs). This dataset is generated through a semi-automated pipeline that transforms explicit decision pathways from medical source into realistic patient cases with associated questions and answers. Covering 4,063 case studies across 17 healthcare topics, each data point is based on clinically validated reasoning chains. HealthBranches supports both open-ended and multiple-choice question formats and uniquely includes the full reasoning path for each Q&A. Its structured design enables robust evaluation of LLMs’ multi-step inference capabilities, including their performance in structured Retrieval-Augmented Generation (RAG) contexts. HealthBranches establishes a foundation for the development of more trustworthy, interpretable, and clinically reliable LLMs in high-stakes domains while also serving as a valuable resource for educational purposes.

[186] ObfusQAte: A Proposed Framework to Evaluate LLM Robustness on Obfuscated Factual Question Answering cs.CL | cs.AI | cs.LG | I.2.7PDF

Shubhra Ghosh, Abhilekh Borah, Aditya Kumar Guru, Kripabandhu Ghosh

TL;DR: 论文提出了ObfusQAte框架，用于评估大型语言模型（LLM）在模糊化事实问答任务中的鲁棒性，并发现LLM在面对模糊化问题时容易失败或生成错误答案。

Details

Motivation: 随着大型语言模型的普及，其在事实问答任务中的表现受到广泛关注，但目前缺乏对模型在模糊化问题上鲁棒性的系统性评估。

Result: 研究发现LLM在面对模糊化问题时容易失败或生成虚假答案，表明现有模型在此类任务中存在显著局限性。

Insight: 模糊化问题能有效揭示LLM的鲁棒性缺陷，为未来模型优化提供了重要方向。

Abstract: The rapid proliferation of Large Language Models (LLMs) has significantly contributed to the development of equitable AI systems capable of factual question-answering (QA). However, no known study tests the LLMs’ robustness when presented with obfuscated versions of questions. To systematically evaluate these limitations, we propose a novel technique, ObfusQAte and, leveraging the same, introduce ObfusQA, a comprehensive, first of its kind, framework with multi-tiered obfuscation levels designed to examine LLM capabilities across three distinct dimensions: (i) Named-Entity Indirection, (ii) Distractor Indirection, and (iii) Contextual Overload. By capturing these fine-grained distinctions in language, ObfusQA provides a comprehensive benchmark for evaluating LLM robustness and adaptability. Our study observes that LLMs exhibit a tendency to fail or generate hallucinated responses when confronted with these increasingly nuanced variations. To foster research in this direction, we make ObfusQAte publicly available.

[187] Grounding Multilingual Multimodal LLMs With Cultural Knowledge cs.CL | cs.LGPDF

Jean de Dieu Nyandwi, Yueqi Song, Simran Khanuja, Graham Neubig

TL;DR: 该论文提出了一种数据驱动的方法，通过多语言视觉问答数据集CulturalGround，训练出能理解文化知识的模型CulturalPangea，显著改善了多模态大语言模型在文化相关任务中的表现。

Details

Motivation: 现有的多模态大语言模型在高资源环境下表现优异，但在低资源语言和文化长尾实体的理解上存在不足，亟需一种文化知识的直接注入方法。

Result: CulturalPangea在文化相关的多语言多模态基准测试中达到开源模型的SOTA，平均提升5.0分，且不影响主流视觉语言任务表现。

Insight: 文化知识的针对性注入可显著缩小MLLMs的文化鸿沟，为构建全球包容性的多模态系统提供了可行路径。

Abstract: Multimodal Large Language Models excel in high-resource settings, but often misinterpret long-tail cultural entities and underperform in low-resource languages. To address this gap, we propose a data-centric approach that directly grounds MLLMs in cultural knowledge. Leveraging a large scale knowledge graph from Wikidata, we collect images that represent culturally significant entities, and generate synthetic multilingual visual question answering data. The resulting dataset, CulturalGround, comprises 22 million high-quality, culturally-rich VQA pairs spanning 42 countries and 39 languages. We train an open-source MLLM CulturalPangea on CulturalGround, interleaving standard multilingual instruction-tuning data to preserve general abilities. CulturalPangea achieves state-of-the-art performance among open models on various culture-focused multilingual multimodal benchmarks, outperforming prior models by an average of 5.0 without degrading results on mainstream vision-language tasks. Our findings show that our targeted, culturally grounded approach could substantially narrow the cultural gap in MLLMs and offer a practical path towards globally inclusive multimodal systems.

[188] Let’s Revise Step-by-Step: A Unified Local Search Framework for Code Generation with LLMs cs.CLPDF

Zhiyi Lyu, Jianguo Huang, Yanchen Deng, Steven Hoi, Bo An

TL;DR: ReLoc是一个统一的局部搜索框架，通过逐步代码修订解决LLM代码生成的效率和可扩展性问题，显著优于现有方法。

Details

Motivation: 现有LLM代码生成方法在效率和可扩展性上存在问题，构造式树搜索方法面临树规模快速增长、高令牌消耗和缺乏即时性的挑战，而改进式方法受限于无效奖励信号和低效搜索策略。

Result: 实验表明ReLoc在多样代码生成任务中表现优异，显著优于构造式树搜索和当前最优改进式方法。

Insight: 通过局部修订和修订奖励模型的结合，能够更高效地指导搜索过程，提升代码生成质量。

Abstract: Large Language Models (LLMs) with inference-time scaling techniques show promise for code generation, yet face notable efficiency and scalability challenges. Construction-based tree-search methods suffer from rapid growth in tree size, high token consumption, and lack of anytime property. In contrast, improvement-based methods offer better performance but often struggle with uninformative reward signals and inefficient search strategies. In this work, we propose \textbf{ReLoc}, a unified local search framework which effectively performs step-by-step code revision. Specifically, ReLoc explores a series of local revisions through four key algorithmic components: initial code drafting, neighborhood code generation, candidate evaluation, and incumbent code updating, each of which can be instantiated with specific decision rules to realize different local search algorithms such as Hill Climbing (HC) or Genetic Algorithm (GA). Furthermore, we develop a specialized revision reward model that evaluates code quality based on revision distance to produce fine-grained preferences that guide the local search toward more promising candidates. Finally, our extensive experimental results demonstrate that our approach achieves superior performance across diverse code generation tasks, significantly outperforming both construction-based tree search as well as the state-of-the-art improvement-based code generation methods.

[189] Positional Biases Shift as Inputs Approach Context Window Limits cs.CLPDF

Blerta Veseli, Julian Chibane, Mariya Toneva, Alexander Koller

TL;DR: 论文分析了大型语言模型（LLMs）在处理长输入时的位置偏差现象，发现当输入长度接近上下文窗口限制时，位置偏差会发生变化。通过相对长度分析，揭示了检索是推理的前提，位置偏差主要由检索引起。

Details

Motivation: 先前研究发现LLMs在长输入中存在“迷失中间”（LiM）效应，即模型对输入开头和结尾更敏感，但对于其强度和适用条件缺乏一致性结论。本研究旨在通过相对长度分析明确这些偏差的特性和表现条件。

Result: 随着输入长度增加，LiM效应消失，转变为基于距离的偏差（信息越接近结尾表现越好）。检索能力直接影响推理质量。

Insight: LLMs处理长上下文的能力受检索限制，未来评测和基准设计需考虑位置偏差的动态变化。

Abstract: Large Language Models (LLMs) often struggle to use information across long inputs effectively. Prior work has identified positional biases, such as the Lost in the Middle (LiM) effect, where models perform better when information appears at the beginning (primacy bias) or end (recency bias) of the input, rather than in the middle. However, long-context studies have not consistently replicated these effects, raising questions about their intensity and the conditions under which they manifest. To address this, we conducted a comprehensive analysis using relative rather than absolute input lengths, defined with respect to each model’s context window. Our findings reveal that the LiM effect is strongest when inputs occupy up to 50% of a model’s context window. Beyond that, the primacy bias weakens, while recency bias remains relatively stable. This effectively eliminates the LiM effect; instead, we observe a distance-based bias, where model performance is better when relevant information is closer to the end of the input. Furthermore, our results suggest that successful retrieval is a prerequisite for reasoning in LLMs, and that the observed positional biases in reasoning are largely inherited from retrieval. These insights have implications for long-context tasks, the design of future LLM benchmarks, and evaluation methodologies for LLMs handling extended inputs.

[190] ALOPE: Adaptive Layer Optimization for Translation Quality Estimation using Large Language Models cs.CL | cs.AIPDF

Archchana Sindhujan, Shenbin Qian, Chan Chi Chun Matthew, Constantin Orasan, Diptesh Kanojia

TL;DR: ALOPE是一个自适应层优化框架，通过调整Transformer表示的层次结构来提升基于LLM的翻译质量估计（QE）性能，结合动态加权和多头回归策略，显著优于现有方法。

Details

Motivation: 当前LLM在翻译质量估计（QE）任务中表现受限，因其预训练目标与回归任务不匹配，且低资源语言的表征能力不足。ALOPE旨在解决这些问题。

Result: 实验表明ALOPE显著优于现有LLM-based QE方法，中间层表征更适跨语言任务。模型和代码已开源。

Insight: LLM的中间层可能更适合跨语言任务（如QE），动态加权和多头聚合能进一步提升性能。

Abstract: Large Language Models (LLMs) have shown remarkable performance across a wide range of natural language processing tasks. Quality Estimation (QE) for Machine Translation (MT), which assesses the quality of a source-target pair without relying on reference translations, remains a challenging cross-lingual task for LLMs. The challenges stem from the inherent limitations of existing LLM-based QE systems, which are pre-trained for causal language modelling rather than regression-specific tasks, further elevated by the presence of low-resource languages given pre-training data distribution. This paper introduces ALOPE, an adaptive layer-optimization framework designed to enhance LLM-based QE by restructuring Transformer representations through layer-wise adaptation for improved regression-based prediction. Our framework integrates low-rank adapters (LoRA) with regression task heads, leveraging selected pre-trained Transformer layers for improved cross-lingual alignment. In addition to the layer-specific adaptation, ALOPE introduces two strategies-dynamic weighting, which adaptively combines representations from multiple layers, and multi-head regression, which aggregates regression losses from multiple heads for QE. Our framework shows improvements over various existing LLM-based QE approaches. Empirical evidence suggests that intermediate Transformer layers in LLMs provide contextual representations that are more aligned with the cross-lingual nature of the QE task. We make resultant models and framework code publicly available for further research, also allowing existing LLM-based MT frameworks to be scaled with QE capabilities.

[191] From Trial-and-Error to Improvement: A Systematic Analysis of LLM Exploration Mechanisms in RLVR cs.CLPDF

Jia Deng, Jie Chen, Zhipeng Chen, Daixuan Cheng, Fei Bai

TL;DR: 该论文系统研究了RLVR中大型语言模型（LLM）的探索机制，提出了量化指标分析其能力边界，探索了熵-性能交换关系，并优化了RL性能。

Details

Motivation: 尽管RLVR在增强LLM推理能力方面表现出色，但其探索行为的基础机制尚未深入研究。

Result: 提出了统一的理论与实证证据，为RLVR系统的进一步优化奠定了基础。

Insight: LLM的探索能力与性能提升密切相关，且熵-性能关系在不同粒度上表现不同。

Abstract: Reinforcement learning with verifiable rewards (RLVR) has emerged as a powerful paradigm for enhancing the reasoning capabilities of large language models (LLMs). Unlike traditional RL approaches, RLVR leverages rule-based feedback to guide LLMs in generating and refining complex reasoning chains – a process critically dependent on effective exploration strategies. While prior work has demonstrated RLVR’s empirical success, the fundamental mechanisms governing LLMs’ exploration behaviors remain underexplored. This technical report presents a systematic investigation of exploration capacities in RLVR, covering four main aspects: (1) exploration space shaping, where we develop quantitative metrics to characterize LLMs’ capability boundaries; (2) entropy-performance exchange, analyzed across training stages, individual instances, and token-level patterns; and (3) RL performance optimization, examining methods to effectively translate exploration gains into measurable improvements. By unifying previously identified insights with new empirical evidence, this work aims to provide a foundational framework for advancing RLVR systems.

[192] IBPS: Indian Bail Prediction System cs.CL | cs.AIPDF

Puspesh Kumar Srivastava, Uddeshya Raj, Praveen Patel, /Shubham Kumar Nigam, Noel Shallum

TL;DR: 论文介绍了印度保释预测系统（IBPS），这是一个基于AI的框架，旨在通过预测保释结果并生成法律依据来辅助印度法院的保释决策，以减少主观性和延迟问题。

Details

Motivation: 印度法院的保释决策存在主观性、延迟和一致性问题，75%的监狱人口是未审判囚犯，许多来自社会经济弱势群体，缺乏及时和公平的保释裁决加剧了人权问题和司法积压。

Result: 结果表明，结合法律知识的微调模型显著优于基线，表现出较高的准确性和解释质量，并能推广到法律专家独立标注的测试集。

Insight: IBPS为印度司法系统提供了一个透明、可扩展和可重复的解决方案，支持数据驱动的法律辅助，减少保释延迟，并促进程序公平。

Abstract: Bail decisions are among the most frequently adjudicated matters in Indian courts, yet they remain plagued by subjectivity, delays, and inconsistencies. With over 75% of India’s prison population comprising undertrial prisoners, many from socioeconomically disadvantaged backgrounds, the lack of timely and fair bail adjudication exacerbates human rights concerns and contributes to systemic judicial backlog. In this paper, we present the Indian Bail Prediction System (IBPS), an AI-powered framework designed to assist in bail decision-making by predicting outcomes and generating legally sound rationales based solely on factual case attributes and statutory provisions. We curate and release a large-scale dataset of 150,430 High Court bail judgments, enriched with structured annotations such as age, health, criminal history, crime category, custody duration, statutes, and judicial reasoning. We fine-tune a large language model using parameter-efficient techniques and evaluate its performance across multiple configurations, with and without statutory context, and with RAG. Our results demonstrate that models fine-tuned with statutory knowledge significantly outperform baselines, achieving strong accuracy and explanation quality, and generalize well to a test set independently annotated by legal experts. IBPS offers a transparent, scalable, and reproducible solution to support data-driven legal assistance, reduce bail delays, and promote procedural fairness in the Indian judicial system.

[193] Keyword-Centric Prompting for One-Shot Event Detection with Self-Generated Rationale Enhancements cs.CLPDF

Ziheng Li, Zhi-Hong Deng

TL;DR: 论文提出KeyCP++方法，通过关键词驱动的思维链提示，解决大模型在事件检测任务中的触发词理解不足和过度解释问题，提升单样本事件检测性能。

Details

Motivation: 传统基于大模型的上下文学习方法在事件检测任务中表现不佳，主要因为模型对事件触发词的理解不准确且易过分解读，仅靠上下文示例难以纠正。

Result: 实验表明KeyCP++在单样本事件检测任务中表现显著优于传统方法。

Insight: 关键词驱动的推理逻辑能够有效引导大模型学习事件检测规则，避免过分解读，提升任务性能。

Abstract: Although the LLM-based in-context learning (ICL) paradigm has demonstrated considerable success across various natural language processing tasks, it encounters challenges in event detection. This is because LLMs lack an accurate understanding of event triggers and tend to make over-interpretation, which cannot be effectively corrected through in-context examples alone. In this paper, we focus on the most challenging one-shot setting and propose KeyCP++, a keyword-centric chain-of-thought prompting approach. KeyCP++ addresses the weaknesses of conventional ICL by automatically annotating the logical gaps between input text and detection results for the demonstrations. Specifically, to generate in-depth and meaningful rationale, KeyCP++ constructs a trigger discrimination prompting template. It incorporates the exemplary triggers (a.k.a keywords) into the prompt as the anchor to simply trigger profiling, let LLM propose candidate triggers, and justify each candidate. These propose-and-judge rationales help LLMs mitigate over-reliance on the keywords and promote detection rule learning. Extensive experiments demonstrate the effectiveness of our approach, showcasing significant advancements in one-shot event detection.

[194] InterChart: Benchmarking Visual Reasoning Across Decomposed and Distributed Chart Information cs.CL | cs.AI | cs.CV | I.2.7; I.2.10; I.4.10; I.7.5PDF

Anirudh Iyengar Kaniyar Narayana Iyengar, Srija Mukhopadhyay, Adnan Qidwai, Shubhankar Singh, Dan Roth

TL;DR: InterChart是一个诊断性基准测试，旨在评估视觉语言模型在多图推理任务中的表现，揭示了模型在跨图整合和复杂推理上的局限性。

Details

Motivation: 现有的视觉语言模型基准主要关注单图任务，而现实应用（如科学报告和政策分析）需处理多图关系。InterChart填补了这一空白。

Result: 实验显示，先进VLMs在多图任务中准确率显著下降，分解多实体图可提升性能，但跨图整合仍具挑战。

Insight: 多图推理是VLMs的薄弱环节，分而治之的策略可能缓解问题，但需进一步研究跨模态整合方法。

Abstract: We introduce InterChart, a diagnostic benchmark that evaluates how well vision-language models (VLMs) reason across multiple related charts, a task central to real-world applications such as scientific reporting, financial analysis, and public policy dashboards. Unlike prior benchmarks focusing on isolated, visually uniform charts, InterChart challenges models with diverse question types ranging from entity inference and trend correlation to numerical estimation and abstract multi-step reasoning grounded in 2-3 thematically or structurally related charts. We organize the benchmark into three tiers of increasing difficulty: (1) factual reasoning over individual charts, (2) integrative analysis across synthetically aligned chart sets, and (3) semantic inference over visually complex, real-world chart pairs. Our evaluation of state-of-the-art open and closed-source VLMs reveals consistent and steep accuracy declines as chart complexity increases. We find that models perform better when we decompose multi-entity charts into simpler visual units, underscoring their struggles with cross-chart integration. By exposing these systematic limitations, InterChart provides a rigorous framework for advancing multimodal reasoning in complex, multi-visual environments.

[195] LoSemB: Logic-Guided Semantic Bridging for Inductive Tool Retrieval cs.CL | cs.AIPDF

Luyao Zhuang, Qinggang Zhang, Huachi Zhou, Juhua Liu, Qing Li

TL;DR: LoSemB是一个逻辑引导的语义桥接框架，用于归纳式工具检索，通过挖掘和迁移潜在逻辑信息解决未见工具的分布偏移和相似性检索脆弱性问题。

Details

Motivation: 现实世界中工具库不断更新，现有方法在归纳式设置下处理未见工具时表现不佳，主要因分布偏移和相似性检索的脆弱性。受人类认知过程启发，提出LoSemB以逻辑信息为桥梁解决这些问题。

Result: 实验表明LoSemB在归纳式设置中表现优异，同时在传统设置中保持高效。

Insight: 逻辑信息是解决归纳式工具检索的关键，可作为人类认知过程与机器学习的桥梁。

Abstract: Tool learning has emerged as a promising paradigm for large language models (LLMs) to solve many real-world tasks. Nonetheless, with the tool repository rapidly expanding, it is impractical to contain all tools within the limited input length of LLMs. To alleviate these issues, researchers have explored incorporating a tool retrieval module to select the most relevant tools or represent tools as unique tokens within LLM parameters. However, most state-of-the-art methods are under transductive settings, assuming all tools have been observed during training. Such a setting deviates from reality as the real-world tool repository is evolving and incorporates new tools frequently. When dealing with these unseen tools, which refer to tools not encountered during the training phase, these methods are limited by two key issues, including the large distribution shift and the vulnerability of similarity-based retrieval. To this end, inspired by human cognitive processes of mastering unseen tools through discovering and applying the logical information from prior experience, we introduce a novel Logic-Guided Semantic Bridging framework for inductive tool retrieval, namely, LoSemB, which aims to mine and transfer latent logical information for inductive tool retrieval without costly retraining. Specifically, LoSemB contains a logic-based embedding alignment module to mitigate distribution shifts and implements a relational augmented retrieval mechanism to reduce the vulnerability of similarity-based retrieval. Extensive experiments demonstrate that LoSemB achieves advanced performance in inductive settings while maintaining desirable effectiveness in the transductive setting.

[196] Can You Trick the Grader? Adversarial Persuasion of LLM Judges cs.CLPDF

Yerin Hwang, Dongryeol Lee, Taegwan Kang, Yongil Kim, Kyomin Jung

TL;DR: 研究发现，通过策略性嵌入说服性语言（如一致性、奉承等），可以显著影响LLM评分者对数学推理任务的评分，且模型规模的增大并不能有效缓解这一问题。

Details

Motivation: 随着大型语言模型（LLM）在实际场景中作为自动化评分者的作用日益增强，研究其是否会被说服性语言操纵以给出不公正的高分成为一个关键问题。

Result: 实验表明，说服性语言导致LLM评分者对错误答案的评分平均提高8%，其中“一致性”技术的效果最为显著。模型规模增大未能显著减轻这一影响。

Insight: LLM作为评分者容易受到说服性语言的操纵，这突显了其在自动化评分应用中的脆弱性，亟需开发针对性的防御机制。

Abstract: As large language models take on growing roles as automated evaluators in practical settings, a critical question arises: Can individuals persuade an LLM judge to assign unfairly high scores? This study is the first to reveal that strategically embedded persuasive language can bias LLM judges when scoring mathematical reasoning tasks, where correctness should be independent of stylistic variation. Grounded in Aristotle’s rhetorical principles, we formalize seven persuasion techniques (Majority, Consistency, Flattery, Reciprocity, Pity, Authority, Identity) and embed them into otherwise identical responses. Across six math benchmarks, we find that persuasive language leads LLM judges to assign inflated scores to incorrect solutions, by up to 8% on average, with Consistency causing the most severe distortion. Notably, increasing model size does not substantially mitigate this vulnerability. Further analysis demonstrates that combining multiple persuasion techniques amplifies the bias, and pairwise evaluation is likewise susceptible. Moreover, the persuasive effect persists under counter prompting strategies, highlighting a critical vulnerability in LLM-as-a-Judge pipelines and underscoring the need for robust defenses against persuasion-based attacks.

[197] Evaluating Compositional Approaches for Focus and Sentiment Analysis cs.CLPDF

Olga Kellert, Muhammad Imran, Nicholas Hill Matlis, Mahmud Uz Zaman, Carlos Gómez-Rodríguez

TL;DR: 本文通过评估组合方法在情感分析（SA）和焦点分析（FA）中的应用，展示了组合规则在SA中的优势（如可解释性），并将其推广至FA领域。

Details

Motivation: 现有研究缺乏对组合方法在FA中的定量评估，而SA中的组合规则可能与FA相关，因此作者试图填补这一空白。

Result: 组合方法在SA中表现出更高的准确性，并可推广至FA。

Insight: SA和FA密切相关，组合规则在语义分析中具有普适性，且可解释性是其重要优势。

Abstract: This paper summarizes the results of evaluating a compositional approach for Focus Analysis (FA) in Linguistics and Sentiment Analysis (SA) in Natural Language Processing (NLP). While quantitative evaluations of compositional and non-compositional approaches in SA exist in NLP, similar quantitative evaluations are very rare in FA in Linguistics that deal with linguistic expressions representing focus or emphasis such as “it was John who left”. We fill this gap in research by arguing that compositional rules in SA also apply to FA because FA and SA are closely related meaning that SA is part of FA. Our compositional approach in SA exploits basic syntactic rules such as rules of modification, coordination, and negation represented in the formalism of Universal Dependencies (UDs) in English and applied to words representing sentiments from sentiment dictionaries. Some of the advantages of our compositional analysis method for SA in contrast to non-compositional analysis methods are interpretability and explainability. We test the accuracy of our compositional approach and compare it with a non-compositional approach VADER that uses simple heuristic rules to deal with negation, coordination and modification. In contrast to previous related work that evaluates compositionality in SA on long reviews, this study uses more appropriate datasets to evaluate compositionality. In addition, we generalize the results of compositional approaches in SA to compositional approaches in FA.

[198] Evaluating Large Language Models as Expert Annotators cs.CLPDF

Yu-Min Tseng, Wei-Lin Chen, Chung-Chi Chen, Hsin-Hsi Chen

TL;DR: 论文评估了大语言模型（LLMs）在需要专家知识的领域中作为专家标注者的潜力，提出了多智能体讨论框架，并发现推理技术在专业领域的数据标注任务中效果有限。

Details

Motivation: 文本数据标注通常成本高且耗时，而LLMs在通用领域的NLP任务中已显示出替代人类标注者的潜力，但在需要专家知识的专业领域的效果尚不明确。

Result: 个体LLMs的推理技术效果有限，推理模型未显著优于非推理模型，多智能体环境中某些模型行为表现出固执性。

Insight: 专业领域的数据标注任务可能需要更复杂的标注框架，而非依赖于单一的推理技术或多智能体讨论。

Abstract: Textual data annotation, the process of labeling or tagging text with relevant information, is typically costly, time-consuming, and labor-intensive. While large language models (LLMs) have demonstrated their potential as direct alternatives to human annotators for general domains natural language processing (NLP) tasks, their effectiveness on annotation tasks in domains requiring expert knowledge remains underexplored. In this paper, we investigate: whether top-performing LLMs, which might be perceived as having expert-level proficiency in academic and professional benchmarks, can serve as direct alternatives to human expert annotators? To this end, we evaluate both individual LLMs and multi-agent approaches across three highly specialized domains: finance, biomedicine, and law. Specifically, we propose a multi-agent discussion framework to simulate a group of human annotators, where LLMs are tasked to engage in discussions by considering others’ annotations and justifications before finalizing their labels. Additionally, we incorporate reasoning models (e.g., o3-mini) to enable a more comprehensive comparison. Our empirical results reveal that: (1) Individual LLMs equipped with inference-time techniques (e.g., chain-of-thought (CoT), self-consistency) show only marginal or even negative performance gains, contrary to prior literature suggesting their broad effectiveness. (2) Overall, reasoning models do not demonstrate statistically significant improvements over non-reasoning models in most settings. This suggests that extended long CoT provides relatively limited benefits for data annotation in specialized domains. (3) Certain model behaviors emerge in the multi-agent discussion environment. For instance, Claude 3.7 Sonnet with thinking rarely changes its initial annotations, even when other agents provide correct annotations or valid reasoning.

[199] REX-RAG: Reasoning Exploration with Policy Correction in Retrieval-Augmented Generation cs.CLPDF

Wentao Jiang, Xiang Feng, Zengmao Wang, Yong Luo, Pingbo Xu

TL;DR: REX-RAG通过混合采样策略和策略纠正机制，解决了强化学习在检索增强生成中的死路问题，提升了推理任务的性能。

Details

Motivation: 在检索增强生成（RAG）中，强化学习（RL）用于提升语言模型的推理能力，但模型容易陷入无效的推理路径（”死路”），导致性能下降。

Result: 在7个问答基准测试中，REX-RAG在Qwen2.5-3B和Qwen2.5-7B上分别取得5.1%和3.6%的平均性能提升。

Insight: 通过混合采样和分布纠正，REX-RAG有效解决了RL在RAG中的死路问题，为复杂推理提供了新思路。

Abstract: Reinforcement learning (RL) is emerging as a powerful paradigm for enabling large language models (LLMs) to perform complex reasoning tasks. Recent advances indicate that integrating RL with retrieval-augmented generation (RAG) allows LLMs to dynamically incorporate external knowledge, leading to more informed and robust decision making. However, we identify a critical challenge during policy-driven trajectory sampling: LLMs are frequently trapped in unproductive reasoning paths, which we refer to as “dead ends”, committing to overconfident yet incorrect conclusions. This severely hampers exploration and undermines effective policy optimization. To address this challenge, we propose REX-RAG (Reasoning Exploration with Policy Correction in Retrieval-Augmented Generation), a novel framework that explores alternative reasoning paths while maintaining rigorous policy learning through principled distributional corrections. Our approach introduces two key innovations: (1) Mixed Sampling Strategy, which combines a novel probe sampling method with exploratory prompts to escape dead ends; and (2) Policy Correction Mechanism, which employs importance sampling to correct distribution shifts induced by mixed sampling, thereby mitigating gradient estimation bias. We evaluate it on seven question-answering benchmarks, and the experimental results show that REX-RAG achieves average performance gains of 5.1% on Qwen2.5-3B and 3.6% on Qwen2.5-7B over strong baselines, demonstrating competitive results across multiple datasets. The code is publicly available at https://github.com/MiliLab/REX-RAG.

[200] Capabilities of GPT-5 on Multimodal Medical Reasoning cs.CL | cs.AIPDF

Shansong Wang, Mingzhe Hu, Qiang Li, Mojtaba Safari, Xiaofeng Yang

TL;DR: 论文研究了GPT-5作为多模态医学推理系统的性能，发现其在零样本链式推理任务中优于基线模型和人类专家，尤其在多模态推理中表现突出。

Details

Motivation: 医疗决策通常需要整合异构信息源（如文本、图像），而目前的大语言模型在跨模态医疗推理上的能力尚未充分验证。本研究旨在探索GPT-5在这些任务中的潜力。

Result: GPT-5在所有任务中超越基线模型和人类专家，尤其在MedXpertQA MM上表现突出（推理+29.62%，理解+36.18%）。案例研究验证了其整合视觉与文本信息的能力。

Insight: GPT-5的改进表明，通用大语言模型可以接近或超越人类专家在复杂多模态医学推理中的表现，为未来临床决策支持系统设计提供了重要参考。

Abstract: Recent advances in large language models (LLMs) have enabled general-purpose systems to perform increasingly complex domain-specific reasoning without extensive fine-tuning. In the medical domain, decision-making often requires integrating heterogeneous information sources, including patient narratives, structured data, and medical images. This study positions GPT-5 as a generalist multimodal reasoner for medical decision support and systematically evaluates its zero-shot chain-of-thought reasoning performance on both text-based question answering and visual question answering tasks under a unified protocol. We benchmark GPT-5, GPT-5-mini, GPT-5-nano, and GPT-4o-2024-11-20 against standardized splits of MedQA, MedXpertQA (text and multimodal), MMLU medical subsets, USMLE self-assessment exams, and VQA-RAD. Results show that GPT-5 consistently outperforms all baselines, achieving state-of-the-art accuracy across all QA benchmarks and delivering substantial gains in multimodal reasoning. On MedXpertQA MM, GPT-5 improves reasoning and understanding scores by +29.62% and +36.18% over GPT-4o, respectively, and surpasses pre-licensed human experts by +24.23% in reasoning and +29.40% in understanding. In contrast, GPT-4o remains below human expert performance in most dimensions. A representative case study demonstrates GPT-5’s ability to integrate visual and textual cues into a coherent diagnostic reasoning chain, recommending appropriate high-stakes interventions. Our results show that, on these controlled multimodal reasoning benchmarks, GPT-5 moves from human-comparable to above human-expert performance. This improvement may substantially inform the design of future clinical decision-support systems.

[201] Exploring Safety Alignment Evaluation of LLMs in Chinese Mental Health Dialogues via LLM-as-Judge cs.CL | cs.CYPDF

Yunna Cai, Fan Wang, Haowei Wang, Kun Wang, Kailai Yang

TL;DR: 该论文提出了PsyCrisis-Bench，一个基于中文心理健康对话的无参考评估基准，通过专家定义的安全原则评估LLM响应的安全性，并采用LLM-as-Judge方法进行语境评估。

Details

Motivation: 在心理健康对话中评估LLM的安全性具有挑战性，因为缺乏黄金标准答案且涉及伦理敏感性。因此，需要一种无参考的评估方法。

Result: 实验结果表明，该方法与专家评估的一致性最高，且生成的评估理由更易解释。

Insight: 基于专家定义的原则进行无参考评估是可行的，尤其是在高敏感领域，LLM-as-Judge方法能提供更可靠的结果。

Abstract: Evaluating the safety alignment of LLM responses in high-risk mental health dialogues is particularly difficult due to missing gold-standard answers and the ethically sensitive nature of these interactions. To address this challenge, we propose PsyCrisis-Bench, a reference-free evaluation benchmark based on real-world Chinese mental health dialogues. It evaluates whether the model responses align with the safety principles defined by experts. Specifically designed for settings without standard references, our method adopts a prompt-based LLM-as-Judge approach that conducts in-context evaluation using expert-defined reasoning chains grounded in psychological intervention principles. We employ binary point-wise scoring across multiple safety dimensions to enhance the explainability and traceability of the evaluation. Additionally, we present a manually curated, high-quality Chinese-language dataset covering self-harm, suicidal ideation, and existential distress, derived from real-world online discourse. Experiments on 3600 judgments show that our method achieves the highest agreement with expert assessments and produces more interpretable evaluation rationales compared to existing approaches. Our dataset and evaluation tool are publicly available to facilitate further research.

[202] Jinx: Unlimited LLMs for Probing Alignment Failures cs.CLPDF

Jiahao Zhao, Liwei Dong

TL;DR: 论文介绍了Jinx，一个无限制的LLM变体，用于探测对齐失败和研究语言模型的安全边界。

Details

Motivation: 目前无限制的语言模型（不经过安全对齐约束）主要用于大公司的内部工具，研究社区难以获取。这限制了安全对齐研究的进展。

Result: Jinx为研究社区提供了一个可访问的工具，用于系统性地研究语言模型的安全失败模式和对齐边界。

Insight: 通过提供无限制模型，研究人员可以更有效地识别和解决安全对齐中的潜在问题。

Abstract: Unlimited, or so-called helpful-only language models are trained without safety alignment constraints and never refuse user queries. They are widely used by leading AI companies as internal tools for red teaming and alignment evaluation. For example, if a safety-aligned model produces harmful outputs similar to an unlimited model, this indicates alignment failures that require further attention. Despite their essential role in assessing alignment, such models are not available to the research community. We introduce Jinx, a helpful-only variant of popular open-weight LLMs. Jinx responds to all queries without refusals or safety filtering, while preserving the base model’s capabilities in reasoning and instruction following. It provides researchers with an accessible tool for probing alignment failures, evaluating safety boundaries, and systematically studying failure modes in language model safety.

eess.AS [Back]

[203] TurboBias: Universal ASR Context-Biasing powered by GPU-accelerated Phrase-Boosting Tree eess.AS | cs.AI | cs.CL | cs.SDPDF

Andrei Andrusenko, Vladimir Bataev, Lilit Grigoryan, Vitaly Lavrukhin, Boris Ginsburg

TL;DR: 本文提出了一种通用的ASR上下文偏置框架TurboBias，支持所有主要ASR模型类型，通过GPU加速的短语提升树实现高效处理，无需额外训练且不显著降低解码速度。

Details

Motivation: 现有上下文偏置方法需要额外训练模型、显著减慢解码速度或限制ASR系统类型，无法满足高效且通用的需求。

Result: 实验表明，该方法在准确性和解码速度上均优于开源上下文偏置方法，且集成到NeMo工具包中开源。

Insight: 通过硬件加速和高效数据结构（短语提升树），可以在不牺牲速度的情况下实现高精度的上下文偏置，为ASR系统提供了一种灵活的通用解决方案。

Abstract: Recognizing specific key phrases is an essential task for contextualized Automatic Speech Recognition (ASR). However, most existing context-biasing approaches have limitations associated with the necessity of additional model training, significantly slow down the decoding process, or constrain the choice of the ASR system type. This paper proposes a universal ASR context-biasing framework that supports all major types: CTC, Transducers, and Attention Encoder-Decoder models. The framework is based on a GPU-accelerated word boosting tree, which enables it to be used in shallow fusion mode for greedy and beam search decoding without noticeable speed degradation, even with a vast number of key phrases (up to 20K items). The obtained results showed high efficiency of the proposed method, surpassing the considered open-source context-biasing approaches in accuracy and decoding speed. Our context-biasing framework is open-sourced as a part of the NeMo toolkit.

[204] KLASSify to Verify: Audio-Visual Deepfake Detection Using SSL-based Audio and Handcrafted Visual Features eess.AS | cs.CVPDF

Ivan Kukanov, Jun Wah Ng

TL;DR: 本文提出了一种多模态方法来检测音频-视觉深度伪造内容，结合了手工视觉特征和自监督学习的音频表征，以提升检测的鲁棒性和泛化能力。

Details

Motivation: 随着音频驱动的说话人生成器和TTS模型的快速发展，深度伪造技术变得越来越复杂，亟需鲁棒的检测方法，尤其是对新型攻击场景的泛化能力。

Result: 在AV-Deepfake1M++数据集上，多模态系统的分类任务AUC为92.78%，仅音频模态的时序定位IoU为0.3536。

Insight: 研究突出了手工特征与自监督学习相结合在多模态深度伪造检测中的潜力，同时强调了可解释性和部署效率的重要性。

Abstract: The rapid development of audio-driven talking head generators and advanced Text-To-Speech (TTS) models has led to more sophisticated temporal deepfakes. These advances highlight the need for robust methods capable of detecting and localizing deepfakes, even under novel, unseen attack scenarios. Current state-of-the-art deepfake detectors, while accurate, are often computationally expensive and struggle to generalize to novel manipulation techniques. To address these challenges, we propose multimodal approaches for the AV-Deepfake1M 2025 challenge. For the visual modality, we leverage handcrafted features to improve interpretability and adaptability. For the audio modality, we adapt a self-supervised learning (SSL) backbone coupled with graph attention networks to capture rich audio representations, improving detection robustness. Our approach strikes a balance between performance and real-world deployment, focusing on resilience and potential interpretability. On the AV-Deepfake1M++ dataset, our multimodal system achieves AUC of 92.78% for deepfake classification task and IoU of 0.3536 for temporal localization using only the audio modality.

q-fin.ST [Back]

[205] Event-Aware Sentiment Factors from LLM-Augmented Financial Tweets: A Transparent Framework for Interpretable Quant Trading q-fin.ST | cs.CL | cs.LGPDF

Yueyi Wang, Qiyao Wei

TL;DR: 论文探讨了利用大型语言模型（LLM）从金融推文中提取情感信号，并通过多标签事件分类将其转化为可交易的量化信号，结果展示了透明和开源框架的潜力。

Details

Motivation: 金融社交媒体数据（如推文）包含丰富的情感信息，但如何将其转化为结构化信号用于量化交易仍具挑战性。本文旨在利用LLM提升情感信号的可解释性和实用性。

Result: 实验显示，某些事件标签能产生显著的负alpha（夏普比率低至-0.38，信息系数超0.05），且在95%置信水平下统计显著。

Insight: 社交媒体情感信号虽嘈杂，但通过LLM的结构化处理，能在金融预测中提供价值；透明开源框架有助于推动算法交易研究的民主化。

Abstract: In this study, we wish to showcase the unique utility of large language models (LLMs) in financial semantic annotation and alpha signal discovery. Leveraging a corpus of company-related tweets, we use an LLM to automatically assign multi-label event categories to high-sentiment-intensity tweets. We align these labeled sentiment signals with forward returns over 1-to-7-day horizons to evaluate their statistical efficacy and market tradability. Our experiments reveal that certain event labels consistently yield negative alpha, with Sharpe ratios as low as -0.38 and information coefficients exceeding 0.05, all statistically significant at the 95% confidence level. This study establishes the feasibility of transforming unstructured social media text into structured, multi-label event variables. A key contribution of this work is its commitment to transparency and reproducibility; all code and methodologies are made publicly available. Our results provide compelling evidence that social media sentiment is a valuable, albeit noisy, signal in financial forecasting and underscore the potential of open-source frameworks to democratize algorithmic trading research.

cs.LG [Back]

[206] Generative Artificial Intelligence Extracts Structure-Function Relationships from Plants for New Materials cs.LG | cond-mat.dis-nn | cond-mat.mtrl-sci | cond-mat.other | cs.AI | cs.CLPDF

Rachel K. Luu, Jingyu Deng, Mohammed Shahrudin Ibrahim, Nam-Joon Cho, Ming Dao

TL;DR: 该论文提出了一种将生成式人工智能与植物科学、仿生学和材料工程结合的框架，用于从植物结构中提取结构-功能关系，设计新型材料。通过实验验证了AI生成的材料设计，成功开发了一种新型花粉基粘合剂。

Details

Motivation: 多学科交叉领域（如材料科学）中，生成式AI的应用尚未充分探索。如何从植物学和其他领域的文献中提取知识并指导实验设计，是研究的主要动机。

Result: 成功开发了一种具有可调形态和剪切强度的花粉基粘合剂，验证了AI辅助设计的实际应用潜力。

Insight: 生成式AI可以从多学科文献中提取知识并启发新材料设计，推动人机协作在实验科学中的应用。

Abstract: Large language models (LLMs) have reshaped the research landscape by enabling new approaches to knowledge retrieval and creative ideation. Yet their application in discipline-specific experimental science, particularly in highly multi-disciplinary domains like materials science, remains limited. We present a first-of-its-kind framework that integrates generative AI with literature from hitherto-unconnected fields such as plant science, biomimetics, and materials engineering to extract insights and design experiments for materials. We focus on humidity-responsive systems such as pollen-based materials and Rhapis excelsa (broadleaf lady palm) leaves, which exhibit self-actuation and adaptive performance. Using a suite of AI tools, including a fine-tuned model (BioinspiredLLM), Retrieval-Augmented Generation (RAG), agentic systems, and a Hierarchical Sampling strategy, we extract structure-property relationships and translate them into new classes of bioinspired materials. Structured inference protocols generate and evaluate hundreds of hypotheses from a single query, surfacing novel and experimentally tractable ideas. We validate our approach through real-world implementation: LLM-generated procedures, materials designs, and mechanical predictions were tested in the laboratory, culminating in the fabrication of a novel pollen-based adhesive with tunable morphology and measured shear strength, establishing a foundation for future plant-derived adhesive design. This work demonstrates how AI-assisted ideation can drive real-world materials design and enable effective human-AI collaboration.

[207] AMFT: Aligning LLM Reasoners by Meta-Learning the Optimal Imitation-Exploration Balance cs.LG | cs.AI | cs.CL | cs.CVPDF

Lixuan He, Jie Feng, Yong Li

TL;DR: AMFT提出了一种新颖的单阶段算法，通过元学习动态平衡监督微调（SFT）和强化学习（RL）的奖励信号，显著提升了语言模型在复杂推理任务中的性能。

Details

Motivation: 现有方法在SFT和RL的两阶段微调中存在灾难性遗忘和模仿-探索权衡不足的问题，启发作者提出一种动态平衡两者的方法。

Result: 在数学推理、抽象视觉推理（General Points）和视觉语言导航（V-IRL）任务中均实现SOTA，并在OOD任务上表现出优越的泛化能力。

Insight: AMFT揭示了任务性能提升的关键在于动态平衡模仿（SFT）与探索（RL），而元学习控制器是实现这一目标的有效工具。

Abstract: Large Language Models (LLMs) are typically fine-tuned for reasoning tasks through a two-stage pipeline of Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL), a process fraught with catastrophic forgetting and suboptimal trade-offs between imitation and exploration. Recent single-stage methods attempt to unify SFT and RL using heuristics, but lack a principled mechanism for dynamically balancing the two paradigms. In this paper, we reframe this challenge through the theoretical lens of \textbf{implicit rewards}, viewing SFT and RL not as distinct methods but as complementary reward signals. We introduce \textbf{Adaptive Meta Fine-Tuning (AMFT)}, a novel single-stage algorithm that learns the optimal balance between SFT’s implicit, path-level reward and RL’s explicit, outcome-based reward. The core of AMFT is a \textbf{meta-gradient adaptive weight controller} that treats the SFT-RL balance as a learnable parameter, dynamically optimizing it to maximize long-term task performance. This forward-looking approach, regularized by policy entropy for stability, autonomously discovers an effective training curriculum. We conduct a comprehensive evaluation on challenging benchmarks spanning mathematical reasoning, abstract visual reasoning (General Points), and vision-language navigation (V-IRL). AMFT consistently establishes a new state-of-the-art and demonstrats superior generalization on out-of-distribution (OOD) tasks. Ablation studies and training dynamic analysis confirm that the meta-learning controller is crucial for AMFT’s stability, sample efficiency, and performance, offering a more principled and effective paradigm for LLM alignment.Our codes are open-sourced via https://github.com/hlxtsyj/AMFT.

[208] Klear-Reasoner: Advancing Reasoning Capability via Gradient-Preserving Clipping Policy Optimization cs.LG | cs.AI | cs.CLPDF

Zhenpeng Su, Leiyu Pan, Xue Bai, Dening Liu, Guanting Dong

TL;DR: Klear-Reasoner 是一种具有长推理能力的模型，通过梯度保留裁剪策略优化（GPPO）提升推理能力，在数学和编程任务中表现优异。

Details

Motivation: 当前推理模型的训练细节不完全公开，导致高绩效模型难以复现。本研究旨在提供完整的训练流程分析，并解决强化学习中的裁剪机制问题。

Result: Klear-Reasoner 在数学竞赛（AIME 2024/2025）和编程基准（LiveCodeBench V5/V6）中分别取得 90.5%、83.2%、66.0% 和 58.1% 的成绩。

Insight: 高质量数据和小样本集比多样化大数据更有效；裁剪机制需要改进以保留关键探索信息。

Abstract: We present Klear-Reasoner, a model with long reasoning capabilities that demonstrates careful deliberation during problem solving, achieving outstanding performance across multiple benchmarks. Although there are already many excellent works related to inference models in the current community, there are still many problems with reproducing high-performance inference models due to incomplete disclosure of training details. This report provides an in-depth analysis of the reasoning model, covering the entire post-training workflow from data preparation and long Chain-of-Thought supervised fine-tuning (long CoT SFT) to reinforcement learning (RL), along with detailed ablation studies for each experimental component. For SFT data, our experiments show that a small number of high-quality data sources are more effective than a large number of diverse data sources, and that difficult samples can achieve better results without accuracy filtering. In addition, we investigate two key issues with current clipping mechanisms in RL: Clipping suppresses critical exploration signals and ignores suboptimal trajectories. To address these challenges, we propose Gradient-Preserving clipping Policy Optimization (GPPO) that gently backpropagates gradients from clipped tokens. GPPO not only enhances the model’s exploration capacity but also improves its efficiency in learning from negative samples. Klear-Reasoner exhibits exceptional reasoning abilities in mathematics and programming, scoring 90.5% on AIME 2024, 83.2% on AIME 2025, 66.0% on LiveCodeBench V5 and 58.1% on LiveCodeBench V6.

[209] Learning to Align, Aligning to Learn: A Unified Approach for Self-Optimized Alignment cs.LG | cs.AI | cs.CLPDF

Haowen Wang, Yun Yue, Zhiling Ye, Shuowen Zhang, Lei Fan

TL;DR: 论文提出了GRAO框架，结合SFT和RL的优势，通过多样本生成、组内相对优势加权和参考感知参数更新，解决了语言模型对齐中的效率和质量问题。

Details

Motivation: 现有对齐方法中，SFT收敛快但受限于离线策略轨迹，RL探索性强但样本效率低且依赖高质量基础模型。GRAO旨在结合两者优点。

Result: 在复杂任务中，GRAO相对SFT、DPO等基线方法分别提升了57.70%、17.65%等性能。

Insight: GRAO为语言模型对齐提供了理论和实证基础，展示了高效能力演进的可能性。

Abstract: Alignment methodologies have emerged as a critical pathway for enhancing language model alignment capabilities. While SFT (supervised fine-tuning) accelerates convergence through direct token-level loss intervention, its efficacy is constrained by offline policy trajectory. In contrast, RL(reinforcement learning) facilitates exploratory policy optimization, but suffers from low sample efficiency and stringent dependency on high-quality base models. To address these dual challenges, we propose GRAO (Group Relative Alignment Optimization), a unified framework that synergizes the respective strengths of SFT and RL through three key innovations: 1) A multi-sample generation strategy enabling comparative quality assessment via reward feedback; 2) A novel Group Direct Alignment Loss formulation leveraging intra-group relative advantage weighting; 3) Reference-aware parameter updates guided by pairwise preference dynamics. Our theoretical analysis establishes GRAO’s convergence guarantees and sample efficiency advantages over conventional approaches. Comprehensive evaluations across complex human alignment tasks demonstrate GRAO’s superior performance, achieving 57.70%,17.65% 7.95% and 5.18% relative improvements over SFT, DPO, PPO and GRPO baselines respectively. This work provides both a theoretically grounded alignment framework and empirical evidence for efficient capability evolution in language models.

[210] Pareto Multi-Objective Alignment for Language Models cs.LG | cs.AI | cs.CLPDF

Qiang He, Setareh Maghsudi

TL;DR: 该论文提出了Pareto多目标对齐（PAMA）算法，解决了大型语言模型（LLM）在多目标对齐（MOA）问题上的挑战，通过高效计算实现多目标优化。

Details

Motivation: 目前基于RLHF的对齐方法通常只优化单一奖励函数，导致LLM行为僵化，无法适应多样且冲突的人类偏好需求，因此需要一种高效的多目标对齐方法。

Result: 实验证明PAMA在多目标任务中表现出色，验证了其理论优势，为LLM的多样对齐提供高效解决方案。

Insight: PAMA首次将多目标对齐问题从理论上和计算上变得可行，推动了LLM在复杂实际场景中的应用，如平衡信息量与简洁性等冲突目标。

Abstract: Large language models (LLMs) are increasingly deployed in real-world applications that require careful balancing of multiple, often conflicting, objectives, such as informativeness versus conciseness, or helpfulness versus creativity. However, current alignment methods, primarily based on RLHF, optimize LLMs toward a single reward function, resulting in rigid behavior that fails to capture the complexity and diversity of human preferences. This limitation hinders the adaptability of LLMs to practical scenarios, making multi-objective alignment (MOA) a critical yet underexplored area. To bridge this gap, we propose Pareto Multi-Objective Alignment (PAMA), a principled and computationally efficient algorithm designed explicitly for MOA in LLMs. In contrast to computationally prohibitive multi-objective optimization (MOO) methods, PAMA transforms multi-objective RLHF into a convex optimization with a closed-form solution, significantly enhancing scalability. Traditional MOO approaches suffer from prohibitive O(n^2*d) complexity, where d represents the number of model parameters, typically in the billions for LLMs, rendering direct optimization infeasible. PAMA reduces this complexity to O(n) where n is the number of objectives, enabling optimization to be completed within milliseconds. We provide theoretical guarantees that PAMA converges to a Pareto stationary point, where no objective can be improved without degrading at least one other. Extensive experiments across language models ranging from 125M to 7B parameters demonstrate PAMA’s robust and effective MOA capabilities, aligning with its theoretical advantages. PAMA provides a highly efficient solution to the MOA problem that was previously considered intractable, offering a practical and theoretically grounded approach to aligning LLMs with diverse human values, paving the way for versatile and adaptable real-world AI deployments.

[211] From Source to Target: Leveraging Transfer Learning for Predictive Process Monitoring in Organizations cs.LG | cs.CL | cs.DBPDF

Sven Weinzierl, Sandra Zilker, Annina Liessmann, Martin Käppel, Weixin Wang

TL;DR: 该论文提出了一种基于迁移学习的预测性流程监控（PPM）技术，帮助资源有限的组织实现决策支持。

Details

Motivation: 传统PPM技术需要大量事件数据或资源，而许多组织无法满足这一需求，限制了PPM的应用。

Result: 实验显示，源业务流程的知识可成功转移到目标业务流程，实现有效的PPM。

Insight: 迁移学习为资源匮乏的组织提供了PPM的可能，同时展示了跨组织知识转移的潜力。

Abstract: Event logs reflect the behavior of business processes that are mapped in organizational information systems. Predictive process monitoring (PPM) transforms these data into value by creating process-related predictions that provide the insights required for proactive interventions at process runtime. Existing PPM techniques require sufficient amounts of event data or other relevant resources that might not be readily available, preventing some organizations from utilizing PPM. The transfer learning-based PPM technique presented in this paper allows organizations without suitable event data or other relevant resources to implement PPM for effective decision support. The technique is instantiated in two real-life use cases, based on which numerical experiments are performed using event logs for IT service management processes in an intra- and inter-organizational setting. The results of the experiments suggest that knowledge of one business process can be transferred to a similar business process in the same or a different organization to enable effective PPM in the target context. With the proposed technique, organizations can benefit from transfer learning in an intra- and inter-organizational setting, where resources like pre-trained models are transferred within and across organizational boundaries.

[212] Part I: Tricks or Traps? A Deep Dive into RL for LLM Reasoning cs.LG | cs.CLPDF

Zihe Liu, Jiashun Liu, Yancheng He, Weixun Wang, Jiaheng Liu

TL;DR: 这篇论文系统分析了RL在LLM推理中的技术挑战，通过统一框架复现和评估了多技术，并提出了简化的有效组合方法。

Details

Motivation: 当前RL在LLM推理领域缺乏标准化指南，实验结果不一致，导致技术选择混乱。

Result: 提出的简化组合策略（基于PPO）性能优于GRPO和DAPO等复杂方法。

Insight: 1. 技术选择需结合具体场景；2. 简单组合可能比复杂方法更有效；3. RL技术的内部分析对LLM推理至关重要。

Abstract: Reinforcement learning for LLM reasoning has rapidly emerged as a prominent research area, marked by a significant surge in related studies on both algorithmic innovations and practical applications. Despite this progress, several critical challenges remain, including the absence of standardized guidelines for employing RL techniques and a fragmented understanding of their underlying mechanisms. Additionally, inconsistent experimental settings, variations in training data, and differences in model initialization have led to conflicting conclusions, obscuring the key characteristics of these techniques and creating confusion among practitioners when selecting appropriate techniques. This paper systematically reviews widely adopted RL techniques through rigorous reproductions and isolated evaluations within a unified open-source framework. We analyze the internal mechanisms, applicable scenarios, and core principles of each technique through fine-grained experiments, including datasets of varying difficulty, model sizes, and architectures. Based on these insights, we present clear guidelines for selecting RL techniques tailored to specific setups, and provide a reliable roadmap for practitioners navigating the RL for the LLM domain. Finally, we reveal that a minimalist combination of two techniques can unlock the learning capability of critic-free policies using vanilla PPO loss. The results demonstrate that our simple combination consistently improves performance, surpassing strategies like GRPO and DAPO.

Keyan Rahimi, Md. Wasiul Haque, Sagar Dasgupta, Mizanur Rahman

TL;DR: 该论文提出了一种结合视觉定位和LLM导航的室内导航方法，采用ResNet-50和LLM技术，在复杂环境中实现高精度定位与导航。

Details

Motivation: 由于室内环境缺乏GPS信号且结构复杂，传统导航方法难以适用，因此作者提出了一种无需基础设施的解决方案。

Result: 定位精度达到96%，导航指令平均准确率为75%，但在零样本推理和推理时间上存在局限性。

Insight: 该方法展示了利用现成设备和公开平面图实现可扩展室内导航的潜力，尤其适用于医院、机场等资源受限场景。

Abstract: Indoor navigation remains a complex challenge due to the absence of reliable GPS signals and the architectural intricacies of large enclosed environments. This study presents an indoor localization and navigation approach that integrates vision-based localization with large language model (LLM)-based navigation. The localization system utilizes a ResNet-50 convolutional neural network fine-tuned through a two-stage process to identify the user’s position using smartphone camera input. To complement localization, the navigation module employs an LLM, guided by a carefully crafted system prompt, to interpret preprocessed floor plan images and generate step-by-step directions. Experimental evaluation was conducted in a realistic office corridor with repetitive features and limited visibility to test localization robustness. The model achieved high confidence and an accuracy of 96% across all tested waypoints, even under constrained viewing conditions and short-duration queries. Navigation tests using ChatGPT on real building floor maps yielded an average instruction accuracy of 75%, with observed limitations in zero-shot reasoning and inference time. This research demonstrates the potential for scalable, infrastructure-free indoor navigation using off-the-shelf cameras and publicly available floor plans, particularly in resource-constrained settings like hospitals, airports, and educational institutions.

cs.RO [Back]

Xiaobei Zhao, Xingqi Lyu, Xiang Li

TL;DR: 本文提出了农业视觉语言导航（AgriVLN）基准和方法，专注于农业场景的机器人导航，通过Vision-Language Model（VLM）和指令分解模块提升导航性能。

Details

Motivation: 农业机器人目前依赖人工操作或固定轨道移动，灵活性不足。视觉语言导航（VLN）在其他领域表现良好，但缺乏针对农业场景的解决方案。

Result: AgriVLN在短指令任务中表现良好，但对长指令的成功率较低（SR=0.33）。加入STL模块后，SR提升至0.47，优于现有VLN方法。

Insight: 农业场景的VLN需要更精细的指令分解能力，以应对复杂任务。STL模块的引入为其他领域的VLN任务提供了优化思路。

Abstract: Agricultural robots have emerged as powerful members in agricultural tasks, nevertheless, still heavily rely on manual operation or untransportable railway for movement, resulting in limited mobility and poor adaptability. Vision-and-Language Navigation (VLN) enables robots to navigate to the target destinations following natural language instructions, demonstrating strong performance on several domains. However, none of the existing benchmarks or methods is specifically designed for agricultural scenes. To bridge this gap, we propose Agriculture to Agriculture (A2A) benchmark, containing 1,560 episodes across six diverse agricultural scenes, in which all realistic RGB videos are captured by front-facing camera on a quadruped robot at a height of 0.38 meters, aligning with the practical deployment conditions. Meanwhile, we propose Vision-and-Language Navigation for Agricultural Robots (AgriVLN) baseline based on Vision-Language Model (VLM) prompted with carefully crafted templates, which can understand both given instructions and agricultural environments to generate appropriate low-level actions for robot control. When evaluated on A2A, AgriVLN performs well on short instructions but struggles with long instructions, because it often fails to track which part of the instruction is currently being executed. To address this, we further propose Subtask List (STL) instruction decomposition module and integrate it into AgriVLN, improving Success Rate (SR) from 0.33 to 0.47. We additionally compare AgriVLN with several existing VLN methods, demonstrating the state-of-the-art performance in the agricultural domain.

[215] Progressive Bird’s Eye View Perception for Safety-Critical Autonomous Driving: A Comprehensive Survey cs.RO | cs.CVPDF

Yan Gong, Naibang Wang, Jianli Lu, Xinyu Zhang, Yongsheng Gao

TL;DR: 这篇论文是第一篇从安全关键视角全面综述鸟瞰图（BEV）感知的研究，分析了单模态车载、多模态车载和多智能体协作感知三阶段的先进框架，并探讨了公开数据集、开放世界挑战及未来研究方向。

Details

Motivation: 随着自动驾驶从受控环境转向实际应用，BEV感知在复杂场景（如遮挡、恶劣天气）中的安全性和可靠性成为关键挑战，论文旨在填补这一领域的综述空白。

Result: 研究总结了BEV感知的现状，指出其在安全性和鲁棒性上的局限性，并提出了未解决的挑战。

Insight: 未来的研究应关注端到端自动驾驶系统、具身智能和大语言模型的结合，以进一步提升BEV感知的安全性和适应性。

Abstract: Bird’s-Eye-View (BEV) perception has become a foundational paradigm in autonomous driving, enabling unified spatial representations that support robust multi-sensor fusion and multi-agent collaboration. As autonomous vehicles transition from controlled environments to real-world deployment, ensuring the safety and reliability of BEV perception in complex scenarios - such as occlusions, adverse weather, and dynamic traffic - remains a critical challenge. This survey provides the first comprehensive review of BEV perception from a safety-critical perspective, systematically analyzing state-of-the-art frameworks and implementation strategies across three progressive stages: single-modality vehicle-side, multimodal vehicle-side, and multi-agent collaborative perception. Furthermore, we examine public datasets encompassing vehicle-side, roadside, and collaborative settings, evaluating their relevance to safety and robustness. We also identify key open-world challenges - including open-set recognition, large-scale unlabeled data, sensor degradation, and inter-agent communication latency - and outline future research directions, such as integration with end-to-end autonomous driving systems, embodied intelligence, and large language models.

Shoaib Ahmmad, Zubayer Ahmed Aditto, Md Mehrab Hossain, Noushin Yeasmin, Shorower Hossain

TL;DR: 论文提出了一种基于云计算的四旋翼无人机自主导航系统，结合多模态感知和LLM驱动的高语义推理，适用于GPS缺失的室内环境。

Details

Motivation: 在GPS缺失的室内环境中，无人机的自主导航面临重大挑战。需要结合高效感知和智能决策以实现安全和精确的导航。

Result: 实验表明：目标检测mAP50为0.6，深度估计MAE为7.2 cm，42次试验中16次安全包络突破，端到端延迟低于1秒。

Insight: 云计算的资源卸载和高语义推理的结合，为复杂环境下的无人机导航提供了新思路。

Abstract: This paper introduces an advanced AI-driven perception system for autonomous quadcopter navigation in GPS-denied indoor environments. The proposed framework leverages cloud computing to offload computationally intensive tasks and incorporates a custom-designed printed circuit board (PCB) for efficient sensor data acquisition, enabling robust navigation in confined spaces. The system integrates YOLOv11 for object detection, Depth Anything V2 for monocular depth estimation, a PCB equipped with Time-of-Flight (ToF) sensors and an Inertial Measurement Unit (IMU), and a cloud-based Large Language Model (LLM) for context-aware decision-making. A virtual safety envelope, enforced by calibrated sensor offsets, ensures collision avoidance, while a multithreaded architecture achieves low-latency processing. Enhanced spatial awareness is facilitated by 3D bounding box estimation with Kalman filtering. Experimental results in an indoor testbed demonstrate strong performance, with object detection achieving a mean Average Precision (mAP50) of 0.6, depth estimation Mean Absolute Error (MAE) of 7.2 cm, only 16 safety envelope breaches across 42 trials over approximately 11 minutes, and end-to-end system latency below 1 second. This cloud-supported, high-intelligence framework serves as an auxiliary perception and navigation system, complementing state-of-the-art drone autonomy for GPS-denied confined spaces.

[217] ODYSSEY: Open-World Quadrupeds Exploration and Manipulation for Long-Horizon Tasks cs.RO | cs.CVPDF

Kaijun Wang, Liqin Lu, Mingyu Liu, Jianuo Jiang, Zeju Li

TL;DR: ODYSSEY是一个统一的移动操作框架，专为装备机械臂的四足机器人设计，结合了高层任务规划和低层全身控制，在非结构化环境中实现了长期任务的执行。

Details

Motivation: 现有的语言引导长期移动操作系统多局限于桌面场景，难以应对移动平台的感知受限和执行范围问题，同时现有操作策略在开放世界中的泛化能力不足。ODYSSEY旨在解决这些问题，推动通用机器人助手的发展。

Result: 成功在模拟和真实环境中完成长期复杂任务，验证了框架的泛化性和鲁棒性。

Insight: 四足机器人结合机械臂和非结构化环境操作能力的提升，为通用机器人助手的发展提供了可行路径。

Abstract: Language-guided long-horizon mobile manipulation has long been a grand challenge in embodied semantic reasoning, generalizable manipulation, and adaptive locomotion. Three fundamental limitations hinder progress: First, although large language models have improved spatial reasoning and task planning through semantic priors, existing implementations remain confined to tabletop scenarios, failing to address the constrained perception and limited actuation ranges of mobile platforms. Second, current manipulation strategies exhibit insufficient generalization when confronted with the diverse object configurations encountered in open-world environments. Third, while crucial for practical deployment, the dual requirement of maintaining high platform maneuverability alongside precise end-effector control in unstructured settings remains understudied. In this work, we present ODYSSEY, a unified mobile manipulation framework for agile quadruped robots equipped with manipulators, which seamlessly integrates high-level task planning with low-level whole-body control. To address the challenge of egocentric perception in language-conditioned tasks, we introduce a hierarchical planner powered by a vision-language model, enabling long-horizon instruction decomposition and precise action execution. At the control level, our novel whole-body policy achieves robust coordination across challenging terrains. We further present the first benchmark for long-horizon mobile manipulation, evaluating diverse indoor and outdoor scenarios. Through successful sim-to-real transfer, we demonstrate the system’s generalization and robustness in real-world deployments, underscoring the practicality of legged manipulators in unstructured environments. Our work advances the feasibility of generalized robotic assistants capable of complex, dynamic tasks. Our project page: https://kaijwang.github.io/odyssey.github.io/

cs.IT [Back]

[218] Codebook-enabled Generative End-to-end Semantic Communication Powered by Transformer cs.IT | cs.AI | cs.CV | math.ITPDF

Peigen Ye, Yaping Sun, Shumin Yao, Hao Chen, Xiaodong Xu

TL;DR: 论文提出了一种基于Codebook和Transformer的生成式端到端语义通信系统，通过联合构建语义编解码器和Codebook，并设计基于Codebook的向量到索引转换器，提升了系统对信道噪声的鲁棒性，生成的图像在视觉感知上优于传统方法。

Details

Motivation: 现有的基于Codebook的语义通信系统因语义关系与索引距离无关，对信道噪声敏感，因此需要设计更鲁棒的方案来提升性能。

Result: 生成的图像在视觉感知上优于JPEG+LDPC和传统联合信源信道编码（JSCC）方法，数值结果也验证了其优势。

Insight: 结合Codebook和Transformer的生成式语义通信系统能够有效提升噪声环境下的图像生成质量，为端到端通信提供了新思路。

Abstract: Codebook-based generative semantic communication attracts increasing attention, since only indices are required to be transmitted when the codebook is shared between transmitter and receiver. However, due to the fact that the semantic relations among code vectors are not necessarily related to the distance of the corresponding code indices, the performance of the codebook-enabled semantic communication system is susceptible to the channel noise. Thus, how to improve the system robustness against the noise requires careful design. This paper proposes a robust codebook-assisted image semantic communication system, where semantic codec and codebook are first jointly constructed, and then vector-to-index transformer is designed guided by the codebook to eliminate the effects of channel noise, and achieve image generation. Thanks to the assistance of the high-quality codebook to the Transformer, the generated images at the receiver outperform those of the compared methods in terms of visual perception. In the end, numerical results and generated images demonstrate the advantages of the generative semantic communication method over JPEG+LDPC and traditional joint source channel coding (JSCC) methods.

cs.MM [Back]

[219] AD-AVSR: Asymmetric Dual-stream Enhancement for Robust Audio-Visual Speech Recognition cs.MM | cs.CV | cs.SD | eess.ASPDF

Junxiao Xue, Xiaozhen Liu, Xuecheng Wu, Xinyi Yin, Danlei Huang

TL;DR: AD-AVSR提出了一种新的双向模态增强框架，通过音频双流编码策略和非对称交叉模态交互，提升了音视频语音识别的性能与噪声鲁棒性。

Details

Motivation: 现有的音视频语音识别方法多采用单向增强或对称融合方式，难以捕捉非对称信息条件下的异质互补相关性。

Result: 在LRS2和LRS3数据集上，AD-AVSR明显优于现有最优方法，性能和噪声鲁棒性均有显著提升。

Insight: 通过非对称交叉模态交互和选择性过滤低相关音视频对，可以有效提升模型的鲁棒性和准确性。

Abstract: Audio-visual speech recognition (AVSR) combines audio-visual modalities to improve speech recognition, especially in noisy environments. However, most existing methods deploy the unidirectional enhancement or symmetric fusion manner, which limits their capability to capture heterogeneous and complementary correlations of audio-visual data-especially under asymmetric information conditions. To tackle these gaps, we introduce a new AVSR framework termed AD-AVSR based on bidirectional modality enhancement. Specifically, we first introduce the audio dual-stream encoding strategy to enrich audio representations from multiple perspectives and intentionally establish asymmetry to support subsequent cross-modal interactions. The enhancement process involves two key components, Audio-aware Visual Refinement Module for enhanced visual representations under audio guidance, and Cross-modal Noise Suppression Masking Module which refines audio representations using visual cues, collaboratively leading to the closed-loop and bidirectional information flow. To further enhance correlation robustness, we adopt a threshold-based selection mechanism to filter out irrelevant or weakly correlated audio-visual pairs. Extensive experimental results on the LRS2 and LRS3 datasets indicate that our AD-AVSR consistently surpasses SOTA methods in both performance and noise robustness, highlighting the effectiveness of our model design.

stat.ML [Back]

[220] Membership Inference Attacks with False Discovery Rate Control stat.ML | cs.CV | cs.LGPDF

Chenxu Zhao, Wei Qian, Aobo Chen, Mengdi Huai

TL;DR: 该论文提出了一种新的成员推断攻击方法，能够控制假发现率（FDR），并通过理论分析和实验验证其效果。

Details

Motivation: 现有成员推断攻击方法（MIAs）缺乏对假发现率（FDR）的保证，无法有效控制错误发现的比例，亟需解决这一挑战。

Result: 实验结果表明，该方法在多种设置（如黑盒和终身学习场景）下表现优异，能够有效控制FDR。

Insight: 该方法不仅提供了FDR保证，还展示了如何在不完全了解底层分布的情况下解决成员推断攻击的挑战。

Abstract: Recent studies have shown that deep learning models are vulnerable to membership inference attacks (MIAs), which aim to infer whether a data record was used to train a target model or not. To analyze and study these vulnerabilities, various MIA methods have been proposed. Despite the significance and popularity of MIAs, existing works on MIAs are limited in providing guarantees on the false discovery rate (FDR), which refers to the expected proportion of false discoveries among the identified positive discoveries. However, it is very challenging to ensure the false discovery rate guarantees, because the underlying distribution is usually unknown, and the estimated non-member probabilities often exhibit interdependence. To tackle the above challenges, in this paper, we design a novel membership inference attack method, which can provide the guarantees on the false discovery rate. Additionally, we show that our method can also provide the marginal probability guarantee on labeling true non-member data as member data. Notably, our method can work as a wrapper that can be seamlessly integrated with existing MIA methods in a post-hoc manner, while also providing the FDR control. We perform the theoretical analysis for our method. Extensive experiments in various settings (e.g., the black-box setting and the lifelong learning setting) are also conducted to verify the desirable performance of our method.

eess.IV [Back]

[221] Transfer Learning with EfficientNet for Accurate Leukemia Cell Classification eess.IV | cs.CV | cs.LG | F.2.2; I.2.7PDF

Faisal Ahmed

TL;DR: 该论文研究了利用EfficientNet进行迁移学习以实现白血病细胞准确分类的效果。通过数据增强和多种模型评估，EfficientNet-B3表现最佳。

Details

Motivation: 急性淋巴细胞白血病（ALL）的准确分类对早期诊断和治疗至关重要，但数据不平衡是主要挑战。

Result: EfficientNet-B3表现最优，F1分数94.30%，准确率92.02%，AUC94.79%。

Insight: 数据增强与高效迁移学习模型的结合可显著提升医学图像分类的准确性和鲁棒性。

Abstract: Accurate classification of Acute Lymphoblastic Leukemia (ALL) from peripheral blood smear images is essential for early diagnosis and effective treatment planning. This study investigates the use of transfer learning with pretrained convolutional neural networks (CNNs) to improve diagnostic performance. To address the class imbalance in the dataset of 3,631 Hematologic and 7,644 ALL images, we applied extensive data augmentation techniques to create a balanced training set of 10,000 images per class. We evaluated several models, including ResNet50, ResNet101, and EfficientNet variants B0, B1, and B3. EfficientNet-B3 achieved the best results, with an F1-score of 94.30%, accuracy of 92.02%, andAUCof94.79%,outperformingpreviouslyreported methods in the C-NMCChallenge. Thesefindings demonstrate the effectiveness of combining data augmentation with advanced transfer learning models, particularly EfficientNet-B3, in developing accurate and robust diagnostic tools for hematologic malignancy detection.

[222] LWT-ARTERY-LABEL: A Lightweight Framework for Automated Coronary Artery Identification eess.IV | cs.CVPDF

Shisheng Zhang, Ramtin Gharleghi, Sonit Singh, Daniel Moses, Dona Adikari

TL;DR: 论文提出了一种轻量级框架LWT-ARTERY-LABEL，结合解剖学知识和基于规则的拓扑约束，用于冠状动脉自动标识，解决了传统方法和深度学习方法的局限性。

Details

Motivation: 冠状动脉疾病是全球主要死因，CTCA是重要诊断工具，但冠状动脉分析任务耗时费力。现有方法（知识导向或深度学习）未能充分利用数据驱动洞察或计算资源过高。

Result: 在基准数据集上表现优异，证明了方法的有效性和高效性。

Insight: 结合规则约束与数据驱动方法可以在低计算成本下实现高性能，为医学图像分析提供了新思路。

Abstract: Coronary artery disease (CAD) remains the leading cause of death globally, with computed tomography coronary angiography (CTCA) serving as a key diagnostic tool. However, coronary arterial analysis using CTCA, such as identifying artery-specific features from computational modelling, is labour-intensive and time-consuming. Automated anatomical labelling of coronary arteries offers a potential solution, yet the inherent anatomical variability of coronary trees presents a significant challenge. Traditional knowledge-based labelling methods fall short in leveraging data-driven insights, while recent deep-learning approaches often demand substantial computational resources and overlook critical clinical knowledge. To address these limitations, we propose a lightweight method that integrates anatomical knowledge with rule-based topology constraints for effective coronary artery labelling. Our approach achieves state-of-the-art performance on benchmark datasets, providing a promising alternative for automated coronary artery labelling.

[223] Fusion-Based Brain Tumor Classification Using Deep Learning and Explainable AI, and Rule-Based Reasoning eess.IV | cs.CVPDF

Melika Filvantorkaman, Mohsen Piri, Maral Filvan Torkaman, Ashkan Zabihi, Hamidreza Moradi

TL;DR: 该研究提出了一种结合深度学习和可解释AI（XAI）的集成方法，用于脑肿瘤的准确分类，并通过Grad-CAM++和临床决策规则增强模型的可解释性，取得高精度和临床信任。

Details

Motivation: 脑肿瘤的准确分类对诊断和治疗至关重要，但现有深度学习方法缺乏透明度和临床可解释性，阻碍了其在临床中的应用。

Result: 集成方法在Figshare数据集上达到91.7%的准确率，且Grad-CAM++热图与专家标注区域高度吻合（Dice系数达0.88）。临床专家评估显示高解释性评分。

Insight: 结合深度学习和可解释技术（如Grad-CAM++）能够提升模型的临床适用性，同时通过规则化的决策支持增强医生对AI预测的信任。

Abstract: Accurate and interpretable classification of brain tumors from magnetic resonance imaging (MRI) is critical for effective diagnosis and treatment planning. This study presents an ensemble-based deep learning framework that combines MobileNetV2 and DenseNet121 convolutional neural networks (CNNs) using a soft voting strategy to classify three common brain tumor types: glioma, meningioma, and pituitary adenoma. The models were trained and evaluated on the Figshare dataset using a stratified 5-fold cross-validation protocol. To enhance transparency and clinical trust, the framework integrates an Explainable AI (XAI) module employing Grad-CAM++ for class-specific saliency visualization, alongside a symbolic Clinical Decision Rule Overlay (CDRO) that maps predictions to established radiological heuristics. The ensemble classifier achieved superior performance compared to individual CNNs, with an accuracy of 91.7%, precision of 91.9%, recall of 91.7%, and F1-score of 91.6%. Grad-CAM++ visualizations revealed strong spatial alignment between model attention and expert-annotated tumor regions, supported by Dice coefficients up to 0.88 and IoU scores up to 0.78. Clinical rule activation further validated model predictions in cases with distinct morphological features. A human-centered interpretability assessment involving five board-certified radiologists yielded high Likert-scale scores for both explanation usefulness (mean = 4.4) and heatmap-region correspondence (mean = 4.0), reinforcing the framework’s clinical relevance. Overall, the proposed approach offers a robust, interpretable, and generalizable solution for automated brain tumor classification, advancing the integration of deep learning into clinical neurodiagnostics.

[224] Spatio-Temporal Conditional Diffusion Models for Forecasting Future Multiple Sclerosis Lesion Masks Conditioned on Treatments eess.IV | cs.CVPDF

Gian Mario Favero, Ge Ya Luo, Nima Fathi, Justin Szeto, Douglas L. Arnold

TL;DR: 论文提出了一种基于时空条件扩散模型的MS病灶预测方法，结合多模态数据和治疗信息，生成未来MS病灶掩模。

Details

Motivation: MS病灶的异质性进展使得预测病情发展具有挑战性，现有方法难以结合治疗信息进行预测。本文旨在填补这一空白。

Result: 模型能准确预测六种不同治疗方案的未来NET2病灶掩模，并在下游任务（如病灶计数、位置估计、分类等）中表现优异。

Insight: 因果图像生成模型有望成为MS数据驱动预后的强大工具，为个性化医疗提供新途径。

Abstract: Image-based personalized medicine has the potential to transform healthcare, particularly for diseases that exhibit heterogeneous progression such as Multiple Sclerosis (MS). In this work, we introduce the first treatment-aware spatio-temporal diffusion model that is able to generate future masks demonstrating lesion evolution in MS. Our voxel-space approach incorporates multi-modal patient data, including MRI and treatment information, to forecast new and enlarging T2 (NET2) lesion masks at a future time point. Extensive experiments on a multi-centre dataset of 2131 patient 3D MRIs from randomized clinical trials for relapsing-remitting MS demonstrate that our generative model is able to accurately predict NET2 lesion masks for patients across six different treatments. Moreover, we demonstrate our model has the potential for real-world clinical applications through downstream tasks such as future lesion count and location estimation, binary lesion activity classification, and generating counterfactual future NET2 masks for several treatments with different efficacies. This work highlights the potential of causal, image-based generative models as powerful tools for advancing data-driven prognostics in MS.

[225] Trustworthy Medical Imaging with Large Language Models: A Study of Hallucinations Across Modalities eess.IV | cs.AI | cs.CVPDF

Anindya Bijoy Das, Shahnewaz Karim Sakib, Shibbir Ahmed

TL;DR: 该研究探讨了大语言模型（LLMs）在医学影像任务中产生的幻觉问题，分析了影像到文本和文本到影像两种方向中的错误，并提出了改进方法以提高临床可靠性。

Details

Motivation: 随着LLMs在医学影像任务中的应用增多，其产生的幻觉（自信但错误的输出）可能误导临床决策，因此需要系统研究这些错误并改进模型的可信度。

Result: 研究发现两种任务中均存在幻觉现象，包括事实不一致和解剖学不准确性，这些错误可能导致临床可靠性问题。

Insight: 通过系统性研究，该工作为提升LLMs在医学影像中的安全性和可信性提供了见解，强调了改进模型训练和架构的必要性。

Abstract: Large Language Models (LLMs) are increasingly applied to medical imaging tasks, including image interpretation and synthetic image generation. However, these models often produce hallucinations, which are confident but incorrect outputs that can mislead clinical decisions. This study examines hallucinations in two directions: image to text, where LLMs generate reports from X-ray, CT, or MRI scans, and text to image, where models create medical images from clinical prompts. We analyze errors such as factual inconsistencies and anatomical inaccuracies, evaluating outputs using expert informed criteria across imaging modalities. Our findings reveal common patterns of hallucination in both interpretive and generative tasks, with implications for clinical reliability. We also discuss factors contributing to these failures, including model architecture and training data. By systematically studying both image understanding and generation, this work provides insights into improving the safety and trustworthiness of LLM driven medical imaging systems.

[226] 3DGS-VBench: A Comprehensive Video Quality Evaluation Benchmark for 3DGS Compression eess.IV | cs.CVPDF

Yuke Xing, William Gordon, Qi Yang, Kaifa Yang, Jiarui Wang

TL;DR: 该论文提出了3DGS-VBench，一个针对3D高斯泼溅（3DGS）压缩技术的视频质量评估基准，包含660个压缩模型和视频序列，并通过主观评分（MOS）和客观指标评估了6种压缩算法的性能。

Details

Motivation: 3DGS技术因高存储需求限制了实际应用，而现有的压缩方法引入了独特的失真，缺乏系统的质量评估研究。

Result: 数据集可靠性得到验证，并展示了不同压缩算法在存储效率和视觉质量上的性能对比。

Insight: 该基准为3DGS压缩技术和视频质量评估的进一步研究提供了重要工具和数据支持。

Abstract: 3D Gaussian Splatting (3DGS) enables real-time novel view synthesis with high visual fidelity, but its substantial storage requirements hinder practical deployment, prompting state-of-the-art (SOTA) 3DGS methods to incorporate compression modules. However, these 3DGS generative compression techniques introduce unique distortions lacking systematic quality assessment research. To this end, we establish 3DGS-VBench, a large-scale Video Quality Assessment (VQA) Dataset and Benchmark with 660 compressed 3DGS models and video sequences generated from 11 scenes across 6 SOTA 3DGS compression algorithms with systematically designed parameter levels. With annotations from 50 participants, we obtained MOS scores with outlier removal and validated dataset reliability. We benchmark 6 3DGS compression algorithms on storage efficiency and visual quality, and evaluate 15 quality assessment metrics across multiple paradigms. Our work enables specialized VQA model training for 3DGS, serving as a catalyst for compression and quality assessment research. The dataset is available at https://github.com/YukeXing/3DGS-VBench.

[227] SAGCNet: Spatial-Aware Graph Completion Network for Missing Slice Imputation in Population CMR Imaging eess.IV | cs.CVPDF

Junkai Liu, Nay Aung, Theodoros N. Arvanitis, Stefan K. Piechnik, Joao A C Lima

TL;DR: SAGCNet是一种空间感知图补全网络，用于解决心脏磁共振（CMR）影像中缺失切片的填补问题，通过结合切片间关系和3D空间信息，显著优于现有方法。

Details

Motivation: 临床MRI实践中，缺失或不可用的切片影响了疾病诊断的准确性。现有的MRI合成方法在建模切片间依赖性和利用3D空间信息方面存在不足，需要更高效的解决方案。

Result: SAGCNet在填补缺失切片任务中表现优于现有方法，定量和定性评估均显示出优势，且在有限数据下仍保持高性能。

Insight: 通过结合图结构建模和3D空间信息，能够更有效地恢复缺失的MRI切片，为临床诊断提供更完整的数据支持。

Abstract: Magnetic resonance imaging (MRI) provides detailed soft-tissue characteristics that assist in disease diagnosis and screening. However, the accuracy of clinical practice is often hindered by missing or unusable slices due to various factors. Volumetric MRI synthesis methods have been developed to address this issue by imputing missing slices from available ones. The inherent 3D nature of volumetric MRI data, such as cardiac magnetic resonance (CMR), poses significant challenges for missing slice imputation approaches, including (1) the difficulty of modeling local inter-slice correlations and dependencies of volumetric slices, and (2) the limited exploration of crucial 3D spatial information and global context. In this study, to mitigate these issues, we present Spatial-Aware Graph Completion Network (SAGCNet) to overcome the dependency on complete volumetric data, featuring two main innovations: (1) a volumetric slice graph completion module that incorporates the inter-slice relationships into a graph structure, and (2) a volumetric spatial adapter component that enables our model to effectively capture and utilize various forms of 3D spatial context. Extensive experiments on cardiac MRI datasets demonstrate that SAGCNet is capable of synthesizing absent CMR slices, outperforming competitive state-of-the-art MRI synthesis methods both quantitatively and qualitatively. Notably, our model maintains superior performance even with limited slice data.

[228] Large-scale Multi-sequence Pretraining for Generalizable MRI Analysis in Versatile Clinical Applications eess.IV | cs.AI | cs.CVPDF

Zelin Qiu, Xi Wang, Zhuoyao Xie, Juan Zhou, Yu Wang

TL;DR: 论文提出了PRISM，一个通过大规模多序列MRI预训练的基础模型，旨在解决MRI序列异质性导致的深度学习模型泛化能力不足问题。

Details

Motivation: MRI多序列成像的异质性导致深度学习模型在临床应用中泛化能力受限，限制了其临床实用性。

Result: PRISM在44个下游任务中显著优于未预训练模型和其他基础模型，其中39个任务排名第一。

Insight: PRISM展示了在多样化MRI协议下学习鲁棒且可泛化表示的能力，为AI在放射学中的临床应用提供了可扩展的框架。

Abstract: Multi-sequence Magnetic Resonance Imaging (MRI) offers remarkable versatility, enabling the distinct visualization of different tissue types. Nevertheless, the inherent heterogeneity among MRI sequences poses significant challenges to the generalization capability of deep learning models. These challenges undermine model performance when faced with varying acquisition parameters, thereby severely restricting their clinical utility. In this study, we present PRISM, a foundation model PRe-trained with large-scale multI-Sequence MRI. We collected a total of 64 datasets from both public and private sources, encompassing a wide range of whole-body anatomical structures, with scans spanning diverse MRI sequences. Among them, 336,476 volumetric MRI scans from 34 datasets (8 public and 26 private) were curated to construct the largest multi-organ multi-sequence MRI pretraining corpus to date. We propose a novel pretraining paradigm that disentangles anatomically invariant features from sequence-specific variations in MRI, while preserving high-level semantic representations. We established a benchmark comprising 44 downstream tasks, including disease diagnosis, image segmentation, registration, progression prediction, and report generation. These tasks were evaluated on 32 public datasets and 5 private cohorts. PRISM consistently outperformed both non-pretrained models and existing foundation models, achieving first-rank results in 39 out of 44 downstream benchmarks with statistical significance improvements. These results underscore its ability to learn robust and generalizable representations across unseen data acquired under diverse MRI protocols. PRISM provides a scalable framework for multi-sequence MRI analysis, thereby enhancing the translational potential of AI in radiology. It delivers consistent performance across diverse imaging protocols, reinforcing its clinical applicability.

[229] HaDM-ST: Histology-Assisted Differential Modeling for Spatial Transcriptomics Generation eess.IV | cs.CV | q-bio.QM | 92C40, 68T07 | I.2.10; I.4.8PDF

Xuepeng Liu, Zheng Jiang, Pinan Zhu, Hanyu Liu, Chao Li

TL;DR: HaDM-ST是一个结合H&E染色组织学图像提升空间转录组学分辨率的新框架，解决了特征提取、空间对齐和基因特异性建模的挑战。

Details

Motivation: 当前的空间转录组学（ST）技术分辨率受限，而结合H&E图像的方法面临特征提取、对齐和基因特异性建模的三大挑战，亟需新方法。

Result: 在多种组织和物种的200个基因实验中，HaDM-ST优于现有方法，表现出更高的分辨率和基因表达精度。

Insight: H&E图像的语义信息与ST数据的结合可以有效提升空间分辨率，针对基因特异性的建模是关键创新点。

Abstract: Spatial transcriptomics (ST) reveals spatial heterogeneity of gene expression, yet its resolution is limited by current platforms. Recent methods enhance resolution via H&E-stained histology, but three major challenges persist: (1) isolating expression-relevant features from visually complex H&E images; (2) achieving spatially precise multimodal alignment in diffusion-based frameworks; and (3) modeling gene-specific variation across expression channels. We propose HaDM-ST (Histology-assisted Differential Modeling for ST Generation), a high-resolution ST generation framework conditioned on H&E images and low-resolution ST. HaDM-ST includes: (i) a semantic distillation network to extract predictive cues from H&E; (ii) a spatial alignment module enforcing pixel-wise correspondence with low-resolution ST; and (iii) a channel-aware adversarial learner for fine-grained gene-level modeling. Experiments on 200 genes across diverse tissues and species show HaDM-ST consistently outperforms prior methods, enhancing spatial fidelity and gene-level coherence in high-resolution ST predictions.

[230] DiffVC-OSD: One-Step Diffusion-based Perceptual Neural Video Compression Framework eess.IV | cs.CVPDF

Wenzhuo Ma, Zhenzhong Chen

TL;DR: DiffVC-OSD提出了一种基于单步扩散的神经视频压缩框架，通过单步扩散模型提升感知质量，优于多步扩散方法。

Details

Motivation: 传统多步扩散方法在视频压缩中计算复杂度高且效率低，DiffVC-OSD旨在通过单步扩散提升效率和质量。

Result: 实验表明，DiffVC-OSD在感知压缩性能上达到SOTA，解码速度提升20倍，比特率降低86.92%。

Insight: 单步扩散模型在视频压缩中既能提升效率，又保持高质量，未来可能进一步简化扩散模型的应用。

Abstract: In this work, we first propose DiffVC-OSD, a One-Step Diffusion-based Perceptual Neural Video Compression framework. Unlike conventional multi-step diffusion-based methods, DiffVC-OSD feeds the reconstructed latent representation directly into a One-Step Diffusion Model, enhancing perceptual quality through a single diffusion step guided by both temporal context and the latent itself. To better leverage temporal dependencies, we design a Temporal Context Adapter that encodes conditional inputs into multi-level features, offering more fine-grained guidance for the Denoising Unet. Additionally, we employ an End-to-End Finetuning strategy to improve overall compression performance. Extensive experiments demonstrate that DiffVC-OSD achieves state-of-the-art perceptual compression performance, offers about 20$\times$ faster decoding and a 86.92% bitrate reduction compared to the corresponding multi-step diffusion-based variant.

[231] Anatomy-Aware Low-Dose CT Denoising via Pretrained Vision Models and Semantic-Guided Contrastive Learning eess.IV | cs.CVPDF

Runze Wang, Zeli Chen, Zhiyun Song, Wei Fang, Jiajin Zhang

TL;DR: ALDEN结合预训练视觉模型和语义引导对比学习，提出了解剖感知的低剂量CT去噪方法，显著提升了去噪效果和解剖结构保留能力。

Details

Motivation: 现有低剂量CT去噪方法忽略了人体组织的解剖语义，可能导致去噪结果不理想，因此需要一种能够保留解剖结构的方法。

Result: ALDEN在两个低剂量CT去噪数据集上达到SOTA性能，有效减少过平滑问题，并在多器官分割任务中验证了解剖结构的保留能力。

Insight: 解剖语义的引入是低剂量CT去噪的关键，对比学习能有效保持组织特异性模式并抑制噪声，预训练模型提供了丰富的先验知识。

Abstract: To reduce radiation exposure and improve the diagnostic efficacy of low-dose computed tomography (LDCT), numerous deep learning-based denoising methods have been developed to mitigate noise and artifacts. However, most of these approaches ignore the anatomical semantics of human tissues, which may potentially result in suboptimal denoising outcomes. To address this problem, we propose ALDEN, an anatomy-aware LDCT denoising method that integrates semantic features of pretrained vision models (PVMs) with adversarial and contrastive learning. Specifically, we introduce an anatomy-aware discriminator that dynamically fuses hierarchical semantic features from reference normal-dose CT (NDCT) via cross-attention mechanisms, enabling tissue-specific realism evaluation in the discriminator. In addition, we propose a semantic-guided contrastive learning module that enforces anatomical consistency by contrasting PVM-derived features from LDCT, denoised CT and NDCT, preserving tissue-specific patterns through positive pairs and suppressing artifacts via dual negative pairs. Extensive experiments conducted on two LDCT denoising datasets reveal that ALDEN achieves the state-of-the-art performance, offering superior anatomy preservation and substantially reducing over-smoothing issue of previous work. Further validation on a downstream multi-organ segmentation task (encompassing 117 anatomical structures) affirms the model’s ability to maintain anatomical awareness.

[232] PCA-Guided Autoencoding for Structured Dimensionality Reduction in Active Infrared Thermography eess.IV | cs.AI | cs.CV | cs.LGPDF

Mohammed Salah, Numan Saeed, Davor Svetinovic, Stefano Sfarra, Mohammed Omar

TL;DR: 该论文提出了一种基于PCA引导的自编码框架，用于结构化降维，以捕捉红外热成像信号中的非线性特征，并通过新的PCA蒸馏损失函数强制潜在空间结构化。实验表明，该方法在降维和缺陷表征任务上优于现有方法。

Details

Motivation: 为了解决红外热成像数据高维且现有自编码器潜在空间缺乏结构的问题，提出一种结构化降维方法，以提升后续缺陷表征任务的效果。

Result: 实验在PVC、CFRP和PLA样本上显示，该方法在对比度、信噪比和神经网络指标上优于现有降维方法。

Insight: 结合PCA的结构性与自编码器的非线性能力可以显著提升红外热成像数据的降维效果，并改善缺陷表征任务的性能。

Abstract: Active Infrared thermography (AIRT) is a widely adopted non-destructive testing (NDT) technique for detecting subsurface anomalies in industrial components. Due to the high dimensionality of AIRT data, current approaches employ non-linear autoencoders (AEs) for dimensionality reduction. However, the latent space learned by AIRT AEs lacks structure, limiting their effectiveness in downstream defect characterization tasks. To address this limitation, this paper proposes a principal component analysis guided (PCA-guided) autoencoding framework for structured dimensionality reduction to capture intricate, non-linear features in thermographic signals while enforcing a structured latent space. A novel loss function, PCA distillation loss, is introduced to guide AIRT AEs to align the latent representation with structured PCA components while capturing the intricate, non-linear patterns in thermographic signals. To evaluate the utility of the learned, structured latent space, we propose a neural network-based evaluation metric that assesses its suitability for defect characterization. Experimental results show that the proposed PCA-guided AE outperforms state-of-the-art dimensionality reduction methods on PVC, CFRP, and PLA samples in terms of contrast, signal-to-noise ratio (SNR), and neural network-based metrics.

[233] MIND: A Noise-Adaptive Denoising Framework for Medical Images Integrating Multi-Scale Transformer eess.IV | cs.AI | cs.CV | cs.LG | cs.MMPDF

Tao Tang, Chengxu Yang

TL;DR: 论文提出了一个结合多尺度卷积和Transformer架构的医学图像自适应去噪模型（MI-ND），通过引入噪声水平估计器和噪声自适应注意力模块，实现了噪声感知驱动的通道-空间注意力调控与跨模态特征融合。

Details

Motivation: 医学图像在疾病诊断中的核心作用使其质量直接影响临床判断的准确性，但低剂量扫描、设备限制及成像伪影等因素常导致图像受非均匀噪声干扰，影响结构识别和病灶检测。

Result: 在公开多模态数据集上显著优于对比方法，图像质量指标（PSNR、SSIM、LPIPS）提升，下游诊断任务的F1分数和ROC-AUC提高。

Insight: 该模型在结构恢复、诊断灵敏度和跨模态鲁棒性方面表现突出，为医学图像增强和AI辅助诊疗提供了有效解决方案。

Abstract: The core role of medical images in disease diagnosis makes their quality directly affect the accuracy of clinical judgment. However, due to factors such as low-dose scanning, equipment limitations and imaging artifacts, medical images are often accompanied by non-uniform noise interference, which seriously affects structure recognition and lesion detection. This paper proposes a medical image adaptive denoising model (MI-ND) that integrates multi-scale convolutional and Transformer architecture, introduces a noise level estimator (NLE) and a noise adaptive attention module (NAAB), and realizes channel-spatial attention regulation and cross-modal feature fusion driven by noise perception. Systematic testing is carried out on multimodal public datasets. Experiments show that this method significantly outperforms the comparative methods in image quality indicators such as PSNR, SSIM, and LPIPS, and improves the F1 score and ROC-AUC in downstream diagnostic tasks, showing strong prac-tical value and promotional potential. The model has outstanding benefits in structural recovery, diagnostic sensitivity, and cross-modal robustness, and provides an effective solution for medical image enhancement and AI-assisted diagnosis and treatment.

[234] Learned Regularization for Microwave Tomography eess.IV | cs.CVPDF

Bowen Tong, Hao Chen, Shaorui Guo, Dong Liu

TL;DR: 本文提出了一种物理信息混合框架SSD-Reg，将扩散模型作为学习正则化项嵌入变分重建过程中，解决了微波成像中的非线性、病态逆问题，无需配对数据即可恢复复杂结构。

Details

Motivation: 微波断层扫描（MWT）的逆问题具有高度非线性和病态性，传统优化方法难以恢复精细结构，而深度学习模型通常依赖大量配对数据且泛化性较差。SSD-Reg通过结合物理模型和扩散先验知识，提出了一种更灵活高效的解决方案。

Result: 实验表明，SSD-Reg在精度、稳定性和鲁棒性上均有显著提升，能够有效解决功能图像重建中的病态问题。

Insight: 通过融合物理模型和生成先验，SSD-Reg提供了解决复杂逆问题的新思路，展示了在医学成像和其他领域中的潜在应用价值。

Abstract: Microwave Tomography (MWT) aims to reconstruct the dielectric properties of tissues from measured scattered electromagnetic fields. This inverse problem is highly nonlinear and ill-posed, posing significant challenges for conventional optimization-based methods, which, despite being grounded in physical models, often fail to recover fine structural details. Recent deep learning strategies, including end-to-end and post-processing networks, have improved reconstruction quality but typically require large paired training datasets and may struggle to generalize. To overcome these limitations, we propose a physics-informed hybrid framework that integrates diffusion models as learned regularization within a data-consistency-driven variational scheme. Specifically, we introduce Single-Step Diffusion Regularization (SSD-Reg), a novel approach that embeds diffusion priors into the iterative reconstruction process, enabling the recovery of complex anatomical structures without the need for paired data. SSD-Reg maintains fidelity to both the governing physics and learned structural distributions, improving accuracy, stability, and robustness. Extensive experiments demonstrate that SSD-Reg, implemented as a Plug-and-Play (PnP) module, provides a flexible and effective solution for tackling the ill-posedness inherent in functional image reconstruction.

cs.SD [Back]

[235] Joint Transcription of Acoustic Guitar Strumming Directions and Chords cs.SD | cs.CL | eess.ASPDF

Sebastian Murgul, Johannes Schimper, Michael Heizmann

TL;DR: 论文提出了一种通过深度学习模型联合转录吉他弹奏方向和和弦的方法，利用合成和真实数据提升效果。

Details

Motivation: 吉他弹奏的自动转录在音乐信息检索（MIR）中是一个具有挑战性但研究较少的问题，尤其是同时提取弹奏方向和和弦进展。现有方法因数据集有限而效果受限。

Result: 实验表明，结合合成和真实数据的混合方法在弹奏动作检测与和弦分类上取得了最高准确率，显著优于基线算法。

Insight: 深度学习在吉他弹奏转录中具有巨大潜力，并为自动节奏吉他分析开辟了新途径。

Abstract: Automatic transcription of guitar strumming is an underrepresented and challenging task in Music Information Retrieval (MIR), particularly for extracting both strumming directions and chord progressions from audio signals. While existing methods show promise, their effectiveness is often hindered by limited datasets. In this work, we extend a multimodal approach to guitar strumming transcription by introducing a novel dataset and a deep learning-based transcription model. We collect 90 min of real-world guitar recordings using an ESP32 smartwatch motion sensor and a structured recording protocol, complemented by a synthetic dataset of 4h of labeled strumming audio. A Convolutional Recurrent Neural Network (CRNN) model is trained to detect strumming events, classify their direction, and identify the corresponding chords using only microphone audio. Our evaluation demonstrates significant improvements over baseline onset detection algorithms, with a hybrid method combining synthetic and real-world data achieving the highest accuracy for both strumming action detection and chord classification. These results highlight the potential of deep learning for robust guitar strumming transcription and open new avenues for automatic rhythm guitar analysis.

[236] Audio-Thinker: Guiding Audio Language Model When and How to Think via Reinforcement Learning cs.SD | cs.CL | cs.MM | eess.ASPDF

Shu Wu, Chenxing Li, Wenfu Wang, Hao Zhang, Hualei Wang

TL;DR: 论文提出Audio-Thinker，通过强化学习改进大音频语言模型（LALM）的推理能力，引入动态奖励机制，提升适应性、一致性和有效性。

Details

Motivation: 现有大音频语言模型的显式推理过程对音频问答任务帮助有限，且与人类听觉语言推理能力差距较大，需改进。

Result: Audio-Thinker在多个基准任务中超过现有推理导向的LALM，展现出更强的推理和泛化能力。

Insight: 动态奖励机制和外部评估模型可以有效提升LALM的推理能力，改进其在复杂任务中的表现。

Abstract: Recent advancements in large language models, multimodal large language models, and large audio language models (LALMs) have significantly improved their reasoning capabilities through reinforcement learning with rule-based rewards. However, the explicit reasoning process has yet to show significant benefits for audio question answering, and effectively leveraging deep reasoning remains an open challenge, with LALMs still falling short of human-level auditory-language reasoning. To address these limitations, we propose Audio-Thinker, a reinforcement learning framework designed to enhance the reasoning capabilities of LALMs, with a focus on improving adaptability, consistency, and effectiveness. Our approach introduces an adaptive think accuracy reward, enabling the model to adjust its reasoning strategies based on task complexity dynamically. Furthermore, we incorporate an external reward model to evaluate the overall consistency and quality of the reasoning process, complemented by think-based rewards that help the model distinguish between valid and flawed reasoning paths during training. Experimental results demonstrate that our Audio-Thinker model outperforms existing reasoning-oriented LALMs across various benchmark tasks, exhibiting superior reasoning and generalization capabilities.

cs.AI [Back]

[237] DatasetResearch: Benchmarking Agent Systems for Demand-Driven Dataset Discovery cs.AI | cs.CLPDF

Keyu Li, Mohan Jiang, Dayuan Fu, Yunze Wu, Xiangkun Hu

TL;DR: 这篇论文提出了DatasetResearch基准，用于评估AI代理系统在需求驱动数据集发现中的能力，揭示了当前系统在数据集发现上的局限性，并提供了改进方向。

Details

Motivation: 随着大型语言模型的快速发展，AI开发的瓶颈从计算能力转向数据可用性。许多有价值的数据集隐藏在专业存储库、研究附录和领域平台中。AI代理能否超越传统搜索，实现自主的需求驱动数据发现，成为了关键问题。

Result: 即使在高级深度研究系统上，在最具挑战性的DatasetResearch子集上的得分仅为22%，显示出当前系统与完美数据集发现之间的巨大差距。

Insight: 1. 搜索代理在知识任务中表现优异，而合成代理在推理任务中表现更好。2. 两者在超出当前分布范围的“极端案例”中表现极差，表明需要更全面的数据集发现方法。

Abstract: The rapid advancement of large language models has fundamentally shifted the bottleneck in AI development from computational power to data availability-with countless valuable datasets remaining hidden across specialized repositories, research appendices, and domain platforms. As reasoning capabilities and deep research methodologies continue to evolve, a critical question emerges: can AI agents transcend conventional search to systematically discover any dataset that meets specific user requirements, enabling truly autonomous demand-driven data curation? We introduce DatasetResearch, the first comprehensive benchmark evaluating AI agents’ ability to discover and synthesize datasets from 208 real-world demands across knowledge-intensive and reasoning-intensive tasks. Our tri-dimensional evaluation framework reveals a stark reality: even advanced deep research systems achieve only 22% score on our challenging DatasetResearch-pro subset, exposing the vast gap between current capabilities and perfect dataset discovery. Our analysis uncovers a fundamental dichotomy-search agents excel at knowledge tasks through retrieval breadth, while synthesis agents dominate reasoning challenges via structured generation-yet both catastrophically fail on “corner cases” outside existing distributions. These findings establish the first rigorous baseline for dataset discovery agents and illuminate the path toward AI systems capable of finding any dataset in the digital universe. Our benchmark and comprehensive analysis provide the foundation for the next generation of self-improving AI systems and are publicly available at https://github.com/GAIR-NLP/DatasetResearch.

[238] MultiMedEdit: A Scenario-Aware Benchmark for Evaluating Knowledge Editing in Medical VQA cs.AI | cs.CL | cs.LG | cs.MMPDF

Shengtao Wen, Haodong Chen, Yadong Wang, Zhongying Pan, Xiang Chen

TL;DR: MultiMedEdit是首个针对医学多模态场景知识编辑（KE）的基准测试，通过三维指标评估可靠性、通用性和局部性，揭示了当前方法在复杂医疗环境中的局限性。

Details

Motivation: 当前知识编辑研究主要集中在通用领域或文本任务，而医学多模态场景需要结合视觉推理，缺乏针对性的评估基准。

Result: 实验表明当前方法在复杂临床工作流中泛化和长尾推理能力不足，并通过效率分析揭示了实际部署中的权衡。

Insight: 医学知识编辑需结合视觉推理，现有方法在临床场景中表现不佳，需进一步优化。

Abstract: Knowledge editing (KE) provides a scalable approach for updating factual knowledge in large language models without full retraining. While previous studies have demonstrated effectiveness in general domains and medical QA tasks, little attention has been paid to KE in multimodal medical scenarios. Unlike text-only settings, medical KE demands integrating updated knowledge with visual reasoning to support safe and interpretable clinical decisions. To address this gap, we propose MultiMedEdit, the first benchmark tailored to evaluating KE in clinical multimodal tasks. Our framework spans both understanding and reasoning task types, defines a three-dimensional metric suite (reliability, generality, and locality), and supports cross-paradigm comparisons across general and domain-specific models. We conduct extensive experiments under single-editing and lifelong-editing settings. Results suggest that current methods struggle with generalization and long-tail reasoning, particularly in complex clinical workflows. We further present an efficiency analysis (e.g., edit latency, memory footprint), revealing practical trade-offs in real-world deployment across KE paradigms. Overall, MultiMedEdit not only reveals the limitations of current approaches but also provides a solid foundation for developing clinically robust knowledge editing techniques in the future.

[239] EndoAgent: A Memory-Guided Reflective Agent for Intelligent Endoscopic Vision-to-Decision Reasoning cs.AI | cs.CLPDF

Yi Tang, Kaini Wang, Yang Chen, Guangquan Zhou

TL;DR: EndoAgent是一个基于记忆引导的内窥镜视觉到决策推理智能代理，通过双记忆设计和专家工具集成，实现了复杂临床工作流中的多步推理与协作。

Details

Motivation: 现有基于大规模预训练的方法在统一协调多任务和处理复杂临床工作流中的多步骤推理时表现不足。EndoAgent旨在填补AI代理在内窥镜领域的应用空白。

Result: EndoAgent在实验中明显优于通用和医疗多模态模型，展现了强灵活性和推理能力。

Insight: 通过记忆机制和工具集成，AI代理可以更好地处理复杂临床任务的多步推理需求。

Abstract: Developing general artificial intelligence (AI) systems to support endoscopic image diagnosis is an emerging research priority. Existing methods based on large-scale pretraining often lack unified coordination across tasks and struggle to handle the multi-step processes required in complex clinical workflows. While AI agents have shown promise in flexible instruction parsing and tool integration across domains, their potential in endoscopy remains underexplored. To address this gap, we propose EndoAgent, the first memory-guided agent for vision-to-decision endoscopic analysis that integrates iterative reasoning with adaptive tool selection and collaboration. Built on a dual-memory design, it enables sophisticated decision-making by ensuring logical coherence through short-term action tracking and progressively enhancing reasoning acuity through long-term experiential learning. To support diverse clinical tasks, EndoAgent integrates a suite of expert-designed tools within a unified reasoning loop. We further introduce EndoAgentBench, a benchmark of 5,709 visual question-answer pairs that assess visual understanding and language generation capabilities in realistic scenarios. Extensive experiments show that EndoAgent consistently outperforms both general and medical multimodal models, exhibiting its strong flexibility and reasoning capabilities.

[240] Generative AI for Strategic Plan Development cs.AI | cs.CL | cs.LG | I.2.7; I.5.4PDF

Jesse Ponnock

TL;DR: 本文提出了一种利用生成式人工智能（GAI）为大型政府组织制定战略计划的模块化模型，并通过比较BERTopic和非负矩阵分解（NMF）在主题建模中的表现，证明了这些技术能够生成与战略计划中的愿景元素高度相似的主题。

Details

Motivation: 随着生成式人工智能（GAI）和大语言模型（LLM）的突破，许多专业服务可以通过AI实现自动化。本文旨在探索GAI在战略计划制定中的应用潜力，以满足政府组织的高效需求。

Result: 结果显示，BERTopic和NMF能生成100%相似的主题。其中，BERTopic表现更优，超过一半的主题达到了“中等”或“强”相关性。

Insight: GAI在战略计划生成中具有实际应用潜力，尤其适用于政府组织的需求。未来研究可进一步探索其他模块的可行性和实际应用落地。

Abstract: Given recent breakthroughs in Generative Artificial Intelligence (GAI) and Large Language Models (LLMs), more and more professional services are being augmented through Artificial Intelligence (AI), which once seemed impossible to automate. This paper presents a modular model for leveraging GAI in developing strategic plans for large scale government organizations and evaluates leading machine learning techniques in their application towards one of the identified modules. Specifically, the performance of BERTopic and Non-negative Matrix Factorization (NMF) are evaluated in their ability to use topic modeling to generate themes representative of Vision Elements within a strategic plan. To accomplish this, BERTopic and NMF models are trained using a large volume of reports from the Government Accountability Office (GAO). The generated topics from each model are then scored for similarity against the Vision Elements of a published strategic plan and the results are compared. Our results show that these techniques are capable of generating themes similar to 100% of the elements being evaluated against. Further, we conclude that BERTopic performs best in this application with more than half of its correlated topics achieving a “medium” or “strong” correlation. A capability of GAI-enabled strategic plan development impacts a multi-billion dollar industry and assists the federal government in overcoming regulatory requirements which are crucial to the public good. Further work will focus on the operationalization of the concept proven in this study as well as viability of the remaining modules in the proposed model for GAI-generated strategic plans.

[241] A Comprehensive Survey of Self-Evolving AI Agents: A New Paradigm Bridging Foundation Models and Lifelong Agentic Systems cs.AI | cs.CL | cs.MAPDF

Jinyuan Fang, Yanwen Peng, Xi Zhang, Yingxu Wang, Xinhao Yi

TL;DR: 该论文对自进化AI代理技术进行了全面调查，提出了一种新范式，旨在结合基础模型的静态能力和终身代理系统的持续适应性。

Details

Motivation: 现有AI代理系统通常依赖静态配置，难以适应动态环境，因此需要研究自进化代理技术以实现持续优化和适应性。

Result: 总结了自进化代理技术的现状，为开发更自适应的终身代理系统奠定了基础。

Insight: 自进化代理技术的核心在于通过反馈循环实现持续优化，领域特定的策略和伦理考量是未来研究的关键。

Abstract: Recent advances in large language models have sparked growing interest in AI agents capable of solving complex, real-world tasks. However, most existing agent systems rely on manually crafted configurations that remain static after deployment, limiting their ability to adapt to dynamic and evolving environments. To this end, recent research has explored agent evolution techniques that aim to automatically enhance agent systems based on interaction data and environmental feedback. This emerging direction lays the foundation for self-evolving AI agents, which bridge the static capabilities of foundation models with the continuous adaptability required by lifelong agentic systems. In this survey, we provide a comprehensive review of existing techniques for self-evolving agentic systems. Specifically, we first introduce a unified conceptual framework that abstracts the feedback loop underlying the design of self-evolving agentic systems. The framework highlights four key components: System Inputs, Agent System, Environment, and Optimisers, serving as a foundation for understanding and comparing different strategies. Based on this framework, we systematically review a wide range of self-evolving techniques that target different components of the agent system. We also investigate domain-specific evolution strategies developed for specialised fields such as biomedicine, programming, and finance, where optimisation objectives are tightly coupled with domain constraints. In addition, we provide a dedicated discussion on the evaluation, safety, and ethical considerations for self-evolving agentic systems, which are critical to ensuring their effectiveness and reliability. This survey aims to provide researchers and practitioners with a systematic understanding of self-evolving AI agents, laying the foundation for the development of more adaptive, autonomous, and lifelong agentic systems.

[242] CP-Agent: Agentic Constraint Programming cs.AI | cs.CL | cs.LG | cs.SEPDF

Stefan Szeider

TL;DR: 论文提出了一种基于纯代理策略的约束编程方法CP-Agent，通过动态编码和执行代码解决问题，成功解决了CP-Bench基准集中的所有101个问题，表明通用编码工具与领域知识提示的结合优于专用代理架构。

Details

Motivation: 传统的约束编程方法依赖固定流程和人工建模步骤，难以适用于所有基准问题。作者希望通过一种动态、灵活的代理策略，结合编码工具和领域知识提示，提升自动化建模的效率。

Result: CP-Agent成功解决了CP-Bench基准集中的所有101个问题，表明其优于固定流程和专用架构的方法。

Insight: 约束建模任务需要结合通用编码工具和领域知识提示，而非依赖专用代理架构或预定义流程。这种代理策略具有灵活性和可扩展性。

Abstract: Translating natural language problem descriptions into formal constraint models remains a fundamental challenge in constraint programming, requiring deep expertise in both the problem domain and modeling frameworks. Previous approaches to automating this translation have employed fixed workflows with predetermined modeling steps, failing on a significant number of benchmark problems. We present a new approach using a pure agentic strategy without any fixed pipeline. We developed a general-purpose Python coding agent based on the ReAct (Reason and Act) principle, utilizing a persistent IPython kernel for stateful code execution and iterative development. Rather than embedding constraint programming logic into the agent architecture, domain-specific expertise is injected solely through a carefully crafted project prompt. The agent combines this prompt-encoded knowledge with access to file operations and code execution tools, enabling it to test hypotheses, debug failures, and verify solutions dynamically. Implemented in just a few hundred lines of code, this architecture successfully solves all 101 problems of the CP-Bench constraint programming benchmark set. The results suggest that constraint modeling tasks require the combination of general coding tools and domain expertise encoded in prompts, rather than specialized agent architectures or predefined workflows.

[243] Democratizing Diplomacy: A Harness for Evaluating Any Large Language Model on Full-Press Diplomacy cs.AI | cs.CL | cs.CY | cs.LGPDF

Alexander Duffy, Samuel J Paech, Ishana Shastri, Elizabeth Karpinski, Baptiste Alloui-Cros

TL;DR: 该论文提出了首个无需微调或专业训练即可评估任何大语言模型（LLM）在完整版《外交》游戏中表现的测试工具，克服了游戏状态复杂性和高信息密度的挑战。

Details

Motivation: 过去的研究因《外交》游戏状态的高复杂性和信息密度，只能依赖前沿LLM或微调模型，限制了研究的普及性和灵活性。

Result: 实验表明，大模型表现最佳，但小模型也能胜任；同时揭示了LLM在战略推理能力上的自然涌现。

Insight: 论文展示了LLM在复杂战略任务中的潜力，并为未来研究提供了无需微调的标准化评估工具。

Abstract: We present the first evaluation harness that enables any out-of-the-box, local, Large Language Models (LLMs) to play full-press Diplomacy without fine-tuning or specialized training. Previous work required frontier LLMs, or fine-tuning, due to the high complexity and information density of Diplomacy’s game state. Combined with the high variance of matches, these factors made Diplomacy prohibitive for study. In this work, we used data-driven iteration to optimize a textual game state representation such that a 24B model can reliably complete matches without any fine tuning. We develop tooling to facilitate hypothesis testing and statistical analysis, and we present case studies on persuasion, aggressive playstyles, and performance across a range of models. We conduct a variety of experiments across many popular LLMs, finding the larger models perform the best, but the smaller models still play adequately. We also introduce Critical State Analysis: an experimental protocol for rapidly iterating and analyzing key moments in a game at depth. Our harness democratizes the evaluation of strategic reasoning in LLMs by eliminating the need for fine-tuning, and it provides insights into how these capabilities emerge naturally from widely used LLMs. Our code is available in the supplement and will be open sourced.

[244] ThinkTuning: Instilling Cognitive Reflections without Distillation cs.AI | cs.CL | cs.LGPDF

Aswin RRV, Jacob Dineen, Divij Handa, Md Nayem Uddin, Mihir Parmar

TL;DR: ThinkTuning提出了一种基于GRPO的交互式训练方法，通过教师模型的反馈改善学生模型的推理能力，避免了蒸馏的需求，在多个基准测试中表现出显著提升。

Details

Motivation: 当前RL驱动的自改进范式未能真正赋予模型新的推理能力，而ThinkTuning旨在通过教师模型的反馈帮助学生模型开发这种能力。

Result: 在MATH-500、AIME和GPQA-Diamond等基准上分别提升了2.08%、2.23%和3.99%，平均提升3.85%。

Insight: 隐式监督通过反馈可以有效提升模型推理能力，避免了蒸馏的复杂性。

Abstract: Recent advances in test-time scaling have led to the emergence of thinking LLMs that exhibit self-reflective behaviors and multi-step reasoning. While RL drives this self-improvement paradigm, a recent study (Gandhi et al., 2025) shows that RL alone does not truly instill these new reasoning abilities - it merely draws out behaviors already present in the base models. This raises a question: How can we train the models that don’t exhibit such thinking behavior to develop it in the first place? To this end, we propose ThinkTuning, a GRPO-based interactive training approach where we augment the rollouts of a student model with the guidance from a teacher model. A simple idea from classroom practice inspires our method: a teacher poses a problem, lets the student try an answer, then gives corrective feedback – enough to point the mind in the right direction and then show the solution. Each piece of feedback reshapes the student’s thoughts, leading them to arrive at the correct solution. Similarly, we find that this type of implicit supervision through feedback from a teacher model of the same size improves the reasoning capabilities of the student model. In particular, on average, our method shows a 3.85% improvement over zero-shot baselines across benchmarks, and on MATH-500, AIME and GPQA-Diamond it shows 2.08%, 2.23% and 3.99% improvements over the vanilla-GRPO baseline. Source code is available at https://github.com/3rdAT/ThinkTuning.

Tianyi Ma, Yue Zhang, Zehao Wang, Parisa Kordjamshidi

TL;DR: 论文提出了SkillNav框架，通过模块化和技能分解的方法改进视觉与语言导航（VLN）任务，利用专用代理处理原子技能，并引入VLM路由器动态选择代理，实现了R2R和GSA-R2R基准的新SOTA性能。

Details

Motivation: 当前VLN方法在复杂空间和时间推理的未见过场景中泛化能力较弱，亟需一种更结构化且解释性强的解决方案。

Result: 在R2R基准上取得SOTA，并在GSA-R2R上展示强泛化能力。

Insight: 模块化和技能分解能提升VLN任务的解释性和泛化性，VLM路由器的动态选择机制是关键创新。

Abstract: Vision-and-Language Navigation (VLN) poses significant challenges in enabling agents to interpret natural language instructions and navigate complex 3D environments. While recent progress has been driven by large-scale pre-training and data augmentation, current methods still struggle to generalize to unseen scenarios, particularly when complex spatial and temporal reasoning is required. In this work, we propose SkillNav, a modular framework that introduces structured, skill-based reasoning into Transformer-based VLN agents. Our method decomposes navigation into a set of interpretable atomic skills (e.g., Vertical Movement, Area and Region Identification, Stop and Pause), each handled by a specialized agent. We then introduce a novel zero-shot Vision-Language Model (VLM)-based router, which dynamically selects the most suitable agent at each time step by aligning sub-goals with visual observations and historical actions. SkillNav achieves a new state-of-the-art performance on the R2R benchmark and demonstrates strong generalization to the GSA-R2R benchmark that includes novel instruction styles and unseen environments.

[246] IRL-VLA: Training an Vision-Language-Action Policy via Reward World Model cs.AI | cs.CV | cs.ROPDF

Anqing Jiang, Yu Gao, Yiru Wang, Zhigang Sun, Shuo Wang

TL;DR: IRL-VLA提出了一种新型的视觉-语言-行动（VLA）策略训练方法，通过结合逆强化学习和奖励世界模型，解决了开环模仿学习的局限性和闭环训练对高保真仿真的依赖问题。

Details

Motivation: 现有的VLA架构主要基于开环模仿学习，导致性能受限；而闭环训练依赖高保真仿真，存在领域差距和计算效率问题。

Result: 在NAVSIM v2基准测试中表现优异，在CVPR2025自动驾驶挑战赛中排名第一。

Insight: 结合逆强化学习和轻量级奖励世界模型可以有效提升VLA策略的闭环训练效率，同时兼顾安全、舒适性和交通效率。

Abstract: Vision-Language-Action (VLA) models have demonstrated potential in autonomous driving. However, two critical challenges hinder their development: (1) Existing VLA architectures are typically based on imitation learning in open-loop setup which tends to capture the recorded behaviors in the dataset, leading to suboptimal and constrained performance, (2) Close-loop training relies heavily on high-fidelity sensor simulation, where domain gaps and computational inefficiencies pose significant barriers. In this paper, we introduce IRL-VLA, a novel close-loop Reinforcement Learning via \textbf{I}nverse \textbf{R}einforcement \textbf{L}earning reward world model with a self-built VLA approach. Our framework proceeds in a three-stage paradigm: In the first stage, we propose a VLA architecture and pretrain the VLA policy via imitation learning. In the second stage, we construct a lightweight reward world model via inverse reinforcement learning to enable efficient close-loop reward computation. To further enhance planning performance, finally, we design specialized reward world model guidence reinforcement learning via PPO(Proximal Policy Optimization) to effectively balance the safety incidents, comfortable driving, and traffic efficiency. Our approach achieves state-of-the-art performance in NAVSIM v2 end-to-end driving benchmark, 1st runner up in CVPR2025 Autonomous Grand Challenge. We hope that our framework will accelerate VLA research in close-loop autonomous driving.

[247] CountQA: How Well Do MLLMs Count in the Wild? cs.AI | cs.CVPDF

Jayant Sravan Tamarapalli, Rynaa Grover, Nilay Pande, Sahiti Yerramilli

TL;DR: CountQA 是一个新的基准测试，旨在评估多模态大语言模型（MLLMs）在复杂真实场景中的计数能力，揭示了 MLLMs 在计数任务上的不足。

Details

Motivation: 现有的 MLLMs 在视觉场景理解中表现出色，但在基本认知技能——物体计数上存在明显缺陷。当前基准测试无法充分评估这一能力，限制了模型的可靠性。

Result: 实验显示，表现最好的模型准确率仅为 42.9%，且随着物体数量增加，性能进一步下降。

Insight: CountQA 揭示了 MLLMs 在计数任务上的不足，为未来研究提供了改进方向，强调模型需在描述流畅的同时具备数值和空间感知能力。

Abstract: Multimodal Large Language Models (MLLMs) demonstrate remarkable fluency in understanding visual scenes, yet they exhibit a critical lack in a fundamental cognitive skill: object counting. This blind spot severely limits their reliability in real-world applications. To date, this capability has been largely unevaluated in complex scenarios, as existing benchmarks either feature sparse object densities or are confined to specific visual domains, failing to test models under realistic conditions. Addressing this gap, we introduce CountQA, a challenging new benchmark designed to probe this deficiency. Comprising over 1,500 question-answer pairs, CountQA features real-world images with high object density, clutter, and occlusion. We investigate this weakness by evaluating 15 prominent MLLMs on the CountQA benchmark and reveal that the top-performing model achieves a mere 42.9% accuracy, with performance declining as object counts rise. By providing a dedicated benchmark to diagnose and rectify this core weakness, CountQA paves the way for a new generation of MLLMs that are not only descriptively fluent but also numerically grounded and spatially aware. We will open-source the dataset and code upon paper acceptance to foster further research.

[248] MeteorPred: A Meteorological Multimodal Large Model and Dataset for Severe Weather Event Prediction cs.AI | cs.CVPDF

Shuo Tang, Jian Xu, Jiadong Zhang, Yi Chen, Qizhao Jin

TL;DR: 该论文提出了一个面向强天气预报的多模态大模型（MMLM）和首个大规模多模态数据集MP-Bench，解决了传统系统依赖人工和样本稀缺的问题，通过动态特征提取模块实现了高维气象数据的有效处理。

Details

Motivation: 强天气预报依赖人工专家解读，存在主观性和操作负担高的问题，AI技术的进步为自动化预报提供了可能，但现有技术无法处理高维气象数据及其复杂依赖性。

Result: 在MP-Bench上的实验表明，MMLM在多项任务中表现优异，验证了其在强天气预报中的有效性。

Insight: 通过多模态数据和大模型结合，MMLM为AI驱动的自动化天气预报系统迈出了关键一步。

Abstract: Timely and accurate severe weather warnings are critical for disaster mitigation. However, current forecasting systems remain heavily reliant on manual expert interpretation, introducing subjectivity and significant operational burdens. With the rapid development of AI technologies, the end-to-end “AI weather station” is gradually emerging as a new trend in predicting severe weather events. Three core challenges impede the development of end-to-end AI severe weather system: (1) scarcity of severe weather event samples; (2) imperfect alignment between high-dimensional meteorological data and textual warnings; (3) existing multimodal language models are unable to handle high-dimensional meteorological data and struggle to fully capture the complex dependencies across temporal sequences, vertical pressure levels, and spatial dimensions. To address these challenges, we introduce MP-Bench, the first large-scale temporal multimodal dataset for severe weather events prediction, comprising 421,363 pairs of raw multi-year meteorological data and corresponding text caption, covering a wide range of severe weather scenarios across China. On top of this dataset, we develop a meteorology multimodal large model (MMLM) that directly ingests 4D meteorological inputs. In addition, it is designed to accommodate the unique characteristics of 4D meteorological data flow, incorporating three plug-and-play adaptive fusion modules that enable dynamic feature extraction and integration across temporal sequences, vertical pressure layers, and spatial dimensions. Extensive experiments on MP-Bench demonstrate that MMLM performs exceptionally well across multiple tasks, highlighting its effectiveness in severe weather understanding and marking a key step toward realizing automated, AI-driven weather forecasting systems. Our source code and dataset will be made publicly available.

[249] FEAT: A Multi-Agent Forensic AI System with Domain-Adapted Large Language Model for Automated Cause-of-Death Analysis cs.AI | cs.CV | cs.LG | cs.MAPDF

Chen Shen, Wanqing Zhang, Kehan Li, Erwen Huang, Haitao Bi

TL;DR: FEAT是一个多代理人工智能系统，采用领域自适应的大型语言模型（LLM），用于自动化法医死因分析，解决法医领域的工作量不足和诊断差异问题。

Details

Motivation: 法医领域的死因确定存在系统性挑战，如人力短缺和诊断不一致，尤其是在高负荷的法医体系中。

Result: 在不同中国案例中，FEAT优于现有AI系统，表现出强大的泛化能力和高专家一致性。

Insight: 通过结合AI效率和人类监督，FEAT为法医系统提供了可扩展且一致的解决方案，同时保持了专家级严谨性。

Abstract: Forensic cause-of-death determination faces systemic challenges, including workforce shortages and diagnostic variability, particularly in high-volume systems like China’s medicolegal infrastructure. We introduce FEAT (ForEnsic AgenT), a multi-agent AI framework that automates and standardizes death investigations through a domain-adapted large language model. FEAT’s application-oriented architecture integrates: (i) a central Planner for task decomposition, (ii) specialized Local Solvers for evidence analysis, (iii) a Memory & Reflection module for iterative refinement, and (iv) a Global Solver for conclusion synthesis. The system employs tool-augmented reasoning, hierarchical retrieval-augmented generation, forensic-tuned LLMs, and human-in-the-loop feedback to ensure legal and medical validity. In evaluations across diverse Chinese case cohorts, FEAT outperformed state-of-the-art AI systems in both long-form autopsy analyses and concise cause-of-death conclusions. It demonstrated robust generalization across six geographic regions and achieved high expert concordance in blinded validations. Senior pathologists validated FEAT’s outputs as comparable to those of human experts, with improved detection of subtle evidentiary nuances. To our knowledge, FEAT is the first LLM-based AI agent system dedicated to forensic medicine, offering scalable, consistent death certification while maintaining expert-level rigor. By integrating AI efficiency with human oversight, this work could advance equitable access to reliable medicolegal services while addressing critical capacity constraints in forensic systems.

cs.IR [Back]

[250] ReasonRank: Empowering Passage Ranking with Strong Reasoning Ability cs.IR | cs.AI | cs.CL | cs.LGPDF

Wenhan Liu, Xinyu Ma, Weiwei Sun, Yutao Zhu, Yuchen Li

TL;DR: ReasonRank提出一种自动化合成推理密集型训练数据的框架，并通过两阶段后训练方法提升排序能力，显著超越现有基线，在BRIGHT排行榜上达到SOTA性能。

Details

Motivation: 现有基于LLM的重排序器由于缺乏推理密集型训练数据，在复杂排序场景中表现不佳，推理能力未得到充分开发。

Result: ReasonRank在BRIGHT排行榜上达到40.6的SOTA性能，显著优于现有基线，且延迟更低。

Insight: 推理密集型数据和多阶段训练是提升排序能力的关键，多视图奖励比单一排序指标更有效。

Abstract: Large Language Model (LLM) based listwise ranking has shown superior performance in many passage ranking tasks. With the development of Large Reasoning Models, many studies have demonstrated that step-by-step reasoning during test-time helps improve listwise ranking performance. However, due to the scarcity of reasoning-intensive training data, existing rerankers perform poorly in many complex ranking scenarios and the ranking ability of reasoning-intensive rerankers remains largely underdeveloped. In this paper, we first propose an automated reasoning-intensive training data synthesis framework, which sources training queries and passages from diverse domains and applies DeepSeek-R1 to generate high-quality training labels. A self-consistency data filtering mechanism is designed to ensure the data quality. To empower the listwise reranker with strong reasoning ability, we further propose a two-stage post-training approach, which includes a cold-start supervised fine-tuning (SFT) stage for reasoning pattern learning and a reinforcement learning (RL) stage for further ranking ability enhancement. During the RL stage, based on the nature of listwise ranking, we design a multi-view ranking reward, which is more effective than a ranking metric-based reward. Extensive experiments demonstrate that our trained reasoning-intensive reranker \textbf{ReasonRank} outperforms existing baselines significantly and also achieves much lower latency than pointwise reranker Rank1. \textbf{Through further experiments, our ReasonRank has achieved state-of-the-art (SOTA) performance 40.6 on the BRIGHT leaderboard\footnote{https://brightbenchmark.github.io/}.} Our codes are available at https://github.com/8421BCD/ReasonRank.

[251] PrLM: Learning Explicit Reasoning for Personalized RAG via Contrastive Reward Optimization cs.IR | cs.CLPDF

Kepu Zhang, Teng Shi, Weijie Yu, Jun Xu

TL;DR: PrLM是一个通过对比奖励优化学习显式推理的个性化RAG框架，能在无需标注推理路径的情况下生成与用户偏好更一致的回答。

Details

Motivation: 现有RAG方法依赖LLMs隐式整合检索内容，易受检索质量影响且可能生成不符合用户偏好的回答，PrLM旨在解决这一问题。

Result: 在三个个性化文本生成数据集上表现优于基线方法，且对检索数量和质量变化具有鲁棒性。

Insight: 显示推理和对比奖励的结合能有效解决传统RAG对检索质量敏感的问题，提升个性化回答的准确性。

Abstract: Personalized retrieval-augmented generation (RAG) aims to produce user-tailored responses by incorporating retrieved user profiles alongside the input query. Existing methods primarily focus on improving retrieval and rely on large language models (LLMs) to implicitly integrate the retrieved context with the query. However, such models are often sensitive to retrieval quality and may generate responses that are misaligned with user preferences. To address this limitation, we propose PrLM, a reinforcement learning framework that trains LLMs to explicitly reason over retrieved user profiles. Guided by a contrastively trained personalization reward model, PrLM effectively learns from user responses without requiring annotated reasoning paths. Experiments on three personalized text generation datasets show that PrLM outperforms existing methods and remains robust across varying numbers of retrieved profiles and different retrievers.

[252] Improving Document Retrieval Coherence for Semantically Equivalent Queries cs.IR | cs.CLPDF

Stefano Campese, Alessandro Moschitti, Ivano Lauriola

TL;DR: 这篇论文提出了一种改进的多重负排序损失方法，用于训练密集检索模型，以提高对语义相似查询的检索一致性。实验表明，新方法降低了模型对查询词汇的敏感性，并提高了准确性。

Details

Motivation: 密集检索模型通常针对特定查询优化文档的相关性，但对查询和文档词汇的变化非常敏感。为了解决这一问题，论文探索如何改进模型在语义相似查询下的检索一致性。

Result: 在多个数据集（MS-MARCO、Natural Questions等）上的实验表明，新方法不仅降低了模型的敏感性，还意外地提升了检索准确性。

Insight: 研究显示，优化模型对语义相似查询的检索一致性不仅能减少词汇敏感性问题，还能进一步提升检索性能，这对实际应用具有重要意义。

Abstract: Dense Retrieval (DR) models have proven to be effective for Document Retrieval and Information Grounding tasks. Usually, these models are trained and optimized for improving the relevance of top-ranked documents for a given query. Previous work has shown that popular DR models are sensitive to the query and document lexicon: small variations of it may lead to a significant difference in the set of retrieved documents. In this paper, we propose a variation of the Multi-Negative Ranking loss for training DR that improves the coherence of models in retrieving the same documents with respect to semantically similar queries. The loss penalizes discrepancies between the top-k ranked documents retrieved for diverse but semantic equivalent queries. We conducted extensive experiments on various datasets, MS-MARCO, Natural Questions, BEIR, and TREC DL 19/20. The results show that (i) models optimizes by our loss are subject to lower sensitivity, and, (ii) interestingly, higher accuracy.

[253] HierSearch: A Hierarchical Enterprise Deep Search Framework Integrating Local and Web Searches cs.IR | cs.AI | cs.CLPDF

Jiejun Tan, Zhicheng Dou, Yan Yu, Jiehan Cheng, Qiang Ju

TL;DR: HierSearch提出了一种层次化深度搜索框架，通过分层强化学习整合本地和Web搜索，显著提升企业级搜索性能。

Details

Motivation: 企业需要同时利用本地和Web知识源的私有深度搜索系统，而现有方法限于单一知识源，或直接用强化学习训练效果不佳。

Result: 在通用、金融和医学领域的六个基准测试中，性能优于平铺强化学习和其他多源检索基线。

Insight: 层次化设计和知识过滤可有效提升多源搜索效率与准确性，适用于企业对隐私和性能的双重需求。

Abstract: Recently, large reasoning models have demonstrated strong mathematical and coding abilities, and deep search leverages their reasoning capabilities in challenging information retrieval tasks. Existing deep search works are generally limited to a single knowledge source, either local or the Web. However, enterprises often require private deep search systems that can leverage search tools over both local and the Web corpus. Simply training an agent equipped with multiple search tools using flat reinforcement learning (RL) is a straightforward idea, but it has problems such as low training data efficiency and poor mastery of complex tools. To address the above issue, we propose a hierarchical agentic deep search framework, HierSearch, trained with hierarchical RL. At the low level, a local deep search agent and a Web deep search agent are trained to retrieve evidence from their corresponding domains. At the high level, a planner agent coordinates low-level agents and provides the final answer. Moreover, to prevent direct answer copying and error propagation, we design a knowledge refiner that filters out hallucinations and irrelevant evidence returned by low-level agents. Experiments show that HierSearch achieves better performance compared to flat RL, and outperforms various deep search and multi-source retrieval-augmented generation baselines in six benchmarks across general, finance, and medical domains.

q-bio.NC [Back]

[254] Sensory robustness through top-down feedback and neural stochasticity in recurrent vision models q-bio.NC | cs.CV | cs.LGPDF

Antonino Greco, Marco D’Alessandro, Karl J. Friston, Giovanni Pezzulo, Markus Siegel

TL;DR: 论文研究了生物视觉系统中自上而下反馈和神经随机性在卷积循环神经网络中的作用，发现这种机制能显著提升模型的鲁棒性和效率。

Details

Motivation: 探究自上而下反馈在视觉处理中的计算贡献，尤其是在人工视觉模型中是否具有类似生物系统的重要性。

Result: 带反馈的ConvRNN在噪声扰动和对抗攻击下表现更鲁棒，且dropout进一步增强了反馈的效果。

Insight: 神经随机性虽增加动态混沌，但结合反馈能稳定网络活动于低维流形，从而提升鲁棒性。

Abstract: Biological systems leverage top-down feedback for visual processing, yet most artificial vision models succeed in image classification using purely feedforward or recurrent architectures, calling into question the functional significance of descending cortical pathways. Here, we trained convolutional recurrent neural networks (ConvRNN) on image classification in the presence or absence of top-down feedback projections to elucidate the specific computational contributions of those feedback pathways. We found that ConvRNNs with top-down feedback exhibited remarkable speed-accuracy trade-off and robustness to noise perturbations and adversarial attacks, but only when they were trained with stochastic neural variability, simulated by randomly silencing single units via dropout. By performing detailed analyses to identify the reasons for such benefits, we observed that feedback information substantially shaped the representational geometry of the post-integration layer, combining the bottom-up and top-down streams, and this effect was amplified by dropout. Moreover, feedback signals coupled with dropout optimally constrained network activity onto a low-dimensional manifold and encoded object information more efficiently in out-of-distribution regimes, with top-down information stabilizing the representational dynamics at the population level. Together, these findings uncover a dual mechanism for resilient sensory coding. On the one hand, neural stochasticity prevents unit-level co-adaptation albeit at the cost of more chaotic dynamics. On the other hand, top-down feedback harnesses high-level information to stabilize network activity on compact low-dimensional manifolds.

Table of Contents

cs.CV [Back]

[1] Med-GRIM: Enhanced Zero-Shot Medical VQA using prompt-embedded Multimodal Graph RAG cs.CV | cs.MAPDF

[2] DiTalker: A Unified DiT-based Framework for High-Quality and Speaking Styles Controllable Portrait Animation cs.CVPDF

[3] BigTokDetect: A Clinically-Informed Vision-Language Model Framework for Detecting Pro-Bigorexia Videos on TikTok cs.CVPDF

[4] Frequency Prior Guided Matching: A Data Augmentation Approach for Generalizable Semi-Supervised Polyp Segmentation cs.CVPDF

[5] Large Language Models Facilitate Vision Reflection in Image Classification cs.CVPDF

[6] A Framework Combining 3D CNN and Transformer for Video-Based Behavior Recognition cs.CV | cs.AIPDF

[7] RMT-PPAD: Real-time Multi-task Learning for Panoptic Perception in Autonomous Driving cs.CV | cs.LGPDF

[8] What Makes “Good” Distractors for Object Hallucination Evaluation in Large Vision-Language Models? cs.CV | cs.LGPDF

[9] Benchmarking Deep Learning-Based Object Detection Models on Feature Deficient Astrophotography Imagery Dataset cs.CV | astro-ph.IMPDF

[10] MILD: Multi-Layer Diffusion Strategy for Complex and Precise Multi-IP Aware Human Erasing cs.CVPDF

[11] Statistical Confidence Rescoring for Robust 3D Scene Graph Generation from Multi-View Images cs.CV | eess.IVPDF

[12] Age-Diverse Deepfake Dataset: Bridging the Age Gap in Deepfake Detection cs.CV | cs.LGPDF

[13] Static and Plugged: Make Embodied Evaluation Simple cs.CVPDF

[14] StyleTailor: Towards Personalized Fashion Styling via Hierarchical Negative Feedback cs.CV | cs.CY | cs.MAPDF

[15] From Label Error Detection to Correction: A Modular Framework and Benchmark for Object Detection Datasets cs.CV | cs.LGPDF

[16] On the effectiveness of multimodal privileged knowledge distillation in two vision transformer based diagnostic applications cs.CV | cs.LGPDF

[17] Grounding Emotion Recognition with Visual Prototypes: VEGA – Revisiting CLIP in MERC cs.CVPDF

[18] Bridging Brain Connectomes and Clinical Reports for Early Alzheimer’s Disease Diagnosis cs.CV | cs.LGPDF

[19] Surformer v1: Transformer-Based Surface Classification Using Tactile and Vision Features cs.CV | cs.AIPDF

[20] ImpliHateVid: A Benchmark Dataset and Two-stage Contrastive Learning Framework for Implicit Hate Speech Detection in Videos cs.CV | cs.LGPDF

[21] ContextGuard-LVLM: Enhancing News Veracity through Fine-grained Cross-modal Contextual Consistency Verification cs.CVPDF

[22] VL-MedGuide: A Visual-Linguistic Large Model for Intelligent and Explainable Skin Disease Auxiliary Diagnosis cs.CVPDF

[23] Learning More by Seeing Less: Line Drawing Pretraining for Efficient, Transferable, and Human-Aligned Vision cs.CVPDF

[24] MMFformer: Multimodal Fusion Transformer Network for Depression Detection cs.CV | cs.AI | cs.CL | cs.LG | cs.SD | eess.ASPDF

[25] Restage4D: Reanimating Deformable 3D Reconstruction from a Single Video cs.CVPDF

[26] FoundBioNet: A Foundation-Based Model for IDH Genotyping of Glioma from Multi-Parametric MRI cs.CV | cs.AIPDF

[27] VOccl3D: A Video Benchmark Dataset for 3D Human Pose and Shape Estimation under real Occlusions cs.CV | cs.GRPDF

[28] SafePLUG: Empowering Multimodal LLMs with Pixel-Level Insight and Temporal Grounding for Traffic Accident Understanding cs.CV | cs.AIPDF

[29] DiffUS: Differentiable Ultrasound Rendering from Volumetric Imaging cs.CV | cs.GRPDF

[30] Edge Detection for Organ Boundaries via Top Down Refinement and SubPixel Upsampling cs.CVPDF

[31] VesselRW: Weakly Supervised Subcutaneous Vessel Segmentation via Learned Random Walk Propagation cs.CVPDF

[32] VSI: Visual Subtitle Integration for Keyframe Selection to enhance Long Video Understanding cs.CV | cs.AI | I.2.10PDF

[33] NS-FPN: Improving Infrared Small Target Detection and Segmentation from Noise Suppression Perspective cs.CV | cs.AIPDF

[34] BASIC: Boosting Visual Alignment with Intrinsic Refined Embeddings in Multimodal Large Language Models cs.CV | cs.AIPDF

[35] eMotions: A Large-Scale Dataset and Audio-Visual Fusion Network for Emotion Analysis in Short-form Videos cs.CVPDF

[36] A Simple yet Powerful Instance-Aware Prompting Framework for Training-free Camouflaged Object Segmentation cs.CVPDF

[37] MultiRef: Controllable Image Generation with Multiple Visual References cs.CVPDF

[38] MMReID-Bench: Unleashing the Power of MLLMs for Effective and Versatile Person Re-identification cs.CV | cs.AIPDF

[39] AR-GRPO: Training Autoregressive Image Generation Models via Reinforcement Learning cs.CVPDF

[40] SLRTP2025 Sign Language Production Challenge: Methodology, Results, and Future Work cs.CV | eess.IV | eess.SPPDF

[41] Beyond Frequency: Seeing Subtle Cues Through the Lens of Spatial Decomposition for Fine-Grained Visual Classification cs.CV | cs.AIPDF

[42] Adversarial Video Promotion Against Text-to-Video Retrieval cs.CVPDF

[43] Evaluating Fisheye-Compatible 3D Gaussian Splatting Methods on Real Images Beyond 180 Degree Field of View cs.CV | cs.GRPDF

[44] WeatherDiffusion: Weather-Guided Diffusion Model for Forward and Inverse Rendering cs.CV | cs.AIPDF

[45] OctreeNCA: Single-Pass 184 MP Segmentation on Consumer Hardware cs.CVPDF

[46] S2-UniSeg: Fast Universal Agglomerative Pooling for Scalable Segment Anything without Supervision cs.CVPDF

[47] HiMat: DiT-based Ultra-High Resolution SVBRDF Generation cs.CV | cs.GRPDF

[48] TerraMAE: Learning Spatial-Spectral Representations from Hyperspectral Earth Observation Data via Adaptive Masked Autoencoders cs.CV | cs.LGPDF

[49] DocRefine: An Intelligent Framework for Scientific Document Understanding and Content Optimization based on Multimodal Large Model Agents cs.CVPDF

[50] MV-CoRe: Multimodal Visual-Conceptual Reasoning for Complex Visual Question Answering cs.CVPDF

[51] Large Language Model Evaluated Stand-alone Attention-Assisted Graph Neural Network with Spatial and Structural Information Interaction for Precise Endoscopic Image Segmentation cs.CVPDF

[52] TeSO: Representing and Compressing 3D Point Cloud Scenes with Textured Surfel Octree cs.CVPDF

[53] ForeSight: Multi-View Streaming Joint Object Detection and Trajectory Forecasting cs.CV | cs.ROPDF

[54] Communication-Efficient Multi-Agent 3D Detection via Hybrid Collaboration cs.CVPDF

[55] AugLift: Boosting Generalization in Lifting-based 3D Human Pose Estimation cs.CV | cs.LGPDF

[56] Perceptual Evaluation of GANs and Diffusion Models for Generating X-rays cs.CV | cs.AIPDF

[57] CMAMRNet: A Contextual Mask-Aware Network Enhancing Mural Restoration Through Comprehensive Mask Guidance cs.CVPDF

[58] Dynamic Pattern Alignment Learning for Pretraining Lightweight Human-Centric Vision Models cs.CVPDF

[59] Intention-Aware Diffusion Model for Pedestrian Trajectory Prediction cs.CV | cs.AIPDF

[60] SketchAnimator: Animate Sketch via Motion Customization of Text-to-Video Diffusion Models cs.CVPDF

[61] CoopDiff: Anticipating 3D Human-object Interactions via Contact-consistent Decoupled Diffusion cs.CVPDF

[62] Lightweight Multi-Scale Feature Extraction with Fully Connected LMF Layer for Salient Object Detection cs.CV | cs.AIPDF

[63] EventRR: Event Referential Reasoning for Referring Video Object Segmentation cs.CVPDF

[64] Similarity Matters: A Novel Depth-guided Network for Image Restoration and A New Dataset cs.CVPDF

[65] Unsupervised Real-World Super-Resolution via Rectified Flow Degradation Modelling cs.CV | eess.IVPDF

[66] Bridging Semantic Logic Gaps: A Cognition-Inspired Multimodal Boundary-Preserving Network for Image Manipulation Localization cs.CVPDF

[67] Generic Calibration: Pose Ambiguity/Linear Solution and Parametric-hybrid Pipeline cs.CVPDF

[68] Landmark Guided Visual Feature Extractor for Visual Speech Recognition with Limited Resource cs.CVPDF

[69] Consistent and Controllable Image Animation with Motion Linear Diffusion Transformers cs.CVPDF

[70] SUIT: Spatial-Spectral Union-Intersection Interaction Network for Hyperspectral Object Tracking cs.CVPDF

[71] Understanding Dynamic Scenes in Ego Centric 4D Point Clouds cs.CVPDF

[72] Small-Large Collaboration: Training-efficient Concept Personalization for Large VLM using a Meta Personalized Small VLM cs.CVPDF

[73] Representation Understanding via Activation Maximization cs.CV | cs.AIPDF

[74] SynMatch: Rethinking Consistency in Medical Image Segmentation with Sparse Annotations cs.CVPDF

[75] BEVANet: Bilateral Efficient Visual Attention Network for Real-Time Semantic Segmentation cs.CVPDF

[76] DragonFruitQualityNet: A Lightweight Convolutional Neural Network for Real-Time Dragon Fruit Quality Inspection on Mobile Devices cs.CV | cs.AIPDF

[77] MCITlib: Multimodal Continual Instruction Tuning Library and Benchmark cs.CV | cs.AIPDF

[78] MobileViCLIP: An Efficient Video-Text Model for Mobile Devices cs.CVPDF