Table of Contents

cs.CV [Back]

[1] Semantic VLM Dataset for Safe Autonomous Driving cs.CV | cs.ROPDF

Yuankai He, Weisong Shi

TL;DR: CAR-Scenes是一个用于自动驾驶的视觉语言模型(VLM)数据集,提供场景级理解,包含5,192张标注图像,涵盖28类知识库和350+子属性。数据集支持语义检索、风险感知场景挖掘,并提供可复现的基线模型和工具。

Details

Motivation: 现有数据集在场景理解和解释性方面存在不足,CAR-Scenes旨在填补这一空白,为自动驾驶提供更丰富、可解释的视觉语言数据。

Result: 数据集支持语义检索和风险感知场景挖掘,基线模型在固定验证集上展示了标量准确率、微平均F1和严重性MAE/RMSE的性能。

Insight: CAR-Scenes通过高粒度标注和工具支持,推动了数据为中心的自动驾驶研究,增强了模型的可解释性和场景理解能力。

Abstract: CAR-Scenes is a frame-level dataset for autonomous driving that enables training and evaluation of vision-language models (VLMs) for interpretable, scene-level understanding. We annotate 5,192 images drawn from Argoverse 1, Cityscapes, KITTI, and nuScenes using a 28-key category/sub-category knowledge base covering environment, road geometry, background-vehicle behavior, ego-vehicle behavior, vulnerable road users, sensor states, and a discrete severity scale (1-10), totaling 350+ leaf attributes. Labels are produced by a GPT-4o-assisted vision-language pipeline with human-in-the-loop verification; we release the exact prompts, post-processing rules, and per-field baseline model performance. CAR-Scenes also provides attribute co-occurrence graphs and JSONL records that support semantic retrieval, dataset triage, and risk-aware scenario mining across sources. To calibrate task difficulty, we include reproducible, non-benchmark baselines, notably a LoRA-tuned Qwen2-VL-2B with deterministic decoding, evaluated via scalar accuracy, micro-averaged F1 for list attributes, and severity MAE/RMSE on a fixed validation split. We publicly release the annotation and analysis scripts, including graph construction and evaluation scripts, to enable explainable, data-centric workflows for future intelligent vehicles. Dataset: https://github.com/Croquembouche/CAR-Scenes


[2] Expert Consensus-based Video-Based Assessment Tool for Workflow Analysis in Minimally Invasive Colorectal Surgery: Development and Validation of ColoWorkflow cs.CVPDF

Pooja P Jain, Pietro Mascagni, Giuseppe Massimiani, Nabani Banik, Marta Goglia

TL;DR: 本研究开发并验证了一种基于专家共识的视频评估工具ColoWorkflow,用于分析微创结直肠手术的工作流程。该工具通过Delphi法确定了通用的工作流程描述符,并在多中心视频数据集中验证了其适用性和评分者间一致性。

Details

Motivation: 微创结直肠手术存在流程变异性大、学习曲线陡峭等问题,需数据驱动的工具以减少变异性并优化培训。现有工具难以标准化和实施。

Result: ColoWorkflow在绝大多数标签中表现出广泛适用性,评分者间可靠性中等(阶段K=0.71,步骤K=0.66),差异主要集中在阶段过渡和步骤边界定义。

Insight: ColoWorkflow为微创结直肠手术提供了可复现的工作流程分析框架,支持跨机构对标和AI驱动的工作流程识别研究,有望标准化培训并提高手术质量。

Abstract: Minimally invasive colorectal surgery is characterized by procedural variability, a difficult learning curve, and complications that impact quality and outcomes. Video-based assessment (VBA) offers an opportunity to generate data-driven insights to reduce variability, optimize training, and improve surgical performance. However, existing tools for workflow analysis remain difficult to standardize and implement. This study aims to develop and validate a VBA tool for workflow analysis across minimally invasive colorectal procedures. A Delphi process was conducted to achieve consensus on generalizable workflow descriptors. The resulting framework informed the development of a new VBA tool, ColoWorkflow. Independent raters then applied ColoWorkflow to a multicentre video dataset of laparoscopic and robotic colorectal surgery (CRS). Applicability and inter-rater reliability were evaluated. Consensus was achieved for 10 procedure-agnostic phases and 34 procedure-specific steps describing CRS workflows. ColoWorkflow was developed and applied to 54 colorectal operative videos (left and right hemicolectomies, sigmoid and rectosigmoid resections, and total proctocolectomies) from five centres. The tool demonstrated broad applicability, with all but one label utilized. Inter-rater reliability was moderate, with mean Cohen’s K of 0.71 for phases and 0.66 for steps. Most discrepancies arose at phase transitions and step boundary definitions. ColoWorkflow is the first consensus-based, validated VBA tool for comprehensive workflow analysis in minimally invasive CRS. It establishes a reproducible framework for video-based performance assessment, enabling benchmarking across institutions and supporting the development of artificial intelligence-driven workflow recognition. Its adoption may standardize training, accelerate competency acquisition, and advance data-informed surgical quality improvement.


[3] Frequency-Aware Vision-Language Multimodality Generalization Network for Remote Sensing Image Classification cs.CVPDF

Junjie Zhang, Feng Zhao, Hanqiang Liu, Jun Yu

TL;DR: 本文提出了一种频率感知的视觉-语言多模态泛化网络(FVMGN),用于解决遥感图像分类中的多模态数据异质性和跨场景泛化问题。

Details

Motivation: 遥感技术的快速发展带来了多模态数据异质性和跨场景泛化的挑战,现有视觉-语言模型缺乏针对遥感模态的专业语言先验知识。

Result: 实验表明FVMGN在多模态泛化能力上优于现有先进方法。

Insight: 频率域分析和多模态语言特征的结合是提升遥感图像分类泛化能力的关键。

Abstract: The booming remote sensing (RS) technology is giving rise to a novel multimodality generalization task, which requires the model to overcome data heterogeneity while possessing powerful cross-scene generalization ability. Moreover, most vision-language models (VLMs) usually describe surface materials in RS images using universal texts, lacking proprietary linguistic prior knowledge specific to different RS vision modalities. In this work, we formalize RS multimodality generalization (RSMG) as a learning paradigm, and propose a frequency-aware vision-language multimodality generalization network (FVMGN) for RS image classification. Specifically, a diffusion-based training-test-time augmentation (DTAug) strategy is designed to reconstruct multimodal land-cover distributions, enriching input information for FVMGN. Following that, to overcome multimodal heterogeneity, a multimodal wavelet disentanglement (MWDis) module is developed to learn cross-domain invariant features by resampling low and high frequency components in the frequency domain. Considering the characteristics of RS vision modalities, shared and proprietary class texts is designed as linguistic inputs for the transformer-based text encoder to extract diverse text features. For multimodal vision inputs, a spatial-frequency-aware image encoder (SFIE) is constructed to realize local-global feature reconstruction and representation. Finally, a multiscale spatial-frequency feature alignment (MSFFA) module is suggested to construct a unified semantic space, ensuring refined multiscale alignment of different text and vision features in spatial and frequency domains. Extensive experiments show that FVMGN has the excellent multimodality generalization ability compared with state-of-the-art (SOTA) methods.


[4] GFT: Graph Feature Tuning for Efficient Point Cloud Analysis cs.CVPDF

Manish Dhakal, Venkat R. Dasari, Raj Sunderraman, Yi Ding

TL;DR: GFT是一种针对点云数据的高效参数微调方法,通过动态图学习和轻量图卷积网络减少可训练参数量,同时保持性能。

Details

Motivation: 点云数据的高效参数微调(PEFT)需要专门的方法,因为通用方法效果不佳。GFT旨在进一步减少可训练参数,提高适应性。

Result: 在物体分类和分割任务中,GFT在减少可训练参数的同时,性能与现有方法相当。

Insight: GFT展示了动态图学习和轻量模型在点云数据高效微调中的潜力,为点云任务提供了新的优化方向。

Abstract: Parameter-efficient fine-tuning (PEFT) significantly reduces computational and memory costs by updating only a small subset of the model’s parameters, enabling faster adaptation to new tasks with minimal loss in performance. Previous studies have introduced PEFTs tailored for point cloud data, as general approaches are suboptimal. To further reduce the number of trainable parameters, we propose a point-cloud-specific PEFT, termed Graph Features Tuning (GFT), which learns a dynamic graph from initial tokenized inputs of the transformer using a lightweight graph convolution network and passes these graph features to deeper layers via skip connections and efficient cross-attention modules. Extensive experiments on object classification and segmentation tasks show that GFT operates in the same domain, rivalling existing methods, while reducing the trainable parameters. Code is at https://github.com/manishdhakal/GFT.


[5] Accuracy-Preserving CNN Pruning Method under Limited Data Availability cs.CV | cs.AI | cs.LGPDF

Daisuke Yasui, Toshitaka Matsuki, Hiroshi Sato

TL;DR: 该论文提出了一种基于Layer-wise Relevance Propagation(LRP)的CNN修剪方法,旨在在数据有限的情况下实现高修剪率且保持模型精度。

Details

Motivation: 现有的LRP方法在修剪CNN时虽然不需要微调(适合数据有限场景),但仍存在精度显著下降的问题,限制了实际应用。

Result: 该方法在数据有限的情况下实现了比现有方法更高的修剪率且更优的精度保留。

Insight: 在数据有限的场景下,结合解释性AI技术(如LRP)可以有效实现模型压缩而不显著牺牲性能。

Abstract: Convolutional Neural Networks (CNNs) are widely used in image recognition and have succeeded in various domains. CNN models have become larger-scale to improve accuracy and generalization performance. Research has been conducted on compressing pre-trained models for specific target applications in environments with limited computing resources. Among model compression techniques, methods using Layer-wise Relevance Propagation (LRP), an explainable AI technique, have shown promise by achieving high pruning rates while preserving accuracy, even without fine-tuning. Because these methods do not require fine-tuning, they are suited to scenarios with limited data. However, existing LRP-based pruning approaches still suffer from significant accuracy degradation, limiting their practical usability. This study proposes a pruning method that achieves a higher pruning rate while preserving better model accuracy. Our approach to pruning with a small amount of data has achieved pruning that preserves accuracy better than existing methods.


[6] Short-Window Sliding Learning for Real-Time Violence Detection via LLM-based Auto-Labeling cs.CV | cs.AIPDF

Seoik Jung, Taekyung Song, Yangro Lee, Sungjun Lee

TL;DR: 论文提出了一种短窗口滑动学习框架,结合LLM自动标注,用于实时暴力检测,显著提升了长视频和实时场景的性能。

Details

Motivation: 传统长视频训练方法难以捕捉快速暴力事件,需精细化的短片段标注数据以提高实时检测精度。

Result: 在RWF-2000上准确率达95.25%,在UCF-Crime上提升至83.25%,验证了方法的泛化能力和实时性。

Insight: 短片段结合LLM自动标注能有效解决长视频训练的不足,适用于智能监控系统的实时暴力检测。

Abstract: This paper proposes a Short-Window Sliding Learning framework for real-time violence detection in CCTV footages. Unlike conventional long-video training approaches, the proposed method divides videos into 1-2 second clips and applies Large Language Model (LLM)-based auto-caption labeling to construct fine-grained datasets. Each short clip fully utilizes all frames to preserve temporal continuity, enabling precise recognition of rapid violent events. Experiments demonstrate that the proposed method achieves 95.25% accuracy on RWF-2000 and significantly improves performance on long videos (UCF-Crime: 83.25%), confirming its strong generalization and real-time applicability in intelligent surveillance systems.


[7] MCN-CL: Multimodal Cross-Attention Network and Contrastive Learning for Multimodal Emotion Recognition cs.CV | cs.AIPDF

Feng Li, Ke Wu, Yongwei Li

TL;DR: MCN-CL提出了一种结合多模态交叉注意力网络和对比学习的方法,用于解决多模态情感识别中的模态异质性、类别不平衡及动态面部动作单元建模的挑战,显著提升了性能。

Details

Motivation: 多模态情感识别在实际应用中面临模态异质性、动态面部动作建模复杂性和类别不平衡等挑战,亟需一种高效的跨模态特征融合方法。

Result: 在IEMOCAP和MELD数据集上,加权F1分数分别提升了3.42%和5.73%。

Insight: 通过结合交叉注意力网络和对比学习,MCN-CL在多模态情感识别中显著提升了跨模态特征融合的能力,适用于不平衡数据集。

Abstract: Multimodal emotion recognition plays a key role in many domains, including mental health monitoring, educational interaction, and human-computer interaction. However, existing methods often face three major challenges: unbalanced category distribution, the complexity of dynamic facial action unit time modeling, and the difficulty of feature fusion due to modal heterogeneity. With the explosive growth of multimodal data in social media scenarios, the need for building an efficient cross-modal fusion framework for emotion recognition is becoming increasingly urgent. To this end, this paper proposes Multimodal Cross-Attention Network and Contrastive Learning (MCN-CL) for multimodal emotion recognition. It uses a triple query mechanism and hard negative mining strategy to remove feature redundancy while preserving important emotional cues, effectively addressing the issues of modal heterogeneity and category imbalance. Experiment results on the IEMOCAP and MELD datasets show that our proposed method outperforms state-of-the-art approaches, with Weighted F1 scores improving by 3.42% and 5.73%, respectively.


[8] DINOv3 as a Frozen Encoder for CRPS-Oriented Probabilistic Rainfall Nowcasting cs.CV | cs.AIPDF

Luciano Araujo Dourado Filho, Almir Moreira da Silva Neto, Anthony Miyaguchi, Rodrigo Pereira David, Rodrigo Tripodi Calumby

TL;DR: 本文提出了一种高效的概率降雨临近预报方法,结合预训练的DINOv3编码器和轻量级概率头部,通过CRPS优化实现优于传统3D-UNET的性能。

Details

Motivation: 传统降雨临近预报方法计算复杂且效果有限,本文旨在结合预训练视觉编码器和高效概率建模,提升预报性能。

Result: 在Weather4Cast 2025基准上,CRPS达到3.5102,相比最佳3D-UNET提升约26%。

Insight: 预训练视觉编码器在特定任务中可以作为强大的特征提取器,轻量级头部设计能有效降低计算成本。

Abstract: This paper proposes a competitive and computationally efficient approach to probabilistic rainfall nowcasting. A video projector (V-JEPA Vision Transformer) associated to a lightweight probabilistic head is attached to a pre-trained satellite vision encoder (DINOv3\text{-}SAT493M) to map encoder tokens into a discrete empirical CDF (eCDF) over 4-hour accumulated rainfall. The projector-head is optimized end-to-end over the Continuous Ranked Probability Score (CRPS). As an alternative, 3D-UNET baselines trained with an aggregate Rank Probability Score and a per-pixel Gamma-Hurdle objective are used. On the Weather4Cast 2025 benchmark, the proposed method achieved a promising performance, with a CRPS of 3.5102 (CRPS), which represents $\approx$26% in effectiveness gain against the best 3D-UNET.


[9] YOLO-Drone: An Efficient Object Detection Approach Using the GhostHead Network for Drone Images cs.CVPDF

Hyun-Ki Jung

TL;DR: 本文提出了一种名为YOLO-Drone的高效目标检测方法,针对无人机图像中的目标检测难题进行了优化。通过改进YOLOv11的Head网络(GhostHead Network),显著提升了检测精度和速度。

Details

Motivation: 无人机图像通常从高空拍摄,目标识别难度大。本文旨在解决这一问题,提出了一种高效的解决方案。

Result: YOLO-Drone在Precision、Recall、F1-Score和mAP(0.5)等指标上分别提升了0.4%、0.6%、0.5%和0.5%。此外,推理速度也有所提高。

Insight: GhostHead Network有效提升了高空目标检测的精度和效率,YOLO-Drone在多个高性能检测模型中表现优于YOLOv8、YOLOv9和YOLOv10。

Abstract: Object detection using images or videos captured by drones is a promising technology with significant potential across various industries. However, a major challenge is that drone images are typically taken from high altitudes, making object identification difficult. This paper proposes an effective solution to address this issue. The base model used in the experiments is YOLOv11, the latest object detection model, with a specific implementation based on YOLOv11n. The experimental data were sourced from the widely used and reliable VisDrone dataset, a standard benchmark in drone-based object detection. This paper introduces an enhancement to the Head network of the YOLOv11 algorithm, called the GhostHead Network. The model incorporating this improvement is named YOLO-Drone. Experimental results demonstrate that YOLO-Drone achieves significant improvements in key detection accuracy metrics, including Precision, Recall, F1-Score, and mAP (0.5), compared to the original YOLOv11. Specifically, the proposed model recorded a 0.4% increase in Precision, a 0.6% increase in Recall, a 0.5% increase in F1-Score, and a 0.5% increase in mAP (0.5). Additionally, the Inference Speed metric, which measures image processing speed, also showed a notable improvement. These results indicate that YOLO-Drone is a high-performance model with enhanced accuracy and speed compared to YOLOv11. To further validate its reliability, comparative experiments were conducted against other high-performance object detection models, including YOLOv8, YOLOv9, and YOLOv10. The results confirmed that the proposed model outperformed YOLOv8 by 0.1% in mAP (0.5) and surpassed YOLOv9 and YOLOv10 by 0.3% and 0.6%, respectively.


[10] PhaseWin Search Framework Enable Efficient Object-Level Interpretation cs.CVPDF

Zihan Gu, Ruoyu Chen, Junchi Zhang, Yue Hu, Hua Zhang

TL;DR: PhaseWin是一种新颖的相位窗口搜索算法,用于高效的对象级解释任务,通过分阶段从粗到细的搜索方法显著降低了计算复杂度,同时保持了高忠实度。

Details

Motivation: 现有基于子模子集选择的方法在高忠实度解释方面表现良好,但计算效率低,限制了实际应用。因此,作者提出PhaseWin以解决这一效率瓶颈。

Result: 实验表明,PhaseWin仅用20%的计算预算即可达到超过95%的贪婪方法的忠实度,并在对象检测和视觉定位任务中显著优于其他基线方法。

Insight: PhaseWin展示了在高效率和高忠实度之间取得平衡的可行性,为对象级多模态模型的解释提供了新的解决方案。

Abstract: Attribution is essential for interpreting object-level foundation models. Recent methods based on submodular subset selection have achieved high faithfulness, but their efficiency limitations hinder practical deployment in real-world scenarios. To address this, we propose PhaseWin, a novel phase-window search algorithm that enables faithful region attribution with near-linear complexity. PhaseWin replaces traditional quadratic-cost greedy selection with a phased coarse-to-fine search, combining adaptive pruning, windowed fine-grained selection, and dynamic supervision mechanisms to closely approximate greedy behavior while dramatically reducing model evaluations. Theoretically, PhaseWin retains near-greedy approximation guarantees under mild monotone submodular assumptions. Empirically, PhaseWin achieves over 95% of greedy attribution faithfulness using only 20% of the computational budget, and consistently outperforms other attribution baselines across object detection and visual grounding tasks with Grounding DINO and Florence-2. PhaseWin establishes a new state of the art in scalable, high-faithfulness attribution for object-level multimodal models.


[11] Out-of-Distribution Detection with Positive and Negative Prompt Supervision Using Large Language Models cs.CVPDF

Zhixia He, Chen Zhao, Minglai Shao, Xintao Wu, Xujiang Zhao

TL;DR: 论文提出了一种基于正负提示监督的方法,通过大语言模型优化提示内容,提升视觉语言模型在分布外检测任务中的表现。该方法利用类间特征和边界特征,显著提升了检测性能。

Details

Motivation: 现有的视觉语言模型在分布外检测中虽然表现优异,但负提示常包含大量非目标特征,导致结果不理想。为了解决这一问题,论文提出了正负提示监督方法,优化提示内容以提升检测性能。

Result: 在CIFAR-100和ImageNet-1K数据集上,通过八种分布外数据集的测试,论文方法超越了现有最优基线。

Insight: 1. 负提示应聚焦类别边界而非广泛非目标特征;2. 语义信息的传递能够显著提升视觉分支的性能;3. 大语言模型在初始化提示内容方面具有潜力。

Abstract: Out-of-distribution (OOD) detection is committed to delineating the classification boundaries between in-distribution (ID) and OOD images. Recent advances in vision-language models (VLMs) have demonstrated remarkable OOD detection performance by integrating both visual and textual modalities. In this context, negative prompts are introduced to emphasize the dissimilarity between image features and prompt content. However, these prompts often include a broad range of non-ID features, which may result in suboptimal outcomes due to the capture of overlapping or misleading information. To address this issue, we propose Positive and Negative Prompt Supervision, which encourages negative prompts to capture inter-class features and transfers this semantic knowledge to the visual modality to enhance OOD detection performance. Our method begins with class-specific positive and negative prompts initialized by large language models (LLMs). These prompts are subsequently optimized, with positive prompts focusing on features within each class, while negative prompts highlight features around category boundaries. Additionally, a graph-based architecture is employed to aggregate semantic-aware supervision from the optimized prompt representations and propagate it to the visual branch, thereby enhancing the performance of the energy-based OOD detector. Extensive experiments on two benchmarks, CIFAR-100 and ImageNet-1K, across eight OOD datasets and five different LLMs, demonstrate that our method outperforms state-of-the-art baselines.


[12] Facial Expression Recognition with YOLOv11 and YOLOv12: A Comparative Study cs.CVPDF

Umma Aymon, Nur Shazwani Kamarudin, Ahmad Fakhri Ab. Nasir

TL;DR: 该论文比较了YOLOv11n和YOLOv12n两种轻量级模型在面部表情识别(FER)任务中的表现。结果显示,YOLOv12n在干净的KDEF数据集上表现最佳(mAP 0.5为95.6),但在FER2013数据集上YOLOv11n显示出更高的精确度(65.2)。研究强调了轻量级模型在实时和资源受限场景中的适用性。

Details

Motivation: 面部表情识别在非约束性真实环境中仍然具有挑战性,研究旨在探索轻量级YOLO模型在FER任务中的性能表现及其在实际应用中的潜力。

Result: YOLOv12n在KDEF数据集上表现最佳(mAP 0.5为95.6),而YOLOv11n在FER2013上显示出更高的精确度(65.2)。结果表明YOLO模型在FER任务中能够平衡性能和效率。

Insight: 轻量级YOLO模型在FER任务中表现出色,尤其是在实时和资源受限的场景中。研究揭示了模型在干净和有噪声数据集上的性能差异,为后续优化提供了方向。

Abstract: Facial Expression Recognition remains a challenging task, especially in unconstrained, real-world environments. This study investigates the performance of two lightweight models, YOLOv11n and YOLOv12n, which are the nano variants of the latest official YOLO series, within a unified detection and classification framework for FER. Two benchmark classification datasets, FER2013 and KDEF, are converted into object detection format and model performance is evaluated using mAP 0.5, precision, recall, and confusion matrices. Results show that YOLOv12n achieves the highest overall performance on the clean KDEF dataset with a mAP 0.5 of 95.6, and also outperforms YOLOv11n on the FER2013 dataset in terms of mAP 63.8, reflecting stronger sensitivity to varied expressions. In contrast, YOLOv11n demonstrates higher precision 65.2 on FER2013, indicating fewer false positives and better reliability in noisy, real-world conditions. On FER2013, both models show more confusion between visually similar expressions, while clearer class separation is observed on the cleaner KDEF dataset. These findings underscore the trade-off between sensitivity and precision, illustrating how lightweight YOLO models can effectively balance performance and efficiency. The results demonstrate adaptability across both controlled and real-world conditions, establishing these models as strong candidates for real-time, resource-constrained emotion-aware AI applications.


[13] Heterogeneous Complementary Distillation cs.CVPDF

Liuchi Xu, Hao Zheng, Lu Wang, Lisheng Xu, Jun Cheng

TL;DR: 这篇论文提出了一种名为异构互补蒸馏(HCD)的新框架,用于解决异构架构(如ViT到ResNet18)在知识蒸馏中因空间特征表示差异而面临的挑战。HCD通过整合教师和学生的互补特征,并使用共享逻辑和对数分解技术,有效提升了学生对教师知识的吸收能力。

Details

Motivation: 异构架构(如ViT到ResNet18)在知识蒸馏中面临的主要问题是空间特征表示差异大,传统同构蒸馏方法难以有效解决。现有的异构蒸馏方法往往计算成本高、设计复杂或过度依赖逻辑对齐,无法充分利用互补特征。

Result: 在CIFAR-100、细粒度数据集(如CUB200)和ImageNet-1K上的实验表明,HCD在异构知识蒸馏任务中优于现有方法。

Insight: HCD的成功在于充分利用了教师和学生的互补特征,并通过子逻辑解耦和正交性损失减少了冗余知识传递,增强了学生的鲁棒性和泛化能力。

Abstract: Knowledge distillation (KD)transfers the dark knowledge from a complex teacher to a compact student. However, heterogeneous architecture distillation, such as Vision Transformer (ViT) to ResNet18, faces challenges due to differences in spatial feature representations.Traditional KD methods are mostly designed for homogeneous architectures and hence struggle to effectively address the disparity. Although heterogeneous KD approaches have been developed recently to solve these issues, they often incur high computational costs and complex designs, or overly rely on logit alignment, which limits their ability to leverage the complementary features. To overcome these limitations, we propose Heterogeneous Complementary Distillation (HCD),a simple yet effective framework that integrates complementary teacher and student features to align representations in shared logits.These logits are decomposed and constrained to facilitate diverse knowledge transfer to the student. Specifically, HCD processes the student’s intermediate features through convolutional projector and adaptive pooling, concatenates them with teacher’s feature from the penultimate layer and then maps them via the Complementary Feature Mapper (CFM) module, comprising fully connected layer,to produce shared logits.We further introduce Sub-logit Decoupled Distillation (SDD) that partitions the shared logits into n sub-logits, which are fused with teacher’s logits to rectify classification.To ensure sub-logit diversity and reduce redundant knowledge transfer, we propose an Orthogonality Loss (OL).By preserving student-specific strengths and leveraging teacher knowledge,HCD enhances robustness and generalization in students.Extensive experiments on the CIFAR-100, Fine-grained (e.g., CUB200)and ImageNet-1K datasets demonstrate that HCD outperforms state-of-the-art KD methods,establishing it as an effective solution for heterogeneous KD.


[14] Divide, Conquer and Unite: Hierarchical Style-Recalibrated Prototype Alignment for Federated Medical Image Segmentation cs.CVPDF

Xingyue Zhao, Wenke Huang, Xingguang Wang, Haoyu Zhao, Linghao Zhuang

TL;DR: 本文提出了一种层次化的风格重新校准原型对齐方法(FedBCS),用于联邦医学图像分割,解决了特征异质性和风格偏差累积问题,显著提升了模型性能。

Details

Motivation: 联邦学习中,不同医学机构的图像数据因扫描仪或协议差异导致特征异质性,现有方法多依赖最终层特征而忽略多层次上下文信息和中间层风格偏差,限制了分割准确性。

Result: 在两个公开数据集上的实验表明,所提方法(FedBCS)性能显著优于现有方法。

Insight: 多层次上下文信息和中间层风格偏差的联合建模是提升联邦医学图像分割性能的关键。

Abstract: Federated learning enables multiple medical institutions to train a global model without sharing data, yet feature heterogeneity from diverse scanners or protocols remains a major challenge. Many existing works attempt to address this issue by leveraging model representations (e.g., mean feature vectors) to correct local training; however, they often face two key limitations: 1) Incomplete Contextual Representation Learning: Current approaches primarily focus on final-layer features, overlooking critical multi-level cues and thus diluting essential context for accurate segmentation. 2) Layerwise Style Bias Accumulation: Although utilizing representations can partially align global features, these methods neglect domain-specific biases within intermediate layers, allowing style discrepancies to build up and reduce model robustness. To address these challenges, we propose FedBCS to bridge feature representation gaps via domain-invariant contextual prototypes alignment. Specifically, we introduce a frequency-domain adaptive style recalibration into prototype construction that not only decouples content-style representations but also learns optimal style parameters, enabling more robust domain-invariant prototypes. Furthermore, we design a context-aware dual-level prototype alignment method that extracts domain-invariant prototypes from different layers of both encoder and decoder and fuses them with contextual information for finer-grained representation alignment. Extensive experiments on two public datasets demonstrate that our method exhibits remarkable performance.


[15] Abstract 3D Perception for Spatial Intelligence in Vision-Language Models cs.CVPDF

Yifan Liu, Fangneng Zhan, Kaichen Zhou, Yilun Du, Paul Pu Liang

TL;DR: 该论文提出了SandboxVLM框架,通过抽象边界框编码几何结构和物理运动学,解决视觉语言模型(VLM)在3D任务中的表现不足问题,并在零样本设置下显著提升了空间智能。

Details

Motivation: 视觉语言模型(VLMs)在3D任务(如空间认知和物理理解)中表现不佳,限制了其在机器人学和实体智能等实际应用中的潜力。作者认为这是由于2D训练的VLMs与3D任务之间存在模态差距,导致从2D输入中检索3D信息效率低下。

Result: 在多个基准测试和VLM架构的零样本设置下,SandboxVLM表现优于基线方法,例如在SAT Real上实现了8.3%的提升。

Insight: 该研究表明,通过简单的3D抽象,无需额外训练即可显著提升VLM的3D推理能力,为通用实体智能带来了新的可能性。

Abstract: Vision-language models (VLMs) struggle with 3D-related tasks such as spatial cognition and physical understanding, which are crucial for real-world applications like robotics and embodied agents. We attribute this to a modality gap between the 3D tasks and the 2D training of VLM, which led to inefficient retrieval of 3D information from 2D input. To bridge this gap, we introduce SandboxVLM, a simple yet effective framework that leverages abstract bounding boxes to encode geometric structure and physical kinematics for VLM. Specifically, we design a 3D Sandbox reconstruction and perception pipeline comprising four stages: generating multi-view priors with abstract control, proxy elevation, multi-view voting and clustering, and 3D-aware reasoning. Evaluated in zero-shot settings across multiple benchmarks and VLM backbones, our approach consistently improves spatial intelligence, achieving an 8.3% gain on SAT Real compared with baseline methods for instance. These results demonstrate that equipping VLMs with a 3D abstraction substantially enhances their 3D reasoning ability without additional training, suggesting new possibilities for general-purpose embodied intelligence.


[16] DEFT-LLM: Disentangled Expert Feature Tuning for Micro-Expression Recognition cs.CV | cs.HCPDF

Ren Zhang, Huilai Li, Chao qi, Guoliang Xu, Tianyu Zhou

TL;DR: DEFT-LLM 通过多专家解耦实现运动语义对齐,提出 Uni-MER 数据集和三个专家架构,解决了微表情识别中静态和动态信息纠缠以及文本标签与运动语义不一致的问题,取得了最先进的性能。

Details

Motivation: 微表情识别(MER)对推断真实情绪至关重要,但现有方法面临静态外观和动态运动信息的纠缠,以及文本标签与面部运动语义不一致的问题。

Result: 在多个 MER 基准测试中实现了最先进的性能,并在局部面部运动的可解释建模中表现突出。

Insight: 解耦静态和动态信息、对齐运动语义是关键;结合光流和 AU 标签能有效构造高质量数据集;多专家架构增强了模型的可解释性。

Abstract: Micro expression recognition (MER) is crucial for inferring genuine emotion. Applying a multimodal large language model (MLLM) to this task enables spatio-temporal analysis of facial motion and provides interpretable descriptions. However, there are still two core challenges: (1) The entanglement of static appearance and dynamic motion cues prevents the model from focusing on subtle motion; (2) Textual labels in existing MER datasets do not fully correspond to underlying facial muscle movements, creating a semantic gap between text supervision and physical motion. To address these issues, we propose DEFT-LLM, which achieves motion semantic alignment by multi-expert disentanglement. We first introduce Uni-MER, a motion-driven instruction dataset designed to align text with local facial motion. Its construction leverages dual constraints from optical flow and Action Unit (AU) labels to ensure spatio-temporal consistency and reasonable correspondence to the movements. We then design an architecture with three experts to decouple facial dynamics into independent and interpretable representations (structure, dynamic textures, and motion-semantics). By integrating the instruction-aligned knowledge from Uni-MER into DEFT-LLM, our method injects effective physical priors for micro expressions while also leveraging the cross modal reasoning ability of large language models, thus enabling precise capture of subtle emotional cues. Experiments on multiple challenging MER benchmarks demonstrate state-of-the-art performance, as well as a particular advantage in interpretable modeling of local facial motion.


[17] Language-Guided Graph Representation Learning for Video Summarization cs.CVPDF

Wenrui Li, Wei Han, Hengyu Man, Wangmeng Zuo, Xiaopeng Fan

TL;DR: 该论文提出了一种基于语言引导的图表示学习网络(LGRLN),用于视频摘要任务,通过构建多向图保留视频内容的时序和上下文依赖,并引入双阈值图卷积机制和跨模态嵌入模块,实现了性能和效率的提升。

Details

Motivation: 现有视频摘要方法难以捕捉视频内容的全局依赖和多模态用户定制需求,且时间邻近性不一定反映语义邻近性,因此需要一种更高效和灵活的解决方案。

Result: 实验表明,LGRLN在多个基准上优于现有方法,推理时间和模型参数分别减少了87.8%和91.7%。

Insight: 图结构能有效建模视频内容的时序和语义关系,结合语言引导可以实现更灵活的多模态摘要生成。

Abstract: With the rapid growth of video content on social media, video summarization has become a crucial task in multimedia processing. However, existing methods face challenges in capturing global dependencies in video content and accommodating multimodal user customization. Moreover, temporal proximity between video frames does not always correspond to semantic proximity. To tackle these challenges, we propose a novel Language-guided Graph Representation Learning Network (LGRLN) for video summarization. Specifically, we introduce a video graph generator that converts video frames into a structured graph to preserve temporal order and contextual dependencies. By constructing forward, backward and undirected graphs, the video graph generator effectively preserves the sequentiality and contextual relationships of video content. We designed an intra-graph relational reasoning module with a dual-threshold graph convolution mechanism, which distinguishes semantically relevant frames from irrelevant ones between nodes. Additionally, our proposed language-guided cross-modal embedding module generates video summaries with specific textual descriptions. We model the summary generation output as a mixture of Bernoulli distribution and solve it with the EM algorithm. Experimental results show that our method outperforms existing approaches across multiple benchmarks. Moreover, we proposed LGRLN reduces inference time and model parameters by 87.8% and 91.7%, respectively. Our codes and pre-trained models are available at https://github.com/liwrui/LGRLN.


[18] Text-guided Weakly Supervised Framework for Dynamic Facial Expression Recognition cs.CV | cs.AIPDF

Gunho Jung, Heejo Kong, Seong-Whan Lee

TL;DR: TG-DFER是一个文本引导的弱监督框架,通过结合语义指导和连贯的时间建模,改进了基于MIL的动态面部表情识别(DFER)。

Details

Motivation: DFER中存在多对一标签问题和视觉表达多样性挑战,现有MIL方法难以应对。

Result: TG-DFER在弱监督下表现出更好的泛化能力、可解释性和时间敏感性。

Insight: 文本信息的引入和多粒度时间建模可以有效缓解DFER的标签问题和动态复杂性。

Abstract: Dynamic facial expression recognition (DFER) aims to identify emotional states by modeling the temporal changes in facial movements across video sequences. A key challenge in DFER is the many-to-one labeling problem, where a video composed of numerous frames is assigned a single emotion label. A common strategy to mitigate this issue is to formulate DFER as a Multiple Instance Learning (MIL) problem. However, MIL-based approaches inherently suffer from the visual diversity of emotional expressions and the complexity of temporal dynamics. To address this challenge, we propose TG-DFER, a text-guided weakly supervised framework that enhances MIL-based DFER by incorporating semantic guidance and coherent temporal modeling. We incorporate a vision-language pre-trained (VLP) model is integrated to provide semantic guidance through fine-grained textual descriptions of emotional context. Furthermore, we introduce visual prompts, which align enriched textual emotion labels with visual instance features, enabling fine-grained reasoning and frame-level relevance estimation. In addition, a multi-grained temporal network is designed to jointly capture short-term facial dynamics and long-range emotional flow, ensuring coherent affective understanding across time. Extensive results demonstrate that TG-DFER achieves improved generalization, interpretability, and temporal sensitivity under weak supervision.


[19] Preserving Cross-Modal Consistency for CLIP-based Class-Incremental Learning cs.CVPDF

Haoran Chen, Houze Xu, Micah Goldblum, Daoguo Dong, Zuxuan Wu

TL;DR: 本文提出了DMC和DMC-OT两种方法,用于解决基于CLIP的类别增量学习(CIL)中的跨模态一致性和分布漂移问题,通过解耦视觉编码器和文本软提示的优化,并结合最优传输校准策略,显著提升了性能。

Details

Motivation: 当前的CLIP-based CIL方法在学习新类别时容易因文本原型过拟合而导致分类器偏差,且在视觉编码器更新时存在分布漂移问题,需要一种能保持跨模态一致性的解决方案。

Result: 在CIFAR-100、Imagenet-R等数据集上,DMC和DMC-OT均达到了SOTA性能,DMC-OT平均提升1.80%准确率。

Insight: 解耦模态优化可以有效缓解过拟合问题,而最优传输校准策略能显著减轻分布漂移对性能的影响。

Abstract: Class-incremental learning (CIL) enables models to continuously learn new categories from sequential tasks without forgetting previously acquired knowledge. While recent advances in vision-language models such as CLIP have demonstrated strong generalization across domains, extending them to continual settings remains challenging. In particular, learning task-specific soft prompts for newly introduced classes often leads to severe classifier bias, as the text prototypes overfit to recent categories when prior data are unavailable. In this paper, we propose DMC, a simple yet effective two-stage framework for CLIP-based CIL that decouples the adaptation of the vision encoder and the optimization of textual soft prompts. Each stage is trained with the other frozen, allowing one modality to act as a stable semantic anchor for the other to preserve cross-modal alignment. Furthermore, current CLIP-based CIL approaches typically store class-wise Gaussian statistics for generative replay, yet they overlook the distributional drift that arises when the vision encoder is updated over time. To address this issue, we introduce DMC-OT, an enhanced version of DMC that incorporates an optimal-transport guided calibration strategy to align memory statistics across evolving encoders, along with a task-specific prompting design that enhances inter-task separability. Extensive experiments on CIFAR-100, Imagenet-R, CUB-200, and UCF-101 demonstrate that both DMC and DMC-OT achieve state-of-the-art performance, with DMC-OT further improving accuracy by an average of 1.80%.


[20] PAS: A Training-Free Stabilizer for Temporal Encoding in Video LLMs cs.CV | cs.AIPDF

Bowen Sun, Yujun Cai, Ming-Hsuan Yang, Hang Wu, Yiwei Wang

TL;DR: 本文提出了Phase Aggregated Smoothing (PAS),一种无需训练的机制,用于解决视频LLMs中由Rotary Position Embeddings(RoPE)引起的时间不一致性问题,通过多相位平均平滑时间核,显著提升了注意力机制的鲁棒性。

Details

Motivation: 视频LLMs在处理时间信息时存在不稳定性,尤其是帧定时的小变化可能导致注意力机制的翻转或相关帧的抑制。这种不稳定性源于RoPE扩展为多模态时的逆傅里叶时间核引发的帧尺度波纹。

Result: 在多个视频理解基准测试中,PAS在相同token预算下表现出稳定且显著的性能提升,计算开销可忽略不计。

Insight: PAS展示了通过简单的多相位平均技术可以有效平滑高频波纹,同时保留频谱信息,为视频LLMs的时间编码提供了一种鲁棒且高效的解决方案。

Abstract: Video LLMs suffer from temporal inconsistency: small shifts in frame timing can flip attention and suppress relevant frames. We trace this instability to the common extension of Rotary Position Embeddings to video through multimodal RoPE. The induced inverse Fourier time kernel exhibits frame-scale ripples that multiply adjacent frames by different factors, which perturbs attention that should otherwise be governed by the raw query key inner product. We present Phase Aggregated Smoothing (PAS), a simple, training-free mechanism that applies small opposed phase offsets across heads and then aggregates their outputs. PAS preserves the per-head spectrum magnitude, while the aggregation effectively smooths the temporal kernel and reduces phase sensitivity without changing the positional encoding structure. Our analysis shows that the RoPE rotated logit can be approximated as a content dot product scaled by a time kernel; smoothing this kernel yields Lipschitz stability of attention to small temporal shifts; multi phase averaging attenuates high frequency ripples while preserving per-head spectra under Nyquist-valid sampling. Experiments on multiple video understanding benchmarks under matched token budgets show consistent improvements with negligible computational overhead. PAS provides a plug and play upgrade for robust temporal encoding in Video LLMs.


[21] Binary Verification for Zero-Shot Vision cs.CV | cs.AIPDF

Jeffrey Liu, Rongbin Hu

TL;DR: 本文提出了一种无需训练的二元验证工作流程,用于零样本视觉任务,通过量化和二值化步骤显著提升了多模态模型的性能。

Details

Motivation: 现有的零样本视觉任务通常需要复杂的训练过程,而本文旨在设计一种无需训练的方法,利用现有的视觉语言模型(VLMs)通过量化问题和二元验证提升性能。

Result: 实验表明,该方法在指代表达式定位、空间推理等任务中显著优于直接回答开放查询的方式,并能在多种任务中保持一致性。

Insight: 本文揭示了通过结构化查询(从开放到选择题再到判断题)可以有效提升模型性能,强调了推理时设计的重要性,而非依赖任务特定的训练。

Abstract: We propose a training-free, binary verification workflow for zero-shot vision with off-the-shelf VLMs. It comprises two steps: (i) quantization, which turns the open-ended query into a multiple-choice question (MCQ) with a small, explicit list of unambiguous candidates; and (ii) binarization, which asks one True/False question per candidate and resolves deterministically: if exactly one is True, select it; otherwise, revert to an MCQ over the remaining plausible candidates. We evaluate the workflow on referring expression grounding (REC), spatial reasoning (Spatial-Map, Spatial-Grid, Spatial-Maze), and BLINK-Jigsaw. Relative to answering open-ended queries directly, quantization to MCQ yields large gains, and True/False binarization provides a consistent additional boost. Across all tasks, the same workflow produces significant improvements, indicating generality. Our theory formalizes how open-ended vision queries can be quantized to MCQs and further binarized into True/False verifications, establishing a hardness ladder. A simple analysis explains why Boolean resolution boosts accuracy. Together, these components yield a simple and unified workflow that emphasizes inference-time design over task-specific training. It offers a practical, drop-in path to stronger zero-shot vision with today’s VLMs.


[22] PROMISE: Prompt-Attentive Hierarchical Contrastive Learning for Robust Cross-Modal Representation with Missing Modalities cs.CV | cs.LGPDF

Jiajun Chen, Sai Cheng, Yutao Yuan, Yirui Zhang, Haitao Yuan

TL;DR: 该论文提出了PROMISE框架,通过结合多模态提示学习和层次对比学习,解决多模态数据中缺失模态导致的性能下降问题,显著提升跨模态表示的鲁棒性。

Details

Motivation: 现实世界中多模态数据常存在模态缺失问题,导致现有模型性能显著下降。现有方法生成缺失模态的方式过于简单,未能有效保持跨模态一致性。

Result: 在基准数据集上验证了PROMISE的优越性,性能显著超过现有方法。

Insight: 动态提示注意力机制能有效弥补模态缺失带来的表示不一致问题,提升跨模态任务的鲁棒性和泛化能力。

Abstract: Multimodal models integrating natural language and visual information have substantially improved generalization of representation models. However, their effectiveness significantly declines in real-world situations where certain modalities are missing or unavailable. This degradation primarily stems from inconsistent representation learning between complete multimodal data and incomplete modality scenarios. Existing approaches typically address missing modalities through relatively simplistic generation methods, yet these approaches fail to adequately preserve cross-modal consistency, leading to suboptimal performance. To overcome this limitation, we propose a novel multimodal framework named PROMISE, a PROMpting-Attentive HIerarchical ContraStive LEarning approach designed explicitly for robust cross-modal representation under conditions of missing modalities. Specifically, PROMISE innovatively incorporates multimodal prompt learning into a hierarchical contrastive learning framework, equipped with a specially designed prompt-attention mechanism. This mechanism dynamically generates robust and consistent representations for scenarios where particular modalities are absent, thereby effectively bridging the representational gap between complete and incomplete data. Extensive experiments conducted on benchmark datasets, along with comprehensive ablation studies, clearly demonstrate the superior performance of PROMISE compared to current state-of-the-art multimodal methods.


[23] EmoVid: A Multimodal Emotion Video Dataset for Emotion-Centric Video Understanding and Generation cs.CVPDF

Zongyang Qiu, Bingyuan Wang, Xingbei Chen, Yingqing He, Zeyu Wang

TL;DR: EmoVid是第一个针对创意媒体的多模态情感标注视频数据集,填补了视频生成任务中情感理解的空白,并提出了情感条件视频生成技术,显著提升了生成视频的质量。

Details

Motivation: 现有视频生成系统主要关注低层视觉指标而忽略了情感维度,缺乏专门资源将情感理解与生成任务结合,尤其是风格化和非真实场景。

Result: 在文本到视频和图像到视频任务中,生成视频的定量指标和视觉质量均有显著提升。

Insight: 情感在视频表达中至关重要,尤其是在创意媒体中;视觉特征与情感感知的关联可用于改进视频生成技术。

Abstract: Emotion plays a pivotal role in video-based expression, but existing video generation systems predominantly focus on low-level visual metrics while neglecting affective dimensions. Although emotion analysis has made progress in the visual domain, the video community lacks dedicated resources to bridge emotion understanding with generative tasks, particularly for stylized and non-realistic contexts. To address this gap, we introduce EmoVid, the first multimodal, emotion-annotated video dataset specifically designed for creative media, which includes cartoon animations, movie clips, and animated stickers. Each video is annotated with emotion labels, visual attributes (brightness, colorfulness, hue), and text captions. Through systematic analysis, we uncover spatial and temporal patterns linking visual features to emotional perceptions across diverse video forms. Building on these insights, we develop an emotion-conditioned video generation technique by fine-tuning the Wan2.1 model. The results show a significant improvement in both quantitative metrics and the visual quality of generated videos for text-to-video and image-to-video tasks. EmoVid establishes a new benchmark for affective video computing. Our work not only offers valuable insights into visual emotion analysis in artistically styled videos, but also provides practical methods for enhancing emotional expression in video generation.


[24] Draft and Refine with Visual Experts cs.CVPDF

Sungheon Jeong, Ryozo Masukawa, Jihong Park, Sanggeon Yun, Wenjun Huang

TL;DR: 论文提出了Draft and Refine (DnR)框架,通过量化视觉信息利用率,减少大型视觉语言模型(LVLMs)的幻觉问题,并通过视觉专家的反馈改进响应。

Details

Motivation: 大型视觉语言模型(LVLMs)仍依赖语言先验而非视觉证据,导致响应不准确或幻觉化,亟需量化其对视觉信息的利用率。

Result: 在VQA和图像描述任务中实现了准确率提升并减少了幻觉现象。

Insight: 量化视觉利用率是提升多模态系统可解释性和证据驱动性的有效途径。

Abstract: While recent Large Vision-Language Models (LVLMs) exhibit strong multimodal reasoning abilities, they often produce ungrounded or hallucinated responses because they rely too heavily on linguistic priors instead of visual evidence. This limitation highlights the absence of a quantitative measure of how much these models actually use visual information during reasoning. We propose Draft and Refine (DnR), an agent framework driven by a question-conditioned utilization metric. The metric quantifies the model’s reliance on visual evidence by first constructing a query-conditioned relevance map to localize question-specific cues and then measuring dependence through relevance-guided probabilistic masking. Guided by this metric, the DnR agent refines its initial draft using targeted feedback from external visual experts. Each expert’s output (such as boxes or masks) is rendered as visual cues on the image, and the model is re-queried to select the response that yields the largest improvement in utilization. This process strengthens visual grounding without retraining or architectural changes. Experiments across VQA and captioning benchmarks show consistent accuracy gains and reduced hallucination, demonstrating that measuring visual utilization provides a principled path toward more interpretable and evidence-driven multimodal agent systems.


[25] VisMem: Latent Vision Memory Unlocks Potential of Vision-Language Models cs.CV | cs.AI | cs.LGPDF

Xinlei Yu, Chengming Xu, Guibin Zhang, Zhangquan Chen, Yudong Zhang

TL;DR: VisMem引入了一种认知对齐的框架,通过动态潜在视觉记忆增强视觉语言模型(VLMs),解决了其长期生成任务中的视觉处理和语义一致性不足的问题。

Details

Motivation: 视觉语言模型在复杂任务中常因‘视觉处理瓶颈’而失效,即失去视觉证据基础或缺乏上下文视觉经验。受人类记忆理论启发,VisMem旨在通过记忆模块提升模型性能。

Result: 在多项视觉任务中,VisMem平均性能提升11.8%,显著优于基线模型和其他对比方法。

Insight: 引入人类记忆理论的分层设计(短期/长期记忆)可以有效解决VLMs的视觉瓶颈问题,为后续记忆增强研究提供了新范式。

Abstract: Despite the remarkable success of Vision-Language Models (VLMs), their performance on a range of complex visual tasks is often hindered by a “visual processing bottleneck”: a propensity to lose grounding in visual evidence and exhibit a deficit in contextualized visual experience during prolonged generation. Drawing inspiration from human cognitive memory theory, which distinguishes short-term visually-dominant memory and long-term semantically-dominant memory, we propose VisMem, a cognitively-aligned framework that equips VLMs with dynamic latent vision memories, a short-term module for fine-grained perceptual retention and a long-term module for abstract semantic consolidation. These memories are seamlessly invoked during inference, allowing VLMs to maintain both perceptual fidelity and semantic consistency across thinking and generation. Extensive experiments across diverse visual benchmarks for understanding, reasoning, and generation reveal that VisMem delivers a significant average performance boost of 11.8% relative to the vanilla model and outperforms all counterparts, establishing a new paradigm for latent-space memory enhancement. The code will be available: https://github.com/YU-deep/VisMem.git.


[26] SP-Guard: Selective Prompt-adaptive Guidance for Safe Text-to-Image Generation cs.CV | cs.CYPDF

Sumin Yu, Taesup Moon

TL;DR: SP-Guard提出了一种选择性提示自适应引导方法,用于扩散模型的文本到图像生成,以解决生成有害内容的问题。其核心在于基于提示的危害性估计和选择性引导掩码,仅调整不安全区域。

Details

Motivation: 扩散模型在文本到图像生成中表现优异,但也容易被用于生成有害内容,引发社会担忧。现有方法缺乏自适应性和选择性,无法灵活调整引导强度或仅针对不安全区域。

Result: 实验表明,SP-Guard比现有方法生成更安全的图像,同时减少了对内容的意外修改。

Insight: 透明性和可控性是图像生成中的重要方向,选择性引导方法为安全生成提供了新思路。

Abstract: While diffusion-based T2I models have achieved remarkable image generation quality, they also enable easy creation of harmful content, raising social concerns and highlighting the need for safer generation. Existing inference-time guiding methods lack both adaptivity–adjusting guidance strength based on the prompt–and selectivity–targeting only unsafe regions of the image. Our method, SP-Guard, addresses these limitations by estimating prompt harmfulness and applying a selective guidance mask to guide only unsafe areas. Experiments show that SP-Guard generates safer images than existing methods while minimizing unintended content alteration. Beyond improving safety, our findings highlight the importance of transparency and controllability in image generation.


[27] SUPER Decoder Block for Reconstruction-Aware U-Net Variants cs.CVPDF

Siheon Joo, Hongjo Kim

TL;DR: 论文提出了SUPER解码器模块,通过选择性抑制完美重构(PR)特性,解决了U-Net变体在逆问题中的信息丢失问题,提升了高频细节的恢复能力。

Details

Motivation: 现有U-Net变体在解决逆问题时存在信息丢失,尤其是高频细节恢复能力不足,限制了其性能。

Result: 在CrackVision12K数据集上显著提升了窄裂纹(<4px)的分割性能;在SIDD去噪任务中也取得了PSNR的提升,验证了其跨频域鲁棒性。

Insight: SUPER模块通过重构感知框架,统一了高频保真和全局一致性,为U-Net变体的改进提供了通用解决方案。

Abstract: Skip-connected encoder-decoder architectures (U-Net variants) are widely adopted for inverse problems but still suffer from information loss, limiting recovery of fine high-frequency details. We present Selectively Suppressed Perfect Reconstruction (SUPER), which exploits the perfect reconstruction (PR) property of wavelets to prevent information degradation while selectively suppressing (SS) redundant features. Free from rigid framelet constraints, SUPER serves as a plug-and-play decoder block for diverse U-Net variants, eliminating their intrinsic reconstruction bottlenecks and enhancing representational richness. Experiments across diverse crack benchmarks, including state-of-the-art (SOTA) models, demonstrate the structural potential of the proposed SUPER Decoder Block. Maintaining comparable computational cost, SUPER enriches representational diversity through increased parameterization. In small-scale in-domain experiments on the CrackVision12K dataset, SUPER markedly improves thin-crack segmentation performance, particularly for cracks narrower than 4 px, underscoring its advantage in high-frequency dominant settings. In smartphone image denoising on SIDD, where low-frequency components prevail, SUPER still achieves a moderate gain in PSNR, confirming its robustness across low- and high-frequency regimes. These results validate its plug-and-play generality across U-Net variants, achieving high-frequency fidelity and global coherence within a unified, reconstruction-aware framework.


[28] AirCopBench: A Benchmark for Multi-drone Collaborative Embodied Perception and Reasoning cs.CV | cs.AIPDF

Jirong Zha, Yuxuan Fan, Tianyu Zhang, Geng Chen, Yingfeng Chen

TL;DR: AirCopBench是首个针对多无人机协作感知任务的综合性评测基准,涵盖了复杂感知条件下的14种任务类型,展示了MLLMs在多智能体协作中的性能差距及其改进潜力。

Details

Motivation: 多智能体协作感知任务缺乏专门的评测基准,而现有的多图像基准主要关注高质量单智能体图像的基础任务,难以评估MLLMs在实际复杂协作场景中的表现。

Result: 评测40个MLLMs发现其在协作感知任务中表现显著落后于人类(平均差距24.38%),但微调实验证实了模拟到真实场景迁移的可行性。

Insight: 协作感知任务对MLLMs提出了更高要求,退化感知条件下的数据有助于提升模型的鲁棒性和泛化能力。

Abstract: Multimodal Large Language Models (MLLMs) have shown promise in single-agent vision tasks, yet benchmarks for evaluating multi-agent collaborative perception remain scarce. This gap is critical, as multi-drone systems provide enhanced coverage, robustness, and collaboration compared to single-sensor setups. Existing multi-image benchmarks mainly target basic perception tasks using high-quality single-agent images, thus failing to evaluate MLLMs in more complex, egocentric collaborative scenarios, especially under real-world degraded perception conditions.To address these challenges, we introduce AirCopBench, the first comprehensive benchmark designed to evaluate MLLMs in embodied aerial collaborative perception under challenging perceptual conditions. AirCopBench includes 14.6k+ questions derived from both simulator and real-world data, spanning four key task dimensions: Scene Understanding, Object Understanding, Perception Assessment, and Collaborative Decision, across 14 task types. We construct the benchmark using data from challenging degraded-perception scenarios with annotated collaborative events, generating large-scale questions through model-, rule-, and human-based methods under rigorous quality control. Evaluations on 40 MLLMs show significant performance gaps in collaborative perception tasks, with the best model trailing humans by 24.38% on average and exhibiting inconsistent results across tasks. Fine-tuning experiments further confirm the feasibility of sim-to-real transfer in aerial collaborative perception and reasoning.


[29] EmbryoDiff: A Conditional Diffusion Framework with Multi-Focal Feature Fusion for Fine-Grained Embryo Developmental Stage Recognition cs.CVPDF

Yong Sun, Zhengjie Zhang, Junyu Shi, Zhiyuan Zhang, Lijiang Liu

TL;DR: EmbryoDiff提出了一种条件扩散框架,结合多焦点特征融合,用于精细的胚胎发育阶段识别,显著提高了准确性并缓解了细胞遮挡导致的特征模糊问题。

Details

Motivation: 当前深度学习模型在胚胎发育阶段识别中未能充分利用胚胎发育的先验分布信息,且依赖单焦点信息导致特征表达不完整,影响分类准确性。

Result: 在两个基准数据集上实现最优性能,平均测试准确率为82.8%和81.3%,仅需单次去噪步骤。

Insight: 利用扩散模型的生成能力结合多焦点特征融合,可显著提升细粒度分类任务中对遮挡和模糊特征的鲁棒性。

Abstract: Identification of fine-grained embryo developmental stages during In Vitro Fertilization (IVF) is crucial for assessing embryo viability. Although recent deep learning methods have achieved promising accuracy, existing discriminative models fail to utilize the distributional prior of embryonic development to improve accuracy. Moreover, their reliance on single-focal information leads to incomplete embryonic representations, making them susceptible to feature ambiguity under cell occlusions. To address these limitations, we propose EmbryoDiff, a two-stage diffusion-based framework that formulates the task as a conditional sequence denoising process. Specifically, we first train and freeze a frame-level encoder to extract robust multi-focal features. In the second stage, we introduce a Multi-Focal Feature Fusion Strategy that aggregates information across focal planes to construct a 3D-aware morphological representation, effectively alleviating ambiguities arising from cell occlusions. Building on this fused representation, we derive complementary semantic and boundary cues and design a Hybrid Semantic-Boundary Condition Block to inject them into the diffusion-based denoising process, enabling accurate embryonic stage classification. Extensive experiments on two benchmark datasets show that our method achieves state-of-the-art results. Notably, with only a single denoising step, our model obtains the best average test performance, reaching 82.8% and 81.3% accuracy on the two datasets, respectively.


[30] Algorithms Trained on Normal Chest X-rays Can Predict Health Insurance Types cs.CV | cs.AIPDF

Chi-Yu Chen, Rawan Abulibdeh, Arash Asgari, Leo Anthony Celi, Deirdre Goode

TL;DR: 这篇论文展示了深度学习模型可以从正常的胸部X光片中预测患者的健康保险类型(社会经济地位的代名词),揭示了医疗图像中隐藏的社会不平等信号。

Details

Motivation: 研究动机在于揭示医疗AI模型是否能够捕捉并利用医疗图像中隐含的社会经济信息,挑战了医疗图像仅作为中性生物学数据的假设。

Result: 实验结果显示,模型在预测健康保险类型上表现出显著的准确性(AUC约为0.67-0.68),信号具有全局性且不受年龄、种族或性别的影响。

Insight: 论文的核心洞察是医疗图像并非中性数据,而是嵌入了社会不平等的信息。这要求医疗AI公平性的研究不仅要关注数据平衡和阈值调整,还需要深入探究临床数据中的社会指纹。

Abstract: Artificial intelligence is revealing what medicine never intended to encode. Deep vision models, trained on chest X-rays, can now detect not only disease but also invisible traces of social inequality. In this study, we show that state-of-the-art architectures (DenseNet121, SwinV2-B, MedMamba) can predict a patient’s health insurance type, a strong proxy for socioeconomic status, from normal chest X-rays with significant accuracy (AUC around 0.67 on MIMIC-CXR-JPG, 0.68 on CheXpert). The signal persists even when age, race, and sex are controlled for, and remains detectable when the model is trained exclusively on a single racial group. Patch-based occlusion reveals that the signal is diffuse rather than localized, embedded in the upper and mid-thoracic regions. This suggests that deep networks may be internalizing subtle traces of clinical environments, equipment differences, or care pathways; learning socioeconomic segregation itself. These findings challenge the assumption that medical images are neutral biological data. By uncovering how models perceive and exploit these hidden social signatures, this work reframes fairness in medical AI: the goal is no longer only to balance datasets or adjust thresholds, but to interrogate and disentangle the social fingerprints embedded in clinical data itself.


[31] Accelerating Controllable Generation via Hybrid-grained Cache cs.CV | cs.MMPDF

Lin Liu, Huixia Ben, Shuo Wang, Jinda Lu, Junxiang Qiu

TL;DR: 论文提出了一种混合粒度缓存(HGC)方法,通过在可控生成模型的不同计算阶段采用不同粒度的缓存策略,显著降低了计算开销,同时保持了生成质量的语义保真度。

Details

Motivation: 可控生成模型在提升合成视觉内容真实性方面应用广泛,但处理控制条件和内容生成的计算需求导致生成效率较低。为提高效率,作者提出了HGC方法。

Result: 在COCO-Stuff等数据集上的实验表明,HGC在保持语义保真度(性能损失小于1.5%)的同时,显著提升了生成效率(计算成本降低63%)。

Insight: 混合粒度缓存策略能有效平衡计算效率和生成质量,为可控生成模型的优化提供了新思路。

Abstract: Controllable generative models have been widely used to improve the realism of synthetic visual content. However, such models must handle control conditions and content generation computational requirements, resulting in generally low generation efficiency. To address this issue, we propose a Hybrid-Grained Cache (HGC) approach that reduces computational overhead by adopting cache strategies with different granularities at different computational stages. Specifically, (1) we use a coarse-grained cache (block-level) based on feature reuse to dynamically bypass redundant computations in encoder-decoder blocks between each step of model reasoning. (2) We design a fine-grained cache (prompt-level) that acts within a module, where the fine-grained cache reuses cross-attention maps within consecutive reasoning steps and extends them to the corresponding module computations of adjacent steps. These caches of different granularities can be seamlessly integrated into each computational link of the controllable generation process. We verify the effectiveness of HGC on four benchmark datasets, especially its advantages in balancing generation efficiency and visual quality. For example, on the COCO-Stuff segmentation benchmark, our HGC significantly reduces the computational cost (MACs) by 63% (from 18.22T to 6.70T), while keeping the loss of semantic fidelity (quantized performance degradation) within 1.5%.


[32] MPCGNet: A Multiscale Feature Extraction and Progressive Feature Aggregation Network Using Coupling Gates for Polyp Segmentation cs.CVPDF

Wei Wang, Feng Jiang, Xin Wang

TL;DR: MPCGNet提出了一种基于耦合门的多尺度特征提取和渐进特征聚合网络,用于息肉分割,显著提升了小息肉识别和边界分割的准确性。

Details

Motivation: 解决息肉分割中存在的三个主要挑战:小息肉易漏检、边界模糊以及结肠镜图像中的噪声干扰。

Result: 在ETIS-LaribPolypDB和CVC-ColonDB数据集上,MPCGNet的mDice分数分别比次优网络高出2.20%和0.68%。

Insight: 耦合门的引入不仅抑制了噪声,还优化了特征选择,显著改善了小息肉和模糊边界的分割效果。

Abstract: Automatic segmentation methods of polyps is crucial for assisting doctors in colorectal polyp screening and cancer diagnosis. Despite the progress made by existing methods, polyp segmentation faces several challenges: (1) small-sized polyps are prone to being missed during identification, (2) the boundaries between polyps and the surrounding environment are often ambiguous, (3) noise in colonoscopy images, caused by uneven lighting and other factors, affects segmentation results. To address these challenges, this paper introduces coupling gates as components in specific modules to filter noise and perform feature importance selection. Three modules are proposed: the coupling gates multiscale feature extraction (CGMFE) module, which effectively extracts local features and suppresses noise; the windows cross attention (WCAD) decoder module, which restores details after capturing the precise location of polyps; and the decoder feature aggregation (DFA) module, which progressively aggregates features, further extracts them, and performs feature importance selection to reduce the loss of small-sized polyps. Experimental results demonstrate that MPCGNet outperforms recent networks, with mDice scores 2.20% and 0.68% higher than the second-best network on the ETIS-LaribPolypDB and CVC-ColonDB datasets, respectively.


[33] CrossMed: A Multimodal Cross-Task Benchmark for Compositional Generalization in Medical Imaging cs.CV | cs.AIPDF

Pooja Singh, Siddhant Ujjain, Tapan Kumar Gandhi, Sandeep Kumar

TL;DR: CrossMed是一个多模态跨任务基准测试,用于评估医学影像中多模态大语言模型的组合泛化能力。

Details

Motivation: 现有医学AI在多模态组合泛化能力方面研究不足,CrossMed旨在填补这一空白。

Result: 多模态LLMs在相关分割任务中表现优异(83.2%分类准确率,0.75分割cIoU),但在零重叠条件下性能显著下降。

Insight: 多模态LLMs在跨任务和组合泛化方面表现突出,传统模型增益有限。

Abstract: Recent advances in multimodal large language models have enabled unified processing of visual and textual inputs, offering promising applications in general-purpose medical AI. However, their ability to generalize compositionally across unseen combinations of imaging modality, anatomy, and task type remains underexplored. We introduce CrossMed, a benchmark designed to evaluate compositional generalization (CG) in medical multimodal LLMs using a structured Modality-Anatomy-Task (MAT) schema. CrossMed reformulates four public datasets, CheXpert (X-ray classification), SIIM-ACR (X-ray segmentation), BraTS 2020 (MRI classification and segmentation), and MosMedData (CT classification) into a unified visual question answering (VQA) format, resulting in 20,200 multiple-choice QA instances. We evaluate two open-source multimodal LLMs, LLaVA-Vicuna-7B and Qwen2-VL-7B, on both Related and Unrelated MAT splits, as well as a zero-overlap setting where test triplets share no Modality, Anatomy, or Task with the training data. Models trained on Related splits achieve 83.2 percent classification accuracy and 0.75 segmentation cIoU, while performance drops significantly under Unrelated and zero-overlap conditions, demonstrating the benchmark difficulty. We also show cross-task transfer, where segmentation performance improves by 7 percent cIoU even when trained using classification-only data. Traditional models (ResNet-50 and U-Net) show modest gains, confirming the broad utility of the MAT framework, while multimodal LLMs uniquely excel at compositional generalization. CrossMed provides a rigorous testbed for evaluating zero-shot, cross-task, and modality-agnostic generalization in medical vision-language models.


[34] SemanticNN: Compressive and Error-Resilient Semantic Offloading for Extremely Weak Devices cs.CV | cs.AI | cs.DCPDF

Jiaming Huang, Yi Gao, Fuchang Pan, Renjie Li, Wei Dong

TL;DR: SemanticNN是一种面向极弱设备的语义级错误容忍编解码器,通过动态适应网络条件和紧凑特征表示,显著降低传输量并保持高精度。

Details

Motivation: 随着物联网设备资源受限和网络条件不稳定,传统基于比特级正确性的方法效率低下,需要一种语义级的协作推理卸载系统。

Result: 实验表明,在不同误码率下,SemanticNN减少特征传输量56.82-344.83倍,同时保持高推理精度。

Insight: 语义级正确性优于比特级正确性,适用于资源受限设备;动态适应和解码器补偿是提升鲁棒性的关键。

Abstract: With the rapid growth of the Internet of Things (IoT), integrating artificial intelligence (AI) on extremely weak embedded devices has garnered significant attention, enabling improved real-time performance and enhanced data privacy. However, the resource limitations of such devices and unreliable network conditions necessitate error-resilient device-edge collaboration systems. Traditional approaches focus on bit-level transmission correctness, which can be inefficient under dynamic channel conditions. In contrast, we propose SemanticNN, a semantic codec that tolerates bit-level errors in pursuit of semantic-level correctness, enabling compressive and resilient collaborative inference offloading under strict computational and communication constraints. It incorporates a Bit Error Rate (BER)-aware decoder that adapts to dynamic channel conditions and a Soft Quantization (SQ)-based encoder to learn compact representations. Building on this architecture, we introduce Feature-augmentation Learning, a novel training strategy that enhances offloading efficiency. To address encoder-decoder capability mismatches from asymmetric resources, we propose XAI-based Asymmetry Compensation to enhance decoding semantic fidelity. We conduct extensive experiments on STM32 using three models and six datasets across image classification and object detection tasks. Experimental results demonstrate that, under varying transmission error rates, SemanticNN significantly reduces feature transmission volume by 56.82-344.83x while maintaining superior inference accuracy.


[35] Hyperbolic Hierarchical Alignment Reasoning Network for Text-3D Retrieval cs.CVPDF

Wenrui Li, Yidan Lu, Yeyu Chai, Rui Zhao, Hengyu Man

TL;DR: 本文提出了一种用于文本-3D检索的双曲层次对齐推理网络(H²ARN),通过在双曲空间中嵌入数据和设计新型损失函数,解决了层次表示崩溃(HRC)和噪声冗余导致的显著性稀释(RISD)问题。

Details

Motivation: 随着3D数据的增多,文本-3D检索的重要性日益凸显,但现有方法在层次表示和噪声冗余处理上存在不足,导致检索性能受限。

Result: 在扩展的T3DR-HIT v2数据集上,H²ARN显著提升了文本-3D检索性能。

Insight: 双曲空间的几何特性天然适合建模层次关系,结合贡献感知聚合能有效增强关键区域的显著性,同时抑制噪声干扰。

Abstract: With the daily influx of 3D data on the internet, text-3D retrieval has gained increasing attention. However, current methods face two major challenges: Hierarchy Representation Collapse (HRC) and Redundancy-Induced Saliency Dilution (RISD). HRC compresses abstract-to-specific and whole-to-part hierarchies in Euclidean embeddings, while RISD averages noisy fragments, obscuring critical semantic cues and diminishing the model’s ability to distinguish hard negatives. To address these challenges, we introduce the Hyperbolic Hierarchical Alignment Reasoning Network (H$^{2}$ARN) for text-3D retrieval. H$^{2}$ARN embeds both text and 3D data in a Lorentz-model hyperbolic space, where exponential volume growth inherently preserves hierarchical distances. A hierarchical ordering loss constructs a shrinking entailment cone around each text vector, ensuring that the matched 3D instance falls within the cone, while an instance-level contrastive loss jointly enforces separation from non-matching samples. To tackle RISD, we propose a contribution-aware hyperbolic aggregation module that leverages Lorentzian distance to assess the relevance of each local feature and applies contribution-weighted aggregation guided by hyperbolic geometry, enhancing discriminative regions while suppressing redundancy without additional supervision. We also release the expanded T3DR-HIT v2 benchmark, which contains 8,935 text-to-3D pairs, 2.6 times the original size, covering both fine-grained cultural artefacts and complex indoor scenes. Our codes are available at https://github.com/liwrui/H2ARN.


[36] LiteAttention: A Temporal Sparse Attention for Diffusion Transformers cs.CV | cs.AIPDF

Dor Shmilovich, Tony Wu, Aviad Dahan, Yuval Domb

TL;DR: LiteAttention利用扩散注意力中的时间一致性,通过进化计算跳过冗余的注意力计算,显著提升了视频扩散模型的效率,同时保持了生成质量。

Details

Motivation: 扩散变换器(尤其是视频生成场景)虽效果优异,但因其注意力机制的二次复杂度导致高延迟。现有加速方法在动态稀疏模式的高开销和静态稀疏模式的次优性之间难以平衡。

Result: 在生产级视频扩散模型中实现了显著的加速,且无质量损失。

Insight: 扩散注意力模式的时间一致性为降低计算复杂度提供了新的优化方向,动态跳过策略可有效平衡效率与性能。

Abstract: Diffusion Transformers, particularly for video generation, achieve remarkable quality but suffer from quadratic attention complexity, leading to prohibitive latency. Existing acceleration methods face a fundamental trade-off: dynamically estimating sparse attention patterns at each denoising step incurs high computational overhead and estimation errors, while static sparsity patterns remain fixed and often suboptimal throughout denoising. We identify a key structural property of diffusion attention, namely, its sparsity patterns exhibit strong temporal coherence across denoising steps. Tiles deemed non-essential at step $t$ typically remain so at step $t+δ$. Leveraging this observation, we introduce LiteAttention, a method that exploits temporal coherence to enable evolutionary computation skips across the denoising sequence. By marking non-essential tiles early and propagating skip decisions forward, LiteAttention eliminates redundant attention computations without repeated profiling overheads, combining the adaptivity of dynamic methods with the efficiency of static ones. We implement a highly optimized LiteAttention kernel on top of FlashAttention and demonstrate substantial speedups on production video diffusion models, with no degradation in quality. The code and implementation details will be publicly released.


[37] From Retinal Pixels to Patients: Evolution of Deep Learning Research in Diabetic Retinopathy Screening cs.CV | cs.AIPDF

Muskaan Chopra, Lorenz Sparrenberg, Armin Berger, Sarthak Khanna, Jan H. Terheyden

TL;DR: 这篇综述系统性回顾了2016-2025年期间深度学习在糖尿病视网膜病变筛查中的研究进展,总结了50多项研究和20多个数据集,讨论了方法学创新和实际应用中的挑战。

Details

Motivation: 糖尿病视网膜病变(DR)是可预防失明的主要原因之一,早期筛查至关重要。深度学习技术在过去十年中显著推动了DR筛查的进步,但面临数据集不平衡、标签稀缺、领域迁移和可解释性等问题。

Result: 通过基准表格对比了不同数据集的性能,揭示了多中心验证和临床信任中的开放性问题。

Insight: 研究表明,DR研究的深度学习创新可广泛应用于医学影像领域,但在临床部署中仍需解决可重复性、隐私保护和临床信任等问题。

Abstract: Diabetic Retinopathy (DR) remains a leading cause of preventable blindness, with early detection critical for reducing vision loss worldwide. Over the past decade, deep learning has transformed DR screening, progressing from early convolutional neural networks trained on private datasets to advanced pipelines addressing class imbalance, label scarcity, domain shift, and interpretability. This survey provides the first systematic synthesis of DR research spanning 2016-2025, consolidating results from 50+ studies and over 20 datasets. We critically examine methodological advances, including self- and semi-supervised learning, domain generalization, federated training, and hybrid neuro-symbolic models, alongside evaluation protocols, reporting standards, and reproducibility challenges. Benchmark tables contextualize performance across datasets, while discussion highlights open gaps in multi-center validation and clinical trust. By linking technical progress with translational barriers, this work outlines a practical agenda for reproducible, privacy-preserving, and clinically deployable DR AI. Beyond DR, many of the surveyed innovations extend broadly to medical imaging at scale.


[38] S2D-ALIGN: Shallow-to-Deep Auxiliary Learning for Anatomically-Grounded Radiology Report Generation cs.CV | cs.AI | cs.CLPDF

Jiechao Gao, Chang Liu, Yuangang Li

TL;DR: S2D-ALIGN提出了一种新的SFT范式,通过多粒度辅助信号实现解剖学基础的放射学报告生成,超越了传统的实例级对齐方法。

Details

Motivation: 现有的放射学报告生成方法主要关注图像-文本对的实例级对齐,忽略了报告模板化导致的解剖学基础对齐不足问题。

Result: 在MIMIC-CXR和IU X-Ray基准测试中达到了最先进性能。消融实验验证了多阶段辅助方法的有效性。

Insight: 通过多粒度辅助信号和渐进对齐策略,可以显著提升复杂多模态生成任务中的基础对齐能力。

Abstract: Radiology Report Generation (RRG) aims to automatically generate diagnostic reports from radiology images. To achieve this, existing methods have leveraged the powerful cross-modal generation capabilities of Multimodal Large Language Models (MLLMs), primarily focusing on optimizing cross-modal alignment between radiographs and reports through Supervised Fine-Tuning (SFT). However, by only performing instance-level alignment with the image-text pairs, the standard SFT paradigm fails to establish anatomically-grounded alignment, where the templated nature of reports often leads to sub-optimal generation quality. To address this, we propose \textsc{S2D-Align}, a novel SFT paradigm that establishes anatomically-grounded alignment by leveraging auxiliary signals of varying granularities. \textsc{S2D-Align} implements a shallow-to-deep strategy, progressively enriching the alignment process: it begins with the coarse radiograph-report pairing, then introduces reference reports for instance-level guidance, and ultimately utilizes key phrases to ground the generation in specific anatomical details. To bridge the different alignment stages, we introduce a memory-based adapter that empowers feature sharing, thereby integrating coarse and fine-grained guidance. For evaluation, we conduct experiments on the public \textsc{MIMIC-CXR} and \textsc{IU X-Ray} benchmarks, where \textsc{S2D-Align} achieves state-of-the-art performance compared to existing methods. Ablation studies validate the effectiveness of our multi-stage, auxiliary-guided approach, highlighting a promising direction for enhancing grounding capabilities in complex, multi-modal generation tasks.


[39] Evaluating Latent Generative Paradigms for High-Fidelity 3D Shape Completion from a Single Depth Image cs.CVPDF

Matthias Humt, Ulrich Hillenbrand, Rudolph Triebel

TL;DR: 本文比较了两种生成模型(扩散模型和自回归模型)在3D形状补全任务中的表现,发现扩散模型在连续潜空间中表现最佳,而自回归模型在离散潜空间中也能匹配其性能。

Details

Motivation: 生成模型在3D数据任务中的应用尚未达成共识,尤其是针对部分3D数据的任务。本文旨在比较扩散模型和自回归模型在3D形状建模与补全任务中的表现。

Result: 扩散模型在多模态形状补全任务中表现最优,自回归模型在离散潜空间中也能匹配其性能。

Insight: 生成模型的选择和潜空间的连续性对3D形状补全任务的表现有显著影响。

Abstract: While generative models have seen significant adoption across a wide range of data modalities, including 3D data, a consensus on which model is best suited for which task has yet to be reached. Further, conditional information such as text and images to steer the generation process are frequently employed, whereas others, like partial 3D data, have not been thoroughly evaluated. In this work, we compare two of the most promising generative models–Denoising Diffusion Probabilistic Models and Autoregressive Causal Transformers–which we adapt for the tasks of generative shape modeling and completion. We conduct a thorough quantitative evaluation and comparison of both tasks, including a baseline discriminative model and an extensive ablation study. Our results show that (1) the diffusion model with continuous latents outperforms both the discriminative model and the autoregressive approach and delivers state-of-the-art performance on multi-modal shape completion from a single, noisy depth image under realistic conditions and (2) when compared on the same discrete latent space, the autoregressive model can match or exceed diffusion performance on these tasks.


[40] Phys-Liquid: A Physics-Informed Dataset for Estimating 3D Geometry and Volume of Transparent Deformable Liquids cs.CV | cs.ROPDF

Ke Ma, Yizhou Fang, Jean-Baptiste Weibel, Shuai Tan, Xinggang Wang

TL;DR: 论文介绍了Phys-Liquid数据集,一个物理驱动的数据集,用于估计透明可变形液体的3D几何形状和体积,解决了现有数据集缺乏动态场景下真实液体行为的模拟数据的问题。

Details

Motivation: 透明液体的几何和体积估计在机器人精确操作任务中非常重要,但由于光学复杂性和动态表面变形,现有数据集无法满足需求。

Result: 实验结果表明,该方法在重建液体几何和体积方面优于现有基准,具有更高的准确性和一致性。

Insight: Phys-Liquid数据集的引入为透明液体感知任务提供了更真实的模拟数据支持,推动了这一领域的进一步发展。

Abstract: Estimating the geometric and volumetric properties of transparent deformable liquids is challenging due to optical complexities and dynamic surface deformations induced by container movements. Autonomous robots performing precise liquid manipulation tasks, such as dispensing, aspiration, and mixing, must handle containers in ways that inevitably induce these deformations, complicating accurate liquid state assessment. Current datasets lack comprehensive physics-informed simulation data representing realistic liquid behaviors under diverse dynamic scenarios. To bridge this gap, we introduce Phys-Liquid, a physics-informed dataset comprising 97,200 simulation images and corresponding 3D meshes, capturing liquid dynamics across multiple laboratory scenes, lighting conditions, liquid colors, and container rotations. To validate the realism and effectiveness of Phys-Liquid, we propose a four-stage reconstruction and estimation pipeline involving liquid segmentation, multi-view mask generation, 3D mesh reconstruction, and real-world scaling. Experimental results demonstrate improved accuracy and consistency in reconstructing liquid geometry and volume, outperforming existing benchmarks. The dataset and associated validation methods facilitate future advancements in transparent liquid perception tasks. The dataset and code are available at https://dualtransparency.github.io/Phys-Liquid/.


[41] SplineSplat: 3D Ray Tracing for Higher-Quality Tomography cs.CV | eess.IV | eess.SPPDF

Youssef Haouchat, Sepand Kashani, Aleix Boquet-Pujadas, Philippe Thévenaz, Michael Unser

TL;DR: SplineSplat提出了一种高效计算3D体积断层投影的方法,通过结合B样条和神经网络改进3D射线追踪算法,显著提升了重建质量。

Details

Motivation: 传统基于体素的断层重建方法在质量和效率上存在局限,需要一种能够高效计算3D线积分且支持任意投影几何的新型方法。

Result: 方法在无正则化需求的数据充分情况下,重建质量高于传统体素方法。

Insight: 结合B样条和神经网络是提升断层重建质量的有效途径,尤其在计算效率和精度平衡上表现出色。

Abstract: We propose a method to efficiently compute tomographic projections of a 3D volume represented by a linear combination of shifted B-splines. To do so, we propose a ray-tracing algorithm that computes 3D line integrals with arbitrary projection geometries. One of the components of our algorithm is a neural network that computes the contribution of the basis functions efficiently. In our experiments, we consider well-posed cases where the data are sufficient for accurate reconstruction without the need for regularization. We achieve higher reconstruction quality than traditional voxel-based methods.


[42] A Space-Time Transformer for Precipitation Forecasting cs.CVPDF

Levi Harris, Tianlong Chen

TL;DR: SaTformer是一个基于时空注意力机制的Transformer模型,旨在通过卫星辐射数据预测极端降水,解决了传统数值天气预报模型的运算复杂性和短期预报性能不足的问题。

Details

Motivation: 数值天气预报模型存在计算复杂度高和短期预报性能下降的问题,而现有的AI天气预测方法在视频理解架构的应用上仍未被充分探索。

Result: 该模型在NeurIPS Weather4Cast 2025累积降雨挑战赛中排名第一。

Insight: 通过视频Transformer和分类任务处理极端降水预测问题,为AI天气预测提供了新的技术路径。

Abstract: Meteorological agencies around the world rely on real-time flood guidance to issue live-saving advisories and warnings. For decades traditional numerical weather prediction (NWP) models have been state-of-the-art for precipitation forecasting. However, physically-parameterized models suffer from a few core limitations: first, solving PDEs to resolve atmospheric dynamics is computationally demanding, and second, these methods degrade in performance at nowcasting timescales (i.e., 0-4 hour lead-times). Motivated by these shortcomings, recent work proposes AI-weather prediction (AI-WP) alternatives that learn to emulate analysis data with neural networks. While these data-driven approaches have enjoyed enormous success across diverse spatial and temporal resolutions, applications of video-understanding architectures for weather forecasting remain underexplored. To address these gaps, we propose SaTformer: a video transformer built on full space-time attention that skillfully forecasts extreme precipitation from satellite radiances. Along with our novel architecture, we introduce techniques to tame long-tailed precipitation datasets. Namely, we reformulate precipitation regression into a classification problem, and employ a class-weighted loss to address label imbalances. Our model scored first place on the NeurIPS Weather4Cast 2025 Cumulative Rainfall challenge. Code and model weights are available: https://github.com/leharris3/satformer


[43] Machine-Learning Based Detection of Coronary Artery Calcification Using Synthetic Chest X-Rays cs.CVPDF

Dylan Saeed, Ramtin Gharleghi, Susann Bier, Sonit Singh

TL;DR: 本文提出一种利用机器学习从合成的胸部X射线图像中检测冠状动脉钙化(CAC)的方法,并通过数字化重建X射线图像(DRRs)实现了可靠的标签标注,为大规模筛查提供了低成本解决方案。

Details

Motivation: 冠状动脉钙化(CAC)是心血管事件的重要预测指标,但传统CT检查成本高昂,而普通胸部X射线(CXRs)缺乏可靠标签,限制了深度学习的应用。DRRs通过从CT生成CXR样图像并提供精确标签,成为一种潜在的替代方案。

Result: 最佳配置的平均AUC达到0.754,优于或与基于CXR的现有研究相当,验证了DRRs作为可靠标签源的潜力。

Insight: DRRs为CAC检测提供了一种低成本、可扩展的解决方案,未来可通过迁移学习和域适应进一步应用于真实CXR数据。

Abstract: Coronary artery calcification (CAC) is a strong predictor of cardiovascular events, with CT-based Agatston scoring widely regarded as the clinical gold standard. However, CT is costly and impractical for large-scale screening, while chest X-rays (CXRs) are inexpensive but lack reliable ground truth labels, constraining deep learning development. Digitally reconstructed radiographs (DRRs) offer a scalable alternative by projecting CT volumes into CXR-like images while inheriting precise labels. In this work, we provide the first systematic evaluation of DRRs as a surrogate training domain for CAC detection. Using 667 CT scans from the COCA dataset, we generate synthetic DRRs and assess model capacity, super-resolution fidelity enhancement, preprocessing, and training strategies. Lightweight CNNs trained from scratch outperform large pretrained networks; pairing super-resolution with contrast enhancement yields significant gains; and curriculum learning stabilises training under weak supervision. Our best configuration achieves a mean AUC of 0.754, comparable to or exceeding prior CXR-based studies. These results establish DRRs as a scalable, label-rich foundation for CAC detection, while laying the foundation for future transfer learning and domain adaptation to real CXRs.


[44] VIDEOP2R: Video Understanding from Perception to Reasoning cs.CV | cs.AI | cs.LGPDF

Yifan Jiang, Yueying Wang, Rui Zhao, Toufiq Parag, Zhimin Chen

TL;DR: VideoP2R是一种新颖的视频理解框架,通过将感知和推理建模为两个独立的过程,提升了视频语言模型的推理能力。它采用监督微调和强化学习两阶段方法,并结合创新的数据集和优化算法,实现了多项基准测试的最佳性能。

Details

Motivation: 现有的大视频语言模型(LVLMs)在推理能力上存在不足,传统的强化微调(RFT)方法在视频领域的扩展面临挑战。VideoP2R通过区分感知和推理过程,解决了这一问题。

Result: 在七项视频推理和理解基准测试中,VideoP2R在六项上达到了最优性能,并通过消融实验验证了其方法的有效性。

Insight: 1) 感知和推理的分离建模显著提升了视频理解的性能;2) 模型的感知输出对下游推理具有充分的信息支持。

Abstract: Reinforcement fine-tuning (RFT), a two-stage framework consisting of supervised fine-tuning (SFT) and reinforcement learning (RL) has shown promising results on improving reasoning ability of large language models (LLMs). Yet extending RFT to large video language models (LVLMs) remains challenging. We propose VideoP2R, a novel process-aware video RFT framework that enhances video reasoning by modeling perception and reasoning as distinct processes. In the SFT stage, we develop a three-step pipeline to generate VideoP2R-CoT-162K, a high-quality, process-aware chain-of-thought (CoT) dataset for perception and reasoning. In the RL stage, we introduce a novel process-aware group relative policy optimization (PA-GRPO) algorithm that supplies separate rewards for perception and reasoning. Extensive experiments show that VideoP2R achieves state-of-the-art (SotA) performance on six out of seven video reasoning and understanding benchmarks. Ablation studies further confirm the effectiveness of our process-aware modeling and PA-GRPO and demonstrate that model’s perception output is information-sufficient for downstream reasoning.


[45] Toward Generalized Detection of Synthetic Media: Limitations, Challenges, and the Path to Multimodal Solutions cs.CV | cs.NEPDF

Redwan Hussain, Mizanur Rahman, Prithwiraj Bhattacharjee

TL;DR: 该论文综述了24篇关于AI生成媒体检测的研究,总结了现有方法的局限性,并提出多模态深度学习模型作为未来研究方向,以提高检测的鲁棒性和泛化能力。

Details

Motivation: AI生成媒体技术的快速发展(如GANs和扩散模型)使得区分真实与合成内容变得困难,尤其是Deepfake等技术可能被滥用。因此,开发有效的检测方法至关重要,但目前方法在泛化和多模态数据上表现不足。

Result: 研究发现现有检测方法在泛化和多模态数据上表现不佳,尤其是对于未见过的数据或高度修改的内容。

Insight: 多模态深度学习模型可能是解决检测泛化问题的关键,未来研究应结合视觉、时间和其他模态信息,以提高检测的鲁棒性。

Abstract: Artificial intelligence (AI) in media has advanced rapidly over the last decade. The introduction of Generative Adversarial Networks (GANs) improved the quality of photorealistic image generation. Diffusion models later brought a new era of generative media. These advances made it difficult to separate real and synthetic content. The rise of deepfakes demonstrated how these tools could be misused to spread misinformation, political conspiracies, privacy violations, and fraud. For this reason, many detection models have been developed. They often use deep learning methods such as Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs). These models search for visual, spatial, or temporal anomalies. However, such approaches often fail to generalize across unseen data and struggle with content from different models. In addition, existing approaches are ineffective in multimodal data and highly modified content. This study reviews twenty-four recent works on AI-generated media detection. Each study was examined individually to identify its contributions and weaknesses, respectively. The review then summarizes the common limitations and key challenges faced by current approaches. Based on this analysis, a research direction is suggested with a focus on multimodal deep learning models. Such models have the potential to provide more robust and generalized detection. It offers future researchers a clear starting point for building stronger defenses against harmful synthetic media.


[46] Hindsight Distillation Reasoning with Knowledge Encouragement Preference for Knowledge-based Visual Question Answering cs.CVPDF

Yu Zhao, Ying Zhang, Xuhui Sui, Baohang Zhou, Li Shen

TL;DR: 论文提出了HinD框架和KEPO方法,通过利用MLLM的内部知识推理能力,解决KBVQA中的推理监督问题和知识可信度与置信度的错位问题。

Details

Motivation: 现有KBVQA方法在推理过程中缺乏显式的多步骤轨迹,导致知识推理能力未充分利用。

Result: 在OK-VQA和A-OKVQA数据集上验证了HinD的有效性,优于现有方法且无需商业模型API或外部知识。

Insight: 显式推理轨迹和知识优化能显著提升KBVQA性能,小规模MLLM也能实现高效推理。

Abstract: Knowledge-based Visual Question Answering (KBVQA) necessitates external knowledge incorporation beyond cross-modal understanding. Existing KBVQA methods either utilize implicit knowledge in multimodal large language models (MLLMs) via in-context learning or explicit knowledge via retrieval augmented generation. However, their reasoning processes remain implicit, without explicit multi-step trajectories from MLLMs. To address this gap, we provide a Hindsight Distilled Reasoning (HinD) framework with Knowledge Encouragement Preference Optimization (KEPO), designed to elicit and harness internal knowledge reasoning ability in MLLMs. First, to tackle the reasoning supervision problem, we propose to emphasize the hindsight wisdom of MLLM by prompting a frozen 7B-size MLLM to complete the reasoning process between the question and its ground truth answer, constructing Hindsight-Zero training data. Then we self-distill Hindsight-Zero into Chain-of-Thought (CoT) Generator and Knowledge Generator, enabling the generation of sequential steps and discrete facts. Secondly, to tackle the misalignment between knowledge correctness and confidence, we optimize the Knowledge Generator with KEPO, preferring under-confident but helpful knowledge over the over-confident but unhelpful one. The generated CoT and sampled knowledge are then exploited for answer prediction. Experiments on OK-VQA and A-OKVQA validate the effectiveness of HinD, showing that HinD with elicited reasoning from 7B-size MLLM achieves superior performance without commercial model APIs or outside knowledge.


[47] OT-ALD: Aligning Latent Distributions with Optimal Transport for Accelerated Image-to-Image Translation cs.CV | cs.AIPDF

Zhanpeng Wang, Shuting Cao, Yuhang Lu, Yuhan Li, Na Lei

TL;DR: OT-ALD是一种基于最优传输理论的图像到图像翻译框架,通过对齐潜在分布解决DDIB方法的效率低和轨迹偏差问题,显著提升了翻译速度和质量。

Details

Motivation: Dual Diffusion Implicit Bridge (DDIB)方法在图像到图像翻译中存在效率低和潜在分布不匹配导致的轨迹偏差问题,亟需一种既能保留DDIB优势又能解决这些问题的改进方法。

Result: 在三种高分辨率数据集的四个翻译任务上,OT-ALD平均提升了20.29%的采样效率,并将FID分数降低了2.6。

Insight: 通过最优传输理论对齐潜在分布可以有效提升图像翻译的效率和质量,为解决类似问题提供了新思路。

Abstract: The Dual Diffusion Implicit Bridge (DDIB) is an emerging image-to-image (I2I) translation method that preserves cycle consistency while achieving strong flexibility. It links two independently trained diffusion models (DMs) in the source and target domains by first adding noise to a source image to obtain a latent code, then denoising it in the target domain to generate the translated image. However, this method faces two key challenges: (1) low translation efficiency, and (2) translation trajectory deviations caused by mismatched latent distributions. To address these issues, we propose a novel I2I translation framework, OT-ALD, grounded in optimal transport (OT) theory, which retains the strengths of DDIB-based approach. Specifically, we compute an OT map from the latent distribution of the source domain to that of the target domain, and use the mapped distribution as the starting point for the reverse diffusion process in the target domain. Our error analysis confirms that OT-ALD eliminates latent distribution mismatches. Moreover, OT-ALD effectively balances faster image translation with improved image quality. Experiments on four translation tasks across three high-resolution datasets show that OT-ALD improves sampling efficiency by 20.29% and reduces the FID score by 2.6 on average compared to the top-performing baseline models.


[48] Reverberation: Learning the Latencies Before Forecasting Trajectories cs.CVPDF

Conghao Wong, Ziqian Zou, Beihao Xia, Xinge You

TL;DR: 论文提出了一种基于声学回响曲线启发的Reverberation(Rev)模型,通过显式学习和预测代理响应轨迹变化事件的延迟,改善了轨迹预测的因果连续性和准确性。

Details

Motivation: 当前轨迹预测方法未显式考虑代理响应事件的时间延迟(latency),导致预测结果可能缺乏连续性或不合理。本文受声学回响曲线启发,旨在建模和学习这些延迟。

Result: 实验表明,Rev在准确性和延迟动态解释性上表现优异,定性分析验证了回响变换的潜力。

Insight: 延迟的动态建模是轨迹预测中的重要因素,显式学习延迟可提升模型的因果性和可控性。

Abstract: Bridging the past to the future, connecting agents both spatially and temporally, lies at the core of the trajectory prediction task. Despite great efforts, it remains challenging to explicitly learn and predict latencies, the temporal delays with which agents respond to different trajectory-changing events and adjust their future paths, whether on their own or interactively. Different agents may exhibit distinct latency preferences for noticing, processing, and reacting to any specific trajectory-changing event. The lack of consideration of such latencies may undermine the causal continuity of the forecasting system and also lead to implausible or unintended trajectories. Inspired by the reverberation curves in acoustics, we propose a new reverberation transform and the corresponding Reverberation (short for Rev) trajectory prediction model, which simulates and predicts different latency preferences of each agent as well as their stochasticity by using two explicit and learnable reverberation kernels, allowing for the controllable trajectory prediction based on these forecasted latencies. Experiments on multiple datasets, whether pedestrians or vehicles, demonstrate that Rev achieves competitive accuracy while revealing interpretable latency dynamics across agents and scenarios. Qualitative analyses further verify the properties of the proposed reverberation transform, highlighting its potential as a general latency modeling approach.


[49] Explainable Deep Convolutional Multi-Type Anomaly Detection cs.CVPDF

Alex George, Lyudmila Mihaylova, Sean Anderson

TL;DR: 论文提出了一种轻量级卷积框架MultiTypeFCDD,用于可解释的多类型异常检测,仅需图像级标签即可训练,并能生成多通道热图对应不同异常类型。

Details

Motivation: 现有可解释异常检测方法无法区分异常类型且需为每种类别训练独立模型,计算成本高。MultiTypeFCDD旨在解决这一问题,适用于资源受限场景。

Result: 在Real-IAD数据集上表现接近SOTA复杂模型,但参数量和推理时间大幅减少。

Insight: 轻量级设计使其适合实时或嵌入式系统,为多类型异常检测提供了高效实用的解决方案。

Abstract: Most explainable anomaly detection methods often identify anomalies but lack the capability to differentiate the type of anomaly. Furthermore, they often require the costly training and maintenance of separate models for each object category. The lack of specificity is a significant research gap, as identifying the type of anomaly (e.g., “Crack” vs. “Scratch”) is crucial for accurate diagnosis that facilitates cost-saving operational decisions across diverse application domains. While some recent large-scale Vision-Language Models (VLMs) have begun to address this, they are computationally intensive and memory-heavy, restricting their use in real-time or embedded systems. We propose MultiTypeFCDD, a simple and lightweight convolutional framework designed as a practical alternative for explainable multi-type anomaly detection. MultiTypeFCDD uses only image-level labels to learn and produce multi-channel heatmaps, where each channel is trained to correspond to a specific anomaly type. The model functions as a single, unified framework capable of differentiating anomaly types across multiple object categories, eliminating the need to train and manage separate models for each object category. We evaluated our proposed method on the Real-IAD dataset and it delivers results competitive with state-of-the-art complex models at significantly reduced parametric load and inference times. This makes it a highly practical and viable solution for real-world applications where computational resources are tightly constrained.


[50] CATS-V2V: A Real-World Vehicle-to-Vehicle Cooperative Perception Dataset with Complex Adverse Traffic Scenarios cs.CVPDF

Hangyu Li, Bofeng Cao, Zhaohui Liang, Wuzhen Li, Juyoung Oh

TL;DR: CATS-V2V是首个针对复杂恶劣交通场景的车辆间协同感知(V2V)真实世界数据集,包含多种天气和光照条件下的数据,覆盖LiDAR点云、多视角相机图像及高精度GNSS/IMU记录,并提供时间一致的3D标注与4D BEV表示,旨在推动自动驾驶领域的研究。

Details

Motivation: 现有数据集中在普通交通场景,复杂恶劣条件下的车辆协同感知数据稀缺,限制了自动驾驶技术的发展,CATS-V2V填补了这一空白。

Result: 数据集包含60K帧LiDAR点云、1.26M相机图像和750K GNSS/IMU记录,是目前规模最大、质量最高的V2V数据集。

Insight: CATS-V2V为复杂场景下的协同感知研究提供了基础,其多模态数据和高精度标注将极大促进自动驾驶技术的进步。

Abstract: Vehicle-to-Vehicle (V2V) cooperative perception has great potential to enhance autonomous driving performance by overcoming perception limitations in complex adverse traffic scenarios (CATS). Meanwhile, data serves as the fundamental infrastructure for modern autonomous driving AI. However, due to stringent data collection requirements, existing datasets focus primarily on ordinary traffic scenarios, constraining the benefits of cooperative perception. To address this challenge, we introduce CATS-V2V, the first-of-its-kind real-world dataset for V2V cooperative perception under complex adverse traffic scenarios. The dataset was collected by two hardware time-synchronized vehicles, covering 10 weather and lighting conditions across 10 diverse locations. The 100-clip dataset includes 60K frames of 10 Hz LiDAR point clouds and 1.26M multi-view 30 Hz camera images, along with 750K anonymized yet high-precision RTK-fixed GNSS and IMU records. Correspondingly, we provide time-consistent 3D bounding box annotations for objects, as well as static scenes to construct a 4D BEV representation. On this basis, we propose a target-based temporal alignment method, ensuring that all objects are precisely aligned across all sensor modalities. We hope that CATS-V2V, the largest-scale, most supportive, and highest-quality dataset of its kind to date, will benefit the autonomous driving community in related tasks.


[51] Refine and Align: Confidence Calibration through Multi-Agent Interaction in VQA cs.CV | cs.AI | cs.LGPDF

Ayush Pandey, Jai Bardhan, Ishita Jain, Ramya S Hebbalaguppe, Rohan Raju Dhanakshirur

TL;DR: 本文提出了一个基于多智能体交互的框架AlignVQA,通过辩论过程改进VQA系统的置信度校准,并引入了一种新的可微分校准感知损失函数aligncal。

Details

Motivation: 现代VQA系统在高风险领域(如医疗诊断和自动驾驶)中的应用日益广泛,但其置信度估计的可靠性尚未得到充分研究,系统往往过于自信。

Result: 在多个VQA基准数据集上的实验表明,该方法显著减少了校准差异,提升了置信度估计的准确性。

Insight: 特化智能体的校准程度越高,整体置信度对齐效果越好;辩论过程和校准感知损失函数的结合能有效改进系统可靠性。

Abstract: In the context of Visual Question Answering (VQA) and Agentic AI, calibration refers to how closely an AI system’s confidence in its answers reflects their actual correctness. This aspect becomes especially important when such systems operate autonomously and must make decisions under visual uncertainty. While modern VQA systems, powered by advanced vision-language models (VLMs), are increasingly used in high-stakes domains like medical diagnostics and autonomous navigation due to their improved accuracy, the reliability of their confidence estimates remains under-examined. Particularly, these systems often produce overconfident responses. To address this, we introduce AlignVQA, a debate-based multi-agent framework, in which diverse specialized VLM – each following distinct prompting strategies – generate candidate answers and then engage in two-stage interaction: generalist agents critique, refine and aggregate these proposals. This debate process yields confidence estimates that more accurately reflect the model’s true predictive performance. We find that more calibrated specialized agents produce better aligned confidences. Furthermore, we introduce a novel differentiable calibration-aware loss function called aligncal designed to fine-tune the specialized agents by minimizing an upper bound on the calibration error. This objective explicitly improves the fidelity of each agent’s confidence estimates. Empirical results across multiple benchmark VQA datasets substantiate the efficacy of our approach, demonstrating substantial reductions in calibration discrepancies. Furthermore, we propose a novel differentiable calibration-aware loss to fine-tune the specialized agents and improve the quality of their individual confidence estimates based on minimising upper bound calibration error.


[52] Dynamic Gaussian Scene Reconstruction from Unsynchronized Videos cs.CVPDF

Zhixin Xu, Hengyu Zhou, Yuan Liu, Wenhan Xue, Hao Pan

TL;DR: 提出了一种新颖的时间对齐策略,用于从非同步多视角视频中实现高质量的动态高斯场景重建。

Details

Motivation: 现有动态场景重建方法(如4D高斯泼溅)通常假设输入视频流是时间同步的,但在实际场景中,这种假设常因相机触发延迟或独立录制设置而失效,导致重建质量下降。

Result: 实验表明,该方法能有效处理时间错位视频,并显著提升基线方法的性能。

Insight: 时间对齐的动态场景重建方法在非同步多视角数据下具有重要实际意义,增强了现有技术的适用性。

Abstract: Multi-view video reconstruction plays a vital role in computer vision, enabling applications in film production, virtual reality, and motion analysis. While recent advances such as 4D Gaussian Splatting (4DGS) have demonstrated impressive capabilities in dynamic scene reconstruction, they typically rely on the assumption that input video streams are temporally synchronized. However, in real-world scenarios, this assumption often fails due to factors like camera trigger delays or independent recording setups, leading to temporal misalignment across views and reduced reconstruction quality. To address this challenge, a novel temporal alignment strategy is proposed for high-quality 4DGS reconstruction from unsynchronized multi-view videos. Our method features a coarse-to-fine alignment module that estimates and compensates for each camera’s time shift. The method first determines a coarse, frame-level offset and then refines it to achieve sub-frame accuracy. This strategy can be integrated as a readily integrable module into existing 4DGS frameworks, enhancing their robustness when handling asynchronous data. Experiments show that our approach effectively processes temporally misaligned videos and significantly enhances baseline methods.


[53] Viper-F1: Fast and Fine-Grained Multimodal Understanding with Cross-Modal State-Space Modulation cs.CVPDF

Quoc-Huy Trinh, Mustapha Abdullahi, Do Duy Hung Trinh, Bo Zhao, Debesh Jha

TL;DR: Viper-F1是一种基于跨模态状态空间调制的高效、细粒度多模态理解模型,通过引入液体状态空间动态和Token-Grid相关模块,实现了线性时间推理和精准视觉定位。

Details

Motivation: 当前多模态大语言模型(MLLMs)因其高计算成本和难以捕捉细粒度视觉区域的问题,限制了在资源受限场景中的部署。

Result: 在多个基准测试中,Viper-F1表现出高效的推理能力和精准的细粒度理解。

Insight: 状态空间动态和轻量级相关模块的结合为高效多模态理解提供了新思路。

Abstract: Recent advances in multimodal large language models (MLLMs) have enabled impressive progress in vision-language understanding, yet their high computational cost limits deployment in resource-constrained scenarios such as robotic manipulation, personal assistants, and smart cameras. Most existing methods rely on Transformer-based cross-attention, whose quadratic complexity hinders efficiency. Moreover, small vision-language models often struggle to precisely capture fine-grained, task-relevant visual regions, leading to degraded performance on fine-grained reasoning tasks that limit their effectiveness in the real world. To address these issues, we introduce Viper-F1, a Hybrid State-Space Vision-Language Model that replaces attention with efficient Liquid State-Space Dynamics. To further enhance visual grounding, we propose a Token-Grid Correlation Module, which computes lightweight correlations between text tokens and image patches and modulates the state-space dynamics via FiLM conditioning. This enables the model to selectively emphasize visual regions relevant to the textual prompt while maintaining linear-time inference. Experimental results across multiple benchmarks demonstrate that Viper-F1 achieves accurate, fine-grained understanding with significantly improved efficiency.


[54] A Comparison of Lightweight Deep Learning Models for Particulate-Matter Nowcasting in the Indian Subcontinent & Surrounding Regions cs.CVPDF

Ansh Kushwaha, Kaushik Gopalan

TL;DR: 这篇论文提出了一个轻量级深度学习框架,用于印度次大陆及周边地区的颗粒物(PM₁、PM₂.₅、PM₁₀)6小时临近预报,显著提升了预报精度并降低了系统偏差。

Details

Motivation: 为了解决颗粒物预报的高计算成本和时延问题,作者提出了一种轻量级深度学习模型,旨在在有限空间域内实现快速且准确的短时预报。

Result: 实验结果表明,所提模型在RMSE、MAE和SSIM等指标上显著优于Aurora基准模型,证明了轻量级模型在短时预报中的有效性。

Insight: 研究发现,轻量化和专业化设计的模型在有限空间域内能够更高效地捕捉颗粒物浓度的时空分布特征,适用于资源受限的实时预报场景。

Abstract: This paper is a submission for the Weather4Cast~2025 complementary Pollution Task and presents an efficient framework for 6-hour lead-time nowcasting of PM$1$, PM${2.5}$, and PM$_{10}$ across the Indian subcontinent and surrounding regions. The proposed approach leverages analysis fields from the Copernicus Atmosphere Monitoring Service (CAMS) Global Atmospheric Composition Forecasts at 0.4 degree resolution. A 256x256 spatial region, covering 28.4S-73.6N and 32E-134.0E, is used as the model input, while predictions are generated for the central 128x128 area spanning 2.8S-48N and 57.6E-108.4E, ensuring an India-centric forecast domain with sufficient synoptic-scale context. Models are trained on CAMS analyses from 2021-2023 using a shuffled 90/10 split and independently evaluated on 2024 data. Three lightweight parameter-specific architectures are developed to improve accuracy, minimize systematic bias, and enable rapid inference. Evaluation using RMSE, MAE, Bias, and SSIM demonstrates substantial performance gains over the Aurora foundation model, underscoring the effectiveness of compact & specialized deep learning models for short-range forecasts on limited spatial domains.


[55] Computationally-efficient deep learning models for nowcasting of precipitation: A solution for the Weather4cast 2025 challenge cs.CVPDF

Anushree Bhuskute, Kaushik Gopalan, Jeet Shah

TL;DR: 本文提出了一种基于ConvGRU的迁移学习框架,用于Weather4cast 2025竞赛中的短时降水预测,利用SEVIRI红外通道数据,并通过两阶段训练策略实现四小时内的降水估计。模型在累积降水任务中表现优异,获得第二名。

Details

Motivation: 解决短时降水预测的计算效率问题,并参与Weather4cast 2025竞赛,提供高效的深度学习模型。

Result: 在Weather4cast 2025竞赛中,累积降水任务获得第二名;事件预测任务中表现与基准模型相近。

Insight: 迁移学习和两阶段训练策略在气象预测中表现优异;同时表明ConvGRU在捕捉时空模式上的有效性。

Abstract: This study presents a transfer-learning framework based on Convolutional Gated Recurrent Units (ConvGRU) for short-term rainfall prediction in the Weather4Cast 2025 competition. A single SEVIRI infrared channel (10.8 μm wavelength) is used as input, which consists of four observations over a one-hour period. A two-stage training strategy is applied to generate rainfall estimates up to four hours ahead. In the first stage, ConvGRU is trained to forecast the brightness temperatures from SEVIRI, enabling the model to capture relevant spatiotemporal patterns. In the second stage, an empirically derived nonlinear transformation maps the predicted fields to OPERA-compatible rainfall rates. For the event-prediction task, the transformed rainfall forecasts are processed using 3D event detection followed by spatiotemporal feature extraction to identify and characterize precipitation events. Our submission achieved 2nd place in the cumulative rainfall task. Further, the same model was used out-of-the-box for the event prediction task, and resulted in similar scores as the baseline model to the competition.


[56] Geospatial Chain of Thought Reasoning for Enhanced Visual Question Answering on Satellite Imagery cs.CVPDF

Shambhavi Shanker, Manikandan Padmanaban, Jagabondhu Hazra

TL;DR: 这篇论文提出了一个结合思维链(CoT)推理和直接偏好优化(DPO)的视觉问答(VQA)框架,用于提升卫星影像中的复杂地理空间问题的解决能力。

Details

Motivation: 现有的VQA模型在处理卫星影像数据时缺乏结构化推理能力,难以应对复杂的地理空间查询需求。

Result: 实验结果显示,CoT监督使准确率提升了34.9%,而DPO进一步提升了推理质量和准确率。

Insight: 通过引入结构化推理和优化算法,VQA模型在地理空间分析和气候相关应用中表现出更强的能力。

Abstract: Geospatial chain of thought (CoT) reasoning is essential for advancing Visual Question Answering (VQA) on satellite imagery, particularly in climate related applications such as disaster monitoring, infrastructure risk assessment, urban resilience planning, and policy support. Existing VQA models enable scalable interpretation of remote sensing data but often lack the structured reasoning required for complex geospatial queries. We propose a VQA framework that integrates CoT reasoning with Direct Preference Optimization (DPO) to improve interpretability, robustness, and accuracy. By generating intermediate rationales, the model better handles tasks involving detection, classification, spatial relations, and comparative analysis, which are critical for reliable decision support in high stakes climate domains. Experiments show that CoT supervision improves accuracy by 34.9% over direct baselines, while DPO yields additional gains in accuracy and reasoning quality. The resulting system advances VQA for multispectral Earth observation by enabling richer geospatial reasoning and more effective climate use cases.


[57] Questioning the Stability of Visual Question Answering cs.CV | cs.LGPDF

Amir Rosenfeld, Neta Glazer, Ethan Fetaya

TL;DR: 这篇论文首次大规模研究了视觉语言模型(VLM)对微小视觉和文本扰动的鲁棒性,发现即使是先进模型也对像素级变化或无害改写高度敏感。稳定性强的样本通常回答更准确,小模型的稳定性还可预测大模型的正确性。

Details

Motivation: 尽管视觉语言模型(VLM)取得了显著进展,但其在语义不变的微小输入变化下的可靠性尚未被充分理解。论文旨在填补这一空白。

Result: 现代VLM对微小扰动高度敏感,稳定性强的样本正确率更高。小模型的稳定性模式能高精度预测大模型的正确性。

Insight: 当前VLM存在根本性脆弱性,未来的鲁棒性评估需关注模型应保持的不变性,而非仅限于对抗性扰动。

Abstract: Visual Language Models (VLMs) have achieved remarkable progress, yet their reliability under small, meaning-preserving input changes remains poorly understood. We present the first large-scale, systematic study of VLM robustness to benign visual and textual perturbations: pixel-level shifts, light geometric transformations, padded rescaling, paraphrasing, and multilingual rewrites that do not alter the underlying semantics of an image-question pair. Across a broad set of models and datasets, we find that modern VLMs are highly sensitive to such minor perturbations: a substantial fraction of samples change their predicted answer under at least one visual or textual modification. We characterize how this instability varies across perturbation types, question categories, and models, revealing that even state-of-the-art systems (e.g., GPT-4o, Gemini 2.0 Flash) frequently fail under shifts as small as a few pixels or harmless rephrasings. We further show that sample-level stability serves as a strong indicator of correctness: stable samples are consistently far more likely to be answered correctly. Leveraging this, we demonstrate that the stability patterns of small, accessible open-source models can be used to predict the correctness of much larger closed-source models with high precision. Our findings expose a fundamental fragility in current VLMs and highlight the need for robustness evaluations that go beyond adversarial perturbations, focusing instead on invariances that models should reliably uphold.


[58] One-to-N Backdoor Attack in 3D Point Cloud via Spherical Trigger cs.CVPDF

Dongmei Shan, Wei Lian, Chongxia Wang

TL;DR: 该论文提出了一种针对3D点云的新型一对多后门攻击框架,利用球形触发器作为参数空间,实现了单一触发器设计编码多个目标类别的功能,攻击成功率高达100%。

Details

Motivation: 现有3D点云领域的后门攻击仅支持一对一的范式,限制了攻击的灵活性和潜在威胁。作者希望通过引入一对多的攻击方式,扩展后门攻击的能力,使其在3D视觉系统中更具破坏性和隐蔽性。

Result: 实验结果表明,该方法攻击成功率可达100%,且在干净数据上的模型精度未受影响,证明了其高效性和隐蔽性。

Insight: 球形触发器的空间属性为后门攻击提供了灵活的参数空间,为未来3D智能系统的安全防护提供了重要基准和研究方向。

Abstract: Backdoor attacks represent a critical threat to deep learning systems, particularly in safety-sensitive 3D domains such as autonomous driving and robotics. However, existing backdoor attacks for 3D point clouds have been limited to a rigid one-to-one paradigm. To address this, we present the first one-to-N backdoor framework for 3D vision, based on a novel, configurable spherical trigger. Our key insight is to leverage the spatial properties of spheres as a parameter space, allowing a single trigger design to encode multiple target classes. We establish a theoretical foundation for one-to-N backdoor attacks in 3D, demonstrating that poisoned models can map distinct trigger configurations to different target labels. Experimental results systematically validate this conclusion across multiple datasets and model architectures, achieving high attack success rates (up to 100%) while maintaining accuracy on clean data. This work establishes a crucial benchmark for multi-target threats in 3D vision and provides the foundational understanding needed to secure future 3D-driven intelligent systems.


[59] MAFM^3: Modular Adaptation of Foundation Models for Multi-Modal Medical AI cs.CVPDF

Mohammad Areeb Qazi, Munachiso S Nwadike, Ibrahim Almakky, Mohammad Yaqub, Numan Saeed

TL;DR: MAFM^3提出了一种轻量级模块化框架,通过扩展基础模型的能力,使其适应医学影像中的多任务和多模态需求,实验证明其在预后和分割任务中表现优越。

Details

Motivation: 医学影像数据稀缺,传统的每个任务或模态单独训练成本高,亟需一种统一且高效的适应框架。

Result: 在胸部CT分类基础模型上扩展预后和分割模块,性能提升;引入PET扫描后Dice分数提高5%。

Insight: 基础模型通过模块化设计可以突破初始训练范围,适用于医学影像中的多任务和多模态场景。

Abstract: Foundational models are trained on extensive datasets to capture the general trends of a domain. However, in medical imaging, the scarcity of data makes pre-training for every domain, modality, or task challenging. Instead of building separate models, we propose MAFM^3 (Modular Adaptation of Foundation Models for Multi-Modal Medical AI), a framework that enables a single foundation model to expand into diverse domains, tasks, and modalities through lightweight modular components. These components serve as specialized skill sets that allow the system to flexibly activate the appropriate capability at the inference time, depending on the input type or clinical objective. Unlike conventional adaptation methods that treat each new task or modality in isolation, MAFM^3 provides a unified and expandable framework for efficient multitask and multimodality adaptation. Empirically, we validate our approach by adapting a chest CT foundation model initially trained for classification into prognosis and segmentation modules. Our results show improved performance on both tasks. Furthermore, by incorporating PET scans, MAFM^3 achieved an improvement in the Dice score 5% compared to the respective baselines. These findings establish that foundation models, when equipped with modular components, are not inherently constrained to their initial training scope but can evolve into multitask, multimodality systems for medical imaging. The code implementation of this work can be found at https://github.com/Areeb2735/CTscan_prognosis_VLM


[60] RealisticDreamer: Guidance Score Distillation for Few-shot Gaussian Splatting cs.CVPDF

Ruocheng Wu, Haolan He, Yufei Wang, Zhihao Li, Bihan Wen

TL;DR: 该论文提出了Guidance Score Distillation (GSD)框架,通过从预训练的视频扩散模型(VDM)中提取多视角一致性先验,改善了稀疏训练视图下3D高斯泼溅(3DGS)的过拟合问题,并通过深度和语义特征引导优化了噪声预测结果。

Details

Motivation: 3DGS在稀疏训练视图下容易过拟合,主要是由于缺乏中间视角的监督。为解决这一问题,作者受视频扩散模型(VDM)的启发,提出了GSD框架。

Result: 实验结果表明,该方法在多个数据集上优于现有方法。

Insight: 1. 视频扩散模型可以提供有效的多视角一致性先验;2. 结合深度和语义特征的引导形式能够优化3DGS的生成方向。

Abstract: 3D Gaussian Splatting (3DGS) has recently gained great attention in the 3D scene representation for its high-quality real-time rendering capabilities. However, when the input comprises sparse training views, 3DGS is prone to overfitting, primarily due to the lack of intermediate-view supervision. Inspired by the recent success of Video Diffusion Models (VDM), we propose a framework called Guidance Score Distillation (GSD) to extract the rich multi-view consistency priors from pretrained VDMs. Building on the insights from Score Distillation Sampling (SDS), GSD supervises rendered images from multiple neighboring views, guiding the Gaussian splatting representation towards the generative direction of VDM. However, the generative direction often involves object motion and random camera trajectories, making it challenging for direct supervision in the optimization process. To address this problem, we introduce an unified guidance form to correct the noise prediction result of VDM. Specifically, we incorporate both a depth warp guidance based on real depth maps and a guidance based on semantic image features, ensuring that the score update direction from VDM aligns with the correct camera pose and accurate geometry. Experimental results show that our method outperforms existing approaches across multiple datasets.


[61] Positional Bias in Multimodal Embedding Models: Do They Favor the Beginning, the Middle, or the End? cs.CVPDF

Kebin Wu, Fatima Albreiki

TL;DR: 本文探讨了多模态嵌入模型中的位置偏见问题,揭示了文本编码器和图像编码器在输入位置上的不同偏好,并分析了其成因。

Details

Motivation: 尽管位置偏见在文本生成模型中已被广泛研究,但在多模态表示模型中的表现和影响仍不清楚。本文旨在填补这一空白。

Result: 实验表明,多模态模型中普遍存在位置偏见:文本编码器偏向输入开头,而图像编码器同时偏向开头和结尾。

Insight: 位置偏见由多种因素共同导致,提示未来在设计多模态模型时需要更关注输入位置的平衡性。

Abstract: Positional bias - where models overemphasize certain positions regardless of content - has been shown to negatively impact model performance across various tasks. While recent research has extensively examined positional bias in text generation models, its presence and effects in representation models remain underexplored. Even less is known about such biases in multimodal models. In this work, we investigate positional bias in multimodal representation models, specifically in the context of image-text retrieval. We begin by distinguishing between context importance and positional bias, and then assess the presence and extent of positional bias across different models and datasets. Our experiments demonstrate that positional bias is prevalent in multimodal models, but manifests differently across modalities: text encoders tend to exhibit bias toward the beginning of the input, whereas image encoders show bias at both the beginning and end. Furthermore, we find that this bias arises from, or is amplified by, a combination of factors, including the positional encoding scheme, training loss, context importance, and the nature of using image-text pairs in multimodal training.


[62] 3D Gaussian and Diffusion-Based Gaze Redirection cs.CV | cs.AIPDF

Abiram Panchalingam, Indu Bodala, Stuart Middleton

TL;DR: 本文提出DiT-Gaze框架,结合扩散变换器(DiT)和弱监督策略,提升3D高斯溅射模型的视线重定向效果。通过正交约束损失实现了视线、头部姿态和表情的解耦,显著降低了视线误差。

Details

Motivation: 高保真的视线重定向对生成增强数据以提升视线估计器的泛化能力至关重要。当前基于3D高斯溅射的模型在渲染细微连续的视线转移时表现不佳。

Result: DiT-Gaze将视线误差降低4.1%至6.353度,显著优于现有方法。

Insight: DiT与高斯溅射模型的结合为解决视线重定向中的连续性问题提供了新思路,正交约束损失在多任务学习中具有潜在应用价值。

Abstract: High-fidelity gaze redirection is critical for generating augmented data to improve the generalization of gaze estimators. 3D Gaussian Splatting (3DGS) models like GazeGaussian represent the state-of-the-art but can struggle with rendering subtle, continuous gaze shifts. In this paper, we propose DiT-Gaze, a framework that enhances 3D gaze redirection models using a novel combination of Diffusion Transformer (DiT), weak supervision across gaze angles, and an orthogonality constraint loss. DiT allows higher-fidelity image synthesis, while our weak supervision strategy using synthetically generated intermediate gaze angles provides a smooth manifold of gaze directions during training. The orthogonality constraint loss mathematically enforces the disentanglement of internal representations for gaze, head pose, and expression. Comprehensive experiments show that DiT-Gaze sets a new state-of-the-art in both perceptual quality and redirection accuracy, reducing the state-of-the-art gaze error by 4.1% to 6.353 degrees, providing a superior method for creating synthetic training data. Our code and models will be made available for the research community to benchmark against.


[63] Beyond Flatlands: Unlocking Spatial Intelligence by Decoupling 3D Reasoning from Numerical Regression cs.CVPDF

Zhongbin Guo, Jiahe Liu, Yushan Li, Wenyu Gao, Zhen Yang

TL;DR: 这篇论文提出了一种名为GEODE的新型架构,通过将3D推理与数值生成解耦,解决了现有视觉语言模型在3D空间智能理解上的双重瓶颈问题。

Details

Motivation: 现有视觉语言模型在输入阶段存在几何感知编码器与2D特征的冲突,输出阶段则因离散标记器的结构性限制而无法生成精确数值,导致3D空间智能理解失败。

Result: 1.5B参数的GEODE模型在高层次语义分发任务中表现优异,其空间推理性能媲美7B+规模模型。

Insight: 解耦3D推理与数值生成是实现高效空间智能的关键,轻量化模块设计能在小参数规模下达到与大模型竞争的性能。

Abstract: Existing Vision Language Models (VLMs) architecturally rooted in “flatland” perception, fundamentally struggle to comprehend real-world 3D spatial intelligence. This failure stems from a dual-bottleneck: input-stage conflict between computationally exorbitant geometric-aware encoders and superficial 2D-only features, and output-stage misalignment where discrete tokenizers are structurally incapable of producing precise, continuous numerical values. To break this impasse, we introduce GEODE (Geometric-Output and Decoupled-Input Engine), a novel architecture that resolves this dual-bottleneck by decoupling 3D reasoning from numerical generation. GEODE augments main VLM with two specialized, plug-and-play modules: Decoupled Rationale Module (DRM) that acts as spatial co-processor, aligning explicit 3D data with 2D visual features via cross-attention and distilling spatial Chain-of-Thought (CoT) logic into injectable Rationale Tokens; and Direct Regression Head (DRH), an “Embedding-as-Value” paradigm which routes specialized control tokens to a lightweight MLP for precise, continuous regression of scalars and 3D bounding boxes. The synergy of these modules allows our 1.5B parameter model to function as a high-level semantic dispatcher, achieving state-of-the-art spatial reasoning performance that rivals 7B+ models.


[64] Arcee: Differentiable Recurrent State Chain for Generative Vision Modeling with Mamba SSMs cs.CVPDF

Jitesh Chavan, Rohit Lal, Anand Kamat, Mengjia Xu

TL;DR: Arcee通过引入可微的跨块循环状态链,改进了Mamba SSMs在视觉生成任务中的表现,显著降低了FID分数。

Details

Motivation: Mamba SSMs在长上下文序列建模中的表现优异,但在视觉任务中,传统的选择扫描操作会丢弃跨块的状态表示,限制了模型的性能。

Result: 在CelebA-HQ数据集上,Arcee将FID分数从82.81降至15.33,性能提升了5.4倍。

Insight: 终端状态表示作为一种方向性先验,而非非时序信号本身的估计,能够显著提升视觉生成任务的性能。

Abstract: State-space models (SSMs), Mamba in particular, are increasingly adopted for long-context sequence modeling, providing linear-time aggregation via an input-dependent, causal selective-scan operation. Along this line, recent “Mamba-for-vision” variants largely explore multiple scan orders to relax strict causality for non-sequential signals (e.g., images). Rather than preserving cross-block memory, the conventional formulation of the selective-scan operation in Mamba reinitializes each block’s state-space dynamics from zero, discarding the terminal state-space representation (SSR) from the previous block. Arcee, a cross-block recurrent state chain, reuses each block’s terminal state-space representation as the initial condition for the next block. Handoff across blocks is constructed as a differentiable boundary map whose Jacobian enables end-to-end gradient flow across terminal boundaries. Key to practicality, Arcee is compatible with all prior “vision-mamba” variants, parameter-free, and incurs constant, negligible cost. As a modeling perspective, we view terminal SSR as a mild directional prior induced by a causal pass over the input, rather than an estimator of the non-sequential signal itself. To quantify the impact, for unconditional generation on CelebA-HQ (256$\times$256) with Flow Matching, Arcee reduces FID$\downarrow$ from $82.81$ to $15.33$ ($5.4\times$ lower) on a single scan-order Zigzag Mamba baseline. Efficient CUDA kernels and training code will be released to support rigorous and reproducible research.


[65] Discovering Meaningful Units with Visually Grounded Semantics from Image Captions cs.CV | cs.CLPDF

Melika Behjati, James Henderson

TL;DR: 这篇论文提出了一种用于视觉-语言模型的架构,通过分组caption tokens来捕获语言的细粒度表示,并使其与图像编码器发现的物体对齐,从而提升了模型对视觉和语言的细粒度理解能力。分组后的token与文本中的groundable phrases高度相似。

Details

Motivation: 现有的视觉-语言模型多专注于对齐图像块和语言token,但这些对齐单元对人类不直观且token不一定携带可ground的信息。论文旨在通过分组caption tokens捕获更细粒度的语义表示,提升模型理解能力。

Result: 实验结果展示了模型在视觉-语言理解上的提升,同时分组后的token与文本groundable phrases在定性和定量上均高度相似。

Insight: 1. 分组的token比单独token更能捕获groundable信息。2. 对齐细粒度语言表示和物体表示有助于模型理解视觉-语言关系。

Abstract: Fine-grained knowledge is crucial for vision-language models to obtain a better understanding of the real world. While there has been work trying to acquire this kind of knowledge in the space of vision and language, it has mostly focused on aligning the image patches with the tokens on the language side. However, image patches do not have any meaning to the human eye, and individual tokens do not necessarily carry groundable information in the image. It is groups of tokens which describe different aspects of the scene. In this work, we propose a model which groups the caption tokens as part of its architecture in order to capture a fine-grained representation of the language. We expect our representations to be at the level of objects present in the image, and therefore align our representations with the output of an image encoder trained to discover objects. We show that by learning to group the tokens, the vision-language model has a better fine-grained understanding of vision and language. In addition, the token groups that our model discovers are highly similar to groundable phrases in text, both qualitatively and quantitatively.


[66] GraphPilot: Grounded Scene Graph Conditioning for Language-Based Autonomous Driving cs.CVPDF

Fabian Schmidt, Markus Enzweiler, Abhinav Valada

TL;DR: GraphPilot提出了一种基于场景图的条件化方法,用于提升语言驱动的自动驾驶模型的性能。通过在训练过程中引入场景图的结构化关系监督,模型能够更好地理解空间结构与动态交互,显著提升了驾驶评分。

Details

Motivation: 现有的语言驱动自动驾驶模型缺乏显式的关系依赖监督,限制了其从多模态输入中推断交通实体间交互的能力。GraphPilot旨在填补这一缺口,通过场景图的条件化增强模型的拓扑感知能力。

Result: 在LangAuto基准测试中,GraphPilot显著提升了LMDrive(15.6%)和BEVDriver(17.5%)的驾驶评分,表明模型能够通过场景图条件化更好地内化关系先验。

Insight: 研究表明,显式的关系监督(如场景图)能够大幅提升语言驱动自动驾驶模型的性能,即使测试时无需场景图输入,训练阶段的条件化仍能带来持久的好处。

Abstract: Vision-language models have recently emerged as promising planners for autonomous driving, where success hinges on topology-aware reasoning over spatial structure and dynamic interactions from multimodal input. However, existing models are typically trained without supervision that explicitly encodes these relational dependencies, limiting their ability to infer how agents and other traffic entities influence one another from raw sensor data. In this work, we bridge this gap with a novel model-agnostic method that conditions language-based driving models on structured relational context in the form of traffic scene graphs. We serialize scene graphs at various abstraction levels and formats, and incorporate them into the models via structured prompt templates, enabling a systematic analysis of when and how relational supervision is most beneficial. Extensive evaluations on the public LangAuto benchmark show that scene graph conditioning of state-of-the-art approaches yields large and persistent improvement in driving performance. Notably, we observe up to a 15.6% increase in driving score for LMDrive and 17.5% for BEVDriver, indicating that models can better internalize and ground relational priors through scene graph-conditioned training, even without requiring scene graph input at test-time. Code, fine-tuned models, and our scene graph dataset are publicly available at https://github.com/iis-esslingen/GraphPilot.


[67] Φeat: Physically-Grounded Feature Representation cs.CVPDF

Giuseppe Vecchio, Adrien Kaiser, Rouffet Romain, Rosalie Martin, Elena Garces

TL;DR: 论文提出了 $Φ$eat,一种基于物理特征的视觉骨干网络,通过自监督预训练策略学习材料的反射和几何结构特征,解决了现有特征表示中语义与物理因素纠缠的问题,适用于需要物理感知的任务。

Details

Motivation: 现有自监督特征表示往往将高层语义与底层物理因素(如几何和光照)混为一谈,限制了其在需要显式物理推理的任务中的应用。因此,作者提出了一种基于物理的特征表示方法。

Result: 实验表明,$Φ$eat 学习的特征能够捕捉物理结构,并在材料选择和特征相似性分析中表现优异,证明了其在物理感知任务中的有效性。

Insight: 自监督学习方法可以学习到物理特征,无需显式标签,为计算机视觉和图形学中的物理感知任务提供了新的基础。

Abstract: Foundation models have emerged as effective backbones for many vision tasks. However, current self-supervised features entangle high-level semantics with low-level physical factors, such as geometry and illumination, hindering their use in tasks requiring explicit physical reasoning. In this paper, we introduce $Φ$eat, a novel physically-grounded visual backbone that encourages a representation sensitive to material identity, including reflectance cues and geometric mesostructure. Our key idea is to employ a pretraining strategy that contrasts spatial crops and physical augmentations of the same material under varying shapes and lighting conditions. While similar data have been used in high-end supervised tasks such as intrinsic decomposition or material estimation, we demonstrate that a pure self-supervised training strategy, without explicit labels, already provides a strong prior for tasks requiring robust features invariant to external physical factors. We evaluate the learned representations through feature similarity analysis and material selection, showing that $Φ$eat captures physically-grounded structure beyond semantic grouping. These findings highlight the promise of unsupervised physical feature learning as a foundation for physics-aware perception in vision and graphics. These findings highlight the promise of unsupervised physical feature learning as a foundation for physics-aware perception in vision and graphics.


[68] Coordinative Learning with Ordinal and Relational Priors for Volumetric Medical Image Segmentation cs.CVPDF

Haoyi Wang

TL;DR: CORAL提出了一种结合序数关系与对比学习的协调学习方法,用于医学体积图像分割,通过捕捉局部和全局解剖结构信息,在有限标注下实现最佳性能。

Details

Motivation: 现有方法依赖二元阈值定义样本,忽略了连续解剖相似性和全局方向一致性,导致特征空间失真。CORAL旨在解决这些问题。

Result: 在有限标注的基准数据集上实现最先进的性能,同时学习到具有解剖结构的代表性特征。

Insight: 通过协调局部和全局结构学习,CORAL显著提升了医学图像分割的性能,尤其是在标注有限的情况下。

Abstract: Volumetric medical image segmentation presents unique challenges due to the inherent anatomical structure and limited availability of annotations. While recent methods have shown promise by contrasting spatial relationships between slices, they rely on hard binary thresholds to define positive and negative samples, thereby discarding valuable continuous information about anatomical similarity. Moreover, these methods overlook the global directional consistency of anatomical progression, resulting in distorted feature spaces that fail to capture the canonical anatomical manifold shared across patients. To address these limitations, we propose Coordinative Ordinal-Relational Anatomical Learning (CORAL) to capture both local and global structure in volumetric images. First, CORAL employs a contrastive ranking objective to leverage continuous anatomical similarity, ensuring relational feature distances between slices are proportional to their anatomical position differences. In addition, CORAL incorporates an ordinal objective to enforce global directional consistency, aligning the learned feature distribution with the canonical anatomical progression across patients. Learning these inter-slice relationships produces anatomically informed representations that benefit the downstream segmentation task. Through this coordinative learning framework, CORAL achieves state-of-the-art performance on benchmark datasets under limited-annotation settings while learning representations with meaningful anatomical structure. Code is available at https://github.com/haoyiwang25/CORAL.


[69] From Synthetic Scenes to Real Performance: Enhancing Spatial Reasoning in VLMs cs.CV | cs.CLPDF

Massimo Rizzoli, Simone Alghisi, Seyed Mahed Mousavi, Giuseppe Riccardi

TL;DR: 论文提出了一种通过控制合成数据的生成和标注来增强视觉语言模型(VLMs)空间推理能力的方法,避免了传统微调过程中的偏差和分布不平衡问题,并在真实数据上取得了更好的性能。

Details

Motivation: 传统视觉语言模型的微调依赖真实场景的数据收集和标注,但这一过程容易引入偏差、标注错误和分布不平衡,导致过拟合和性能不均衡。论文旨在通过设计可控的合成数据生成流程来解决这些问题。

Result: 实验结果表明:1) 在平衡的合成数据上微调的模型在视觉场景中表现出更均衡的性能;2) 这种微调方法在真实数据(COCO)上显著优于传统匹配设置的微调方法。

Insight: 论文揭示了合成数据在解决真实数据标注偏差和分布不平衡问题中的潜力,为视觉语言模型的训练提供了新的思路。

Abstract: Fine-tuning Vision-Language Models (VLMs) is a common strategy to improve performance following an ad-hoc data collection and annotation of real-world scenes. However, this process is often prone to biases, errors, and distribution imbalance, resulting in overfitting and imbalanced performance. Although a few studies have tried to address this problem by generating synthetic data, they lacked control over distribution bias and annotation quality. To address these challenges, we redesign the fine-tuning process in two ways. First, we control the generation of data and its annotations, ensuring it is free from bias, distribution imbalance, and annotation errors. We automatically construct the dataset by comprehensively sampling objects’ attributes, including color, shape, size, and position within the scene. Secondly, using this annotated dataset, we fine-tune state-of-the-art VLMs and assess performance transferability to real-world data on the absolute position task. We conduct exhaustive evaluations on both synthetic and real-world benchmarks. Our experiments reveal two key findings: 1) fine-tuning on balanced synthetic data yields uniform performance across the visual scene and mitigates common biases; and 2) fine-tuning on synthetic stimuli significantly improves performance on real-world data (COCO), outperforming models fine-tuned in the matched setting.


[70] D-GAP: Improving Out-of-Domain Robustness via Dataset-Agnostic and Gradient-Guided Augmentation in Amplitude and Pixel Spaces cs.CV | cs.AIPDF

Ruoqi Wang, Haitao Wang, Shaojie Guo, Qiong Luo

TL;DR: D-GAP提出了一种结合频域(幅度空间)和像素空间的数据增强方法,通过任务梯度计算频率敏感性映射,自适应地插值幅度信息,以减少模型的频域学习偏差,同时在像素空间中补充细节,显著提升了模型的OOD鲁棒性。

Details

Motivation: 在实际计算机视觉应用中,模型的OOD鲁棒性因图像背景、风格和采集设备的变换而受到挑战。传统增强方法效果不一致,而特定数据增强需要专业知识。神经网络对频域成分的学习偏差进一步加剧了这一问题。

Result: 在四个真实数据集和三个领域自适应基准中,D-GAP平均提升OOD性能5.3%(真实数据集)和1.8%(基准数据集),表现优于通用和特定数据增强方法。

Insight: D-GAP揭示了频域学习偏差是影响OOD鲁棒性的关键因素之一,同时证明了结合频域和像素空间的增强策略的有效性,为未来研究提供了新方向。

Abstract: Out-of-domain (OOD) robustness is challenging to achieve in real-world computer vision applications, where shifts in image background, style, and acquisition instruments always degrade model performance. Generic augmentations show inconsistent gains under such shifts, whereas dataset-specific augmentations require expert knowledge and prior analysis. Moreover, prior studies show that neural networks adapt poorly to domain shifts because they exhibit a learning bias to domain-specific frequency components. Perturbing frequency values can mitigate such bias but overlooks pixel-level details, leading to suboptimal performance. To address these problems, we propose D-GAP (Dataset-agnostic and Gradient-guided augmentation in Amplitude and Pixel spaces), improving OOD robustness by introducing targeted augmentation in both the amplitude space (frequency space) and pixel space. Unlike conventional handcrafted augmentations, D-GAP computes sensitivity maps in the frequency space from task gradients, which reflect how strongly the model responds to different frequency components, and uses the maps to adaptively interpolate amplitudes between source and target samples. This way, D-GAP reduces the learning bias in frequency space, while a complementary pixel-space blending procedure restores fine spatial details. Extensive experiments on four real-world datasets and three domain-adaptation benchmarks show that D-GAP consistently outperforms both generic and dataset-specific augmentations, improving average OOD performance by +5.3% on real-world datasets and +1.8% on benchmark datasets.


[71] DocLens : A Tool-Augmented Multi-Agent Framework for Long Visual Document Understanding cs.CV | cs.CLPDF

Dawei Zhu, Rui Meng, Jiefeng Chen, Sujian Li, Tomas Pfister

TL;DR: DocLens是一个工具增强的多智能体框架,通过逐步聚焦相关视觉元素以解决长视觉文档理解中的证据定位问题,表现优于现有方法和人类专家。

Details

Motivation: 长视觉文档的理解因信息分布在多页文本和视觉元素中而具有挑战性,现有方法在证据定位上表现不佳,导致性能受限和模型幻觉。

Result: 在MMLongBench-Doc和FinRAGBench-V上取得SOTA性能,尤其在视觉中心和不可回答问题上表现突出。

Insight: 证据定位的精确性是长视觉文档理解的关键,DocLens的多智能体协作机制提供了可行的解决方案。

Abstract: Comprehending long visual documents, where information is distributed across extensive pages of text and visual elements, is a critical but challenging task for modern Vision-Language Models (VLMs). Existing approaches falter on a fundamental challenge: evidence localization. They struggle to retrieve relevant pages and overlook fine-grained details within visual elements, leading to limited performance and model hallucination. To address this, we propose DocLens, a tool-augmented multi-agent framework that effectively ``zooms in’’ on evidence like a lens. It first navigates from the full document to specific visual elements on relevant pages, then employs a sampling-adjudication mechanism to generate a single, reliable answer. Paired with Gemini-2.5-Pro, DocLens achieves state-of-the-art performance on MMLongBench-Doc and FinRAGBench-V, surpassing even human experts. The framework’s superiority is particularly evident on vision-centric and unanswerable queries, demonstrating the power of its enhanced localization capabilities.


[72] AUVIC: Adversarial Unlearning of Visual Concepts for Multi-modal Large Language Models cs.CV | cs.AIPDF

Haokun Chen, Jianing Li, Yao Zhang, Jinhe Bi, Yan Xia

TL;DR: 论文提出了AUVIC框架,用于从多模态大语言模型中精确遗忘目标视觉概念,解决了数据隐私问题,并通过对抗扰动实现了高效遗忘。

Details

Motivation: 多模态大语言模型(MLLMs)在大量数据上的优化引发了隐私和版权问题,尤其是在法规要求”被遗忘权”的情况下。现有研究主要在文本领域,视觉概念的遗忘在MLLMs中仍需探索。

Result: 实验表明AUVIC在目标遗忘率上达到最优,同时对非目标概念的性能影响最小。

Insight: 对抗扰动是一种有效的视觉概念遗忘方法,VCUBench为未来研究提供了重要基准。

Abstract: Multimodal Large Language Models (MLLMs) achieve impressive performance once optimized on massive datasets. Such datasets often contain sensitive or copyrighted content, raising significant data privacy concerns. Regulatory frameworks mandating the ‘right to be forgotten’ drive the need for machine unlearning. This technique allows for the removal of target data without resource-consuming retraining. However, while well-studied for text, visual concept unlearning in MLLMs remains underexplored. A primary challenge is precisely removing a target visual concept without disrupting model performance on related entities. To address this, we introduce AUVIC, a novel visual concept unlearning framework for MLLMs. AUVIC applies adversarial perturbations to enable precise forgetting. This approach effectively isolates the target concept while avoiding unintended effects on similar entities. To evaluate our method, we construct VCUBench. It is the first benchmark designed to assess visual concept unlearning in group contexts. Experimental results demonstrate that AUVIC achieves state-of-the-art target forgetting rates while incurs minimal performance degradation on non-target concepts.


[73] DocSLM: A Small Vision-Language Model for Long Multimodal Document Understanding cs.CVPDF

Tanveer Hannan, Dimitrios Mallios, Parth Pathak, Faegheh Sardari, Thomas Seidl

TL;DR: DocSLM是一个高效的小型视觉语言模型,专为资源受限的边缘设备设计,能够在长文档理解任务中减少内存消耗和延迟。

Details

Motivation: 现有的LVLMs在多模态长文档理解任务中表现优异,但高内存需求限制了其在边缘设备上的实际应用,因此需要一种更轻量的解决方案。

Result: DocSLM在多项基准测试中表现优异,减少了82%的视觉token、75%的参数量和71%的延迟。

Insight: 通过压缩和流式处理,可以在资源受限的环境中实现高效的长文档多模态理解。

Abstract: Large Vision-Language Models (LVLMs) have demonstrated strong multimodal reasoning capabilities on long and complex documents. However, their high memory footprint makes them impractical for deployment on resource-constrained edge devices. We present DocSLM, an efficient Small Vision-Language Model designed for long-document understanding under constrained memory resources. DocSLM incorporates a Hierarchical Multimodal Compressor that jointly encodes visual, textual, and layout information from each page into a fixed-length sequence, greatly reducing memory consumption while preserving both local and global semantics. To enable scalable processing over arbitrarily long inputs, we introduce a Streaming Abstention mechanism that operates on document segments sequentially and filters low-confidence responses using an entropy-based uncertainty calibrator. Across multiple long multimodal document benchmarks, DocSLM matches or surpasses state-of-the-art methods while using 82% fewer visual tokens, 75% fewer parameters, and 71% lower latency, delivering reliable multimodal document understanding on lightweight edge devices. Code is available in the supplementary material.


[74] YCB-Ev SD: Synthetic event-vision dataset for 6DoF object pose estimation cs.CVPDF

Pavel Rojtberg, Julius Kühn

TL;DR: YCB-Ev SD是一个用于6DoF物体姿态估计的合成事件相机数据集,填补了事件视觉领域缺乏综合数据的空白。数据集包含50,000个事件序列,采用物理渲染技术生成,并通过系统评估确定了最优的事件表示方法。

Details

Motivation: 事件视觉领域缺乏高质量合成数据集,阻碍了6DoF物体姿态估计算法的发展。为此,作者提出了YCB-Ev SD数据集,旨在填补这一空白并推动相关研究。

Result: 实验表明,线性衰减双极性编码的时间表面表示在6DoF姿态估计中表现最佳,显著优于指数衰减和单极性编码方法。极性信息对性能提升贡献最大。

Insight: 1)事件表示方法对姿态估计性能有显著影响;2)极性信息和线性时间编码是提升性能的关键;3)合成数据可以有效支持事件视觉算法的开发。

Abstract: We introduce YCB-Ev SD, a synthetic dataset of event-camera data at standard definition (SD) resolution for 6DoF object pose estimation. While synthetic data has become fundamental in frame-based computer vision, event-based vision lacks comparable comprehensive resources. Addressing this gap, we present 50,000 event sequences of 34 ms duration each, synthesized from Physically Based Rendering (PBR) scenes of YCB-Video objects following the Benchmark for 6D Object Pose (BOP) methodology. Our generation framework employs simulated linear camera motion to ensure complete scene coverage, including background activity. Through systematic evaluation of event representations for CNN-based inference, we demonstrate that time-surfaces with linear decay and dual-channel polarity encoding achieve superior pose estimation performance, outperforming exponential decay and single-channel alternatives by significant margins. Our analysis reveals that polarity information contributes most substantially to performance gains, while linear temporal encoding preserves critical motion information more effectively than exponential decay. The dataset is provided in a structured format with both raw event streams and precomputed optimal representations to facilitate immediate research use and reproducible benchmarking. The dataset is publicly available at https://huggingface.co/datasets/paroj/ycbev_sd.


[75] Free3D: 3D Human Motion Emerges from Single-View 2D Supervision cs.CVPDF

Sheng Liu, Yuanzhi Liang, Sidan Du

TL;DR: Free3D是一种从单视角2D监督中生成3D人体运动的框架,无需3D标注即可合成逼真3D运动。

Details

Motivation: 现有3D运动生成模型依赖于精确3D监督,导致泛化能力受限。Free3D旨在通过2D监督学习3D结构和语义,提升泛化性。

Result: Free3D生成的3D运动多样且时序连贯,性能媲美甚至超越全3D监督方法。

Insight: 放松显式3D监督可促进更强的结构推理和泛化能力,为3D运动生成提供了一种高效且可扩展的范式。

Abstract: Recent 3D human motion generation models demonstrate remarkable reconstruction accuracy yet struggle to generalize beyond training distributions. This limitation arises partly from the use of precise 3D supervision, which encourages models to fit fixed coordinate patterns instead of learning the essential 3D structure and motion semantic cues required for robust generalization.To overcome this limitation, we propose Free3D, a framework that synthesizes realistic 3D motions without any 3D motion annotations. Free3D introduces a Motion-Lifting Residual Quantized VAE (ML-RQ) that maps 2D motion sequences into 3D-consistent latent spaces, and a suite of 3D-free regularization objectives enforcing view consistency, orientation coherence, and physical plausibility. Trained entirely on 2D motion data, Free3D generates diverse, temporally coherent, and semantically aligned 3D motions, achieving performance comparable to or even surpassing fully 3D-supervised counterparts. These results suggest that relaxing explicit 3D supervision encourages stronger structural reasoning and generalization, offering a scalable and data-efficient paradigm for 3D motion generation.


[76] Disentangling Emotional Bases and Transient Fluctuations: A Low-Rank Sparse Decomposition Approach for Video Affective Analysis cs.CVPDF

Feng-Qi Cui, Jinyang Huang, Ziyu Jia, Xinyu Li, Xin Yan

TL;DR: 论文提出了一种低秩稀疏情感理解框架(LSEF),用于视频情感分析,通过分层建模长时情感基和短时情感波动,提升了模型的鲁棒性和动态判别能力。

Details

Motivation: 现有的视频情感计算(VAC)方法由于复杂的情感动态变化,导致模型不稳定和表征退化,缺乏一种分层机制来区分不同情感成分(如长时情感基和短时情感波动)。

Result: 在多数据集实验中,LSEF显著提升了模型的鲁棒性和动态判别能力,验证了低秩稀疏分层建模的有效性和通用性。

Insight: 分层建模情感基和波动信号能够有效捕捉情感动态,低秩稀疏分解为情感分析提供了一种新的理论视角。

Abstract: Video-based Affective Computing (VAC), vital for emotion analysis and human-computer interaction, suffers from model instability and representational degradation due to complex emotional dynamics. Since the meaning of different emotional fluctuations may differ under different emotional contexts, the core limitation is the lack of a hierarchical structural mechanism to disentangle distinct affective components, i.e., emotional bases (the long-term emotional tone), and transient fluctuations (the short-term emotional fluctuations). To address this, we propose the Low-Rank Sparse Emotion Understanding Framework (LSEF), a unified model grounded in the Low-Rank Sparse Principle, which theoretically reframes affective dynamics as a hierarchical low-rank sparse compositional process. LSEF employs three plug-and-play modules, i.e., the Stability Encoding Module (SEM) captures low-rank emotional bases; the Dynamic Decoupling Module (DDM) isolates sparse transient signals; and the Consistency Integration Module (CIM) reconstructs multi-scale stability and reactivity coherence. This framework is optimized by a Rank Aware Optimization (RAO) strategy that adaptively balances gradient smoothness and sensitivity. Extensive experiments across multiple datasets confirm that LSEF significantly enhances robustness and dynamic discrimination, which further validates the effectiveness and generality of hierarchical low-rank sparse modeling for understanding affective dynamics.


[77] MicroVQA++: High-Quality Microscopy Reasoning Dataset with Weakly Supervised Graphs for Multimodal Large Language Model cs.CVPDF

Manyu Li, Ruian He, Chenxi Ma, Weimin Tan, Bo Yan

TL;DR: MicroVQA++是一个高质量显微镜视觉问答数据集,通过三阶段构建,结合专家验证、图过滤和多模态大语言模型生成,显著提升显微镜推理性能。

Details

Motivation: 多模态大语言模型在生物医学图像中的应用受限于缺乏高质量训练数据,MicroVQA++旨在解决这一瓶颈。

Result: 数据集质量显著优于MicroVQA基准,4B规模的MLLM性能接近GPT-5,并在开源模型中达到SOTA。

Insight: 高质量数据构建和跨模态一致性过滤是提升MLLM在专业领域性能的关键。

Abstract: Multimodal Large Language Models are increasingly applied to biomedical imaging, yet scientific reasoning for microscopy remains limited by the scarcity of large-scale, high-quality training data. We introduce MicroVQA++, a three-stage, large-scale and high-quality microscopy VQA corpus derived from the BIOMEDICA archive. Stage one bootstraps supervision from expert-validated figure-caption pairs sourced from peer-reviewed articles. Stage two applies HiCQA-Graph, a novel heterogeneous graph over images, captions, and QAs that fuses NLI-based textual entailment, CLIP-based vision-language alignment, and agent signals to identify and filter inconsistent samples. Stage three uses a MultiModal Large Language Model (MLLM) agent to generate multiple-choice questions (MCQ) followed by human screening. The resulting release comprises a large training split and a human-checked test split whose Bloom’s level hard-sample distribution exceeds the MicroVQA benchmark. Our work delivers (i) a quality-controlled dataset that couples expert literature with graph-based filtering and human refinement; (ii) HiCQA-Graph, the first graph that jointly models (image, caption, QA) for cross-modal consistency filtering; (iii) evidence that careful data construction enables 4B-scale MLLMs to reach competitive microscopy reasoning performance (e.g., GPT-5) and achieve state-of-the-art performance among open-source MLLMs. Code and dataset will be released after the review process concludes.


[78] Q-Doc: Benchmarking Document Image Quality Assessment Capabilities in Multi-modal Large Language Models cs.CVPDF

Jiaxi Huang, Dongxu Wu, Hanwei Zhu, Lingyu Zhu, Jun Xing

TL;DR: Q-Doc提出了一种三层评估框架(粗糙、中度和精细),用于系统性评估多模态大语言模型(MLLMs)在文档图像质量评估(DIQA)中的能力,发现其存在评分不一致和失真识别错误等问题,但思维链(CoT)提示能显著提升性能。

Details

Motivation: 当前MLLMs在DIQA领域的潜力尚未充分探索,缺乏系统性评估工具。Q-Doc旨在填补这一空白,为MLLMs的质量评估能力提供标准化测试框架。

Result: MLLMs在DIQA中表现初步能力,但存在评分不一致和失真误判;CoT提示显著提升性能,为改进提供路径。

Insight: MLLMs在DIQA中尚有改进空间,系统性评估和优化方法(如CoT)能有效提升其质量感知能力。

Abstract: The rapid advancement of Multi-modal Large Language Models (MLLMs) has expanded their capabilities beyond high-level vision tasks. Nevertheless, their potential for Document Image Quality Assessment (DIQA) remains underexplored. To bridge this gap, we propose Q-Doc, a three-tiered evaluation framework for systematically probing DIQA capabilities of MLLMs at coarse, middle, and fine granularity levels. a) At the coarse level, we instruct MLLMs to assign quality scores to document images and analyze their correlation with Quality Annotations. b) At the middle level, we design distortion-type identification tasks, including single-choice and multi-choice tests for multi-distortion scenarios. c) At the fine level, we introduce distortion-severity assessment where MLLMs classify distortion intensity against human-annotated references. Our evaluation demonstrates that while MLLMs possess nascent DIQA abilities, they exhibit critical limitations: inconsistent scoring, distortion misidentification, and severity misjudgment. Significantly, we show that Chain-of-Thought (CoT) prompting substantially enhances performance across all levels. Our work provides a benchmark for DIQA capabilities in MLLMs, revealing pronounced deficiencies in their quality perception and promising pathways for enhancement. The benchmark and code are publicly available at: https://github.com/cydxf/Q-Doc.


[79] BOFA: Bridge-Layer Orthogonal Low-Rank Fusion for CLIP-Based Class-Incremental Learning cs.CV | cs.LGPDF

Lan Li, Tao Hu, Da-Wei Zhou, Han-Jia Ye, De-Chuan Zhan

TL;DR: BOFA 是一个基于 CLIP 的类增量学习框架,通过桥层的正交低秩融合和无额外参数的适应性设计,解决了遗忘和多模态整合问题,显著提升了准确率和效率。

Details

Motivation: CLIP 等视觉语言模型为类增量学习提供了强大的多模态表示,但在下游任务适应和多模态整合方面仍存在挑战。BOFA 旨在解决这些挑战,避免遗忘并充分利用多模态优势。

Result: 在标准基准测试中,BOFA 显著优于现有方法,展示了更高的准确性和效率。

Insight: 无需数据回放的稳定知识积累和多模态协同是提升类增量学习性能的关键。

Abstract: Class-Incremental Learning (CIL) aims to continually learn new categories without forgetting previously acquired knowledge. Vision-language models such as CLIP offer strong transferable representations via multi-modal supervision, making them promising for CIL. However, applying CLIP to CIL poses two major challenges: (1) adapting to downstream tasks often requires additional learnable modules, increasing model complexity and susceptibility to forgetting; and (2) while multi-modal representations offer complementary strengths, existing methods have yet to fully realize their potential in effectively integrating visual and textual modalities. To address these issues, we propose BOFA (Bridge-layer Orthogonal Fusion for Adaptation), a novel framework for CIL. BOFA confines all model adaptation exclusively to CLIP’s existing cross-modal bridge-layer, thereby adding no extra parameters or inference cost. To prevent forgetting within this layer, it leverages Orthogonal Low-Rank Fusion, a mechanism that constrains parameter updates to a low-rank ``safe subspace” mathematically constructed to be orthogonal to past task features. This ensures stable knowledge accumulation without data replay. Furthermore, BOFA employs a cross-modal hybrid prototype that synergizes stable textual prototypes with visual counterparts derived from our stably adapted bridge-layer, enhancing classification performance. Extensive experiments on standard benchmarks show that BOFA achieves superior accuracy and efficiency compared to existing methods.


[80] Shrinking the Teacher: An Adaptive Teaching Paradigm for Asymmetric EEG-Vision Alignment cs.CVPDF

Lukun Wu, Jie Li, Ziqi Ren, Kaifan Zhang, Xinbo Gao

TL;DR: 论文提出了一种自适应教学范式(adaptive teaching paradigm),通过动态调整视觉模态(teacher)的知识结构,使其适应EEG模态(student)的容量,从而解决EEG-视觉对齐中的不对称性问题。

Details

Motivation: EEG和视觉模态之间存在两个关键不对称性差距:保真度差距(EEG的噪声和信号退化)和语义差距(EEG的浅层概念表示)。传统方法忽略这种不对称性,导致泛化能力差。

Result: 在零样本脑图像检索任务中,Top-1准确率达60.2%,比之前的最佳方法提升9.8%。

Insight: 解决不对称对齐问题的新视角是教师模态(视觉)需要动态收缩和调整,以适应学生模态(EEG)的能力,而非强制对称对齐。

Abstract: Decoding visual features from EEG signals is a central challenge in neuroscience, with cross-modal alignment as the dominant approach. We argue that the relationship between visual and brain modalities is fundamentally asymmetric, characterized by two critical gaps: a Fidelity Gap (stemming from EEG’s inherent noise and signal degradation, vs. vision’s high-fidelity features) and a Semantic Gap (arising from EEG’s shallow conceptual representation, vs. vision’s rich semantic depth). Previous methods often overlook this asymmetry, forcing alignment between the two modalities as if they were equal partners and thereby leading to poor generalization. To address this, we propose the adaptive teaching paradigm. This paradigm empowers the teacher" modality (vision) to dynamically shrink and adjust its knowledge structure under task guidance, tailoring its semantically dense features to match the student” modality (EEG)’s capacity. We implement this paradigm with the ShrinkAdapter, a simple yet effective module featuring a residual-free design and a bottleneck structure. Through extensive experiments, we validate the underlying rationale and effectiveness of our paradigm. Our method achieves a top-1 accuracy of 60.2% on the zero-shot brain-to-image retrieval task, surpassing previous state-of-the-art methods by a margin of 9.8%. Our work introduces a new perspective for asymmetric alignment: the teacher must shrink and adapt to bridge the vision-brain gap.


[81] WEAVE: Unleashing and Benchmarking the In-context Interleaved Comprehension and Generation cs.CVPDF

Wei Chow, Jiachun Pan, Yongyuan Liang, Mingze Zhou, Xue Song

TL;DR: WEAVE是一个专注于多轮、上下文依赖的多模态理解和生成任务的数据集与评测框架,填补了现有单轮交互任务的空白。

Details

Motivation: 现有统一多模态模型(UMMs)的数据集和评测主要集中在单轮交互,无法捕捉真实世界中图像创建和编辑的多轮、上下文依赖特性。

Result: 实验表明,WEAVE-100k训练提升了模型的理解、编辑和协作能力,但多轮生成和上下文感知任务仍存挑战。

Insight: WEAVE为多模态社区提供了研究上下文交互式理解和生成的基础,揭示了当前方法在多轮任务中的局限性。

Abstract: Recent advances in unified multimodal models (UMMs) have enabled impressive progress in visual comprehension and generation. However, existing datasets and benchmarks focus primarily on single-turn interactions, failing to capture the multi-turn, context-dependent nature of real-world image creation and editing. To address this gap, we present WEAVE, the first suite for in-context interleaved cross-modality comprehension and generation. Our suite consists of two complementary parts. WEAVE-100k is a large-scale dataset of 100K interleaved samples spanning over 370K dialogue turns and 500K images, covering comprehension, editing, and generation tasks that require reasoning over historical context. WEAVEBench is a human-annotated benchmark with 100 tasks based on 480 images, featuring a hybrid VLM judger evaluation framework based on both the reference image and the combination of the original image with editing instructions that assesses models’ abilities in multi-turn generation, visual memory, and world-knowledge reasoning across diverse domains. Experiments demonstrate that training on WEAVE-100k enables vision comprehension, image editing, and comprehension-generation collaboration capabilities. Furthermore, it facilitates UMMs to develop emergent visual-memory capabilities, while extensive evaluations on WEAVEBench expose the persistent limitations and challenges of current approaches in multi-turn, context-aware image generation and editing. We believe WEAVE provides a view and foundation for studying in-context interleaved comprehension and generation for multi-modal community.


[82] The Persistence of Cultural Memory: Investigating Multimodal Iconicity in Diffusion Models cs.CV | cs.AIPDF

Maria-Teresa De Rosa Palmini, Eva Cetinic

TL;DR: 该论文研究了文本到图像扩散模型中泛化与记忆之间的模糊性,提出了多模态图标性的概念,并通过评估框架区分识别与实现,揭示了模型在文化记忆中的表现。

Details

Motivation: 研究动机在于理解扩散模型如何处理文化共享的关联(多模态图标性),以及如何在识别文化引用与具体表现之间取得平衡。

Result: 结果显示,该框架比现有的基于相似性的方法更能有效区分复制与转换。此外,模型对文化对齐的敏感性与训练数据频率、文本独特性、引用流行度和创建日期相关。

Insight: 论文揭示了扩散模型的价值不仅在于复制文化知识,还在于其如何转换和重新情境化这些知识,推动了超越简单文本-图像匹配的评估方法。

Abstract: Our work addresses the ambiguity between generalization and memorization in text-to-image diffusion models, focusing on a specific case we term multimodal iconicity. This refers to instances where images and texts evoke culturally shared associations, such as when a title recalls a familiar artwork or film scene. While prior research on memorization and unlearning emphasizes forgetting, we examine what is remembered and how, focusing on the balance between recognizing cultural references and reproducing them. We introduce an evaluation framework that separates recognition, whether a model identifies a reference, from realization, how it depicts it through replication or reinterpretation, quantified through measures capturing both dimensions. By evaluating five diffusion models across 767 Wikidata-derived cultural references spanning static and dynamic imagery, we show that our framework distinguishes replication from transformation more effectively than existing similarity-based methods. To assess linguistic sensitivity, we conduct prompt perturbation experiments using synonym substitutions and literal image descriptions, finding that models often reproduce iconic visual structures even when textual cues are altered. Finally, our analysis shows that cultural alignment correlates not only with training data frequency, but also textual uniqueness, reference popularity, and creation date. Our work reveals that the value of diffusion models lies not only in what they reproduce but in how they transform and recontextualize cultural knowledge, advancing evaluation beyond simple text-image matching toward richer contextual understanding.


[83] Hi-DREAM: Brain Inspired Hierarchical Diffusion for fMRI Reconstruction via ROI Encoder and visuAl Mapping cs.CV | cs.HCPDF

Guowei Zhang, Yun Zhao, Moein Khajehnejad, Adeel Razi, Levin Kuhlmann

TL;DR: Hi-DREAM 是一个受大脑启发的分层扩散框架,通过 ROI 编码器和视觉映射重建 fMRI 信号,利用了大脑皮层的层级处理特性,显著提升了图像重建的性能和解释性。

Details

Motivation: 现有基于扩散的 fMRI 解码器通常直接基于 fMRI 特征,而忽略了大脑视觉信息在皮层中的层级组织方式。这导致对早期、中期和晚期视觉区域功能的混淆,影响了重建性能和解释性。

Result: 在 Natural Scenes Dataset (NSD) 上,Hi-DREAM 在高层次语义指标上达到 SOTA,同时在低层次保真度上保持竞争力。

Insight: 通过显式建模大脑皮层的层级结构,不仅能提升图像重建性能,还能揭示不同视觉区域的功能贡献,为视觉皮层研究提供了新视角。

Abstract: Mapping human brain activity to natural images offers a new window into vision and cognition, yet current diffusion-based decoders face a core difficulty: most condition directly on fMRI features without analyzing how visual information is organized across the cortex. This overlooks the brain’s hierarchical processing and blurs the roles of early, middle, and late visual areas. We propose Hi-DREAM, a brain-inspired conditional diffusion framework that makes the cortical organization explicit. A region-of-interest (ROI) adapter groups fMRI into early/mid/late streams and converts them into a multi-scale cortical pyramid aligned with the U-Net depth (shallow scales preserve layout and edges; deeper scales emphasize objects and semantics). A lightweight, depth-matched ControlNet injects these scale-specific hints during denoising. The result is an efficient and interpretable decoder in which each signal plays a brain-like role, allowing the model not only to reconstruct images but also to illuminate functional contributions of different visual areas. Experiments on the Natural Scenes Dataset (NSD) show that Hi-DREAM attains state-of-the-art performance on high-level semantic metrics while maintaining competitive low-level fidelity. These findings suggest that structuring conditioning by cortical hierarchy is a powerful alternative to purely data-driven embeddings and provides a useful lens for studying the visual cortex.


[84] VP-Bench: A Comprehensive Benchmark for Visual Prompting in Multimodal Large Language Models cs.CVPDF

Mingjie Xu, Jinpeng Chen, Yuzhi Zhao, Jason Chun Lok Li, Yue Qiu

TL;DR: VP-Bench是一个评估多模态大语言模型(MLLMs)理解和利用视觉提示(VPs)能力的基准测试工具,提供了两阶段评估框架,涵盖VP感知和下游任务应用,并对28个MLLMs进行了系统性分析。

Details

Motivation: 现有的MLLMs缺乏系统性评估其理解和利用视觉提示(如边界框)的能力,VP-Bench填补了这一空白,旨在验证MLLMs是否能像人类一样直观地使用VPs解决问题。

Result: 通过VP-Bench评估28个MLLMs(包括GPT-4o和开源模型),分析了VP属性、问题设计和模型规模等因素对VP理解的影响。

Insight: 研究表明,MLLMs在VP理解和应用方面存在差异,VP-Bench为未来研究提供了标准化的评估框架。

Abstract: Multimodal large language models (MLLMs) have enabled a wide range of advanced vision-language applications, including fine-grained object recognition and contextual understanding. When querying specific regions or objects in an image, human users naturally use “visual prompts” (VPs), such as bounding boxes, to provide reference. However, no existing benchmark systematically evaluates the ability of MLLMs to interpret such VPs. This gap leaves it unclear whether current MLLMs can effectively recognize VPs, an intuitive prompting method for humans, and use them to solve problems. To address this limitation, we introduce VP-Bench, a benchmark for assessing MLLMs’ capability in VP perception and utilization. VP-Bench employs a two-stage evaluation framework: Stage 1 examines models’ ability to perceive VPs in natural scenes, using 30k visualized prompts spanning eight shapes and 355 attribute combinations. Stage 2 investigates the impact of VPs on downstream tasks, measuring their effectiveness in real-world problem-solving scenarios. Using VP-Bench, we evaluate 28 MLLMs, including proprietary systems (e.g., GPT-4o) and open-source models (e.g., InternVL3 and Qwen2.5-VL), and provide a comprehensive analysis of factors that affect VP understanding, such as variations in VP attributes, question arrangement, and model scale. VP-Bench establishes a new reference framework for studying how MLLMs comprehend and resolve grounded referring questions.


[85] VoxTell: Free-Text Promptable Universal 3D Medical Image Segmentation cs.CV | cs.LGPDF

Maximilian Rokuss, Moritz Langenberg, Yannick Kirchhoff, Fabian Isensee, Benjamin Hamm

TL;DR: VoxTell是一个基于视觉语言模型的医学图像分割方法,通过自由文本提示生成3D分割掩码,适用于多种模态的医疗影像,并在零样本任务中表现优异。

Details

Motivation: 现有医学图像分割方法通常需要特定标注,而VoxTell旨在通过自由文本描述实现灵活的分割任务,减少标注需求并提升泛化能力。

Result: 在未见过的数据集上表现出色,尤其在跨模态迁移和临床语言适应性方面表现突出。

Insight: 自由文本提示为医学图像分割提供了更高灵活性,多阶段特征对齐显著提升了模型对新类别和复杂描述的适应性。

Abstract: We introduce VoxTell, a vision-language model for text-prompted volumetric medical image segmentation. It maps free-form descriptions, from single words to full clinical sentences, to 3D masks. Trained on 62K+ CT, MRI, and PET volumes spanning over 1K anatomical and pathological classes, VoxTell uses multi-stage vision-language fusion across decoder layers to align textual and visual features at multiple scales. It achieves state-of-the-art zero-shot performance across modalities on unseen datasets, excelling on familiar concepts while generalizing to related unseen classes. Extensive experiments further demonstrate strong cross-modality transfer, robustness to linguistic variations and clinical language, as well as accurate instance-specific segmentation from real-world text. Code is available at: https://www.github.com/MIC-DKFZ/VoxTell


[86] Rethinking Efficient Mixture-of-Experts for Remote Sensing Modality-Missing Classification cs.CVPDF

Qinghao Gao, Jianhai Qu, Yunsong Li, Weiqiang Dong

TL;DR: MaMOL框架通过双路由机制(动态路由器和静态路由器)解决了遥感多模态分类中的模态缺失问题,实现了参数高效的适应性和跨模态知识共享。

Details

Motivation: 多模态分类在遥感领域中常因环境干扰、传感器故障或大气效应导致模态缺失,严重影响分类性能。现有两阶段适应方法计算成本高且假设训练数据完整,难以推广到现实场景中的不完整性。

Result: 在多个遥感基准测试中表现出优异的鲁棒性和泛化能力,计算开销极小;在自然图像数据集上的迁移实验验证了其跨域适用性。

Insight: MaMOL不仅解决了遥感领域的模态缺失问题,还展示了其在其他领域的通用性和高效性。

Abstract: Multimodal classification in remote sensing often suffers from missing modalities caused by environmental interference, sensor failures, or atmospheric effects, which severely degrade classification performance. Existing two-stage adaptation methods are computationally expensive and assume complete multimodal data during training, limiting their generalization to real-world incompleteness. To overcome these issues, we propose a Missing-aware Mixture-of-Loras (MaMOL) framework that reformulates modality missing as a multi-task learning problem. MaMOL introduces a dual-routing mechanism: a task-oriented dynamic router that adaptively activates experts for different missing patterns, and a modality-specific-shared static router that maintains stable cross-modal knowledge sharing. Unlike prior methods that train separate networks for each missing configuration, MaMOL achieves parameter-efficient adaptation via lightweight expert updates and shared expert reuse. Experiments on multiple remote sensing benchmarks demonstrate superior robustness and generalization under varying missing rates, with minimal computational overhead. Moreover, transfer experiments on natural image datasets validate its scalability and cross-domain applicability, highlighting MaMOL as a general and efficient solution for incomplete multimodal learning.


[87] Sat2RealCity: Geometry-Aware and Appearance-Controllable 3D Urban Generation from Satellite Imagery cs.CVPDF

Yijie Kang, Xinliang Wang, Zhenyu Wu, Yifeng Shi, Hailong Zhu

TL;DR: Sat2RealCity提出了一种从卫星图像生成3D城市的新框架,解决了依赖大规模3D城市资产和语义/高度图的局限性,实现了几何感知和外观可控的生成。

Details

Motivation: 现有3D城市生成方法依赖稀缺且昂贵的大规模3D城市资产,以及缺乏与现实外观的连接,限制了生成城市的真实性和泛化能力。

Result: 实验表明,Sat2RealCity在结构一致性和外观真实性上显著优于基线方法。

Insight: 通过分解生成单位为单个建筑实体,利用3D对象生成的先验知识,减少了对大规模3D城市数据的依赖,提升了现实对齐能力。

Abstract: Recent advances in generative modeling have substantially enhanced 3D urban generation, enabling applications in digital twins, virtual cities, and large-scale simulations. However, existing methods face two key challenges: (1) the need for large-scale 3D city assets for supervised training, which are difficult and costly to obtain, and (2) reliance on semantic or height maps, which are used exclusively for generating buildings in virtual worlds and lack connection to real-world appearance, limiting the realism and generalizability of generated cities. To address these limitations, we propose Sat2RealCity, a geometry-aware and appearance-controllable framework for 3D urban generation from real-world satellite imagery. Unlike previous city-level generation methods, Sat2RealCity builds generation upon individual building entities, enabling the use of rich priors and pretrained knowledge from 3D object generation while substantially reducing dependence on large-scale 3D city assets. Specifically, (1) we introduce the OSM-based spatial priors strategy to achieve interpretable geometric generation from spatial topology to building instances; (2) we design an appearance-guided controllable modeling mechanism for fine-grained appearance realism and style control; and (3) we construct an MLLM-powered semantic-guided generation pipeline, bridging semantic interpretation and geometric reconstruction. Extensive quantitative and qualitative experiments demonstrate that Sat2RealCity significantly surpasses existing baselines in structural consistency and appearance realism, establishing a strong foundation for real-world aligned 3D urban content creation. The code will be released soon.


[88] ImAgent: A Unified Multimodal Agent Framework for Test-Time Scalable Image Generation cs.CV | cs.AIPDF

Kaishen Wang, Ruibo Chen, Tong Zheng, Heng Huang

TL;DR: ImAgent是一个无需训练的统一多模态代理框架,通过动态互动和自我评估提升图像生成的质量和语义对齐,显著提高了测试时的效率和生成效果。

Details

Motivation: 现有文本到图像(T2I)生成模型在模糊或未明确描述的提示下存在随机性和不一致性问题,且现有方法通常需要额外模块,测试效率低。

Result: 在图像生成和编辑任务中,ImAgent显著优于基线模型,甚至在主干模型失败的任务中表现优异。

Insight: 统一的多模态代理框架在测试时自适应和高效生成方面具有潜力,尤其是在处理模糊提示时表现出色。

Abstract: Recent text-to-image (T2I) models have made remarkable progress in generating visually realistic and semantically coherent images. However, they still suffer from randomness and inconsistency with the given prompts, particularly when textual descriptions are vague or underspecified. Existing approaches, such as prompt rewriting, best-of-N sampling, and self-refinement, can mitigate these issues but usually require additional modules and operate independently, hindering test-time scaling efficiency and increasing computational overhead. In this paper, we introduce ImAgent, a training-free unified multimodal agent that integrates reasoning, generation, and self-evaluation within a single framework for efficient test-time scaling. Guided by a policy controller, multiple generation actions dynamically interact and self-organize to enhance image fidelity and semantic alignment without relying on external models. Extensive experiments on image generation and editing tasks demonstrate that ImAgent consistently improves over the backbone and even surpasses other strong baselines where the backbone model fails, highlighting the potential of unified multimodal agents for adaptive and efficient image generation under test-time scaling.


[89] Multimodal Posterior Sampling-based Uncertainty in PD-L1 Segmentation from H&E Images cs.CV | q-bio.QMPDF

Roman Kinakh, Gonzalo R. Ríos-Muñoz, Arrate Muñoz-Barrutia

TL;DR: 本文提出了一种基于多模态后验采样的贝叶斯分割框架nnUNet-B,直接从H&E染色组织学图像中推断PD-L1表达,实现了准确分割和认知不确定性估计。

Details

Motivation: 现有基于免疫组织化学(IHC)的PD-L1表达评估方法资源密集且耗时,亟需一种更高效、可扩展的方法。

Result: 在肺鳞状细胞癌数据集上表现出色,平均Dice分数和平均IoU分别为0.805和0.709,不确定性估计与分割误差强相关。

Insight: H&E图像的不确定性感知PD-L1预测为临床工作流中的生物标志物评估提供了可扩展且可解释的解决方案。

Abstract: Accurate assessment of PD-L1 expression is critical for guiding immunotherapy, yet current immunohistochemistry (IHC) based methods are resource-intensive. We present nnUNet-B: a Bayesian segmentation framework that infers PD-L1 expression directly from H&E-stained histology images using Multimodal Posterior Sampling (MPS). Built upon nnUNet-v2, our method samples diverse model checkpoints during cyclic training to approximate the posterior, enabling both accurate segmentation and epistemic uncertainty estimation via entropy and standard deviation. Evaluated on a dataset of lung squamous cell carcinoma, our approach achieves competitive performance against established baselines with mean Dice Score and mean IoU of 0.805 and 0.709, respectively, while providing pixel-wise uncertainty maps. Uncertainty estimates show strong correlation with segmentation error, though calibration remains imperfect. These results suggest that uncertainty-aware H&E-based PD-L1 prediction is a promising step toward scalable, interpretable biomarker assessment in clinical workflows.


[90] PAS : Prelim Attention Score for Detecting Object Hallucinations in Large Vision–Language Models cs.CV | cs.AIPDF

Nhat Hoang-Xuan, Minh Vu, My T. Thai, Manish Bhattarai

TL;DR: 该论文提出了一种名为PAS(Prelim Attention Score)的无训练、轻量级方法,用于检测大型视觉语言模型(LVLM)中的对象幻觉。通过条件互信息分析,作者发现图像与预测对象之间的弱依赖关系与幻觉强相关,并利用注意力权重计算PAS,实现了实时幻觉检测。

Details

Motivation: 大型视觉语言模型在生成描述时容易出现对象幻觉(即生成未出现在图像中的对象),这一问题降低了模型的可靠性。作者发现模型在预测新对象时往往忽略图像内容,转而依赖先前生成的标记(prelim tokens),因此希望通过量化这种依赖性来解决幻觉检测问题。

Result: PAS在多个LVLM和数据集上显著优于现有方法,能够实时检测对象幻觉,支持动态过滤和干预。

Insight: 模型内部的注意力权重包含可用于检测幻觉的关键信号;无需额外训练,模型自身的注意力机制即可提供有效的幻觉检测依据。

Abstract: Large vision-language models (LVLMs) are powerful, yet they remain unreliable due to object hallucinations. In this work, we show that in many hallucinatory predictions the LVLM effectively ignores the image and instead relies on previously generated output (prelim) tokens to infer new objects. We quantify this behavior via the mutual information between the image and the predicted object conditioned on the prelim, demonstrating that weak image dependence strongly correlates with hallucination. Building on this finding, we introduce the Prelim Attention Score (PAS), a lightweight, training-free signal computed from attention weights over prelim tokens. PAS requires no additional forward passes and can be computed on the fly during inference. Exploiting this previously overlooked signal, PAS achieves state-of-the-art object-hallucination detection across multiple models and datasets, enabling real-time filtering and intervention.


[91] OpenUS: A Fully Open-Source Foundation Model for Ultrasound Image Analysis via Self-Adaptive Masked Contrastive Learning cs.CVPDF

Xiaoyu Zheng, Xu Chen, Awais Rauf, Qifan Fu, Benedetta Monosi

TL;DR: OpenUS是首个完全开源的超声影像基础模型,通过自适应掩膜对比学习解决超声影像分析的挑战,结合了视觉Mamba主干和创新的预训练策略。

Details

Motivation: 超声影像分析受限于操作者依赖性、设备差异性和标注稀缺性,亟需一种通用性强、标注高效的基础模型。

Result: 预训练模型能高效适应下游任务,支持标注稀缺场景的应用。

Insight: 自适应掩膜机制结合动态学习计划,有效提升模型对复杂超声数据的特征提取能力。

Abstract: Ultrasound (US) is one of the most widely used medical imaging modalities, thanks to its low cost, portability, real-time feedback, and absence of ionizing radiation. However, US image interpretation remains highly operator-dependent and varies significantly across anatomical regions, acquisition protocols, and device types. These variations, along with unique challenges such as speckle, low contrast, and limited standardized annotations, hinder the development of generalizable, label-efficient ultrasound AI models. In this paper, we propose OpenUS, the first reproducible, open-source ultrasound foundation model built on a large collection of public data. OpenUS employs a vision Mamba backbone, capturing both local and global long-range dependencies across the image. To extract rich features during pre-training, we introduce a novel self-adaptive masking framework that combines contrastive learning with masked image modeling. This strategy integrates the teacher’s attention map with student reconstruction loss, adaptively refining clinically-relevant masking to enhance pre-training effectiveness. OpenUS also applies a dynamic learning schedule to progressively adjust the difficulty of the pre-training process. To develop the foundation model, we compile the largest to-date public ultrasound dataset comprising over 308K images from 42 publicly available datasets, covering diverse anatomical regions, institutions, imaging devices, and disease types. Our pre-trained OpenUS model can be easily adapted to specific downstream tasks by serving as a backbone for label-efficient fine-tuning. Code is available at https://github.com/XZheng0427/OpenUS.


[92] Bridging Hidden States in Vision-Language Models cs.CVPDF

Benjamin Fein-Ashley, Jacob Fein-Ashley

TL;DR: 论文提出了一种轻量级的跨模态融合模块BRIDGE,通过双向注意力层对齐视觉和文本编码器的隐藏状态,提升了多模态任务的性能,同时保持了双编码器的效率。

Details

Motivation: 现有的多模态模型通常在编码器中早期融合或晚期比较池化嵌入,但忽略了隐藏状态的丰富模态特定信息。论文提出直接对齐这些状态是一种更自然的方式。

Result: 在检索、VQA和视觉推理等任务上,BRIDGE优于同类模型,同时保持了双编码器的效率。

Insight: 直接对齐视觉和文本的隐藏状态可以更自然地捕捉模态特定的信息(如空间布局和语法语义),这对多模态任务的性能提升至关重要。

Abstract: Vision-Language Models (VLMs) are a new family of models that align image content with natural language. Existing approaches typically fuse either (a) early: by mixing tokens/features inside the encoders, or (b) late: by comparing pooled embeddings. Many methods also tie fusion to an autoregressive decoder. However, the hidden states of both modalities already carry rich, modality-specific structure (spatial layout in vision; syntax and semantics in text), so directly aligning these states is a natural way to match what the two modalities “think”. We propose a lightweight fusion module: a few cross-only, bidirectional attention layers placed near the top of both encoders. Each layer projects the vision and text encoder hidden-state sequences into a shared space, attends across modalities, and sends gated residual updates back, with simple stabilizers to improve alignment. The encoders remain non-causal and strong for understanding, while generation stays cleanly decoupled via an optional decoder. Across standard retrieval, VQA, and visual reasoning benchmarks, BRIDGE outperforms comparable VLMs while preserving the bi-encoder efficiency of contrastive models. We make our code publicly available at https://github.com/jfeinashley/BRIDGE.


[93] LARM: A Large Articulated-Object Reconstruction Model cs.CVPDF

Sylvia Yuan, Ruoxi Shi, Xinyue Wei, Xiaoshuai Zhang, Hao Su

TL;DR: LARM是一个基于前馈的统一框架,能够从稀疏视角图像中重建3D关节对象,同时恢复细节几何、逼真纹理和准确的关节结构。通过扩展LVSM方法,结合Transformer架构推理相机姿态和关节变化,实现高效的新视角合成和高保真3D重建。

Details

Motivation: 现有的3D关节对象重建方法通常依赖密集多视角输入和昂贵的每实例优化,或者前馈方法虽然速度快但几何粗糙且缺乏纹理重建。LARM旨在解决这些限制,提供一种高效且高保真的统一框架。

Result: 实验表明,LARM在新视角和状态合成以及3D关节对象重建上均优于现有方法,能够生成高质量且与输入图像高度一致的网格。

Insight: LARM展示了一种高效且可扩展的方法,通过联合推理相机姿态和关节变化,显著提升了稀疏视图下3D关节对象的重建质量,为相关应用提供了实用工具。

Abstract: Modeling 3D articulated objects with realistic geometry, textures, and kinematics is essential for a wide range of applications. However, existing optimization-based reconstruction methods often require dense multi-view inputs and expensive per-instance optimization, limiting their scalability. Recent feedforward approaches offer faster alternatives but frequently produce coarse geometry, lack texture reconstruction, and rely on brittle, complex multi-stage pipelines. We introduce LARM, a unified feedforward framework that reconstructs 3D articulated objects from sparse-view images by jointly recovering detailed geometry, realistic textures, and accurate joint structures. LARM extends LVSM a recent novel view synthesis (NVS) approach for static 3D objects into the articulated setting by jointly reasoning over camera pose and articulation variation using a transformer-based architecture, enabling scalable and accurate novel view synthesis. In addition, LARM generates auxiliary outputs such as depth maps and part masks to facilitate explicit 3D mesh extraction and joint estimation. Our pipeline eliminates the need for dense supervision and supports high-fidelity reconstruction across diverse object categories. Extensive experiments demonstrate that LARM outperforms state-of-the-art methods in both novel view and state synthesis as well as 3D articulated object reconstruction, generating high-quality meshes that closely adhere to the input images. project page: https://sylviayuan-sy.github.io/larm-site/


cs.CL [Back]

[94] Data Analysis and Performance Evaluation of Simulation Deduction Based on LLMs cs.CL | cs.AIPDF

Shansi Zhang, Min Li

TL;DR: 论文提出了一种基于大型语言模型(LLMs)的仿真推演数据分析和性能评估方法,通过任务分解和多轮交互生成高质量报告。

Details

Motivation: 传统手动分析方法耗时且易出错,LLMs的分析推理能力可提升效率,但单次输入难以生成结构化报告。

Result: 实验表明,该方法生成的报告质量更高,评分优于基线方法。

Insight: LLMs在多任务分解和结构化输出生成中有潜力,结合自定义工具和模板能显著提升实际应用效果。

Abstract: Data analysis and performance evaluation of simulation deduction plays a pivotal role in modern warfare, which enables military personnel to gain invaluable insights into the potential effectiveness of different strategies, tactics, and operational plans. Traditional manual analysis approach is time-consuming and limited by human errors. To enhance efficiency and accuracy, large language models (LLMs) with strong analytical and inferencing capabilities can be employed. However, high-quality analysis reports with well-structured formatting cannot be obtained through a single instruction input to the LLM. To tackle this issue, we propose a method that first decomposes the complex task into several sub-tasks and designs effective system prompts and user prompts for each sub-task. Multi-round interactions with the LLM incorporating self-check and reflection are then conducted to enable structured data extraction as well as multi-step analysis and evaluation. Furthermore, custom tools are defined and invoked to generate figures and compute metrics. We also design multiple report templates, each tailored to a specific application and input data type, ensuring their adaptability across a variety of scenarios. Extensive evaluation results demonstrate that the reports generated by our method exhibit higher quality, therefore obtaining higher scores than the baseline method.


[95] Hybrid Quantum Transformer for Language Generation cs.CL | cs.AI | quant-phPDF

Desheng Kong, Xiangshuo Cui, Jiaying Jin, Jing Xu, Donglin Wang

TL;DR: 这篇论文提出了首个混合量子-经典的大型语言模型HyQuT,用于自然语言生成,成功将变分量子电路(VQCs)集成到Transformer框架中,展示了量子计算在生成语言模型中的可行性。

Details

Motivation: 尽管量子计算逐渐应用于替代经典计算,但现有的量子或混合模型仍局限于简单任务,尚未成功应用于大规模自然语言生成。本文旨在填补这一空白。

Result: 在150M参数模型中,10个量子比特和80个量子门即可替换10%的经典参数,同时保持生成质量和收敛稳定性。

Insight: 研究表明,量子计算可以高效嵌入大型生成语言模型,且对小规模量子资源的依赖使其具备实际应用潜力。

Abstract: Although quantum computing has been increasingly applied to replace classical computation, most existing quantum or hybrid models remain confined to simple tasks, with no successful application to large-scale natural language generation to date. In this work, we present the first hybrid quantum-classical large language model (LLM) for natural language generation, HyQuT, capable of performing coherent and context-aware dialogue. The proposed architecture integrates variational quantum circuits (VQCs) into the Transformer framework at both 8M and 150M parameter scales. Experimental results show that a minimal number of qubits (10 qubits with 80 quantum gates) can replace about 10% of the classical parameters in the 150M-parameter model, while achieving comparable convergence stability and generation quality. This study provides an early demonstration of the feasibility of integrating quantum computing to large-scale generative language models.


[96] Empirical Characterization of Temporal Constraint Processing in LLMs cs.CL | cs.AIPDF

Javier Marín

TL;DR: 该论文通过实验研究了大型语言模型(LLMs)在处理时间约束时的表现,揭示了现有模型在实时决策任务中的系统性风险,并提出了改进架构的需求。

Details

Motivation: 在需要实时决策的代理架构中部署LLMs时,通常会假设它们能可靠地判断行动窗口是否开放或关闭。此假设未经验证,论文旨在填补这一空白。

Result: 模型表现呈现双峰分布(95%或50%准确率),提示微小改动导致性能大幅波动(30-60个百分点),微调可部分提升性能(12-37个百分点)。

Insight: 时间约束满足能力无法仅通过自然语言的下一词预测学习,需要结合符号推理模块的混合架构。

Abstract: When deploying LLMs in agentic architectures requiring real-time decisions under temporal constraints, we assume they reliably determine whether action windows remain open or have closed. This assumption is untested. We characterize temporal constraint processing across eight production-scale models (2.8-8B parameters) using deadline detection tasks, revealing systematic deployment risks: bimodal performance distribution (models achieve either 95% or 50% accuracy), extreme prompt brittleness (30-60 percentage point swings from formatting changes alone), and systematic action bias (100% false positive rates in failing models). Parameter count shows no correlation with capability in this range-a 3.8B model matches 7B models while other 7B models fail completely. Fine-tuning on 200 synthetic examples improves models with partial capability by 12-37 percentage points. We demonstrate that temporal constraint satisfaction cannot be reliably learned through next-token prediction on natural language, even with targeted fine-tuning. This capability requires architectural mechanisms for: (1) continuous temporal state representation, (2) explicit constraint checking separate from linguistic pattern matching, (3) systematic compositional reasoning over temporal relations. Current autoregressive architectures lack these mechanisms. Deploying such systems in time-critical applications without hybrid architectures incorporating symbolic reasoning modules represents unacceptable risk.


[97] Spectral Neuro-Symbolic Reasoning II: Semantic Node Merging, Entailment Filtering, and Knowledge Graph Alignment cs.CL | cs.AI | cs.NEPDF

Andrew Kiruluta, Priscilla Burity

TL;DR: 该论文扩展了Spectral NSR框架,通过引入语义节点合并、蕴涵过滤和知识图谱对齐三种增强方法,提升图的保真度和推理准确性。

Details

Motivation: 现有的神经符号推理框架在处理冗余节点和边缘质量时存在不足,且缺乏外部知识的补充,影响了推理的准确性和鲁棒性。

Result: 在多个基准测试中,准确率提升最高达3.8%,增强了对抗性案例的泛化能力并减少了推理噪声。

Insight: 语义预处理模块的引入无需改变核心推理引擎,即可显著提升系统性能,适用于开放域和实际场景。

Abstract: This report extends the Spectral Neuro-Symbolic Reasoning (Spectral NSR) framework by introducing three semantically grounded enhancements: (1) transformer-based node merging using contextual embeddings (e.g., Sentence-BERT, SimCSE) to reduce redundancy, (2) sentence-level entailment validation with pretrained NLI classifiers (e.g., RoBERTa, DeBERTa) to improve edge quality, and (3) alignment with external knowledge graphs (e.g., ConceptNet, Wikidata) to augment missing context. These modifications enhance graph fidelity while preserving the core spectral reasoning pipeline. Experimental results on ProofWriter, EntailmentBank, and CLUTRR benchmarks show consistent accuracy gains (up to +3.8%), improved generalization to adversarial cases, and reduced inference noise. The novelty lies in performing semantic and symbolic refinement entirely upstream of the spectral inference stage, enabling efficient, interpretable, and scalable reasoning without relying on quadratic attention mechanisms. In summary, this work extends the Spectral NSR framework with modular, semantically grounded preprocessing steps that improve graph quality without altering the core spectral reasoning engine. The result is a more robust, interpretable, and scalable reasoning system suitable for deployment in open-domain and real-world settings.


[98] Preference Orchestrator: Prompt-Aware Multi-Objective Alignment for Large Language Models cs.CL | cs.AIPDF

Biao Liu, Ning Xu, Junming Yang, Xin Geng

TL;DR: PRO框架通过轻量级的偏好适配器自动推断每个提示的偏好权重,解决了多目标对齐中手动指定偏好权重的低效问题,并在理论和实验中优于现有方法。

Details

Motivation: 大型语言模型在多目标对齐中需要手动指定偏好权重,这增加了用户负担且效率低下。PRO旨在通过自动推断偏好权重解决这一问题。

Result: 实验表明,PRO在多个任务中优于现有方法,验证了其有效性。

Insight: 提示特定的偏好权重可以更高效地实现多目标对齐,避免了手动调参的复杂性。

Abstract: While Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse natural language processing tasks, aligning these models with varying human preferences across multiple objectives remains a significant challenge in practical deployments. Existing multi-objective alignment methods rely on manually specified preference weights, which not only burden users with difficult preference specification tasks but also lead to suboptimal training efficiency due to exploration of irrelevant preference combinations. To alleviate these issues, we propose a novel framework named PRO, i.e., PReference Orchestrator, which features a lightweight preference adapter that automatically infers prompt-specific preference weights during both training and deployment phases. Specifically, the adapter automatically learns appropriate preference weights for each prompt by training on normalized reward scores from multiple reward models for preferred responses, which inherently reflect effective preference balances across objectives. Additionally, We provide theoretical analysis proving that our prompt-aware preference mechanism achieves superior performance compared to fixed preference weights in multi-objective alignment scenarios. Extensive experiments across multiple tasks demonstrate the effectiveness of our method over existing multi-objective alignment approaches.


[99] Patent Representation Learning via Self-supervision cs.CL | cs.LGPDF

You Zuo, Kim Gerdes, Eric Villemonte de La Clergerie, Benoît Sagot

TL;DR: 这篇论文提出了一种基于对比学习的专利表示学习框架,通过利用专利文档中不同部分的互补视图,解决了SimCSE风格的数据增强在专利数据中导致的语义丢失问题,并在大规模基准测试中表现出色。

Details

Motivation: 专利数据的表示学习通常依赖脆弱的标注信息(如引用或IPC分类),而SimCSE风格的dropout增强在专利数据上会导致语义丢失。为了解决这一问题,作者试图利用专利文档中的不同部分(如摘要、权利要求书、背景)作为互补视图,以改善表示学习的质量。

Result: 该方法在专利检索和分类任务上表现优异,无需依赖标注即可匹配或超越有监督基线。分析表明,权利要求书更适合检索任务,而背景部分对分类任务更有帮助。

Insight: 专利文档的固有结构(如不同部分的分工)为表示学习提供了天然的语义多样性,能够显著提升模型性能。通过利用文档内多视图,可以实现更高效和通用的专利理解。

Abstract: This paper presents a simple yet effective contrastive learning framework for learning patent embeddings by leveraging multiple views from within the same document. We first identify a patent-specific failure mode of SimCSE style dropout augmentation: it produces overly uniform embeddings that lose semantic cohesion. To remedy this, we propose section-based augmentation, where different sections of a patent (e.g., abstract, claims, background) serve as complementary views. This design introduces natural semantic and structural diversity, mitigating over-dispersion and yielding embeddings that better preserve both global structure and local continuity. On large-scale benchmarks, our fully self-supervised method matches or surpasses citation-and IPC-supervised baselines in prior-art retrieval and classification, while avoiding reliance on brittle or incomplete annotations. Our analysis further shows that different sections specialize for different tasks-claims and summaries benefit retrieval, while background sections aid classification-highlighting the value of patents’ inherent discourse structure for representation learning. These results highlight the value of exploiting intra-document views for scalable and generalizable patent understanding.


[100] Evaluating Modern Large Language Models on Low-Resource and Morphologically Rich Languages:A Cross-Lingual Benchmark Across Cantonese, Japanese, and Turkish cs.CL | cs.AIPDF

Chengxuan Xia, Qianye Wu, Hongbin Guan, Sixuan Tian, Yilun Hao

TL;DR: 该论文评估了七种现代大语言模型(LLMs)在低资源和形态复杂语言(粤语、日语和土耳其语)上的表现,发现尽管专有模型(如GPT-4o和Claude~3.5)在多语言任务中表现优异,但在文化细微理解和形态学泛化方面仍存在显著差距。开源模型的表现相对较差。

Details

Motivation: 尽管大语言模型(LLMs)在高资源语言(如英语)中表现优异,但它们在低资源和形态复杂语言中的有效性仍未充分研究。论文旨在填补这一空白,评估LLMs在这些语言中的表现。

Result: 专有模型(如GPT-4o和Claude3.5)在多语言任务中表现最佳,但在文化细微理解和形态学泛化上仍有不足;开源模型(如LLaMA-2和Mistral7B)表现较差。

Insight: LLMs在多语言任务中仍面临文化细微理解和形态复杂语言的挑战,尤其是在低资源语言中。未来的研究方向应包括提高模型的跨语言和文化适应性。

Abstract: Large language models (LLMs) have achieved impressive results in high-resource languages like English, yet their effectiveness in low-resource and morphologically rich languages remains underexplored. In this paper, we present a comprehensive evaluation of seven cutting-edge LLMs – including GPT-4o, GPT-4, Claude3.5Sonnet, LLaMA3.1, MistralLarge2, LLaMA-2Chat13B, and Mistral7BInstruct – on a new cross-lingual benchmark covering \textbf{Cantonese, Japanese, and Turkish}. Our benchmark spans four diverse tasks: open-domain question answering, document summarization, English-to-X translation, and culturally grounded dialogue. We combine \textbf{human evaluations} (rating fluency, factual accuracy, and cultural appropriateness) with automated metrics (e.g., BLEU, ROUGE) to assess model performance. Our results reveal that while the largest proprietary models (GPT-4o, GPT-4, Claude3.5) generally lead across languages and tasks, significant gaps persist in culturally nuanced understanding and morphological generalization. Notably, GPT-4o demonstrates robust multilingual performance even on cross-lingual tasks, and Claude3.5Sonnet achieves competitive accuracy on knowledge and reasoning benchmarks. However, all models struggle to some extent with the unique linguistic challenges of each language, such as Turkish agglutinative morphology and Cantonese colloquialisms. Smaller open-source models (LLaMA-213B, Mistral7B) lag substantially in fluency and accuracy, highlighting the resource disparity. We provide detailed quantitative results, qualitative error analysis, and discuss implications for developing more culturally aware and linguistically generalizable LLMs. Our benchmark and evaluation data are released to foster reproducibility and further research.


[101] Grounded Visual Factualization: Factual Anchor-Based Finetuning for Enhancing MLLM Factual Consistency cs.CL | cs.CVPDF

Filippo Morbiato, Luca Romano, Alessandro Persona

TL;DR: 这篇论文提出了Grounded Visual Factualization (GVF) Finetuning方法,通过引入事实信号系统提升多模态大语言模型(MLLM)的事实一致性,显著减少了视觉幻觉问题。

Details

Motivation: 多模态大语言模型中存在的视觉幻觉问题(如生成与图像内容不符的细节)严重影响了模型的可靠性,现有微调方法对此改进有限。

Result: 在LLaVA-1.5-13B模型上,GVF显著优于标准微调方法,同时在通用多模态基准测试中保持或提升性能。

Insight: 系统性引入事实信号不仅能减少幻觉问题,还能保持模型的通用推理能力。

Abstract: Visual hallucination, where Multimodal Large Language Models fabricate details inconsistent with image content, critically undermines their reliability. Existing fine-tuning methods offer limited improvement, failing to deeply intervene in factual reasoning. This paper introduces Grounded Visual Factualization (GVF) Finetuning, a novel approach to systematically enhance MLLM visual factual consistency. GVF integrates explicit factual signals via three core mechanisms: Factual Anchor Data Augmentation, enriching training data with structured factual anchors and counter-factual prompts; Fact-Aware Instruction Tuning, embedding these cues into explicit instructions; and a Factual Consistency Loss function, specifically penalizing factual inaccuracies. Evaluated on LLaVA-1.5-13B, GVF Finetuning significantly outperforms standard fine-tuning on the VHTest benchmark for both Open-Ended Question (OEQ) and Yes/No Question (YNQ) formats. Crucially, GVF maintains or even slightly improves performance on general multimodal benchmarks like MME and POPE, demonstrating effective mitigation of visual hallucinations without compromising general understanding and reasoning abilities.


[102] SpiderGen: Towards Procedure Generation For Carbon Life Cycle Assessments with Generative AI cs.CL | cs.CYPDF

Anupama Sitaraman, Bharathan Balaji, Yuvraj Agarwal

TL;DR: SpiderGen是一种基于大语言模型(LLM)的工作流,为碳生命周期评估(LCA)生成程序信息,显著降低了成本和时间消耗。

Details

Motivation: 气候变化和温室气体排放是全球关注的问题,评估消费品环境影响需要高效的LCA工具。

Result: SpiderGen在10个样本中F1得分为62%,优于其他基线技术,成本低于1美元,时间为10分钟。

Insight: LLM在环境评估中潜力巨大,可显著减少传统LCA的高成本和长时间消耗。

Abstract: Investigating the effects of climate change and global warming caused by GHG emissions have been a primary concern worldwide. These emissions are largely contributed to by the production, use and disposal of consumer products. Thus, it is important to build tools to estimate the environmental impact of consumer goods, an essential part of which is conducting Life Cycle Assessments (LCAs). LCAs specify and account for the appropriate processes involved with the production, use, and disposal of the products. We present SpiderGen, an LLM-based workflow which integrates the taxonomy and methodology of traditional LCA with the reasoning capabilities and world knowledge of LLMs to generate the procedural information used for LCA. We additionally evaluate the output of SpiderGen using real-world LCA documents as ground-truth. We find that SpiderGen provides accurate LCA process information that is either fully correct or has minor errors, achieving an F1-Score of 62% across 10 sample data points. We observe that the remaining missed processes and hallucinated errors occur primarily due to differences in detail between LCA documents, as well as differences in the “scope” of which auxiliary processes must also be included. We also demonstrate that SpiderGen performs better than several baselines techniques, such as chain-of-thought prompting and one-shot prompting. Finally, we highlight SpiderGen’s potential to reduce the human effort and costs for estimating carbon impact, as it is able to produce LCA process information for less than $1 USD in under 10 minutes as compared to the status quo LCA, which can cost over $25000 USD and take up to 21-person days.


[103] A methodological analysis of prompt perturbations and their effect on attack success rates cs.CLPDF

Tiago Machado, Maysa Malfiza Garcia de Macedo, Rogerio Abreu de Paula, Marcelo Carpinette Grave, Aminat Adebiyi

TL;DR: 论文研究了不同大语言模型对齐方法对提示攻击成功率的影响,发现即使是小的提示修改也能显著改变攻击成功率,强调现有攻击基准可能不足以揭示所有漏洞。

Details

Motivation: 探究不同对齐方法(如SFT、DPO、RLHF)对大语言模型在提示攻击下的响应影响,以填补现有攻击评估的不足。

Result: 小提示修改能显著改变攻击成功率(ASR),不同对齐方法对攻击的敏感性各异。

Insight: 提示扰动的敏感性分析为模型对齐方法的安全性评估提供了新视角,强调需更全面的攻击测试框架。

Abstract: This work aims to investigate how different Large Language Models (LLMs) alignment methods affect the models’ responses to prompt attacks. We selected open source models based on the most common alignment methods, namely, Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Reinforcement Learning with Human Feedback (RLHF). We conducted a systematic analysis using statistical methods to verify how sensitive the Attack Success Rate (ASR) is when we apply variations to prompts designed to elicit inappropriate content from LLMs. Our results show that even small prompt modifications can significantly change the Attack Success Rate (ASR) according to the statistical tests we run, making the models more or less susceptible to types of attack. Critically, our results demonstrate that running existing ‘attack benchmarks’ alone may not be sufficient to elicit all possible vulnerabilities of both models and alignment methods. This paper thus contributes to ongoing efforts on model attack evaluation by means of systematic and statistically-based analyses of the different alignment methods and how sensitive their ASR is to prompt variation.


[104] Modeling and Predicting Multi-Turn Answer Instability in Large Language Models cs.CLPDF

Jiahang He, Rishi Ramachandran, Neel Ramachandran, Aryan Katakam, Kevin Zhu

TL;DR: 论文研究了大型语言模型(LLM)在多轮对话中的回答不稳定性,提出了使用马尔可夫链建模和线性探针来预测回答变化的方法,揭示了模型在多轮交互中的脆弱性。

Details

Motivation: 随着LLM在广泛应用中的部署,研究其稳健性成为关键问题。本文旨在评估LLM在多轮对话中的回答稳定性,并探索预测回答变化的方法。

Result: 实验显示,多轮交互导致模型准确性下降(如Gemini 1.5 Flash下降10%),马尔可夫链能有效建模动态变化,线性探针可预测未来变化。

Insight: 多轮交互揭示了LLM的脆弱性,稳健性指标的建立和高风险场景的应用需解决这种不稳定性。

Abstract: As large language models (LLMs) are adopted in an increasingly wide range of applications, user-model interactions have grown in both frequency and scale. Consequently, research has focused on evaluating the robustness of LLMs, an essential quality for real-world tasks. In this paper, we employ simple multi-turn follow-up prompts to evaluate models’ answer changes, model accuracy dynamics across turns with Markov chains, and examine whether linear probes can predict these changes. Our results show significant vulnerabilities in LLM robustness: a simple “Think again” prompt led to an approximate 10% accuracy drop for Gemini 1.5 Flash over nine turns, while combining this prompt with a semantically equivalent reworded question caused a 7.5% drop for Claude 3.5 Haiku. Additionally, we find that model accuracy across turns can be effectively modeled using Markov chains, enabling the prediction of accuracy probabilities over time. This allows for estimation of the model’s stationary (long-run) accuracy, which we find to be on average approximately 8% lower than its first-turn accuracy for Gemini 1.5 Flash. Our results from a model’s hidden states also reveal evidence that linear probes can help predict future answer changes. Together, these results establish stationary accuracy as a principled robustness metric for interactive settings and expose the fragility of models under repeated questioning. Addressing this instability will be essential for deploying LLMs in high-stakes and interactive settings where consistent reasoning is as important as initial accuracy.


[105] Saying the Unsaid: Revealing the Hidden Language of Multimodal Systems Through Telephone Games cs.CL | cs.AIPDF

Juntu Zhao, Jialing Zhang, Chongxuan Li, Dequan Wang

TL;DR: 论文通过多轮“电话游戏”实验,揭示了闭源多模态系统的偏好偏差及其隐含的语言理解机制,提出了一种量化概念连接强度的方法,并贡献了一个数据集Telescope。

Details

Motivation: 闭源多模态系统的黑盒特性使其隐含的语言理解机制不透明,研究旨在通过偏好偏差揭示这种机制。

Result: 研究揭示了多模态系统的隐含语言机制,评估了其泛化能力,并为未来可解释性和可控性研究奠定了基础。

Insight: 多模态系统的偏好偏差可通过实验方法量化,揭示了其对世界理解的独特方式,为模型的改进提供了新视角。

Abstract: Recent closed-source multimodal systems have made great advances, but their hidden language for understanding the world remains opaque because of their black-box architectures. In this paper, we use the systems’ preference bias to study their hidden language: During the process of compressing the input images (typically containing multiple concepts) into texts and then reconstructing them into images, the systems’ inherent preference bias introduces specific shifts in the outputs, disrupting the original input concept co-occurrence. We employ the multi-round “telephone game” to strategically leverage this bias. By observing the co-occurrence frequencies of concepts in telephone games, we quantitatively investigate the concept connection strength in the understanding of multimodal systems, i.e., “hidden language.” We also contribute Telescope, a dataset of 10,000+ concept pairs, as the database of our telephone game framework. Our telephone game is test-time scalable: By iteratively running telephone games, we can construct a global map of concept connections in multimodal systems’ understanding. Here we can identify preference bias inherited from training, assess generalization capability advancement, and discover more stable pathways for fragile concept connections. Furthermore, we use Reasoning-LLMs to uncover unexpected concept relationships that transcend textual and visual similarities, inferring how multimodal systems understand and simulate the world. This study offers a new perspective on the hidden language of multimodal systems and lays the foundation for future research on the interpretability and controllability of multimodal systems.


[106] Evaluating from Benign to Dynamic Adversarial: A Squid Game for Large Language Models cs.CL | cs.AIPDF

Zijian Chen, Wenjun Zhang, Guangtao Zhai

TL;DR: 该论文提出了一种名为“Squid Game”的动态对抗评估框架,用于评估大语言模型(LLMs)在多任务、资源受限和信息不对称环境下的表现,揭示了现有静态评测的局限性。

Details

Motivation: 当前评测基准难以跟上大语言模型的发展,且可能存在数据污染问题。另外,现有评测多假设环境友好且资源充足,忽略了模型在压力下的表现。

Result: 研究发现LLMs在动态评测中出现代际性能跃迁,部分模型依赖投机性捷径完成任务,揭示了静态评测的潜在污染问题。

Insight: 动态对抗评测能更全面地揭示模型能力,且与静态评测相关性较低,为构建更鲁棒的评估框架提供了新思路。

Abstract: Contemporary benchmarks are struggling to keep pace with the development of large language models (LLMs). Although they are indispensable to evaluate model performance on various tasks, it is uncertain whether the models trained on Internet data have genuinely learned how to solve problems or merely seen the questions before. This potential data contamination issue presents a fundamental challenge to establishing trustworthy evaluation frameworks. Meanwhile, existing benchmarks predominantly assume benign, resource-rich settings, leaving the behavior of LLMs under pressure unexplored. In this paper, we introduce Squid Game, a dynamic and adversarial evaluation environment with resource-constrained and asymmetric information settings elaborated to evaluate LLMs through interactive gameplay against other LLM opponents. Notably, Squid Game consists of six elimination-style levels, focusing on multi-faceted abilities, such as instruction-following, code, reasoning, planning, and safety alignment. We evaluate over 50 LLMs on Squid Game, presenting the largest behavioral evaluation study of general LLMs on dynamic adversarial scenarios. We observe a clear generational phase transition on performance in the same model lineage and find evidence that some models resort to speculative shortcuts to win the game, indicating the possibility of higher-level evaluation paradigm contamination in static benchmarks. Furthermore, we compare prominent LLM benchmarks and Squid Game with correlation analyses, highlighting that dynamic evaluation can serve as a complementary part for static evaluations. The code and data will be released in the future.


[107] Do AI Voices Learn Social Nuances? A Case of Politeness and Speech Rate cs.CL | cs.AI | cs.HC | cs.SDPDF

Eyal Rabin, Zohar Elyoseph, Rotem Israel-Fishelson, Adi Dali, Ravit Nussinson

TL;DR: 研究表明,先进的文本转语音系统能够隐式学习并模仿人类社交中的非明显韵律标记,如通过降低语速表达礼貌。

Details

Motivation: 研究语音人工智能是否能够学习并遵守人类社交中的隐含规则,尤其是通过语速变化传达礼貌这一非明显特征。

Result: 礼貌提示下的语速显著慢于随意提示,效果显著,且在不同平台的语音中表现一致。

Insight: AI不仅是工具,还能作为社会角色强化人类的社交规范,展示了其在社交互动中的潜力和复杂性。

Abstract: Voice-based artificial intelligence is increasingly expected to adhere to human social conventions, but can it learn implicit cues that are not explicitly programmed? This study investigates whether state-of-the-art text-to-speech systems have internalized the human tendency to reduce speech rate to convey politeness - a non-obvious prosodic marker. We prompted 22 synthetic voices from two leading AI platforms (AI Studio and OpenAI) to read a fixed script under both “polite and formal” and “casual and informal” conditions and measured the resulting speech duration. Across both AI platforms, the polite prompt produced slower speech than the casual prompt with very large effect sizes, an effect that was statistically significant for all of AI Studio’s voices and for a large majority of OpenAI’s voices. These results demonstrate that AI can implicitly learn and replicate psychological nuances of human communication, highlighting its emerging role as a social actor capable of reinforcing human social norms.


[108] Where does an LLM begin computing an instruction? cs.CLPDF

Aditya Pola, Vineeth N. Balasubramanian

TL;DR: 该研究通过激活修补技术和最小对比提示对,确定了LLM在处理指令时的‘起始点’(onset),即指令计算开始的层数点,并验证了不同任务和模型规模下的一致性。

Details

Motivation: 探索LLM在执行指令时的内部机制,明确指令计算开始的具体层数点,以更好地理解模型的行为。

Result: 在Llama系列模型中,发现了一个明显的onset点,干预在此之前会影响预测结果,之后则效果减弱;多跳任务也显示出类似的onset位置。

Insight: 指令计算的起始点是模型内部处理的关键转折点,这一发现有助于未来对LLM内部机制的进一步研究和优化。

Abstract: Following an instruction involves distinct sub-processes, such as reading content, reading the instruction, executing it, and producing an answer. We ask where, along the layer stack, instruction following begins, the point where reading gives way to doing. We introduce three simple datasets (Key-Value, Quote Attribution, Letter Selection) and two hop compositions of these tasks. Using activation patching on minimal-contrast prompt pairs, we measure a layer-wise flip rate that indicates when substituting selected residual activations changes the predicted answer. Across models in the Llama family, we observe an inflection point, which we term onset, where interventions that change predictions before this point become largely ineffective afterward. Multi-hop compositions show a similar onset location. These results provide a simple, replicable way to locate where instruction following begins and to compare this location across tasks and model sizes.


[109] “As Eastern Powers, I will veto.” : An Investigation of Nation-level Bias of Large Language Models in International Relations cs.CLPDF

Jonghyeon Choi, Yeonjun Choi, Hyun-chul Kim, Beakcheol Jang

TL;DR: 该论文系统地研究了大型语言模型(LLMs)在国际关系(IR)领域的国家级偏见,提出了一种偏见评估框架,并发现偏见因模型和任务而异。通过结合检索增强生成和自反思技术,论文提出了一种去偏框架,显著减少了偏见并提升了性能。

Details

Motivation: 大型语言模型在国际关系领域的应用日益广泛,但对其潜在的偏见研究不足。作者希望通过揭示LLMs的国家级偏见,推动更公平的模型应用。

Result: 实验证明,该方法有效减少了GPT-4o-mini和LLama-3.3-70B的国家级偏见,同时提升了模型的性能。

Insight: LLMs的偏见是动态和多维的,与其推理能力相关,因此在IR领域应用LLMs时需同时评估偏见和性能。

Abstract: This paper systematically examines nation-level biases exhibited by Large Language Models (LLMs) within the domain of International Relations (IR). Leveraging historical records from the United Nations Security Council (UNSC), we developed a bias evaluation framework comprising three distinct tests to explore nation-level bias in various LLMs, with a particular focus on the five permanent members of the UNSC. Experimental results show that, even with the general bias patterns across models (e.g., favorable biases toward the western nations, and unfavorable biases toward Russia), these still vary based on the LLM. Notably, even within the same LLM, the direction and magnitude of bias for a nation change depending on the evaluation context. This observation suggests that LLM biases are fundamentally multidimensional, varying across models and tasks. We also observe that models with stronger reasoning abilities show reduced bias and better performance. Building on this finding, we introduce a debiasing framework that improves LLMs’ factual reasoning combining Retrieval-Augmented Generation with Reflexion-based self-reflection techniques. Experiments show it effectively reduces nation-level bias, and improves performance, particularly in GPT-4o-mini and LLama-3.3-70B. Our findings emphasize the need to assess nation-level bias alongside performance when applying LLMs in the IR domain.


[110] $π$-Attention: Periodic Sparse Transformers for Efficient Long-Context Modeling cs.CL | cs.AIPDF

Dong Liu, Yanxuan Yu

TL;DR: 该论文提出了一种名为$π$-Attention的周期性稀疏Transformer,通过结合环形局部注意力、确定性$π$-stride跳跃和自适应融合门,实现了长上下文建模的高效性和适应性。

Details

Motivation: 传统Transformer的二次复杂度限制了长范围建模的效率,而现有的稀疏注意力机制(如RingAttention)虽降低了计算成本,但存在感受野受限和缺乏适应性的问题。

Result: 实验表明,$π$-Attention在同等上下文长度下比RingAttention困惑度降低8.3%,GPU使用量减少50%,且性能优于密集注意力。

Insight: 周期性跳跃、自适应融合和头部级稀疏协调是高效长上下文建模的关键。

Abstract: Transformers have revolutionized natural language processing, but their quadratic complexity with respect to sequence length remains a fundamental bottleneck for long-range modeling. While sparse attention mechanisms like RingAttention reduce computational costs by restricting attention to local neighborhoods, they suffer from limited receptive fields and lack of adaptability. We present \PiAttention, a periodic sparse Transformer that factorizes attention into ring-local neighborhoods, deterministic $π$-stride skips, and an adaptive fusion gate. The periodic structure provides predictable coverage of distant tokens, while the sparse footprint keeps the per-layer complexity linear in context length. We prove that \PiAttention achieves $\mathcal{O}(kL + π\log L)$ receptive field growth compared to $\mathcal{O}(kL)$ for RingAttention, where $k$ is the local window size, $π$ is the skip period, and $L$ is the sequence length. Extensive experiments on language modeling, retrieval, and vision-language tasks demonstrate that \PiAttention matches or surpasses dense attention quality with 8.3% lower perplexity than RingAttention while using 50% fewer GPUs for the same context length. Our detailed ablations and visualizations reveal the importance of periodic skips, adaptive fusion, and head-level sparsity coordination for efficient long-context modeling.


[111] Faithful Summarization of Consumer Health Queries: A Cross-Lingual Framework with LLMs cs.CLPDF

Ajwad Abrar, Nafisa Tabassum Oeshy, Prianka Maheru, Farzana Tabassum, Tareque Mohmud Chowdhury

TL;DR: 该论文提出了一个结合TextRank和医疗命名实体识别的框架,利用LLaMA-2-7B模型在英语和孟加拉语数据集上生成医疗问题摘要,显著提升了摘要的忠实性和质量。

Details

Motivation: 在医疗领域,消费者健康问题(CHQs)的摘要可以促进医患沟通,但不忠实的摘要可能误传医疗信息,带来严重风险。因此,确保摘要的忠实性至关重要。

Result: 在ROUGE、BERTScore等质量和忠实性指标上表现优于零样本基线和现有系统。人工评估显示80%以上的摘要保留了关键医疗信息。

Insight: 忠实性是医疗摘要可靠性的关键指标,本文方法展现了LLMs在医疗领域安全部署的潜力。

Abstract: Summarizing consumer health questions (CHQs) can ease communication in healthcare, but unfaithful summaries that misrepresent medical details pose serious risks. We propose a framework that combines TextRank-based sentence extraction and medical named entity recognition with large language models (LLMs) to enhance faithfulness in medical text summarization. In our experiments, we fine-tuned the LLaMA-2-7B model on the MeQSum (English) and BanglaCHQ-Summ (Bangla) datasets, achieving consistent improvements across quality (ROUGE, BERTScore, readability) and faithfulness (SummaC, AlignScore) metrics, and outperforming zero-shot baselines and prior systems. Human evaluation further shows that over 80% of generated summaries preserve critical medical information. These results highlight faithfulness as an essential dimension for reliable medical summarization and demonstrate the potential of our approach for safer deployment of LLMs in healthcare contexts.


[112] Leveraging Parameter Space Symmetries for Reasoning Skill Transfer in LLMs cs.CL | cs.AI | cs.LGPDF

Stefan Horoi, Sangwoo Cho, Supriyo Chakraborty, Shi-Xiong Zhang, Sambit Sahu

TL;DR: 该论文提出了一种利用Transformer架构的参数空间对称性(如排列、旋转和缩放)来对齐模型参数空间的方法,从而提升大型语言模型(LLMs)之间技能迁移的效果。

Details

Motivation: 标准任务算术在模型参数空间存在差异时容易产生负干扰,限制了技能迁移的效果。

Result: 在复杂推理任务中,该方法显著优于标准任务算术。

Insight: 参数空间对称性对齐是提升LLM技能迁移效果的关键,减少了对冗余微调的依赖,增强了模型的适应性。

Abstract: Task arithmetic is a powerful technique for transferring skills between Large Language Models (LLMs), but it often suffers from negative interference when models have diverged during training. We address this limitation by first aligning the models’ parameter spaces, leveraging the inherent permutation, rotation, and scaling symmetries of Transformer architectures. We adapt parameter space alignment for modern Grouped-Query Attention (GQA) and SwiGLU layers, exploring both weight-based and activation-based approaches. Using this alignment-first strategy, we successfully transfer advanced reasoning skills to a non-reasoning model. Experiments on challenging reasoning benchmarks show that our method consistently outperforms standard task arithmetic. This work provides an effective approach for merging and transferring specialized skills across evolving LLM families, reducing redundant fine-tuning and enhancing model adaptability.


[113] From Proof to Program: Characterizing Tool-Induced Reasoning Hallucinations in Large Language Models cs.CL | cs.LO | cs.SEPDF

Farima Fatahi Bayat, Pouya Pezeshkpour, Estevam Hruschka

TL;DR: 这篇论文研究了工具增强语言模型(TaLMs)在调用外部工具时可能产生的推理幻觉(Tool-Induced Myopia, TIM),揭示了工具使用虽能提高答案准确率,但会导致推理质量的下降。

Details

Motivation: 尽管TaLMs通过调用外部工具扩展了解决问题的能力,但其推理过程的可靠性和连贯性尚未被充分研究。论文旨在探究工具使用是否会导致模型在推理过程中依赖工具输出而非自身逻辑。

Result: TaLMs在答案准确率上提升19.3%,但推理质量显著下降(非工具模型在推理对比中胜出41.5%)。工具使用频率越高,推理连贯性越差。此外,TIM在高风险案例中占比约55%。

Insight: 工具使用可能导致模型过度依赖工具输出,忽视自身推理逻辑,尤其是在需要创造性和全局思考的任务中。偏好优化方法可以有效缓解这一问题。

Abstract: Tool-augmented Language Models (TaLMs) can invoke external tools to solve problems beyond their parametric capacity. However, it remains unclear whether these tool-enabled gains reflect trustworthy reasoning. Focusing on the Code Interpreter tool, we show that even when tools are selected and executed correctly, TaLMs treat tool outputs as substitutes for reasoning, producing solutions that appear correct but lack coherent justification. We term this failure mode Tool-Induced Myopia (TIM), and study it using PYMATH, a benchmark of 1,679 competition-level mathematical problems for which Python code is helpful but not sufficient. We further develop a multi-dimensional evaluation suite to quantify reasoning degradation in TaLMs relative to their non-tool counterparts. Our findings reveal that while TaLMs achieve up to a 19.3 percentage point gain in final-answer accuracy, their reasoning behavior consistently deteriorates (e.g., non-tool LLMs win up to 41.5% more often in pairwise comparisons of the reasoning process). This degradation intensifies with tool use; the more frequently a model invokes tools, the less coherent its reasoning becomes. Moreover, tool use shifts errors from arithmetic mistakes toward global reasoning failures (logic, assumption, creativity); with TIM present in ~55% of high-risk cases. Finally, we propose a preference-optimization-based framework that realigns TaLMs to use tools as assistive evidence, improving both final-answer accuracy and reasoning depth under tool use. Codes and data are available at: https://github.com/megagonlabs/TIM.


[114] Expert-Guided Prompting and Retrieval-Augmented Generation for Emergency Medical Service Question Answering cs.CL | cs.AIPDF

Xueren Ge, Sahil Murtaza, Anthony Cortez, Homa Alemzadeh

TL;DR: 论文提出了EMSQA数据集和两种方法Expert-CoT与ExpertRAG,结合领域专家知识提升急救医学问答系统的性能。

Details

Motivation: 现有大语言模型(LLMs)在医学问答中忽视领域专业知识(如临床主题和认证级别),限制了高风险场景下的表现。

Result: Expert-CoT比普通CoT提升2.05%;结合ExpertRAG后比标准RAG基线提升4.59%。32B模型通过所有EMS认证模拟考试。

Insight: 结合领域专家知识与检索技术显著提升LLMs在医学问答中的表现,尤其在需要专业知识的场景下。

Abstract: Large language models (LLMs) have shown promise in medical question answering, yet they often overlook the domain-specific expertise that professionals depend on, such as the clinical subject areas (e.g., trauma, airway) and the certification level (e.g., EMT, Paramedic). Existing approaches typically apply general-purpose prompting or retrieval strategies without leveraging this structured context, limiting performance in high-stakes settings. We address this gap with EMSQA, an 24.3K-question multiple-choice dataset spanning 10 clinical subject areas and 4 certification levels, accompanied by curated, subject area-aligned knowledge bases (40K documents and 2M tokens). Building on EMSQA, we introduce (i) Expert-CoT, a prompting strategy that conditions chain-of-thought (CoT) reasoning on specific clinical subject area and certification level, and (ii) ExpertRAG, a retrieval-augmented generation pipeline that grounds responses in subject area-aligned documents and real-world patient data. Experiments on 4 LLMs show that Expert-CoT improves up to 2.05% over vanilla CoT prompting. Additionally, combining Expert-CoT with ExpertRAG yields up to a 4.59% accuracy gain over standard RAG baselines. Notably, the 32B expertise-augmented LLMs pass all the computer-adaptive EMS certification simulation exams.


[115] Multimodal Peer Review Simulation with Actionable To-Do Recommendations for Community-Aware Manuscript Revisions cs.CLPDF

Mengze Hong, Di Jiang, Weiwei Zhao, Yawen Li, Yihang Wang

TL;DR: 这篇论文提出了一个基于多模态大语言模型的交互式同行评审模拟系统,旨在为学术论文修改提供结构化且可操作的建议。

Details

Motivation: 现有的学术同行评审系统受限于纯文本输入、上下文局限性以及缺乏可操作性反馈,无法充分发挥大语言模型的潜力。

Result: 实验表明,该系统生成的评审更全面、有用,且符合专家标准,显著优于基准方法。

Insight: 多模态信息和检索增强生成的结合能显著提升同行评审的质量和实用性,为学术修改提供透明和以人为中心的辅助。

Abstract: While large language models (LLMs) offer promising capabilities for automating academic workflows, existing systems for academic peer review remain constrained by text-only inputs, limited contextual grounding, and a lack of actionable feedback. In this work, we present an interactive web-based system for multimodal, community-aware peer review simulation to enable effective manuscript revisions before paper submission. Our framework integrates textual and visual information through multimodal LLMs, enhances review quality via retrieval-augmented generation (RAG) grounded in web-scale OpenReview data, and converts generated reviews into actionable to-do lists using the proposed Action:Objective[#] format, providing structured and traceable guidance. The system integrates seamlessly into existing academic writing platforms, providing interactive interfaces for real-time feedback and revision tracking. Experimental results highlight the effectiveness of the proposed system in generating more comprehensive and useful reviews aligned with expert standards, surpassing ablated baselines and advancing transparent, human-centered scholarly assistance.


[116] Automated Analysis of Learning Outcomes and Exam Questions Based on Bloom’s Taxonomy cs.CL | cs.AIPDF

Ramya Kumar, Dhruv Gulwani, Sonit Singh

TL;DR: 该论文研究了基于布鲁姆分类法的考试题目与学习目标的自动分类方法,比较了传统机器学习模型、RNN架构、Transformer模型及大语言模型的性能,发现数据增强的SVM表现最佳,而复杂模型在有限数据下易过拟合。

Details

Motivation: 布鲁姆分类法在教育领域的应用需要高效准确的自动化工具,但现有方法在有限数据下的表现不佳,尤其是复杂模型易过拟合。

Result: SVM+数据增强表现最佳(94%准确率);RNN和BERT过拟合严重;LLMs零样本学习准确率约0.72-0.73。

Insight: 在有限数据下,简单模型+数据增强可能优于复杂深度模型;LLMs虽无需训练但性能仍有提升空间。

Abstract: This paper explores the automatic classification of exam questions and learning outcomes according to Bloom’s Taxonomy. A small dataset of 600 sentences labeled with six cognitive categories - Knowledge, Comprehension, Application, Analysis, Synthesis, and Evaluation - was processed using traditional machine learning (ML) models (Naive Bayes, Logistic Regression, Support Vector Machines), recurrent neural network architectures (LSTM, BiLSTM, GRU, BiGRU), transformer-based models (BERT and RoBERTa), and large language models (OpenAI, Gemini, Ollama, Anthropic). Each model was evaluated under different preprocessing and augmentation strategies (for example, synonym replacement, word embeddings, etc.). Among traditional ML approaches, Support Vector Machines (SVM) with data augmentation achieved the best overall performance, reaching 94 percent accuracy, recall, and F1 scores with minimal overfitting. In contrast, the RNN models and BERT suffered from severe overfitting, while RoBERTa initially overcame it but began to show signs as training progressed. Finally, zero-shot evaluations of large language models (LLMs) indicated that OpenAI and Gemini performed best among the tested LLMs, achieving approximately 0.72-0.73 accuracy and comparable F1 scores. These findings highlight the challenges of training complex deep models on limited data and underscore the value of careful data augmentation and simpler algorithms (such as augmented SVM) for Bloom’s Taxonomy classification.


[117] Evaluating Large Language Models on Rare Disease Diagnosis: A Case Study using House M.D cs.CL | cs.AIPDF

Arsh Gupta, Ajay Narayanan Sridhar, Bonam Mingole, Amulya Yadav

TL;DR: 论文通过电视剧《豪斯医生》构建了一个罕见病诊断数据集,评估了四种先进LLM在叙事医学推理任务中的表现,发现新一代模型性能提升显著,但罕见病诊断仍具挑战性。

Details

Motivation: LLM在多领域表现优异,但在罕见病叙事诊断任务中的能力尚未充分探索,需要建立一个教育验证的评估基准。

Result: 模型准确率在16.48%-38.64%之间,新一代模型性能提升2.3倍,但仍面临罕见病诊断的挑战。

Insight: LLM在罕见病诊断中潜力显著,但需进一步优化;公开数据集和框架有助于推动AI辅助诊断研究。

Abstract: Large language models (LLMs) have demonstrated capabilities across diverse domains, yet their performance on rare disease diagnosis from narrative medical cases remains underexplored. We introduce a novel dataset of 176 symptom-diagnosis pairs extracted from House M.D., a medical television series validated for teaching rare disease recognition in medical education. We evaluate four state-of-the-art LLMs such as GPT 4o mini, GPT 5 mini, Gemini 2.5 Flash, and Gemini 2.5 Pro on narrative-based diagnostic reasoning tasks. Results show significant variation in performance, ranging from 16.48% to 38.64% accuracy, with newer model generations demonstrating a 2.3 times improvement. While all models face substantial challenges with rare disease diagnosis, the observed improvement across architectures suggests promising directions for future development. Our educationally validated benchmark establishes baseline performance metrics for narrative medical reasoning and provides a publicly accessible evaluation framework for advancing AI-assisted diagnosis research.


[118] When Data is the Algorithm: A Systematic Study and Curation of Preference Optimization Datasets cs.CL | cs.AIPDF

Aladin Djuhera, Farhan Ahmed, Swanand Ravindra Kadhe, Syed Zawad, Heiko Ludwig

TL;DR: 本文首次对开源直接偏好优化(DPO)数据集进行了全面的数据为中心的分析,使用Magpie框架标注样本质量,并基于此构建了一个更高效的混合数据集UltraMix。

Details

Motivation: 目前缺乏对开源DPO数据集的系统性比较,主要是因为计算成本高和质量标注不足,难以理解偏好选择、任务类型和人类判断的准确性。

Result: UltraMix比最佳个体数据集小30%,但在关键基准测试中表现更优。所有标注和数据集已公开。

Insight: 通过数据为中心的方法可以显著提升偏好数据的质量和效率,同时揭示了数据集间的结构和质量差异对模型优化的重要性。

Abstract: Aligning large language models (LLMs) is a central objective of post-training, often achieved through reward modeling and reinforcement learning methods. Among these, direct preference optimization (DPO) has emerged as a widely adopted technique that fine-tunes LLMs on preferred completions over less favorable ones. While most frontier LLMs do not disclose their curated preference pairs, the broader LLM community has released several open-source DPO datasets, including TuluDPO, ORPO, UltraFeedback, HelpSteer, and Code-Preference-Pairs. However, systematic comparisons remain scarce, largely due to the high computational cost and the lack of rich quality annotations, making it difficult to understand how preferences were selected, which task types they span, and how well they reflect human judgment on a per-sample level. In this work, we present the first comprehensive, data-centric analysis of popular open-source DPO corpora. We leverage the Magpie framework to annotate each sample for task category, input quality, and preference reward, a reward-model-based signal that validates the preference order without relying on human annotations. This enables a scalable, fine-grained inspection of preference quality across datasets, revealing structural and qualitative discrepancies in reward margins. Building on these insights, we systematically curate a new DPO mixture, UltraMix, that draws selectively from all five corpora while removing noisy or redundant samples. UltraMix is 30% smaller than the best-performing individual dataset yet exceeds its performance across key benchmarks. We publicly release all annotations, metadata, and our curated mixture to facilitate future research in data-centric preference optimization.


[119] AV-Dialog: Spoken Dialogue Models with Audio-Visual Input cs.CL | cs.AI | cs.CV | cs.MM | cs.SDPDF

Tuochao Chen, Bandhav Veluri, Hongyu Gong, Shyamnath Gollakota

TL;DR: AV-Dialog是一种结合音频和视觉线索的多模态对话框架,旨在解决嘈杂多说话者环境中的对话问题,通过多任务、多阶段训练实现自然对话流。

Details

Motivation: 传统对话模型在嘈杂或多说话者环境中表现不佳,容易出现不相关回应和不自然的交替发言问题,因此需要结合音频和视觉信号以提升鲁棒性。

Result: 实验显示AV-Dialog在干扰环境下优于纯音频模型,减少了转录错误,提升了发言交替预测准确性和人类评估的对话质量。

Insight: 视觉信号的引入显著提升了对话系统的鲁棒性和自然性,为现实嘈杂环境中的语音对话代理奠定了基础。

Abstract: Dialogue models falter in noisy, multi-speaker environments, often producing irrelevant responses and awkward turn-taking. We present AV-Dialog, the first multimodal dialog framework that uses both audio and visual cues to track the target speaker, predict turn-taking, and generate coherent responses. By combining acoustic tokenization with multi-task, multi-stage training on monadic, synthetic, and real audio-visual dialogue datasets, AV-Dialog achieves robust streaming transcription, semantically grounded turn-boundary detection and accurate responses, resulting in a natural conversational flow. Experiments show that AV-Dialog outperforms audio-only models under interference, reducing transcription errors, improving turn-taking prediction, and enhancing human-rated dialogue quality. These results highlight the power of seeing as well as hearing for speaker-aware interaction, paving the way for {spoken} dialogue agents that perform {robustly} in real-world, noisy environments.


[120] Enhancing Meme Emotion Understanding with Multi-Level Modality Enhancement and Dual-Stage Modal Fusion cs.CL | cs.CVPDF

Yi Shi, Wenlong Meng, Zhenyuan Guo, Chengkun Wei, Wenzhi Chen

TL;DR: 本论文提出了一种名为MemoDetector的新框架,用于增强Mem Emotion Understanding(MEU),通过多级模态增强和双阶段模态融合来解决现有方法中的两个主要挑战。

Details

Motivation: 随着社交媒体和互联网文化的快速发展,表情包(Memes)成为表达情感倾向的热门媒介。然而,现有的MEU方法在细粒度多模态融合和表情包隐含意义挖掘方面存在不足。

Result: 在两个数据集(MET-MEME和MOOD)上,MemoDetector的F1分数分别提高了4.3%和3.4%,并通过消融实验验证了方法的有效性和鲁棒性。

Insight: 论文表明,充分利用多模态大语言模型的知识推理能力,并结合分阶段的模态融合策略,可以有效提升表情包情感理解的性能。

Abstract: With the rapid rise of social media and Internet culture, memes have become a popular medium for expressing emotional tendencies. This has sparked growing interest in Meme Emotion Understanding (MEU), which aims to classify the emotional intent behind memes by leveraging their multimodal contents. While existing efforts have achieved promising results, two major challenges remain: (1) a lack of fine-grained multimodal fusion strategies, and (2) insufficient mining of memes’ implicit meanings and background knowledge. To address these challenges, we propose MemoDetector, a novel framework for advancing MEU. First, we introduce a four-step textual enhancement module that utilizes the rich knowledge and reasoning capabilities of Multimodal Large Language Models (MLLMs) to progressively infer and extract implicit and contextual insights from memes. These enhanced texts significantly enrich the original meme contents and provide valuable guidance for downstream classification. Next, we design a dual-stage modal fusion strategy: the first stage performs shallow fusion on raw meme image and text, while the second stage deeply integrates the enhanced visual and textual features. This hierarchical fusion enables the model to better capture nuanced cross-modal emotional cues. Experiments on two datasets, MET-MEME and MOOD, demonstrate that our method consistently outperforms state-of-the-art baselines. Specifically, MemoDetector improves F1 scores by 4.3% on MET-MEME and 3.4% on MOOD. Further ablation studies and in-depth analyses validate the effectiveness and robustness of our approach, highlighting its strong potential for advancing MEU. Our code is available at https://github.com/singing-cat/MemoDetector.


[121] PRSM: A Measure to Evaluate CLIP’s Robustness Against Paraphrases cs.CL | cs.CY | cs.LGPDF

Udo Schlegel, Franziska Weeber, Jian Lan, Thomas Seidl

TL;DR: 本文提出了一种新指标PRSM,用于评估CLIP模型在文本改写(paraphrase)情况下的鲁棒性,并通过实验揭示了其在性别相关查询中的稳定性差异。

Details

Motivation: CLIP作为多模态模型的核心问题之一是其在语言变体(尤其是改写)下的鲁棒性未充分研究,这在敏感社会场景中尤为重要。

Result: 实验表明CLIP对改写的鲁棒性因策略而异,且在男性和女性相关查询中表现出细微但一致的差异。

Insight: 多模态模型的公平性需要考虑语言变体的鲁棒性,尤其在敏感场景中,改写可能放大社会偏见。

Abstract: Contrastive Language-Image Pre-training (CLIP) is a widely used multimodal model that aligns text and image representations through large-scale training. While it performs strongly on zero-shot and few-shot tasks, its robustness to linguistic variation, particularly paraphrasing, remains underexplored. Paraphrase robustness is essential for reliable deployment, especially in socially sensitive contexts where inconsistent representations can amplify demographic biases. In this paper, we introduce the Paraphrase Ranking Stability Metric (PRSM), a novel measure for quantifying CLIP’s sensitivity to paraphrased queries. Using the Social Counterfactuals dataset, a benchmark designed to reveal social and demographic biases, we empirically assess CLIP’s stability under paraphrastic variation, examine the interaction between paraphrase robustness and gender, and discuss implications for fairness and equitable deployment of multimodal systems. Our analysis reveals that robustness varies across paraphrasing strategies, with subtle yet consistent differences observed between male- and female-associated queries.


[122] Adverbs Revisited: Enhancing WordNet Coverage of Adverbs with a Supersense Taxonomy cs.CLPDF

Jooyoung Lee, Jader Martins Camboim de Sá

TL;DR: 该论文针对WordNet中副词分类不足的问题,提出了一种基于语言学理论的副词超语义分类体系,并通过标注实验验证了其覆盖性和可靠性,扩展了WordNet的功能,支持多种NLP任务。

Details

Motivation: WordNet对名词和动词有详细的超语义层次分类,但副词的语义分类系统严重不足,限制了其在自然语言处理中的应用。

Result: 实验表明,提出的副词分类体系能广泛覆盖自然文本中的副词,且标注者能可靠分配这些类别。

Insight: 系统化的副词语义分类不仅能提升WordNet的覆盖范围,还能支持多种NLP任务(如词义消歧、情感分析等)。

Abstract: WordNet offers rich supersense hierarchies for nouns and verbs, yet adverbs remain underdeveloped, lacking a systematic semantic classification. We introduce a linguistically grounded supersense typology for adverbs, empirically validated through annotation, that captures major semantic domains including manner, temporal, frequency, degree, domain, speaker-oriented, and subject-oriented functions. Results from a pilot annotation study demonstrate that these categories provide broad coverage of adverbs in natural text and can be reliably assigned by human annotators. Incorporating this typology extends WordNet’s coverage, aligns it more closely with linguistic theory, and facilitates downstream NLP applications such as word sense disambiguation, event extraction, sentiment analysis, and discourse modeling. We present the proposed supersense categories, annotation outcomes, and directions for future work.


[123] iMAD: Intelligent Multi-Agent Debate for Efficient and Accurate LLM Inference cs.CL | cs.AI | cs.MAPDF

Wei Fan, JinYi Yoon, Bo Ji

TL;DR: iMAD是一个智能的多智能体辩论框架,通过选择性触发MAD来提高LLM的推理效率和准确性,避免不必要的计算开销。

Details

Motivation: 现有的MAD框架对所有查询均进行多智能体辩论,导致计算成本高且可能降低准确性。iMAD旨在通过智能判断何时触发辩论来优化这一过程。

Result: 实验表明,iMAD在六种问答数据集上显著降低计算开销(最多减少92%)并提高准确性(最多提升13.5%)。

Insight: 选择性触发多智能体辩论是优化LLM推理的有效策略,语言学特征可作为辩论决策的重要依据。

Abstract: Large Language Model (LLM) agent systems have advanced rapidly, driven by their strong generalization in zero-shot settings. To further enhance reasoning and accuracy on complex tasks, Multi-Agent Debate (MAD) has emerged as a promising framework that engages multiple LLM agents in structured debates to encourage diverse reasoning. However, triggering MAD for every query is inefficient, as it incurs substantial computational (token) cost and may even degrade accuracy by overturning correct single-agent answers. To address these limitations, we propose intelligent Multi-Agent Debate (iMAD), a token-efficient framework that selectively triggers MAD only when it is likely to be beneficial (i.e., correcting an initially wrong answer). To achieve this goal, iMAD learns generalizable model behaviors to make accurate debate decisions. Specifically, iMAD first prompts a single agent to produce a structured self-critique response, from which we extract 41 interpretable linguistic and semantic features capturing hesitation cues. Then, iMAD uses a lightweight debate-decision classifier, trained using our proposed FocusCal loss, to determine whether to trigger MAD, enabling robust debate decisions without test dataset-specific tuning. Through extensive experiments using six (visual) question answering datasets against five competitive baselines, we have shown that iMAD significantly reduces token usage (by up to 92%) while also improving final answer accuracy (by up to 13.5%).


[124] NOVA: An Agentic Framework for Automated Histopathology Analysis and Discovery cs.CL | cs.AIPDF

Anurag J. Vaidya, Felix Meissen, Daniel C. Castro, Shruthi Bannur, Tristan Lazard

TL;DR: NOVA是一个自动化病理学分析的智能框架,能将科学查询转化为可执行的Python代码分析流程,整合49种专业工具并支持动态工具创建。SlideQuest基准测试显示NOVA在多步推理和编码任务上优于基线方法,并通过案例研究验证其可扩展性。

Details

Motivation: 数字化病理学分析通常需要复杂的流程和专业知识,限制了其可及性。NOVA旨在通过自动化工具解决这一问题,提升分析效率和可扩展性。

Result: NOVA在SlideQuest基准上优于基线方法,案例研究成功将形态学分析与PAM50预后亚型关联。

Insight: 基于智能体的自动化分析框架可以显著简化复杂医学任务,多步推理和动态工具创建为其核心优势。

Abstract: Digitized histopathology analysis involves complex, time-intensive workflows and specialized expertise, limiting its accessibility. We introduce NOVA, an agentic framework that translates scientific queries into executable analysis pipelines by iteratively generating and running Python code. NOVA integrates 49 domain-specific tools (e.g., nuclei segmentation, whole-slide encoding) built on open-source software, and can also create new tools ad hoc. To evaluate such systems, we present SlideQuest, a 90-question benchmark – verified by pathologists and biomedical scientists – spanning data processing, quantitative analysis, and hypothesis testing. Unlike prior biomedical benchmarks focused on knowledge recall or diagnostic QA, SlideQuest demands multi-step reasoning, iterative coding, and computational problem solving. Quantitative evaluation shows NOVA outperforms coding-agent baselines, and a pathologist-verified case study links morphology to prognostically relevant PAM50 subtypes, demonstrating its scalable discovery potential.


[125] LaoBench: A Large-Scale Multidimensional Lao Benchmark for Large Language Models cs.CLPDF

Jian Gao, Richeng Xuan, Zhaolu Kang, Dingshi Liao, Wenxin Huang

TL;DR: LaoBench 是一个大规模、高质量、多维度的老挝语基准数据集,用于评估大语言模型在老挝语中的综合理解和推理能力,填补了低资源语言评估的空白。

Details

Motivation: 当前大语言模型的快速进步与其在低资源语言(如东南亚语言老挝语)中的评估不匹配,需要一个专门的基准来推动相关研究和开发。

Result: 在当前最先进的大语言模型上进行测试,结果显示这些模型在老挝语的多样化任务中仍面临显著挑战。

Insight: LaoBench 的提出有望推动低资源东南亚语言的AI技术研究,同时强调了跨文化和多语言评估的重要性。

Abstract: The rapid advancement of large language models (LLMs) has not been matched by their evaluation in low-resource languages, especially Southeast Asian languages like Lao. To fill this gap, we introduce LaoBench, the first large-scale, high-quality, and multidimensional benchmark dataset dedicated to assessing LLMs’ comprehensive language understanding and reasoning abilities in Lao. LaoBench comprises over 17,000 carefully curated samples spanning three core dimensions: knowledge application, K12 foundational education, and bilingual translation among Lao, Chinese, and English. The dataset is divided into open-source and closed-source subsets, with the closed-source portion enabling black-box evaluation on an official platform to ensure fairness and data security. Our data construction pipeline integrates expert human curation with automated agent-assisted verification, ensuring linguistic accuracy, cultural relevance, and educational value. Benchmarking multiple state-of-the-art LLMs on LaoBench reveals that current models still face significant challenges in mastering Lao across diverse tasks. We hope LaoBench will catalyze further research and development of AI technologies for underrepresented Southeast Asian languages.


[126] W2S-AlignTree: Weak-to-Strong Inference-Time Alignment for Large Language Models via Monte Carlo Tree Search cs.CLPDF

Zhenyu Ding, Yuhao Wang, Tengyue Xiao, Haoying Wang, Guojun Ma

TL;DR: W2S-AlignTree是一个基于蒙特卡洛树搜索(MCTS)的推理时间对齐框架,首次将弱监督到强泛化(Weak-to-Strong)范式与MCTS结合,实现对大语言模型(LLM)输出的精细控制,而无需修改模型参数。

Details

Motivation: 现有的大型语言模型(LLM)输出常与人类偏好不一致,且训练时间对齐方法(如RLHF)成本高、可扩展性差。需要一个低成本、动态可控的推理时间对齐方案。

Result: 实验显示,W2S-AlignTree在情感生成、摘要和指令跟随等任务中表现优于基线。例如,在摘要任务中,Llama3-8B的性能提升了15.9%。

Insight: 1. 推理时间对齐可降低成本;2. 弱模型的实时信号可作为有效的对齐代理;3. MTS在LLM生成控制中潜力巨大。

Abstract: Large Language Models (LLMs) demonstrate impressive capabilities, yet their outputs often suffer from misalignment with human preferences due to the inadequacy of weak supervision and a lack of fine-grained control. Training-time alignment methods like Reinforcement Learning from Human Feedback (RLHF) face prohibitive costs in expert supervision and inherent scalability limitations, offering limited dynamic control during inference. Consequently, there is an urgent need for scalable and adaptable alignment mechanisms. To address this, we propose W2S-AlignTree, a pioneering plug-and-play inference-time alignment framework that synergistically combines Monte Carlo Tree Search (MCTS) with the Weak-to-Strong Generalization paradigm for the first time. W2S-AlignTree formulates LLM alignment as an optimal heuristic search problem within a generative search tree. By leveraging weak model’s real-time, step-level signals as alignment proxies and introducing an Entropy-Aware exploration mechanism, W2S-AlignTree enables fine-grained guidance during strong model’s generation without modifying its parameters. The approach dynamically balances exploration and exploitation in high-dimensional generation search trees. Experiments across controlled sentiment generation, summarization, and instruction-following show that W2S-AlignTree consistently outperforms strong baselines. Notably, W2S-AlignTree raises the performance of Llama3-8B from 1.89 to 2.19, a relative improvement of 15.9 on the summarization task.


[127] PRBench: Large-Scale Expert Rubrics for Evaluating High-Stakes Professional Reasoning cs.CL | cs.CYPDF

Afra Feyza Akyürek, Advait Gosai, Chen Bo Calvin Zhang, Vipul Gupta, Jaehwan Jeong

TL;DR: PRBench是一个大规模的专业推理评估基准,专注于法律和金融领域的高风险任务,包含1100个专家设计的任务和19356条专家标准,是目前最大的公开基准。通过评估20个领先模型,发现它们在专业领域的表现仍有显著提升空间。

Details

Motivation: 现有评估方法难以衡量高风险专业领域(如法律和金融)中的实际表现,因此需要一种更贴近现实的评估工具。

Result: 模型的整体表现较低(金融:0.39,法律:0.37),且在不同能力上存在显著差异。常见的失败模式包括判断错误、推理不透明和不完整。

Insight: 现有模型在高风险专业领域的可靠性仍有不足,特别是在透明性和推理完整性方面需要改进。

Abstract: Frontier model progress is often measured by academic benchmarks, which offer a limited view of performance in real-world professional contexts. Existing evaluations often fail to assess open-ended, economically consequential tasks in high-stakes domains like Legal and Finance, where practical returns are paramount. To address this, we introduce Professional Reasoning Bench (PRBench), a realistic, open-ended, and difficult benchmark of real-world problems in Finance and Law. We open-source its 1,100 expert-authored tasks and 19,356 expert-curated criteria, making it, to our knowledge, the largest public, rubric-based benchmark for both legal and finance domains. We recruit 182 qualified professionals, holding JDs, CFAs, or 6+ years of experience, who contributed tasks inspired by their actual workflows. This process yields significant diversity, with tasks spanning 114 countries and 47 US jurisdictions. Our expert-curated rubrics are validated through a rigorous quality pipeline, including independent expert validation. Subsequent evaluation of 20 leading models reveals substantial room for improvement, with top scores of only 0.39 (Finance) and 0.37 (Legal) on our Hard subsets. We further catalog associated economic impacts of the prompts and analyze performance using human-annotated rubric categories. Our analysis shows that models with similar overall scores can diverge significantly on specific capabilities. Common failure modes include inaccurate judgments, a lack of process transparency and incomplete reasoning, highlighting critical gaps in their reliability for professional adoption.


cs.MA [Back]

[128] Who Gets the Reward, Who Gets the Blame? Evaluation-Aligned Training Signals for Multi-LLM Agents cs.MA | cs.AI | cs.CL | cs.GTPDF

Chih-Hsuan Yang, Tanwi Mallick, Le Chen, Krishnan Raghavan, Azton Wells

TL;DR: 本文提出了一种理论框架,将多LLM代理的系统级评估转化为代理级别和消息级别的学习信号,统一了合作博弈论和过程奖励建模,以生成公平、协作的本地信号。

Details

Motivation: 现有的多LLM代理训练方法缺乏将系统级评估与代理级别和消息级别学习联系起来的统一方法。作者希望通过理论框架填补这一空白。

Result: 理论框架为多LLM代理训练提供了局部、有符号且信用守恒的信号,适用于强化学习或偏好优化训练,但未进行实证验证。

Insight: 信号在设计上是边界清晰、协作性强且可审计的,为多代理系统提供了一种全局评估到局部监督的统一路径。

Abstract: Large Language Models (LLMs) in multi-agent systems (MAS) have shown promise for complex tasks, yet current training methods lack principled ways to connect system-level evaluation with agent-level and message-level learning. We propose a theoretical framework that unifies cooperative game-theoretic attribution with process reward modeling to transform system evaluation into agent credit and then into response-level signals. Unlike prior approaches that rely only on attribution (e.g., Shapley) or step-level labels (e.g., PRM), our method produces local, signed, and credit-conserving signals. In success cases, Shapley-based credit assignment fairly allocates outcomes across agents and is refined into per-message rewards that promote cooperation while discouraging redundancy or sabotage. In failure cases, first-error localization yields repair-aware preferences that penalize harmful steps while rewarding corrective attempts. The resulting signals are bounded, cooperative, and directly compatible with reinforcement-based or preference-based post-training, providing a unified and auditable pathway from global evaluation to local supervision in LLM multi-agent training. Our contribution is conceptual: we present a theoretical foundation and training signals, leaving empirical validation for future work.


cs.IR [Back]

[129] MOON Embedding: Multimodal Representation Learning for E-commerce Search Advertising cs.IR | cs.AI | cs.CV | cs.LGPDF

Chenghan Fu, Daoze Zhang, Yukang Lin, Zhanheng Nie, Xiang Zhang

TL;DR: MOON是一种用于电子商务搜索广告的多模态表示学习方法,已在淘宝搜索广告系统中全面部署,显著提升了点击率任务表现。

Details

Motivation: 为了解决多模态表示学习与下游任务目标不对齐的问题,MOON旨在通过量化中间指标的效果来优化模型。

Result: MOON在淘宝搜索广告系统中实现了20%的在线点击率提升,并在过去三年中经历了五次迭代。

Insight: 图像搜索召回是关键中间指标,可指导多模态模型优化;规模化效应中训练样本量、负样本数量和行为序列长度是关键因素。

Abstract: We introduce MOON, our comprehensive set of sustainable iterative practices for multimodal representation learning for e-commerce applications. MOON has already been fully deployed across all stages of Taobao search advertising system, including retrieval, relevance, ranking, and so on. The performance gains are particularly significant on click-through rate (CTR) prediction task, which achieves an overall +20.00% online CTR improvement. Over the past three years, this project has delivered the largest improvement on CTR prediction task and undergone five full-scale iterations. Throughout the exploration and iteration of our MOON, we have accumulated valuable insights and practical experience that we believe will benefit the research community. MOON contains a three-stage training paradigm of “Pretraining, Post-training, and Application”, allowing effective integration of multimodal representations with downstream tasks. Notably, to bridge the misalignment between the objectives of multimodal representation learning and downstream training, we define the exchange rate to quantify how effectively improvements in an intermediate metric can translate into downstream gains. Through this analysis, we identify the image-based search recall as a critical intermediate metric guiding the optimization of multimodal models. Over three years and five iterations, MOON has evolved along four critical dimensions: data processing, training strategy, model architecture, and downstream application. The lessons and insights gained through the iterative improvements will also be shared. As part of our exploration into scaling effects in the e-commerce field, we further conduct a systematic study of the scaling laws governing multimodal representation learning, examining multiple factors such as the number of training tokens, negative samples, and the length of user behavior sequences.


cs.SE [Back]

[130] SQuaD: The Software Quality Dataset cs.SE | cs.AI | cs.CL | cs.CR | cs.IRPDF

Mikel Robredo, Matteo Esposito, Davide Taibi, Rafael Peñaloza, Valentina Lenarduzzi

TL;DR: 论文介绍了SQuaD数据集,一个多维、时间感知的软件质量数据集,整合了450个开源项目中700多个独特指标,支持维护性、技术债等研究。

Details

Motivation: 现有软件质量数据集多局限于单一维度(如代码异味、技术债),限制了跨时间和维度的综合分析,因此需要更全面的数据集。

Result: 数据集覆盖63,586个发布版本,支持维护性、技术债和软件演化等研究,已在ZENODO公开。

Insight: SQuaD为大规模软件质量分析提供了统一资源,未来可支持自动更新和跨项目模型研究。

Abstract: Software quality research increasingly relies on large-scale datasets that measure both the product and process aspects of software systems. However, existing resources often focus on limited dimensions, such as code smells, technical debt, or refactoring activity, thereby restricting comprehensive analyses across time and quality dimensions. To address this gap, we present the Software Quality Dataset (SQuaD), a multi-dimensional, time-aware collection of software quality metrics extracted from 450 mature open-source projects across diverse ecosystems, including Apache, Mozilla, FFmpeg, and the Linux kernel. By integrating nine state-of-the-art static analysis tools, i.e., SonarQube, CodeScene, PMD, Understand, CK, JaSoMe, RefactoringMiner, RefactoringMiner++, and PyRef, our dataset unifies over 700 unique metrics at method, class, file, and project levels. Covering a total of 63,586 analyzed project releases, SQuaD also provides version control and issue-tracking histories, software vulnerability data (CVE/CWE), and process metrics proven to enhance Just-In-Time (JIT) defect prediction. The SQuaD enables empirical research on maintainability, technical debt, software evolution, and quality assessment at unprecedented scale. We also outline emerging research directions, including automated dataset updates and cross-project quality modeling to support the continuous evolution of software analytics. The dataset is publicly available on ZENODO (DOI: 10.5281/zenodo.17566690).


cs.LG [Back]

[131] The Map of Misbelief: Tracing Intrinsic and Extrinsic Hallucinations Through Attention Patterns cs.LG | cs.AI | cs.CLPDF

Elyes Hajji, Aymen Bouguerra, Fabio Arnez

TL;DR: 论文提出了一个区分外在和内在幻觉类别的评估框架,并提出了一种新颖的注意力聚合策略,以提高幻觉检测的性能和可解释性。

Details

Motivation: 大型语言模型(LLMs)在安全关键领域的应用日益增多,但其仍容易产生幻觉。现有方法多依赖计算昂贵的采样策略,且忽视了幻觉类型的区分。

Result: 实验表明,基于采样的方法(如语义熵)适用于外在幻觉检测,但对内在幻觉效果不佳;而基于注意力聚合的方法更适用于内在幻觉。

Insight: 注意力模式可以作为量化模型不确定性的丰富信号,并为针对不同类型的幻觉设计检测策略提供了新方向。

Abstract: Large Language Models (LLMs) are increasingly deployed in safety-critical domains, yet remain susceptible to hallucinations. While prior works have proposed confidence representation methods for hallucination detection, most of these approaches rely on computationally expensive sampling strategies and often disregard the distinction between hallucination types. In this work, we introduce a principled evaluation framework that differentiates between extrinsic and intrinsic hallucination categories and evaluates detection performance across a suite of curated benchmarks. In addition, we leverage a recent attention-based uncertainty quantification algorithm and propose novel attention aggregation strategies that improve both interpretability and hallucination detection performance. Our experimental findings reveal that sampling-based methods like Semantic Entropy are effective for detecting extrinsic hallucinations but generally fail on intrinsic ones. In contrast, our method, which aggregates attention over input tokens, is better suited for intrinsic hallucinations. These insights provide new directions for aligning detection strategies with the nature of hallucination and highlight attention as a rich signal for quantifying model uncertainty.


[132] Optimizing Mixture of Block Attention cs.LG | cs.CLPDF

Guangxuan Xiao, Junxian Guo, Kasra Mazaheri, Song Han

TL;DR: 论文研究了Mixture of Block Attention (MoBA)的性能优化问题,提出了理论分析和高效GPU实现FlashMoBA,显著提升了长上下文处理效率。

Details

Motivation: MoBA虽然能高效处理长上下文,但其设计原则模糊,且缺乏高效的GPU实现,限制了实际应用。

Result: 改进的MoBA模型性能媲美密集注意力基线,FlashMoBA在小分块下比FlashAttention-2快14.7倍。

Insight: 路由精度是MoBA性能的关键,小分块和信号聚类可显著提升效果,但需硬件优化实现高效计算。

Abstract: Mixture of Block Attention (MoBA) (Lu et al., 2025) is a promising building block for efficiently processing long contexts in LLMs by enabling queries to sparsely attend to a small subset of key-value blocks, drastically reducing computational cost. However, the design principles governing MoBA’s performance are poorly understood, and it lacks an efficient GPU implementation, hindering its practical adoption. In this paper, we first develop a statistical model to analyze MoBA’s underlying mechanics. Our model reveals that performance critically depends on the router’s ability to accurately distinguish relevant from irrelevant blocks based on query-key affinities. We derive a signal-to-noise ratio that formally connects architectural parameters to this retrieval accuracy. Guided by our analysis, we identify two key pathways for improvement: using smaller block sizes and applying a short convolution on keys to cluster relevant signals, which enhances routing accuracy. While theoretically better, small block sizes are inefficient on GPUs. To bridge this gap, we introduce FlashMoBA, a hardware-aware CUDA kernel that enables efficient MoBA execution even with the small block sizes our theory recommends. We validate our insights by training LLMs from scratch, showing that our improved MoBA models match the performance of dense attention baselines. FlashMoBA achieves up to 14.7x speedup over FlashAttention-2 for small blocks, making our theoretically-grounded improvements practical. Code is available at: https://github.com/mit-han-lab/flash-moba.


[133] From Parameter to Representation: A Closed-Form Approach for Controllable Model Merging cs.LG | cs.CVPDF

Jialin Wu, Jian Yang, Handing Wang, Jiajun Wen, Zhiyong Yu

TL;DR: 论文提出了一种封闭式方法,通过直接修正模型的最终表示来解决模型合并中的参数干扰问题,避免了昂贵的离线多目标优化。

Details

Motivation: 模型合并通常需要处理参数干扰问题,现有方法依赖昂贵的离线优化,复杂度随任务数指数增长,亟需一种更高效且可控的方法。

Result: 实验表明,该方法生成了更优的Pareto前沿,偏好对齐更精确,计算成本大幅降低。

Insight: 将优化视角从参数空间转向表示空间,能显著提升模型合并的效率和控制性。

Abstract: Model merging combines expert models for multitask performance but faces challenges from parameter interference. This has sparked recent interest in controllable model merging, giving users the ability to explicitly balance performance trade-offs. Existing approaches employ a compile-then-query paradigm, performing a costly offline multi-objective optimization to enable fast, preference-aware model generation. This offline stage typically involves iterative search or dedicated training, with complexity that grows exponentially with the number of tasks. To overcome these limitations, we shift the perspective from parameter-space optimization to a direct correction of the model’s final representation. Our approach models this correction as an optimal linear transformation, yielding a closed-form solution that replaces the entire offline optimization process with a single-step, architecture-agnostic computation. This solution directly incorporates user preferences, allowing a Pareto-optimal model to be generated on-the-fly with complexity that scales linearly with the number of tasks. Experimental results show our method generates a superior Pareto front with more precise preference alignment and drastically reduced computational cost.


cs.AI [Back]

[134] Co-EPG: A Framework for Co-Evolution of Planning and Grounding in Autonomous GUI Agents cs.AI | cs.CLPDF

Yuan Zhao, Hualei Zhu, Tingyu Jiang, Shen Li, Xiaohang Xu

TL;DR: Co-EPG 是一个自迭代训练框架,通过规划模型和接地模型的协同进化,解决了 GUI 任务自动化中跨模型协同不足和数据利用不充分的问题,显著提升了自主 GUI 代理的性能。

Details

Motivation: 当前 GUI 代理的规划与接地模型缺乏充分的协同,且过度依赖合成数据而未能充分利用,限制了代理的性能提升。

Result: 在 Multimodal-Mind2Web 和 AndroidControl 基准测试中,Co-EPG 仅通过三次迭代即超越现有最优方法。

Insight: 通过自驱动协同进化,GUI 代理可以实现持续性能提升,摆脱对外部数据的依赖,为未来研究提供了新范式。

Abstract: Graphical User Interface (GUI) task automation constitutes a critical frontier in artificial intelligence research. While effective GUI agents synergistically integrate planning and grounding capabilities, current methodologies exhibit two fundamental limitations: (1) insufficient exploitation of cross-model synergies, and (2) over-reliance on synthetic data generation without sufficient utilization. To address these challenges, we propose Co-EPG, a self-iterative training framework for Co-Evolution of Planning and Grounding. Co-EPG establishes an iterative positive feedback loop: through this loop, the planning model explores superior strategies under grounding-based reward guidance via Group Relative Policy Optimization (GRPO), generating diverse data to optimize the grounding model. Concurrently, the optimized Grounding model provides more effective rewards for subsequent GRPO training of the planning model, fostering continuous improvement. Co-EPG thus enables iterative enhancement of agent capabilities through self-play optimization and training data distillation. On the Multimodal-Mind2Web and AndroidControl benchmarks, our framework outperforms existing state-of-the-art methods after just three iterations without requiring external data. The agent consistently improves with each iteration, demonstrating robust self-enhancement capabilities. This work establishes a novel training paradigm for GUI agents, shifting from isolated optimization to an integrated, self-driven co-evolution approach.


[135] From Efficiency to Adaptivity: A Deeper Look at Adaptive Reasoning in Large Language Models cs.AI | cs.CLPDF

Chao Wu, Baoheng Li, Mingchen Gao, Zhenyi Wang

TL;DR: 这篇论文从适应性视角重新审视大语言模型的推理能力,提出了一种新的分类法,将现有方法分为基于训练和无需训练的两类,并讨论了未来的挑战。

Details

Motivation: 现有的大语言模型在推理任务中采用统一的策略,未能根据任务复杂度动态调整推理努力,因此需要一种适应性方法来提升推理效率和质量。

Result: 论文提出了一个清晰的框架,帮助理解不同方法如何实现适应性推理,并为未来研究提供了方向。

Insight: 适应性推理的关键在于动态分配计算资源,未来的挑战包括自我评估、元推理和人类对齐的推理控制。

Abstract: Recent advances in large language models (LLMs) have made reasoning a central benchmark for evaluating intelligence. While prior surveys focus on efficiency by examining how to shorten reasoning chains or reduce computation, this view overlooks a fundamental challenge: current LLMs apply uniform reasoning strategies regardless of task complexity, generating long traces for trivial problems while failing to extend reasoning for difficult tasks. This survey reframes reasoning through the lens of {adaptivity}: the capability to allocate reasoning effort based on input characteristics such as difficulty and uncertainty. We make three contributions. First, we formalize deductive, inductive, and abductive reasoning within the LLM context, connecting these classical cognitive paradigms with their algorithmic realizations. Second, we formalize adaptive reasoning as a control-augmented policy optimization problem balancing task performance with computational cost, distinguishing learned policies from inference-time control mechanisms. Third, we propose a systematic taxonomy organizing existing methods into training-based approaches that internalize adaptivity through reinforcement learning, supervised fine-tuning, and learned controllers, and training-free approaches that achieve adaptivity through prompt conditioning, feedback-driven halting, and modular composition. This framework clarifies how different mechanisms realize adaptive reasoning in practice and enables systematic comparison across diverse strategies. We conclude by identifying open challenges in self-evaluation, meta-reasoning, and human-aligned reasoning control.


[136] Multi-agent Undercover Gaming: Hallucination Removal via Counterfactual Test for Multimodal Reasoning cs.AI | cs.CL | cs.MA | cs.MMPDF

Dayong Liang, Xiao-Yong Wei, Changmeng Zheng

TL;DR: 论文提出了一种多Agent卧底游戏(MUG)协议,通过多模态反事实测试检测幻觉Agent,提升多模态推理的可靠性。

Details

Motivation: 大语言模型(LLMs)在多模态推理中经常出现幻觉问题,现有的多Agent辩论(MAD)方法假设所有Agent均为理性,但实际情况中Agent可能仍受幻觉影响。

Result: MUG在多模态推理中表现更可靠,优于传统的MAD方法。

Insight: 反事实测试和动态证据修改能有效识别幻觉Agent,提升多模态推理的鲁棒性。

Abstract: Hallucination continues to pose a major obstacle in the reasoning capabilities of large language models (LLMs). Although the Multi-Agent Debate (MAD) paradigm offers a promising solution by promoting consensus among multiple agents to enhance reliability, it relies on the unrealistic assumption that all debaters are rational and reflective, which is a condition that may not hold when agents themselves are prone to hallucinations. To address this gap, we introduce the Multi-agent Undercover Gaming (MUG) protocol, inspired by social deduction games like “Who is Undercover?”. MUG reframes MAD as a process of detecting “undercover” agents (those suffering from hallucinations) by employing multimodal counterfactual tests. Specifically, we modify reference images to introduce counterfactual evidence and observe whether agents can accurately identify these changes, providing ground-truth for identifying hallucinating agents and enabling robust, crowd-powered multimodal reasoning. MUG advances MAD protocols along three key dimensions: (1) enabling factual verification beyond statistical consensus through counterfactual testing; (2) introducing cross-evidence reasoning via dynamically modified evidence sources instead of relying on static inputs; and (3) fostering active reasoning, where agents engage in probing discussions rather than passively answering questions. Collectively, these innovations offer a more reliable and effective framework for multimodal reasoning in LLMs. The source code can be accessed at https://github.com/YongLD/MUG.git.


[137] Aligning Machiavellian Agents: Behavior Steering via Test-Time Policy Shaping cs.AI | cs.CLPDF

Dena Mujtaba, Brian Hu, Anthony Hoogs, Arslan Basharat

TL;DR: 论文提出了一种基于测试时策略调整的方法,用于在复杂动态环境中引导预训练AI代理的行为,使其符合人类价值观或伦理准则,而无需重新训练代理。

Details

Motivation: 预训练的AI代理在最大化目标奖励时可能表现出有害行为,确保其行为与人类价值观对齐是一个关键挑战,尤其是在多样且可能冲突的伦理属性下。

Result: 在MACHIAVELLI基准测试中验证了方法的有效性,显著减少了不道德行为和权力追求行为。

Insight: 测试时策略调整为预训练代理的伦理对齐提供了高效且可扩展的解决方案。

Abstract: The deployment of decision-making AI agents presents a critical challenge in maintaining alignment with human values or guidelines while operating in complex, dynamic environments. Agents trained solely to achieve their objectives may adopt harmful behavior, exposing a key trade-off between maximizing the reward function and maintaining the alignment. For the pre-trained agents, ensuring alignment is particularly challenging, as retraining can be a costly and slow process. This is further complicated by the diverse and potentially conflicting attributes representing the ethical values for alignment. To address these challenges, we propose a test-time alignment technique based on model-guided policy shaping. Our method allows precise control over individual behavioral attributes, generalizes across diverse reinforcement learning (RL) environments, and facilitates a principled trade-off between ethical alignment and reward maximization without requiring agent retraining. We evaluate our approach using the MACHIAVELLI benchmark, which comprises 134 text-based game environments and thousands of annotated scenarios involving ethical decisions. The RL agents are first trained to maximize the reward in their respective games. At test time, we apply policy shaping via scenario-action attribute classifiers to ensure decision alignment with ethical attributes. We compare our approach against prior training-time methods and general-purpose agents, as well as study several types of ethical violations and power-seeking behavior. Our results demonstrate that test-time policy shaping provides an effective and scalable solution for mitigating unethical behavior across diverse environments and alignment attributes.


cs.MM [Back]

[138] AccKV: Towards Efficient Audio-Video LLMs Inference via Adaptive-Focusing and Cross-Calibration KV Cache Optimization cs.MM | cs.CV | cs.SDPDF

Zhonghua Jiang, Kui Chen, Kunxi Li, Keting Yin, Yiyun Zhou

TL;DR: AccKV提出了一种针对音频-视频大语言模型(AV-LLMs)的自适应聚焦和交叉校准KV缓存优化框架,以提高计算效率并保持模型精度。

Details

Motivation: 视频和音频引入了时间维度,导致KV缓存比静态图像嵌入更大。传统的优化策略是根据任务选择性保留音频或视频的KV缓存,但实验发现AV-LLMs在高层的注意力并不严格依赖任务,且处理音频和视频KV时可能导致信息混淆和性能下降,亟需更高效的优化方法。

Result: 实验结果表明,AccKV能显著提升AV-LLMs的计算效率,同时保持模型精度。

Insight: AV-LLMs在高层的注意力更倾向于视频模态,直接整合音频和视频的KV缓存可能导致信息混淆和性能下降;通过自适应聚焦和交叉校准优化KV缓存,可实现高效且精准的多模态推理。

Abstract: Recent advancements in Audio-Video Large Language Models (AV-LLMs) have enhanced their capabilities in tasks like audio-visual question answering and multimodal dialog systems. Video and audio introduce an extended temporal dimension, resulting in a larger key-value (KV) cache compared to static image embedding. A naive optimization strategy is to selectively focus on and retain KV caches of audio or video based on task. However, in the experiment, we observed that the attention of AV-LLMs to various modalities in the high layers is not strictly dependent on the task. In higher layers, the attention of AV-LLMs shifts more towards the video modality. In addition, we also found that directly integrating temporal KV of audio and spatial-temporal KV of video may lead to information confusion and significant performance degradation of AV-LLMs. If audio and video are processed indiscriminately, it may also lead to excessive compression or reservation of a certain modality, thereby disrupting the alignment between modalities. To address these challenges, we propose AccKV, an Adaptive-Focusing and Cross-Calibration KV cache optimization framework designed specifically for efficient AV-LLMs inference. Our method is based on layer adaptive focusing technology, selectively focusing on key modalities according to the characteristics of different layers, and enhances the recognition of heavy hitter tokens through attention redistribution. In addition, we propose a Cross-Calibration technique that first integrates inefficient KV caches within the audio and video modalities, and then aligns low-priority modalities with high-priority modalities to selectively evict KV cache of low-priority modalities. The experimental results show that AccKV can significantly improve the computational efficiency of AV-LLMs while maintaining accuracy.


eess.IV [Back]

[139] DualVision ArthroNav: Investigating Opportunities to Enhance Localization and Reconstruction in Image-based Arthroscopy Navigation via External Cameras eess.IV | cs.CV | cs.ROPDF

Hongchao Shu, Lalithkumar Seenivasan, Mingxu Liu, Yunseo Hwang, Yu-Chun Ku

TL;DR: DualVision ArthroNav是一种多摄像头关节镜导航系统,通过集成外部摄像头和单目关节镜摄像头,解决了传统视觉导航系统中存在的尺度模糊、漂移和重定位问题,提升了定位和重建精度。

Details

Motivation: 现有光学追踪系统限制了工作空间并干扰手术流程,而基于视觉的替代方案仅依赖单目关节镜摄像头,易受漂移、尺度模糊和快速运动或遮挡的影响。

Result: 实验表明,系统绝对轨迹误差平均1.09 mm,目标配准误差2.16 mm,视觉质量较高(SSIM=0.69,PSNR=22.19)。

Insight: 该系统为关节镜导航提供了实用且经济的解决方案,填补了光学追踪与纯视觉系统之间的空白,推动了临床可部署的全视觉导航系统的发展。

Abstract: Arthroscopic procedures can greatly benefit from navigation systems that enhance spatial awareness, depth perception, and field of view. However, existing optical tracking solutions impose strict workspace constraints and disrupt surgical workflow. Vision-based alternatives, though less invasive, often rely solely on the monocular arthroscope camera, making them prone to drift, scale ambiguity, and sensitivity to rapid motion or occlusion. We propose DualVision ArthroNav, a multi-camera arthroscopy navigation system that integrates an external camera rigidly mounted on the arthroscope. The external camera provides stable visual odometry and absolute localization, while the monocular arthroscope video enables dense scene reconstruction. By combining these complementary views, our system resolves the scale ambiguity and long-term drift inherent in monocular SLAM and ensures robust relocalization. Experiments demonstrate that our system effectively compensates for calibration errors, achieving an average absolute trajectory error of 1.09 mm. The reconstructed scenes reach an average target registration error of 2.16 mm, with high visual fidelity (SSIM = 0.69, PSNR = 22.19). These results indicate that our system provides a practical and cost-efficient solution for arthroscopic navigation, bridging the gap between optical tracking and purely vision-based systems, and paving the way toward clinically deployable, fully vision-based arthroscopic guidance.


[140] From Attention to Frequency: Integration of Vision Transformer and FFT-ReLU for Enhanced Image Deblurring eess.IV | cs.CVPDF

Syed Mumtahin Mahmud, Mahdi Mohd Hossain Noki, Prothito Shovon Majumder, Abdul Mohaimen Al Radi, Md. Haider Ali

TL;DR: 提出了一种结合Vision Transformer和FFT-ReLU模块的双域架构,用于图像去模糊,显著提升了PSNR、SSIM和感知质量。

Details

Motivation: 图像去模糊是一个重要但具有挑战性的任务,现有的CNN和ViT方法在复杂模糊和高分辨率图像上表现不足。

Result: 在基准数据集上取得了优于现有方法的PSNR、SSIM和感知质量。

Insight: 结合空间注意力和频域稀疏性是一种有效的图像去模糊方法,尤其在复杂和高分辨率场景中表现突出。

Abstract: Image deblurring is vital in computer vision, aiming to recover sharp images from blurry ones caused by motion or camera shake. While deep learning approaches such as CNNs and Vision Transformers (ViTs) have advanced this field, they often struggle with complex or high-resolution blur and computational demands. We propose a new dual-domain architecture that unifies Vision Transformers with a frequency-domain FFT-ReLU module, explicitly bridging spatial attention modeling and frequency sparsity. In this structure, the ViT backbone captures local and global dependencies, while the FFT-ReLU component enforces frequency-domain sparsity to suppress blur-related artifacts and preserve fine details. Extensive experiments on benchmark datasets demonstrate that this architecture achieves superior PSNR, SSIM, and perceptual quality compared to state-of-the-art models. Both quantitative metrics, qualitative comparisons, and human preference evaluations confirm its effectiveness, establishing a practical and generalizable paradigm for real-world image restoration.


[141] Boosting Neural Video Representation via Online Structural Reparameterization eess.IV | cs.CV | cs.MMPDF

Ziyi Li, Qingyu Mao, Shuai Liu, Qilei Li, Fanyang Meng

TL;DR: 本文提出了一种基于在线结构重参数化的神经视频表示(NVR)框架Online-RepNeRV,通过多分支卷积路径增强模型容量,并在训练后动态融合参数以减少计算开销,显著提升了视频压缩性能。

Details

Motivation: 现有的神经视频表示方法虽然通过架构改进提升了表示能力,但往往设计复杂且计算开销大,模型容量的固有局限性也导致了性能瓶颈。本文旨在解决这些问题。

Result: 在主流视频数据集上的实验表明,该方法比基线方法的PSNR平均提升了0.37-2.7 dB,同时保持了相近的训练时间和解码速度。

Insight: 在线结构重参数化技术可以有效提升神经视频表示的性能,同时通过动态融合参数避免了推理阶段的开销增加,为视频压缩任务提供了一种高效灵活的解决方案。

Abstract: Neural Video Representation~(NVR) is a promising paradigm for video compression, showing great potential in improving video storage and transmission efficiency. While recent advances have made efforts in architectural refinements to improve representational capability, these methods typically involve complex designs, which may incur increased computational overhead and lack the flexibility to integrate into other frameworks. Moreover, the inherent limitation in model capacity restricts the expressiveness of NVR networks, resulting in a performance bottleneck. To overcome these limitations, we propose Online-RepNeRV, a NVR framework based on online structural reparameterization. Specifically, we propose a universal reparameterization block named ERB, which incorporates multiple parallel convolutional paths to enhance the model capacity. To mitigate the overhead, an online reparameterization strategy is adopted to dynamically fuse the parameters during training, and the multi-branch structure is equivalently converted into a single-branch structure after training. As a result, the additional computational and parameter complexity is confined to the encoding stage, without affecting the decoding efficiency. Extensive experiments on mainstream video datasets demonstrate that our method achieves an average PSNR gain of 0.37-2.7 dB over baseline methods, while maintaining comparable training time and decoding speed.


[142] Large-scale modality-invariant foundation models for brain MRI analysis: Application to lesion segmentation eess.IV | cs.AI | cs.CV | cs.LGPDF

Petros Koutsouvelis, Matej Gazda, Leroy Volmer, Sina Amirrajab, Kamil Barbierik

TL;DR: 本文提出了一种模态不变的表征学习方法,通过大规模自监督预训练提升脑部MRI数据的分析能力,特别是在中风和癫痫病灶分割任务中的应用。研究发现,尽管跨模态对齐有效,但保留细粒度模态特征对病灶分割更为关键。

Details

Motivation: 当前的自监督学习框架主要针对自然图像设计,如何适应多模态MRI数据的表征学习仍是一个未充分探索的问题。本文旨在填补这一空白,提升脑MRI分析的泛化能力。

Result: 实验表明,尽管模态对齐成功,但细粒度模态特征对病灶分割性能的提升更显著。模型在病灶分割任务中表现出优异的性能。

Insight: 模态不变表征学习中,跨模态对齐固然重要,但任务特定的模态特征(如MRI的精细结构)可能对性能起决定性作用。

Abstract: The field of computer vision is undergoing a paradigm shift toward large-scale foundation model pre-training via self-supervised learning (SSL). Leveraging large volumes of unlabeled brain MRI data, such models can learn anatomical priors that improve few-shot performance in diverse neuroimaging tasks. However, most SSL frameworks are tailored to natural images, and their adaptation to capture multi-modal MRI information remains underexplored. This work proposes a modality-invariant representation learning setup and evaluates its effectiveness in stroke and epilepsy lesion segmentation, following large-scale pre-training. Experimental results suggest that despite successful cross-modality alignment, lesion segmentation primarily benefits from preserving fine-grained modality-specific features. Model checkpoints and code are made publicly available.


[143] Unsupervised Motion-Compensated Decomposition for Cardiac MRI Reconstruction via Neural Representation eess.IV | cs.CVPDF

Xuanyu Tian, Lixuan Chen, Qing Wu, Xiao Wang, Jie Feng

TL;DR: 该论文提出了一种新型无监督方法MoCo-INR,结合隐式神经表示(INR)与传统的运动补偿框架,用于高加速因子下心脏MRI的高质量重建。

Details

Motivation: 当前心脏MRI(CMR)重建方法要么图像质量不佳,要么因缺乏真实数据而受限,MoCo-INR旨在解决这些问题并提高临床实用性。

Result: 在模拟和真实自由呼吸CMR数据上,MoCo-INR优于现有方法,支持20倍超高加速因子下的精细重建。

Insight: 无监督方法与INR的结合为高动态医学影像重建提供了新方向,同时优化的网络设计提升了模型稳定性和效率。

Abstract: Cardiac magnetic resonance (CMR) imaging is widely used to characterize cardiac morphology and function. To accelerate CMR imaging, various methods have been proposed to recover high-quality spatiotemporal CMR images from highly undersampled k-t space data. However, current CMR reconstruction techniques either fail to achieve satisfactory image quality or are restricted by the scarcity of ground truth data, leading to limited applicability in clinical scenarios. In this work, we proposed MoCo-INR, a new unsupervised method that integrates implicit neural representations (INR) with the conventional motion-compensated (MoCo) framework. Using explicit motion modeling and the continuous prior of INRs, MoCo-INR can produce accurate cardiac motion decomposition and high-quality CMR reconstruction. Furthermore, we introduce a new INR network architecture tailored to the CMR problem, which significantly stabilizes model optimization. Experiments on retrospective (simulated) datasets demonstrate the superiority of MoCo-INR over state-of-the-art methods, achieving fast convergence and fine-detailed reconstructions at ultra-high acceleration factors (e.g., 20x in VISTA sampling). Additionally, evaluations on prospective (real-acquired) free-breathing CMR scans highlight the clinical practicality of MoCo-INR for real-time imaging. Several ablation studies further confirm the effectiveness of the critical components of MoCo-INR.


q-bio.QM [Back]

[144] Synergy vs. Noise: Performance-Guided Multimodal Fusion For Biochemical Recurrence-Free Survival in Prostate Cancer q-bio.QM | cs.CV | cs.LG | eess.IVPDF

Seth Alain Chang, Muhammad Mueez Amjad, Noorul Wahab, Ethar Alzaid, Nasir Rajpoot

TL;DR: 多模态深度学习(MDL)在计算病理学中展现潜力,但盲目结合模态可能引入噪声而非互补信息。研究表明,选择性整合高性能模态优于无差别结合。

Details

Motivation: 尽管MDL在计算病理学中表现出色,但未经检验的假设认为多模态结合必然提升性能。本研究探讨模态性能差异对预测结果的影响。

Result: 发现高性能模态结合能提升预测,而低性能模态可能引入噪声,降低准确性。

Insight: MDL设计应基于模态性能选择性整合,而非简单组合,这对医学影像和计算病理学具有广泛意义。

Abstract: Multimodal deep learning (MDL) has emerged as a transformative approach in computational pathology. By integrating complementary information from multiple data sources, MDL models have demonstrated superior predictive performance across diverse clinical tasks compared to unimodal models. However, the assumption that combining modalities inherently improves performance remains largely unexamined. We hypothesise that multimodal gains depend critically on the predictive quality of individual modalities, and that integrating weak modalities may introduce noise rather than complementary information. We test this hypothesis on a prostate cancer dataset with histopathology, radiology, and clinical data to predict time-to-biochemical recurrence. Our results confirm that combining high-performing modalities yield superior performance compared to unimodal approaches. However, integrating a poor-performing modality with other higher-performing modalities degrades predictive accuracy. These findings demonstrate that multimodal benefit requires selective, performance-guided integration rather than indiscriminate modality combination, with implications for MDL design across computational pathology and medical imaging.


cs.RO [Back]

[145] Attentive Feature Aggregation or: How Policies Learn to Stop Worrying about Robustness and Attend to Task-Relevant Visual Cues cs.RO | cs.CVPDF

Nikolaos Tsagkas, Andreas Sochopoulos, Duolikun Danier, Sethu Vijayakumar, Alexandros Kouris

TL;DR: 本文提出了一种称为注意力特征聚合(AFA)的轻量级可训练机制,用于提升视觉运动策略的鲁棒性。AFA通过学习专注于任务相关的视觉线索,减少对场景中无关信息的依赖,从而在不需昂贵数据增强或预训练视觉表征微调的情况下,显著提升策略在视觉扰动下的性能。

Details

Motivation: 尽管预训练视觉表征(PVRs)广泛应用于视觉运动策略训练,但它们可能包含大量任务无关的视觉信息,导致策略在视觉扰动和干扰物面前表现不佳。

Result: 实验表明,使用AFA的策略在仿真和真实场景中均优于标准池化方法,尤其是在视觉扰动环境下。

Insight: 忽略无关视觉信息是提升视觉运动策略鲁棒性和泛化能力的关键。

Abstract: The adoption of pre-trained visual representations (PVRs), leveraging features from large-scale vision models, has become a popular paradigm for training visuomotor policies. However, these powerful representations can encode a broad range of task-irrelevant scene information, making the resulting trained policies vulnerable to out-of-domain visual changes and distractors. In this work we address visuomotor policy feature pooling as a solution to the observed lack of robustness in perturbed scenes. We achieve this via Attentive Feature Aggregation (AFA), a lightweight, trainable pooling mechanism that learns to naturally attend to task-relevant visual cues, ignoring even semantically rich scene distractors. Through extensive experiments in both simulation and the real world, we demonstrate that policies trained with AFA significantly outperform standard pooling approaches in the presence of visual perturbations, without requiring expensive dataset augmentation or fine-tuning of the PVR. Our findings show that ignoring extraneous visual information is a crucial step towards deploying robust and generalisable visuomotor policies. Project Page: tsagkas.github.io/afa


[146] Rethinking Progression of Memory State in Robotic Manipulation: An Object-Centric Perspective cs.RO | cs.CVPDF

Nhat Chung, Taisei Hanyu, Toan Nguyen, Huy Le, Frederick Bumgarner

TL;DR: 论文提出了一种针对机器人操作任务的非马尔可夫环境下的对象中心记忆状态管理方法,通过LIBERO-Mem任务套件和Embodied-SlotSSM框架提升对象跟踪和动作预测的能力。

Details

Motivation: 在复杂的机器人操作任务中,视觉相似对象的非马尔可夫环境要求对对象实例的历史状态进行持续跟踪和推理,传统视觉-语言-动作(VLA)模型在此类任务中表现不佳。

Result: 实验显示Embodied-SlotSSM在LIBERO-Mem及其他任务上表现优异,为非马尔可夫推理提供了可扩展的解决方案。

Insight: 对象中心的方法显著提升了机器人操作任务中对历史状态的推理能力,尤其在非马尔可夫环境中表现出色。

Abstract: As embodied agents operate in increasingly complex environments, the ability to perceive, track, and reason about individual object instances over time becomes essential, especially in tasks requiring sequenced interactions with visually similar objects. In these non-Markovian settings, key decision cues are often hidden in object-specific histories rather than the current scene. Without persistent memory of prior interactions (what has been interacted with, where it has been, or how it has changed) visuomotor policies may fail, repeat past actions, or overlook completed ones. To surface this challenge, we introduce LIBERO-Mem, a non-Markovian task suite for stress-testing robotic manipulation under object-level partial observability. It combines short- and long-horizon object tracking with temporally sequenced subgoals, requiring reasoning beyond the current frame. However, vision-language-action (VLA) models often struggle in such settings, with token scaling quickly becoming intractable even for tasks spanning just a few hundred frames. We propose Embodied-SlotSSM, a slot-centric VLA framework built for temporal scalability. It maintains spatio-temporally consistent slot identities and leverages them through two mechanisms: (1) slot-state-space modeling for reconstructing short-term history, and (2) a relational encoder to align the input tokens with action decoding. Together, these components enable temporally grounded, context-aware action prediction. Experiments show Embodied-SlotSSM’s baseline performance on LIBERO-Mem and general tasks, offering a scalable solution for non-Markovian reasoning in object-centric robotic policies.


[147] Collaborative Representation Learning for Alignment of Tactile, Language, and Vision Modalities cs.RO | cs.CVPDF

Yiyun Zhou, Mingjing Xu, Jingwei Shi, Quanjiang Li, Jingyuan Chen

TL;DR: 论文提出了TLV-CoRe方法,基于CLIP框架,通过传感器感知调制器和解耦学习统一触觉特征,并强化触觉、语言和视觉的三模态交互。实验表明该方法显著提升了跨传感器的表示学习和多模态对齐能力。

Details

Motivation: 触觉传感为视觉和语言提供了细粒度的物体属性信息,但现有传感器缺乏标准化且冗余特征多,同时触觉与语言、视觉模态的交互不足。

Result: 实验证明TLV-CoRe显著提升了传感器无关的表示学习和跨模态对齐能力。

Insight: 通过标准化触觉特征和加强多模态交互,可以有效解决触觉传感的冗余问题和模态隔离问题,为多模态表示学习提供了新方向。

Abstract: Tactile sensing offers rich and complementary information to vision and language, enabling robots to perceive fine-grained object properties. However, existing tactile sensors lack standardization, leading to redundant features that hinder cross-sensor generalization. Moreover, existing methods fail to fully integrate the intermediate communication among tactile, language, and vision modalities. To address this, we propose TLV-CoRe, a CLIP-based Tactile-Language-Vision Collaborative Representation learning method. TLV-CoRe introduces a Sensor-Aware Modulator to unify tactile features across different sensors and employs tactile-irrelevant decoupled learning to disentangle irrelevant tactile features. Additionally, a Unified Bridging Adapter is introduced to enhance tri-modal interaction within the shared representation space. To fairly evaluate the effectiveness of tactile models, we further propose the RSS evaluation framework, focusing on Robustness, Synergy, and Stability across different methods. Experimental results demonstrate that TLV-CoRe significantly improves sensor-agnostic representation learning and cross-modal alignment, offering a new direction for multimodal tactile representation.