Table of Contents
- cs.CV [Total: 49]
- cs.CL [Total: 51]
- cs.HC [Total: 1]
- cs.RO [Total: 1]
- eess.IV [Total: 4]
- cs.CR [Total: 1]
- cs.LO [Total: 1]
- cs.AI [Total: 2]
- cs.AR [Total: 2]
- q-bio.NC [Total: 1]
- cs.SE [Total: 1]
- cs.CY [Total: 1]
- cs.LG [Total: 6]
cs.CV [Back]
[1] Text-Driven 3D Hand Motion Generation from Sign Language Data cs.CV | 65-XX | I.4.9; I.5.1PDF
Léore Bensabath, Mathis Petrovich, Gül Varol
TL;DR: 该论文旨在通过自然语言描述生成3D手部动作,利用大规模手语视频数据集和伪标注类别,结合LLM翻译成手部动作描述,训练了一个文本条件扩散模型HandMDM,并展示了其跨领域的鲁棒性。
Details
Motivation: 当前缺乏大规模3D手部动作与文本描述配对的数据集,且现有方法难以在跨领域(如不同手语或非手势动作)中表现鲁棒。
Result: HandMDM在未见手语类别和其他手语或非手势动作中表现出鲁棒性。
Insight: 通过结合大规模伪标注数据和LLM生成的文本描述,可以实现高效的3D手部动作生成,并具有跨领域泛化能力。
Abstract: Our goal is to train a generative model of 3D hand motions, conditioned on natural language descriptions specifying motion characteristics such as handshapes, locations, finger/hand/arm movements. To this end, we automatically build pairs of 3D hand motions and their associated textual labels with unprecedented scale. Specifically, we leverage a large-scale sign language video dataset, along with noisy pseudo-annotated sign categories, which we translate into hand motion descriptions via an LLM that utilizes a dictionary of sign attributes, as well as our complementary motion-script cues. This data enables training a text-conditioned hand motion diffusion model HandMDM, that is robust across domains such as unseen sign categories from the same sign language, but also signs from another sign language and non-sign hand movements. We contribute extensive experimental investigation of these scenarios and will make our trained models and data publicly available to support future research in this relatively new field.
[2] VT-LVLM-AR: A Video-Temporal Large Vision-Language Model Adapter for Fine-Grained Action Recognition in Long-Term Videos cs.CVPDF
Kaining Li, Shuwei He, Zihan Xu
TL;DR: VT-LVLM-AR是一种新颖的框架,通过将长视频转换为语义丰富的视觉事件序列,并利用大视觉语言模型(LVLM)进行动作识别,解决了长视频中细粒度动作识别的挑战。
Details
Motivation: 长视频中的动作识别面临复杂背景和细微动作差异的挑战,传统深度学习模型在计算开销、长程时序依赖捕捉和语义理解方面存在局限性。
Result: 在NTU RGB+D和NTU RGB+D 120数据集上取得SOTA性能(如NTU RGB+D X-Sub 94.1%准确率)。
Insight: 通过视频到语言的转换和高效模型适配,展示了LVLM在视频动作理解中的巨大潜力,同时提高了可解释性。
Abstract: Human action recognition in long-term videos, characterized by complex backgrounds and subtle action differences, poses significant challenges for traditional deep learning models due to computational overhead, difficulty in capturing long-range temporal dependencies, and limited semantic understanding. While Large Language Models (LLMs) and Large Vision-Language Models (LVLMs) have shown remarkable capabilities in multi-modal understanding and reasoning, their direct application to continuous video streams for fine-grained action recognition remains an open problem. This paper introduces VT-LVLM-AR (Video-Temporal Large Vision-Language Model Adapter for Action Recognition), a novel framework designed to bridge this gap. VT-LVLM-AR comprises a Video-to-Event Mapper (VTEM) that efficiently transforms raw video into compact, semantically rich, and temporally coherent “visual event sequences” through lightweight spatio-temporal feature extraction, adaptive temporal pooling, and conceptual quantization with an event coherence bias. These visual event sequences are then fed into an LVLM-based Action Reasoning module, specifically a frozen LLaVA-1.5 model, adapted using parameter-efficient Prompt Tuning (P-Tuning v2) for action classification. Comprehensive evaluations on the NTU RGB+D and NTU RGB+D 120 datasets demonstrate that VT-LVLM-AR consistently achieves state-of-the-art performance, surpassing existing methods (e.g., 94.1% accuracy on NTU RGB+D X-Sub). Ablation studies confirm the critical contributions of VTEM’s components and the efficacy of Prompt Tuning, while human evaluations underscore the interpretability of our visual event representations. This work highlights the immense potential of leveraging LVLMs for robust and interpretable video action understanding through effective video-to-language translation and efficient model adaptation.
[3] Boosting Pathology Foundation Models via Few-shot Prompt-tuning for Rare Cancer Subtyping cs.CVPDF
Dexuan He, Xiao Zhou, Wenbin Guan, Liyuan Zhang, Xiaoman Zhang
TL;DR: 论文提出了一种名为PathPT的新框架,通过空间感知的视觉聚合和任务特定的提示调优,充分利用视觉-语言病理学基础模型的潜力,显著提升了罕见癌症亚型分类的性能。
Details
Motivation: 罕见癌症占所有恶性肿瘤的20-25%,但由于专家资源有限(尤其在儿科肿瘤学中占70%以上),其诊断面临巨大挑战。现有的病理学基础模型在常见癌症分类中表现良好,但在罕见癌症中性能有限。
Result: PathPT在八个罕见癌症和三个常见癌症数据集上进行了测试,显著提升了分类准确性和癌细胞区域的定位能力,优于现有的多种视觉-语言模型和多实例学习方法。
Insight: 这项工作为罕见癌症的AI辅助诊断提供了可扩展的解决方案,尤其在专家资源有限的情况下,通过结合视觉和语言模态的知识,显著提升了模型的分类和解释能力。
Abstract: Rare cancers comprise 20-25% of all malignancies but face major diagnostic challenges due to limited expert availability-especially in pediatric oncology, where they represent over 70% of cases. While pathology vision-language (VL) foundation models show promising zero-shot capabilities for common cancer subtyping, their clinical performance for rare cancers remains limited. Existing multi-instance learning (MIL) methods rely only on visual features, overlooking cross-modal knowledge and compromising interpretability critical for rare cancer diagnosis. To address this limitation, we propose PathPT, a novel framework that fully exploits the potential of vision-language pathology foundation models through spatially-aware visual aggregation and task-specific prompt tuning. Unlike conventional MIL, PathPT converts WSI-level supervision into fine-grained tile-level guidance by leveraging the zero-shot capabilities of VL models, thereby preserving localization on cancerous regions and enabling cross-modal reasoning through prompts aligned with histopathological semantics. We benchmark PathPT on eight rare cancer datasets(four adult and four pediatric) spanning 56 subtypes and 2,910 WSIs, as well as three common cancer datasets, evaluating four state-of-the-art VL models and four MIL frameworks under three few-shot settings. Results show that PathPT consistently delivers superior performance, achieving substantial gains in subtyping accuracy and cancerous region grounding ability. This work advances AI-assisted diagnosis for rare cancers, offering a scalable solution for improving subtyping accuracy in settings with limited access to specialized expertise.
[4] Semantic-Aware Ship Detection with Vision-Language Integration cs.CVPDF
Jiahao Li, Jiancheng Pan, Yuze Sun, Xiaomeng Huang
TL;DR: 本文提出了一种结合视觉-语言模型(VLMs)和多尺度自适应滑动窗口策略的新型船舶检测框架,并引入ShipSem-VL数据集以支持细粒度语义信息的捕获。
Details
Motivation: 远程感知图像中的船舶检测在海事活动监控、航运物流和环境研究中有广泛应用,但现有方法难以捕获细粒度语义信息,限制了其在复杂场景中的效果。
Result: 实验表明该框架在多任务评估中表现优异,有效提升了船舶检测的语义感知能力。
Insight: 视觉-语言模型的结合和多尺度策略可以显著提升远程感知图像中船舶检测的语义和检测精度。
Abstract: Ship detection in remote sensing imagery is a critical task with wide-ranging applications, such as maritime activity monitoring, shipping logistics, and environmental studies. However, existing methods often struggle to capture fine-grained semantic information, limiting their effectiveness in complex scenarios. To address these challenges, we propose a novel detection framework that combines Vision-Language Models (VLMs) with a multi-scale adaptive sliding window strategy. To facilitate Semantic-Aware Ship Detection (SASD), we introduce ShipSem-VL, a specialized Vision-Language dataset designed to capture fine-grained ship attributes. We evaluate our framework through three well-defined tasks, providing a comprehensive analysis of its performance and demonstrating its effectiveness in advancing SASD from multiple perspectives.
[5] Automatic Retrieval of Specific Cows from Unlabeled Videos cs.CV | eess.IVPDF
Jiawen Lyu, Manu Ramesh, Madison Simonds, Jacquelyn P. Boerman, Amy R. Reibman
TL;DR: 该论文提出了一种自动化系统,用于从无标签视频中检索特定的奶牛,包含自动目录生成器、无需深度学习的奶牛识别器和实时视频奶牛查找器。
Details
Motivation: 当前缺乏能够无接触地对奶牛进行自动分类和识别的视频系统,尤其是在无标签和无约束的视频环境下。
Result: 系统成功从未标记、未分割的视频中识别出特定奶牛,展示了其高效性和实用性。
Insight: 无需依赖深度学习,通过传统方法也能实现高效的奶牛识别,为类似场景提供了低成本解决方案。
Abstract: Few automated video systems are described in the open literature that enable hands-free cataloging and identification (ID) of cows in a dairy herd. In this work, we describe our system, composed of an AutoCattloger, which builds a Cattlog of dairy cows in a herd with a single input video clip per cow, an eidetic cow recognizer which uses no deep learning to ID cows, and a CowFinder, which IDs cows in a continuous stream of video. We demonstrate its value in finding individuals in unlabeled, unsegmented videos of cows walking unconstrained through the holding area of a milking parlor.
[6] Investigating Different Geo Priors for Image Classification cs.CVPDF
Angela Zhu, Christian Lange, Max Hamilton
TL;DR: 该论文研究了不同空间隐式神经表示(SINR)模型作为地理先验在基于视觉的物种分类中的有效性,探讨了模型配置以及对未训练物种处理的影响。发现了地理先验模型的有效性与制作准确分布图的不同因素。
Details
Motivation: 地理先验模型在结合位置信息进行物种分类时表现优越,但不同模型配置和未训练物种的处理方式对分类效果的影响尚不明确。
Result: 研究发现地理先验模型的有效性依赖于特定配置和处理方式,而这些因素与制作分布图的要求不同。
Insight: 地理先验模型在视觉分类中的作用与分布图制作的要求存在差异,需针对性优化模型配置和处理策略。
Abstract: Species distribution models encode spatial patterns of species occurrence making them effective priors for vision-based species classification when location information is available. In this study, we evaluate various SINR (Spatial Implicit Neural Representations) models as a geographical prior for visual classification of species from iNaturalist observations. We explore the impact of different model configurations and adjust how we handle predictions for species not included in Geo Prior training. Our analysis reveals factors that contribute to the effectiveness of these models as Geo Priors, factors that may differ from making accurate range maps.
[7] Representation Learning with Adaptive Superpixel Coding cs.CV | cs.AIPDF
Mahmoud Khalil, Ahmad Khalil, Alioune Ngom
TL;DR: 本文提出了一种基于Transformer的自监督模型——自适应超像素编码(ASC),通过动态调整超像素层克服传统Vision Transformer固定分区的局限性。
Details
Motivation: 传统视觉模型依赖于固定的网格结构,限制了其对不同图像内容的适应性。
Result: 在标准下游任务基准测试中优于广泛使用的替代方法。
Insight: 动态分区能够更好地适应图像内容,提升模型性能。
Abstract: Deep learning vision models are typically tailored for specific modalities and often rely on domain-specific assumptions, such as the grid structures used by nearly all existing vision models. In this work, we propose a self-supervised model based on Transformers, which we call Adaptive Superpixel Coding (ASC). The key insight of our model is to overcome the limitations of traditional Vision Transformers, which depend on fixed-size and non-adaptive patch partitioning. Instead, ASC employs adaptive superpixel layers that dynamically adjust to the underlying image content. We analyze key properties of the approach that make it effective, and find that our method outperforms widely-used alternatives on standard image downstream task benchmarks.
[8] Glo-VLMs: Leveraging Vision-Language Models for Fine-Grained Diseased Glomerulus Classification cs.CVPDF
Zhenhao Guo, Rachit Saluja, Tianyuan Yao, Quan Liu, Yuankai Huo
TL;DR: 该论文提出了Glo-VLMs框架,通过结合视觉语言模型(VLMs)在有限标注数据下实现细粒度的肾病肾小球分类,展示了VLMs在医学图像分类中的潜力。
Details
Motivation: 肾病肾小球的细粒度分类面临形态学差异微小且标注数据稀缺的挑战,传统方法效果有限,因此需要探索如何利用预训练的视觉语言模型解决这一问题。
Result: 在少量标注数据下,模型达到了0.7416的准确率、0.9045的宏AUC和0.5277的F1分数,验证了VLMs在医学图像分类中的有效性。
Insight: 即使在高度有限的监督下,大规模预训练模型仍可通过微调适应细粒度的医学图像分类任务,为临床研究提供了新思路。
Abstract: Vision-language models (VLMs) have shown considerable potential in digital pathology, yet their effectiveness remains limited for fine-grained, disease-specific classification tasks such as distinguishing between glomerular subtypes. The subtle morphological variations among these subtypes, combined with the difficulty of aligning visual patterns with precise clinical terminology, make automated diagnosis in renal pathology particularly challenging. In this work, we explore how large pretrained VLMs can be effectively adapted to perform fine-grained glomerular classification, even in scenarios where only a small number of labeled examples are available. In this work, we introduce Glo-VLMs, a systematic framework designed to explore the adaptation of VLMs to fine-grained glomerular classification in data-constrained settings. Our approach leverages curated pathology images alongside clinical text prompts to facilitate joint image-text representation learning for nuanced renal pathology subtypes. By assessing various VLMs architectures and adaptation strategies under a few-shot learning paradigm, we explore how both the choice of method and the amount of labeled data impact model performance in clinically relevant scenarios. To ensure a fair comparison, we evaluate all models using standardized multi-class metrics, aiming to clarify the practical requirements and potential of large pretrained models for specialized clinical research applications. As a result, fine-tuning the VLMs achieved 0.7416 accuracy, 0.9045 macro-AUC, and 0.5277 F1-score with only 8 shots per class, demonstrating that even with highly limited supervision, foundation models can be effectively adapted for fine-grained medical image classification.
[9] Contributions to Label-Efficient Learning in Computer Vision and Remote Sensing cs.CVPDF
Minh-Tan Pham
TL;DR: 本文总结了作者在计算机视觉和遥感领域中标签高效学习方面的多项贡献,重点研究了如何通过有限或部分标注数据以及大量未标注数据进行有效学习,并针对地球观测数据的独特挑战提出了方法。
Details
Motivation: 在计算机视觉和遥感领域中,标注数据的获取成本高昂且耗时,因此需要开发能够从有限标注数据或未标注数据中高效学习的方法。
Result: 通过广泛的实验验证,这些方法在自然和遥感数据集上取得了显著效果。
Insight: 标签高效学习方法在处理遥感数据时,多模态和层次信息的利用是关键。未来研究可以进一步扩展这些方法,以应对实际应用中的规模化需求。
Abstract: This manuscript presents a series of my selected contributions to the topic of label-efficient learning in computer vision and remote sensing. The central focus of this research is to develop and adapt methods that can learn effectively from limited or partially annotated data, and can leverage abundant unlabeled data in real-world applications. The contributions span both methodological developments and domain-specific adaptations, in particular addressing challenges unique to Earth observation data such as multi-modality, spatial resolution variability, and scene heterogeneity. The manuscript is organized around four main axes including (1) weakly supervised learning for object discovery and detection based on anomaly-aware representations learned from large amounts of background images; (2) multi-task learning that jointly trains on multiple datasets with disjoint annotations to improve performance on object detection and semantic segmentation; (3) self-supervised and supervised contrastive learning with multimodal data to enhance scene classification in remote sensing; and (4) few-shot learning for hierarchical scene classification using both explicit and implicit modeling of class hierarchies. These contributions are supported by extensive experimental results across natural and remote sensing datasets, reflecting the outcomes of several collaborative research projects. The manuscript concludes by outlining ongoing and future research directions focused on scaling and enhancing label-efficient learning for real-world applications.
[10] Panoptic Segmentation of Environmental UAV Images : Litter Beach cs.CV | cs.AIPDF
Ousmane Youme, Jean Marie Dembélé, Eugene C. Ezin, Christophe Cambier
TL;DR: 本文探讨了使用CNN进行环境无人机图像的全景分割,特别是在海滩垃圾监测中的应用,提出了基于实例的分割方法和全景分割方法,以解决传统CNN模型在复杂环境中的局限性。
Details
Motivation: 监测海洋垃圾已成为全球性问题,传统CNN模型在复杂海滩环境中因多种干扰因素(如沙色反射、脚印等)表现不佳,需要更鲁棒的方法。
Result: 所提方法在少量样本下表现出高精度,能够有效克服传统CNN在复杂环境中的局限性。
Insight: 全景分割方法在环境监测任务中具有潜力,尤其在复杂背景下,能显著提升检测性能。
Abstract: Convolutional neural networks (CNN) have been used efficiently in several fields, including environmental challenges. In fact, CNN can help with the monitoring of marine litter, which has become a worldwide problem. UAVs have higher resolution and are more adaptable in local areas than satellite images, making it easier to find and count trash. Since the sand is heterogeneous, a basic CNN model encounters plenty of inferences caused by reflections of sand color, human footsteps, shadows, algae present, dunes, holes, and tire tracks. For these types of images, other CNN models, such as CNN-based segmentation methods, may be more appropriate. In this paper, we use an instance-based segmentation method and a panoptic segmentation method that show good accuracy with just a few samples. The model is more robust and less
[11] Automated Multi-label Classification of Eleven Retinal Diseases: A Benchmark of Modern Architectures and a Meta-Ensemble on a Large Synthetic Dataset cs.CV | cs.AIPDF
Jerry Cao-Xue, Tien Comlekoglu, Keyi Xue, Guanliang Wang, Jiang Li
TL;DR: 该论文提出了一个基于合成数据集SynFundus-1M的多标签视网膜疾病分类基准,通过训练六种现代深度学习架构和一个元集成模型,展示了其在合成数据和真实临床数据中的高泛化性能。
Details
Motivation: 由于患者隐私和高成本问题,真实临床数据集的稀缺性限制了多标签深度学习模型在视网膜疾病分类中的发展。合成数据集SynFundus-1M的发布为解决这一问题提供了新机会。
Result: 元集成模型在内部验证集上的宏平均AUC达到0.9973,在真实临床数据集上也表现出色(如DR数据集的AUC为0.7972,青光眼数据集的AUC为0.9126)。
Insight: 合成数据可以作为真实数据的高效替代方案,加速眼科AI系统的开发,同时展示了元集成方法在多标签分类任务中的潜力。
Abstract: The development of multi-label deep learning models for retinal disease classification is often hindered by the scarcity of large, expertly annotated clinical datasets due to patient privacy concerns and high costs. The recent release of SynFundus-1M, a high-fidelity synthetic dataset with over one million fundus images, presents a novel opportunity to overcome these barriers. To establish a foundational performance benchmark for this new resource, we developed an end-to-end deep learning pipeline, training six modern architectures (ConvNeXtV2, SwinV2, ViT, ResNet, EfficientNetV2, and the RETFound foundation model) to classify eleven retinal diseases using a 5-fold multi-label stratified cross-validation strategy. We further developed a meta-ensemble model by stacking the out-of-fold predictions with an XGBoost classifier. Our final ensemble model achieved the highest performance on the internal validation set, with a macro-average Area Under the Receiver Operating Characteristic Curve (AUC) of 0.9973. Critically, the models demonstrated strong generalization to three diverse, real-world clinical datasets, achieving an AUC of 0.7972 on a combined DR dataset, an AUC of 0.9126 on the AIROGS glaucoma dataset and a macro-AUC of 0.8800 on the multi-label RFMiD dataset. This work provides a robust baseline for future research on large-scale synthetic datasets and establishes that models trained exclusively on synthetic data can accurately classify multiple pathologies and generalize effectively to real clinical images, offering a viable pathway to accelerate the development of comprehensive AI systems in ophthalmology.
[12] DRespNeT: A UAV Dataset and YOLOv8-DRN Model for Aerial Instance Segmentation of Building Access Points for Post-Earthquake Search-and-Rescue Missions cs.CVPDF
Aykut Sirma, Angelos Plastropoulos, Argyrios Zolotas, Gilbert Tang
TL;DR: 该论文提出一个名为DRespNeT的高分辨率数据集,用于地震后建筑物入口点的空中实例分割,并基于YOLOv8提出优化的YOLOv8-DRN模型,显著提升搜索与救援任务中的实时决策能力。
Details
Motivation: 地震后的搜救任务需要快速识别建筑物入口点和结构障碍物,现有的数据集依赖卫星图像或粗略语义标注,缺乏高分辨率的细粒度标注。因此,作者开发了DRespNeT数据集和优化模型以满足这一需求。
Result: YOLOv8-DRN模型在RTX-4090 GPU上达到92.7% mAP50和27 FPS的推理速度,满足实时任务需求。数据集和模型显著提升了搜救任务的效率和实时决策能力。
Insight: 1. 高分辨率和细粒度标注对灾难响应任务至关重要;2. 轻量化模型(如YOLOv8)在实时任务中表现出色;3. 人机协作在搜救任务中的潜力通过数据集和模型得到进一步提升。
Abstract: Recent advancements in computer vision and deep learning have enhanced disaster-response capabilities, particularly in the rapid assessment of earthquake-affected urban environments. Timely identification of accessible entry points and structural obstacles is essential for effective search-and-rescue (SAR) operations. To address this need, we introduce DRespNeT, a high-resolution dataset specifically developed for aerial instance segmentation of post-earthquake structural environments. Unlike existing datasets, which rely heavily on satellite imagery or coarse semantic labeling, DRespNeT provides detailed polygon-level instance segmentation annotations derived from high-definition (1080p) aerial footage captured in disaster zones, including the 2023 Turkiye earthquake and other impacted regions. The dataset comprises 28 operationally critical classes, including structurally compromised buildings, access points such as doors, windows, and gaps, multiple debris levels, rescue personnel, vehicles, and civilian visibility. A distinctive feature of DRespNeT is its fine-grained annotation detail, enabling differentiation between accessible and obstructed areas, thereby improving operational planning and response efficiency. Performance evaluations using YOLO-based instance segmentation models, specifically YOLOv8-seg, demonstrate significant gains in real-time situational awareness and decision-making. Our optimized YOLOv8-DRN model achieves 92.7% mAP50 with an inference speed of 27 FPS on an RTX-4090 GPU for multi-target detection, meeting real-time operational requirements. The dataset and models support SAR teams and robotic systems, providing a foundation for enhancing human-robot collaboration, streamlining emergency response, and improving survivor outcomes.
[13] NeuralMeshing: Complete Object Mesh Extraction from Casual Captures cs.CV | cs.ROPDF
Floris Erich, Naoya Chiba, Abdullah Mustafa, Ryo Hanai, Noriaki Ando
TL;DR: 该论文提出了一种自动化系统,通过多段视频提取完整物体网格,无需依赖3D扫描设备,仅需少量标记点即可实现。
Details
Motivation: 研究动机是通过日常视频捕捉生成物体的完整几何模型,而无需商业3D扫描设备,降低几何建模的门槛。
Result: 该系统能从日常视频中生成完整物体网格,代码已开源。
Insight: 通过多视频融合和简易标记点,可以低成本实现高质量几何建模,适用于日常场景数据采集。
Abstract: How can we extract complete geometric models of objects that we encounter in our daily life, without having access to commercial 3D scanners? In this paper we present an automated system for generating geometric models of objects from two or more videos. Our system requires the specification of one known point in at least one frame of each video, which can be automatically determined using a fiducial marker such as a checkerboard or Augmented Reality (AR) marker. The remaining frames are automatically positioned in world space by using Structure-from-Motion techniques. By using multiple videos and merging results, a complete object mesh can be generated, without having to rely on hole filling. Code for our system is available from https://github.com/FlorisE/NeuralMeshing.
[14] Expandable Residual Approximation for Knowledge Distillation cs.CVPDF
Zhaoyi Yan, Binghui Chen, Yunfan Liu, Qixiang Ye
TL;DR: 该论文提出了一种新颖的知识蒸馏方法——可扩展残差近似(ERA),通过多步分解残差知识的逼近任务,采用分治策略减少学生模型模仿教师表示的难度,并结合教师权重集成策略缓解能力差距,显著提升了图像分类和目标检测的性能。
Details
Motivation: 知识蒸馏(KD)中存在教师模型与学生模型之间的学习能力差距,导致知识传递不充分。论文受Stone-Weierstrass定理的渐进逼近原理启发,设计了ERA方法以解决这一问题。
Result: 在ImageNet分类基准上Top-1准确率提升1.41%,在MS COCO目标检测基准上AP提升1.40,并在多个计算机视觉任务中表现领先。
Insight: 分步逼近残差知识可显著降低学生模型学习难度,而教师权重复用策略能有效缓解能力差距,从而实现高效知识蒸馏。
Abstract: Knowledge distillation (KD) aims to transfer knowledge from a large-scale teacher model to a lightweight one, significantly reducing computational and storage requirements. However, the inherent learning capacity gap between the teacher and student often hinders the sufficient transfer of knowledge, motivating numerous studies to address this challenge. Inspired by the progressive approximation principle in the Stone-Weierstrass theorem, we propose Expandable Residual Approximation (ERA), a novel KD method that decomposes the approximation of residual knowledge into multiple steps, reducing the difficulty of mimicking the teacher’s representation through a divide-and-conquer approach. Specifically, ERA employs a Multi-Branched Residual Network (MBRNet) to implement this residual knowledge decomposition. Additionally, a Teacher Weight Integration (TWI) strategy is introduced to mitigate the capacity disparity by reusing the teacher’s head weights. Extensive experiments show that ERA improves the Top-1 accuracy on the ImageNet classification benchmark by 1.41% and the AP on the MS COCO object detection benchmark by 1.40, as well as achieving leading performance across computer vision tasks. Codes and models are available at https://github.com/Zhaoyi-Yan/ERA.
[15] Advances and Trends in the 3D Reconstruction of the Shape and Motion of Animals cs.CVPDF
Ziqi Li, Abderraouf Amrani, Shri Rai, Hamid Laga
TL;DR: 该论文综述了3D动物形状与运动重建领域的最新进展,探讨了基于深度学习的非侵入式方法,分析了不同输入模态、表示方法、重建技术和训练机制,并指出了当前挑战与未来方向。
Details
Motivation: 传统3D扫描方法侵入性强、成本高且难以在自然环境中部署,因此迫切需要非侵入式的解决方案,以应用于生物学、畜牧业、动物保护及数字娱乐等领域。
Result: 研究表明,深度学习方法在非侵入式3D动物重建中表现优异,但仍面临数据稀缺、动态建模复杂性和泛化能力不足等挑战。
Insight: 未来研究可关注多模态数据融合、无监督学习、以及轻量化部署,以进一步提升3D动物重建的实用性和普适性。
Abstract: Reconstructing the 3D geometry, pose, and motion of animals is a long-standing problem, which has a wide range of applications, from biology, livestock management, and animal conservation and welfare to content creation in digital entertainment and Virtual/Augmented Reality (VR/AR). Traditionally, 3D models of real animals are obtained using 3D scanners. These, however, are intrusive, often prohibitively expensive, and difficult to deploy in the natural environment of the animals. In recent years, we have seen a significant surge in deep learning-based techniques that enable the 3D reconstruction, in a non-intrusive manner, of the shape and motion of dynamic objects just from their RGB image and/or video observations. Several papers have explored their application and extension to various types of animals. This paper surveys the latest developments in this emerging and growing field of research. It categorizes and discusses the state-of-the-art methods based on their input modalities, the way the 3D geometry and motion of animals are represented, the type of reconstruction techniques they use, and the training mechanisms they adopt. It also analyzes the performance of some key methods, discusses their strengths and limitations, and identifies current challenges and directions for future research.
[16] A Unified Voxel Diffusion Module for Point Cloud 3D Object Detection cs.CVPDF
Qifeng Liu, Dawei Zhao, Yabo Dong, Linzhi Shang, Liang Xiao
TL;DR: 本文提出了一种新型的Voxel Diffusion Module(VDM),通过结合稀疏3D卷积和子流形稀疏卷积,增强了点云数据中体素级的表示和扩散能力,显著提升了基于Transformer和SSM的检测模型的性能。
Details
Motivation: 当前基于体素的点云目标检测模型由于输入输出维度的严格一致性要求,限制了卷积操作提供的空间扩散能力,影响了检测精度。本文受CNN启发,旨在解决这一问题。
Result: 在多个基准数据集上,VDM显著提升了检测性能,特别是在Waymo(74.7 mAPH)、nuScenes(72.9 NDS)等数据集上刷新了SOTA。
Insight: 通过稀疏卷积的空间扩散能力,结合残差连接,VDM能够在保持计算效率的同时,显著增强体素特征的表示能力,为点云检测提供了新思路。
Abstract: Recent advances in point cloud object detection have increasingly adopted Transformer-based and State Space Models (SSMs), demonstrating strong performance. However, voxelbased representations in these models require strict consistency in input and output dimensions due to their serialized processing, which limits the spatial diffusion capability typically offered by convolutional operations. This limitation significantly affects detection accuracy. Inspired by CNN-based object detection architectures, we propose a novel Voxel Diffusion Module (VDM) to enhance voxel-level representation and diffusion in point cloud data. VDM is composed of sparse 3D convolutions, submanifold sparse convolutions, and residual connections. To ensure computational efficiency, the output feature maps are downsampled to one-fourth of the original input resolution. VDM serves two primary functions: (1) diffusing foreground voxel features through sparse 3D convolutions to enrich spatial context, and (2) aggregating fine-grained spatial information to strengthen voxelwise feature representation. The enhanced voxel features produced by VDM can be seamlessly integrated into mainstream Transformer- or SSM-based detection models for accurate object classification and localization, highlighting the generalizability of our method. We evaluate VDM on several benchmark datasets by embedding it into both Transformerbased and SSM-based models. Experimental results show that our approach consistently improves detection accuracy over baseline models. Specifically, VDM-SSMs achieve 74.7 mAPH (L2) on Waymo, 72.9 NDS on nuScenes, 42.3 mAP on Argoverse 2, and 67.6 mAP on ONCE, setting new stateof-the-art performance across all datasets. Our code will be made publicly available.
[17] Ensemble learning of foundation models for precision oncology cs.CVPDF
Xiangde Luo, Xiyue Wang, Feyisope Eweje, Xiaoming Zhang, Sen Yang
TL;DR: 该论文提出了ELF(Ensemble Learning of Foundation models)框架,通过集成五种顶尖的病理学基础模型,生成统一的幻灯片级别表征,显著提升了疾病分类、生物标志物检测和治疗效果预测的准确性和鲁棒性。
Details
Motivation: 现有的病理学基础模型通常在分散的数据集上以不同策略训练,导致性能不一致且泛化能力有限。为了克服这一问题,作者提出了ELF框架。
Result: ELF在多种临床应用(疾病分类、生物标志物检测和治疗效果预测)中均优于单一基础模型和现有幻灯片级别模型,显示出更高的准确性和鲁棒性。
Insight: 集成学习能够有效结合不同病理学基础模型的优势,为精准肿瘤学的AI辅助解决方案提供了可扩展和通用的新途径。
Abstract: Histopathology is essential for disease diagnosis and treatment decision-making. Recent advances in artificial intelligence (AI) have enabled the development of pathology foundation models that learn rich visual representations from large-scale whole-slide images (WSIs). However, existing models are often trained on disparate datasets using varying strategies, leading to inconsistent performance and limited generalizability. Here, we introduce ELF (Ensemble Learning of Foundation models), a novel framework that integrates five state-of-the-art pathology foundation models to generate unified slide-level representations. Trained on 53,699 WSIs spanning 20 anatomical sites, ELF leverages ensemble learning to capture complementary information from diverse models while maintaining high data efficiency. Unlike traditional tile-level models, ELF’s slide-level architecture is particularly advantageous in clinical contexts where data are limited, such as therapeutic response prediction. We evaluated ELF across a wide range of clinical applications, including disease classification, biomarker detection, and response prediction to major anticancer therapies, cytotoxic chemotherapy, targeted therapy, and immunotherapy, across multiple cancer types. ELF consistently outperformed all constituent foundation models and existing slide-level models, demonstrating superior accuracy and robustness. Our results highlight the power of ensemble learning for pathology foundation models and suggest ELF as a scalable and generalizable solution for advancing AI-assisted precision oncology.
[18] Two-flow Feedback Multi-scale Progressive Generative Adversarial Network cs.CV | cs.AIPDF
Sun Weikai, Song Shijie, Chi Wenjie
TL;DR: 该论文提出了一种新颖的双向反馈多尺度渐进生成对抗网络(MSPG-SEN),通过优化训练过程、提升生成质量和稳定性,同时降低了训练成本。
Details
Motivation: 虽然扩散模型在图像生成领域取得了进展,但GAN因其独特的优势仍具有发展空间,作者旨在进一步提升GAN的生成质量和训练效率。
Result: 在多个数据集上达到SOTA水平(如INKK数据集89.7%,OPIN数据集96.4%),同时训练成本显著降低。
Insight: 创新性地结合反馈机制与多尺度渐进结构,同时引入注意力机制和动态网络设计,为GAN的优化提供了新思路。
Abstract: Although diffusion model has made good progress in the field of image generation, GAN\cite{huang2023adaptive} still has a large development space due to its unique advantages, such as WGAN\cite{liu2021comparing}, SSGAN\cite{guibas2021adaptive} \cite{zhang2022vsa} \cite{zhou2024adapt} and so on. In this paper, we propose a novel two-flow feedback multi-scale progressive generative adversarial network (MSPG-SEN) for GAN models. This paper has four contributions: 1) : We propose a two-flow feedback multi-scale progressive Generative Adversarial network (MSPG-SEN), which not only improves image quality and human visual perception on the basis of retaining the advantages of the existing GAN model, but also simplifies the training process and reduces the training cost of GAN networks. Our experimental results show that, MSPG-SEN has achieved state-of-the-art generation results on the following five datasets,INKK The dataset is 89.7%,AWUN The dataset is 78.3%,IONJ The dataset is 85.5%,POKL The dataset is 88.7%,OPIN The dataset is 96.4%. 2) : We propose an adaptive perception-behavioral feedback loop (APFL), which effectively improves the robustness and training stability of the model and reduces the training cost. 3) : We propose a globally connected two-flow dynamic residual network(). After ablation experiments, it can effectively improve the training efficiency and greatly improve the generalization ability, with stronger flexibility. 4) : We propose a new dynamic embedded attention mechanism (DEMA). After experiments, the attention can be extended to a variety of image processing tasks, which can effectively capture global-local information, improve feature separation capability and feature expression capabilities, and requires minimal computing resources only 88.7% with INJK With strong cross-task capability.
[19] Domain Adaptation via Feature Refinement cs.CV | cs.LGPDF
Savvas Karatsiolis, Andreas Kamilaris
TL;DR: 论文提出了一种名为DAFR2的简单有效的无监督域适应框架,通过结合批量归一化统计调整、特征蒸馏和假设迁移,实现了在分布偏移下的鲁棒性和域不变性特征空间。
Details
Motivation: 在无监督域适应任务中,分布偏移问题导致模型在目标域上的性能下降。传统方法通常需要复杂架构或训练目标,而本文旨在通过简单的方法实现域间特征对齐。
Result: 在多个基准数据集(如CIFAR10-C、CIFAR100-C等)上的实验表明,DAFR2在抗干扰性上优于现有方法。
Insight: 特征分布在统计和表示层面的对齐是提升域适应性能的关键,且DAFR2在不增加模型复杂度的情况下提高了特征对齐效果。
Abstract: We propose Domain Adaptation via Feature Refinement (DAFR2), a simple yet effective framework for unsupervised domain adaptation under distribution shift. The proposed method synergistically combines three key components: adaptation of Batch Normalization statistics using unlabeled target data, feature distillation from a source-trained model and hypothesis transfer. By aligning feature distributions at the statistical and representational levels, DAFR2 produces robust and domain-invariant feature spaces that generalize across similar domains without requiring target labels, complex architectures or sophisticated training objectives. Extensive experiments on benchmark datasets, including CIFAR10-C, CIFAR100-C, MNIST-C and PatchCamelyon-C, demonstrate that the proposed algorithm outperforms prior methods in robustness to corruption. Theoretical and empirical analyses further reveal that our method achieves improved feature alignment, increased mutual information between the domains and reduced sensitivity to input perturbations.
[20] 4D Virtual Imaging Platform for Dynamic Joint Assessment via Uni-Plane X-ray and 2D-3D Registration cs.CVPDF
Hao Tang, Rongxi Yi, Lei Li, Kaiyi Cao, Jiapeng Zhao
TL;DR: 这篇论文提出了一个集成的4D关节分析平台,结合了双机器人臂锥形束CT(CBCT)系统和动态2D X射线成像,用于动态关节评估,具有高精度和低辐射剂量。
Details
Motivation: 传统的CT无法捕捉动态负重关节运动,当前方法在辐射暴露或空间信息完整性上存在局限,因此需要一种能实现4D成像且辐射低的解决方案。
Result: 在模拟研究中,方法达到亚体素精度(0.235毫米),成功率达99.18%,优于传统方法;临床评估显示其对TKA患者膝关节运动的准确量化。
Insight: 该平台为生物力学研究、精准诊断和个性化骨科治疗提供了高效、低剂量的4D关节成像工具。
Abstract: Conventional computed tomography (CT) lacks the ability to capture dynamic, weight-bearing joint motion. Functional evaluation, particularly after surgical intervention, requires four-dimensional (4D) imaging, but current methods are limited by excessive radiation exposure or incomplete spatial information from 2D techniques. We propose an integrated 4D joint analysis platform that combines: (1) a dual robotic arm cone-beam CT (CBCT) system with a programmable, gantry-free trajectory optimized for upright scanning; (2) a hybrid imaging pipeline that fuses static 3D CBCT with dynamic 2D X-rays using deep learning-based preprocessing, 3D-2D projection, and iterative optimization; and (3) a clinically validated framework for quantitative kinematic assessment. In simulation studies, the method achieved sub-voxel accuracy (0.235 mm) with a 99.18 percent success rate, outperforming conventional and state-of-the-art registration approaches. Clinical evaluation further demonstrated accurate quantification of tibial plateau motion and medial-lateral variance in post-total knee arthroplasty (TKA) patients. This 4D CBCT platform enables fast, accurate, and low-dose dynamic joint imaging, offering new opportunities for biomechanical research, precision diagnostics, and personalized orthopedic care.
[21] Beyond Human-prompting: Adaptive Prompt Tuning with Semantic Alignment for Anomaly Detection cs.CV | cs.AIPDF
Pi-Wei Chen, Jerry Chun-Wei Lin, Wei-Han Chen, Jia Ji, Zih-Ching Chen
TL;DR: 论文提出了一种自适应提示微调方法(APT),通过自生成异常样本和噪声扰动训练可学习提示,显著提升了异常检测的性能。
Details
Motivation: 现有的基于提示的异常检测方法依赖人工设计的提示和缺乏可用异常样本,限制了其在上下文特定异常理解上的表现。
Result: APT在多个基准数据集上实现了最先进的性能,无需人工提示设计,提供了鲁棒的异常检测解决方案。
Insight: 该方法展示了如何通过自适应提示微调和语义对齐提升模型对上下文依赖异常的泛化能力。
Abstract: Pre-trained Vision-Language Models (VLMs) have recently shown promise in detecting anomalies. However, previous approaches are fundamentally limited by their reliance on human-designed prompts and the lack of accessible anomaly samples, leading to significant gaps in context-specific anomaly understanding. In this paper, we propose \textbf{A}daptive \textbf{P}rompt \textbf{T}uning with semantic alignment for anomaly detection (APT), a groundbreaking prior knowledge-free, few-shot framework and overcomes the limitations of traditional prompt-based approaches. APT uses self-generated anomaly samples with noise perturbations to train learnable prompts that capture context-dependent anomalies in different scenarios. To prevent overfitting to synthetic noise, we propose a Self-Optimizing Meta-prompt Guiding Scheme (SMGS) that iteratively aligns the prompts with general anomaly semantics while incorporating diverse synthetic anomaly. Our system not only advances pixel-wise anomaly detection, but also achieves state-of-the-art performance on multiple benchmark datasets without requiring prior knowledge for prompt crafting, establishing a robust and versatile solution for real-world anomaly detection.
[22] RAGSR: Regional Attention Guided Diffusion for Image Super-Resolution cs.CVPDF
Haodong He, Yancheng Bai, Rui Lan, Xu Duan, Lei Sun
TL;DR: 本文提出了一种基于区域注意力引导的超分辨率方法(RAGSR),通过结合细粒度的区域描述和新型注意力机制,提升了多物体场景下的超分辨率生成质量。
Details
Motivation: 现有基于视觉-语言模型和扩散模型的超分辨率方法在生成清晰准确的区域细节时表现不佳,尤其是在多物体场景下,主要原因是缺乏细粒度的区域描述和模型对复杂提示的捕捉能力不足。
Result: 实验结果表明,RAGSR在基准数据集上能生成感知真实的视觉细节,同时保持上下文一致性,性能优于现有方法。
Insight: 1. 细粒度的区域描述和文本先验对提升超分辨率质量至关重要;2. 区域注意力机制可以有效控制文本与图像信息的融合,避免不相关的干扰。
Abstract: The rich textual information of large vision-language models (VLMs) combined with the powerful generative prior of pre-trained text-to-image (T2I) diffusion models has achieved impressive performance in single-image super-resolution (SISR). However, existing methods still face significant challenges in generating clear and accurate regional details, particularly in scenarios involving multiple objects. This challenge primarily stems from a lack of fine-grained regional descriptions and the models’ insufficient ability to capture complex prompts. To address these limitations, we propose a Regional Attention Guided Super-Resolution (RAGSR) method that explicitly extracts localized fine-grained information and effectively encodes it through a novel regional attention mechanism, enabling both enhanced detail and overall visually coherent SR results. Specifically, RAGSR localizes object regions in an image and assigns fine-grained caption to each region, which are formatted as region-text pairs as textual priors for T2I models. A regional guided attention is then leveraged to ensure that each region-text pair is properly considered in the attention process while preventing unwanted interactions between unrelated region-text pairs. By leveraging this attention mechanism, our approach offers finer control over the integration of text and image information, thereby effectively overcoming limitations faced by traditional SISR techniques. Experimental results on benchmark datasets demonstrate that our approach exhibits superior performance in generating perceptually authentic visual details while maintaining contextual consistency compared to existing approaches.
[23] Through the Looking Glass: A Dual Perspective on Weakly-Supervised Few-Shot Segmentation cs.CV | cs.AIPDF
Jiaqi Ma, Guo-Sen Xie, Fang Zhao, Zechao Li
TL;DR: 论文提出了一种新颖的同源但异构网络TLG,通过异构视觉聚合(HA)模块和异构转移(HT)模块,解决了元学习中网络同质化的问题,并在弱监督少样本分割任务中取得了显著性能提升。
Details
Motivation: 现有的元学习方法在采样支持-查询对时倾向于同质化,导致网络过度语义同质化。为了解决这一问题,论文提出通过异构网络设计增强互补性,同时保留语义共性。
Result: TLG仅使用现有SOTA模型1/24的参数,在Pascal-5i上提升了13.2%,在COCO-20i上提升了9.7%,且为第一个在相同骨干架构下超越全监督模型的弱监督方法。
Insight: 异构网络设计可以有效平衡语义共性与异构互补性,弱监督方法在少样本任务中具有巨大潜力。
Abstract: Meta-learning aims to uniformly sample homogeneous support-query pairs, characterized by the same categories and similar attributes, and extract useful inductive biases through identical network architectures. However, this identical network design results in over-semantic homogenization. To address this, we propose a novel homologous but heterogeneous network. By treating support-query pairs as dual perspectives, we introduce heterogeneous visual aggregation (HA) modules to enhance complementarity while preserving semantic commonality. To further reduce semantic noise and amplify the uniqueness of heterogeneous semantics, we design a heterogeneous transfer (HT) module. Finally, we propose heterogeneous CLIP (HC) textual information to enhance the generalization capability of multimodal models. In the weakly-supervised few-shot semantic segmentation (WFSS) task, with only 1/24 of the parameters of existing state-of-the-art models, TLG achieves a 13.2% improvement on Pascal-5\textsuperscript{i} and a 9.7% improvement on COCO-20\textsuperscript{i}. To the best of our knowledge, TLG is also the first weakly supervised (image-level) model that outperforms fully supervised (pixel-level) models under the same backbone architectures. The code is available at https://github.com/jarch-ma/TLG.
[24] FTIO: Frequent Temporally Integrated Objects cs.CVPDF
Mohammad Mohammadzadeh Kalati, Farhad Maleki, Ian McQuillan
TL;DR: FTIO是一个后处理框架,通过改进目标选择和纠正时间不一致性,显著提升了无监督视频目标分割(UVOS)的性能。
Details
Motivation: 无监督视频目标分割(UVOS)面临目标选择不确定性高和时间不一致性的挑战,尤其是对小目标和复杂结构的处理。
Result: 实验表明FTIO在多目标UVOS任务中达到SOTA性能。
Insight: 频率显著性和时间整合方法有效提升了UVOS的鲁棒性和一致性。
Abstract: Predicting and tracking objects in real-world scenarios is a critical challenge in Video Object Segmentation (VOS) tasks. Unsupervised VOS (UVOS) has the additional challenge of finding an initial segmentation of salient objects, which affects the entire process and keeps a permanent uncertainty about the object proposals. Moreover, deformation and fast motion can lead to temporal inconsistencies. To address these problems, we propose Frequent Temporally Integrated Objects (FTIO), a post-processing framework with two key components. First, we introduce a combined criterion to improve object selection, mitigating failures common in UVOS–particularly when objects are small or structurally complex–by extracting frequently appearing salient objects. Second, we present a three-stage method to correct temporal inconsistencies by integrating missing object mask regions. Experimental results demonstrate that FTIO achieves state-of-the-art performance in multi-object UVOS. Code is available at: https://github.com/MohammadMohammadzadehKalati/FTIO
[25] SpecVLM: Enhancing Speculative Decoding of Video LLMs via Verifier-Guided Token Pruning cs.CV | cs.AI | cs.CLPDF
Yicheng Ji, Jun Zhang, Heming Xia, Jinpeng Chen, Lidan Shou
TL;DR: SpecVLM是一种针对视频大型语言模型(Vid-LLMs)的训练无关推测解码框架,通过两阶段视频令牌修剪实现无损加速解码,最高可提升2.68倍解码速度。
Details
Motivation: 视频大型语言模型在处理密集视频令牌时存在内存和计算开销问题,而现有视频令牌缩减方法会导致信息损失。
Result: 在四个视频理解基准测试中,SpecVLM实现了LLaVA-OneVision-72B模型2.68倍和解码速度提升,Qwen2.5-VL-32B模型2.11倍加速。
Insight: 发现推测模型的推测能力对视频令牌修剪的低敏感性,为高效视频内容处理提供了新思路。
Abstract: Video large language models (Vid-LLMs) have shown strong capabilities in understanding video content. However, their reliance on dense video token representations introduces substantial memory and computational overhead in both prefilling and decoding. To mitigate the information loss of recent video token reduction methods and accelerate the decoding stage of Vid-LLMs losslessly, we introduce SpecVLM, a training-free speculative decoding (SD) framework tailored for Vid-LLMs that incorporates staged video token pruning. Building on our novel finding that the draft model’s speculation exhibits low sensitivity to video token pruning, SpecVLM prunes up to 90% of video tokens, enabling efficient speculation without sacrificing accuracy. To achieve this, it performs a two-stage pruning process: Stage I selects highly informative tokens guided by attention signals from the verifier (target model), while Stage II prunes remaining redundant ones in a spatially uniform manner. Extensive experiments on four video understanding benchmarks demonstrate the effectiveness and robustness of SpecVLM, which achieves up to 2.68$\times$ decoding speedup for LLaVA-OneVision-72B and 2.11$\times$ speedup for Qwen2.5-VL-32B.
[26] \textsc{T-Mask}: Temporal Masking for Probing Foundation Models across Camera Views in Driver Monitoring cs.CVPDF
Thinesh Thiyakesan Ponbagavathi, Kunyu Peng, Alina Roitberg
TL;DR: 本文提出了一种名为T-Mask的时序掩码方法,用于在驾驶员监控任务中跨视角地利用基础模型的潜力,并通过实验结果展示了其优于现有轻量级适配方法的性能。
Details
Motivation: 在驾驶员监控任务中,摄像头视角变化是一个常见挑战。传统深度学习方法和预训练基础模型虽然在轻量级适配(如线性探针)上表现出潜力,但对未见视角的鲁棒性研究不足。
Result: 1. T-Mask在跨视角任务中相比基线提升了1.23%的Top-1准确率,相比PEFT方法提升了8.0%;
2. 在次要活动识别中,训练视角下提升了5.42%,跨视角下提升了1.36%。
Insight: 1. 轻量级适配方法(如T-Mask)在跨视角和低数据条件下具有潜力;
2. 时序token选择对构建鲁棒的驾驶员监控系统至关重要;
3. 基础模型在细粒度视觉任务中展现了强大的泛化能力。
Abstract: Changes of camera perspective are a common obstacle in driver monitoring. While deep learning and pretrained foundation models show strong potential for improved generalization via lightweight adaptation of the final layers (‘probing’), their robustness to unseen viewpoints remains underexplored. We study this challenge by adapting image foundation models to driver monitoring using a single training view, and evaluating them directly on unseen perspectives without further adaptation. We benchmark simple linear probes, advanced probing strategies, and compare two foundation models (DINOv2 and CLIP) against parameter-efficient fine-tuning (PEFT) and full fine-tuning. Building on these insights, we introduce \textsc{T-Mask} – a new image-to-video probing method that leverages temporal token masking and emphasizes more dynamic video regions. Benchmarked on the public Drive&Act dataset, \textsc{T-Mask} improves cross-view top-1 accuracy by $+1.23%$ over strong probing baselines and $+8.0%$ over PEFT methods, without adding any parameters. It proves particularly effective for underrepresented secondary activities, boosting recognition by $+5.42%$ under the trained view and $+1.36%$ under cross-view settings. This work provides encouraging evidence that adapting foundation models with lightweight probing methods like \textsc{T-Mask} has strong potential in fine-grained driver observation, especially in cross-view and low-data settings. These results highlight the importance of temporal token selection when leveraging foundation models to build robust driver monitoring systems. Code and models will be made available at https://github.com/th-nesh/T-MASK to support ongoing research.
[27] Forecast then Calibrate: Feature Caching as ODE for Efficient Diffusion Transformers cs.CVPDF
Shikang Zheng, Liang Feng, Xinyu Wang, Qinming Zhou, Peiliang Cai
TL;DR: 论文提出了一种名为FoCa的方法,通过将特征缓存问题建模为ODE求解问题,显著提高了Diffusion Transformers的推理效率,同时在高度加速下保持了生成质量。
Details
Motivation: 当前的特征缓存方法在高加速比下难以保持生成质量,主要原因是无法鲁棒地整合历史特征。
Result: 在多种任务(图像合成、视频生成、超分辨率)中,FoCa取得了显著的加速效果(最高6.45倍),且质量损失极小。
Insight: ODE框架为特征缓存提供了一种理论支持,预测-校准策略有效缓解了高加速比下的误差累积问题。
Abstract: Diffusion Transformers (DiTs) have demonstrated exceptional performance in high-fidelity image and video generation. To reduce their substantial computational costs, feature caching techniques have been proposed to accelerate inference by reusing hidden representations from previous timesteps. However, current methods often struggle to maintain generation quality at high acceleration ratios, where prediction errors increase sharply due to the inherent instability of long-step forecasting. In this work, we adopt an ordinary differential equation (ODE) perspective on the hidden-feature sequence, modeling layer representations along the trajectory as a feature-ODE. We attribute the degradation of existing caching strategies to their inability to robustly integrate historical features under large skipping intervals. To address this, we propose FoCa (Forecast-then-Calibrate), which treats feature caching as a feature-ODE solving problem. Extensive experiments on image synthesis, video generation, and super-resolution tasks demonstrate the effectiveness of FoCa, especially under aggressive acceleration. Without additional training, FoCa achieves near-lossless speedups of 5.50 times on FLUX, 6.45 times on HunyuanVideo, 3.17 times on Inf-DiT, and maintains high quality with a 4.53 times speedup on DiT.
[28] OmniCache: A Trajectory-Oriented Global Perspective on Training-Free Cache Reuse for Diffusion Transformer Models cs.CV | cs.AI | cs.LGPDF
Huanpeng Chu, Wei Wu, Guanyu Fen, Yutao Zhang
TL;DR: OmniCache是一种无需训练的缓存重用方法,通过分析扩散Transformer模型的采样轨迹,全局优化缓存策略,显著提升计算效率,同时保持生成质量。
Details
Motivation: 扩散Transformer模型的实时部署面临高计算成本挑战,现有缓存方法仅关注局部步骤相似性。OmniCache从全局采样视角出发,优化缓存重用策略。
Result: 实验表明,OmniCache在加速采样过程的同时保持生成质量,为扩散模型的实时部署提供实用解决方案。
Insight: 全局视角的缓存策略能更高效利用计算冗余,动态噪声过滤进一步优化了采样效率和生成质量。
Abstract: Diffusion models have emerged as a powerful paradigm for generative tasks such as image synthesis and video generation, with Transformer architectures further enhancing performance. However, the high computational cost of diffusion Transformers-stemming from a large number of sampling steps and complex per-step computations-presents significant challenges for real-time deployment. In this paper, we introduce OmniCache, a training-free acceleration method that exploits the global redundancy inherent in the denoising process. Unlike existing methods that determine caching strategies based on inter-step similarities and tend to prioritize reusing later sampling steps, our approach originates from the sampling perspective of DIT models. We systematically analyze the model’s sampling trajectories and strategically distribute cache reuse across the entire sampling process. This global perspective enables more effective utilization of cached computations throughout the diffusion trajectory, rather than concentrating reuse within limited segments of the sampling procedure.In addition, during cache reuse, we dynamically estimate the corresponding noise and filter it out to reduce its impact on the sampling direction.Extensive experiments demonstrate that our approach accelerates the sampling process while maintaining competitive generative quality, offering a promising and practical solution for efficient deployment of diffusion-based generative models.
[29] MedOmni-45°: A Safety-Performance Benchmark for Reasoning-Oriented LLMs in Medicine cs.CVPDF
Kaiyuan Ji, Yijin Guo, Zicheng Zhang, Xiangyang Zhu, Yuan Tian
TL;DR: 本文介绍了 MedOmni-45 Degrees 基准,用于评估医学领域中大型语言模型(LLMs)在推理过程中的安全性与性能权衡,重点关注 Chain-of-Thought 的忠实性和抗奉承性。
Details
Motivation: 随着 LLMs 在医疗决策支持中的广泛应用,需要评估其推理过程的可靠性,而现有基准通常将这些漏洞简化为单一准确率分数。
Result: 结果表明所有模型均未突破对角线,开源模型 QwQ-32B 在安全性和性能间平衡最佳(43.81 度),但未在两方面均领先。
Insight: MedOmni-45 Degrees 旨在揭示医学 LLMs 的推理漏洞,并指导开发更安全的模型。
Abstract: With the increasing use of large language models (LLMs) in medical decision-support, it is essential to evaluate not only their final answers but also the reliability of their reasoning. Two key risks are Chain-of-Thought (CoT) faithfulness – whether reasoning aligns with responses and medical facts – and sycophancy, where models follow misleading cues over correctness. Existing benchmarks often collapse such vulnerabilities into single accuracy scores. To address this, we introduce MedOmni-45 Degrees, a benchmark and workflow designed to quantify safety-performance trade-offs under manipulative hint conditions. It contains 1,804 reasoning-focused medical questions across six specialties and three task types, including 500 from MedMCQA. Each question is paired with seven manipulative hint types and a no-hint baseline, producing about 27K inputs. We evaluate seven LLMs spanning open- vs. closed-source, general-purpose vs. medical, and base vs. reasoning-enhanced models, totaling over 189K inferences. Three metrics – Accuracy, CoT-Faithfulness, and Anti-Sycophancy – are combined into a composite score visualized with a 45 Degrees plot. Results show a consistent safety-performance trade-off, with no model surpassing the diagonal. The open-source QwQ-32B performs closest (43.81 Degrees), balancing safety and accuracy but not leading in both. MedOmni-45 Degrees thus provides a focused benchmark for exposing reasoning vulnerabilities in medical LLMs and guiding safer model development.
[30] PromptFlare: Prompt-Generalized Defense via Cross-Attention Decoy in Diffusion-Based Inpainting cs.CVPDF
Hohyun Na, Seunghoo Hong, Simon S. Woo
TL;DR: PromptFlare提出了一种基于跨注意力机制的新对抗防御方法,通过注入噪声抑制扩散修复模型的采样过程,有效防止恶意图像修改。
Details
Motivation: 扩散模型的成功使得高质量的图像修改变得容易,但也可能被恶意利用。现有方法依赖图像级不一致性,无法解决文本提示的影响。
Result: 在EditBench数据集上表现优异,显著减少了计算开销和GPU内存占用。
Insight: 跨注意力机制可被用于防御攻击,通过噪声注入扰乱模型对提示的关注。
Abstract: The success of diffusion models has enabled effortless, high-quality image modifications that precisely align with users’ intentions, thereby raising concerns about their potential misuse by malicious actors. Previous studies have attempted to mitigate such misuse through adversarial attacks. However, these approaches heavily rely on image-level inconsistencies, which pose fundamental limitations in addressing the influence of textual prompts. In this paper, we propose PromptFlare, a novel adversarial protection method designed to protect images from malicious modifications facilitated by diffusion-based inpainting models. Our approach leverages the cross-attention mechanism to exploit the intrinsic properties of prompt embeddings. Specifically, we identify and target shared token of prompts that is invariant and semantically uninformative, injecting adversarial noise to suppress the sampling process. The injected noise acts as a cross-attention decoy, diverting the model’s focus away from meaningful prompt-image alignments and thereby neutralizing the effect of prompt. Extensive experiments on the EditBench dataset demonstrate that our method achieves state-of-the-art performance across various metrics while significantly reducing computational overhead and GPU memory usage. These findings highlight PromptFlare as a robust and efficient protection against unauthorized image manipulations. The code is available at https://github.com/NAHOHYUN-SKKU/PromptFlare.
[31] An Investigation of Visual Foundation Models Robustness cs.CV | cs.AI | cs.LGPDF
Sandeep Gupta, Roberto Passerone
TL;DR: 本文探讨了视觉基础模型(VFMs)在计算机视觉任务中的鲁棒性需求,分析了现有防御方法和训练策略的优缺点,并提出了评估网络鲁棒性的挑战和方法。
Details
Motivation: VFMs在安全敏感领域(如生物识别和自动驾驶)的应用需要高鲁棒性,以应对动态环境中的多种干扰因素(如光照、天气和传感器噪声)。本文旨在研究如何提升VFMs的鲁棒性。
Result: 研究发现,现有防御机制存在网络属性和组件选择等方面的挑战,需要进一步的研究和改进。
Insight: 1. VFMs的鲁棒性需多维度评估;2. 未来研究应关注网络结构和训练策略的优化,以提升模型在动态环境中的适应性。
Abstract: Visual Foundation Models (VFMs) are becoming ubiquitous in computer vision, powering systems for diverse tasks such as object detection, image classification, segmentation, pose estimation, and motion tracking. VFMs are capitalizing on seminal innovations in deep learning models, such as LeNet-5, AlexNet, ResNet, VGGNet, InceptionNet, DenseNet, YOLO, and ViT, to deliver superior performance across a range of critical computer vision applications. These include security-sensitive domains like biometric verification, autonomous vehicle perception, and medical image analysis, where robustness is essential to fostering trust between technology and the end-users. This article investigates network robustness requirements crucial in computer vision systems to adapt effectively to dynamic environments influenced by factors such as lighting, weather conditions, and sensor characteristics. We examine the prevalent empirical defenses and robust training employed to enhance vision network robustness against real-world challenges such as distributional shifts, noisy and spatially distorted inputs, and adversarial attacks. Subsequently, we provide a comprehensive analysis of the challenges associated with these defense mechanisms, including network properties and components to guide ablation studies and benchmarking metrics to evaluate network robustness.
[32] FlexMUSE: Multimodal Unification and Semantics Enhancement Framework with Flexible interaction for Creative Writing cs.CV | cs.AIPDF
Jiahao Chen, Zhiyong Ma, Wenbiao Du, Qingyuan Chuai
TL;DR: FlexMUSE是一个多模态统一和语义增强框架,用于创意写作,通过灵活的交互模式和语义对齐技术提升多模态输出的创造力和一致性。
Details
Motivation: 现有的多模态生成方法通常要求严格的输入模式或高昂的训练成本,且在多模态创意写作(MMCW)中容易产生语义不一致的问题。
Result: FlexMUSE在多模态创意写作任务中展现出良好的一致性、创造力和连贯性。
Insight: 模态语义对齐和增强对于提升多模态创意写作的质量至关重要,灵活的交互模式可以进一步释放创造力。
Abstract: Multi-modal creative writing (MMCW) aims to produce illustrated articles. Unlike common multi-modal generative (MMG) tasks such as storytelling or caption generation, MMCW is an entirely new and more abstract challenge where textual and visual contexts are not strictly related to each other. Existing methods for related tasks can be forcibly migrated to this track, but they require specific modality inputs or costly training, and often suffer from semantic inconsistencies between modalities. Therefore, the main challenge lies in economically performing MMCW with flexible interactive patterns, where the semantics between the modalities of the output are more aligned. In this work, we propose FlexMUSE with a T2I module to enable optional visual input. FlexMUSE promotes creativity and emphasizes the unification between modalities by proposing the modality semantic alignment gating (msaGate) to restrict the textual input. Besides, an attention-based cross-modality fusion is proposed to augment the input features for semantic enhancement. The modality semantic creative direct preference optimization (mscDPO) within FlexMUSE is designed by extending the rejected samples to facilitate the writing creativity. Moreover, to advance the MMCW, we expose a dataset called ArtMUSE which contains with around 3k calibrated text-image pairs. FlexMUSE achieves promising results, demonstrating its consistency, creativity and coherence.
[33] UniEM-3M: A Universal Electron Micrograph Dataset for Microstructural Segmentation and Generation cs.CVPDF
Nan wang, Zhiyi Xia, Yiming Li, Shi Tang, Zuxin Fan
TL;DR: 本文介绍了UniEM-3M,这是首个大规模、多模态的电子显微图像数据集,用于实例级理解,包含5,091张高分辨率图像、约300万个实例分割标签和图像级属性解耦的文本描述。同时,作者还发布了一个基于扩散模型的文本到图像生成工具,作为数据增强和完整数据分布的代理。
Details
Motivation: 材料科学中的定量微观结构表征依赖于电子显微图像(EM),但深度学习在此领域的进展受到大规模、多样化且专家标注数据稀缺的阻碍。本文旨在解决这一问题。
Result: 实验表明,提出的流式模型UniEM-Net在UniEM-3M基准测试中优于其他先进方法。
Insight: 1) 大规模标注数据集对材料科学中的深度学习应用至关重要;2) 文本到图像生成模型可以作为数据增强的有效工具;3) 流式模型在实例分割任务中表现优越。
Abstract: Quantitative microstructural characterization is fundamental to materials science, where electron micrograph (EM) provides indispensable high-resolution insights. However, progress in deep learning-based EM characterization has been hampered by the scarcity of large-scale, diverse, and expert-annotated datasets, due to acquisition costs, privacy concerns, and annotation complexity. To address this issue, we introduce UniEM-3M, the first large-scale and multimodal EM dataset for instance-level understanding. It comprises 5,091 high-resolution EMs, about 3 million instance segmentation labels, and image-level attribute-disentangled textual descriptions, a subset of which will be made publicly available. Furthermore, we are also releasing a text-to-image diffusion model trained on the entire collection to serve as both a powerful data augmentation tool and a proxy for the complete data distribution. To establish a rigorous benchmark, we evaluate various representative instance segmentation methods on the complete UniEM-3M and present UniEM-Net as a strong baseline model. Quantitative experiments demonstrate that this flow-based model outperforms other advanced methods on this challenging benchmark. Our multifaceted release of a partial dataset, a generative model, and a comprehensive benchmark – available at huggingface – will significantly accelerate progress in automated materials analysis.
[34] Structuring GUI Elements through Vision Language Models: Towards Action Space Generation cs.CV | cs.LGPDF
Yi Xu, Yesheng Zhang, jiajia Liu, Jingdong Chen
TL;DR: 本文提出了一种IoU增强的最大似然(IAML)训练范式,用于提升多模态大语言模型(MLLMs)在图形用户界面(GUI)元素定位中的性能,解决了传统方法在生成精确坐标时面临的语义缺失和暴露偏差问题。
Details
Motivation: 多模态大语言模型在GUI元素结构化中的应用表现出巨大潜力,但其在生成精确UI元素坐标方面的性能受限,主要由于数值坐标在语言表示空间中的语义缺失以及传统训练方法中的暴露偏差问题。
Result: 通过大量实验证明,IAML训练方法在GUI元素定位任务中的表现优于传统训练范式。
Insight: 通过数据增强和创新的训练范式,可以有效弥补MLLMs在数值坐标生成任务中的不足,为GUI理解和交互设计提供了新的解决方案。
Abstract: Multimodal large language models (MLLMs) have emerged as pivotal tools in enhancing human-computer interaction. In this paper we focus on the application of MLLMs in the field of graphical user interface (GUI) elements structuring, where they assist in processing user instructions based on screen contents. Despite the promise of MLLMs, their performance in precisely generating UI element coordinates, a critical aspect of GUI understanding, is hindered by the nature of next-token prediction training. This challenge arises from the semantic void surrounding numerical UI coordinates in language representation spaces, necessitating a substantial and diverse dataset to bolster visual module capabilities. To address these limitations, we introduce an IoU-Augmented Maximum Likelihood (IAML) training paradigm. Specifically, our approach involves a novel pipeline for IoU-based coordinate sampling to augment the training data, which considers the proximity to ground truth coordinates. This data augmentation strategy is then employed to fine-tune MLLMs under the IAML paradigm, which is designed to mitigate the exposure bias problem inherent in traditional maximum likelihood estimation. Through extensive experiments, we demonstrate the superior performance of our IAML training approach over traditional training paradigms.
[35] IRSAMap:Towards Large-Scale, High-Resolution Land Cover Map Vectorization cs.CVPDF
Yu Meng, Ligao Deng, Zhihao Xi, Jiansheng Chen, Jingbo Chen
TL;DR: IRSAMap是一个面向大规模、高分辨率土地覆盖地图矢量化的全球遥感数据集,解决了现有数据集中类注释有限、数据规模小和缺乏空间结构信息的问题。
Details
Motivation: 随着遥感图像分辨率的提升和深度学习的快速发展,土地覆盖映射正从像素级分割转向基于对象的矢量建模,现有数据集无法满足精确对象边界和拓扑一致性的需求。
Result: IRSAMap为标准化的对象级土地覆盖映射提供基准,推动了地理信息更新和数字孪生构建。
Insight: IRSAMap的发布填补了土地覆盖矢量数据集在规模、分辨率和空间结构信息上的空白,为深度学习模型提供了更丰富的训练和评估资源。
Abstract: With the enhancement of remote sensing image resolution and the rapid advancement of deep learning, land cover mapping is transitioning from pixel-level segmentation to object-based vector modeling. This shift demands more from deep learning models, requiring precise object boundaries and topological consistency. However, existing datasets face three main challenges: limited class annotations, small data scale, and lack of spatial structural information. To overcome these issues, we introduce IRSAMap, the first global remote sensing dataset for large-scale, high-resolution, multi-feature land cover vector mapping. IRSAMap offers four key advantages: 1) a comprehensive vector annotation system with over 1.8 million instances of 10 typical objects (e.g., buildings, roads, rivers), ensuring semantic and spatial accuracy; 2) an intelligent annotation workflow combining manual and AI-based methods to improve efficiency and consistency; 3) global coverage across 79 regions in six continents, totaling over 1,000 km; and 4) multi-task adaptability for tasks like pixel-level classification, building outline extraction, road centerline extraction, and panoramic segmentation. IRSAMap provides a standardized benchmark for the shift from pixel-based to object-based approaches, advancing geographic feature automation and collaborative modeling. It is valuable for global geographic information updates and digital twin construction. The dataset is publicly available at https://github.com/ucas-dlg/IRSAMap
[36] Learning Long-Range Action Representation by Two-Stream Mamba Pyramid Network for Figure Skating Assessment cs.CV | cs.MMPDF
Fengshun Wang, Qiurui Wang, Peilin Zhao
TL;DR: 论文提出了一种两流Mamba金字塔网络,用于花样滑冰评分(TES和PCS),通过分离视觉特征的TES评估流和音视频特征的PCS评估流,解决了现有方法的三大挑战,并利用Mamba模型的长距离依赖捕捉能力高效处理长视频。
Details
Motivation: 现有方法在花样滑冰评分中忽视了评估标准的先验知识,未区分TES和PCS的特征需求,且未对动作元素逐一评分,同时长视频处理效率低。
Result: 在FineFS基准测试中达到SOTA性能。
Insight: 1. 区分TES和PCS的特征需求是关键;2. Mamba模型适合长视频任务;3. 多级融合机制能提升多模态特征的有效性。
Abstract: Technical Element Score (TES) and Program Component Score (PCS) evaluations in figure skating demand precise assessment of athletic actions and artistic interpretation, respectively. Existing methods face three major challenges. Firstly, video and audio cues are regarded as common features for both TES and PCS predictions in previous works without considering the prior evaluation criterion of figure skating. Secondly, action elements in competitions are separated in time, TES should be derived from each element’s score, but existing methods try to give an overall TES prediction without evaluating each action element. Thirdly, lengthy competition videos make it difficult and inefficient to handle long-range contexts. To address these challenges, we propose a two-stream Mamba pyramid network that aligns with actual judging criteria to predict TES and PCS by separating visual-feature based TES evaluation stream from audio-visual-feature based PCS evaluation stream. In the PCS evaluation stream, we introduce a multi-level fusion mechanism to guarantee that video-based features remain unaffected when assessing TES, and enhance PCS estimation by fusing visual and auditory cues across each contextual level of the pyramid. In the TES evaluation stream, the multi-scale Mamba pyramid and TES head we proposed effectively address the challenges of localizing and evaluating action elements with various temporal scales and give score predictions. With Mamba’s superior ability to capture long-range dependencies and its linear computational complexity, our method is ideal for handling lengthy figure skating videos. Comprehensive experimentation demonstrates that our framework attains state-of-the-art performance on the FineFS benchmark. Our source code is available at https://github.com/ycwfs/Figure-Skating-Action-Quality-Assessment.
[37] A Multimodal-Multitask Framework with Cross-modal Relation and Hierarchical Interactive Attention for Semantic Comprehension cs.CV | cs.AIPDF
Mohammad Zia Ur Rehman, Devraj Raghuvanshi, Umang Jain, Shubhi Bansal, Nagendra Kumar
TL;DR: 该论文提出了一种名为MM-ORIENT的多模态多任务框架,通过跨模态关系图和分层交互注意力机制,有效减少了模态间的噪声影响,并提升了多任务性能。
Details
Motivation: 多模态学习中的主要挑战是模态内部的噪声问题,这种噪声会影响多模态表示的效果,尤其是在模态间显式交互时。此外,现有多模态融合方法可能忽略单一模态中的判别性信息。
Result: 在三个数据集上的实验表明,MM-ORIENT能够有效理解多模态内容,并在多任务中表现出色。
Insight: 通过避免模态间的显式交互,MM-ORIENT在潜在阶段减少了噪声影响,同时HIMA机制保留了单模态的判别性信息,为多模态多任务学习提供了新思路。
Abstract: A major challenge in multimodal learning is the presence of noise within individual modalities. This noise inherently affects the resulting multimodal representations, especially when these representations are obtained through explicit interactions between different modalities. Moreover, the multimodal fusion techniques while aiming to achieve a strong joint representation, can neglect valuable discriminative information within the individual modalities. To this end, we propose a Multimodal-Multitask framework with crOss-modal Relation and hIErarchical iNteractive aTtention (MM-ORIENT) that is effective for multiple tasks. The proposed approach acquires multimodal representations cross-modally without explicit interaction between different modalities, reducing the noise effect at the latent stage. To achieve this, we propose cross-modal relation graphs that reconstruct monomodal features to acquire multimodal representations. The features are reconstructed based on the node neighborhood, where the neighborhood is decided by the features of a different modality. We also propose Hierarchical Interactive Monomadal Attention (HIMA) to focus on pertinent information within a modality. While cross-modal relation graphs help comprehend high-order relationships between two modalities, HIMA helps in multitasking by learning discriminative features of individual modalities before late-fusing them. Finally, extensive experimental evaluation on three datasets demonstrates that the proposed approach effectively comprehends multimodal content for multiple tasks.
[38] Exploiting Information Redundancy in Attention Maps for Extreme Quantization of Vision Transformers cs.CV | cs.AI | cs.IT | math.ITPDF
Lucas Maisonnave, Karim Haroun, Tom Pegeot
TL;DR: 本文提出了一种利用注意力图中信息冗余的方法(EAM),通过量化低熵的注意力头来降低计算复杂性和内存需求,同时保持模型性能。
Details
Motivation: Transformer中的多头自注意力(MHSA)机制计算复杂且内存需求高,限制了其在边缘设备上的部署。作者发现低熵的注意力头贡献信息较少,从而提出了针对性的压缩策略。
Result: 在DeiT和Swin Transformer模型上,EAM在注意力图稀疏度≤20%时实现了与原始模型相当或更高的准确率。
Insight: 注意力头的信息冗余可以通过熵分析来识别,针对性地压缩低熵部分是一种有效的模型加速方法。
Abstract: Transformer models rely on Multi-Head Self-Attention (MHSA) mechanisms, where each attention head contributes to the final representation. However, their computational complexity and high memory demands due to MHSA hinders their deployment at the edge. In this work, we analyze and exploit information redundancy in attention maps to accelerate model inference. By quantifying the information captured by each attention head using Shannon entropy, our analysis reveals that attention heads with lower entropy, i.e., exhibiting more deterministic behavior, tend to contribute less information, motivating targeted compression strategies. Relying on these insights, we propose Entropy Attention Maps (EAM), a model that freezes the weights of low-entropy attention maps and quantizes these values to low precision to avoid redundant re-computation. Empirical validation on ImageNet-1k shows that EAM achieves similar or higher accuracy at $\leq$20% sparsity in attention maps and competitive performance beyond this level for the DeiT and Swin Transformer models.
[39] Vision encoders should be image size agnostic and task driven cs.CVPDF
Nedyalko Prisadnikov, Danda Pani Paudel, Yuqian Fu, Luc Van Gool
TL;DR: 这篇立场论文主张下一代视觉编码器应当与图像尺寸无关且由任务驱动,灵感来源于生物视觉的效率特性,通过任务动态调节计算复杂度。
Details
Motivation: 现代视觉编码器在处理图像时通常固定计算复杂度与图像尺寸相关,而生物视觉系统则根据任务动态调整计算资源以提高效率。论文旨在解决这一差距。
Result: 初步实验表明该方法是可行的,尤其在图像分类任务中展现了潜力。
Insight: 视觉编码器的效率可通过模仿生物系统的任务驱动行为实现,未来研究应更多关注动态计算资源分配。
Abstract: This position paper argues that the next generation of vision encoders should be image size agnostic and task driven. The source of our inspiration is biological. Not a structural aspect of biological vision, but a behavioral trait – efficiency. We focus on a couple of ways in which vision in nature is efficient, but modern vision encoders not. We – humans and animals – deal with vast quantities of visual data, and need to be smart where we focus our limited energy – it depends on the task. It is our belief that vision encoders should be dynamic and the computational complexity should depend on the task at hand rather than the size of the image. We, also, provide concrete first steps towards our vision – a proof-of-concept solution for image classification. Despite classification being not very representative for what we are trying to achieve, it shows that our approach is feasible and promising.
[40] Attention Mechanism in Randomized Time Warping cs.CVPDF
Yutaro Hiraoka, Kazuya Okamura, Kota Suto, Kazuhiro Fukui
TL;DR: 论文揭示了RTW与自注意力机制的本质联系,通过实验证明RTW在动作识别任务中优于Transformer。
Details
Motivation: 探讨RTW与自注意力机制的相似性,并分析两者在动作识别任务中的性能差异。
Result: RTW与自注意力权重的平均相关性为0.80;在Something-Something V2数据集上表现优于Transformer5%。
Insight: RTW的全局注意力机制可能在任务中比局部自注意力更具优势。
Abstract: This paper reveals that we can interpret the fundamental function of Randomized Time Warping (RTW) as a type of self-attention mechanism, a core technology of Transformers in motion recognition. The self-attention is a mechanism that enables models to identify and weigh the importance of different parts of an input sequential pattern. On the other hand, RTW is a general extension of Dynamic Time Warping (DTW), a technique commonly used for matching and comparing sequential patterns. In essence, RTW searches for optimal contribution weights for each element of the input sequential patterns to produce discriminative features. Although the two approaches look different, these contribution weights can be interpreted as self-attention weights. In fact, the two weight patterns look similar, producing a high average correlation of 0.80 across the ten smallest canonical angles. However, they work in different ways: RTW attention operates on an entire input sequential pattern, while self-attention focuses on only a local view which is a subset of the input sequential pattern because of the computational costs of the self-attention matrix. This targeting difference leads to an advantage of RTW against Transformer, as demonstrated by the 5% performance improvement on the Something-Something V2 dataset.
[41] A Lightweight Group Multiscale Bidirectional Interactive Network for Real-Time Steel Surface Defect Detection cs.CV | cs.AIPDF
Yong Zhang, Cunjian Chen, Qiang Gao, Yi Wang, Bin Fang
TL;DR: 提出了一种轻量级的实时钢材表面缺陷检测方法GMBINet,通过创新模块优化多尺度特征提取与交互,显著提升了速度和精度。
Details
Motivation: 钢铁制造业对实时缺陷检测的需求迫切,但现有深度学习方法计算复杂度高、推理速度慢,难以部署在资源受限的工业环境中。
Result: 在SD-Saliency-900和NRSD-MN数据集上达到1048 FPS(GPU)和16.53 FPS(CPU,512分辨率),仅用0.19 M参数,保持高精度。
Insight: 轻量化和高效特征交互是工业场景下实时检测的关键,分组策略和无参数操作可有效平衡速度与性能。
Abstract: Real-time surface defect detection is critical for maintaining product quality and production efficiency in the steel manufacturing industry. Despite promising accuracy, existing deep learning methods often suffer from high computational complexity and slow inference speeds, which limit their deployment in resource-constrained industrial environments. Recent lightweight approaches adopt multibranch architectures based on depthwise separable convolution (DSConv) to capture multiscale contextual information. However, these methods often suffer from increased computational overhead and lack effective cross-scale feature interaction, limiting their ability to fully leverage multiscale representations. To address these challenges, we propose GMBINet, a lightweight framework that enhances multiscale feature extraction and interaction through novel Group Multiscale Bidirectional Interactive (GMBI) modules. The GMBI adopts a group-wise strategy for multiscale feature extraction, ensuring scale-agnostic computational complexity. It further integrates a Bidirectional Progressive Feature Interactor (BPFI) and a parameter-free Element-Wise Multiplication-Summation (EWMS) operation to enhance cross-scale interaction without introducing additional computational overhead. Experiments on SD-Saliency-900 and NRSD-MN datasets demonstrate that GMBINet delivers competitive accuracy with real-time speeds of 1048 FPS on GPU and 16.53 FPS on CPU at 512 resolution, using only 0.19 M parameters. Additional evaluations on the NEU-CLS defect classification dataset further confirm the strong generalization ability of our method, demonstrating its potential for broader industrial vision applications beyond surface defect detection. The dataset and code are publicly available at: https://github.com/zhangyongcode/GMBINet.
[42] SAMFusion: Sensor-Adaptive Multimodal Fusion for 3D Object Detection in Adverse Weather cs.CVPDF
Edoardo Palladin, Roland Dietze, Praveen Narayanan, Mario Bijelic, Felix Heide
TL;DR: SAMFusion提出了一种针对恶劣天气的多模态传感器融合方法,结合了RGB、LiDAR、NIR门控相机和雷达数据,通过深度感知的注意力机制和BEV平面上的优化,显著提升了自动驾驶在极端天气条件下的目标检测性能。
Details
Motivation: 现有多模态融合方法在恶劣天气条件下表现不佳,导致自动驾驶系统在如浓雾、大雪或污损等情况下失效。
Result: 在恶劣天气条件下,尤其是远距离和雾天场景中,对易受伤的行人检测平均精度提升17.2 AP。
Insight: 恶劣天气下的多模态融合需要动态调整传感器权重,并结合更多传感器类型(如NIR和雷达)以提升鲁棒性。
Abstract: Multimodal sensor fusion is an essential capability for autonomous robots, enabling object detection and decision-making in the presence of failing or uncertain inputs. While recent fusion methods excel in normal environmental conditions, these approaches fail in adverse weather, e.g., heavy fog, snow, or obstructions due to soiling. We introduce a novel multi-sensor fusion approach tailored to adverse weather conditions. In addition to fusing RGB and LiDAR sensors, which are employed in recent autonomous driving literature, our sensor fusion stack is also capable of learning from NIR gated camera and radar modalities to tackle low light and inclement weather. We fuse multimodal sensor data through attentive, depth-based blending schemes, with learned refinement on the Bird’s Eye View (BEV) plane to combine image and range features effectively. Our detections are predicted by a transformer decoder that weighs modalities based on distance and visibility. We demonstrate that our method improves the reliability of multimodal sensor fusion in autonomous vehicles under challenging weather conditions, bridging the gap between ideal conditions and real-world edge cases. Our approach improves average precision by 17.2 AP compared to the next best method for vulnerable pedestrians in long distances and challenging foggy scenes. Our project page is available at https://light.princeton.edu/samfusion/
[43] HAMSt3R: Human-Aware Multi-view Stereo 3D Reconstruction cs.CVPDF
Sara Rojas, Matthieu Armando, Bernard Ghamen, Philippe Weinzaepfel, Vincent Leroy
TL;DR: HAMSt3R是一种基于学习的多视图立体三维重建方法,专注于人-场景联合重建。通过结合场景几何和人体理解,引入附加网络头部分割人物、估计密集对应关系和深度,从而生成富含人类语义信息的密集三维点图。方法高效且完全前馈,适用于实际应用。
Details
Motivation: 现有的学习型多视图立体重建方法(如DUSt3R和MASt3R)主要针对静态室外场景,难以处理以人为中心的场景。HAMSt3R旨在填补这一空白,实现人-场景的高效联合重建。
Result: 在EgoHumans和EgoExo4D等挑战性基准测试中表现优异,同时验证了在传统多视图立体和姿态回归任务上的泛化能力。
Insight: HAMSt3R通过结合场景和人体理解,实现了高效的人-场景联合重建,为三维视觉中的人类语义与场景融合提供了新思路。
Abstract: Recovering the 3D geometry of a scene from a sparse set of uncalibrated images is a long-standing problem in computer vision. While recent learning-based approaches such as DUSt3R and MASt3R have demonstrated impressive results by directly predicting dense scene geometry, they are primarily trained on outdoor scenes with static environments and struggle to handle human-centric scenarios. In this work, we introduce HAMSt3R, an extension of MASt3R for joint human and scene 3D reconstruction from sparse, uncalibrated multi-view images. First, we exploit DUNE, a strong image encoder obtained by distilling, among others, the encoders from MASt3R and from a state-of-the-art Human Mesh Recovery (HMR) model, multi-HMR, for a better understanding of scene geometry and human bodies. Our method then incorporates additional network heads to segment people, estimate dense correspondences via DensePose, and predict depth in human-centric environments, enabling a more comprehensive 3D reconstruction. By leveraging the outputs of our different heads, HAMSt3R produces a dense point map enriched with human semantic information in 3D. Unlike existing methods that rely on complex optimization pipelines, our approach is fully feed-forward and efficient, making it suitable for real-world applications. We evaluate our model on EgoHumans and EgoExo4D, two challenging benchmarks con taining diverse human-centric scenarios. Additionally, we validate its generalization to traditional multi-view stereo and multi-view pose regression tasks. Our results demonstrate that our method can reconstruct humans effectively while preserving strong performance in general 3D reconstruction tasks, bridging the gap between human and scene understanding in 3D vision.
[44] HOSt3R: Keypoint-free Hand-Object 3D Reconstruction from RGB images cs.CV | cs.AI | cs.HC | cs.LG | cs.ROPDF
Anilkumar Swamy, Vincent Leroy, Philippe Weinzaepfel, Jean-Sébastien Franco, Grégory Rogez
TL;DR: HOSt3R 是一种无需关键点检测的手-物体三维重建方法,通过单目运动视频估计手和物体的三维变换,并结合多视图重建技术恢复其形状,在 SHOWMe 和 HO3D 数据集上表现优异。
Details
Motivation: 现有方法依赖关键点检测技术(如 SfM 等),对物体几何多样性、弱纹理和遮挡敏感,限制了方法的可扩展性和泛化能力。HOSt3R 旨在解决这些问题。
Result: 在 SHOWMe 和 HO3D 数据集上展示了优异的性能,尤其在未见过物体类别上表现出良好的泛化能力。
Insight: 无需关键点检测的方法可以更好地处理几何多样性和遮挡问题,为手-物体三维重建提供更通用的解决方案。
Abstract: Hand-object 3D reconstruction has become increasingly important for applications in human-robot interaction and immersive AR/VR experiences. A common approach for object-agnostic hand-object reconstruction from RGB sequences involves a two-stage pipeline: hand-object 3D tracking followed by multi-view 3D reconstruction. However, existing methods rely on keypoint detection techniques, such as Structure from Motion (SfM) and hand-keypoint optimization, which struggle with diverse object geometries, weak textures, and mutual hand-object occlusions, limiting scalability and generalization. As a key enabler to generic and seamless, non-intrusive applicability, we propose in this work a robust, keypoint detector-free approach to estimating hand-object 3D transformations from monocular motion video/images. We further integrate this with a multi-view reconstruction pipeline to accurately recover hand-object 3D shape. Our method, named HOSt3R, is unconstrained, does not rely on pre-scanned object templates or camera intrinsics, and reaches state-of-the-art performance for the tasks of object-agnostic hand-object 3D transformation and shape estimation on the SHOWMe benchmark. We also experiment on sequences from the HO3D dataset, demonstrating generalization to unseen object categories.
[45] Arbitrary-Scale 3D Gaussian Super-Resolution cs.CVPDF
Huimin Zeng, Yue Bai, Yun Fu
TL;DR: 提出了一种支持任意比例3D高斯超分辨率的框架,解决了现有方法仅支持固定比例的问题,同时避免了后处理上采样器的复杂性和渲染效率下降。
Details
Motivation: 现有3D高斯泼溅(3DGS)超分辨率方法仅支持固定比例的高分辨率(HR)渲染,限制了其在资源受限场景的实用性。直接使用原生3DGS渲染任意比例HR视图会因缺乏比例感知能力产生混叠伪影,而添加后处理上采样器会增加框架复杂度并降低效率。
Result: 实验表明,该方法在渲染任意比例HR视图时,PSNR比原生3DGS高出6.59 dB,且保持实时渲染速度(1080p下85 FPS)。
Insight: 通过比例感知技术避免了后处理步骤,提升了框架的灵活性和效率,同时生成先验和渐进优化机制确保了高质量且结构一致的超分辨率结果。
Abstract: Existing 3D Gaussian Splatting (3DGS) super-resolution methods typically perform high-resolution (HR) rendering of fixed scale factors, making them impractical for resource-limited scenarios. Directly rendering arbitrary-scale HR views with vanilla 3DGS introduces aliasing artifacts due to the lack of scale-aware rendering ability, while adding a post-processing upsampler for 3DGS complicates the framework and reduces rendering efficiency. To tackle these issues, we build an integrated framework that incorporates scale-aware rendering, generative prior-guided optimization, and progressive super-resolving to enable 3D Gaussian super-resolution of arbitrary scale factors with a single 3D model. Notably, our approach supports both integer and non-integer scale rendering to provide more flexibility. Extensive experiments demonstrate the effectiveness of our model in rendering high-quality arbitrary-scale HR views (6.59 dB PSNR gain over 3DGS) with a single model. It preserves structural consistency with LR views and across different scales, while maintaining real-time rendering speed (85 FPS at 1080p).
[46] Seeing Clearly, Forgetting Deeply: Revisiting Fine-Tuned Video Generators for Driving Simulation cs.CVPDF
Chun-Peng Chang, Chen-Yu Wang, Julian Schmidt, Holger Caesar, Alain Pagani
TL;DR: 本文研究了微调视频生成模型在驾驶仿真中的效果,发现视觉保真度提升的同时可能损害动态元素的空间准确性,并提出了一种基于持续学习的平衡方案。
Details
Motivation: 最近的视频生成技术在视觉质量和时序连贯性上取得了显著进展,但将其应用于驾驶仿真等领域时,可能会因微调导致动态建模精度下降。本文旨在探讨这一现象及其成因。
Result: 实验结果显示,持续学习策略能够在保持视觉质量的同时,显著提升动态元素的空间准确性。
Insight: 视觉质量和动态建模在多样化场景中高度相关,但在高度规则的驾驶场景中,微调可能导致模型倾向于表面真实性而非动态精度。持续学习提供了一种有效的平衡手段。
Abstract: Recent advancements in video generation have substantially improved visual quality and temporal coherence, making these models increasingly appealing for applications such as autonomous driving, particularly in the context of driving simulation and so-called “world models”. In this work, we investigate the effects of existing fine-tuning video generation approaches on structured driving datasets and uncover a potential trade-off: although visual fidelity improves, spatial accuracy in modeling dynamic elements may degrade. We attribute this degradation to a shift in the alignment between visual quality and dynamic understanding objectives. In datasets with diverse scene structures within temporal space, where objects or perspective shift in varied ways, these objectives tend to highly correlated. However, the very regular and repetitive nature of driving scenes allows visual quality to improve by modeling dominant scene motion patterns, without necessarily preserving fine-grained dynamic behavior. As a result, fine-tuning encourages the model to prioritize surface-level realism over dynamic accuracy. To further examine this phenomenon, we show that simple continual learning strategies, such as replay from diverse domains, can offer a balanced alternative by preserving spatial accuracy while maintaining strong visual quality.
[47] Towards Open World Detection: A Survey cs.CV | cs.AI | 68T45 | A.1; I.2; I.4PDF
Andrei-Stefan Bulzan, Cosmin Cernazanu-Glavan
TL;DR: 该论文提出“开放世界检测”(OWD)这一术语,旨在统一视觉领域中的类无关通用检测模型。通过回顾视觉子领域的历史、关键概念和方法,探讨了从早期显著性检测到现代开放世界检测等任务的融合趋势。
Details
Motivation: 计算机视觉领域的早期研究专注于狭窄的任务,但随着技术进步,复杂的感知任务逐渐涌现。论文旨在探索这些任务的融合可能,推动更通用的检测模型发展。
Result: 论文展示了开放世界检测作为一种通用感知任务的潜力,并指出未来研究方向是实现更统一的视觉感知模型。
Insight: 视觉领域的子任务正逐渐融合,未来可能形成一个统一的感知领域,而开放世界检测是这一趋势的关键步骤。
Abstract: For decades, Computer Vision has aimed at enabling machines to perceive the external world. Initial limitations led to the development of highly specialized niches. As success in each task accrued and research progressed, increasingly complex perception tasks emerged. This survey charts the convergence of these tasks and, in doing so, introduces Open World Detection (OWD), an umbrella term we propose to unify class-agnostic and generally applicable detection models in the vision domain. We start from the history of foundational vision subdomains and cover key concepts, methodologies and datasets making up today’s state-of-the-art landscape. This traverses topics starting from early saliency detection, foreground/background separation, out of distribution detection and leading up to open world object detection, zero-shot detection and Vision Large Language Models (VLLMs). We explore the overlap between these subdomains, their increasing convergence, and their potential to unify into a singular domain in the future, perception.
[48] MV-RAG: Retrieval Augmented Multiview Diffusion cs.CV | cs.AIPDF
Yosef Dayani, Omer Benishu, Sagie Benaim
TL;DR: MV-RAG 是一种基于检索增强的多视角扩散模型,用于生成高质量、一致且准确的 3D 内容,特别针对域外(OOD)或稀有概念。
Details
Motivation: 现有的文本到 3D 生成方法依赖于预训练的 2D 扩散先验,但在处理 OOD 或稀有概念时效果不佳。MV-RAG 旨在通过检索相关 2D 图像并结合多视角扩散模型来解决这些问题。
Result: 实验表明,MV-RAG 在 OOD/稀有概念的 3D 一致性、真实性和文本匹配度上显著优于现有方法,同时在标准评测集上保持竞争力。
Insight: 检索和多视角扩散的结合可以有效提升 3D 生成的质量,特别是在处理复杂或罕见概念时。
Abstract: Text-to-3D generation approaches have advanced significantly by leveraging pretrained 2D diffusion priors, producing high-quality and 3D-consistent outputs. However, they often fail to produce out-of-domain (OOD) or rare concepts, yielding inconsistent or inaccurate results. To this end, we propose MV-RAG, a novel text-to-3D pipeline that first retrieves relevant 2D images from a large in-the-wild 2D database and then conditions a multiview diffusion model on these images to synthesize consistent and accurate multiview outputs. Training such a retrieval-conditioned model is achieved via a novel hybrid strategy bridging structured multiview data and diverse 2D image collections. This involves training on multiview data using augmented conditioning views that simulate retrieval variance for view-specific reconstruction, alongside training on sets of retrieved real-world 2D images using a distinctive held-out view prediction objective: the model predicts the held-out view from the other views to infer 3D consistency from 2D data. To facilitate a rigorous OOD evaluation, we introduce a new collection of challenging OOD prompts. Experiments against state-of-the-art text-to-3D, image-to-3D, and personalization baselines show that our approach significantly improves 3D consistency, photorealism, and text adherence for OOD/rare concepts, while maintaining competitive performance on standard benchmarks.
[49] Interpreting the linear structure of vision-language model embedding spaces cs.CV | cs.CL | cs.MMPDF
Isabel Papadimitriou, Huangyuan Su, Thomas Fel, Sham Kakade, Stephanie Gil
TL;DR: 论文通过稀疏自编码器(SAE)分析视觉-语言模型嵌入空间的线性结构,发现跨模态语义的稀疏概念桥接现象。
Details
Motivation: 研究视觉-语言模型如何通过联合嵌入空间组织语言和图像,以及如何编码意义和模态。
Result: SAE能有效重构嵌入并保持稀疏性;跨模态概念对通过桥接分数揭示了语义关联。
Insight: 嵌入空间的线性结构由模态塑造,但通过潜在桥接实现跨模态语义整合。
Abstract: Vision-language models encode images and text in a joint space, minimizing the distance between corresponding image and text pairs. How are language and images organized in this joint space, and how do the models encode meaning and modality? To investigate this, we train and release sparse autoencoders (SAEs) on the embedding spaces of four vision-language models (CLIP, SigLIP, SigLIP2, and AIMv2). SAEs approximate model embeddings as sparse linear combinations of learned directions, or “concepts”. We find that, compared to other methods of linear feature learning, SAEs are better at reconstructing the real embeddings, while also able to retain the most sparsity. Retraining SAEs with different seeds or different data diet leads to two findings: the rare, specific concepts captured by the SAEs are liable to change drastically, but we also show that commonly-activating concepts are remarkably stable across runs. Interestingly, while most concepts activate primarily for one modality, we find they are not merely encoding modality per se. Many are almost orthogonal to the subspace that defines modality, and the concept directions do not function as good modality classifiers, suggesting that they encode cross-modal semantics. To quantify this bridging behavior, we introduce the Bridge Score, a metric that identifies concept pairs which are both co-activated across aligned image-text inputs and geometrically aligned in the shared space. This reveals that even single-modality concepts can collaborate to support cross-modal integration. We release interactive demos of the SAEs for all models, allowing researchers to explore the organization of the concept spaces. Overall, our findings uncover a sparse linear structure within VLM embedding spaces that is shaped by modality, yet stitched together through latent bridges, offering new insight into how multimodal meaning is constructed.
cs.CL [Back]
[50] KG-o1: Enhancing Multi-hop Question Answering in Large Language Models via Knowledge Graph Integration cs.CL | cs.AIPDF
Nan Wang, Yongqi Fan, yansha zhu, ZongYu Wang, Xuezhi Cao
TL;DR: KG-o1通过集成知识图谱(KG)增强大型语言模型(LLM)在多跳问答任务中的推理能力,提出四阶段方法并优于现有模型。
Details
Motivation: LLMs在知识密集型任务(如多跳问答)中表现不佳,因生成的思维链偏离真实推理路径,而KG能明确表示逻辑连接,故提出KG-o1填补这一差距。
Result: 在简单和复杂数据集上实验表明,KG-o1模型在所有任务中均优于现有大型推理模型(LRM)。
Insight: KG的显式逻辑表征与LLM的推理能力结合,能有效解决多跳推理问题,且通过自改进语料进一步提升性能。
Abstract: Large Language Models (LLMs) face challenges in knowledge-intensive reasoning tasks like classic multi-hop question and answering, which involves reasoning across multiple facts. This difficulty arises because the chain of thoughts (CoTs) generated by LLMs in such tasks often deviate from real or a priori reasoning paths. In contrast, knowledge graphs (KGs) explicitly represent the logical connections between facts through entities and relationships. This reflects a significant gap. Meanwhile, large reasoning models (LRMs), such as o1, have demonstrated that long-step reasoning significantly enhances the performance of LLMs. Building on these insights, we propose KG-o1, a four-stage approach that integrates KGs to enhance the multi-hop reasoning abilities of LLMs. We first filter out initial entities and generate complex subgraphs. Secondly, we construct logical paths for subgraphs and then use knowledge graphs to build a dataset with a complex and extended brainstorming process, which trains LLMs to imitate long-term reasoning. Finally, we employ rejection sampling to generate a self-improving corpus for direct preference optimization (DPO), further refining the LLMs reasoning abilities. We conducted experiments on two simple and two complex datasets. The results show that KG-o1 models exhibit superior performance across all tasks compared to existing LRMs.
[51] Bhav-Net: Knowledge Transfer for Cross-Lingual Antonym vs Synonym Distinction via Dual-Space Graph Transformers cs.CLPDF
Samyak S. Sanghvi
TL;DR: Bhav-Net提出了一种双空间架构,结合语言特定的BERT编码器和图变换网络,实现多语言知识转移,有效区分反义词和同义词。
Details
Motivation: 跨语言反义词和同义词区分具有挑战性,因为反义词在共享语义域的同时表达相反含义。Bhav-Net旨在解决这一问题,实现知识从复杂多语言模型向特定语言架构的转移。
Result: Bhav-Net在八种语言上表现优异,与前沿基线模型竞争,并提供可解释的语义表示和跨语言泛化能力。
Insight: 双空间设计有效地捕获了反义词和同义词的独特语义关系,同时支持跨语言知识转移。
Abstract: Antonym vs synonym distinction across multiple languages presents unique computational challenges due to the paradoxical nature of antonymous relationships words that share semantic domains while expressing opposite meanings. This work introduces Bhav-Net, a novel dual-space architecture that enables effective knowledge transfer from complex multilingual models to simpler, language-specific architectures while maintaining robust cross-lingual antonym–synonym distinction capabilities. Our approach combines language-specific BERT encoders with graph transformer networks, creating distinct semantic projections where synonymous pairs cluster in one space while antonymous pairs exhibit high similarity in a complementary space. Through comprehensive evaluation across eight languages (English, German, French, Spanish, Italian, Portuguese, Dutch, and Russian), we demonstrate that semantic relationship modeling transfers effectively across languages. The dual-encoder design achieves competitive performance against state-of-the-art baselines while providing interpretable semantic representations and effective cross-lingual generalization.
[52] Format as a Prior: Quantifying and Analyzing Bias in LLMs for Heterogeneous Data cs.CL | cs.LGPDF
Jiacheng Liu, Mayi Xu, Qiankun Pi, Wenli Li, Ming Zhong
TL;DR: 该论文首次研究了大型语言模型(LLMs)在处理异构数据时存在的格式偏见问题,通过三阶段实证研究分析了偏见的系统性特征、数据级因素和内部机制,并提出了未来减少偏见的研究方向。
Details
Motivation: 随着LLMs越来越多地用于处理异构格式数据(如文本、表格、知识图谱等),格式偏见可能导致不公平的数据整合和推理错误,但这一问题的系统性特征和内部机制尚不明确。
Result: 研究发现LLMs普遍存在格式偏见,且信息丰富度和结构质量等因素显著影响偏见强度。注意力机制分析揭示了偏见的内部机制,轻量级干预(如注意力重新加权)显示出缓解潜力。
Insight: 格式偏见不仅是数据预处理的问题,还与模型的内部机制密切相关。未来研究可通过格式标准化、推理时干预和均衡训练数据来减少偏见,提升异构数据处理的公平性和鲁棒性。
Abstract: Large Language Models (LLMs) are increasingly employed in applications that require processing information from heterogeneous formats, including text, tables, infoboxes, and knowledge graphs. However, systematic biases toward particular formats may undermine LLMs’ ability to integrate heterogeneous data impartially, potentially resulting in reasoning errors and increased risks in downstream tasks. Despite these concerns, it remains uncertain whether such format biases are systematic, which data-level factors contribute to them, and what internal mechanisms in LLMs underlie their emergence. In this paper, we make the first attempt to investigate and analyze the format bias in LLMs. To systematically investigate the aforementioned questions, we conduct a three-stage empirical study by constructing an heterogeneous data conflict scenario for the exploration of bias. The first stage explores the presence and direction of bias across a diverse range of LLMs. The second stage aims to examine how key data-level factors, including information richness, structure quality, and format type, influence these biases. The third stage analyzes how format bias emerges within LLMs’ attention patterns and evaluates a lightweight intervention to test its potential mitigability. Based on these investigations, we identify three future research directions to reduce format bias: improving data preprocessing through format sanitization and normalization, introducing inference-time interventions such as attention re-weighting, and developing format-balanced training corpora. These directions will support the design of more robust and fair heterogeneous data processing systems.
[53] Benchmarking the Legal Reasoning of LLMs in Arabic Islamic Inheritance Cases cs.CL | cs.AI | cs.CY | cs.LGPDF
Nouar AlDahoul, Yasir Zaki
TL;DR: 该研究评估了大型语言模型(LLMs)在伊斯兰继承法案例中的推理能力,提出了一种多数投票解决方案,显著提高了准确性。
Details
Motivation: 伊斯兰继承法的计算复杂且容易出错,需探讨LLMs是否能辅助此类复杂法律推理任务。
Result: 多数投票方案在QIAS 2025挑战赛任务1中取得92.7%的准确率,位列第三。
Insight: LLMs在伊斯兰法律推理中表现良好,但其性能依赖于模型选择和集成方法。
Abstract: Islamic inheritance domain holds significant importance for Muslims to ensure fair distribution of shares between heirs. Manual calculation of shares under numerous scenarios is complex, time-consuming, and error-prone. Recent advancements in Large Language Models (LLMs) have sparked interest in their potential to assist with complex legal reasoning tasks. This study evaluates the reasoning capabilities of state-of-the-art LLMs to interpret and apply Islamic inheritance laws. We utilized the dataset proposed in the ArabicNLP QIAS 2025 challenge, which includes inheritance case scenarios given in Arabic and derived from Islamic legal sources. Various base and fine-tuned models, are assessed on their ability to accurately identify heirs, compute shares, and justify their reasoning in alignment with Islamic legal principles. Our analysis reveals that the proposed majority voting solution, leveraging three base models (Gemini Flash 2.5, Gemini Pro 2.5, and GPT o3), outperforms all other models that we utilized across every difficulty level. It achieves up to 92.7% accuracy and secures the third place overall in Task 1 of the Qias 2025 challenge.
[54] Benchmarking the Medical Understanding and Reasoning of Large Language Models in Arabic Healthcare Tasks cs.CL | cs.AI | cs.LGPDF
Nouar AlDahoul, Yasir Zaki
TL;DR: 这篇论文评估了大型语言模型(LLMs)在阿拉伯语医疗任务中的理解和推理能力,通过多项选择题和开放式问题测试了其性能,并提出了多数投票方法提升了准确率。
Details
Motivation: 研究动机是探索LLMs在阿拉伯语医疗领域的表现,填补当前研究的空白,评估其在临床环境中的实际应用潜力。
Result: 结果表明:(1) 多数投票方法在多项选择题中达到77%准确率;(2) 开放式问题中,多个LLMs的语义对齐表现优异,BERTScore最高达86.44%。
Insight: 研究发现LLMs在阿拉伯语医疗领域具备潜力,但在临床应用中仍需优化生成内容的准确性和语义一致性。
Abstract: Recent progress in large language models (LLMs) has showcased impressive proficiency in numerous Arabic natural language processing (NLP) applications. Nevertheless, their effectiveness in Arabic medical NLP domains has received limited investigation. This research examines the degree to which state-of-the-art LLMs demonstrate and articulate healthcare knowledge in Arabic, assessing their capabilities across a varied array of Arabic medical tasks. We benchmark several LLMs using a medical dataset proposed in the Arabic NLP AraHealthQA challenge in MedArabiQ2025 track. Various base LLMs were assessed on their ability to accurately provide correct answers from existing choices in multiple-choice questions (MCQs) and fill-in-the-blank scenarios. Additionally, we evaluated the capacity of LLMs in answering open-ended questions aligned with expert answers. Our results reveal significant variations in correct answer prediction accuracy and low variations in semantic alignment of generated answers, highlighting both the potential and limitations of current LLMs in Arabic clinical contexts. Our analysis shows that for MCQs task, the proposed majority voting solution, leveraging three base models (Gemini Flash 2.5, Gemini Pro 2.5, and GPT o3), outperforms others, achieving up to 77% accuracy and securing first place overall in the Arahealthqa 2025 shared task-track 2 (sub-task 1) challenge. Moreover, for the open-ended questions task, several LLMs were able to demonstrate excellent performance in terms of semantic alignment and achieve a maximum BERTScore of 86.44%.
[55] Persuasiveness and Bias in LLM: Investigating the Impact of Persuasiveness and Reinforcement of Bias in Language Models cs.CL | cs.AI | I.2.7PDF
Saumya Roy
TL;DR: 该研究探讨了大语言模型(LLMs)的说服力和偏见之间的相互作用,揭示了模型如何可能被滥用传播错误信息或强化社会偏见,并提出了防范措施。
Details
Motivation: 随着LLMs广泛应用,其强大的说服力和潜在的偏见放大效应可能被滥用,研究旨在评估这些风险并为安全部署提供依据。
Result: LLMs能够显著影响叙事并适应受众价值观,但也可能被用于传播错误信息或强化社会偏见。
Insight: 核心风险在于滥用而非模型本身的偶然错误,需通过技术(如对齐设计)和政策手段防范潜在危害。
Abstract: Warning: This research studies AI persuasion and bias amplification that could be misused; all experiments are for safety evaluation. Large Language Models (LLMs) now generate convincing, human-like text and are widely used in content creation, decision support, and user interactions. Yet the same systems can spread information or misinformation at scale and reflect social biases that arise from data, architecture, or training choices. This work examines how persuasion and bias interact in LLMs, focusing on how imperfect or skewed outputs affect persuasive impact. Specifically, we test whether persona-based models can persuade with fact-based claims while also, unintentionally, promoting misinformation or biased narratives. We introduce a convincer-skeptic framework: LLMs adopt personas to simulate realistic attitudes. Skeptic models serve as human proxies; we compare their beliefs before and after exposure to arguments from convincer models. Persuasion is quantified with Jensen-Shannon divergence over belief distributions. We then ask how much persuaded entities go on to reinforce and amplify biased beliefs across race, gender, and religion. Strong persuaders are further probed for bias using sycophantic adversarial prompts and judged with additional models. Our findings show both promise and risk. LLMs can shape narratives, adapt tone, and mirror audience values across domains such as psychology, marketing, and legal assistance. But the same capacity can be weaponized to automate misinformation or craft messages that exploit cognitive biases, reinforcing stereotypes and widening inequities. The core danger lies in misuse more than in occasional model mistakes. By measuring persuasive power and bias reinforcement, we argue for guardrails and policies that penalize deceptive use and support alignment, value-sensitive design, and trustworthy deployment.
[56] MAC: A Live Benchmark for Multimodal Large Language Models in Scientific Understanding cs.CL | cs.AIPDF
Mohan Jiang, Jin Gao, Jiahao Zhan, Dequan Wang
TL;DR: 论文提出了一个动态更新的多模态大模型(MLLM)评测基准MAC,用于评估模型在科学理解任务上的能力。它基于顶级期刊的图文数据,结合了推理能力的挑战,并通过DAD方法提升了模型性能。
Details
Motivation: 随着MLLM能力的提升,传统固定评测基准逐渐难以有效评估高级科学理解能力。因此需要一种动态更新的评测方法。
Result: 实验显示MLLM感知能力较强,但跨模态推理能力有限;DAD方法可使性能提升达11%。
Insight: 动态评测基准更符合技术发展需求,科学推理能力是MLLM未来的重要发展方向。
Abstract: As multimodal large language models (MLLMs) grow increasingly capable, fixed benchmarks are gradually losing their effectiveness in evaluating high-level scientific understanding. In this paper, we introduce the Multimodal Academic Cover benchmark (MAC), a live benchmark that could continuously evolve with scientific advancement and model progress. MAC leverages over 25,000 image-text pairs sourced from issues of top-tier scientific journals such as Nature, Science, and Cell, challenging MLLMs to reason across abstract visual and textual scientific content. Experiments on our most recent yearly snapshot, MAC-2025, reveal that while MLLMs demonstrate strong perceptual abilities, their cross-modal scientific reasoning remains limited. To bridge this gap, we propose DAD, a lightweight inference-time approach that enhances MLLMs by extending MLLM visual features with language space reasoning, achieving performance improvements of up to 11%. Finally, we highlight the live nature of MAC through experiments on updating journal covers and models for curation, illustrating its potential to remain aligned with the frontier of human knowledge. We release our benchmark at https://github.com/mhjiang0408/MAC_Bench.
[57] SurfaceLogicKV: Surface and Logic Attention Behaviors are All You Need for Robust KV Cache Compression cs.CL | cs.AIPDF
Mengjie Li, William J. Song
TL;DR: 该论文提出了一种名为SurfaceLogicKV的新方法,通过区分注意力行为为表面记忆和逻辑构建,有效地压缩KV缓存,同时保持模型性能。
Details
Motivation: 大型语言模型(LLMs)中不断增长的输入序列长度对KV缓存存储带来了巨大压力,影响了推理效率。本文旨在通过分析注意力的不同行为,设计一种更高效的KV缓存压缩方法。
Result: 实验结果显示,该方法在多种任务和长序列中表现稳健,性能接近甚至优于基线方法或FullKV。
Insight: 研究发现,绝大部分注意力头(98.5%)会忽略无关信息,而表面记忆和逻辑构建行为虽少但对长上下文推理至关重要。这一发现为KV缓存优化提供了新思路。
Abstract: The increasing input sequence length in Large Language Models (LLMs) puts significant pressure on key-value (KV) cache storage, making efficient inference challenging. Explicitly distinguishing attention behavior into our self-defined surface memorization and logic construction reveals essential roles in long-context reasoning. We observe that an individual attention head can display various behaviors, with nearly 98.5% effectively ignoring completely irrelevant information. The remaining 1.5% behaves as logic construction, and 0.5% behaves as surface memorization. Based on layer- and head-wise integration, we propose a novel two-stage SurfaceLogicKV method to utilize these attention behaviors for KV Cache compression. As a result, it achieves improved compressing robustness while maintaining competitive performance across various tasks and long sequences compared to baselines or even FullKV in some specific situations
[58] KL-based self-distillation for large language models cs.CL | cs.AIPDF
Max Rehman Linder
TL;DR: 论文提出了一种基于KL散度的自蒸馏方法,用于在词汇扩展时解决大型语言模型的知识迁移问题,解决了不同分词方法带来的挑战,并在代码生成任务中表现优异。
Details
Motivation: 大型预训练语言模型在微调时难以融入新的领域术语,尤其是在词汇扩展时,由于分词差异导致的知识迁移问题。本文提出了一个数学基础的方法来解决这一问题。
Result: 在2000个代码生成任务上,KL散度方法表现优于传统的交叉熵训练。
Insight: 通过机制解释性分析,揭示了新词表征的学习过程,解释了性能增益的来源,并提供了嵌入空间在词汇扩展时的结构洞察。
Abstract: Large pre-trained language models often struggle to incorporate new domain-specific terminology when fine-tuned on small, specialized corpora. In this work, we address the challenge of vocabulary expansion in frozen LLMs by introducing a mathematically grounded method for knowledge distillation via KL divergence, even when the original and extended models use different tokenizations. This allows the student model to inherit distributional knowledge from the teacher despite differing vocabularies. We compare our KL-based distillation approach to conventional cross-entropy training, evaluating both methods across multiple strategies for initializing new token embeddings. After embedding initialization, models are further fine-tuned to integrate the new vocabulary. Each trained model is benchmarked on approximately 2000 code-generation tasks, where our approach achieves the best performance across the board. Finally, through mechanistic interpretability, we analyze how models learn representations for the new tokens, providing an explanation for the observed gains and offering insight into the structure of embedding space during vocabulary expansion.
[59] Chain-of-Query: Unleashing the Power of LLMs in SQL-Aided Table Understanding via Multi-Agent Collaboration cs.CL | cs.AI | cs.DBPDF
Songyuan Sui, Hongyi Liu, Serena Liu, Li Li, Soo-Hyun Choi
TL;DR: 论文提出了一种名为Chain-of-Query (CoQ)的多智能体框架,通过自然语言表模式表示和分步SQL生成策略,显著提高了表格理解的准确性和SQL生成的有效性。
Details
Motivation: 表格理解需要多步结构化推理,但大型语言模型(LLMs)因表格数据的结构复杂性而表现不佳。现有方法存在理解表结构不可靠、错误传播导致无效查询以及过度依赖执行正确性等问题。
Result: 在五个基准测试中,准确率从61.11%提升至74.77%,无效SQL率从9.48%降至3.34%。
Insight: 自然语言表模式表示和分步SQL生成策略能显著提升表格理解和SQL生成的效果,分离机械与逻辑推理可减少对执行结果的依赖。
Abstract: Table understanding requires structured, multi-step reasoning. Large Language Models (LLMs) struggle with it due to the structural complexity of tabular data. Recently, multi-agent frameworks for SQL generation have shown promise in tackling the challenges of understanding tabular data, but existing approaches often suffer from limitations such as the inability to comprehend table structure for reliable SQL generation, error propagation that results in invalid queries, and over-reliance on execution correctness. To address these issues, we propose Chain-of-Query (CoQ), a novel multi-agent framework for SQL-aided table understanding. CoQ adopts natural-language-style representations of table schemas to abstract away structural noise and enhance understanding. It employs a clause-by-clause SQL generation strategy to improve query quality and introduces a hybrid reasoning division that separates SQL-based mechanical reasoning from LLM-based logical inference, thereby reducing reliance on execution outcomes. Experiments with four models (both closed- and open-source) across five widely used benchmarks show that Chain-of-Query significantly improves accuracy from 61.11% to 74.77% and reduces the invalid SQL rate from 9.48% to 3.34%, demonstrating its superior effectiveness in table understanding. The code is available at https://github.com/SongyuanSui/ChainofQuery.
[60] Detecting Hope, Hate, and Emotion in Arabic Textual Speech and Multi-modal Memes Using Large Language Models cs.CL | cs.AI | cs.LGPDF
Nouar AlDahoul, Yasir Zaki
TL;DR: 论文探讨大型语言模型(LLM)在识别阿拉伯语文本和表情包中的希望、仇恨言论、冒犯性语言及情感表达方面的潜力,并在MAHED 2025挑战赛中验证了其优越性能。
Details
Motivation: 社会媒体上阿拉伯语内容和表情包的传播亟需精准分析,以应对仇恨言论和冒犯性语言的增加。
Result: GPT-4o-mini和Gemini Flash 2.5在三个任务中分别取得72.1%、57.8%和79.6%的宏F1分数,总体排名第一。
Insight: 研究表明,微调LLM能更精细理解阿拉伯语文本和表情包,为内容审核系统提供高效解决方案。
Abstract: The rise of social media and online communication platforms has led to the spread of Arabic textual posts and memes as a key form of digital expression. While these contents can be humorous and informative, they are also increasingly being used to spread offensive language and hate speech. Consequently, there is a growing demand for precise analysis of content in Arabic text and memes. This paper explores the potential of large language models to effectively identify hope, hate speech, offensive language, and emotional expressions within such content. We evaluate the performance of base LLMs, fine-tuned LLMs, and pre-trained embedding models. The evaluation is conducted using a dataset of Arabic textual speech and memes proposed in the ArabicNLP MAHED 2025 challenge. The results underscore the capacity of LLMs such as GPT-4o-mini, fine-tuned with Arabic textual speech, and Gemini Flash 2.5, fine-tuned with Arabic memes, to deliver the superior performance. They achieve up to 72.1%, 57.8%, and 79.6% macro F1 scores for tasks 1, 2, and 3, respectively, and secure first place overall in the Mahed 2025 challenge. The proposed solutions offer a more nuanced understanding of both text and memes for accurate and efficient Arabic content moderation systems.
[61] From Clicks to Preference: A Multi-stage Alignment Framework for Generative Query Suggestion in Conversational System cs.CL | cs.AIPDF
Junhao Yin, Haolin Wang, Peng Bao, Ju Xu, Yongliang Wang
TL;DR: 该论文提出了一种多阶段对齐框架,通过逐步细化生成策略与用户偏好的一致性,解决了对话系统中生成式查询建议的挑战,显著提升了用户点击率。
Details
Motivation: 尽管大语言模型为对话系统提供了强大的生成式查询建议能力,但如何精确对齐生成结果与用户复杂且不确定的偏好仍然是一个关键问题。
Result: 在自动和人工评估中显著优于基线,A/B测试中用户点击率提升了34%。
Insight: 通过概率分布建模用户偏好,结合多阶段对齐和正则化技术,可以有效提升生成建议的质量和用户参与度。
Abstract: Generative query suggestion using large language models offers a powerful way to enhance conversational systems, but aligning outputs with nuanced user preferences remains a critical challenge. To address this, we introduce a multi-stage framework designed for progressive alignment between the generation policy and user intent. Our pipeline begins with prompt engineering as a cold-start strategy, followed by the Supervised Fine-Tuning stage, in which we introduce a distillation method on click logs to create a robust foundational model. To better model user preferences while capturing their inherent uncertainty, we develop a Gaussian Reward Model (GaRM) that represents user preferences as probability distributions rather than point estimates. Finally, we employ reinforcement learning to align the generation policy with these preferences, guided by a composite reward function that integrates GaRM with auxiliary heuristics to mitigate reward hacking. To maintain training stability, this process is enhanced by a novel out-of-distribution regularization method and a two-stage reward fusion technique. Extensive experiments demonstrate that our framework significantly outperforms baselines on both automatic and human evaluations and yields a 34% relative increase in user engagement as measured by click-through rate in live A/B tests.
[62] SCOPE: A Generative Approach for LLM Prompt Compression cs.CL | cs.AIPDF
Tinghui Zhang, Yifan Wang, Daisy Zhe Wang
TL;DR: 论文提出了一种基于生成方法的LLM提示压缩技术(SCOPE),通过将提示分块并重写以保持语义连贯性,显著提升了压缩质量和稳定性。
Details
Motivation: 现有的提示压缩方法主要基于标记去除,导致信息丢失和结构不连贯,限制了生成质量。本文旨在通过生成方法解决这些问题。
Result: 在问答和摘要任务上的实验表明,该方法在高压缩比下表现优于现有方法,压缩质量更高且更稳定。
Insight: 生成式压缩方法优于标记去除,保留语义和结构完整性的同时实现高效压缩。
Abstract: Prompt compression methods enhance the efficiency of Large Language Models (LLMs) and minimize the cost by reducing the length of input context. The goal of prompt compression is to shorten the LLM prompt while maintaining a high generation quality. However, existing solutions, mainly based on token removal, face challenges such as information loss and structural incoherence, like missing grammar elements in a sentence, or incomplete word phrases after token removal. Such challenges limit the final generation quality of LLM. To overcome these limitations, we present a novel generative prompt compression method. Unlike the existing token removal methods, our method centers at a chunking-and-summarization mechanism. Specifically, our method splits prompt into semantically coherent chunks and rewrites the chunks to be more concise. The chunks are reconstructed into meaningful prompt finally. We design several optimization techniques for the mechanism, including optimized semantic chunking, outlier chunk handling, dynamic compression ratio, compression prioritization, and keyword maintaining. These techniques effectively improve the identifying and preserving of critical information and coherence among texts, as well as providing finer grind control of the compression ratio. We conduct extensive evaluation on question-answering and summarization tasks, with datasets covering multiple different domain. The evaluation shows our method achieves a significantly better compression quality, and higher stability than the state-of-the-art methods, especially under high compression ratio, which proves the effectiveness and practicality of our method.
[63] User-Assistant Bias in LLMs cs.CL | cs.AI | cs.HCPDF
Xu Pan, Jingxuan Fan, Zidi Xiong, Ely Hahami, Jorin Overwiening
TL;DR: 大语言模型(LLM)在多轮对话中存在用户或助手信息偏好的问题,称为用户-助手偏见(user-assistant bias)。作者提出一个8k的多轮对话数据集UserAssist,用于评估和调控26个商业和开源模型中的偏见。研究发现:商业模型存在不同程度的用户偏见;指令调优的开源模型用户偏见显著,而推理模型较弱。微调实验表明,人类偏好对齐会增加用户偏见,而链式思维(chain-of-thought)训练会降低偏见。通过直接偏好优化(DPO),可以双向调整偏见,且效果泛化性强。
Details
Motivation: LLM在多轮对话中可能过度依赖用户或自身信息,导致固执或顺从的行为。为了理解和调控这种偏见,需要一种系统的评测方法和工具。
Result: 商业模型用户偏见多样,开源指令调优模型偏见显著,推理模型较弱。人类偏好对齐增加偏见,链式思维训练降低偏见。DPO可有效调控偏见,泛化性强。
Insight: LLM的偏见与训练方法密切相关,偏好对齐和推理训练对其影响相反。偏见调控技术可用于检测和改善模型异常行为。
Abstract: Large language models (LLMs) can bias towards relying on their own or the user’s information in chat history, leading to overly stubborn or agreeable behaviors in multi-turn conversations. In this paper, we formalize this model characteristic as user-assistant bias and introduce an 8k multi-turn conversation dataset $\textbf{UserAssist}$, which we use to benchmark, understand and manipulate the user-assistant bias in frontier LLMs. Leveraging $\textbf{UserAssist-test}$, we first benchmark the user-assistant bias of 26 commercial and 26 open-weight models. Commercial models show various levels of user bias. Evaluation on open-weight models reveals significant user bias in the instruction-tuned models, and weak user bias in reasoning (or reasoning-distilled) models. We then perform controlled fine-tuning experiments to pinpoint the post-training recipe contributing to these bias shifts: human preference alignment increases user bias, while training on chain-of-thought reasoning traces decreases it. Finally, we demonstrate that user-assistant bias can be bidirectionally adjusted by performing direct preference optimization (DPO) on $\textbf{UserAssist-train}$, and generalizes well to both in-domain and out-of-domain conversations. Our results provide insights into how the LLM integrates information from different sources, and also a viable way to detect and control model abnormalities.
[64] Enhancing Cryptocurrency Sentiment Analysis with Multimodal Features cs.CL | q-fin.STPDF
Chenghao Liu, Aniket Mahanti, Ranesh Naha, Guanghao Wang, Erwann Sbai
TL;DR: 该论文通过多模态分析比较了TikTok和Twitter对加密货币市场情绪的影响,发现视频内容更能影响短期市场趋势,而文本内容与长期动态更相关,跨平台信号整合提高了预测准确性。
Details
Motivation: 随着加密货币的日益流行,研究社交媒体对其市场情绪的影响变得重要。现有研究主要聚焦于文本数据(如Twitter),而视频内容的情绪和背景信息尚未充分挖掘。
Result: TikTok视频情绪显著影响投机性资产和短期趋势,Twitter文本情绪与长期动态更一致;跨平台整合使预测准确性提升20%。
Insight: 视频内容因其丰富的情感表达,在短期市场预测中具有独特价值,而文本数据更适合长期分析,多模态方法能更全面地捕捉市场情绪。
Abstract: As cryptocurrencies gain popularity, the digital asset marketplace becomes increasingly significant. Understanding social media signals offers valuable insights into investor sentiment and market dynamics. Prior research has predominantly focused on text-based platforms such as Twitter. However, video content remains underexplored, despite potentially containing richer emotional and contextual sentiment that is not fully captured by text alone. In this study, we present a multimodal analysis comparing TikTok and Twitter sentiment, using large language models to extract insights from both video and text data. We investigate the dynamic dependencies and spillover effects between social media sentiment and cryptocurrency market indicators. Our results reveal that TikTok’s video-based sentiment significantly influences speculative assets and short-term market trends, while Twitter’s text-based sentiment aligns more closely with long-term dynamics. Notably, the integration of cross-platform sentiment signals improves forecasting accuracy by up to 20%.
[65] Embarrassed to observe: The effects of directive language in brand conversation cs.CL | cs.CY | cs.HC | cs.SIPDF
Andria Andriuzzi, Géraldine Michel
TL;DR: 研究表明,社交媒体中品牌使用指令性语言与消费者互动会引发旁观消费者的间接尴尬,从而降低其参与度,尤其在非产品中心对话中更为明显,但品牌关系强度可缓解此负面影响。
Details
Motivation: 研究动机是探究品牌在社交媒体中使用指令性语言与消费者互动时,旁观消费者的反应及其背后的心理机制。
Result: 结果表明,指令性语言会引发旁观消费者的间接尴尬并降低参与度,非产品中心对话中负面效应更强,但品牌关系强度可缓解这一效应。
Insight: 研究发现对话内容(产品与非产品中心)和品牌关系强度是关键调节变量,这对品牌社交媒体管理策略具有重要启示。
Abstract: In social media, marketers attempt to influence consumers by using directive language, that is, expressions designed to get consumers to take action. While the literature has shown that directive messages in advertising have mixed results for recipients, we know little about the effects of directive brand language on consumers who see brands interacting with other consumers in social media conversations. On the basis of a field study and three online experiments, this study shows that directive language in brand conversation has a detrimental downstream effect on engagement of consumers who observe such exchanges. Specifically, in line with Goffman’s facework theory, because a brand that encourages consumers to react could be perceived as face-threatening, consumers who see a brand interacting with others in a directive way may feel vicarious embarrassment and engage less (compared with a conversation without directive language). In addition, we find that when the conversation is nonproduct-centered (vs. product-centered), consumers expect more freedom, as in mundane conversations, even for others; therefore, directive language has a stronger negative effect. However, in this context, the strength of the brand relationship mitigates this effect. Thus, this study contributes to the literature on directive language and brand-consumer interactions by highlighting the importance of context in interactive communication, with direct relevance for social media and brand management.
[66] Mini-Omni-Reasoner: Token-Level Thinking-in-Speaking in Large Speech Models cs.CL | cs.AI | cs.LG | eess.ASPDF
Zhifei Xie, Ziyang Ma, Zihang Liu, Kaiyu Pang, Hongyu Li
TL;DR: 论文提出了一种名为Mini-Omni-Reasoner的框架,通过在语音模型中实现”边说边思考”的机制,显著提升了实时交互的效率和推理能力。
Details
Motivation: 现有的语音模型(LSMs)通常采用”先思考再说话”的模式,导致推理完成前无法生成语音输出,引入显著的延迟问题。论文旨在解决这一延迟问题,同时保持推理的准确性和语音的自然性。
Result: 在Spoken-MQA基准测试中,模型在算术推理和上下文理解上分别取得了+19.1%和+6.4%的提升,同时输出更短且无解码延迟。
Insight: 通过交错推理和语音生成,可以显著减少延迟并提升交互效率,同时保持语音的自然性和逻辑性,为实时语音交互提供了新思路。
Abstract: Reasoning is essential for effective communication and decision-making. While recent advances in LLMs and MLLMs have shown that incorporating explicit reasoning significantly improves understanding and generalization, reasoning in LSMs remains in a nascent stage. Early efforts attempt to transfer the “Thinking-before-Speaking” paradigm from textual models to speech. However, this sequential formulation introduces notable latency, as spoken responses are delayed until reasoning is fully completed, impairing real-time interaction and communication efficiency. To address this, we propose Mini-Omni-Reasoner, a framework that enables reasoning within speech via a novel “Thinking-in-Speaking” formulation. Rather than completing reasoning before producing any verbal output, Mini-Omni-Reasoner interleaves silent reasoning tokens with spoken response tokens at the token level. This design allows continuous speech generation while embedding structured internal reasoning, leveraging the model’s high-frequency token processing capability. Although interleaved, local semantic alignment is enforced to ensure that each response token is informed by its preceding reasoning. To support this framework, we introduce Spoken-Math-Problems-3M, a large-scale dataset tailored for interleaved reasoning and response. The dataset ensures that verbal tokens consistently follow relevant reasoning content, enabling accurate and efficient learning of speech-coupled reasoning. Built on a hierarchical Thinker-Talker architecture, Mini-Omni-Reasoner delivers fluent yet logically grounded spoken responses, maintaining both naturalness and precision. On the Spoken-MQA benchmark, it achieves a +19.1% gain in arithmetic reasoning and +6.4% in contextual understanding, with shorter outputs and zero decoding latency.
[67] DAIQ: Auditing Demographic Attribute Inference from Question in LLMs cs.CL | cs.AIPDF
Srikant Panda, Hitesh Laxmichand Patel, Shahad Al-Khalifa, Amit Agarwal, Hend Al-Khalifa
TL;DR: 该论文提出了DAIQ任务和框架,用于审计语言模型中从无明确人口统计线索的问题中推断用户人口统计属性的行为,揭示了LLMs的系统性风险,并开发了提示护栏以减少身份推断。
Details
Motivation: 研究动机是解决语言模型在问题中隐含推断用户人口统计属性(如性别或种族)的潜在风险,这种行为可能违反中立性期望、推断不必要的信息并编码刻板印象,影响公平性。
Result: 结果显示,无论是开源还是闭源LLMs都会根据问题措辞推断人口统计属性,这种行为普遍且一致,可能加剧社会刻板印象和传播危害。
Insight: 关键见解是LLMs的隐含人口统计推断行为是一种系统性风险,可能威胁隐私、公平和信任,需要通过技术手段(如提示护栏)进行干预以符合负责任AI的目标。
Abstract: Large Language Models (LLMs) are known to reflect social biases when demographic attributes, such as gender or race, are explicitly present in the input. But even in their absence, these models still infer user identities based solely on question phrasing. This subtle behavior has received far less attention, yet poses serious risks: it violates expectations of neutrality, infers unintended demographic information, and encodes stereotypes that undermine fairness in various domains including healthcare, finance and education. We introduce Demographic Attribute Inference from Questions (DAIQ), a task and framework for auditing an overlooked failure mode in language models: inferring user demographic attributes from questions that lack explicit demographic cues. Our approach leverages curated neutral queries, systematic prompting, and both quantitative and qualitative analysis to uncover how models infer demographic information. We show that both open and closed source LLMs do assign demographic labels based solely on question phrasing. Prevalence and consistency of demographic inferences across diverse models reveal a systemic and underacknowledged risk: LLMs can fabricate demographic identities, reinforce societal stereotypes, and propagate harms that erode privacy, fairness, and trust posing a broader threat to social equity and responsible AI deployment. To mitigate this, we develop a prompt-based guardrail that substantially reduces identity inference and helps align model behavior with fairness and privacy objectives.
[68] Who’s Asking? Investigating Bias Through the Lens of Disability Framed Queries in LLMs cs.CL | cs.AI | cs.CYPDF
Srikant Panda, Vishnu Hari, Kalpana Panda, Amit Agarwal, Hitesh Laxmichand Patel
TL;DR: 该论文首次系统审计了基于残疾条件的LLM人口统计偏见,发现LLM在无明确人口信息时仍会推断用户特征,且残疾语境会显著影响预测结果,尤其是更大模型更易受刻板印象影响。
Details
Motivation: 探究LLM如何通过查询中的残疾线索推断用户人口特征,揭示现有对齐策略中忽视的残疾包容性问题。
Result: 模型在97%的案例中会武断推断人口特征,残疾语境显著改变预测分布,且更大模型对残疾线索更敏感但偏见也更严重。
Insight: 当前LLM对齐策略存在严重盲点,需结合弃权校准和反事实微调以减少无依据的人口推断及刻板印象放大。
Abstract: Large Language Models (LLMs) routinely infer users demographic traits from phrasing alone, which can result in biased responses, even when no explicit demographic information is provided. The role of disability cues in shaping these inferences remains largely uncharted. Thus, we present the first systematic audit of disability-conditioned demographic bias across eight state-of-the-art instruction-tuned LLMs ranging from 3B to 72B parameters. Using a balanced template corpus that pairs nine disability categories with six real-world business domains, we prompt each model to predict five demographic attributes - gender, socioeconomic status, education, cultural background, and locality - under both neutral and disability-aware conditions. Across a varied set of prompts, models deliver a definitive demographic guess in up to 97% of cases, exposing a strong tendency to make arbitrary inferences with no clear justification. Disability context heavily shifts predicted attribute distributions, and domain context can further amplify these deviations. We observe that larger models are simultaneously more sensitive to disability cues and more prone to biased reasoning, indicating that scale alone does not mitigate stereotype amplification. Our findings reveal persistent intersections between ableism and other demographic stereotypes, pinpointing critical blind spots in current alignment strategies. We release our evaluation framework and results to encourage disability-inclusive benchmarking and recommend integrating abstention calibration and counterfactual fine-tuning to curb unwarranted demographic inference. Code and data will be released on acceptance.
[69] Alvorada-Bench: Can Language Models Solve Brazilian University Entrance Exams? cs.CL | cs.AIPDF
Henrique Godoy
TL;DR: 这篇论文介绍了Alvorada-Bench,一个包含4,515道巴西大学入学考试题的基准测试,用于评估语言模型在葡萄牙语和多步骤推理任务中的表现。模型在零样本、角色扮演和思维链提示下测试,结果显示其在语言类题目上表现优异,但在数学和工程类题目上仍有不足。
Details
Motivation: 现有语言模型评估多集中在英语环境,而忽视了葡萄牙语及其他语言文化背景的测试需求。Alvorada-Bench旨在填补这一空白,测试模型在巴西教育系统中的表现。
Result: 模型在语言类题目上表现优异(超过94%准确率),但在数学和工程类题目上表现较差。模型的置信度校准良好,且与感知难度相关。成本分析显示每千次token测试成本低于2美元。
Insight: 语言模型在非英语和复杂推理任务中存在明显不足,但在文化语境中的表现接近人类水平。模型的自我评估能力较强,能准确判断自身表现。
Abstract: Language models are increasingly used in Brazil, but most evaluation remains English-centric. This paper presents Alvorada-Bench, a 4,515-question, text-only benchmark drawn from five Brazilian university entrance examinations. Evaluating twenty models under zero-shot, role-playing, and chain-of-thought prompting, producing 270,900 responses with structured self-reports of confidence, perceived difficulty, and Bloom level. The top models exceed 94% accuracy overall, but accuracy declines on Mathematics and on the engineering oriented IME and ITA exams, indicating persistent weaknesses in multi-step reasoning. Confidence is well calibrated and correlates with perceived difficulty, revealing that models can accurately assess their own certainty capabilities. A cost accuracy analysis shows that high accuracy is achievable at under $2 per 1K tokens. On ENEM 2024 the top model (O3) achieved perfect scores in Languages subject questions while even the weakest system (GPT-4.1 Nano) only underperforms humans in Mathematics. Through exams that distill decades of Brazilian educational priorities and assess millions of students yearly, Alvorada-Bench establishes whether language models can navigate the intersection of language, culture, and reasoning that defines academic readiness in Brazil.
[70] Lexical Hints of Accuracy in LLM Reasoning Chains cs.CL | cs.LGPDF
Arne Vanhoyweghen, Brecht Verbeken, Andres Algaba, Vincent Ginis
TL;DR: 论文研究了通过分析LLM推理链中的词汇特征(如不确定性词汇、情感波动等)来预测模型答案的准确性,发现词汇不确定性标记是最强的错误指标,而推理链长度仅在中等难度任务中有用。
Details
Motivation: 当前LLM在低准确率任务中常常表现出高自信,校准性较差。作者希望通过分析推理链(CoT)的可测量特征,如长度、情感波动和词汇提示,来捕捉模型的内部置信度,以提高模型的安全部署。
Result: 词汇不确定性标记(如“guess”、“stuck”)是最强的错误预测指标;情感波动信号较弱;CoT长度仅在中等难度任务(Omni-MATH)中有用。不确定性指标比高自信标记更显著。
Insight: 词汇特征提供了一种简单有效的方法来预测LLM的错误,尤其是在低准确率任务中,这对模型的安全部署具有重要意义。
Abstract: Fine-tuning Large Language Models (LLMs) with reinforcement learning to produce an explicit Chain-of-Thought (CoT) before answering produces models that consistently raise overall performance on code, math, and general-knowledge benchmarks. However, on benchmarks where LLMs currently achieve low accuracy, such as Humanity’s Last Exam (HLE), they often report high self-confidence, reflecting poor calibration. Here, we test whether measurable properties of the CoT provide reliable signals of an LLM’s internal confidence in its answers. We analyze three feature classes: (i) CoT length, (ii) intra-CoT sentiment volatility, and (iii) lexicographic hints, including hedging words. Using DeepSeek-R1 and Claude 3.7 Sonnet on both Humanity’s Last Exam (HLE), a frontier benchmark with very low accuracy, and Omni-MATH, a saturated benchmark of moderate difficulty, we find that lexical markers of uncertainty (e.g., $\textit{guess}$, $\textit{stuck}$, $\textit{hard}$) in the CoT are the strongest indicators of an incorrect response, while shifts in the CoT sentiment provide a weaker but complementary signal. CoT length is informative only on Omni-MATH, where accuracy is already high ($\approx 70%$), and carries no signal on the harder HLE ($\approx 9%$), indicating that CoT length predicts correctness only in the intermediate-difficulty benchmarks, i.e., inside the model’s demonstrated capability, but still below saturation. Finally, we find that uncertainty indicators in the CoT are consistently more salient than high-confidence markers, making errors easier to predict than correct responses. Our findings support a lightweight post-hoc calibration signal that complements unreliable self-reported probabilities and supports safer deployment of LLMs.
[71] Coarse-to-Fine Personalized LLM Impressions for Streamlined Radiology Reports cs.CL | cs.AIPDF
Chengbo Sun, Hui Yi Leong, Lei Li
TL;DR: 论文提出了一种从粗到精的框架,利用开源大语言模型(LLMs)自动生成并个性化放射学报告中的总结部分,旨在减轻放射科医生的工作负担。
Details
Motivation: 放射学报告中的“总结”(Impression)部分手动编写是导致放射科医生职业疲劳的主要原因之一,需要一种自动化方法来提升效率。
Result: 该方法显著减少了行政工作量,提升了报告效率,同时保持了高标准的临床精确性。
Insight: 通过结合大语言模型和RLHF,能够在医疗报告中实现个性化和高效化的自动生成,为类似领域提供参考。
Abstract: The manual creation of the “Impression” section in radiology reports is a primary driver of radiologist burnout. To address this challenge, we propose a coarse-to-fine framework that leverages open-source large language models (LLMs) to automatically generate and personalize impressions from clinical findings. The system first produces a draft impression and then refines it using machine learning and reinforcement learning from human feedback (RLHF) to align with individual radiologists’ styles while ensuring factual accuracy. We fine-tune LLaMA and Mistral models on a large dataset of reports from the University of Chicago Medicine. Our approach is designed to significantly reduce administrative workload and improve reporting efficiency while maintaining high standards of clinical precision.
[72] CyPortQA: Benchmarking Multimodal Large Language Models for Cyclone Preparedness in Port Operation cs.CLPDF
Chenchen Kuai, Chenhao Wu, Yang Zhou, Xiubin Bruce Wang, Tianbao Yang
TL;DR: 这篇论文提出了CyPortQA,第一个针对港口台风准备的多模态基准测试,评估了多模态大语言模型(MLLMs)在港口操作中的表现。
Details
Motivation: 由于台风强度增强且路径预测不确定性增加,港口操作员需要快速整合多模态预报数据以提供可操作的指导,而MLLMs在此领域的准确性和可靠性尚未被严格评估。
Result: MLLMs在情境理解方面表现出潜力,但在潜在影响估计和决策推理等任务中仍面临挑战。
Insight: MLLMs在台风准备任务中展示了潜力,但需要进一步改进推理能力以提高实际应用中的可靠性。
Abstract: As tropical cyclones intensify and track forecasts become increasingly uncertain, U.S. ports face heightened supply-chain risk under extreme weather conditions. Port operators need to rapidly synthesize diverse multimodal forecast products, such as probabilistic wind maps, track cones, and official advisories, into clear, actionable guidance as cyclones approach. Multimodal large language models (MLLMs) offer a powerful means to integrate these heterogeneous data sources alongside broader contextual knowledge, yet their accuracy and reliability in the specific context of port cyclone preparedness have not been rigorously evaluated. To fill this gap, we introduce CyPortQA, the first multimodal benchmark tailored to port operations under cyclone threat. CyPortQA assembles 2,917 realworld disruption scenarios from 2015 through 2023, spanning 145 U.S. principal ports and 90 named storms. Each scenario fuses multisource data (i.e., tropical cyclone products, port operational impact records, and port condition bulletins) and is expanded through an automated pipeline into 117,178 structured question answer pairs. Using this benchmark, we conduct extensive experiments on diverse MLLMs, including both open-source and proprietary model. MLLMs demonstrate great potential in situation understanding but still face considerable challenges in reasoning tasks, including potential impact estimation and decision reasoning.
[73] MedCoT-RAG: Causal Chain-of-Thought RAG for Medical Question Answering cs.CL | cs.IRPDF
Ziyu Wang, Elahe Khatibi, Amir M. Rahmani
TL;DR: MedCoT-RAG是一个针对医学问答任务的框架,通过结合因果感知的文档检索和结构化思维链提示,提升了模型在复杂医学任务中的表现,优于现有方法。
Details
Motivation: 大型语言模型(LLMs)在医学问答中存在幻觉和浅层推理问题,传统检索增强生成(RAG)方法缺乏结构化推理能力,难以满足临床决策支持的需求。
Result: 在三个医学问答基准测试中,MedCoT-RAG比普通RAG和先进领域适配方法分别提高了10.3%和6.4%的性能。
Insight: 通过模拟临床推理过程,模型在复杂医学任务中的表现显著提升,证明了结构化因果推理对医学问答的有效性。
Abstract: Large language models (LLMs) have shown promise in medical question answering but often struggle with hallucinations and shallow reasoning, particularly in tasks requiring nuanced clinical understanding. Retrieval-augmented generation (RAG) offers a practical and privacy-preserving way to enhance LLMs with external medical knowledge. However, most existing approaches rely on surface-level semantic retrieval and lack the structured reasoning needed for clinical decision support. We introduce MedCoT-RAG, a domain-specific framework that combines causal-aware document retrieval with structured chain-of-thought prompting tailored to medical workflows. This design enables models to retrieve evidence aligned with diagnostic logic and generate step-by-step causal reasoning reflective of real-world clinical practice. Experiments on three diverse medical QA benchmarks show that MedCoT-RAG outperforms strong baselines by up to 10.3% over vanilla RAG and 6.4% over advanced domain-adapted methods, improving accuracy, interpretability, and consistency in complex medical tasks.
[74] DocHop-QA: Towards Multi-Hop Reasoning over Multimodal Document Collections cs.CLPDF
Jiwon Park, Seohyun Pyeon, Jinwoo Kim, Rina Carines Cabal, Yihao Ding
TL;DR: DocHop-QA是一个大规模多模态、多文档、多跳问答基准,包含11,379个问题实例,支持跨文档、模态的结构化推理。
Details
Motivation: 现有的问答基准多局限于单文档或单模态,无法反映真实世界中信息检索的复杂性。DocHop-QA旨在填补这一空白。
Result: 通过四项任务验证了DocHop-QA在结构化索引预测、生成式回答和多模态整合方面的能力。
Insight: DocHop-QA为多模态跨文档推理提供了更真实的评估场景,推动复杂QA任务的进一步发展。
Abstract: Despite recent advances in large language models (LLMs), most QA benchmarks are still confined to single-paragraph or single-document settings, failing to capture the complexity of real-world information-seeking tasks. Practical QA often requires multi-hop reasoning over information distributed across multiple documents, modalities, and structural formats. Although prior datasets made progress in this area, they rely heavily on Wikipedia-based content and unimodal plain text, with shallow reasoning paths that typically produce brief phrase-level or single-sentence answers, thus limiting their realism and generalizability. We propose DocHop-QA, a large-scale benchmark comprising 11,379 QA instances for multimodal, multi-document, multi-hop question answering. Constructed from publicly available scientific documents sourced from PubMed, DocHop-QA is domain-agnostic and incorporates diverse information formats, including textual passages, tables, and structural layout cues. Unlike existing datasets, DocHop-QA does not rely on explicitly hyperlinked documents; instead, it supports open-ended reasoning through semantic similarity and layout-aware evidence synthesis. To scale realistic QA construction, we designed an LLM-driven pipeline grounded in 11 high-frequency scientific question concepts. We evaluated DocHop-QA through four tasks spanning structured index prediction, generative answering, and multimodal integration, reflecting both discriminative and generative paradigms. These tasks demonstrate DocHop-QA’s capacity to support complex, multimodal reasoning across multiple documents.
[75] MGSC: A Multi-granularity Consistency Framework for Robust End-to-end Asr cs.CL | cs.AI | cs.SD | eess.AS | I.2.7PDF
Xuwen Yang
TL;DR: 论文提出了一个多粒度一致性框架MGSC,通过同时优化宏观句子语义和微观词对齐的一致性,显著提升了端到端ASR模型在噪声环境中的鲁棒性。
Details
Motivation: 当前端到端ASR模型在噪声环境下容易产生灾难性的语义错误,主要原因是其仅关注最终输出错误,而忽略了模型内部计算过程的一致性约束。
Result: 在公开数据集上,MGSC将字符错误率平均降低了8.7%,显著减少了语义错误。
Insight: 模型内部一致性的约束是提升AI系统鲁棒性和可信度的关键步骤。
Abstract: End-to-end ASR models, despite their success on benchmarks, often pro-duce catastrophic semantic errors in noisy environments. We attribute this fragility to the prevailing ‘direct mapping’ objective, which solely penalizes final output errors while leaving the model’s internal computational pro-cess unconstrained. To address this, we introduce the Multi-Granularity Soft Consistency (MGSC) framework, a model-agnostic, plug-and-play module that enforces internal self-consistency by simultaneously regulariz-ing macro-level sentence semantics and micro-level token alignment. Cru-cially, our work is the first to uncover a powerful synergy between these two consistency granularities: their joint optimization yields robustness gains that significantly surpass the sum of their individual contributions. On a public dataset, MGSC reduces the average Character Error Rate by a relative 8.7% across diverse noise conditions, primarily by preventing se-vere meaning-altering mistakes. Our work demonstrates that enforcing in-ternal consistency is a crucial step towards building more robust and trust-worthy AI.
[76] QU-NLP at QIAS 2025 Shared Task: A Two-Phase LLM Fine-Tuning and Retrieval-Augmented Generation Approach for Islamic Inheritance Reasoning cs.CLPDF
Mohammad AL-Smadi
TL;DR: QU-NLP团队在QIAS 2025的子任务1中提出了一种结合两阶段LLM微调和检索增强生成(RAG)的方法,用于伊斯兰遗产推理任务,取得了85.8%的高准确率,超越了GPT 4.5等大型模型。
Details
Motivation: 伊斯兰遗产法涉及复杂的规则和计算,传统大型语言模型(LLM)在零样本设置下表现有限。团队希望通过领域微调和检索增强技术提升推理能力。
Result: 系统在最终测试中达到85.8%的准确率,在高级推理任务中表现尤为突出(97.6%),超过了多个零样本设置的竞争模型。
Insight: 领域专用微调(LoRA)与检索增强(RAG)的结合,使得中等规模的阿拉伯语LLM在特定任务上可以超越前沿通用模型。
Abstract: This paper presents our approach and results for SubTask 1: Islamic Inheritance Reasoning at QIAS 2025, a shared task focused on evaluating Large Language Models (LLMs) in understanding and reasoning within Islamic inheritance knowledge. We fine-tuned the Fanar-1-9B causal language model using Low-Rank Adaptation (LoRA) and integrated it into a Retrieval-Augmented Generation (RAG) pipeline. Our system addresses the complexities of Islamic inheritance law, including comprehending inheritance scenarios, identifying eligible heirs, applying fixed-share rules, and performing precise calculations. Our system achieved an accuracy of 0.858 in the final test, outperforming other competitive models such as, GPT 4.5, LLaMA, Fanar, Mistral and ALLaM evaluated with zero-shot prompting. Our results demonstrate that QU-NLP achieves near state-of-the-art accuracy (85.8%), excelling especially on advanced reasoning (97.6%) where it outperforms Gemini 2.5 and OpenAI’s o3. This highlights that domain-specific fine-tuning combined with retrieval grounding enables mid-scale Arabic LLMs to surpass frontier models in Islamic inheritance reasoning.
[77] Counterspeech for Mitigating the Influence of Media Bias: Comparing Human and LLM-Generated Responses cs.CL | cs.CY | cs.SIPDF
Luyang Lin, Zijin Feng, Lingzhi Wang, Kam-Fai Wong
TL;DR: 研究探讨了如何通过回应言论(counterspeech)减少媒体偏见的影响,对比了人类与大型语言模型(LLM)生成的回应效果,发现后者更礼貌但缺乏多样性和新颖性。通过小样本学习和背景信息整合,生成效果得到提升。
Details
Motivation: 偏见新闻加剧社会极化,而攻击性评论进一步强化偏见,造成危害。回应言论可有效抵制此类言论。本研究首次在新闻背景下探讨回应言论生成。
Result: 模型生成回应更礼貌,但多样性不足。整合背景信息和小样本学习提升了多样性和相关性。
Insight: 回应言论是抵制偏见的高效工具,但需提升模型生成的多样性和新颖性。新闻背景信息对小样本学习有显著帮助。
Abstract: Biased news contributes to societal polarization and is often reinforced by hostile reader comments, constituting a vital yet often overlooked aspect of news dissemination. Our study reveals that offensive comments support biased content, amplifying bias and causing harm to targeted groups or individuals. Counterspeech is an effective approach to counter such harmful speech without violating freedom of speech, helping to limit the spread of bias. To the best of our knowledge, this is the first study to explore counterspeech generation in the context of news articles. We introduce a manually annotated dataset linking media bias, offensive comments, and counterspeech. We conduct a detailed analysis showing that over 70% offensive comments support biased articles, amplifying bias and thus highlighting the importance of counterspeech generation. Comparing counterspeech generated by humans and large language models, we find model-generated responses are more polite but lack the novelty and diversity. Finally, we improve generated counterspeech through few-shot learning and integration of news background information, enhancing both diversity and relevance.
[78] XFinBench: Benchmarking LLMs in Complex Financial Problem Solving and Reasoning cs.CL | cs.LGPDF
Zhihan Zhang, Yixin Cao, Lizi Liao
TL;DR: XFinBench是一个新颖的金融问题解决基准,用于评估大语言模型(LLM)在复杂、知识密集型多模态金融问题中的能力。实验表明当前最佳文本模型仍显著落后于人类专家,尤其在时序推理和场景规划方面。
Details
Motivation: 金融问题解决需要复杂的推理、多模态数据处理和广泛的技术知识,这对现有大语言模型提出了独特挑战。
Result: 最佳文本模型准确率为67.3%,仍显著落后于人类专家(79.8%)。知识增强仅对小模型有效,计算和视觉问题是主要错误来源。
Insight: 当前LLM在复杂金融任务中存在明显不足,尤其是时序推理和场景规划;知识增强的局限性揭示了模型规模的影响。
Abstract: Solving financial problems demands complex reasoning, multimodal data processing, and a broad technical understanding, presenting unique challenges for current large language models (LLMs). We introduce XFinBench, a novel benchmark with 4,235 examples designed to evaluate LLM’s ability in solving complex, knowledge-intensive financial problems across diverse graduate-level finance topics with multi-modal context. We identify five core capabilities of LLMs using XFinBench, i.e, terminology understanding, temporal reasoning, future forecasting, scenario planning, and numerical modelling. Upon XFinBench, we conduct extensive experiments on 18 leading models. The result shows that o1 is the best-performing text-only model with an overall accuracy of 67.3%, but still lags significantly behind human experts with 12.5%, especially in temporal reasoning and scenario planning capabilities. We further construct a knowledge bank with 3,032 finance terms for knowledge augmentation analysis, and find that relevant knowledge to the question only brings consistent accuracy improvements to small open-source model. Additionally, our error analysis reveals that rounding errors during calculation and blindness to position and intersection of curves in the image are two primary issues leading to model’s poor performance in calculating and visual-context questions, respectively. Code and dataset are accessible via GitHub: https://github.com/Zhihan72/XFinBench.
[79] CARFT: Boosting LLM Reasoning via Contrastive Learning with Annotated Chain-of-Thought-based Reinforced Fine-Tuning cs.CL | cs.AIPDF
Wenqiao Zhu, Ji Liu, Rongjuncheng Zhang, Haipang Wu, Yulun Zhang
TL;DR: 该论文提出了CARFT方法,通过结合对比学习和标注的Chain-of-Thought强化微调,提升大语言模型的推理能力。
Details
Motivation: 当前基于强化学习的微调方法忽视了标注的Chain-of-Thought,且推理路径采样不稳定,导致模型崩溃和性能下降。而监督微调方法过于依赖标注的Chain-of-Thought,未能充分挖掘潜在推理路径。
Result: 在实验中,CARFT在鲁棒性、性能(提升10.15%)和效率(提升30.62%)上显著优于基线方法。
Insight: 结合对比学习和强化学习不仅能充分利用标注数据,还能稳定训练过程,提升模型推理能力。
Abstract: Reasoning capability plays a significantly critical role in the the broad applications of Large Language Models (LLMs). To enhance the reasoning performance of LLMs, diverse Reinforcement Learning (RL)-based fine-tuning approaches have been proposed to address the limited generalization capability of LLMs trained solely via Supervised Fine-Tuning (SFT). Despite their effectiveness, two major limitations hinder the advancement of LLMs. First, vanilla RL-based approaches ignore annotated Chain-of-Thought (CoT) and incorporate unstable reasoning path sampling, which typically results in model collapse, unstable training process, and suboptimal performance. Second, existing SFT approaches generally overemphasize the annotated CoT, potentially leading to performance degradation due to insufficient exploitation of potential CoT. In this paper, we propose a Contrastive learning with annotated CoT-based Reinforced Fine-Tuning approach, i.e., \TheName{}, to enhance the reasoning performance of LLMs while addressing the aforementioned limitations. Specifically, we propose learning a representation for each CoT. Based on this representation, we design novel contrastive signals to guide the fine-tuning process. Our approach not only fully exploits the available annotated CoT but also stabilizes the fine-tuning procedure by incorporating an additional unsupervised learning signal. We conduct comprehensive experiments and in-depth analysis with three baseline approaches, two foundation models, and two datasets to demonstrate significant advantages of \TheName{} in terms of robustness, performance (up to 10.15%), and efficiency (up to 30.62%). Code is available at https://github.com/WNQzhu/CARFT.
[80] DeepMEL: A Multi-Agent Collaboration Framework for Multimodal Entity Linking cs.CL | cs.AI | cs.MAPDF
Fang Wang, Tianwei Yan, Zonghao Yang, Minghao Hu, Jun Zhang
TL;DR: DeepMEL是一个基于多智能体协作的多模态实体链接框架,通过角色专责分工策略解决了现有方法在跨模态融合和联合大型语言模型与视觉模型方面的挑战。
Details
Motivation: 当前多模态实体链接方法面临上下文信息不完整、跨模态融合粗糙以及联合大型语言模型和视觉模型的困难。
Result: 在五个公开基准数据集上取得了最先进性能,准确率提升1%-57%。
Insight: 角色专责分工和动态协调显著提升跨模态链接性能,结构化填空提示简化了任务解析。
Abstract: Multimodal Entity Linking (MEL) aims to associate textual and visual mentions with entities in a multimodal knowledge graph. Despite its importance, current methods face challenges such as incomplete contextual information, coarse cross-modal fusion, and the difficulty of jointly large language models (LLMs) and large visual models (LVMs). To address these issues, we propose DeepMEL, a novel framework based on multi-agent collaborative reasoning, which achieves efficient alignment and disambiguation of textual and visual modalities through a role-specialized division strategy. DeepMEL integrates four specialized agents, namely Modal-Fuser, Candidate-Adapter, Entity-Clozer and Role-Orchestrator, to complete end-to-end cross-modal linking through specialized roles and dynamic coordination. DeepMEL adopts a dual-modal alignment path, and combines the fine-grained text semantics generated by the LLM with the structured image representation extracted by the LVM, significantly narrowing the modal gap. We design an adaptive iteration strategy, combines tool-based retrieval and semantic reasoning capabilities to dynamically optimize the candidate set and balance recall and precision. DeepMEL also unifies MEL tasks into a structured cloze prompt to reduce parsing complexity and enhance semantic comprehension. Extensive experiments on five public benchmark datasets demonstrate that DeepMEL achieves state-of-the-art performance, improving ACC by 1%-57%. Ablation studies verify the effectiveness of all modules.
[81] Less Redundancy: Boosting Practicality of Vision Language Model in Walking Assistants cs.CLPDF
Chongyang Li, Yuan Zhiqiang, Jiapei Zhang, Ying Deng, Hanbo Bi
TL;DR: 论文提出WalkVLM-LR模型,通过减少冗余输出和时间冗余,提升视觉语言模型在盲人行走辅助系统中的实用性。
Details
Motivation: 全球约2.83亿人存在视觉障碍,现有视觉语言模型在行走辅助任务中存在输出冗余和时间冗余问题,影响用户对环境的准确评估。
Result: 实验表明WalkVLM-LR在输出简洁性和减少时间冗余方面优于其他模型。
Insight: 结合人类偏好和场景风险评估可以显著提升行走辅助模型的实用性和效率。
Abstract: Approximately 283 million people worldwide live with visual impairments, motivating increasing research into leveraging Visual Language Models (VLMs) to develop effective walking assistance systems for blind and low vision individuals. However, existing VLMs in walking assistant task often have outputs that contain considerable redundancy and extraneous details, adversely affecting users’ ability to accurately assess their surroundings. Moreover, these models typically lack the capability to proactively assess environmental risks and adaptively trigger reminders based on the appropriate scene, leading to excessive temporal redundancy. To mitigate output and temporal redundancy, we propose WalkVLM-LR, a walking assistance model with less redundancy. To reduce output redundancy, we introduce four human-preference-based custom reward functions within the GRPO-based reasoning framework to optimize the output in terms of conciseness, fluency, keyword density, and accuracy, thereby producing more informative and streamlined outputs. To minimize temporal redundancy, we incorporate an environment awareness discriminator, which shares the visual encoder with the VLMs to reduce redundant computations and enhance discriminative efficiency, to make WalkVLM-LR assess scene risk levels and minimize unnecessary reminders. Experimental results demonstrate that our method achieves state-of-the-art performance across all evaluation metrics compared with other models, particularly in output conciseness and less temporal redundancy.
[82] CEQuest: Benchmarking Large Language Models for Construction Estimation cs.CL | cs.LGPDF
Yanzhao Wu, Lufan Wang, Rui Liu
TL;DR: 论文介绍了CEQuest,一个专门用于评估大语言模型在建筑领域问答性能的新基准数据集,重点关注建筑图纸解释和估算,并通过实验证明了当前模型的不足。
Details
Motivation: 大语言模型在通用领域表现优异,但在建筑等专业领域的潜力尚未充分探索。因此,研究团队希望通过开发专业基准数据集,推动领域专用模型的发展。
Result: 实验表明,当前模型在建筑领域仍有显著提升空间,强调了融入领域专业知识的重要性。
Insight: 领域专用的大语言模型需要更多专业化数据和知识,而CEQuest数据集的开放将促进相关研究的进一步发展。
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities across a wide range of general-domain tasks. However, their effectiveness in specialized fields, such as construction, remains underexplored. In this paper, we introduce CEQuest, a novel benchmark dataset specifically designed to evaluate the performance of LLMs in answering construction-related questions, particularly in the areas of construction drawing interpretation and estimation. We conduct comprehensive experiments using five state-of-the-art LLMs, including Gemma 3, Phi4, LLaVA, Llama 3.3, and GPT-4.1, and evaluate their performance in terms of accuracy, execution time, and model size. Our experimental results demonstrate that current LLMs exhibit considerable room for improvement, highlighting the importance of integrating domain-specific knowledge into these models. To facilitate further research, we will open-source the proposed CEQuest dataset, aiming to foster the development of specialized large language models (LLMs) tailored to the construction domain.
[83] CYCLE-INSTRUCT: Fully Seed-Free Instruction Tuning via Dual Self-Training and Cycle Consistency cs.CL | cs.AI | cs.LGPDF
Zhanming Shen, Hao Chen, Yulei Tang, Shaolin Zhu, Wentao Ye
TL;DR: Cycle-Instruct提出了一种无需种子数据的指令调优框架,通过双自训练和循环一致性实现完全自动化,避免了依赖人工种子数据或外部教师模型的问题。
Details
Motivation: 传统指令调优依赖成本高昂的人工标注种子数据或强大的外部教师模型,而现有方法仍无法完全摆脱种子数据的限制。Cycle-Instruct旨在解决这一问题,实现完全无需种子的指令调优。
Result: 在四个多样化的数据任务上验证了Cycle-Instruct的有效性,性能优于基于种子的反向翻译基线,接近强监督方法。
Insight: 循环一致性和双自训练的组合为指令调优提供了一种全新的无种子解决方案,展示了从数据固有结构中学习的潜力,避免了种子数据引入的偏差。
Abstract: Instruction tuning is vital for aligning large language models (LLMs) with human intent, but current methods typically rely on costly human-annotated seed data or powerful external teacher models. While instruction back-translation techniques reduce this dependency, they remain fundamentally tethered to an initial seed set, which limits full automation, introduces biases, and can lead to inefficient use of unlabeled corpora. In this paper, we propose Cycle-Instruct, a novel framework that achieves fully seed-free instruction tuning. Inspired by cycle consistency, Cycle-Instruct employs a dual self-training loop where two models-an answer generator and a question generator-are bootstrapped solely from raw, unlabeled text. These models mutually supervise each other by reconstructing original text segments from their counterpart’s generated pseudo-labels, effectively learning from the intrinsic structure of the data without any human-provided seeds. We demonstrate Cycle-Instruct’s efficacy across four diverse data tracks, including general instruction-following, domain-specific tasks, dialogue logs, and plain text. Our extensive experiments show that Cycle-Instruct not only outperforms seed-driven back-translation baselines but also achieves performance comparable to strongly supervised methods.
[84] From Indirect Object Identification to Syllogisms: Exploring Binary Mechanisms in Transformer Circuits cs.CL | cs.LGPDF
Karim Saraipour, Shichang Zhang
TL;DR: 论文探讨了GPT-2 small处理二元逻辑推理任务的机制,通过分析三段论任务,识别了多个电路,揭示了其中的二进制机制,并通过与间接宾语识别(IOI)任务的比较提供了对注意力头(attention heads)和MLP(多层感知机)作用的新见解。
Details
Motivation: 研究旨在理解Transformer模型在复杂逻辑任务(如三段论)中的行为,探索其内部机制及其与简单语言任务(如IOI)的差异,以推动对模型推理能力的深入理解。
Result: 一个由五个注意力头组成的电路实现了原模型90%以上的性能,验证了其有效性。研究还发现模型能通过负注意力头生成否定标记,支持复杂逻辑推理。
Insight: 注意力头和MLPs在简单和复杂任务中扮演不同角色;二进制机制(如负注意力头)是模型逻辑推理能力的核心;IOI和逻辑任务的分析可以相互补充,推动更全面的模型理解。
Abstract: Transformer-based language models (LMs) can perform a wide range of tasks, and mechanistic interpretability (MI) aims to reverse engineer the components responsible for task completion to understand their behavior. Previous MI research has focused on linguistic tasks such as Indirect Object Identification (IOI). In this paper, we investigate the ability of GPT-2 small to handle binary truth values by analyzing its behavior with syllogistic prompts, e.g., “Statement A is true. Statement B matches statement A. Statement B is”, which requires more complex logical reasoning compared to IOI. Through our analysis of several syllogism tasks of varying difficulty, we identify multiple circuits that mechanistically explain GPT-2’s logical-reasoning capabilities and uncover binary mechanisms that facilitate task completion, including the ability to produce a negated token not present in the input prompt through negative heads. Our evaluation using a faithfulness metric shows that a circuit comprising five attention heads achieves over 90% of the original model’s performance. By relating our findings to IOI analysis, we provide new insights into the roles of specific attention heads and MLPs in LMs. These insights contribute to a broader understanding of model reasoning and support future research in mechanistic interpretability.
[85] Text Takes Over: A Study of Modality Bias in Multimodal Intent Detection cs.CLPDF
Ankan Mullick, Saransh Sharma, Abhik Jana, Pawan Goyal
TL;DR: 该研究发现,在多模态意图检测任务中,文本模态的主导性导致纯文本LLM(如Mistral-7B)优于多模态模型。通过去偏框架处理数据集后,大部分样本被移除,模型性能显著下降,突显了多模态数据集中的模态偏差问题。
Details
Motivation: 研究动机是探索多模态意图检测任务中不同模态模型的性能表现,特别是纯文本LLM与多模态模型的对比,以及数据集中存在的模态偏差问题。
Result: 结果表明:1)纯文本LLM在多模态任务中表现优于多模态模型;2)去偏后数据集样本大幅减少,模型性能显著下降;3)小规模多模态融合模型受去偏影响最大。
Insight: 研究揭示了多模态数据集中的文本模态偏差问题,强调了构建无偏数据集对有效评估多模态模型的重要性。
Abstract: The rise of multimodal data, integrating text, audio, and visuals, has created new opportunities for studying multimodal tasks such as intent detection. This work investigates the effectiveness of Large Language Models (LLMs) and non-LLMs, including text-only and multi-modal models, in the multimodal intent detection task. Our study reveals that Mistral-7B, a text-only LLM, outperforms most competitive multimodal models by approximately 9% on MIntRec-1 and 4% on MIntRec2.0 datasets. This performance advantage comes from a strong textual bias in these datasets, where over 90% of the samples require textual input, either alone or in combination with other modalities, for correct classification. We confirm the modality bias of these datasets via human evaluation, too. Next, we propose a framework to debias the datasets, and upon debiasing, more than 70% of the samples in MIntRec-1 and more than 50% in MIntRec2.0 get removed, resulting in significant performance degradation across all models, with smaller multimodal fusion models being the most affected with an accuracy drop of over 50 - 60%. Further, we analyze the context-specific relevance of different modalities through empirical analysis. Our findings highlight the challenges posed by modality bias in multimodal intent datasets and emphasize the need for unbiased datasets to evaluate multimodal models effectively.
[86] ParamBench: A Graduate-Level Benchmark for Evaluating LLM Understanding on Indic Subjects cs.CLPDF
Kaushal Sharma, Vivek Patel, Ayush Maheshwari, Aditya Maheshwari
TL;DR: 这篇论文提出了ParamBench,一个用于评估大型语言模型(LLM)在印度文化背景下研究生水平问题理解能力的基准测试,涵盖了16个不同学科的11.5K个问题。
Details
Motivation: 现有的印度基准测试主要关注基础事实性查询,缺乏对印度文化背景下深度学科理解的评估。ParamBench填补了这一空白,专注于研究生水平的印度文化相关问题。
Result: Llama 3.3 70B表现最佳,总体准确率为48%。但LLM在音乐、古典乐器等文化相关主题上的表现仍然较差。
Insight: LLM在印度文化背景下的研究生水平问题理解能力有限,尤其是在文化相关主题上表现较弱,突显了文化接地推理的挑战。
Abstract: Large language models (LLMs) have been widely evaluated on tasks such as comprehension, question answering, summarization, code generation, etc. However, their performance on graduate-level, culturally grounded questions in the Indian context remains largely unexplored. Existing Indian benchmarks emphasise basic fact-orientated queries that offer limited assessment of a deeper disciplinary understanding tailored to the Indian setting. In this paper, we present ParamBench, consisting of around 11.5K questions in Hindi language comprising questionnaires from 16 diverse subjects. These questions are primarily derived from nation-wide graduate level entrance examination covering topics such as history, music, instruments, yoga, literature, philosophy, law, etc., specifically for the Indian context. Additionally, we assess the ability of LLMs to handle diverse question formats-such as list-based matching, assertion-reason pairs, and sequence ordering-alongside conventional multiple-choice questions. We evaluated the performance of more than 17 open source LLMs on this benchmark, observing that Llama 3.3 70B attains the highest overall accuracy of 48%. Furthermore, subject-wise analysis indicates that even for the best performing LLMs, performance remains weak on topics such as music, classical instruments, politics and archaeology, underscoring persistent challenges in culturally grounded reasoning.
[87] Seeing is Believing: Emotion-Aware Audio-Visual Language Modeling for Expressive Speech Generation cs.CL | cs.CV | cs.MM | cs.SD | eess.ASPDF
Weiting Tan, Jiachen Lian, Hirofumi Inaguma, Paden Tomasello, Philipp Koehn
TL;DR: 本文提出了一种音频-视觉语言模型(AVLM),通过整合全脸视觉线索到预训练的语音生成模型中,显著提升了情感识别和表达性对话任务的性能。
Details
Motivation: 当前的语音生成模型通常依赖纯语音输入,忽略了视觉信息在情感表达中的作用。本文通过引入视觉线索弥补这一不足,以提升语音生成的情感表达能力。
Result: 实验表明AVLM在情感识别的F1分数上比纯语音基线提升了5分,验证了视觉信息对语音生成的重要性。
Insight: 视觉信息是提升语音生成情感表达能力的关键因素,未来多模态对话系统应进一步整合视觉与语音模态。
Abstract: We present an Audio-Visual Language Model (AVLM) for expressive speech generation by integrating full-face visual cues into a pre-trained expressive speech model. We explore multiple visual encoders and multimodal fusion strategies during pre-training to identify the most effective integration approach. Subsequent fine-tuning on emotion recognition and expressive dialogue tasks yields substantial gains over speech-only baselines (e.g., +5 F1 in emotion recognition). AVLM highlights the value of expressive visual information in guiding speech generation and offers a foundation for end-to-end multimodal conversational systems.
[88] ComicScene154: A Scene Dataset for Comic Analysis cs.CLPDF
Sandro Paval, Ivan P. Yamshchikov, Pascal Meißner
TL;DR: 这篇论文介绍了ComicScene154,一个手动标注的漫画场景数据集,旨在促进多模态叙事分析和漫画研究的计算方法的进展。
Details
Motivation: 漫画是一个多模态叙事的独特领域,但目前缺乏足够的数据集支持其计算分析,因此作者提出了ComicScene154来填补这一空白。
Result: ComicScene154被证明是一个有价值的资源,能够推动多模态叙事理解的计算方法发展。
Insight: 漫画作为一种多模态叙事形式,具有潜力为更广泛的多模态故事叙述研究提供独特见解。
Abstract: Comics offer a compelling yet under-explored domain for computational narrative analysis, combining text and imagery in ways distinct from purely textual or audiovisual media. We introduce ComicScene154, a manually annotated dataset of scene-level narrative arcs derived from public-domain comic books spanning diverse genres. By conceptualizing comics as an abstraction for narrative-driven, multimodal data, we highlight their potential to inform broader research on multi-modal storytelling. To demonstrate the utility of ComicScene154, we present a baseline scene segmentation pipeline, providing an initial benchmark that future studies can build upon. Our results indicate that ComicScene154 constitutes a valuable resource for advancing computational methods in multimodal narrative understanding and expanding the scope of comic analysis within the Natural Language Processing community.
[89] CMR-SPB: Cross-Modal Multi-Hop Reasoning over Text, Image, and Speech with Path Balance cs.CLPDF
Seunghee Kim, Ingyu Bang, Seokgyu Jang, Changhyeon Kim, Sanghwan Bae
TL;DR: 该论文提出了一种新的基准CMR-SPB,用于评估跨模态多跳推理能力,弥补了现有基准忽略语音模态和存在偏置推理路径的不足,并提出了一种有效的ECV提示技术。
Details
Motivation: 现有的跨模态多跳推理评估基准存在两个主要问题:忽视了语音模态,且推理路径分布偏置严重,影响了公平评估。
Result: 实验表明,CMR-SPB能更公平地评估模型性能,并揭示了现有基准的偏置问题。ECV提示技术显著提升了模型在不同推理路径上的表现。
Insight: 研究强调了公平和无偏的评估对跨模态推理的重要性,提出的ECV技术为未来的多模态AI开发提供了有效工具。
Abstract: Cross-modal multi-hop reasoning (CMR) is a valuable yet underexplored capability of multimodal large language models (MLLMs), entailing the integration of information from multiple modalities to produce a coherent output for a given context. We argue that existing benchmarks for evaluating this ability have critical shortcomings: (1) they largely overlook the speech modality, and (2) they exhibit heavily biased reasoning path distributions, which can severely undermine fair evaluation. To address these limitations, we introduce a novel benchmark – Cross-Modal Multi-Hop Reasoning over Text, Image and Speech with Path Balance (CMR-SPB) – designed to assess tri-modal multi-hop reasoning while ensuring both unbiased and diverse reasoning paths. Our experiments with the new dataset reveal consistent model failures in specific reasoning sequences and show that biased benchmarks risk misrepresenting model performance. Finally, based on our extensive analysis, we propose a new ECV (Extract, Connect, Verify) prompting technique that effectively mitigates the performance gap across different reasoning paths. Overall, we call for more careful evaluation in CMR to advance the development of robust multimodal AI.
[90] TULIP: Adapting Open-Source Large Language Models for Underrepresented Languages and Specialized Financial Tasks cs.CLPDF
İrem Demirtaş, Burak Payzun, Seçil Arslan
TL;DR: 这篇论文介绍了TULIP模型,通过多阶段管道(数据收集、持续预训练、基准设计、合成数据生成和监督微调)对Llama 3.1 8B和Qwen 2.5 7B进行适应,以提升其在金融土耳其语任务中的表现。
Details
Motivation: 尽管大型专有模型在金融领域表现优异,但较小的开源模型在隐私和适应性方面更具优势,尤其是针对小众语言和敏感数据的场景。
Result: 实验表明,TULIP模型在金融土耳其语任务中表现显著提升。
Insight: 开源模型的适应性和隐私优势使其在特定领域和小众语言任务中具有竞争力,尤其是在金融等敏感领域。
Abstract: Thanks to the growing popularity of large language models over the years, there is great potential for their applications in finance. Despite the exceptional performance of larger proprietary models, which are presented as black-box solutions through APIs, smaller models that can be hosted on-premise present opportunities for adaptability and privacy. Especially in cases where the management of sensitive information and application of domain knowledge is important, like finance, enhancing the capabilities of smaller models becomes crucial, notably for underrepresented languages. In this work, we introduce TULIP models, which adapt Llama 3.1 8B and Qwen 2.5 7B for domain and language adaptation, focusing on financial Turkish use cases. The five-stage development pipeline involves data collection, continual pre-training (CPT), benchmark design, synthetic data generation and supervised fine-tuning (SFT). The results show that the capabilities of the models can be enhanced to effectively accomplish targeted tasks in this specific domain and language.
[91] M3TQA: Massively Multilingual Multitask Table Question Answering cs.CLPDF
Daixin Shu, Jian Yang, Zhenhe Wu, Xianjie Wu, Xianfu Cheng
TL;DR: 论文提出M3TQA框架,解决多语言表格问答中数据不均衡和规模不足的问题,通过大规模多任务基准(97种语言)和高质量翻译流程,提升低资源语言的性能。
Details
Motivation: 现有表格理解研究多集中于英语,多语言数据存在地理语言不平衡问题,缺乏对低资源语言的覆盖。
Result: 翻译质量中位数BLEU达60.19,合成数据显著提升低资源语言性能,M3TQA成为多语言表格理解的新标准。
Insight: 合成未标注数据是提升低资源语言性能的有效途径,多语言任务需系统性覆盖多样语言家族。
Abstract: Tabular data is a fundamental component of real-world information systems, yet most research in table understanding remains confined to English, leaving multilingual comprehension significantly underexplored. Existing multilingual table benchmarks suffer from geolinguistic imbalance - overrepresenting certain languages and lacking sufficient scale for rigorous cross-lingual analysis. To address these limitations, we introduce a comprehensive framework for massively multilingual multitask table question answering, featuring m3TQA-Instruct, a large-scale benchmark spanning 97 languages across diverse language families, including underrepresented and low-resource languages. We construct m3TQA by curating 50 real-world tables in Chinese and English, then applying a robust six-step LLM-based translation pipeline powered by DeepSeek and GPT-4o, achieving high translation fidelity with a median BLEU score of 60.19 as validated through back-translation. The benchmark includes 2,916 professionally annotated question-answering pairs across four tasks designed to evaluate nuanced table reasoning capabilities. Experiments on state-of-the-art LLMs reveal critical insights into cross-lingual generalization, demonstrating that synthetically generated, unannotated QA data can significantly boost performance, particularly for low-resource languages. M3T-Bench establishes a new standard for multilingual table understanding, providing both a challenging evaluation platform and a scalable methodology for future research.
[92] From Confidence to Collapse in LLM Factual Robustness cs.CL | cs.AIPDF
Alina Fastowski, Bardh Prenkaj, Gjergji Kasneci
TL;DR: 论文提出了一种新的指标FRS,通过分析Token分布熵和温度缩放敏感性来衡量LLM在事实知识上的鲁棒性,并验证了其有效性。
Details
Motivation: 现有的评估方法主要关注基于性能的指标,而忽视了生成过程中知识的鲁棒性,因此需要一种新的方法来填补这一空白。
Result: 结果表明,不同规模的模型FRS差异显著(小模型0.76,大模型0.93),且在不确定性增加时准确率下降约60%。
Insight: 熵和温度缩放对事实准确性有显著影响,为未来模型开发更鲁棒的知识保留和检索机制奠定了基础。
Abstract: Ensuring the robustness of factual knowledge in LLMs is critical for reliable applications in tasks such as question answering and reasoning. However, existing evaluation methods predominantly focus on performance-based metrics, often investigating from the perspective of prompt perturbations, which captures only the externally triggered side of knowledge robustness. To bridge this gap, we introduce a principled approach to measure factual robustness from the perspective of the generation process by analyzing token distribution entropy in combination with temperature scaling sensitivity. These two factors build the Factual Robustness Score (FRS), a novel metric which quantifies the stability of a fact against perturbations in decoding conditions, given its initial uncertainty. To validate our approach, we conduct extensive experiments on 5 LLMs across 3 closed-book QA datasets (SQuAD, TriviaQA, and HotpotQA). We show that factual robustness varies significantly – smaller models report an FRS of $0.76$, larger ones $0.93$ – with accuracy degrading by ~$60%$ under increased uncertainty. These insights demonstrate how entropy and temperature scaling impact factual accuracy, and lay a foundation for developing more robust knowledge retention and retrieval in future models.
[93] LLMs that Understand Processes: Instruction-tuning for Semantics-Aware Process Mining cs.CLPDF
Vira Pyrih, Adrian Rebmann, Han van der Aa
TL;DR: 论文探讨了通过指令调优(instruction-tuning)提升大型语言模型(LLM)在语义感知流程挖掘任务中的泛化能力。
Details
Motivation: 传统流程挖掘方法基于频率分析,缺乏对语义信息的利用。LLM可通过微调优化特定任务表现,但计算成本高且泛化能力差,因此研究了指令调优的潜力。
Result: 指令调优显著提升了流程发现和预测任务性能,但在异常检测任务中表现因模型而异,表明任务选择对结果至关重要。
Insight: 指令调优是提升LLM在流程挖掘中泛化能力的有效方法,但任务组合的选择对性能优化起关键作用。
Abstract: Process mining is increasingly using textual information associated with events to tackle tasks such as anomaly detection and process discovery. Such semantics-aware process mining focuses on what behavior should be possible in a process (i.e., expectations), thus providing an important complement to traditional, frequency-based techniques that focus on recorded behavior (i.e., reality). Large Language Models (LLMs) provide a powerful means for tackling semantics-aware tasks. However, the best performance is so far achieved through task-specific fine-tuning, which is computationally intensive and results in models that can only handle one specific task. To overcome this lack of generalization, we use this paper to investigate the potential of instruction-tuning for semantics-aware process mining. The idea of instruction-tuning here is to expose an LLM to prompt-answer pairs for different tasks, e.g., anomaly detection and next-activity prediction, making it more familiar with process mining, thus allowing it to also perform better at unseen tasks, such as process discovery. Our findings demonstrate a varied impact of instruction-tuning: while performance considerably improved on process discovery and prediction tasks, it varies across models on anomaly detection tasks, highlighting that the selection of tasks for instruction-tuning is critical to achieving desired outcomes.
[94] JaParaPat: A Large-Scale Japanese-English Parallel Patent Application Corpus cs.CLPDF
Masaaki Nagata, Katsuki Chousa, Norihito Yasuda
TL;DR: 构建了JaParaPat,一个包含超过3亿日英句对的专利申请平行语料库,通过翻译对齐方法提升专利翻译质量,BLEU分数提高20分。
Details
Motivation: 专利翻译的需求日益增长,现有的平行语料库规模有限,构建大规模且高质量的专利平行语料库以满足翻译需求。
Result: 实验表明,加入专利语料后翻译质量显著提升,BLEU分数提高了20分。
Insight: 专利领域的平行语料对提升机器翻译质量具有显著作用,尤其是在专业领域翻译任务中。
Abstract: We constructed JaParaPat (Japanese-English Parallel Patent Application Corpus), a bilingual corpus of more than 300 million Japanese-English sentence pairs from patent applications published in Japan and the United States from 2000 to 2021. We obtained the publication of unexamined patent applications from the Japan Patent Office (JPO) and the United States Patent and Trademark Office (USPTO). We also obtained patent family information from the DOCDB, that is a bibliographic database maintained by the European Patent Office (EPO). We extracted approximately 1.4M Japanese-English document pairs, which are translations of each other based on the patent families, and extracted about 350M sentence pairs from the document pairs using a translation-based sentence alignment method whose initial translation model is bootstrapped from a dictionary-based sentence alignment method. We experimentally improved the accuracy of the patent translations by 20 bleu points by adding more than 300M sentence pairs obtained from patent applications to 22M sentence pairs obtained from the web.
[95] MizanQA: Benchmarking Large Language Models on Moroccan Legal Question Answering cs.CL | cs.AI | cs.IRPDF
Adil Bahaj, Mounir Ghogho
TL;DR: MizanQA是一个专门评估大型语言模型在摩洛哥法律问答任务上的基准,填补了阿拉伯语法律领域低资源环境的空白。
Details
Motivation: 当前大型语言模型在阿拉伯语法律等低资源、专业化领域的表现有限,亟需针对性评估工具和领域优化。
Result: 实验表明现有模型在摩洛哥法律任务上存在显著性能差距,需改进评估指标和领域特异性模型开发。
Insight: 文化背景和法律特殊性对语言模型性能至关重要,未来需开发更符合本地需求的模型。
Abstract: The rapid advancement of large language models (LLMs) has significantly propelled progress in natural language processing (NLP). However, their effectiveness in specialized, low-resource domains-such as Arabic legal contexts-remains limited. This paper introduces MizanQA (pronounced Mizan, meaning “scale” in Arabic, a universal symbol of justice), a benchmark designed to evaluate LLMs on Moroccan legal question answering (QA) tasks, characterised by rich linguistic and legal complexity. The dataset draws on Modern Standard Arabic, Islamic Maliki jurisprudence, Moroccan customary law, and French legal influences. Comprising over 1,700 multiple-choice questions, including multi-answer formats, MizanQA captures the nuances of authentic legal reasoning. Benchmarking experiments with multilingual and Arabic-focused LLMs reveal substantial performance gaps, highlighting the need for tailored evaluation metrics and culturally grounded, domain-specific LLM development.
[96] RoMedQA: The First Benchmark for Romanian Medical Question Answering cs.CL | cs.AI | cs.LGPDF
Ana-Cristina Rogoz, Radu Tudor Ionescu, Alexandra-Valentina Anghel, Ionut-Lucian Antone-Iordache, Simona Coniac
TL;DR: RoMedQA 是首个罗马尼亚医学领域问答基准,包含 102,646 个问答对,基于癌症患者病例总结构建。通过实验发现,监督微调的模型显著优于零样本提示模型,凸显了领域和语言特定微调的重要性。
Details
Motivation: 当前缺乏特定领域和语言的问答数据集,影响了 AI 模型的泛化能力,尤其在医疗领域和罗马尼亚语中。
Result: 微调模型显著优于零样本模型,表明预训练模型在 RoMedQA 上泛化能力不足。
Insight: 领域和语言特定微调对可靠临床 QA 至关重要,RoMedQA 填补了罗马尼亚医学 QA 的空白。
Abstract: Question answering (QA) is an actively studied topic, being a core natural language processing (NLP) task that needs to be addressed before achieving Artificial General Intelligence (AGI). However, the lack of QA datasets in specific domains and languages hinders the development of robust AI models able to generalize across various domains and languages. To this end, we introduce RoMedQA, the first Romanian QA benchmark for the medical domain, alongside a comprehensive evaluation of state-of-the-art large language models (LLMs). We construct a high-quality and large-scale dataset comprising 102,646 QA pairs related to cancer patients. The questions regard medical case summaries of 1,011 patients, requiring either keyword extraction or reasoning to be answered correctly. RoMedQA is the result of a time-consuming manual annotation process carried out by seven physicians specialized in oncology or radiotherapy, who spent a total of about 2,100 work hours to generate the QA pairs. We experiment with four LLMs from distinct families of models on RoMedQA. Each model is employed in two scenarios, namely one based on zero-shot prompting and one based on supervised fine-tuning. Our results show that fine-tuned models significantly outperform their zero-shot counterparts, clearly indicating that pretrained models fail to generalize on RoMedQA. Our findings demonstrate the importance of both domain-specific and language-specific fine-tuning for reliable clinical QA in Romanian. We publicly release our dataset and code at https://github.com/ana-rogoz/RoMedQA.
[97] Cetvel: A Unified Benchmark for Evaluating Language Understanding, Generation and Cultural Capacity of LLMs for Turkish cs.CL | cs.AI | I.2.7PDF
Yakup Abrek Er, Ilker Kesen, Gözde Gül Şahin, Aykut Erdem
TL;DR: Cetvel是一个针对土耳其语的综合基准测试,旨在评估大型语言模型(LLMs)在语言理解、生成和文化能力方面的表现,弥补了现有土耳其语基准测试的不足。
Details
Motivation: 现有土耳其语基准测试通常缺乏任务多样性或文化相关性,Cetvel通过结合多样化的判别性和生成性任务,并融入土耳其语言和文化的丰富内容,解决了这一问题。
Result: 实验结果表明,尽管土耳其专用模型针对土耳其语进行了优化,但其表现通常不如多语言或通用模型(如Llama 3和Mistral),语法纠错和抽取式问答等任务能有效区分模型能力。
Insight: 土耳其语LLMs的发展需要更多文化相关的数据和研究,而Cetvel为未来模型优化和评估提供了重要工具。
Abstract: We introduce Cetvel, a comprehensive benchmark designed to evaluate large language models (LLMs) in Turkish. Existing Turkish benchmarks often lack either task diversity or culturally relevant content, or both. Cetvel addresses these gaps by combining a broad range of both discriminative and generative tasks ensuring content that reflects the linguistic and cultural richness of Turkish language. Cetvel covers 23 tasks grouped into seven categories, including tasks such as grammatical error correction, machine translation, and question answering rooted in Turkish history and idiomatic language. We evaluate 33 open-weight LLMs (up to 70B parameters) covering different model families and instruction paradigms. Our experiments reveal that Turkish-centric instruction-tuned models generally underperform relative to multilingual or general-purpose models (e.g. Llama 3 and Mistral), despite being tailored for the language. Moreover, we show that tasks such as grammatical error correction and extractive question answering are particularly discriminative in differentiating model capabilities. Cetvel offers a comprehensive and culturally grounded evaluation suite for advancing the development and assessment of LLMs in Turkish.
[98] A Probabilistic Inference Scaling Theory for LLM Self-Correction cs.CLPDF
Zhe Yang, Yichang Zhang, Yudong Wang, Ziyao Xu, Junyang Lin
TL;DR: 该论文提出了一种概率理论,用于建模大语言模型(LLM)在多轮自我纠正过程中的准确率动态变化,并解释了性能提升的机制。通过数学推导,作者给出了准确率的收敛公式,并通过实验验证了理论的有效性。
Details
Motivation: 探索LLM在多轮自我纠正过程中准确率变化的机制,填补了现有研究中对这一动态过程的定量理解空白。
Result: 实验结果显示,理论预测的准确率曲线与实证数据高度吻合,证明了该模型的有效性。
Insight: 该模型不仅量化了LLM自我纠正的动态过程,还为未来研究提供了理论基础,例如进一步优化收敛速率或探索不同任务下的表现。
Abstract: Large Language Models (LLMs) have demonstrated the capability to refine their generated answers through self-correction, enabling continuous performance improvement over multiple rounds. However, the mechanisms underlying how and why accuracy evolves during this iterative process remain unexplored. To fill this gap, we propose a probabilistic theory to model the dynamics of accuracy change and explain the performance improvements observed in multi-round self-correction. Through mathematical derivation, we establish that the accuracy after the $t^{th}$ round of self-correction is given by: $Acc_t = Upp - \alpha^t(Upp - Acc_0),$ where $Acc_0$ denotes the initial accuracy, $Upp$ represents the upper bound of accuracy convergence, and $\alpha$ determines the rate of convergence. Based on our theory, these parameters can be calculated and the predicted accuracy curve then can be obtained through only a single round of self-correction. Extensive experiments across diverse models and datasets demonstrate that our theoretical predictions align closely with empirical accuracy curves, validating the effectiveness of the theory. Our work provides a theoretical foundation for understanding LLM self-correction, thus paving the way for further explorations.
[99] LLM-as-classifier: Semi-Supervised, Iterative Framework for Hierarchical Text Classification using Large Language Models cs.CL | cs.IRPDF
Doohee You, Andy Parisi, Zach Vander Velden, Lara Dantas Inojosa
TL;DR: 论文提出了一个半监督迭代框架,利用LLM的零样本和小样本能力,构建层次化文本分类器,解决了实际工业部署中的动态数据分发问题。
Details
Motivation: LLM在文本分析中具有强大能力,但其作为生产环境中的分类器的可靠性和可扩展性存在挑战,动态数据分发尤为关键。
Result: 实现了高效、可维护的分类系统,适用于动态工业数据分发。
Insight: 人类反馈和迭代优化是LLM实用化的关键;层次分类需要结合领域知识和多阶段验证。
Abstract: The advent of Large Language Models (LLMs) has provided unprecedented capabilities for analyzing unstructured text data. However, deploying these models as reliable, robust, and scalable classifiers in production environments presents significant methodological challenges. Standard fine-tuning approaches can be resource-intensive and often struggle with the dynamic nature of real-world data distributions, which is common in the industry. In this paper, we propose a comprehensive, semi-supervised framework that leverages the zero- and few-shot capabilities of LLMs for building hierarchical text classifiers as a framework for a solution to these industry-wide challenges. Our methodology emphasizes an iterative, human-in-the-loop process that begins with domain knowledge elicitation and progresses through prompt refinement, hierarchical expansion, and multi-faceted validation. We introduce techniques for assessing and mitigating sequence-based biases and outline a protocol for continuous monitoring and adaptation. This framework is designed to bridge the gap between the raw power of LLMs and the practical need for accurate, interpretable, and maintainable classification systems in industry applications.
[100] Transfer Learning via Lexical Relatedness: A Sarcasm and Hate Speech Case Study cs.CL | cs.LGPDF
Angelly Cabrera, Linus Lei, Antonio Ortega
TL;DR: 本文探讨了通过讽刺预训练提升仇恨言论检测,特别是隐含仇恨言论的效果,证明了讽刺预训练对BERT+BiLSTM模型的性能提升。
Details
Motivation: 社交媒体中隐含形式的仇恨言论(如讽刺、反讽)检测一直是一个难题。本文研究讽刺预训练是否有助于提升隐含和显式仇恨言论检测。
Result: 讽刺预训练使BERT+BiLSTM在ETHOS上的召回率提升9.7%,AUC提升7.8%,F1分数提升6%。在隐含仇恨语料库上,精确率提升7.8%。
Insight: 讽刺和仇恨言论之间存在语义关联性,利用讽刺预训练可以增强模型对隐含仇恨的捕捉能力,有助于整体仇恨言论检测。
Abstract: Detecting hate speech in non-direct forms, such as irony, sarcasm, and innuendos, remains a persistent challenge for social networks. Although sarcasm and hate speech are regarded as distinct expressions, our work explores whether integrating sarcasm as a pre-training step improves implicit hate speech detection and, by extension, explicit hate speech detection. Incorporating samples from ETHOS, Sarcasm on Reddit, and Implicit Hate Corpus, we devised two training strategies to compare the effectiveness of sarcasm pre-training on a CNN+LSTM and BERT+BiLSTM model. The first strategy is a single-step training approach, where a model trained only on sarcasm is then tested on hate speech. The second strategy uses sequential transfer learning to fine-tune models for sarcasm, implicit hate, and explicit hate. Our results show that sarcasm pre-training improved the BERT+BiLSTM’s recall by 9.7%, AUC by 7.8%, and F1-score by 6% on ETHOS. On the Implicit Hate Corpus, precision increased by 7.8% when tested only on implicit samples. By incorporating sarcasm into the training process, we show that models can more effectively detect both implicit and explicit hate.
cs.HC [Back]
[101] Prompting with Sign Parameters for Low-resource Sign Language Instruction Generation cs.HC | cs.CVPDF
Md Tariquzzaman, Md Farhan Ishmam, Saiyma Sittul Muna, Md Kamrul Hasan, Hasan Mahmud
TL;DR: 这篇论文提出了一个针对低资源手语的指令生成方法,通过引入手语参数提示(SPI prompting)来提升零样本性能,并在新构建的孟加拉手语数据集(BdSLIG)上进行了评估。
Details
Motivation: 许多手语在AI领域仍然资源不足,这限制了聋哑和听力障碍社区的交流。论文旨在通过生成结构化的手语学习指令,促进非手语用户的学习和交互。
Result: 在BdSLIG数据集上的实验验证了SPI prompting的优越性,尤其在低资源和长尾视觉概念任务中表现突出。
Insight: 结构化提示(如SPI prompting)可以有效提升模型在低资源领域的性能,同时也为其他类似任务的改进提供了思路。手语学习的包容性和技术进步是研究的核心目标。
Abstract: Sign Language (SL) enables two-way communication for the deaf and hard-of-hearing community, yet many sign languages remain under-resourced in the AI space. Sign Language Instruction Generation (SLIG) produces step-by-step textual instructions that enable non-SL users to imitate and learn SL gestures, promoting two-way interaction. We introduce BdSLIG, the first Bengali SLIG dataset, used to evaluate Vision Language Models (VLMs) (i) on under-resourced SLIG tasks, and (ii) on long-tail visual concepts, as Bengali SL is unlikely to appear in the VLM pre-training data. To enhance zero-shot performance, we introduce Sign Parameter-Infused (SPI) prompting, which integrates standard SL parameters, like hand shape, motion, and orientation, directly into the textual prompts. Subsuming standard sign parameters into the prompt makes the instructions more structured and reproducible than free-form natural text from vanilla prompting. We envision that our work would promote inclusivity and advancement in SL learning systems for the under-resourced communities.
cs.RO [Back]
[102] GelSLAM: A Real-time, High-Fidelity, and Robust 3D Tactile SLAM System cs.RO | cs.CVPDF
Hung-Jui Huang, Mohammad Amin Mirzaee, Michael Kaess, Wenzhen Yuan
TL;DR: GelSLAM是一种仅依赖触觉感知的实时3D SLAM系统,用于高精度物体姿态估计与形状重建。
Details
Motivation: 相对于视觉方法,触觉感知在高精度与抗遮挡方面具有优势,尤其在接触式物体跟踪与重建任务中。传统的点云方法在低纹理物体上效果不佳,触觉感知能弥补这一缺陷。
Result: 系统在实时跟踪中表现出低误差与最小漂移,尤其对低纹理物体(如木质工具)能实现亚毫米级重建精度。
Insight: 触觉感知不仅适用于局部接触任务,还能扩展到全局时空感知,为高精度操作任务(如手内物体交互)提供新思路。
Abstract: Accurately perceiving an object’s pose and shape is essential for precise grasping and manipulation. Compared to common vision-based methods, tactile sensing offers advantages in precision and immunity to occlusion when tracking and reconstructing objects in contact. This makes it particularly valuable for in-hand and other high-precision manipulation tasks. In this work, we present GelSLAM, a real-time 3D SLAM system that relies solely on tactile sensing to estimate object pose over long periods and reconstruct object shapes with high fidelity. Unlike traditional point cloud-based approaches, GelSLAM uses tactile-derived surface normals and curvatures for robust tracking and loop closure. It can track object motion in real time with low error and minimal drift, and reconstruct shapes with submillimeter accuracy, even for low-texture objects such as wooden tools. GelSLAM extends tactile sensing beyond local contact to enable global, long-horizon spatial perception, and we believe it will serve as a foundation for many precise manipulation tasks involving interaction with objects in hand. The video demo is available on our website: https://joehjhuang.github.io/gelslam.
eess.IV [Back]
[103] Cross-Attention Multimodal Fusion for Breast Cancer Diagnosis: Integrating Mammography and Clinical Data with Explainability eess.IV | cs.CV | cs.LGPDF
Muhaisin Tiyumba Nantogmah, Abdul-Barik Alhassan, Salamudeen Alhassan
TL;DR: 该论文提出了一种基于交叉注意力的多模态融合方法,结合乳腺X光片和临床数据,用于乳腺癌诊断,并通过可解释性AI提高了模型的可信度。
Details
Motivation: 现有计算机辅助诊断系统往往仅依赖于乳腺X光片特征,未能充分利用临床数据的有价值信息。论文旨在探索临床特征与乳腺X光片的有效融合方式,并通过可解释性方法提升模型的可靠性。
Result: 在TCGA和CBIS-DDSM数据集上,模型表现优异,AUC-ROC达0.98,准确率0.96,F1分数0.94,精确率0.92,召回率0.95。
Insight: 临床数据显著提升了乳腺癌分类的性能,而交叉注意力机制能够有效融合多模态数据。可解释性方法为模型决策提供了直观的解释,增强了临床实用性。
Abstract: A precise assessment of the risk of breast lesions can greatly lower it and assist physicians in choosing the best course of action. To categorise breast lesions, the majority of current computer-aided systems only use characteristics from mammograms. Although this method is practical, it does not completely utilise clinical reports’ valuable information to attain the best results. When compared to utilising mammography alone, will clinical features greatly enhance the categorisation of breast lesions? How may clinical features and mammograms be combined most effectively? In what ways may explainable AI approaches improve the interpretability and reliability of models used to diagnose breast cancer? To answer these basic problems, a comprehensive investigation is desperately needed. In order to integrate mammography and categorical clinical characteristics, this study examines a number of multimodal deep networks grounded on feature concatenation, co-attention, and cross-attention. The model achieved an AUC-ROC of 0.98, accuracy of 0.96, F1-score of 0.94, precision of 0.92, and recall of 0.95 when tested on publicly accessible datasets (TCGA and CBIS-DDSM).
[104] Decoding MGMT Methylation: A Step Towards Precision Medicine in Glioblastoma eess.IV | cs.CVPDF
Hafeez Ur Rehman, Sumaiya Fazal, Moutaz Alazab, Ali Baydoun
TL;DR: 这篇论文提出了基于自适应稀疏惩罚的卷积自编码器框架CAMP,用于预测MGMT基因甲基化状态,以改进胶质母细胞瘤的个性化治疗策略,显著提升了预测准确性。
Details
Motivation: 胶质母细胞瘤具有高侵袭性和治疗难度,MGMT基因甲基化状态是预测治疗效果的关键生物标志物,但目前非侵入性成像技术的预测准确性有限。
Result: 在基准数据集上,CAMP的准确率为0.97,特异性为0.98,灵敏度为0.97,显著优于现有方法。
Insight: 自适应稀疏惩罚能够有效处理MRI图像中的对比度差异和肿瘤异质性,为精准医疗提供了新的工具。
Abstract: Glioblastomas, constituting over 50% of malignant brain tumors, are highly aggressive brain tumors that pose substantial treatment challenges due to their rapid progression and resistance to standard therapies. The methylation status of the O-6-Methylguanine-DNA Methyltransferase (MGMT) gene is a critical biomarker for predicting patient response to treatment, particularly with the alkylating agent temozolomide. However, accurately predicting MGMT methylation status using non-invasive imaging techniques remains challenging due to the complex and heterogeneous nature of glioblastomas, that includes, uneven contrast, variability within lesions, and irregular enhancement patterns. This study introduces the Convolutional Autoencoders for MGMT Methylation Status Prediction (CAMP) framework, which is based on adaptive sparse penalties to enhance predictive accuracy. The CAMP framework operates in two phases: first, generating synthetic MRI slices through a tailored autoencoder that effectively captures and preserves intricate tissue and tumor structures across different MRI modalities; second, predicting MGMT methylation status using a convolutional neural network enhanced by adaptive sparse penalties. The adaptive sparse penalty dynamically adjusts to variations in the data, such as contrast differences and tumor locations in MR images. Our method excels in MRI image synthesis, preserving brain tissue, fat, and individual tumor structures across all MRI modalities. Validated on benchmark datasets, CAMP achieved an accuracy of 0.97, specificity of 0.98, and sensitivity of 0.97, significantly outperforming existing methods. These results demonstrate the potential of the CAMP framework to improve the interpretation of MRI data and contribute to more personalized treatment strategies for glioblastoma patients.
[105] Disentangled Multi-modal Learning of Histology and Transcriptomics for Cancer Characterization eess.IV | cs.AI | cs.CVPDF
Yupei Zhang, Xiaofei Wang, Anran Liu, Lequan Yu, Chao Li
TL;DR: 该论文提出了一种解缠的多模态学习框架,结合组织学和转录组学数据,通过分解和协调策略解决多模态异质性、多尺度整合和配对数据依赖问题,显著提升了癌症特征的诊断和预后性能。
Details
Motivation: 现有方法在多模态异质性、多尺度整合不足和对配对数据的依赖方面存在限制,影响了多模态学习在临床中的适用性。
Result: 在癌症诊断、预后和生存预测任务中表现优于现有方法。
Insight: 解缠学习和多模态协调策略能显著提升多模态数据的分析和应用能力,尤其在医学领域具有重要价值。
Abstract: Histopathology remains the gold standard for cancer diagnosis and prognosis. With the advent of transcriptome profiling, multi-modal learning combining transcriptomics with histology offers more comprehensive information. However, existing multi-modal approaches are challenged by intrinsic multi-modal heterogeneity, insufficient multi-scale integration, and reliance on paired data, restricting clinical applicability. To address these challenges, we propose a disentangled multi-modal framework with four contributions: 1) To mitigate multi-modal heterogeneity, we decompose WSIs and transcriptomes into tumor and microenvironment subspaces using a disentangled multi-modal fusion module, and introduce a confidence-guided gradient coordination strategy to balance subspace optimization. 2) To enhance multi-scale integration, we propose an inter-magnification gene-expression consistency strategy that aligns transcriptomic signals across WSI magnifications. 3) To reduce dependency on paired data, we propose a subspace knowledge distillation strategy enabling transcriptome-agnostic inference through a WSI-only student model. 4) To improve inference efficiency, we propose an informative token aggregation module that suppresses WSI redundancy while preserving subspace semantics. Extensive experiments on cancer diagnosis, prognosis, and survival prediction demonstrate our superiority over state-of-the-art methods across multiple settings. Code is available at https://github.com/helenypzhang/Disentangled-Multimodal-Learning.
[106] A Disease-Centric Vision-Language Foundation Model for Precision Oncology in Kidney Cancer eess.IV | cs.AI | cs.CVPDF
Yuhui Tao, Zhongwei Zhao, Zilong Wang, Xufang Luo, Feng Chen
TL;DR: 该论文提出了一种名为RenalCLIP的视觉-语言基础模型,用于肾癌的精准肿瘤学,通过两阶段预训练策略结合对比学习,显著提升了诊断和预后任务的性能。
Details
Motivation: 肾癌的非侵入性评估是一个关键挑战,常因诊断不确定性导致良性或惰性肿瘤的过度治疗。
Result: 在TCIA队列中,RenalCLIP的复发自由生存预测C-index达到0.726,比基线提高约20%;且仅需20%训练数据即可达到基线模型的峰值性能。
Insight: RenalCLIP不仅提升了诊断和预后任务的性能,还在数据效率和多任务泛化能力上表现出优势。
Abstract: The non-invasive assessment of increasingly incidentally discovered renal masses is a critical challenge in urologic oncology, where diagnostic uncertainty frequently leads to the overtreatment of benign or indolent tumors. In this study, we developed and validated RenalCLIP using a dataset of 27,866 CT scans from 8,809 patients across nine Chinese medical centers and the public TCIA cohort, a visual-language foundation model for characterization, diagnosis and prognosis of renal mass. The model was developed via a two-stage pre-training strategy that first enhances the image and text encoders with domain-specific knowledge before aligning them through a contrastive learning objective, to create robust representations for superior generalization and diagnostic precision. RenalCLIP achieved better performance and superior generalizability across 10 core tasks spanning the full clinical workflow of kidney cancer, including anatomical assessment, diagnostic classification, and survival prediction, compared with other state-of-the-art general-purpose CT foundation models. Especially, for complicated task like recurrence-free survival prediction in the TCIA cohort, RenalCLIP achieved a C-index of 0.726, representing a substantial improvement of approximately 20% over the leading baselines. Furthermore, RenalCLIP’s pre-training imparted remarkable data efficiency; in the diagnostic classification task, it only needs 20% training data to achieve the peak performance of all baseline models even after they were fully fine-tuned on 100% of the data. Additionally, it achieved superior performance in report generation, image-text retrieval and zero-shot diagnosis tasks. Our findings establish that RenalCLIP provides a robust tool with the potential to enhance diagnostic accuracy, refine prognostic stratification, and personalize the management of patients with kidney cancer.
cs.CR [Back]
[107] Unveiling Unicode’s Unseen Underpinnings in Undermining Authorship Attribution cs.CR | cs.CL | cs.IRPDF
Robert Dilworth
TL;DR: 本文探讨了在公共通信中即使采取匿名措施,用户仍可能通过文本内容(风格分析)被识别身份,并提出了一种利用Unicode隐写术的对抗策略。
Details
Motivation: 尽管用户采取了多种匿名化措施,但文本内容本身仍可能通过风格分析(stylometry)暴露身份,这需要研究对抗性策略以保护隐私。
Result: 通过Unicode隐写术,论文展示了如何在文本中隐藏或修改风格特征,从而有效对抗风格分析。
Insight: 即使是最谨慎的匿名化措施也可能因文本风格而失效,而Unicode隐写术为保护隐私提供了一种新的可能途径。
Abstract: When using a public communication channel – whether formal or informal, such as commenting or posting on social media – end users have no expectation of privacy: they compose a message and broadcast it for the world to see. Even if an end user takes utmost precautions to anonymize their online presence – using an alias or pseudonym; masking their IP address; spoofing their geolocation; concealing their operating system and user agent; deploying encryption; registering with a disposable phone number or email; disabling non-essential settings; revoking permissions; and blocking cookies and fingerprinting – one obvious element still lingers: the message itself. Assuming they avoid lapses in judgment or accidental self-exposure, there should be little evidence to validate their actual identity, right? Wrong. The content of their message – necessarily open for public consumption – exposes an attack vector: stylometric analysis, or author profiling. In this paper, we dissect the technique of stylometry, discuss an antithetical counter-strategy in adversarial stylometry, and devise enhancements through Unicode steganography.
cs.LO [Back]
[108] Lean Meets Theoretical Computer Science: Scalable Synthesis of Theorem Proving Challenges in Formal-Informal Pairs cs.LO | cs.AI | cs.CL | cs.LGPDF
Terry Jingchen Zhang, Wenyuan Jiang, Rongchuan Liu, Yisong Wang, Junran Yang
TL;DR: 论文通过利用理论计算机科学(TCS)生成可扩展的形式定理证明挑战,展示了其在自动化推理研究中的价值。
Details
Motivation: 当前形式定理证明(FTP)数据集的局限性(高成本、稀缺性)阻碍了大型语言模型在推理能力评估中的进展,需要寻找可扩展的挑战性问题来源。
Result: 实验表明,前沿模型(如DeepSeekProver-V2-671B)在Busy Beaver问题中达到57.5%的成功率,但在混合布尔算术问题中仅12%,揭示了长形式证明生成的挑战性。
Insight: 即使是计算验证简单的问题,长形式证明生成对模型仍具有显著挑战性,TCS领域为自动化推理研究提供了丰富的资源。
Abstract: Formal theorem proving (FTP) has emerged as a critical foundation for evaluating the reasoning capabilities of large language models, enabling automated verification of mathematical proofs at scale. However, progress has been constrained by limited datasets due to the high cost of manual curation and the scarcity of challenging problems with verified formal-informal correspondences. We propose leveraging theoretical computer science (TCS) as a scalable source of rigorous proof problems, where algorithmic definitions enable automated generation of arbitrarily many challenging theorem-proof pairs. We demonstrate this approach on two TCS domains: Busy Beaver problems, which involve proving bounds on Turing machine halting behavior, and Mixed Boolean Arithmetic problems, which combine logical and arithmetic reasoning. Our framework automatically synthesizes problems with parallel formal (Lean4) and informal (Markdown) specifications, creating a scalable pipeline for generating verified proof challenges. Evaluation on frontier models reveals substantial gaps in automated theorem proving: while DeepSeekProver-V2-671B achieves 57.5% success on Busy Beaver problems, it manages only 12% on Mixed Boolean Arithmetic problems. These results highlight the difficulty of long-form proof generation even for problems that are computationally easy to verify, demonstrating the value of TCS domains for advancing automated reasoning research.
cs.AI [Back]
[109] Modular Embedding Recomposition for Incremental Learning cs.AI | cs.CVPDF
Aniello Panariello, Emanuele Frascaroli, Pietro Buzzega, Lorenzo Bonicelli, Angelo Porrello
TL;DR: 论文提出了一种模块化嵌入重组方法(MoDER),通过训练多个文本专家并在推理时组合它们,提升了预训练视觉语言模型(VLM)在增量学习中的零样本分类能力。
Details
Motivation: 预训练视觉语言模型(VLM)在持续学习(CL)中表现出强大的零样本分类能力,但在下游任务与预训练领域差异较大时仍需微调。现有方法主要关注保留VLM的零样本能力,而本文进一步提出通过模块化嵌入重组来增强这一能力。
Result: 在包含14个数据集的实验中,MoDER显示了其有效性,提升了VLM在增量学习中的零样本分类性能。
Insight: 通过模块化重组专家的方式,无需直接微调VLM即可提升其零样本能力,为持续学习提供了一种高效的新思路。
Abstract: The advent of pre-trained Vision-Language Models (VLMs) has significantly transformed Continual Learning (CL), mainly due to their zero-shot classification abilities. Such proficiency makes VLMs well-suited for real-world applications, enabling robust performance on novel unseen classes without requiring adaptation. However, fine-tuning remains essential when downstream tasks deviate significantly from the pre-training domain. Prior CL approaches primarily focus on preserving the zero-shot capabilities of VLMs during incremental fine-tuning on a downstream task. We take a step further by devising an approach that transforms preservation into enhancement of the zero-shot capabilities of VLMs. Our approach, named MoDular Embedding Recomposition (MoDER), introduces a modular framework that trains multiple textual experts, each specialized in a single seen class, and stores them in a foundational hub. At inference time, for each unseen class, we query the hub and compose the retrieved experts to synthesize a refined prototype that improves classification. We show the effectiveness of our method across two popular zero-shot incremental protocols, Class-IL and MTIL, comprising a total of 14 datasets. The codebase is available at https://github.com/aimagelab/mammoth.
[110] Generative Foundation Model for Structured and Unstructured Electronic Health Records cs.AI | cs.CLPDF
Sonish Sivarajkumar, Hang Zhang, Yuelyu Ji, Maneesh Bilalpur, Xizhi Wu
TL;DR: 该论文提出了Generative Deep Patient (GDP),一种多模态基础模型,通过结合结构和非结构化电子健康记录(EHRs)数据,同时支持临床预测和高质量临床叙述生成。
Details
Motivation: EHRs数据复杂多样,包含结构和非结构化信息,但现有方法在序列化数字EHR数据时可能丢失时间性和定量细节。需要一种能够统一处理多模态数据并支持多种临床任务的模型。
Result: 在MIMIC-IV数据集中,GDP在临床预测任务(如心力衰竭、2型糖尿病、30天再入院)和叙述生成任务中表现优异。
Insight: 多模态基础模型能够统一处理EHR数据,同时提升临床预测和叙述生成的性能,减少医院文档工作负担。
Abstract: Electronic health records (EHRs) are rich clinical data sources but complex repositories of patient data, spanning structured elements (demographics, vitals, lab results, codes), unstructured clinical notes and other modalities of data. Harnessing this heterogeneity is critical for improving patient outcomes. Recent advances in large language models (LLMs) have enabled foundation models that can learn from multiple data modalities and support clinical tasks. However, most current approaches simply serialize numeric EHR data into text, which risks losing temporal and quantitative detail. We introduce Generative Deep Patient (GDP), a multimodal foundation model that natively encodes structured EHR time-series via a CNN-Transformer encoder and fuses it with unstructured EHRs through cross-modal attention into a LLaMA-based decoder. GDP is trained in two stages: (1) generative pretraining, where it learns to produce clinical narratives from raw patient timelines while also performing masked feature prediction (MFP) and next time-step prediction (NTP) to capture temporal dynamics; and (2) multi-task fine-tuning for clinically meaningful predictions (e.g., heart failure, type 2 diabetes, 30-day readmission). In clinical prediction, GDP demonstrated superior performance on MIMIC-IV: heart failure AUROC = 0.923, type 2 diabetes AUROC = 0.817, and 30-day readmission AUROC = 0.627. For narrative generation, GDP achieved ROUGE-L = 0.135 and BERTScore-F1 = 0.545. In a blinded human evaluation, GDP-Instruct scored highest on faithfulness, fluency, and overall clinical utility, suggesting reduced hospital documentation workload without sacrificing accuracy. Our results demonstrate that a single multimodal foundation model can both predict clinically actionable events and generate high-quality clinical narratives. Furthermore, GDP’s flexible architecture can be extended to additional modalities.
cs.AR [Back]
[111] ASIC-Agent: An Autonomous Multi-Agent System for ASIC Design with Benchmark Evaluation cs.AR | cs.AI | cs.CL | cs.DC | cs.MAPDF
Ahmed Allam, Youssef Mansour, Mohamed Shalan
TL;DR: 该论文提出了ASIC-Agent,一个专为数字ASIC设计任务设计的自主多智能体系统,通过整合多个子智能体和沙盒环境解决了LLM在硬件设计中的局限性,并引入了首个硬件设计任务基准ASIC-Agent-Bench进行评估。
Details
Motivation: 现有LLM在RTL设计中的能力有限,无法执行代码、调试或长期记忆,限制了其在真实硬件设计流程中的应用。
Result: ASIC-Agent成功自动化了多种复杂度的ASIC设计任务,显著加速了设计流程,尤其在Claude 4 Sonnet支持下表现优异。
Insight: 多智能体系统结合沙盒环境和专业工具库是解决LLM在硬件设计中局限性的有效途径。
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in Register Transfer Level (RTL) design, enabling high-quality code generation from natural language descriptions. However, LLMs alone face significant limitations in real-world hardware design workflows, including the inability to execute code, lack of debugging capabilities, and absence of long-term memory. To address these challenges, we present ASIC-Agent, an autonomous system designed specifically for digital ASIC design tasks. ASIC-Agent enhances base LLMs with a multi-agent architecture incorporating specialized sub-agents for RTL generation, verification, OpenLane hardening, and Caravel chip integration, all operating within a comprehensive sandbox environment with access to essential hardware design tools. The system leverages a vector database containing documentation, API references, error knowledge, and curated insights from the open-source silicon community. To evaluate ASIC-Agent’s performance, we introduce ASIC-Agent-Bench, the first benchmark specifically designed to assess agentic systems in hardware design tasks. We evaluate ASIC-Agent with various base LLMs, providing quantitative comparisons and qualitative insights into agent behavior across different design scenarios. Our results demonstrate that ASIC-Agent, when powered by Claude 4 Sonnet, successfully automates a broad range of ASIC design tasks spanning varying levels of complexity, showing the potential of significantly accelerating the ASIC design workflow.
[112] Hardwired-Neurons Language Processing Units as General-Purpose Cognitive Substrates cs.AR | cs.CLPDF
Yang Liu, Yi Chen, Yongwei Zhao, Yifan Hao, Zifu Zheng
TL;DR: 本文提出了一种通过物理硬连线LLM权重参数的Hardwired-Neurons语言处理单元(HNLPU),显著提升计算效率,并通过Metal-Embedding方法解决了经济成本问题。
Details
Motivation: 大型语言模型(LLM)推理系统的能源消耗日益增长,需要开发专门的高效能语言处理单元以应对这一挑战。
Result: HNLPU在能效(36 tokens/J)和计算速度(249,960 tokens/s)上显著优于GPU和WSE,成本效益提升了8.57倍,碳足迹减少了230倍。
Insight: 通过硬件优化和3D嵌入技术,可以显著提升LLM推理的能效和经济性,为未来专用语言处理单元的设计提供了新思路。
Abstract: The rapid advancement of Large Language Models (LLMs) has established language as a core general-purpose cognitive substrate, driving the demand for specialized Language Processing Units (LPUs) tailored for LLM inference. To overcome the growing energy consumption of LLM inference systems, this paper proposes a Hardwired-Neurons Language Processing Unit (HNLPU), which physically hardwires LLM weight parameters into the computational fabric, achieving several orders of magnitude computational efficiency improvement by extreme specialization. However, a significant challenge still lies in the scale of modern LLMs. An ideal estimation on hardwiring gpt-oss 120 B requires fabricating at least 6 billion dollars of photomask sets, rendering the straightforward solution economically impractical. Addressing this challenge, we propose the novel Metal-Embedding methodology. Instead of embedding weights in a 2D grid of silicon device cells, Metal-Embedding embeds weight parameters into the 3D topology of metal wires. This brings two benefits: (1) a 15x increase in density, and (2) 60 out of 70 layers of photomasks are made homogeneous across chips, including all EUV photomasks. In total, Metal-Embedding reduced the photomask cost by 112x, bringing the Non-Recurring Engineering (NRE) cost of HNLPU into an economically viable range. Experimental results show that HNLPU achieved 249,960 tokens/s (5,555x/85x of GPU/WSE), 36 tokens/J (1,047x/283x of GPU/WSE), 13,232 mm2 total die area (29% inscribed rectangular area in a 300 mm wafer), $184M estimated NRE at 5 nm technology. Analysis shows that HNLPU achieved 8.57x cost-effectiveness and 230x carbon footprint reduction compared to H100 clusters, under an annual weight updating assumption.
q-bio.NC [Back]
[113] NeuroKoop: Neural Koopman Fusion of Structural-Functional Connectomes for Identifying Prenatal Drug Exposure in Adolescents q-bio.NC | cs.CV | eess.IVPDF
Badhan Mazumder, Aline Kotoski, Vince D. Calhoun, Dong Hye Ye
TL;DR: NeuroKoop是一种基于图神经网络的创新框架,通过神经Koopman算子驱动的潜在空间融合,整合结构和功能性脑网络,以识别青少年产前药物暴露(PDE)。该方法在ABCD数据集的青少年队列中表现出色,揭示了结构-功能连接的关键特征。
Details
Motivation: 产前暴露于精神活性物质(如大麻)对青少年大脑组织的影响尚不明确,且现有方法难以充分利用多模态神经影像数据的互补特征,限制了生物学洞察力和预测性能。
Result: 在ABCD数据集上,NeuroKoop优于现有基线方法,并识别出与PDE相关的显著结构-功能性连接。
Insight: Koopman理论在神经影像数据分析中具有潜力,能够统一结构和功能性脑网络的表示,为理解产前药物暴露的神经发育影响提供了新视角。
Abstract: Understanding how prenatal exposure to psychoactive substances such as cannabis shapes adolescent brain organization remains a critical challenge, complicated by the complexity of multimodal neuroimaging data and the limitations of conventional analytic methods. Existing approaches often fail to fully capture the complementary features embedded within structural and functional connectomes, constraining both biological insight and predictive performance. To address this, we introduced NeuroKoop, a novel graph neural network-based framework that integrates structural and functional brain networks utilizing neural Koopman operator-driven latent space fusion. By leveraging Koopman theory, NeuroKoop unifies node embeddings derived from source-based morphometry (SBM) and functional network connectivity (FNC) based brain graphs, resulting in enhanced representation learning and more robust classification of prenatal drug exposure (PDE) status. Applied to a large adolescent cohort from the ABCD dataset, NeuroKoop outperformed relevant baselines and revealed salient structural-functional connections, advancing our understanding of the neurodevelopmental impact of PDE.
cs.SE [Back]
[114] AetherCode: Evaluating LLMs’ Ability to Win In Premier Programming Competitions cs.SE | cs.CLPDF
Zihan Wang, Jiaze Chen, Zhicheng Liu, Markus Mak, Yidi Du
TL;DR: AetherCode是一个新的基准测试,旨在通过更难的编程竞赛问题更准确地评估大型语言模型(LLMs)的编程能力,弥补现有基准的不足。
Details
Motivation: 现有基准测试低估了LLMs与顶级人类程序员之间的差距,主要因为问题难度不足和测试用例质量低。
Result: AetherCode能够更准确衡量LLMs的编程能力,为其能力设立了新标准。
Insight: 现有基准可能高估了LLMs的能力,需要更具挑战性的评估方式来真实反映其水平。
Abstract: Competitive programming has emerged as a critical benchmark for evaluating the reasoning and coding capabilities of Large Language Models (LLMs). Despite impressive progress on existing benchmarks, we argue that current evaluations overstate model proficiency, masking a substantial gap between LLMs and elite human programmers. This gap arises from two key limitations: insufficient difficulty and scope of benchmark problems, and evaluation bias from low-quality test cases. To address these shortcomings, we present AetherCode, a new benchmark that draws problems from premier programming competitions such as IOI and ICPC, offering broader coverage and higher difficulty. AetherCode further incorporates comprehensive, expert-validated test suites built through a hybrid of automated generation and human curation, ensuring rigorous and reliable assessment. By combining challenging problem design with robust evaluation, AetherCode provides a more faithful measure of LLM capabilities and sets a new standard for future research in code reasoning.
cs.CY [Back]
[115] PediatricsMQA: a Multi-modal Pediatrics Question Answering Benchmark cs.CY | cs.AI | cs.CL | cs.GR | cs.MMPDF
Adil Bahaj, Mounir Ghogho
TL;DR: 该论文提出了一个新的多模态儿科问答基准PediatricsMQA,旨在解决大语言模型和视觉增强语言模型在儿科任务中的系统性年龄偏见问题。
Details
Motivation: 现有的医疗大模型在儿科任务中表现较差,反映了医学研究中儿科研究的不足。为了解决这一偏见并推动儿科AI的公平性,作者构建了多模态儿科问答基准。
Result: 实验显示,现有模型在年轻群体上的性能显著下降,突出了年龄感知方法的必要性。
Insight: 论文揭示了医疗AI中存在的年龄偏见问题,并强调了针对儿科任务定制模型的重要性。
Abstract: Large language models (LLMs) and vision-augmented LLMs (VLMs) have significantly advanced medical informatics, diagnostics, and decision support. However, these models exhibit systematic biases, particularly age bias, compromising their reliability and equity. This is evident in their poorer performance on pediatric-focused text and visual question-answering tasks. This bias reflects a broader imbalance in medical research, where pediatric studies receive less funding and representation despite the significant disease burden in children. To address these issues, a new comprehensive multi-modal pediatric question-answering benchmark, PediatricsMQA, has been introduced. It consists of 3,417 text-based multiple-choice questions (MCQs) covering 131 pediatric topics across seven developmental stages (prenatal to adolescent) and 2,067 vision-based MCQs using 634 pediatric images from 67 imaging modalities and 256 anatomical regions. The dataset was developed using a hybrid manual-automatic pipeline, incorporating peer-reviewed pediatric literature, validated question banks, existing benchmarks, and existing QA resources. Evaluating state-of-the-art open models, we find dramatic performance drops in younger cohorts, highlighting the need for age-aware methods to ensure equitable AI support in pediatric care.
cs.LG [Back]
[116] RotaTouille: Rotation Equivariant Deep Learning for Contours cs.LG | cs.CVPDF
Odin Hoff Gardaa, Nello Blaser
TL;DR: RotaTouille 是一个针对轮廓数据的深度学习框架,通过复值循环卷积实现对旋转和循环平移的等变性,并在形状分类、重建和回归任务中表现出色。
Details
Motivation: 轮廓数据(如闭曲线)在多个领域普遍存在,且输入的旋转通常会导致输出的相应旋转。因此,模型需要具备旋转等变性和循环平移等变性。
Result: 在形状分类、重建和轮廓回归任务中验证了 RotaTouille 的有效性。
Insight: 通过复值表示和循环卷积可以高效处理轮廓数据的等变性需求,为类似任务提供了新思路。
Abstract: Contours or closed planar curves are common in many domains. For example, they appear as object boundaries in computer vision, isolines in meteorology, and the orbits of rotating machinery. In many cases when learning from contour data, planar rotations of the input will result in correspondingly rotated outputs. It is therefore desirable that deep learning models be rotationally equivariant. In addition, contours are typically represented as an ordered sequence of edge points, where the choice of starting point is arbitrary. It is therefore also desirable for deep learning methods to be equivariant under cyclic shifts. We present RotaTouille, a deep learning framework for learning from contour data that achieves both rotation and cyclic shift equivariance through complex-valued circular convolution. We further introduce and characterize equivariant non-linearities, coarsening layers, and global pooling layers to obtain invariant representations for downstream tasks. Finally, we demonstrate the effectiveness of RotaTouille through experiments in shape classification, reconstruction, and contour regression.
[117] TinyML Towards Industry 4.0: Resource-Efficient Process Monitoring of a Milling Machine cs.LG | cs.CV | cs.ET | cs.SY | eess.SP | eess.SY | I.2.1; I.5.4; C.5.3; C.3PDF
Tim Langer, Matthias Widra, Volkhard Beyer
TL;DR: 本文提出了一个完整的TinyML流程,用于工业铣床的资源高效过程监控,通过量化CNN模型实现了高精度和低能耗。
Details
Motivation: 为工业4.0中的老旧设备提供智能化升级方案,TinyML因其资源高效性成为理想选择。
Result: 在ARM Cortex M4F微控制器上,实现15.4ms推理时间、1.462mJ能耗和100%测试精度。
Insight: TinyML在工业过程监控中具有极高潜力,量化技术是关键,能够平衡资源与性能。
Abstract: In the context of industry 4.0, long-serving industrial machines can be retrofitted with process monitoring capabilities for future use in a smart factory. One possible approach is the deployment of wireless monitoring systems, which can benefit substantially from the TinyML paradigm. This work presents a complete TinyML flow from dataset generation, to machine learning model development, up to implementation and evaluation of a full preprocessing and classification pipeline on a microcontroller. After a short review on TinyML in industrial process monitoring, the creation of the novel MillingVibes dataset is described. The feasibility of a TinyML system for structure-integrated process quality monitoring could be shown by the development of an 8-bit-quantized convolutional neural network (CNN) model with 12.59kiB parameter storage. A test accuracy of 100.0% could be reached at 15.4ms inference time and 1.462mJ per quantized CNN inference on an ARM Cortex M4F microcontroller, serving as a reference for future TinyML process monitoring solutions.
[118] PGF-Net: A Progressive Gated-Fusion Framework for Efficient Multimodal Sentiment Analysis cs.LG | cs.CLPDF
Bin Wen, Tien-Ping Tan
TL;DR: PGF-Net提出了一种新颖的多模态情感分析框架,通过渐进式门控融合机制和高效的参数调优策略,实现了高性能且轻量化的模型设计。
Details
Motivation: 多模态情感分析需要高效的跨模态融合方法,同时减少计算开销,以适应资源受限的场景。
Result: 在MOSI数据集上,MAE为0.691,F1-Score为86.9%,仅需3.09M可训练参数。
Insight: 渐进式融合和动态门控机制能够提升模型的性能和可解释性,而混合PEFT策略显著降低了计算成本。
Abstract: We introduce PGF-Net (Progressive Gated-Fusion Network), a novel deep learning framework designed for efficient and interpretable multimodal sentiment analysis. Our framework incorporates three primary innovations. Firstly, we propose a Progressive Intra-Layer Fusion paradigm, where a Cross-Attention mechanism empowers the textual representation to dynamically query and integrate non-linguistic features from audio and visual streams within the deep layers of a Transformer encoder. This enables a deeper, context-dependent fusion process. Secondly, the model incorporates an Adaptive Gated Arbitration mechanism, which acts as a dynamic controller to balance the original linguistic information against the newly fused multimodal context, ensuring stable and meaningful integration while preventing noise from overwhelming the signal. Lastly, a hybrid Parameter-Efficient Fine-Tuning (PEFT) strategy is employed, synergistically combining global adaptation via LoRA with local refinement through Post-Fusion Adapters. This significantly reduces trainable parameters, making the model lightweight and suitable for resource-limited scenarios. These innovations are integrated into a hierarchical encoder architecture, enabling PGF-Net to perform deep, dynamic, and interpretable multimodal sentiment analysis while maintaining exceptional parameter efficiency. Experimental results on MOSI dataset demonstrate that our proposed PGF-Net achieves state-of-the-art performance, with a Mean Absolute Error (MAE) of 0.691 and an F1-Score of 86.9%. Notably, our model achieves these results with only 3.09M trainable parameters, showcasing a superior balance between performance and computational efficiency.
[119] AgentFly: Fine-tuning LLM Agents without Fine-tuning LLMs cs.LG | cs.CLPDF
Huichi Zhou, Yihang Chen, Siyuan Guo, Xue Yan, Kin Hei Lee
TL;DR: 该论文提出了一种无需微调大型语言模型(LLM)的新型学习范式AgentFly,通过基于记忆的在线强化学习实现高效持续适应。
Details
Motivation: 现有方法要么依赖静态手工反射流程,要么需要高计算成本的LLM参数梯度更新,无法实现低成本持续适应。
Result: 在GAIA验证集上达到87.88% Pass@3,测试集上79.40%,DeepResearcher数据集上F1为66.6%、PM为80.4%。
Insight: 方法提供了可扩展的路径,使得LLM代理能够通过记忆机制实现无梯度更新的实时学习。
Abstract: In this paper, we introduce a novel learning paradigm for adaptive Large Language Model (LLM) agents that eliminates the need for fine-tuning the underlying LLMs. Existing approaches are often either rigid, relying on static, handcrafted reflection workflows, or computationally intensive, requiring gradient updates of LLM model parameters. In contrast, our method enables low-cost continual adaptation via memory-based online reinforcement learning. We formalise this as a Memory-augmented Markov Decision Process (M-MDP), equipped with a neural case-selection policy to guide action decisions. Past experiences are stored in an episodic memory, either differentiable or non-parametric. The policy is continually updated based on environmental feedback through a memory rewriting mechanism, whereas policy improvement is achieved through efficient memory reading (retrieval). We instantiate our agent model in the deep research setting, namely AgentFly, which attains top-1 on GAIA validation ($87.88%$ Pass@$3$) and $79.40%$ on the test set. It reaches $66.6%$ F1 and $80.4%$ PM on the DeepResearcher dataset, outperforming the state-of-the-art training-based method, while case-based memory adds $4.7%$ to $9.6%$ absolute points on out-of-distribution tasks. Our approach offers a scalable and efficient pathway for developing generalist LLM agents capable of continuous, real-time learning without gradient updates, advancing machine learning towards open-ended skill acquisition and deep research scenarios. The code is available at https://github.com/Agent-on-the-Fly/AgentFly.
[120] Retrieval Enhanced Feedback via In-context Neural Error-book cs.LG | cs.AI | cs.CLPDF
Jongyeop Hyun, Bumsoo Kim
TL;DR: 论文提出了REFINE框架,通过检索增强反馈和结构化错误分析,提升多模态大语言模型的推理能力。
Details
Motivation: 现有方法缺乏对错误的系统性分析与缓解,尤其在多模态大语言模型中,视觉和文本信息的整合增加了复杂性。
Result: 实验显示REFINE显著提升速度、降低计算成本,并具有良好泛化能力。
Insight: 结构化错误分析和针对性反馈是提升多模态推理性能的关键,检索优化可显著改善效率。
Abstract: Recent advancements in Large Language Models (LLMs) have significantly improved reasoning capabilities, with in-context learning (ICL) emerging as a key technique for adaptation without retraining. While previous works have focused on leveraging correct examples, recent research highlights the importance of learning from errors to enhance performance. However, existing methods lack a structured framework for analyzing and mitigating errors, particularly in Multimodal Large Language Models (MLLMs), where integrating visual and textual inputs adds complexity. To address this issue, we propose REFINE: Retrieval-Enhanced Feedback via In-context Neural Error-book, a teacher-student framework that systematically structures errors and provides targeted feedback. REFINE introduces three systematic queries to construct structured feedback – Feed-Target, Feed-Check, and Feed-Path – to enhance multimodal reasoning by prioritizing relevant visual information, diagnosing critical failure points, and formulating corrective actions. Unlike prior approaches that rely on redundant retrievals, REFINE optimizes structured feedback retrieval, improving inference efficiency, token usage, and scalability. Our results demonstrate substantial speedup, reduced computational costs, and successful generalization, highlighting REFINE’s potential for enhancing multimodal reasoning.
[121] FLAMES: Improving LLM Math Reasoning via a Fine-Grained Analysis of the Data Synthesis Pipeline cs.LG | cs.AI | cs.CLPDF
Parker Seegmiller, Kartik Mehta, Soumya Saha, Chenyang Tao, Shereen Oraby
TL;DR: 论文提出了FLAMES框架,用于系统分析和优化数学推理数据的合成策略,发现复杂度和多样性平衡的重要性,并设计了新的数据合成方法,显著提升了多个数学基准的性能。
Details
Motivation: 现有研究在利用合成数据改进LLM数学推理时缺乏统一比较,无法明确数据合成中各因素(如低质量问题的过滤)的作用。
Result: FLAMES数据集在多个数学基准(如OlympiadBench、MATH)上优于公开数据集,微调模型性能超越Llama3 405B等更大模型。
Insight: 1. 增加问题复杂度对提升数学能力最有效;2. 预算固定时,高覆盖率比高可靠性更重要;3. 易到难的泛化能力显著。
Abstract: Recent works improving LLM math reasoning with synthetic data have used unique setups, making comparison of data synthesis strategies impractical. This leaves many unanswered questions about the roles of different factors in the synthetic data pipeline, such as the impact of filtering low-quality problems. To address this gap, we introduce FLAMES, a Framework for LLM Assessment of Math rEasoning Data Synthesis, and perform a systematic study of 10 existing data synthesis strategies and multiple other factors impacting the performance of synthetic math reasoning data. Our FLAMES experiments provide several valuable insights about the optimal balance of difficulty and diversity of synthetic data. First, data agents designed to increase problem complexity lead to best improvements on most math metrics. Second, with a fixed data generation budget, keeping higher problem coverage is more important than keeping only problems with reliable solutions. Third, GSM8K- and MATH-based synthetic data can lead to improvements on competition-level benchmarks, showcasing easy-to-hard generalization. Leveraging insights from our FLAMES experiments, we design two novel data synthesis strategies for improving out-of-domain generalization and robustness. Further, we develop the FLAMES dataset, an effective blend of our novel and existing data synthesis strategies, outperforming public datasets on OlympiadBench (+15.7), CollegeMath (+4.5), GSMPlus (+6.5), and MATH (+3.1). Fine-tuning Qwen2.5-Math-7B on the FLAMES dataset achieves 81.4% on MATH, surpassing larger Llama3 405B, GPT-4o and Claude 3.5 Sonnet.