Table of Contents

cs.CV [Back]

[1] RU-Net for Automatic Characterization of TRISO Fuel Cross Sections cs.CV | cs.AIPDF

Lu Cai, Fei Xu, Min Xian, Yalei Tang, Shoukun Sun

TL;DR: 论文提出了一种基于卷积神经网络(CNN)的RU-Net模型,用于自动分割TRISO燃料颗粒的显微截面图像,以提高分析效率和结果客观性。

Details

Motivation: 传统手动分析TRISO燃料颗粒显微图像的方法费时且主观,因此需要一种自动化的解决方案来加速数据分析和提高准确性。

Result: RU-Net在IoU指标上表现最优,能够显著减少人工劳动并提高分割结果的客观性。

Insight: CNN在微观图像分析任务中具有巨大潜力,而RU-Net的设计可能为类似的分割任务提供新的思路。

Abstract: During irradiation, phenomena such as kernel swelling and buffer densification may impact the performance of tristructural isotropic (TRISO) particle fuel. Post-irradiation microscopy is often used to identify these irradiation-induced morphologic changes. However, each fuel compact generally contains thousands of TRISO particles. Manually performing the work to get statistical information on these phenomena is cumbersome and subjective. To reduce the subjectivity inherent in that process and to accelerate data analysis, we used convolutional neural networks (CNNs) to automatically segment cross-sectional images of microscopic TRISO layers. CNNs are a class of machine-learning algorithms specifically designed for processing structured grid data. They have gained popularity in recent years due to their remarkable performance in various computer vision tasks, including image classification, object detection, and image segmentation. In this research, we generated a large irradiated TRISO layer dataset with more than 2,000 microscopic images of cross-sectional TRISO particles and the corresponding annotated images. Based on these annotated images, we used different CNNs to automatically segment different TRISO layers. These CNNs include RU-Net (developed in this study), as well as three existing architectures: U-Net, Residual Network (ResNet), and Attention U-Net. The preliminary results show that the model based on RU-Net performs best in terms of Intersection over Union (IoU). Using CNN models, we can expedite the analysis of TRISO particle cross sections, significantly reducing the manual labor involved and improving the objectivity of the segmentation results.


[2] Modular, On-Site Solutions with Lightweight Anomaly Detection for Sustainable Nutrient Management in Agriculture cs.CV | cs.AIPDF

Abigail R. Cohen, Yuming Sun, Zhihao Qin, Harsh S. Muriki, Zihao Xiao

TL;DR: 研究了農業中的輕量級異常檢測模組化解決方案,結合高效養分管理與多光譜成像技術,提出分層檢測方法並分析能耗與準確性之間的平衡。

Details

Motivation: 傳統養分管理方法耗時且難以實現實時優化,多光譜成像雖然快速,但計算成本高,限制了在資源受限環境中的應用。

Result: AE在能耗較低的情況下實現了73%的異常檢測率;ViT在部分養分估計上表現優於RF,但能耗更高。

Insight: 模組化設計可平衡能耗與準確性,為邊緣計算在農業中的應用提供了實際可能性。

Abstract: Efficient nutrient management is critical for crop growth and sustainable resource consumption (e.g., nitrogen, energy). Current approaches require lengthy analyses, preventing real-time optimization; similarly, imaging facilitates rapid phenotyping but can be computationally intensive, preventing deployment under resource constraints. This study proposes a flexible, tiered pipeline for anomaly detection and status estimation (fresh weight, dry mass, and tissue nutrients), including a comprehensive energy analysis of approaches that span the efficiency-accuracy spectrum. Using a nutrient depletion experiment with three treatments (T1-100%, T2-50%, and T3-25% fertilizer strength) and multispectral imaging (MSI), we developed a hierarchical pipeline using an autoencoder (AE) for early warning. Further, we compared two status estimation modules of different complexity for more detailed analysis: vegetation index (VI) features with machine learning (Random Forest, RF) and raw whole-image deep learning (Vision Transformer, ViT). Results demonstrated high-efficiency anomaly detection (73% net detection of T3 samples 9 days after transplanting) at substantially lower energy than embodied energy in wasted nitrogen. The state estimation modules show trade-offs, with ViT outperforming RF on phosphorus and calcium estimation (R2 0.61 vs. 0.58, 0.48 vs. 0.35) at higher energy cost. With our modular pipeline, this work opens opportunities for edge diagnostics and practical opportunities for agricultural sustainability.


[3] Humor in Pixels: Benchmarking Large Multimodal Models Understanding of Online Comics cs.CV | cs.AI | cs.CLPDF

Yuriel Ryan, Rui Yang Tan, Kenny Tsu Wei Choo, Roy Ka-Wei Lee

TL;DR: PixelHumor基准测试数据集用于评估大型多模态模型(LMMs)对多模态幽默和叙事序列的理解能力,揭示了当前模型在整合视觉和文本线索方面的局限性。

Details

Motivation: 幽默理解是社会智能的核心,但对LMMs而言仍是一个重大挑战。需要系统评估其多模态幽默理解和叙事能力。

Result: 顶级LMMs在面板排序任务中仅达到61%准确率,远低于人类水平,凸显当前模型在多模态理解和叙事连贯性上的不足。

Insight: PixelHumor为开发更具社交智能的LMMs提供了严格评估框架,强调需改进模型的视觉-文本整合与叙事推理能力。

Abstract: Understanding humor is a core aspect of social intelligence, yet it remains a significant challenge for Large Multimodal Models (LMMs). We introduce PixelHumor, a benchmark dataset of 2,800 annotated multi-panel comics designed to evaluate LMMs’ ability to interpret multimodal humor and recognize narrative sequences. Experiments with state-of-the-art LMMs reveal substantial gaps: for instance, top models achieve only 61% accuracy in panel sequencing, far below human performance. This underscores critical limitations in current models’ integration of visual and textual cues for coherent narrative and humor understanding. By providing a rigorous framework for evaluating multimodal contextual and narrative reasoning, PixelHumor aims to drive the development of LMMs that better engage in natural, socially aware interactions.


[4] OnlineHOI: Towards Online Human-Object Interaction Generation and Perception cs.CV | cs.AI | cs.ROPDF

Yihong Ji, Yunze Liu, Yiyao Zhuo, Weijiang Yu, Fei Ma

TL;DR: 论文提出了在线人-物交互(HOI)生成与感知的新任务,并提出了基于Mamba框架和记忆机制的OnlineHOI框架,解决了现有离线方法在在线场景中的性能问题。

Details

Motivation: 现有的人-物交互(HOI)方法主要针对离线场景,无法有效处理在线场景中仅依赖当前和历史数据的限制。

Result: 在Core4D和OAKINK2的在线生成任务以及HOI4D的在线感知任务中取得了最佳性能。

Insight: 在线场景需要动态处理数据流,Mamba框架和记忆机制的结合为处理此类任务提供了新思路。

Abstract: The perception and generation of Human-Object Interaction (HOI) are crucial for fields such as robotics, AR/VR, and human behavior understanding. However, current approaches model this task in an offline setting, where information at each time step can be drawn from the entire interaction sequence. In contrast, in real-world scenarios, the information available at each time step comes only from the current moment and historical data, i.e., an online setting. We find that offline methods perform poorly in an online context. Based on this observation, we propose two new tasks: Online HOI Generation and Perception. To address this task, we introduce the OnlineHOI framework, a network architecture based on the Mamba framework that employs a memory mechanism. By leveraging Mamba’s powerful modeling capabilities for streaming data and the Memory mechanism’s efficient integration of historical information, we achieve state-of-the-art results on the Core4D and OAKINK2 online generation tasks, as well as the online HOI4D perception task.


[5] EfficientNet-Based Multi-Class Detection of Real, Deepfake, and Plastic Surgery Faces cs.CVPDF

Li Kun, Milena Radenkovic

TL;DR: 该论文提出了一种基于EfficientNet的多类检测方法,用于区分真实人脸、Deepfake生成的人脸以及经过整形手术的人脸,以应对Deepfake技术对社会带来的潜在危害。

Details

Motivation: Deepfake技术的滥用对社会造成了严重影响,包括隐私侵犯、名人声誉损害和国家安全威胁。因此,需要一种高效的方法来检测和区分真实人脸与伪造或修改的人脸。

Result: 所提方法在多类人脸检测任务中表现出色,能够高精度识别不同类型的伪造或修改人脸,为Deepfake检测提供了有效解决方案。

Insight: 通过结合高效的深度学习架构(如EfficientNet)和多类分类策略,可以显著提升对复杂伪造手段的检测能力,为未来相关研究提供了方向。

Abstract: Currently, deep learning has been utilised to tackle several difficulties in our everyday lives. It not only exhibits progress in computer vision but also constitutes the foundation for several revolutionary technologies. Nonetheless, similar to all phenomena, the use of deep learning in diverse domains has produced a multifaceted interaction of advantages and disadvantages for human society. Deepfake technology has advanced, significantly impacting social life. However, developments in this technology can affect privacy, the reputations of prominent personalities, and national security via software development. It can produce indistinguishable counterfeit photographs and films, potentially impairing the functionality of facial recognition systems, so presenting a significant risk. The improper application of deepfake technology produces several detrimental effects on society. Face-swapping programs mislead users by altering persons’ appearances or expressions to fulfil particular aims or to appropriate personal information. Deepfake technology permeates daily life through such techniques. Certain individuals endeavour to sabotage election campaigns or subvert prominent political figures by creating deceptive pictures to influence public perception, causing significant harm to a nation’s political and economic structure.


[6] PATIMT-Bench: A Multi-Scenario Benchmark for Position-Aware Text Image Machine Translation in Large Vision-Language Models cs.CV | cs.AIPDF

Wanru Zhuang, Wenbo Li, Zhibin Lan, Xu Han, Peng Li

TL;DR: 该论文提出了位置感知文本图像机器翻译(PATIMT)任务,构建了PATIMT-Bench基准,支持细粒度和布局保留的翻译,并通过自适应OCR细化流程和数据增强提升模型性能。

Details

Motivation: 传统TIMT任务忽略了位置信息和多样场景,无法满足实际需求。PATIMT任务旨在解决这些问题,提供更细粒度的翻译支持。

Result: 微调后的紧凑型LVLMs在两个子任务上均达到SOTA性能,展示了数据的可扩展性和泛化能力。

Insight: PATIMT任务的实际价值在于其细粒度和布局保留特性,自适应OCR流程和数据多样性是性能提升的关键。

Abstract: Text Image Machine Translation (TIMT) aims to translate texts embedded within an image into another language. Current TIMT studies primarily focus on providing translations for all the text within an image, while neglecting to provide bounding boxes and covering limited scenarios. In this work, we extend traditional TIMT into position-aware TIMT (PATIMT), aiming to support fine-grained and layoutpreserving translation, which holds great practical value but remains largely unexplored. This task comprises two key sub-tasks: regionspecific translation and full-image translation with grounding. To support existing models on PATIMT and conduct fair evaluation, we construct the PATIMT benchmark (PATIMTBench), which consists of 10 diverse real-world scenarios. Specifically, we introduce an Adaptive Image OCR Refinement Pipeline, which adaptively selects appropriate OCR tools based on scenario and refines the results of text-rich images. To ensure evaluation reliability, we further construct a test set, which contains 1,200 high-quality instances manually annotated and reviewed by human experts. After fine-tuning on our data, compact Large Vision-Language Models (LVLMs) achieve state-of-the-art performance on both sub-tasks. Experimental results also highlight the scalability and generalizability of our training data


[7] Deep learning for 3D point cloud processing – from approaches, tasks to its implications on urban and environmental applications cs.CVPDF

Zhenxin Zhang, Zhihua Xu, Yuwei Cao, Ningli Xu, Shuye Wang

TL;DR: 这篇综述论文探讨了深度学习在3D点云处理中的应用,涵盖了关键任务如场景补全、配准、语义分割和建模,并分析了其在城市与环境应用中的实际价值与挑战。

Details

Motivation: 点云处理在测绘与环境监测等领域具有重要意义,但现有研究多关注网络架构,忽略了实际应用中大规模数据、多样场景内容和非均匀点密度等问题。

Result: 研究发现当前的深度学习方法仍需改进以应对大规模数据和非均匀点密度等实际问题,尤其是在城市与环境应用中。

Insight: 论文强调了将深度学习点云处理技术转化为实际应用时需解决的挑战,包括算法优化和多模态数据融合。

Abstract: Point cloud processing as a fundamental task in the field of geomatics and computer vision, has been supporting tasks and applications at different scales from air to ground, including mapping, environmental monitoring, urban/tree structure modeling, automated driving, robotics, disaster responses etc. Due to the rapid development of deep learning, point cloud processing algorithms have nowadays been almost explicitly dominated by learning-based approaches, most of which are yet transitioned into real-world practices. Existing surveys primarily focus on the ever-updating network architecture to accommodate unordered point clouds, largely ignoring their practical values in typical point cloud processing applications, in which extra-large volume of data, diverse scene contents, varying point density, data modality need to be considered. In this paper, we provide a meta review on deep learning approaches and datasets that cover a selection of critical tasks of point cloud processing in use such as scene completion, registration, semantic segmentation, and modeling. By reviewing a broad range of urban and environmental applications these tasks can support, we identify gaps to be closed as these methods transformed into applications and draw concluding remarks in both the algorithmic and practical aspects of the surveyed methods.


[8] Evaluating Robustness of Vision-Language Models Under Noisy Conditions cs.CVPDF

Purushoth, Alireza

TL;DR: 该论文提出了一个全面的评估框架,测试了多种先进视觉-语言模型(VLM)在噪声条件下的鲁棒性,揭示了模型大小、数据集特性和噪声类型之间的复杂权衡。

Details

Motivation: 视觉-语言模型在多模态任务中表现出色,但其在噪声条件下的鲁棒性尚未得到充分研究。论文旨在填补这一空白。

Result: 实验表明:(1)地面实况描述的清晰度显著影响性能;(2)大模型(如LLaVA)在语义理解上表现更优,但并不总是优于小模型;(3)JPEG压缩和运动模糊对模型性能影响最大。

Insight: 研究发现模型鲁棒性不仅依赖于模型大小,还与数据集特性和噪声类型密切相关,为未来鲁棒多模态学习提供了基准。

Abstract: Vision-Language Models (VLMs) have attained exceptional success across multimodal tasks such as image captioning and visual question answering. However, their robustness under noisy conditions remains unfamiliar. In this study, we present a comprehensive evaluation framework to evaluate the performance of several state-of-the-art VLMs under controlled perturbations, including lighting variation, motion blur, and compression artifacts. We used both lexical-based metrics (BLEU, METEOR, ROUGE, CIDEr) and neural-based similarity measures using sentence embeddings to quantify semantic alignment. Our experiments span diverse datasets, revealing key insights: (1) descriptiveness of ground-truth captions significantly influences model performance; (2) larger models like LLaVA excel in semantic understanding but do not universally outperform smaller models; and (3) certain noise types, such as JPEG compression and motion blur, dramatically degrade performance across models. Our findings highlight the nuanced trade-offs between model size, dataset characteristics, and noise resilience, offering a standardized benchmark for future robust multimodal learning.


[9] Axis-Aligned 3D Stalk Diameter Estimation from RGB-D Imagery cs.CVPDF

Benjamin Vail, Rahul Harsha Cheppally, Ajay Sharda, Sidharth Rai

TL;DR: 这篇论文提出了一种基于RGB-D图像的几何感知计算机视觉流程,用于估计作物茎干直径,为高通量表型分析提供可靠且可扩展的解决方案。

Details

Motivation: 传统测量茎干直径的方法费力且易出错,不适合高通量表型分析。论文旨在通过计算机视觉技术解决这一问题。

Result: 该方法能够有效减少曲率、遮挡和图像噪声的影响,实现高精度的茎干直径估计。

Insight: 几何感知的计算机视觉方法可以显著提升农业表型分析的效率和可靠性。

Abstract: Accurate, high-throughput phenotyping is a critical component of modern crop breeding programs, especially for improving traits such as mechanical stability, biomass production, and disease resistance. Stalk diameter is a key structural trait, but traditional measurement methods are labor-intensive, error-prone, and unsuitable for scalable phenotyping. In this paper, we present a geometry-aware computer vision pipeline for estimating stalk diameter from RGB-D imagery. Our method integrates deep learning-based instance segmentation, 3D point cloud reconstruction, and axis-aligned slicing via Principal Component Analysis (PCA) to perform robust diameter estimation. By mitigating the effects of curvature, occlusion, and image noise, this approach offers a scalable and reliable solution to support high-throughput phenotyping in breeding and agronomic research.


[10] Explicit Multimodal Graph Modeling for Human-Object Interaction Detection cs.CVPDF

Wenxuan Ji, Haichao Shi, Xiao-Yu zhang

TL;DR: 本文提出了一种基于图神经网络的多模态建模方法(MGNM),通过显式建模人-物交互(HOI)任务中的关系结构,提升了HOI检测的性能,并在HICO-DET和V-COCO基准上取得了最优结果。

Details

Motivation: Transformer架构在HOI检测中未显式建模关系结构,阻碍了交互识别的性能;而图神经网络(GNN)天然适合建模此类关系。

Result: 在HICO-DET和V-COCO基准上取得最优性能,与高级目标检测器结合后性能进一步提升,且平衡了稀有类与非稀有类的识别效果。

Insight: 显式建模关系结构对HOI任务至关重要,多模态特征的融合能有效提升交互识别的性能。

Abstract: Transformer-based methods have recently become the prevailing approach for Human-Object Interaction (HOI) detection. However, the Transformer architecture does not explicitly model the relational structures inherent in HOI detection, which impedes the recognition of interactions. In contrast, Graph Neural Networks (GNNs) are inherently better suited for this task, as they explicitly model the relationships between human-object pairs. Therefore, in this paper, we propose \textbf{M}ultimodal \textbf{G}raph \textbf{N}etwork \textbf{M}odeling (MGNM) that leverages GNN-based relational structures to enhance HOI detection. Specifically, we design a multimodal graph network framework that explicitly models the HOI task in a four-stage graph structure. Furthermore, we introduce a multi-level feature interaction mechanism within our graph network. This mechanism leverages multi-level vision and language features to enhance information propagation across human-object pairs. Consequently, our proposed MGNM achieves state-of-the-art performance on two widely used benchmarks: HICO-DET and V-COCO. Moreover, when integrated with a more advanced object detector, our method demonstrates a significant performance gain and maintains an effective balance between rare and non-rare classes.


[11] VQT-Light:Lightweight HDR Illumination Map Prediction with Richer Texture.pdf cs.CVPDF

Kunliang Xie

TL;DR: VQT-Light是一个基于VQVAE和ViT架构的轻量级框架,用于高动态范围(HDR)光照图的预测,通过离散特征提取和全局上下文捕获,实现了纹理丰富且高效的光照估计。

Details

Motivation: 现有方法在光照图的纹理恢复和运行速度之间存在权衡,难以同时满足高纹理保真度和实时性需求。

Result: VQT-Light在40FPS的速度下运行,并在多项评估指标上优于现有方法,实现了更高纹理保真度的光照预测。

Insight: 通过离散特征表征和全局建模,可以显著提升光照估计任务的效果和效率。

Abstract: Accurate lighting estimation is a significant yet challenging task in computer vision and graphics. However, existing methods either struggle to restore detailed textures of illumination map, or face challenges in running speed and texture fidelity. To tackle this problem, we propose a novel framework (VQT-Light) based on VQVAE and ViT architecture. VQT-Light includes two modules: feature extraction and lighting estimation. First, we take advantages of VQVAE to extract discrete features of illumination map rather than continuous features to avoid “posterior collapse”. Second, we capture global context and dependencies of input image through ViT rather than CNNs to improve the prediction of illumination outside the field of view. Combining the above two modules, we formulate the lighting estimation as a multiclass classification task, which plays a key role in our pipeline. As a result, our model predicts light map with richer texture and better fidelity while keeping lightweight and fast. VQT-Light achieves an inference speed of 40FPS and improves multiple evaluation metrics. Qualitative and quantitative experiments demonstrate that the proposed method realizes superior results compared to existing state-of-the-art methods.


[12] Beyond Artificial Misalignment: Detecting and Grounding Semantic-Coordinated Multimodal Manipulations cs.CV | cs.AIPDF

Jinjie Shen, Yaxiong Wang, Lechao Cheng, Nan Pu, Zhun Zhong

TL;DR: 该论文提出了首个语义对齐的多模态操纵检测数据集(SAMM),并开发了一种检索增强的检测与定位框架(RamDG),显著提升了检测性能。

Details

Motivation: 现有数据集因人工破坏模态对齐导致检测简单化,无法反映真实世界的多模态操纵行为,因此需要更真实的数据集和方法。

Result: RamDG在SAMM数据集上的检测准确率比现有方法高2.06%。

Insight: 真实世界的多模态操纵需要语义一致性,现有数据集的简单破坏会导致检测偏差。SAMM和RamDG提供了更真实的评测基准和解法。

Abstract: The detection and grounding of manipulated content in multimodal data has emerged as a critical challenge in media forensics. While existing benchmarks demonstrate technical progress, they suffer from misalignment artifacts that poorly reflect real-world manipulation patterns: practical attacks typically maintain semantic consistency across modalities, whereas current datasets artificially disrupt cross-modal alignment, creating easily detectable anomalies. To bridge this gap, we pioneer the detection of semantically-coordinated manipulations where visual edits are systematically paired with semantically consistent textual descriptions. Our approach begins with constructing the first Semantic-Aligned Multimodal Manipulation (SAMM) dataset, generated through a two-stage pipeline: 1) applying state-of-the-art image manipulations, followed by 2) generation of contextually-plausible textual narratives that reinforce the visual deception. Building on this foundation, we propose a Retrieval-Augmented Manipulation Detection and Grounding (RamDG) framework. RamDG commences by harnessing external knowledge repositories to retrieve contextual evidence, which serves as the auxiliary texts and encoded together with the inputs through our image forgery grounding and deep manipulation detection modules to trace all manipulations. Extensive experiments demonstrate our framework significantly outperforms existing methods, achieving 2.06% higher detection accuracy on SAMM compared to state-of-the-art approaches. The dataset and code are publicly available at https://github.com/shen8424/SAMM-RamDG-CAP.


[13] A Comparative Study of YOLOv8 to YOLOv11 Performance in Underwater Vision Tasks cs.CV | cs.AIPDF

Gordon Hung, Ivan Felipe Rodriguez

TL;DR: 本文对比了YOLOv8至YOLOv11在水下视觉任务中的性能,发现轻量级YOLOv10在速度和精度之间提供了最佳权衡,同时提供了一个可复现的基准。

Details

Motivation: 自主水下车辆(AUVs)依赖计算机视觉系统完成水下任务,但水下图像受限于光线衰减、浑浊度和类别不平衡。YOLO系列检测器在陆地基准的表现未必适用于水下领域,因此需要系统对比。

Result: YOLOv9后精度趋近饱和,但YOLOv10在速度和精度之间表现最佳。推理速度显著提升。

Insight: YOLO系列的迭代主要优化效率而非精度,轻量级模型更适合资源受限的水下应用。

Abstract: Autonomous underwater vehicles (AUVs) increasingly rely on on-board computer-vision systems for tasks such as habitat mapping, ecological monitoring, and infrastructure inspection. However, underwater imagery is hindered by light attenuation, turbidity, and severe class imbalance, while the computational resources available on AUVs are limited. One-stage detectors from the YOLO family are attractive because they fuse localization and classification in a single, low-latency network; however, their terrestrial benchmarks (COCO, PASCAL-VOC, Open Images) leave open the question of how successive YOLO releases perform in the marine domain. We curate two openly available datasets that span contrasting operating conditions: a Coral Disease set (4,480 images, 18 classes) and a Fish Species set (7,500 images, 20 classes). For each dataset, we create four training regimes (25 %, 50 %, 75 %, 100 % of the images) while keeping balanced validation and test partitions fixed. We train YOLOv8-s, YOLOv9-s, YOLOv10-s, and YOLOv11-s with identical hyperparameters (100 epochs, 640 px input, batch = 16, T4 GPU) and evaluate precision, recall, mAP50, mAP50-95, per-image inference time, and frames-per-second (FPS). Post-hoc Grad-CAM visualizations probe feature utilization and localization faithfulness. Across both datasets, accuracy saturates after YOLOv9, suggesting architectural innovations primarily target efficiency rather than accuracy. Inference speed, however, improves markedly. Our results (i) provide the first controlled comparison of recent YOLO variants on underwater imagery, (ii) show that lightweight YOLOv10 offers the best speed-accuracy trade-off for embedded AUV deployment, and (iii) deliver an open, reproducible benchmark and codebase to accelerate future marine-vision research.


[14] RIS-FUSION: Rethinking Text-Driven Infrared and Visible Image Fusion from the Perspective of Referring Image Segmentation cs.CVPDF

Siju Ma, Changsiyu Gong, Xiaofeng Fan, Yong Ma, Chengjie Jiang

TL;DR: 这篇论文提出了RIS-FUSION,一种将文本驱动的红外与可见光图像融合与参照图像分割(RIS)结合的级联框架,通过LangGatedFusion模块注入文本特征以增强语义对齐,并引入了大规模数据集MM-RIS。

Details

Motivation: 当前的文本驱动红外与可见光图像融合方法缺乏目标对齐的任务来监督和评估文本输入对融合效果的影响。作者观察到RIS任务与文本驱动融合有共同目标,即突出文本所指的对象。

Result: 实验表明,RIS-FUSION取得了最先进的性能,mIoU超过现有方法11%以上。

Insight: 将RIS任务与图像融合结合,可以更好地实现文本驱动的多模态对齐和目标聚焦。

Abstract: Text-driven infrared and visible image fusion has gained attention for enabling natural language to guide the fusion process. However, existing methods lack a goal-aligned task to supervise and evaluate how effectively the input text contributes to the fusion outcome. We observe that referring image segmentation (RIS) and text-driven fusion share a common objective: highlighting the object referred to by the text. Motivated by this, we propose RIS-FUSION, a cascaded framework that unifies fusion and RIS through joint optimization. At its core is the LangGatedFusion module, which injects textual features into the fusion backbone to enhance semantic alignment. To support multimodal referring image segmentation task, we introduce MM-RIS, a large-scale benchmark with 12.5k training and 3.5k testing triplets, each consisting of an infrared-visible image pair, a segmentation mask, and a referring expression. Extensive experiments show that RIS-FUSION achieves state-of-the-art performance, outperforming existing methods by over 11% in mIoU. Code and dataset will be released at https://github.com/SijuMa2003/RIS-FUSION.


[15] Learning by Imagining: Debiased Feature Augmentation for Compositional Zero-Shot Learning cs.CVPDF

Haozhe Zhang, Chenchen Jing, Mingyu Liu, Qingsheng Wang, Hao Chen

TL;DR: 本文提出了一种名为Debiased Feature Augmentation (DeFA)的新方法,通过解构和重构特征增强框架结合去偏策略,解决了组合零样本学习中的纠缠和长尾分布问题。

Details

Motivation: 组合零样本学习(CZSL)的挑战在于属性与对象的纠缠性和现实数据中的长尾分布,受神经科学中想象与感知共享相似神经过程的启发,提出了DeFA方法。

Result: 在三个广泛使用的数据集上,DeFA在封闭世界和开放世界设定下均达到最先进性能。

Insight: 通过想象驱动的特征合成方法,可以有效缓解数据分布偏差和组合泛化问题。

Abstract: Compositional Zero-Shot Learning (CZSL) aims to recognize unseen attribute-object compositions by learning prior knowledge of seen primitives, \textit{i.e.}, attributes and objects. Learning generalizable compositional representations in CZSL remains challenging due to the entangled nature of attributes and objects as well as the prevalence of long-tailed distributions in real-world data. Inspired by neuroscientific findings that imagination and perception share similar neural processes, we propose a novel approach called Debiased Feature Augmentation (DeFA) to address these challenges. The proposed DeFA integrates a disentangle-and-reconstruct framework for feature augmentation with a debiasing strategy. DeFA explicitly leverages the prior knowledge of seen attributes and objects by synthesizing high-fidelity composition features to support compositional generalization. Extensive experiments on three widely used datasets demonstrate that DeFA achieves state-of-the-art performance in both \textit{closed-world} and \textit{open-world} settings.


[16] AsyMoE: Leveraging Modal Asymmetry for Enhanced Expert Specialization in Large Vision-Language Models cs.CV | cs.ROPDF

Heng Zhang, Haichuan Hu, Yaomin Shen, Weihao Yu, Yilei Yuan

TL;DR: AsyMoE 提出了一种新型架构,通过模态不对称性增强专家特化,解决了视觉和语言处理不对称性问题。

Details

Motivation: 现有 MoE 方法在处理视觉和语言模态不对称时存在困难,导致语言专家在深层逐渐失去上下文基础,依赖参数知识而非模态信息。

Result: AsyMoE 在准确率上比传统 MoE 和模态专用 MoE 分别提升 26.58% 和 15.45%,且激活参数比密集模型少 25.45%。

Insight: 模态不对称性是跨模态建模的关键问题,通过专家分工特化能有效平衡模态特征与跨模态交互。

Abstract: Large Vision-Language Models (LVLMs) have demonstrated impressive performance on multimodal tasks through scaled architectures and extensive training. However, existing Mixture of Experts (MoE) approaches face challenges due to the asymmetry between visual and linguistic processing. Visual information is spatially complete, while language requires maintaining sequential context. As a result, MoE models struggle to balance modality-specific features and cross-modal interactions. Through systematic analysis, we observe that language experts in deeper layers progressively lose contextual grounding and rely more on parametric knowledge rather than utilizing the provided visual and linguistic information. To address this, we propose AsyMoE, a novel architecture that models this asymmetry using three specialized expert groups. We design intra-modality experts for modality-specific processing, hyperbolic inter-modality experts for hierarchical cross-modal interactions, and evidence-priority language experts to suppress parametric biases and maintain contextual grounding. Extensive experiments demonstrate that AsyMoE achieves 26.58% and 15.45% accuracy improvements over vanilla MoE and modality-specific MoE respectively, with 25.45% fewer activated parameters than dense models.


[17] EvoEmpirBench: Dynamic Spatial Reasoning with Agent-ExpVer cs.CVPDF

Pukun Zhao, Longxiang Wang, Miaowei Wang, Chen Chen, Fanqing Zhou

TL;DR: 该论文提出了两个动态空间推理基准(迷宮导航和match-2消除任务),以评估模型在局部可观察、动态变化环境中的空间理解和自适应规划能力。作者还提出了一种基于主观经验的记忆机制,用于跨任务经验转移。实验验证了当前主流模型在动态空间推理和长期记忆中的局限性。

Details

Motivation: 现有空间推理基准主要关注静态或全局可观察环境,未能捕捉局部可观察性和动态变化下长期推理与记忆利用的挑战。

Result: 实验表明主流模型在处理动态空间任务时表现不佳,验证了基准的有效性和记忆机制的实用性。

Insight: 动态环境下的推理需要模型具备实时更新认知和策略的能力,而现有模型在长期记忆和自适应规划上仍有改进空间。

Abstract: Most existing spatial reasoning benchmarks focus on static or globally observable environments, failing to capture the challenges of long-horizon reasoning and memory utilization under partial observability and dynamic changes. We introduce two dynamic spatial benchmarks, locally observable maze navigation and match-2 elimination that systematically evaluate models’ abilities in spatial understanding and adaptive planning when local perception, environment feedback, and global objectives are tightly coupled. Each action triggers structural changes in the environment, requiring continuous update of cognition and strategy. We further propose a subjective experience-based memory mechanism for cross-task experience transfer and validation. Experiments show that our benchmarks reveal key limitations of mainstream models in dynamic spatial reasoning and long-term memory, providing a comprehensive platform for future methodological advances. Our code and data are available at https://anonymous.4open.science/r/EvoEmpirBench-143C/.


[18] SPGen: Spherical Projection as Consistent and Flexible Representation for Single Image 3D Shape Generation cs.CVPDF

Jingdong Zhang, Weikai Chen, Yuan Liu, Jionghao Wang, Zhengming Yu

TL;DR: SPGen提出了一种新颖的球面投影(SP)表示方法,用于单张图像生成3D形状,解决了现有方法在视图一致性和复杂结构表示上的局限性,同时在几何质量和计算效率上显著优于基线方法。

Details

Motivation: 现有的单视图3D生成模型通常依赖多视角扩散先验,但在视图一致性和复杂内部结构表示上存在不足。SPGen旨在提供一种更一致、灵活且高效的表示方法。

Result: 实验表明,SPGen在几何质量和计算效率上显著优于现有基线方法。

Insight: 球面投影表示方法为3D形状生成提供了一种新的范式,特别适用于复杂内部结构和拓扑的处理,同时避免了多视角不一致的问题。

Abstract: Existing single-view 3D generative models typically adopt multiview diffusion priors to reconstruct object surfaces, yet they remain prone to inter-view inconsistencies and are unable to faithfully represent complex internal structure or nontrivial topologies. In particular, we encode geometry information by projecting it onto a bounding sphere and unwrapping it into a compact and structural multi-layer 2D Spherical Projection (SP) representation. Operating solely in the image domain, SPGen offers three key advantages simultaneously: (1) Consistency. The injective SP mapping encodes surface geometry with a single viewpoint which naturally eliminates view inconsistency and ambiguity; (2) Flexibility. Multi-layer SP maps represent nested internal structures and support direct lifting to watertight or open 3D surfaces; (3) Efficiency. The image-domain formulation allows the direct inheritance of powerful 2D diffusion priors and enables efficient finetuning with limited computational resources. Extensive experiments demonstrate that SPGen significantly outperforms existing baselines in geometric quality and computational efficiency.


[19] Defense-to-Attack: Bypassing Weak Defenses Enables Stronger Jailbreaks in Vision-Language Models cs.CV | cs.AIPDF

Yunhan Zhao, Xiang Zheng, Xingjun Ma

TL;DR: 论文揭示了将弱防御机制整合到攻击流程中可以显著提升视觉语言模型(VLMs)越狱攻击的效果和效率,并提出了一种新方法Defense2Attack,通过视觉和文本优化器以及强化微调增强越狱能力。

Details

Motivation: 尽管视觉语言模型(VLMs)能力强大,但其易受越狱攻击的问题限制了其安全性。现有方法的有效性和效率仍有提升空间,尤其是如何利用现有防御机制来反向提升攻击效果。

Result: 在四个VLM和四个安全基准测试上的实验表明,Defense2Attack在单次尝试中实现了优于现有方法的越狱性能。

Insight: 将防御机制反向整合到攻击流程中,可以为提升越狱攻击的效果提供新思路,同时揭示了模型安全性设计的潜在漏洞。

Abstract: Despite their superb capabilities, Vision-Language Models (VLMs) have been shown to be vulnerable to jailbreak attacks. While recent jailbreaks have achieved notable progress, their effectiveness and efficiency can still be improved. In this work, we reveal an interesting phenomenon: incorporating weak defense into the attack pipeline can significantly enhance both the effectiveness and the efficiency of jailbreaks on VLMs. Building on this insight, we propose Defense2Attack, a novel jailbreak method that bypasses the safety guardrails of VLMs by leveraging defensive patterns to guide jailbreak prompt design. Specifically, Defense2Attack consists of three key components: (1) a visual optimizer that embeds universal adversarial perturbations with affirmative and encouraging semantics; (2) a textual optimizer that refines the input using a defense-styled prompt; and (3) a red-team suffix generator that enhances the jailbreak through reinforcement fine-tuning. We empirically evaluate our method on four VLMs and four safety benchmarks. The results demonstrate that Defense2Attack achieves superior jailbreak performance in a single attempt, outperforming state-of-the-art attack methods that often require multiple tries. Our work offers a new perspective on jailbreaking VLMs.


[20] Effective Gaussian Management for High-fidelity Object Reconstruction cs.CVPDF

Jiateng Liu, Hao Gao, Jiu-Cheng Xie, Chi-Man Pun, Jian Xiong

TL;DR: 本文提出了一种高效的高斯管理方法,用于高保真物体重建。通过动态激活球谐函数(SHs)或法线,并结合表面重建模块的监督,有效解决了双重监督引起的梯度冲突问题。此外,还提出了一种轻量化的高斯表示方法,通过自适应调整SH阶数和任务解耦剪枝,平衡了表示能力和参数量。

Details

Motivation: 现有的基于高斯泼溅(GS)的方法存在属性盲目分配问题,导致双重监督引起的梯度冲突,影响重建质量和效率。本文旨在通过动态管理和轻量化表示解决这些问题。

Result: 实验表明,该方法在重建质量和效率上均优于现有方法,参数量显著减少的同时性能提升。

Insight: 动态管理和轻量化是解决高斯重建中梯度冲突和参数冗余的有效途径,方法通用性强,适用于多种框架。

Abstract: This paper proposes an effective Gaussian management approach for high-fidelity object reconstruction. Departing from recent Gaussian Splatting (GS) methods that employ indiscriminate attribute assignment, our approach introduces a novel densification strategy that dynamically activates spherical harmonics (SHs) or normals under the supervision of a surface reconstruction module, which effectively mitigates the gradient conflicts caused by dual supervision and achieves superior reconstruction results. To further improve representation efficiency, we develop a lightweight Gaussian representation that adaptively adjusts the SH orders of each Gaussian based on gradient magnitudes and performs task-decoupled pruning to remove Gaussian with minimal impact on a reconstruction task without sacrificing others, which balances the representational capacity with parameter quantity. Notably, our management approach is model-agnostic and can be seamlessly integrated into other frameworks, enhancing performance while reducing model size. Extensive experiments demonstrate that our approach consistently outperforms state-of-the-art approaches in both reconstruction quality and efficiency, achieving superior performance with significantly fewer parameters.


[21] Modelling and analysis of the 8 filters from the “master key filters hypothesis” for depthwise-separable deep networks in relation to idealized receptive fields based on scale-space theory cs.CVPDF

Tony Lindeberg, Zahra Babaiee, Peyman M. Kiasari

TL;DR: 本文通过分析深度可分卷积网络中的8种’主键滤波器’,提出了一种基于离散尺度空间理论的理想化感受野模型,证明了学习到的滤波器可以被高斯核的离散模拟近似。

Details

Motivation: 研究动机在于探索深度可分卷积网络中学习到的滤波器是否可以被理想化模型(如高斯核及其差分算子)近似,从而简化网络设计并提升理论理解。

Result: 结果表明,学习到的滤波器可以被高斯核的离散模拟很好地近似,且理想化模型在网络中具有良好的预测性能。

Insight: 研究揭示了深度可分卷积网络中的滤波器与理论中的理想化感受野(如高斯核)之间的紧密联系,为网络设计提供了新的理论支持。

Abstract: This paper presents the results of analysing and modelling a set of 8 master key filters'', which have been extracted by applying a clustering approach to the receptive fields learned in depthwise-separable deep networks based on the ConvNeXt architecture. For this purpose, we first compute spatial spread measures in terms of weighted mean values and weighted variances of the absolute values of the learned filters, which support the working hypotheses that: (i) the learned filters can be modelled by separable filtering operations over the spatial domain, and that (ii) the spatial offsets of the those learned filters that are non-centered are rather close to half a grid unit. Then, we model the clustered master key filters’’ in terms of difference operators applied to a spatial smoothing operation in terms of the discrete analogue of the Gaussian kernel, and demonstrate that the resulting idealized models of the receptive fields show good qualitative similarity to the learned filters. This modelling is performed in two different ways: (i) using possibly different values of the scale parameters in the coordinate directions for each filter, and (ii) using the same value of the scale parameter in both coordinate directions. Then, we perform the actual model fitting by either (i) requiring spatial spread measures in terms of spatial variances of the absolute values of the receptive fields to be equal, or (ii) minimizing the discrete $l_1$- or $l_2$-norms between the idealized receptive field models and the learned filters. Complementary experimental results then demonstrate the idealized models of receptive fields have good predictive properties for replacing the learned filters by idealized filters in depthwise-separable deep networks, thus showing that the learned filters in depthwise-separable deep networks can be well approximated by discrete scale-space filters.


[22] What Makes a Good Generated Image? Investigating Human and Multimodal LLM Image Preference Alignment cs.CVPDF

Rishab Parthasarathy, Jasmine Collins, Cory Stephenson

TL;DR: 该论文研究了人类和多模态大语言模型(LLM)在图像质量评估中的偏好差异,重点关注了美学、无伪影、解剖学准确性、构图正确性、对象一致性和风格等属性。通过合成图像对构建数据集,发现人类和LLM在这些属性上的相关性存在显著差异。

Details

Motivation: 自动化评估生成式文本到图像模型的性能是一个挑战性问题。多模态LLMs被用于图像质量评估,但其与人类在图像属性上的判断差异尚不清楚。本文旨在探究人类和LLM在图像质量评估中的偏好和差异。

Result: 人类能够轻松判断图像质量属性(如美学、构图等),但LLM在解剖学准确性等属性上的评估能力较弱。人类和LLM在图像质量判断上的相关性差异显著。

Insight: 多模态LLM在图像质量评估中的表现与人类存在显著差异,尤其是在涉及复杂视觉属性(如解剖学准确性)时。这为未来改进LLM的图像评估能力提供了方向。

Abstract: Automated evaluation of generative text-to-image models remains a challenging problem. Recent works have proposed using multimodal LLMs to judge the quality of images, but these works offer little insight into how multimodal LLMs make use of concepts relevant to humans, such as image style or composition, to generate their overall assessment. In this work, we study what attributes of an image–specifically aesthetics, lack of artifacts, anatomical accuracy, compositional correctness, object adherence, and style–are important for both LLMs and humans to make judgments on image quality. We first curate a dataset of human preferences using synthetically generated image pairs. We use inter-task correlation between each pair of image quality attributes to understand which attributes are related in making human judgments. Repeating the same analysis with LLMs, we find that the relationships between image quality attributes are much weaker. Finally, we study individual image quality attributes by generating synthetic datasets with a high degree of control for each axis. Humans are able to easily judge the quality of an image with respect to all of the specific image quality attributes (e.g. high vs. low aesthetic image), however we find that some attributes, such as anatomical accuracy, are much more difficult for multimodal LLMs to learn to judge. Taken together, these findings reveal interesting differences between how humans and multimodal LLMs perceive images.


[23] Recurrent Cross-View Object Geo-Localization cs.CVPDF

Xiaohan Zhang, Si-Yuan Cao, Xiaokai Bai, Yiming Li, Zhangkai Shen

TL;DR: 这篇论文提出了一种名为ReCOT的循环跨视角物体地理定位Transformer,通过迭代细化定位结果,结合SAM知识蒸馏和分层注意力机制,显著提升了性能并减少了参数量。

Details

Motivation: 现有方法仅将跨视角物体地理定位视为一次性检测任务,容易受特征噪声干扰且缺乏纠错机制。因此,需要一种能够逐步优化的方法。

Result: 在标准CVOGL基准测试中达到SOTA性能,并将参数量减少60%。

Insight: 循环优化和分层注意力机制显著提升定位精度,同时知识蒸馏在无需额外推理成本的情况下提供了语义指导。

Abstract: Cross-view object geo-localization (CVOGL) aims to determine the location of a specific object in high-resolution satellite imagery given a query image with a point prompt. Existing approaches treat CVOGL as a one-shot detection task, directly regressing object locations from cross-view information aggregation, but they are vulnerable to feature noise and lack mechanisms for error correction. In this paper, we propose ReCOT, a Recurrent Cross-view Object geo-localization Transformer, which reformulates CVOGL as a recurrent localization task. ReCOT introduces a set of learnable tokens that encode task-specific intent from the query image and prompt embeddings, and iteratively attend to the reference features to refine the predicted location. To enhance this recurrent process, we incorporate two complementary modules: (1) a SAM-based knowledge distillation strategy that transfers segmentation priors from the Segment Anything Model (SAM) to provide clearer semantic guidance without additional inference cost, and (2) a Reference Feature Enhancement Module (RFEM) that introduces a hierarchical attention to emphasize object-relevant regions in the reference features. Extensive experiments on standard CVOGL benchmarks demonstrate that ReCOT achieves state-of-the-art (SOTA) performance while reducing parameters by 60% compared to previous SOTA approaches.


[24] A-TDOM: Active TDOM via On-the-Fly 3DGS cs.CVPDF

Yiwei Xu, Xiang Wang, Yifei Yu, Wentian Gan, Luca Morelli

TL;DR: A-TDOM是一种基于On-the-Fly 3DGS优化的近实时TDOM生成方法,通过动态优化新图像与3DGS场,解决了传统方法延迟和质量问题。

Details

Motivation: 传统TDOM生成方法依赖复杂离线流程,延迟高且质量易受相机位姿不准确或遮挡影响,无法满足实时需求。

Result: 实验表明,A-TDOM能在几秒内处理新图像,保持渲染质量和几何精度,支持近实时应用。

Insight: 动态3DGS优化为实时地学产品生成提供了新思路,未来或可扩展到其他动态场景重建任务。

Abstract: True Digital Orthophoto Map (TDOM) serves as a crucial geospatial product in various fields such as urban management, city planning, land surveying, etc. However, traditional TDOM generation methods generally rely on a complex offline photogrammetric pipeline, resulting in delays that hinder real-time applications. Moreover, the quality of TDOM may degrade due to various challenges, such as inaccurate camera poses or Digital Surface Model (DSM) and scene occlusions. To address these challenges, this work introduces A-TDOM, a near real-time TDOM generation method based on On-the-Fly 3DGS optimization. As each image is acquired, its pose and sparse point cloud are computed via On-the-Fly SfM. Then new Gaussians are integrated and optimized into previously unseen or coarsely reconstructed regions. By integrating with orthogonal splatting, A-TDOM can render just after each update of a new 3DGS field. Initial experiments on multiple benchmarks show that the proposed A-TDOM is capable of actively rendering TDOM in near real-time, with 3DGS optimization for each new image in seconds while maintaining acceptable rendering quality and TDOM geometric accuracy.


[25] DyGLNet: Hybrid Global-Local Feature Fusion with Dynamic Upsampling for Medical Image Segmentation cs.CVPDF

Yican Zhao, Ce Wang, You Hao, Lei Li, Tianli Liao

TL;DR: DyGLNet提出了一种融合全局和局部特征的动态上采样方法,用于高效且精准的医疗图像分割,通过创新的SHDCBlock和DyFusionUp模块实现了多尺度特征建模和高保真重建,同时降低计算开销。

Details

Motivation: 医疗图像分割面临多尺度病灶变异性、模糊边界和高计算需求的挑战。DyGLNet旨在通过融合全局和局部特征,动态调整上采样方式,提升分割精度和效率。

Result: 在七个公开数据集上表现优于现有方法,尤其在边界精度和小目标分割上表现突出,同时计算复杂度更低。

Insight: 结合自注意力与多尺度卷积能有效捕捉医疗图像的复杂特征,动态上采样机制优化了特征重建过程,轻量化设计适合临床应用。

Abstract: Medical image segmentation grapples with challenges including multi-scale lesion variability, ill-defined tissue boundaries, and computationally intensive processing demands. This paper proposes the DyGLNet, which achieves efficient and accurate segmentation by fusing global and local features with a dynamic upsampling mechanism. The model innovatively designs a hybrid feature extraction module (SHDCBlock), combining single-head self-attention and multi-scale dilated convolutions to model local details and global context collaboratively. We further introduce a dynamic adaptive upsampling module (DyFusionUp) to realize high-fidelity reconstruction of feature maps based on learnable offsets. Then, a lightweight design is adopted to reduce computational overhead. Experiments on seven public datasets demonstrate that DyGLNet outperforms existing methods, particularly excelling in boundary accuracy and small-object segmentation. Meanwhile, it exhibits lower computation complexity, enabling an efficient and reliable solution for clinical medical image analysis. The code will be made available soon.


[26] BATR-FST: Bi-Level Adaptive Token Refinement for Few-Shot Transformers cs.CV | cs.LGPDF

Mohammed Al-Habib, Zuping Zhang, Abdulrahman Noman

TL;DR: BATR-FST提出了一种两阶段的双层自适应令牌精炼方法,用于提升Vision Transformers在小样本学习中的表现,通过预训练和元微调阶段优化令牌表示和归纳偏置。

Details

Motivation: Vision Transformers在小样本学习中面临令牌级交互精炼难、训练数据少和归纳偏置弱的问题,现有方法依赖不灵活的令牌匹配或简单相似性度量,限制了全局上下文和局部特征的整合。

Result: 在三个基准小样本数据集上的实验表明,BATR-FST在1-shot和5-shot场景中均取得优越结果。

Insight: 双层精炼机制和全局-局部特征平衡是小样本学习的有效性关键;图令牌传播和类分离惩罚增强了模型的判别力。

Abstract: Vision Transformers (ViTs) have shown significant promise in computer vision applications. However, their performance in few-shot learning is limited by challenges in refining token-level interactions, struggling with limited training data, and developing a strong inductive bias. Existing methods often depend on inflexible token matching or basic similarity measures, which limit the effective incorporation of global context and localized feature refinement. To address these challenges, we propose Bi-Level Adaptive Token Refinement for Few-Shot Transformers (BATR-FST), a two-stage approach that progressively improves token representations and maintains a robust inductive bias for few-shot classification. During the pre-training phase, Masked Image Modeling (MIM) provides Vision Transformers (ViTs) with transferable patch-level representations by recreating masked image regions, providing a robust basis for subsequent adaptation. In the meta-fine-tuning phase, BATR-FST incorporates a Bi-Level Adaptive Token Refinement module that utilizes Token Clustering to capture localized interactions, Uncertainty-Aware Token Weighting to prioritize dependable features, and a Bi-Level Attention mechanism to balance intra-cluster and inter-cluster relationships, thereby facilitating thorough token refinement. Furthermore, Graph Token Propagation ensures semantic consistency between support and query instances, while a Class Separation Penalty preserves different class borders, enhancing discriminative capability. Extensive experiments on three benchmark few-shot datasets demonstrate that BATR-FST achieves superior results in both 1-shot and 5-shot scenarios and improves the few-shot classification via transformers.


[27] CECT-Mamba: a Hierarchical Contrast-enhanced-aware Model for Pancreatic Tumor Subtyping from Multi-phase CECT cs.CV | cs.AIPDF

Zhifang Gong, Shuo Gao, Ben Zhao, Yingjing Xu, Yijun Yang

TL;DR: 本文提出了一种基于Mamba的分层对比增强感知模型CECT-Mamba,用于从多期相CECT数据中自动分类胰腺肿瘤亚型,通过空间和时间采样序列以及相似性引导的细化模块,显著提高了亚型诊断的准确性。

Details

Motivation: 胰腺肿瘤的高异质性和变异性为精确亚型诊断带来了挑战。现有方法未能有效利用多期相CECT数据的上下文信息,限制了其性能。

Result: 在270例临床数据上,区分胰腺导管腺癌(PDAC)和胰腺神经内分泌肿瘤(PNETs)的准确率达97.4%,AUC为98.6%。

Insight: 利用Mamba模型的可学习性和简洁性,结合多期相CECT数据,能够显著提升胰腺肿瘤亚型诊断的性能。

Abstract: Contrast-enhanced computed tomography (CECT) is the primary imaging technique that provides valuable spatial-temporal information about lesions, enabling the accurate diagnosis and subclassification of pancreatic tumors. However, the high heterogeneity and variability of pancreatic tumors still pose substantial challenges for precise subtyping diagnosis. Previous methods fail to effectively explore the contextual information across multiple CECT phases commonly used in radiologists’ diagnostic workflows, thereby limiting their performance. In this paper, we introduce, for the first time, an automatic way to combine the multi-phase CECT data to discriminate between pancreatic tumor subtypes, among which the key is using Mamba with promising learnability and simplicity to encourage both temporal and spatial modeling from multi-phase CECT. Specifically, we propose a dual hierarchical contrast-enhanced-aware Mamba module incorporating two novel spatial and temporal sampling sequences to explore intra and inter-phase contrast variations of lesions. A similarity-guided refinement module is also imposed into the temporal scanning modeling to emphasize the learning on local tumor regions with more obvious temporal variations. Moreover, we design the space complementary integrator and multi-granularity fusion module to encode and aggregate the semantics across different scales, achieving more efficient learning for subtyping pancreatic tumors. The experimental results on an in-house dataset of 270 clinical cases achieve an accuracy of 97.4% and an AUC of 98.6% in distinguishing between pancreatic ductal adenocarcinoma (PDAC) and pancreatic neuroendocrine tumors (PNETs), demonstrating its potential as a more accurate and efficient tool.


[28] Modeling the Multivariate Relationship with Contextualized Representations for Effective Human-Object Interaction Detection cs.CVPDF

Zhehao Li, Yucheng Qian, Chong Wang, Yinghao Lu, Zhihao Yang

TL;DR: 该论文提出了一种上下文表征学习网络,通过结合功能引导推理和上下文提示,改进人-物交互(HOI)检测的多元关系建模,尤其在涉及工具时表现突出。

Details

Motivation: 现有两阶段HOI检测方法在上下文建模中存在不足,无法充分捕捉复杂交互(如依赖工具的交互)。

Result: 在HICO-Det和V-COCO数据集上表现优越。

Insight: 通过功能角色建模和语言-视觉对齐,可以更可靠地推理复杂交互。

Abstract: Human-Object Interaction (HOI) detection aims to simultaneously localize human-object pairs and recognize their interactions. While recent two-stage approaches have made significant progress, they still face challenges due to incomplete context modeling. In this work, we introduce a Contextualized Representation Learning Network that integrates both affordance-guided reasoning and contextual prompts with visual cues to better capture complex interactions. We enhance the conventional HOI detection framework by expanding it beyond simple human-object pairs to include multivariate relationships involving auxiliary entities like tools. Specifically, we explicitly model the functional role (affordance) of these auxiliary objects through triplet structures <human, tool, object>. This enables our model to identify tool-dependent interactions such as ‘filling’. Furthermore, the learnable prompt is enriched with instance categories and subsequently integrated with contextual visual features using an attention mechanism. This process aligns language with image content at both global and regional levels. These contextualized representations equip the model with enriched relational cues for more reliable reasoning over complex, context-dependent interactions. Our proposed method demonstrates superior performance on both the HICO-Det and V-COCO datasets in most scenarios. Codes will be released upon acceptance.


[29] Double Helix Diffusion for Cross-Domain Anomaly Image Generation cs.CVPDF

Linchun Wu, Qin Zou, Xianbiao Qi, Bo Du, Zhongyuan Wang

TL;DR: 该论文提出了双螺旋扩散模型(DH-Diff),用于跨领域的异常图像生成,解决现有方法在结构不一致和特征纠缠上的问题,显著提升了生成图像的真实性和多样性。

Details

Motivation: 制造业中的视觉异常检测缺乏真实的异常样本,现有合成方法存在结构不一致和特征纠缠的问题,限制了检测器的训练效果。

Result: 实验表明DH-Diff在多样性和真实性上显著优于现有方法,并提升了下游异常检测性能。

Insight: 通过域解耦注意力机制和语义对齐,生成模型可以有效解决特征纠缠问题,同时保持图像结构的真实性。

Abstract: Visual anomaly inspection is critical in manufacturing, yet hampered by the scarcity of real anomaly samples for training robust detectors. Synthetic data generation presents a viable strategy for data augmentation; however, current methods remain constrained by two principal limitations: 1) the generation of anomalies that are structurally inconsistent with the normal background, and 2) the presence of undesirable feature entanglement between synthesized images and their corresponding annotation masks, which undermines the perceptual realism of the output. This paper introduces Double Helix Diffusion (DH-Diff), a novel cross-domain generative framework designed to simultaneously synthesize high-fidelity anomaly images and their pixel-level annotation masks, explicitly addressing these challenges. DH-Diff employs a unique architecture inspired by a double helix, cycling through distinct modules for feature separation, connection, and merging. Specifically, a domain-decoupled attention mechanism mitigates feature entanglement by enhancing image and annotation features independently, and meanwhile a semantic score map alignment module ensures structural authenticity by coherently integrating anomaly foregrounds. DH-Diff offers flexible control via text prompts and optional graphical guidance. Extensive experiments demonstrate that DH-Diff significantly outperforms state-of-the-art methods in diversity and authenticity, leading to significant improvements in downstream anomaly detection performance.


[30] Superpixel Anything: A general object-based framework for accurate yet regular superpixel segmentation cs.CVPDF

Julien Walther, Rémi Giraud, Michaël Clément

TL;DR: 本文提出了SPAM(SuperPixel Anything Model),一种通用的超像素分割框架,能够在保持规则性的同时实现高精度分割。

Details

Motivation: 传统超像素方法依赖低层特征,而深度学习方法虽利用高层特征但牺牲了超像素的规则性。SPAM旨在平衡这两者。

Result: 实验表明,SPAM在分割任务上定性和定量均优于现有方法。

Insight: 结合高层特征与规则性约束可显著提升超像素分割的精度和实用性。

Abstract: Superpixels are widely used in computer vision to simplify image representation and reduce computational complexity. While traditional methods rely on low-level features, deep learning-based approaches leverage high-level features but also tend to sacrifice regularity of superpixels to capture complex objects, leading to accurate but less interpretable segmentations. In this work, we introduce SPAM (SuperPixel Anything Model), a versatile framework for segmenting images into accurate yet regular superpixels. We train a model to extract image features for superpixel generation, and at inference, we leverage a large-scale pretrained model for semantic-agnostic segmentation to ensure that superpixels align with object masks. SPAM can handle any prior high-level segmentation, resolving uncertainty regions, and is able to interactively focus on specific objects. Comprehensive experiments demonstrate that SPAM qualitatively and quantitatively outperforms state-of-the-art methods on segmentation tasks, making it a valuable and robust tool for various applications. Code and pre-trained models are available here: https://github.com/waldo-j/spam.


[31] SAGA: Selective Adaptive Gating for Efficient and Expressive Linear Attention cs.CVPDF

Yuan Cao, Dong Wang

TL;DR: 论文提出SAGA方法,通过选择性自适应门控改进线性注意力机制,解决了传统线性注意力特征冗余和对齐问题,显著提升了计算效率和模型性能。

Details

Motivation: Transformer中的softmax注意力机制在处理高分辨率图像时因二次复杂度成为瓶颈,而现有线性注意力方法因均匀压缩KV信息导致特征冗余和对齐问题。

Result: SAGA在1280×1280分辨率下吞吐量提升1.76倍,GPU峰值内存降低2.69倍,ImageNet上Top-1精度最高提升4.4%。

Insight: 选择性门控能够优化线性注意力的信息聚合方式,实现高效与表达能力兼备,优于传统均匀压缩方法。

Abstract: While Transformer architecture excel at modeling long-range dependencies contributing to its widespread adoption in vision tasks the quadratic complexity of softmax-based attention mechanisms imposes a major bottleneck, particularly when processing high-resolution images. Linear attention presents a promising alternative by reformulating the attention computation from $(QK)V$ to $Q(KV)$, thereby reducing the complexity from $\mathcal{O}(N^2)$ to $\mathcal{O}(N)$ while preserving the global receptive field. However, most existing methods compress historical key-value (KV) information uniformly, which can lead to feature redundancy and the loss of directional alignment with the query (Q). This uniform compression results in low-rank $KV$ feature maps, contributing to a performance gap compared to softmax attention. To mitigate this limitation, we propose \textbf{S}elective \textbf{A}daptive \textbf{GA}ting for Efficient and Expressive Linear Attention (SAGA) , which introduces input-adaptive learnable gates to selectively modulate information aggregation into the $KV$ feature map. These gates enhance semantic diversity and alleviate the low-rank constraint inherent in conventional linear attention. Additionally, we propose an efficient Hadamard-product decomposition method for gate computation, which introduces no additional memory overhead. Experiments demonstrate that SAGA achieves a 1.76$\times$ improvement in throughput and a 2.69$\times$ reduction in peak GPU memory compared to PVT-T at a resolution of $1280 \times 1280$. Moreover, it improves top-1 accuracy by up to 4.4% on the ImageNet dataset, demonstrating both computational efficiency and model effectiveness.


[32] Data Scaling Laws for Radiology Foundation Models cs.CV | cs.AIPDF

Maximilian Ilse, Harshita Sharma, Anton Schwaighofer, Sam Bond-Taylor, Fernando Pérez-García

TL;DR: 研究了医学影像基础模型在大规模数据下的表现,比较了两种主要视觉编码器(MI2和RAD-DINO)在胸部X光数据上的性能,发现MI2在疾病识别任务中表现更好,而RAD-DINO在管状结构任务中更优,同时强调了结构化监督和本地化持续预训练的价值。

Details

Motivation: 医学影像基础模型的数据规模通常较小,限制了对其性能与数据规模关系的理解,本文旨在探索数据规模对医学影像基础模型的影响。

Result: MI2在疾病相关任务中表现更优,RAD-DINO在管状结构任务中更强;结构化监督和UniCL方法能进一步提升性能;30k域内数据即可超越开放权重模型。

Insight: 医学机构可以通过本地化持续预训练和结构化监督显著提升模型性能,无需依赖大规模公共数据集。

Abstract: Foundation vision encoders such as CLIP and DINOv2, trained on web-scale data, exhibit strong transfer performance across tasks and datasets. However, medical imaging foundation models remain constrained by smaller datasets, limiting our understanding of how data scale and pretraining paradigms affect performance in this setting. In this work, we systematically study continual pretraining of two vision encoders, MedImageInsight (MI2) and RAD-DINO representing the two major encoder paradigms CLIP and DINOv2, on up to 3.5M chest x-rays from a single institution, holding compute and evaluation protocols constant. We evaluate on classification (radiology findings, lines and tubes), segmentation (lines and tubes), and radiology report generation. While prior work has primarily focused on tasks related to radiology findings, we include lines and tubes tasks to counterbalance this bias and evaluate a model’s ability to extract features that preserve continuity along elongated structures. Our experiments show that MI2 scales more effectively for finding-related tasks, while RAD-DINO is stronger on tube-related tasks. Surprisingly, continually pretraining MI2 with both reports and structured labels using UniCL improves performance, underscoring the value of structured supervision at scale. We further show that for some tasks, as few as 30k in-domain samples are sufficient to surpass open-weights foundation models. These results highlight the utility of center-specific continual pretraining, enabling medical institutions to derive significant performance gains by utilizing in-domain data.


[33] Exploring Metric Fusion for Evaluation of NeRFs cs.CVPDF

Shreyas Shivakumara, Gabriel Eilertsen, Karljohan Lundin Palmerius

TL;DR: 该论文研究了如何结合DISTS和VMAF两种指标来评价NeRF生成的图像质量,通过归一化和融合策略提升了评价指标与主观评分的相关性。

Details

Motivation: NeRF生成的图像存在独特伪影,现有单一指标无法在所有数据集上表现良好,因此需要结合不同感知方法的指标以提升评价效果。

Result: 在两种数据集和三种配置下验证了融合指标的鲁棒性和泛化能力,其相关性显著优于单一指标。

Insight: 指标融合能够弥补单一指标的局限性,结合不同感知方法的指标可以更全面地评价NeRF生成的图像质量。

Abstract: Neural Radiance Fields (NeRFs) have demonstrated significant potential in synthesizing novel viewpoints. Evaluating the NeRF-generated outputs, however, remains a challenge due to the unique artifacts they exhibit, and no individual metric performs well across all datasets. We hypothesize that combining two successful metrics, Deep Image Structure and Texture Similarity (DISTS) and Video Multi-Method Assessment Fusion (VMAF), based on different perceptual methods, can overcome the limitations of individual metrics and achieve improved correlation with subjective quality scores. We experiment with two normalization strategies for the individual metrics and two fusion strategies to evaluate their impact on the resulting correlation with the subjective scores. The proposed pipeline is tested on two distinct datasets, Synthetic and Outdoor, and its performance is evaluated across three different configurations. We present a detailed analysis comparing the correlation coefficients of fusion methods and individual scores with subjective scores to demonstrate the robustness and generalizability of the fusion metrics.


[34] Leveraging Large Language Models to Effectively Generate Visual Data for Canine Musculoskeletal Diagnoses cs.CVPDF

Martin Thißen, Thi Ngoc Diep Tran, Barbara Esteve Ratsch, Ben Joel Schönbein, Ute Trapp

TL;DR: 该论文探讨了利用大语言模型(LLM)生成犬类肌肉骨骼诊断的合成视觉数据,以解决真实数据稀缺的问题。通过映射视觉标注到文本域,并结合多种提示技术,生成的合成数据在真实数据上表现出色。

Details

Motivation: 由于某些任务(如罕见疾病诊断)的真实数据稀缺且收集成本高,本文旨在探索LLM生成合成视觉数据的潜力,以补充训练数据集,提升AI模型的性能。

Result: 生成的合成数据在真实数据上训练的模型达到88%的F1分数,且具有诊断位置和严重程度的敏感性,与犬类性别无关。

Insight: LLM生成的合成数据可以有效缓解数据稀缺问题,尤其在罕见疾病领域。该方法虽然针对医学领域设计,但可推广至其他领域。

Abstract: It is well-established that more data generally improves AI model performance. However, data collection can be challenging for certain tasks due to the rarity of occurrences or high costs. These challenges are evident in our use case, where we apply AI models to a novel approach for visually documenting the musculoskeletal condition of dogs. Here, abnormalities are marked as colored strokes on a body map of a dog. Since these strokes correspond to distinct muscles or joints, they can be mapped to the textual domain in which large language models (LLMs) operate. LLMs have demonstrated impressive capabilities across a wide range of tasks, including medical applications, offering promising potential for generating synthetic training data. In this work, we investigate whether LLMs can effectively generate synthetic visual training data for canine musculoskeletal diagnoses. For this, we developed a mapping that segments visual documentations into over 200 labeled regions representing muscles or joints. Using techniques like guided decoding, chain-of-thought reasoning, and few-shot prompting, we generated 1,000 synthetic visual documentations for patellar luxation (kneecap dislocation) diagnosis, the diagnosis for which we have the most real-world data. Our analysis shows that the generated documentations are sensitive to location and severity of the diagnosis while remaining independent of the dog’s sex. We further generated 1,000 visual documentations for various other diagnoses to create a binary classification dataset. A model trained solely on this synthetic data achieved an F1 score of 88% on 70 real-world documentations. These results demonstrate the potential of LLM-generated synthetic data, which is particularly valuable for addressing data scarcity in rare diseases. While our methodology is tailored to the medical domain, the insights and techniques can be adapted to other fields.


[35] Lego-Edit: A General Image Editing Framework with Model-Level Bricks and MLLM Builder cs.CVPDF

Qifei Jia, Yu Liu, Yajie Chai, Xintong Yao, Qiming Lu

TL;DR: Lego-Edit是一个基于多模态大语言模型(MLLM)的图像编辑框架,通过模型级工具包和渐进式强化学习,实现了对开放域用户指令的通用编辑能力,并在多个基准测试中取得最优性能。

Details

Motivation: 现有基于指令的图像编辑方法难以泛化到训练域之外的多样化用户指令,限制了其实际应用。

Result: 在GEdit-Bench和ImgBench上达到了最优性能,并能无需微调直接使用新工具。

Insight: 将MLLM的通用能力与模型级工具结合,可以显著提升图像编辑系统对开放域指令的适应性和灵活性。

Abstract: Instruction-based image editing has garnered significant attention due to its direct interaction with users. However, real-world user instructions are immensely diverse, and existing methods often fail to generalize effectively to instructions outside their training domain, limiting their practical application. To address this, we propose Lego-Edit, which leverages the generalization capability of Multi-modal Large Language Model (MLLM) to organize a suite of model-level editing tools to tackle this challenge. Lego-Edit incorporates two key designs: (1) a model-level toolkit comprising diverse models efficiently trained on limited data and several image manipulation functions, enabling fine-grained composition of editing actions by the MLLM; and (2) a three-stage progressive reinforcement learning approach that uses feedback on unannotated, open-domain instructions to train the MLLM, equipping it with generalized reasoning capabilities for handling real-world instructions. Experiments demonstrate that Lego-Edit achieves state-of-the-art performance on GEdit-Bench and ImgBench. It exhibits robust reasoning capabilities for open-domain instructions and can utilize newly introduced editing tools without additional fine-tuning. Code is available: https://github.com/xiaomi-research/lego-edit.


[36] Runge-Kutta Approximation and Decoupled Attention for Rectified Flow Inversion and Semantic Editing cs.CV | cs.AIPDF

Weiming Chen, Zhihan Zhu, Yijia Wang, Zhihai He

TL;DR: 该论文提出了基于Runge-Kutta求解器的高阶逆方法提升Rectified Flow模型的逆精度,并引入DDTA机制解耦多模态注意力,提升语义控制,实现了高保真和可编辑性。

Details

Motivation: Rectified Flow模型在生成性能上优于DDIM-based扩散模型,但在实际应用中面临逆精度低和多模态注意力纠缠的问题,限制了其源图像一致性和语义控制能力。

Result: 在图像重建和文本引导编辑任务中,方法在保真度和可编辑性上达到SOTA性能。

Insight: Runge-Kutta方法可高效提升逆精度;注意力解耦是多模态扩散模型的关键改进方向。

Abstract: Rectified flow (RF) models have recently demonstrated superior generative performance compared to DDIM-based diffusion models. However, in real-world applications, they suffer from two major challenges: (1) low inversion accuracy that hinders the consistency with the source image, and (2) entangled multimodal attention in diffusion transformers, which hinders precise attention control. To address the first challenge, we propose an efficient high-order inversion method for rectified flow models based on the Runge-Kutta solver of differential equations. To tackle the second challenge, we introduce Decoupled Diffusion Transformer Attention (DDTA), a novel mechanism that disentangles text and image attention inside the multimodal diffusion transformers, enabling more precise semantic control. Extensive experiments on image reconstruction and text-guided editing tasks demonstrate that our method achieves state-of-the-art performance in terms of fidelity and editability. Code is available at https://github.com/wmchen/RKSovler_DDTA.


[37] MEJO: MLLM-Engaged Surgical Triplet Recognition via Inter- and Intra-Task Joint Optimization cs.CVPDF

Yiyi Zhang, Yuchen Yuan, Ying Zheng, Jialun Pei, Jinpeng Li

TL;DR: 论文提出MEJO框架,通过任务间和任务内联合优化解决外科手术三元组识别中的长尾分布问题,利用MLLM增强语义特征并协调梯度学习,在CholecT45和CholecT50数据集上表现优异。

Details

Motivation: 外科手术三元组识别(器械、动作、目标及其组合)面临长尾数据分布问题。现有方法在多任务学习中存在任务间和任务内的优化冲突,需设计更有效的联合优化策略。

Result: 在CholecT45和CholecT50数据集上表现优于基线方法,验证了框架的有效性。

Insight: 通过联合优化任务间和任务内冲突,并结合MLLM的高级语义信息,可以有效提升长尾数据集上的性能。

Abstract: Surgical triplet recognition, which involves identifying instrument, verb, target, and their combinations, is a complex surgical scene understanding challenge plagued by long-tailed data distribution. The mainstream multi-task learning paradigm benefiting from cross-task collaborative promotion has shown promising performance in identifying triples, but two key challenges remain: 1) inter-task optimization conflicts caused by entangling task-generic and task-specific representations; 2) intra-task optimization conflicts due to class-imbalanced training data. To overcome these difficulties, we propose the MLLM-Engaged Joint Optimization (MEJO) framework that empowers both inter- and intra-task optimization for surgical triplet recognition. For inter-task optimization, we introduce the Shared-Specific-Disentangled (S$^2$D) learning scheme that decomposes representations into task-shared and task-specific components. To enhance task-shared representations, we construct a Multimodal Large Language Model (MLLM) powered probabilistic prompt pool to dynamically augment visual features with expert-level semantic cues. Additionally, comprehensive task-specific cues are modeled via distinct task prompts covering the temporal-spatial dimensions, effectively mitigating inter-task ambiguities. To tackle intra-task optimization conflicts, we develop a Coordinated Gradient Learning (CGL) strategy, which dissects and rebalances the positive-negative gradients originating from head and tail classes for more coordinated learning behaviors. Extensive experiments on the CholecT45 and CholecT50 datasets demonstrate the superiority of our proposed framework, validating its effectiveness in handling optimization conflicts.


[38] Cross-Layer Vision Smoothing: Enhancing Visual Understanding via Sustained Focus on Key Objects in Large Vision-Language Models cs.CV | cs.AIPDF

Jianfei Zhao, Feng Zhang, Xin Sun, Lingxing Kong, Zhixing Tan

TL;DR: 提出跨层视觉平滑(CLVS)方法,通过持续关注关键对象来增强大视觉语言模型(LVLM)的视觉理解能力,实验验证其在多项任务中的有效性。

Details

Motivation: 大视觉语言模型(LVLM)对图像关键对象的注意力短暂,假设持续关注这些对象能提升视觉能力。

Result: 在四个基准测试和三种LVLM上验证,CLVS在视觉理解任务中达到最优,关系和属性理解提升显著。

Insight: 持续平滑关注关键对象能有效提升LVLM的视觉能力,尤其是多层级任务中的表现。

Abstract: Large Vision-Language Models (LVLMs) can accurately locate key objects in images, yet their attention to these objects tends to be very brief. Motivated by the hypothesis that sustained focus on key objects can improve LVLMs’ visual capabilities, we propose Cross-Layer Vision Smoothing (CLVS). The core idea of CLVS is to incorporate a vision memory that smooths the attention distribution across layers. Specifically, we initialize this vision memory with position-unbiased visual attention in the first layer. In subsequent layers, the model’s visual attention jointly considers the vision memory from previous layers, while the memory is updated iteratively, thereby maintaining smooth attention on key objects. Given that visual understanding primarily occurs in the early and middle layers of the model, we use uncertainty as an indicator of completed visual understanding and terminate the smoothing process accordingly. Experiments on four benchmarks across three LVLMs confirm the effectiveness and generalizability of our method. CLVS achieves state-of-the-art performance on a variety of visual understanding tasks, with particularly significant improvements in relation and attribute understanding.


[39] MSGFusion: Multimodal Scene Graph-Guided Infrared and Visible Image Fusion cs.CVPDF

Guihui Li, Bowei Dong, Kaizhi Dong, Jiayi Li, Haiyong Zheng

TL;DR: MSGFusion提出了一种基于多模态场景图(scene graph)的红外与可见光图像融合框架,通过结合文本与视觉的结构化信息,显著提升融合图像的语义一致性和细节保留能力。

Details

Motivation: 当前的红外与可见光图像融合方法主要依赖低层次视觉特征(如纹理和对比度),难以捕捉图像的高层次语义信息。现有方法引入文本指导时也未显式建模实体、属性和关系,限制了融合的性能。

Result: 在多个公开数据集上,MSGFusion在细节保留和结构清晰度上优于现有方法,并在下游任务(如低光目标检测、语义分割和医学图像融合)中表现出更好的语义一致性和泛化能力。

Insight: 结构化场景图能有效桥接低层次特征与高层次语义,为多模态图像融合提供了新的研究方向。

Abstract: Infrared and visible image fusion has garnered considerable attention owing to the strong complementarity of these two modalities in complex, harsh environments. While deep learning-based fusion methods have made remarkable advances in feature extraction, alignment, fusion, and reconstruction, they still depend largely on low-level visual cues, such as texture and contrast, and struggle to capture the high-level semantic information embedded in images. Recent attempts to incorporate text as a source of semantic guidance have relied on unstructured descriptions that neither explicitly model entities, attributes, and relationships nor provide spatial localization, thereby limiting fine-grained fusion performance. To overcome these challenges, we introduce MSGFusion, a multimodal scene graph-guided fusion framework for infrared and visible imagery. By deeply coupling structured scene graphs derived from text and vision, MSGFusion explicitly represents entities, attributes, and spatial relations, and then synchronously refines high-level semantics and low-level details through successive modules for scene graph representation, hierarchical aggregation, and graph-driven fusion. Extensive experiments on multiple public benchmarks show that MSGFusion significantly outperforms state-of-the-art approaches, particularly in detail preservation and structural clarity, and delivers superior semantic consistency and generalizability in downstream tasks such as low-light object detection, semantic segmentation, and medical image fusion.


[40] T-SiamTPN: Temporal Siamese Transformer Pyramid Networks for Robust and Efficient UAV Tracking cs.CVPDF

Hojat Ardi, Amir Jahanshahi, Ali Diba

TL;DR: T-SiamTPN是一个基于时间感知的Siamese跟踪框架,通过显式时间建模解决了现有跟踪器在时间依赖性上的不足,显著提升了性能和鲁棒性。

Details

Motivation: 现有Siamese跟踪器主要依赖空间线索,忽略了时间依赖性,导致长期跟踪和被遮挡场景下鲁棒性不足。同时,相关操作限制了其处理非线性外观变化的能力。

Result: 在Jetson Nano上实时运行(7.1 FPS),成功率提升13.7%,精确率提升14.7%,性能接近最先进方法。

Insight: 1.时间建模对提升跟踪鲁棒性至关重要;2.轻量级设计可兼顾性能和效率,适合实际应用部署。

Abstract: Aerial object tracking remains a challenging task due to scale variations, dynamic backgrounds, clutter, and frequent occlusions. While most existing trackers emphasize spatial cues, they often overlook temporal dependencies, resulting in limited robustness in long-term tracking and under occlusion. Furthermore, correlation-based Siamese trackers are inherently constrained by the linear nature of correlation operations, making them ineffective against complex, non-linear appearance changes. To address these limitations, we introduce T-SiamTPN, a temporal-aware Siamese tracking framework that extends the SiamTPN architecture with explicit temporal modeling. Our approach incorporates temporal feature fusion and attention-based interactions, strengthening temporal consistency and enabling richer feature representations. These enhancements yield significant improvements over the baseline and achieve performance competitive with state-of-the-art trackers. Crucially, despite the added temporal modules, T-SiamTPN preserves computational efficiency. Deployed on the resource-constrained Jetson Nano, the tracker runs in real time at 7.1 FPS, demonstrating its suitability for real-world embedded applications without notable runtime overhead. Experimental results highlight substantial gains: compared to the baseline, T-SiamTPN improves success rate by 13.7% and precision by 14.7%. These findings underscore the importance of temporal modeling in Siamese tracking frameworks and establish T-SiamTPN as a strong and efficient solution for aerial object tracking. Code is available at: https://github.com/to/be/released


[41] MATTER: Multiscale Attention for Registration Error Regression cs.CVPDF

Shipeng Liu, Ziliang Xiong, Khac-Hoang Ngo, Per-Erik Forssén

TL;DR: 该论文提出了一种基于回归的PCR质量验证方法MATTER,通过多尺度特征提取和注意力聚合,显著提升了点云配准误差估计的精度和鲁棒性。

Details

Motivation: 现有PCR质量验证方法通常将其视为分类问题,导致结果粗糙。作者希望通过回归实现更精细的量化,并扩展特征提取方式以提升性能,尤其是在空间密度不均匀的点云上。

Result: 在多种数据集上,MATTER显著优于现有分类方法,尤其对空间密度不均的点云效果更佳。此外,其能有效提升下游任务(如地图构建)的质量。

Insight: 回归方法可从粗粒度分类任务中释放更多信息;多尺度和注意力机制能有效捕捉复杂点云数据的特征。

Abstract: Point cloud registration (PCR) is crucial for many downstream tasks, such as simultaneous localization and mapping (SLAM) and object tracking. This makes detecting and quantifying registration misalignment, i.e.,~{\it PCR quality validation}, an important task. All existing methods treat validation as a classification task, aiming to assign the PCR quality to a few classes. In this work, we instead use regression for PCR validation, allowing for a more fine-grained quantification of the registration quality. We also extend previously used misalignment-related features by using multiscale extraction and attention-based aggregation. This leads to accurate and robust registration error estimation on diverse datasets, especially for point clouds with heterogeneous spatial densities. Furthermore, when used to guide a mapping downstream task, our method significantly improves the mapping quality for a given amount of re-registered frames, compared to the state-of-the-art classification-based method.


[42] 4DRadar-GS: Self-Supervised Dynamic Driving Scene Reconstruction with 4D Radar cs.CVPDF

Xiao Tang, Guirong Zhuo, Cong Wang, Boyuan Zheng, Minqing Huang

TL;DR: 4DRadar-GS是一种利用4D雷达的自监督动态驾驶场景重建框架,通过结合4D雷达的速度和空间信息,实现了对动态对象的准确分割和深度恢复,提升了动态场景重建的精确性与时间一致性。

Details

Motivation: 现有方法在动态场景重建中由于运动估计不准确和时间一致性弱,导致动态对象的重建效果不佳。为提升动态驾驶场景的重建质量,结合4D雷达的能力成为一种潜在解决方案。

Result: 在OmniHD-Scenes数据集上实现了最先进的动态驾驶场景3D重建性能。

Insight: 4D雷达的多普勒信息为动态对象重建提供了关键运动线索,结合自监督学习可显著提升动态场景的建模精度。

Abstract: 3D reconstruction and novel view synthesis are critical for validating autonomous driving systems and training advanced perception models. Recent self-supervised methods have gained significant attention due to their cost-effectiveness and enhanced generalization in scenarios where annotated bounding boxes are unavailable. However, existing approaches, which often rely on frequency-domain decoupling or optical flow, struggle to accurately reconstruct dynamic objects due to imprecise motion estimation and weak temporal consistency, resulting in incomplete or distorted representations of dynamic scene elements. To address these challenges, we propose 4DRadar-GS, a 4D Radar-augmented self-supervised 3D reconstruction framework tailored for dynamic driving scenes. Specifically, we first present a 4D Radar-assisted Gaussian initialization scheme that leverages 4D Radar’s velocity and spatial information to segment dynamic objects and recover monocular depth scale, generating accurate Gaussian point representations. In addition, we propose a Velocity-guided PointTrack (VGPT) model, which is jointly trained with the reconstruction pipeline under scene flow supervision, to track fine-grained dynamic trajectories and construct temporally consistent representations. Evaluated on the OmniHD-Scenes dataset, 4DRadar-GS achieves state-of-the-art performance in dynamic driving scene 3D reconstruction.


[43] Time-step Mixup for Efficient Spiking Knowledge Transfer from Appearance to Event Domain cs.CVPDF

Yuqi Xie, Shuhan Ye, Chong Wang, Jiazhen Xu, Le Shen

TL;DR: 本文提出了Time-step Mixup知识迁移(TMKT)方法,通过时间步混合RGB和DVS输入,解决事件摄像头和脉冲神经网络训练中的模态差异问题。

Details

Motivation: 事件摄像头和脉冲神经网络结合的视觉处理具有高效能潜力,但事件数据稀缺和DVS输出稀疏性限制了训练效果。现有方法忽略了RGB和DVS模态间的分布差异。

Result: 在多数据集上的实验验证了方法的有效性,实现了优异的脉冲图像分类性能。

Insight: 时间步混合策略能够有效缓解训练中的模态偏移问题,模态感知辅助目标增强了模型的跨模态判别能力。

Abstract: The integration of event cameras and spiking neural networks holds great promise for energy-efficient visual processing. However, the limited availability of event data and the sparse nature of DVS outputs pose challenges for effective training. Although some prior work has attempted to transfer semantic knowledge from RGB datasets to DVS, they often overlook the significant distribution gap between the two modalities. In this paper, we propose Time-step Mixup knowledge transfer (TMKT), a novel fine-grained mixing strategy that exploits the asynchronous nature of SNNs by interpolating RGB and DVS inputs at various time-steps. To enable label mixing in cross-modal scenarios, we further introduce modality-aware auxiliary learning objectives. These objectives support the time-step mixup process and enhance the model’s ability to discriminate effectively across different modalities. Our approach enables smoother knowledge transfer, alleviates modality shift during training, and achieves superior performance in spiking image classification tasks. Extensive experiments demonstrate the effectiveness of our method across multiple datasets. The code will be released after the double-blind review process.


[44] MMMS: Multi-Modal Multi-Surface Interactive Segmentation cs.CV | cs.LGPDF

Robin Schön, Julian Lorenz, Katja Ludwig, Daniel Kienzle, Rainer Lienhart

TL;DR: 该论文提出了一种MMMS方法,通过用户点击交互式生成分割掩码,特别关注同一图像中多个纠缠表面的分割问题,并引入了一种新的评估指标。

Details

Motivation: 现有的分割方法在处理同一图像中多个纠缠或相邻表面时表现不佳,且缺乏有效的评估指标。同时,多模态输入可能提升分割效果但现有方法未充分利用。

Result: 多模态输入显著提升了性能,DeLiVER和MFNet数据集的NoC@90分别降低了1.28和1.19次点击。同时,RGB基线在单掩码场景中也表现优异。

Insight: 多模态融合和交互信息的延迟整合能有效提升分割精度,尤其是在复杂场景中。

Abstract: In this paper, we present a method to interactively create segmentation masks on the basis of user clicks. We pay particular attention to the segmentation of multiple surfaces that are simultaneously present in the same image. Since these surfaces may be heavily entangled and adjacent, we also present a novel extended evaluation metric that accounts for the challenges of this scenario. Additionally, the presented method is able to use multi-modal inputs to facilitate the segmentation task. At the center of this method is a network architecture which takes as input an RGB image, a number of non-RGB modalities, an erroneous mask, and encoded clicks. Based on this input, the network predicts an improved segmentation mask. We design our architecture such that it adheres to two conditions: (1) The RGB backbone is only available as a black-box. (2) To reduce the response time, we want our model to integrate the interaction-specific information after the image feature extraction and the multi-modal fusion. We refer to the overall task as Multi-Modal Multi-Surface interactive segmentation (MMMS). We are able to show the effectiveness of our multi-modal fusion strategy. Using additional modalities, our system reduces the NoC@90 by up to 1.28 clicks per surface on average on DeLiVER and up to 1.19 on MFNet. On top of this, we are able to show that our RGB-only baseline achieves competitive, and in some cases even superior performance when tested in a classical, single-mask interactive segmentation scenario.


[45] SHREC 2025: Protein surface shape retrieval including electrostatic potential cs.CV | q-bio.BM | I.3.8; I.5.4; J.3PDF

Taher Yacoub, Camille Depenveiller, Atsushi Tatsuma, Tin Barisin, Eugen Rusakov

TL;DR: SHREC 2025赛道专注于蛋白质表面形状检索任务,9个团队参与评测,15种方法在一个包含11,555个蛋白质表面的数据集上进行了性能评估。结果表明,结合静电势作为补充信息的检索方法表现最佳。

Details

Motivation: 蛋白质表面形状检索在生物信息学中具有重要意义。研究动机是探索如何通过结合静电势等分子表面描述符提升检索性能,尤其是对于数据有限的类别。

Result: 结果表明,结合静电势信息的检索方法在多项指标上表现最佳,尤其是在数据有限的类别中效果显著。

Insight: 静电势作为分子表面描述符可以显著提升蛋白质表面形状检索的性能,该方法在数据不足的情况下仍然有效,强调了多模态信息的重要性。

Abstract: This SHREC 2025 track dedicated to protein surface shape retrieval involved 9 participating teams. We evaluated the performance in retrieval of 15 proposed methods on a large dataset of 11,555 protein surfaces with calculated electrostatic potential (a key molecular surface descriptor). The performance in retrieval of the proposed methods was evaluated through different metrics (Accuracy, Balanced accuracy, F1 score, Precision and Recall). The best retrieval performance was achieved by the proposed methods that used the electrostatic potential complementary to molecular surface shape. This observation was also valid for classes with limited data which highlights the importance of taking into account additional molecular surface descriptors.


[46] PANORAMA: The Rise of Omnidirectional Vision in the Embodied AI Era cs.CVPDF

Xu Zheng, Chenfei Liao, Ziqiao Weng, Kaiyu Lei, Zihao Dongfang

TL;DR: 这篇综述论文探讨了全向视觉在具身AI时代的崛起,介绍了其重要性、最新突破以及未来挑战,并提出了一个理想的全景系统架构PANORAMA。

Details

Motivation: 传统针孔视觉在某些领域的环境感知能力有限,而全向视觉提供了更全面的环境感知能力,因此在机器人、工业检测和环境监测等领域的需求日益增长。

Result: 综述了全向视觉的最新进展,并提出了一个系统化的架构,为未来的研究提供了方向。

Insight: 全向视觉在具身AI时代具有巨大潜力,但其研究基础仍需进一步加强,未来的工作需解决数据、泛化性和系统集成等问题。

Abstract: Omnidirectional vision, using 360-degree vision to understand the environment, has become increasingly critical across domains like robotics, industrial inspection, and environmental monitoring. Compared to traditional pinhole vision, omnidirectional vision provides holistic environmental awareness, significantly enhancing the completeness of scene perception and the reliability of decision-making. However, foundational research in this area has historically lagged behind traditional pinhole vision. This talk presents an emerging trend in the embodied AI era: the rapid development of omnidirectional vision, driven by growing industrial demand and academic interest. We highlight recent breakthroughs in omnidirectional generation, omnidirectional perception, omnidirectional understanding, and related datasets. Drawing on insights from both academia and industry, we propose an ideal panoramic system architecture in the embodied AI era, PANORAMA, which consists of four key subsystems. Moreover, we offer in-depth opinions related to emerging trends and cross-community impacts at the intersection of panoramic vision and embodied AI, along with the future roadmap and open challenges. This overview synthesizes state-of-the-art advancements and outlines challenges and opportunities for future research in building robust, general-purpose omnidirectional AI systems in the embodied AI era.


[47] Dual-Stage Reweighted MoE for Long-Tailed Egocentric Mistake Detection cs.CV | cs.AI | cs.LGPDF

Boyu Han, Qianqian Xu, Shilong Bao, Zhiyong Yang, Sicong Li

TL;DR: 论文提出了一种双阶段重加权混合专家(DR-MoE)框架,用于处理长尾分布的自我中心视角错误检测问题。通过结合特征级和分类级的专家模块,以及对不同损失函数的分阶段优化,显著提升了罕见和模糊错误实例的检测性能。

Details

Motivation: 自我中心视角视频中的错误检测面临罕见错误实例和类别不平衡的挑战,传统方法在此类问题上表现不佳,因此需要一种新的框架来解决这些问题。

Result: DR-MoE在罕见和模糊错误实例检测上表现优异,且开源代码可供复现。

Insight: 结合特征级和分类级的专家模块可以有效缓解长尾分布问题,多目标优化的分类器设计进一步提升了模型的鲁棒性和泛化能力。

Abstract: In this report, we address the problem of determining whether a user performs an action incorrectly from egocentric video data. To handle the challenges posed by subtle and infrequent mistakes, we propose a Dual-Stage Reweighted Mixture-of-Experts (DR-MoE) framework. In the first stage, features are extracted using a frozen ViViT model and a LoRA-tuned ViViT model, which are combined through a feature-level expert module. In the second stage, three classifiers are trained with different objectives: reweighted cross-entropy to mitigate class imbalance, AUC loss to improve ranking under skewed distributions, and label-aware loss with sharpness-aware minimization to enhance calibration and generalization. Their predictions are fused using a classification-level expert module. The proposed method achieves strong performance, particularly in identifying rare and ambiguous mistake instances. The code is available at https://github.com/boyuh/DR-MoE.


[48] Brought a Gun to a Knife Fight: Modern VFM Baselines Outgun Specialized Detectors on In-the-Wild AI Image Detection cs.CVPDF

Yue Zhou, Xinan He, Kaiqing Lin, Bing Fan, Feng Ding

TL;DR: 使用现代视觉基础模型(VFM)的简单线性分类器显著优于专门针对AI生成图像设计的检测器,在真实场景中准确率提升了20%以上。这表明更新模型的‘火力’比静态检测器的‘工艺’更有效,同时强调了测试数据需独立于模型预训练历史的重要性。

Details

Motivation: 现有的专门检测器在静态数据集上表现优异,但在真实场景中表现堪忧,尤其是假阴性率高。作者试图探索现代视觉基础模型是否能更有效地解决这一现实世界问题。

Result: 现代VFM在真实场景中的检测准确率比专门检测器高出20%以上。但若测试数据为模型预训练后收集的,性能显著下降。

Insight: 1. 现代VFM的‘火力’(预训练学习能力)比静态检测器的‘工艺’更适用于动态问题。2. 模型评估需严格排除预训练数据的潜在污染。

Abstract: While specialized detectors for AI-generated images excel on curated benchmarks, they fail catastrophically in real-world scenarios, as evidenced by their critically high false-negative rates on in-the-wild' benchmarks. Instead of crafting another specialized knife’ for this problem, we bring a gun' to the fight: a simple linear classifier on a modern Vision Foundation Model (VFM). Trained on identical data, this baseline decisively outguns’ bespoke detectors, boosting in-the-wild accuracy by a striking margin of over 20%. Our analysis pinpoints the source of the VFM’s firepower': First, by probing text-image similarities, we find that recent VLMs (e.g., Perception Encoder, Meta CLIP2) have learned to align synthetic images with forgery-related concepts (e.g., AI-generated’), unlike previous versions. Second, we speculate that this is due to data exposure, as both this alignment and overall accuracy plummet on a novel dataset scraped after the VFM’s pre-training cut-off date, ensuring it was unseen during pre-training. Our findings yield two critical conclusions: 1) For the real-world gunfight' of AI-generated image detection, the raw firepower’ of an updated VFM is far more effective than the `craftsmanship’ of a static detector. 2) True generalization evaluation requires test data to be independent of the model’s entire training history, including pre-training.


[49] Dream3DAvatar: Text-Controlled 3D Avatar Reconstruction from a Single Image cs.CVPDF

Gaofeng Liu, Hengsen Li, Ruoyu Gao, Xuetong Li, Zhiyuan Ma

TL;DR: 这篇论文提出了Dream3DAvatar,一个高效、可分阶段控制的文本驱动框架,用于从单张图像重建3D虚拟化身,解决了遮挡区域生成时的几何和纹理控制问题。

Details

Motivation: 由于单目输入的局限性,当前的3D化身重建技术难以控制遮挡区域的几何和纹理。为解决这一问题,作者提出了一个两阶段的框架,通过改进重建流程和引入文本控制,提升重建质量和可控性。

Result: 实验表明,Dream3DAvatar能够生成无需后处理即可动画化的高质量3D化身,并在多评估指标上优于现有基线方法。

Insight: 结合适配器和文本驱动的多视图生成能够有效解决遮挡区域的生成问题,同时前馈Transformer在3D重建中表现出高效率和高质量。

Abstract: With the rapid advancement of 3D representation techniques and generative models, substantial progress has been made in reconstructing full-body 3D avatars from a single image. However, this task remains fundamentally ill-posedness due to the limited information available from monocular input, making it difficult to control the geometry and texture of occluded regions during generation. To address these challenges, we redesign the reconstruction pipeline and propose Dream3DAvatar, an efficient and text-controllable two-stage framework for 3D avatar generation. In the first stage, we develop a lightweight, adapter-enhanced multi-view generation model. Specifically, we introduce the Pose-Adapter to inject SMPL-X renderings and skeletal information into SDXL, enforcing geometric and pose consistency across views. To preserve facial identity, we incorporate ID-Adapter-G, which injects high-resolution facial features into the generation process. Additionally, we leverage BLIP2 to generate high-quality textual descriptions of the multi-view images, enhancing text-driven controllability in occluded regions. In the second stage, we design a feedforward Transformer model equipped with a multi-view feature fusion module to reconstruct high-fidelity 3D Gaussian Splat representations (3DGS) from the generated images. Furthermore, we introduce ID-Adapter-R, which utilizes a gating mechanism to effectively fuse facial features into the reconstruction process, improving high-frequency detail recovery. Extensive experiments demonstrate that our method can generate realistic, animation-ready 3D avatars without any post-processing and consistently outperforms existing baselines across multiple evaluation metrics.


[50] Perception Before Reasoning: Two-Stage Reinforcement Learning for Visual Reasoning in Vision-Language Models cs.CV | cs.AIPDF

Yan Chen, Long Li, Teng Xi, Long Zeng, Jingdong Wang

TL;DR: 论文提出了一种两阶段强化学习框架PeBR-R1,用于提升视觉语言模型(VLMs)的感知和推理能力,通过分阶段训练解决直接从语言模型移植方法的不足,实验验证了其有效性。

Details

Motivation: 直接将从大型语言模型(LLMs)中成功的强化学习方法应用于视觉语言模型(VLMs)效果不佳,因为VLMs需要先准确感知视觉输入才能推理。因此,需要一种针对VLMs的两阶段学习方法。

Result: PeBR-R1在七个基准数据集上表现优异,验证了方法的有效性。

Insight: 视觉推理任务中,感知能力是推理的基础,分阶段训练优于直接移植LLMs的方法。

Abstract: Reinforcement learning (RL) has proven highly effective in eliciting the reasoning capabilities of large language models (LLMs). Inspired by this success, recent studies have explored applying similar techniques to vision-language models (VLMs), aiming to enhance their reasoning performance. However, directly transplanting RL methods from LLMs to VLMs is suboptimal, as the tasks faced by VLMs are inherently more complex. Specifically, VLMs must first accurately perceive and understand visual inputs before reasoning can be effectively performed. To address this challenge, we propose a two-stage reinforcement learning framework designed to jointly enhance both the perceptual and reasoning capabilities of VLMs. To mitigate the vanishing advantage issue commonly observed in RL training, we first perform dataset-level sampling to selectively strengthen specific capabilities using distinct data sources. During training, the first stage focuses on improving the model’s visual perception through coarse- and fine-grained visual understanding, while the second stage targets the enhancement of reasoning abilities. After the proposed two-stage reinforcement learning process, we obtain PeBR-R1, a vision-language model with significantly enhanced perceptual and reasoning capabilities. Experimental results on seven benchmark datasets demonstrate the effectiveness of our approach and validate the superior performance of PeBR-R1 across diverse visual reasoning tasks.


[51] HERO: Rethinking Visual Token Early Dropping in High-Resolution Large Vision-Language Models cs.CVPDF

Xu Li, Yuxuan Liang, Xiaolei Chen, Yi Zheng, Haotian Chen

TL;DR: HERO 是一种高分辨率视觉语言模型(HR-LVLMs)的高效推理框架,通过动态分配视觉标记预算和选择性保留互补标记,显著提升了模型的效率与精度权衡。

Details

Motivation: HR-LVLMs 将高分辨率图像裁剪为局部块并独立编码,虽提升了细粒度视觉理解能力,但增加了视觉标记数量,导致计算和内存开销大幅增加。本文旨在解决这一效率问题。

Result: HERO 在多个基准测试和模型规模上实现了优越的效率-精度权衡,且无需训练。

Insight: 视觉标记在不同阶段的互补性是实现高效推理的关键,动态分配和选择性保留标记是优化 HR-LVLMs 效率的有效途径。

Abstract: By cropping high-resolution images into local tiles and encoding them independently, High-Resolution Large Vision-Language Models (HR-LVLMs) have demonstrated remarkable fine-grained visual understanding capabilities. However, this divide-and-conquer paradigm significantly increases the number of visual tokens, resulting in substantial computational and memory overhead. To better understand and address this challenge, we empirically investigate visual token utilization in HR-LVLMs and uncover three key findings: (1) the local tiles have varying importance, jointly determined by visual saliency and task relevance; (2) the CLS token in CLIP-based vision encoders exhibits a two-stage attention pattern across layers, with each stage attending to different types of visual tokens; (3) the visual tokens emphasized at different stages encode information at varying levels of granularity, playing complementary roles within LVLMs. Building on these insights, we propose HERO, a High-resolution visual token early dropping framework that integrates content-adaptive token budget allocation with function-aware token selection. By accurately estimating tile-level importance and selectively retaining visual tokens with complementary roles, HERO achieves superior efficiency-accuracy trade-offs across diverse benchmarks and model scales, all in a training-free manner. This study provides both empirical insights and practical solutions toward efficient inference in HR-LVLMs.


[52] TFANet: Three-Stage Image-Text Feature Alignment Network for Robust Referring Image Segmentation cs.CV | cs.AIPDF

Qianqi Lu, Yuxiang Xie, Jing Zhang, Shiwei Zou, Yan Chen

TL;DR: 论文提出了一种三阶段图像-文本特征对齐网络(TFANet),通过层次化框架解决指代图像分割(RIS)任务中的多模态不对齐和语言语义损失问题。

Details

Motivation: 现有的指代图像分割方法在多模态对齐和语言语义保持方面存在不足,特别是在复杂场景中容易导致目标错位或不完整分割。

Result: TFANet在复杂场景下表现优异,能够更准确地定位和分割目标。

Insight: 层次化的三阶段设计有效解决了多模态对齐的难点,尤其是通过词级语义深化弥补了早期阶段的语义退化问题。

Abstract: Referring Image Segmentation (RIS) is a task that segments image regions based on language expressions, requiring fine-grained alignment between two modalities. However, existing methods often struggle with multimodal misalignment and language semantic loss, especially in complex scenes containing multiple visually similar objects, where uniquely described targets are frequently mislocalized or incompletely segmented. To tackle these challenges, this paper proposes TFANet, a Three-stage Image-Text Feature Alignment Network that systematically enhances multimodal alignment through a hierarchical framework comprising three stages: Knowledge Plus Stage (KPS), Knowledge Fusion Stage (KFS), and Knowledge Intensification Stage (KIS). In the first stage, we design the Multiscale Linear Cross-Attention Module (MLAM), which facilitates bidirectional semantic exchange between visual features and textual representations across multiple scales. This establishes rich and efficient alignment between image regions and different granularities of linguistic descriptions. Subsequently, the KFS further strengthens feature alignment through the Cross-modal Feature Scanning Module (CFSM), which applies multimodal selective scanning to capture long-range dependencies and construct a unified multimodal representation. This is essential for modeling long-range cross-modal dependencies and enhancing alignment accuracy in complex scenes. Finally, in the KIS, we propose the Word-level Linguistic Feature-guided Semantic Deepening Module (WFDM) to compensate for semantic degradation introduced in earlier stages.


[53] Enhancing Dual Network Based Semi-Supervised Medical Image Segmentation with Uncertainty-Guided Pseudo-Labeling cs.CVPDF

Yunyao Lu, Yihang Wu, Ahmad Chaddad, Tareef Daqqaq, Reem Kateb

TL;DR: 本文提出了一种基于双网络架构的半监督3D医学图像分割框架,通过交叉一致性增强模块和动态加权策略减少噪声伪标签,并利用对比学习机制降低预测不确定性。

Details

Motivation: 现有半监督分割方法存在噪声伪标签和特征空间监督不足的问题,限制了医学图像分割的实用性。

Result: 在Left Atrial、NIH Pancreas和BraTS-2019三个数据集上表现出色(如Left Atrial上10%标注数据时Dice得分89.95%),优于现有方法。

Insight: 通过结合伪标签优化和特征对齐,能有效提升半监督医学图像分割的性能,尤其在标注数据较少时。

Abstract: Despite the remarkable performance of supervised medical image segmentation models, relying on a large amount of labeled data is impractical in real-world situations. Semi-supervised learning approaches aim to alleviate this challenge using unlabeled data through pseudo-label generation. Yet, existing semi-supervised segmentation methods still suffer from noisy pseudo-labels and insufficient supervision within the feature space. To solve these challenges, this paper proposes a novel semi-supervised 3D medical image segmentation framework based on a dual-network architecture. Specifically, we investigate a Cross Consistency Enhancement module using both cross pseudo and entropy-filtered supervision to reduce the noisy pseudo-labels, while we design a dynamic weighting strategy to adjust the contributions of pseudo-labels using an uncertainty-aware mechanism (i.e., Kullback-Leibler divergence). In addition, we use a self-supervised contrastive learning mechanism to align uncertain voxel features with reliable class prototypes by effectively differentiating between trustworthy and uncertain predictions, thus reducing prediction uncertainty. Extensive experiments are conducted on three 3D segmentation datasets, Left Atrial, NIH Pancreas and BraTS-2019. The proposed approach consistently exhibits superior performance across various settings (e.g., 89.95% Dice score on left Atrial with 10% labeled data) compared to the state-of-the-art methods. Furthermore, the usefulness of the proposed modules is further validated via ablation experiments.


[54] A Synthetic Data Pipeline for Supporting Manufacturing SMEs in Visual Assembly Control cs.CV | cs.ROPDF

Jonas Werheid, Shengjie He, Aymen Gannouni, Anas Abdelrazeq, Robert H. Schmitt

TL;DR: 该论文提出了一种基于合成数据的新型视觉装配控制方法,通过利用CAD数据和目标检测算法,为制造业中小型企业(SMEs)提供资源高效的数据生成和质量管理解决方案。

Details

Motivation: 制造业中的装配质量控制至关重要,但传统的计算机视觉方法在数据采集和标注方面成本高昂,尤其是对中小型企业来说难以负担。合成数据可以减少这些成本,但其在实际装配质量控制中的应用仍然有限。

Result: 合成数据的训练精度达到了99.5%(mAP@0.5:0.95),在实际测试数据中的迁移效果为93%,证明了方法的有效性。

Insight: 通过合成数据生成流水线,中小型企业可以更高效地实现视觉装配控制,降低了资源投入和技术门槛。

Abstract: Quality control of assembly processes is essential in manufacturing to ensure not only the quality of individual components but also their proper integration into the final product. To assist in this matter, automated assembly control using computer vision methods has been widely implemented. However, the costs associated with image acquisition, annotation, and training of computer vision algorithms pose challenges for integration, especially for small- and medium-sized enterprises (SMEs), which often lack the resources for extensive training, data collection, and manual image annotation. Synthetic data offers the potential to reduce manual data collection and labeling. Nevertheless, its practical application in the context of assembly quality remains limited. In this work, we present a novel approach for easily integrable and data-efficient visual assembly control. Our approach leverages simulated scene generation based on computer-aided design (CAD) data and object detection algorithms. The results demonstrate a time-saving pipeline for generating image data in manufacturing environments, achieving a mean Average Precision (mAP@0.5:0.95) up to 99,5% for correctly identifying instances of synthetic planetary gear system components within our simulated training data, and up to 93% when transferred to real-world camera-captured testing data. This research highlights the effectiveness of synthetic data generation within an adaptable pipeline and underscores its potential to support SMEs in implementing resource-efficient visual assembly control solutions.


[55] Weakly and Self-Supervised Class-Agnostic Motion Prediction for Autonomous Driving cs.CVPDF

Ruibo Li, Hanyu Shi, Zhe Wang, Guosheng Lin

TL;DR: 该论文研究了自动驾驶中弱监督和自监督的类无关运动预测方法,利用LiDAR点云数据,通过前景/背景或非地面/地面掩码减少标注需求,并提出了鲁棒一致性感知Chamfer距离损失以提升性能。

Details

Motivation: 自动驾驶需要准确理解动态环境中的运动,但传统方法依赖大量标注。研究目标是开发一种减少标注需求的运动预测方法。

Result: 实验表明,提出的弱监督和自监督模型优于现有自监督方法,弱监督模型的性能甚至接近一些监督模型。

Insight: 前景/背景或非地面/地面掩码是有效的运动预测监督信号,在减少标注的同时保持了高性能。

Abstract: Understanding motion in dynamic environments is critical for autonomous driving, thereby motivating research on class-agnostic motion prediction. In this work, we investigate weakly and self-supervised class-agnostic motion prediction from LiDAR point clouds. Outdoor scenes typically consist of mobile foregrounds and static backgrounds, allowing motion understanding to be associated with scene parsing. Based on this observation, we propose a novel weakly supervised paradigm that replaces motion annotations with fully or partially annotated (1%, 0.1%) foreground/background masks for supervision. To this end, we develop a weakly supervised approach utilizing foreground/background cues to guide the self-supervised learning of motion prediction models. Since foreground motion generally occurs in non-ground regions, non-ground/ground masks can serve as an alternative to foreground/background masks, further reducing annotation effort. Leveraging non-ground/ground cues, we propose two additional approaches: a weakly supervised method requiring fewer (0.01%) foreground/background annotations, and a self-supervised method without annotations. Furthermore, we design a Robust Consistency-aware Chamfer Distance loss that incorporates multi-frame information and robust penalty functions to suppress outliers in self-supervised learning. Experiments show that our weakly and self-supervised models outperform existing self-supervised counterparts, and our weakly supervised models even rival some supervised ones. This demonstrates that our approaches effectively balance annotation effort and performance.


[56] MSDNet: Efficient 4D Radar Super-Resolution via Multi-Stage Distillation cs.CVPDF

Minqing Huang, Shouyi Lu, Boyuan Zheng, Ziyao Li, Xiao Tang

TL;DR: MSDNet 提出了一种多阶段蒸馏框架,通过高效的 LiDAR 先验知识迁移,显著提升 4D 雷达点云的分辨率,同时兼顾重建质量和计算效率。

Details

Motivation: 现有的 4D 雷达超分辨率方法存在训练成本高、推理延迟大且泛化性差的问题,难以平衡精度与效率。MSDNet 旨在解决这些问题。

Result: 实验表明,MSDNet 在 4D 雷达点云超分辨率任务中实现了高保真重建和低延迟推理,并提升了下游任务性能。

Insight: 多阶段蒸馏和噪声适配器的设计,为高效雷达数据增强提供了一种新思路,可能扩展到其他点云超分辨率任务中。

Abstract: 4D radar super-resolution, which aims to reconstruct sparse and noisy point clouds into dense and geometrically consistent representations, is a foundational problem in autonomous perception. However, existing methods often suffer from high training cost or rely on complex diffusion-based sampling, resulting in high inference latency and poor generalization, making it difficult to balance accuracy and efficiency. To address these limitations, we propose MSDNet, a multi-stage distillation framework that efficiently transfers dense LiDAR priors to 4D radar features to achieve both high reconstruction quality and computational efficiency. The first stage performs reconstruction-guided feature distillation, aligning and densifying the student’s features through feature reconstruction. In the second stage, we propose diffusion-guided feature distillation, which treats the stage-one distilled features as a noisy version of the teacher’s representations and refines them via a lightweight diffusion network. Furthermore, we introduce a noise adapter that adaptively aligns the noise level of the feature with a predefined diffusion timestep, enabling a more precise denoising. Extensive experiments on the VoD and in-house datasets demonstrate that MSDNet achieves both high-fidelity reconstruction and low-latency inference in the task of 4D radar point cloud super-resolution, and consistently improves performance on downstream tasks. The code will be publicly available upon publication.


[57] Enhancing Video Large Language Models with Structured Multi-Video Collaborative Reasoning (early version) cs.CVPDF

Zhihao He, Tianyao He, Tieyuan Chen, Yun Xu, Huabin Liu

TL;DR: 该论文提出了一种多视频协作框架,通过结构化视频表示和图融合模块,增强视频大型语言模型(VLM)的推理能力,解决单视频时空不完整性和冗余信息问题。

Details

Motivation: 当前视频语言模型在处理综合视频推理任务时,由于单个视频的时空不完整性(spatio-temporal incompleteness)和冗余信息,容易产生幻觉和错误。多视频协作是一种潜在解决方案,但直接输入视频数据会导致性能下降。

Result: 实验验证了框架的有效性,展示了其提升视频语言模型推理能力的潜力。

Insight: 通过结构化表示和多视频协作,可以有效减少冗余信息,提升视频推理的准确性和鲁棒性。

Abstract: Despite the prosperity of the video language model, the current pursuit of comprehensive video reasoning is thwarted by the inherent spatio-temporal incompleteness within individual videos, resulting in hallucinations and inaccuracies. A promising solution is to augment the reasoning performance with multiple related videos. However, video tokens are numerous and contain redundant information, so directly feeding the relevant video data into a large language model to enhance responses could be counterproductive. To address this challenge, we propose a multi-video collaborative framework for video language models. For efficient and flexible video representation, we establish a Video Structuring Module to represent the video’s knowledge as a spatio-temporal graph. Based on the structured video representation, we design the Graph Fusion Module to fuse the structured knowledge and valuable information from related videos into the augmented graph node tokens. Finally, we construct an elaborate multi-video structured prompt to integrate the graph, visual, and textual tokens as the input to the large language model. Extensive experiments substantiate the effectiveness of our framework, showcasing its potential as a promising avenue for advancing video language models.


[58] WHU-STree: A Multi-modal Benchmark Dataset for Street Tree Inventory cs.CVPDF

Ruifei Ding, Zhe Chen, Wen Fan, Chen Long, Huijuan Xiao

TL;DR: WHU-STree 是一个多模态街树数据集,支持街树库存的多种任务,填补了现有数据在小规模、标注有限或单模态方面的不足。

Details

Motivation: 传统街树调查耗时耗力,现有移动测绘系统(MMS)获取的数据集规模小、标注有限或单模态,限制了全面分析。WHU-STree 旨在解决这些问题。

Result: 实验证明了多模态数据融合的潜力,并强调跨域适用性对算法实际部署的关键性。

Insight: 多模态融合、多任务协作、跨域泛化、空间模式学习及多模态大语言模型是未来街树资产管理的重要方向。

Abstract: Street trees are vital to urban livability, providing ecological and social benefits. Establishing a detailed, accurate, and dynamically updated street tree inventory has become essential for optimizing these multifunctional assets within space-constrained urban environments. Given that traditional field surveys are time-consuming and labor-intensive, automated surveys utilizing Mobile Mapping Systems (MMS) offer a more efficient solution. However, existing MMS-acquired tree datasets are limited by small-scale scene, limited annotation, or single modality, restricting their utility for comprehensive analysis. To address these limitations, we introduce WHU-STree, a cross-city, richly annotated, and multi-modal urban street tree dataset. Collected across two distinct cities, WHU-STree integrates synchronized point clouds and high-resolution images, encompassing 21,007 annotated tree instances across 50 species and 2 morphological parameters. Leveraging the unique characteristics, WHU-STree concurrently supports over 10 tasks related to street tree inventory. We benchmark representative baselines for two key tasks–tree species classification and individual tree segmentation. Extensive experiments and in-depth analysis demonstrate the significant potential of multi-modal data fusion and underscore cross-domain applicability as a critical prerequisite for practical algorithm deployment. In particular, we identify key challenges and outline potential future works for fully exploiting WHU-STree, encompassing multi-modal fusion, multi-task collaboration, cross-domain generalization, spatial pattern learning, and Multi-modal Large Language Model for street tree asset management. The WHU-STree dataset is accessible at: https://github.com/WHU-USI3DV/WHU-STree.


[59] More performant and scalable: Rethinking contrastive vision-language pre-training of radiology in the LLM era cs.CVPDF

Yingtai Li, Haoran Lai, Xiaoqian Zhou, Shuai Ming, Wenxin Ma

TL;DR: 论文探讨了如何利用大型语言模型(LLM)提升医学对比视觉-语言预训练的效能和可扩展性。通过自动提取放射报告中的诊断标签,创建低成本的大规模”银标准”数据集,并验证了基于这些数据的预训练模型的优越性。

Details

Motivation: 大型语言模型的出现为医学视觉-语言预训练提供了新的机会,尤其是在缺乏标注数据的领域(如医学影像)。本文旨在利用LLM自动提取高质量标签,降低预训练成本,并提升模型性能。

Result: 1. 在CT-RATE上零样本诊断AUC达83.8%,在RAD-ChestCT上达77.3%。
2. 跨模态检索表现显著提升(图像-图像MAP@50=53.7%,报告-图像Recall@100=52.2%)。

Insight: 1. LLM可以低成本生成高质量标注数据,降低医学AI开发的门槛。
2. 监督预训练对提升视觉-语言对齐至关重要。
3. 简化的模型架构(如3D ResNet-18)结合LLM标注数据,可以实现高性能的医学AI系统。

Abstract: The emergence of Large Language Models (LLMs) presents unprecedented opportunities to revolutionize medical contrastive vision-language pre-training. In this paper, we show how LLMs can facilitate large-scale supervised pre-training, thereby advancing vision-language alignment. We begin by demonstrate that modern LLMs can automatically extract diagnostic labels from radiology reports with remarkable precision (>96% AUC in our experiments) without complex prompt engineering, enabling the creation of large-scale “silver-standard” datasets at a minimal cost (~$3 for 50k CT image-report pairs). Further, we find that vision encoder trained on this “silver-standard” dataset achieves performance comparable to those trained on labels extracted by specialized BERT-based models, thereby democratizing the access to large-scale supervised pre-training. Building on this foundation, we proceed to reveal that supervised pre-training fundamentally improves contrastive vision-language alignment. Our approach achieves state-of-the-art performance using only a 3D ResNet-18 with vanilla CLIP training, including 83.8% AUC for zero-shot diagnosis on CT-RATE, 77.3% AUC on RAD-ChestCT, and substantial improvements in cross-modal retrieval (MAP@50=53.7% for image-image, Recall@100=52.2% for report-image). These results demonstrate the potential of utilizing LLMs to facilitate {\bf more performant and scalable} medical AI systems. Our code is avaiable at https://github.com/SadVoxel/More-performant-and-scalable.


[60] Road Obstacle Video Segmentation cs.CVPDF

Shyam Nandan Rai, Shyamgopal Karthik, Mariana-Iuliana Georgescu, Barbara Caputo, Carlo Masone

TL;DR: 论文通过整合时序信息,提出了一种新的道路障碍物视频分割方法,并在四个评估基准上验证了其有效性。

Details

Motivation: 现有道路障碍物分割方法多基于单帧图像,忽视了问题的时序特性,导致预测结果不一致。论文认为该任务是时序性的,需要对连续帧的相关性进行建模。

Result: 所提方法在长序列视频中实现了新的最先进性能。

Insight: 道路障碍物分割任务是时序性的,时序建模显著提升了分割的连续性和一致性。

Abstract: With the growing deployment of autonomous driving agents, the detection and segmentation of road obstacles have become critical to ensure safe autonomous navigation. However, existing road-obstacle segmentation methods are applied on individual frames, overlooking the temporal nature of the problem, leading to inconsistent prediction maps between consecutive frames. In this work, we demonstrate that the road-obstacle segmentation task is inherently temporal, since the segmentation maps for consecutive frames are strongly correlated. To address this, we curate and adapt four evaluation benchmarks for road-obstacle video segmentation and evaluate 11 state-of-the-art image- and video-based segmentation methods on these benchmarks. Moreover, we introduce two strong baseline methods based on vision foundation models. Our approach establishes a new state-of-the-art in road-obstacle video segmentation for long-range video sequences, providing valuable insights and direction for future research.


[61] Vi-SAFE: A Spatial-Temporal Framework for Efficient Violence Detection in Public Surveillance cs.CV | I.2.10; I.4.8PDF

Ligang Chang, Shengkai Xu, Liangchang Shen, Binhan Xu, Junqiao Wang

TL;DR: Vi-SAFE是一个空间-时间框架,用于高效检测公共监控中的暴力行为,通过改进的YOLOv8和TSN结合,优化了轻量级结构和计算效率,并在RWF-2000数据集上表现优异。

Details

Motivation: 公共监控中的暴力检测对公共安全至关重要,但面临小规模目标、复杂环境和实时分析的挑战,亟待高效解决方法。

Result: 在RWF-2000数据集上,Vi-SAFE准确率达0.88,优于单独TSN(0.77)和其他现有方法,且在计算效率上表现优异。

Insight: 1. 空间-时间框架的结合显著提升暴力检测性能。
2. 轻量化和注意力机制优化对实时监控系统至关重要。

Abstract: Violence detection in public surveillance is critical for public safety. This study addresses challenges such as small-scale targets, complex environments, and real-time temporal analysis. We propose Vi-SAFE, a spatial-temporal framework that integrates an enhanced YOLOv8 with a Temporal Segment Network (TSN) for video surveillance. The YOLOv8 model is optimized with GhostNetV3 as a lightweight backbone, an exponential moving average (EMA) attention mechanism, and pruning to reduce computational cost while maintaining accuracy. YOLOv8 and TSN are trained separately on pedestrian and violence datasets, where YOLOv8 extracts human regions and TSN performs binary classification of violent behavior. Experiments on the RWF-2000 dataset show that Vi-SAFE achieves an accuracy of 0.88, surpassing TSN alone (0.77) and outperforming existing methods in both accuracy and efficiency, demonstrating its effectiveness for public safety surveillance. Code is available at https://anonymous.4open.science/r/Vi-SAFE-3B42/README.md.


[62] Curriculum Multi-Task Self-Supervision Improves Lightweight Architectures for Onboard Satellite Hyperspectral Image Segmentation cs.CV | cs.AI | cs.LGPDF

Hugo Carlesso, Josiane Mothe, Radu Tudor Ionescu

TL;DR: 论文提出了一种新颖的课程多任务自监督学习框架(CMTSSL),专为轻量级高光谱图像(HSI)分析设计,通过联合学习空间和光谱特征,显著提升了轻量模型的性能。

Details

Motivation: 高光谱数据维度高且卫星传输速率低,需要轻量高效的模型支持星上处理,减少冗余数据传输。

Result: 在四个公开数据集上验证,下游分割任务性能显著提升,模型轻量化程度高达16,000倍以上。

Insight: CMTSSL展示了轻量级模型在自监督学习中的潜力,尤其适用于星上高光谱图像分析的实际场景。

Abstract: Hyperspectral imaging (HSI) captures detailed spectral signatures across hundreds of contiguous bands per pixel, being indispensable for remote sensing applications such as land-cover classification, change detection, and environmental monitoring. Due to the high dimensionality of HSI data and the slow rate of data transfer in satellite-based systems, compact and efficient models are required to support onboard processing and minimize the transmission of redundant or low-value data, e.g. cloud-covered areas. To this end, we introduce a novel curriculum multi-task self-supervised learning (CMTSSL) framework designed for lightweight architectures for HSI analysis. CMTSSL integrates masked image modeling with decoupled spatial and spectral jigsaw puzzle solving, guided by a curriculum learning strategy that progressively increases data complexity during self-supervision. This enables the encoder to jointly capture fine-grained spectral continuity, spatial structure, and global semantic features. Unlike prior dual-task SSL methods, CMTSSL simultaneously addresses spatial and spectral reasoning within a unified and computationally efficient design, being particularly suitable for training lightweight models for onboard satellite deployment. We validate our approach on four public benchmark datasets, demonstrating consistent gains in downstream segmentation tasks, using architectures that are over 16,000x lighter than some state-of-the-art models. These results highlight the potential of CMTSSL in generalizable representation learning with lightweight architectures for real-world HSI applications. Our code is publicly available at https://github.com/hugocarlesso/CMTSSL.


[63] Intelligent Vacuum Thermoforming Process cs.CV | cs.LG | I.2.10; I.4.9PDF

Andi Kuswoyo, Christos Margadji, Sebastian W. Pattinson

TL;DR: 该论文提出了一种基于视觉的质量控制系统,用于优化真空热成型工艺参数,通过k近邻算法调整工艺参数以减少缺陷并提高生产效率。

Details

Motivation: 真空热成型工艺中因材料和工具配置的变化导致质量一致性难以保证,亟需一种高效的方法优化工艺参数。

Result: 模型在调整加热功率、加热时间和真空时间方面表现优异,有效减少了缺陷并提高了生产效率。

Insight: 基于视觉的方法和k近邻算法可以有效优化真空热成型工艺,为类似制造过程的优化提供了新思路。

Abstract: Ensuring consistent quality in vacuum thermoforming presents challenges due to variations in material properties and tooling configurations. This research introduces a vision-based quality control system to predict and optimise process parameters, thereby enhancing part quality with minimal data requirements. A comprehensive dataset was developed using visual data from vacuum-formed samples subjected to various process parameters, supplemented by image augmentation techniques to improve model training. A k-Nearest Neighbour algorithm was subsequently employed to identify adjustments needed in process parameters by mapping low-quality parts to their high-quality counterparts. The model exhibited strong performance in adjusting heating power, heating time, and vacuum time to reduce defects and improve production efficiency.


[64] ResidualViT for Efficient Temporally Dense Video Encoding cs.CV | cs.AI | cs.IR | eess.IVPDF

Mattia Soldan, Fabian Caba Heilbron, Bernard Ghanem, Josef Sivic, Bryan Russell

TL;DR: 提出ResidualViT架构,通过残差连接和token缩减模块高效处理高时间分辨率视频任务,结合轻量级蒸馏策略,显著降低计算成本并加快推理速度,同时保持模型性能。

Details

Motivation: 高时间分辨率视频任务需要密集帧级特征计算,但计算成本高昂,需解决冗余信息和效率问题。

Result: 计算成本降低60%,推理速度提升2.5倍,性能接近原始模型。

Insight: 视频中的时间冗余信息可被有效利用,通过残差和token缩减技术显著提升效率。

Abstract: Several video understanding tasks, such as natural language temporal video grounding, temporal activity localization, and audio description generation, require “temporally dense” reasoning over frames sampled at high temporal resolution. However, computing frame-level features for these tasks is computationally expensive given the temporal resolution requirements. In this paper, we make three contributions to reduce the cost of computing features for temporally dense tasks. First, we introduce a vision transformer (ViT) architecture, dubbed ResidualViT, that leverages the large temporal redundancy in videos to efficiently compute temporally dense frame-level features. Our architecture incorporates (i) learnable residual connections that ensure temporal consistency across consecutive frames and (ii) a token reduction module that enhances processing speed by selectively discarding temporally redundant information while reusing weights of a pretrained foundation model. Second, we propose a lightweight distillation strategy to approximate the frame-level features of the original foundation model. Finally, we evaluate our approach across four tasks and five datasets, in both zero-shot and fully supervised settings, demonstrating significant reductions in computational cost (up to 60%) and improvements in inference speed (up to 2.5x faster), all while closely approximating the accuracy of the original foundation model.


[65] RadGame: An AI-Powered Platform for Radiology Education cs.CV | cs.AIPDF

Mohammed Baharoon, Siavash Raissi, John S. Jun, Thibault Heintz, Mahmoud Alabbad

TL;DR: RadGame 是一个基于 AI 和游戏化的放射学教育平台,通过自动反馈提升学员的定位和报告撰写能力,显著优于传统方法。

Details

Motivation: 传统放射学教育缺乏即时且可扩展的反馈机制,RadGame 旨在通过 AI 和游戏化解决这一问题。

Result: 学员使用 RadGame 后,定位准确率提升 68%(传统方法为 17%),报告撰写准确率提升 31%(传统方法为 4%)。

Insight: AI 驱动的游戏化教育可以显著提升培训效果,为医学教育开辟了新途径。

Abstract: We introduce RadGame, an AI-powered gamified platform for radiology education that targets two core skills: localizing findings and generating reports. Traditional radiology training is based on passive exposure to cases or active practice with real-time input from supervising radiologists, limiting opportunities for immediate and scalable feedback. RadGame addresses this gap by combining gamification with large-scale public datasets and automated, AI-driven feedback that provides clear, structured guidance to human learners. In RadGame Localize, players draw bounding boxes around abnormalities, which are automatically compared to radiologist-drawn annotations from public datasets, and visual explanations are generated by vision-language models for user missed findings. In RadGame Report, players compose findings given a chest X-ray, patient age and indication, and receive structured AI feedback based on radiology report generation metrics, highlighting errors and omissions compared to a radiologist’s written ground truth report from public datasets, producing a final performance and style score. In a prospective evaluation, participants using RadGame achieved a 68% improvement in localization accuracy compared to 17% with traditional passive methods and a 31% improvement in report-writing accuracy compared to 4% with traditional methods after seeing the same cases. RadGame highlights the potential of AI-driven gamification to deliver scalable, feedback-rich radiology training and reimagines the application of medical AI resources in education.


[66] Image Realness Assessment and Localization with Multimodal Features cs.CV | eess.IVPDF

Lovish Kaushik, Agnij Biswas, Somdyuti Paul

TL;DR: 该论文提出了一种评估AI生成图像真实感并定位视觉不一致区域的框架,利用多模态特征提升真实感预测性能。

Details

Motivation: AI生成图像的真实感评估和局部不一致识别对其实际应用和生成模型的改进至关重要。

Result: 实验表明该方法在真实感预测性能上有所提升,并能有效区分图像中的真实与非真实区域。

Insight: 多模态特征(视觉与语言结合)能更有效地评估和定位AI生成图像的真实感问题。

Abstract: A reliable method of quantifying the perceptual realness of AI-generated images and identifying visually inconsistent regions is crucial for practical use of AI-generated images and for improving photorealism of generative AI via realness feedback during training. This paper introduces a framework that accomplishes both overall objective realness assessment and local inconsistency identification of AI-generated images using textual descriptions of visual inconsistencies generated by vision-language models trained on large datasets that serve as reliable substitutes for human annotations. Our results demonstrate that the proposed multimodal approach improves objective realness prediction performance and produces dense realness maps that effectively distinguish between realistic and unrealistic spatial regions.


[67] StyleSculptor: Zero-Shot Style-Controllable 3D Asset Generation with Texture-Geometry Dual Guidance cs.CVPDF

Zefan Qu, Zhenwei Wang, Haoyuan Wang, Ke Xu, Gerhard Hancke

TL;DR: StyleSculptor提出了一种无需训练的零样本方法,通过纹理-几何双引导生成风格可控的3D资产,核心模块是Style Disentangled Attention (SD-Attn),能动态融合内容与风格图像的特征。

Details

Motivation: 在实际应用中,如游戏和虚拟现实,生成与现有资产风格一致的3D资产是常见的需求。传统方法难以实现细粒度的风格控制,而StyleSculptor旨在解决这一问题。

Result: 实验表明,StyleSculptor在生成高保真3D资产方面优于基线方法,支持纹理、几何或混合风格的细粒度控制。

Insight: 通过动态注意力机制和解耦策略,可以实现高效的风格控制,为3D生成任务提供了新思路。

Abstract: Creating 3D assets that follow the texture and geometry style of existing ones is often desirable or even inevitable in practical applications like video gaming and virtual reality. While impressive progress has been made in generating 3D objects from text or images, creating style-controllable 3D assets remains a complex and challenging problem. In this work, we propose StyleSculptor, a novel training-free approach for generating style-guided 3D assets from a content image and one or more style images. Unlike previous works, StyleSculptor achieves style-guided 3D generation in a zero-shot manner, enabling fine-grained 3D style control that captures the texture, geometry, or both styles of user-provided style images. At the core of StyleSculptor is a novel Style Disentangled Attention (SD-Attn) module, which establishes a dynamic interaction between the input content image and style image for style-guided 3D asset generation via a cross-3D attention mechanism, enabling stable feature fusion and effective style-guided generation. To alleviate semantic content leakage, we also introduce a style-disentangled feature selection strategy within the SD-Attn module, which leverages the variance of 3D feature patches to disentangle style- and content-significant channels, allowing selective feature injection within the attention framework. With SD-Attn, the network can dynamically compute texture-, geometry-, or both-guided features to steer the 3D generation process. Built upon this, we further propose the Style Guided Control (SGC) mechanism, which enables exclusive geometry- or texture-only stylization, as well as adjustable style intensity control. Extensive experiments demonstrate that StyleSculptor outperforms existing baseline methods in producing high-fidelity 3D assets.


[68] 3D Aware Region Prompted Vision Language Model cs.CVPDF

An-Chieh Cheng, Yang Fu, Yukang Chen, Zhijian Liu, Xiaolong Li

TL;DR: SR-3D是一个结合2D图像和3D数据的视觉语言模型,通过共享视觉标记空间实现灵活的区域标注,无需多帧标注。它利用3D位置嵌入增强2D特征,在2D和3D基准测试中表现优异。

Details

Motivation: 现有的视觉语言模型在2D与3D数据之间的连接较弱,标注3D数据需要大量多帧标注。SR-3D旨在解决这一问题,实现更高效的3D空间理解。

Result: 在2D和3D基准测试中达到SOTA性能,适用于无3D输入或标注的野外视频,能准确推断空间关系和度量信息。

Insight: SR-3D展示了2D和3D表示空间统一的可能性,为场景理解提供了更高效的解决方案,同时适用于实际场景中的跨帧推理。

Abstract: We present Spatial Region 3D (SR-3D) aware vision-language model that connects single-view 2D images and multi-view 3D data through a shared visual token space. SR-3D supports flexible region prompting, allowing users to annotate regions with bounding boxes, segmentation masks on any frame, or directly in 3D, without the need for exhaustive multi-frame labeling. We achieve this by enriching 2D visual features with 3D positional embeddings, which allows the 3D model to draw upon strong 2D priors for more accurate spatial reasoning across frames, even when objects of interest do not co-occur within the same view. Extensive experiments on both general 2D vision language and specialized 3D spatial benchmarks demonstrate that SR-3D achieves state-of-the-art performance, underscoring its effectiveness for unifying 2D and 3D representation space on scene understanding. Moreover, we observe applicability to in-the-wild videos without sensory 3D inputs or ground-truth 3D annotations, where SR-3D accurately infers spatial relationships and metric measurements.


cs.CL [Back]

[69] MORABLES: A Benchmark for Assessing Abstract Moral Reasoning in LLMs with Fables cs.CL | cs.AI | 68T50 | I.2.7PDF

Matteo Marcuzzo, Alessandro Zangari, Andrea Albarelli, Jose Camacho-Collados, Mohammad Taher Pilehvar

TL;DR: MORABLES是一个基于寓言和短篇故事构建的基准测试,用于评估大语言模型(LLMs)在抽象道德推理方面的能力。研究发现,虽然大模型表现优于小模型,但它们仍易受对抗性攻击,且依赖浅层模式而非真正的道德推理。

Details

Motivation: 随着LLMs在标准阅读理解任务上的优异表现,研究转向评估其在复杂抽象推理和深层理解能力方面的表现,尤其是道德推理。寓言和故事的丰富叙事为此提供了理想框架。

Result: 大模型表现优于小模型,但对对抗性输入脆弱,且常依赖浅层模式而非深刻推理。推理增强模型未能显著缩小性能差距。

Insight: 模型在道德推理任务中的表现主要由规模驱动,而非推理能力;对抗性测试揭示了模型的脆弱性,表明现有方法仍难实现真正的抽象推理。

Abstract: As LLMs excel on standard reading comprehension benchmarks, attention is shifting toward evaluating their capacity for complex abstract reasoning and inference. Literature-based benchmarks, with their rich narrative and moral depth, provide a compelling framework for evaluating such deeper comprehension skills. Here, we present MORABLES, a human-verified benchmark built from fables and short stories drawn from historical literature. The main task is structured as multiple-choice questions targeting moral inference, with carefully crafted distractors that challenge models to go beyond shallow, extractive question answering. To further stress-test model robustness, we introduce adversarial variants designed to surface LLM vulnerabilities and shortcuts due to issues such as data contamination. Our findings show that, while larger models outperform smaller ones, they remain susceptible to adversarial manipulation and often rely on superficial patterns rather than true moral reasoning. This brittleness results in significant self-contradiction, with the best models refuting their own answers in roughly 20% of cases depending on the framing of the moral choice. Interestingly, reasoning-enhanced models fail to bridge this gap, suggesting that scale - not reasoning ability - is the primary driver of performance.


[70] SENTRA: Selected-Next-Token Transformer for LLM Text Detection cs.CL | cs.LGPDF

Mitchell Plyler, Yilun Zhang, Alexander Tuzhilin, Saoud Khalifah, Sen Tian

TL;DR: SENTRA是一种基于Transformer的LLM文本检测器,通过选择下一个令牌概率序列和对比预训练,显著优于现有基线,尤其是在跨域场景中。

Details

Motivation: 随着LLM能力的提升和广泛应用,其生成文本的滥用问题日益凸显,亟需一种通用的检测器来识别未声明的LLM生成文本。

Result: 在三个公开数据集和24个文本领域的实验中,SENTRA在跨域场景下显著优于现有基线。

Insight: SENTRA通过对比预训练和令牌概率序列优化了跨域检测能力,为LLM文本检测提供了新的思路。

Abstract: LLMs are becoming increasingly capable and widespread. Consequently, the potential and reality of their misuse is also growing. In this work, we address the problem of detecting LLM-generated text that is not explicitly declared as such. We present a novel, general-purpose, and supervised LLM text detector, SElected-Next-Token tRAnsformer (SENTRA). SENTRA is a Transformer-based encoder leveraging selected-next-token-probability sequences and utilizing contrastive pre-training on large amounts of unlabeled data. Our experiments on three popular public datasets across 24 domains of text demonstrate SENTRA is a general-purpose classifier that significantly outperforms popular baselines in the out-of-domain setting.


[71] MORQA: Benchmarking Evaluation Metrics for Medical Open-Ended Question Answering cs.CL | 68T50 (Primary) 68T45 (Secondary) | I.2.7; I.2.10PDF

Wen-wai Yim, Asma Ben Abacha, Zixuan Yu, Robert Doerning, Fei Xia

TL;DR: MORQA是一个新的多语言基准,用于评估医学开放问答任务中的自然语言生成评估指标,通过专家评分的黄金标准答案比较传统指标和LLM方法的性能。

Details

Motivation: 医学领域的开放问答任务对准确性、相关性和领域专业知识要求极高,传统自动评估指标如BLEU等在区分高质量回答时表现不佳,因此需要更有效的评估方法。

Result: LLM评估方法显著优于传统指标,尤其在语义敏感性和参考答案多样性处理上表现更优。

Insight: 医学问答评估需要更注重与人类专家判断的一致性,LLM方法在语义理解和多参考答案处理方面具有优势。

Abstract: Evaluating natural language generation (NLG) systems in the medical domain presents unique challenges due to the critical demands for accuracy, relevance, and domain-specific expertise. Traditional automatic evaluation metrics, such as BLEU, ROUGE, and BERTScore, often fall short in distinguishing between high-quality outputs, especially given the open-ended nature of medical question answering (QA) tasks where multiple valid responses may exist. In this work, we introduce MORQA (Medical Open-Response QA), a new multilingual benchmark designed to assess the effectiveness of NLG evaluation metrics across three medical visual and text-based QA datasets in English and Chinese. Unlike prior resources, our datasets feature 2-4+ gold-standard answers authored by medical professionals, along with expert human ratings for three English and Chinese subsets. We benchmark both traditional metrics and large language model (LLM)-based evaluators, such as GPT-4 and Gemini, finding that LLM-based approaches significantly outperform traditional metrics in correlating with expert judgments. We further analyze factors driving this improvement, including LLMs’ sensitivity to semantic nuances and robustness to variability among reference answers. Our results provide the first comprehensive, multilingual qualitative study of NLG evaluation in the medical domain, highlighting the need for human-aligned evaluation methods. All datasets and annotations will be publicly released to support future research.


[72] MedFact: Benchmarking the Fact-Checking Capabilities of Large Language Models on Chinese Medical Texts cs.CL | cs.AIPDF

Jiayi He, Yangmin Huang, Qianyun Du, Xiangying Zhou, Zhiyang He

TL;DR: MedFact是一个专为中文医学事实核查设计的新基准数据集,包含2,116个专家标注实例,涵盖13个医学专业、8种错误类型和4种写作风格。通过全面评估20个领先的大语言模型(LLMs),研究发现模型虽能检测错误,但准确定位错误仍具有挑战性。

Details

Motivation: 由于LLMs在医疗领域的广泛应用,其事实可靠性亟需测试。现有基准数据集覆盖领域有限,难以反映真实医学信息的复杂性。

Result: 研究发现LLMs在错误检测上表现尚可,但在具体定位上远不及人类专家,且存在过度批评现象。

Insight: 高级推理技术(如多智能体协作)可能加剧模型的过度批评倾向,提示医疗领域需更可靠的模型。

Abstract: The increasing deployment of Large Language Models (LLMs) in healthcare necessitates a rigorous evaluation of their factual reliability. However, existing benchmarks are often limited by narrow domains of data, failing to capture the complexity of real-world medical information. To address this critical gap, we introduce MedFact, a new and challenging benchmark for Chinese medical fact-checking. MedFact comprises 2,116 expert-annotated instances curated from diverse real-world texts, spanning 13 medical specialties, 8 fine-grained error types, 4 writing styles, and multiple difficulty levels. Its construction employs a hybrid AI-human framework where iterative expert feedback refines an AI-driven, multi-criteria filtering process, ensuring both high data quality and difficulty. We conduct a comprehensive evaluation of 20 leading LLMs, benchmarking their performance on veracity classification and error localization against a human expert baseline. Our results reveal that while models can often determine if a text contains an error, precisely localizing it remains a substantial challenge, with even top-performing models falling short of human performance. Furthermore, our analysis uncovers a frequent ``over-criticism’’ phenomenon, a tendency for models to misidentify correct information as erroneous, which is exacerbated by advanced reasoning techniques such as multi-agent collaboration and inference-time scaling. By highlighting these critical challenges for deploying LLMs in medical applications, MedFact provides a robust resource to drive the development of more factually reliable and medically aware models.


[73] Audited Reasoning Refinement: Fine-Tuning Language Models via LLM-Guided Step-Wise Evaluation and Correction cs.CLPDF

Sumanta Bhattacharyya, Sara Riaz, Pedram Rooshenas

TL;DR: 论文提出了一种名为R2tA的方法,通过LLM生成和优化的中间推理轨迹来训练任务特定的小型推理模型,解决了标注数据稀缺的问题。

Details

Motivation: 在任务特定的小型模型训练中,直接人工监督或高质量标签稀缺是一个挑战。LLMs生成的中间推理轨迹可以被系统优化,为训练提供有效监督信号。

Result: 在扩展实体关系图(EERD)评估任务中,R2tA表现出色,提供了低成本、可扩展的LLM适应方案。

Insight: R2tA展示了在数据稀缺领域利用LLM生成高质量监督信号的潜力,为教育和复杂任务提供了可复现的AI工具。

Abstract: Training a task-specific small reasoning model is challenging when direct human supervision or high-quality labels are scarce. However, LLMs with reasoning capabilities produce abundant intermediate reasoning traces that can be systematically refined to create effective supervision signals. We propose Reason-Refine-then-Align (R2tA), which turns refined model rationales into supervision for training task-specific reasoning models. Our method generates initial reasoning and responses from an open-source base model on task-specific inputs, then refines these traces, fixing hallucinations and inconsistencies, to form a high-fidelity dataset. We perform a two-stage alignment, supervised fine-tuning (SFT), followed by direct preference optimization (DPO) to calibrate the model’s intermediate reasoning with human-validated conceptual preferences and then condition the final output on that aligned reasoning. As a case study, we apply R2tA to evaluate extended entity relationship diagrams (EERDs) in database system design, a structurally complex task where prompt-only methods miss or hallucinate errors. We curated a dataset of 600 EERD variants (train/test split of 450/150, respectively) with induced mistakes spanning 11 categories. Empirical evaluation suggests R2tA provides a practical, cost-effective path to scalable LLM adaptation in data-scarce domains, enabling reproducible AI tools for education and beyond.


[74] FunAudio-ASR Technical Report cs.CL | cs.AI | cs.SD | eess.ASPDF

Keyu An, Yanni Chen, Chong Deng, Changfeng Gao, Zhifu Gao

TL;DR: FunAudio-ASR是一个基于大语言模型(LLM)的大规模自动语音识别(ASR)系统,通过结合大数据、大模型、LLM集成和强化学习,在多样复杂的语音识别场景中实现了最优性能,并针对实际部署进行了优化。

Details

Motivation: 现有基于LLM的ASR系统虽然在新基准测试中表现优异,但在实际工业评估中表现不佳,存在幻觉问题,严重影响了用户体验。

Result: 在实际应用数据集中实现了最优性能,证明了其在真实场景中的有效性和鲁棒性。

Insight: LLM-based ASR系统需要通过实际部署优化(如流式能力、噪声处理等)来提升工业应用中的表现,而不仅仅是依赖新基准测试的指标。

Abstract: In recent years, automatic speech recognition (ASR) has witnessed transformative advancements driven by three complementary paradigms: data scaling, model size scaling, and deep integration with large language models (LLMs). However, LLMs are prone to hallucination, which can significantly degrade user experience in real-world ASR applications. In this paper, we present FunAudio-ASR, a large-scale, LLM-based ASR system that synergistically combines massive data, large model capacity, LLM integration, and reinforcement learning to achieve state-of-the-art performance across diverse and complex speech recognition scenarios. Moreover, FunAudio-ASR is specifically optimized for practical deployment, with enhancements in streaming capability, noise robustness, code-switching, hotword customization, and satisfying other real-world application requirements. Experimental results show that while most LLM-based ASR systems achieve strong performance on open-source benchmarks, they often underperform on real industry evaluation sets. Thanks to production-oriented optimizations, FunAudio-ASR achieves SOTA performance on real application datasets, demonstrating its effectiveness and robustness in practical settings.


[75] A comparison of pipelines for the translation of a low resource language based on transformers cs.CL | cs.CE | cs.CY | cs.LGPDF

Chiara Bonfanti, Michele Colombino, Giulia Coucourde, Faeze Memari, Stefano Pinardi

TL;DR: 本文比较了三种基于Transformer的神经网络的训练流水线,用于低资源语言Bambara的机器翻译。结果表明,简单的Transformer模型表现最佳,尤其是在低资源环境下。

Details

Motivation: Bambara是一种非洲低资源语言,缺乏高质量的机器翻译工具。本文旨在比较不同训练流水线的效果,为低资源语言的翻译提供实用解决方案。

Result: 在Bayelemagaba数据集上,简单Transformer模型的BLEU和chrF得分最高(10%和21%)。在Yiri数据集上,BLEU得分达33.81%,chrF得分41%。基于LLaMA3的模型在单数据集上表现更好。

Insight: 简单模型在低资源语言翻译中可能更具鲁棒性,而基于微调的模型更擅长捕捉特定数据集的模式。语言蒸馏方法在整合低资源语言到预训练模型中具有潜力。

Abstract: This work compares three pipelines for training transformer-based neural networks to produce machine translators for Bambara, a Mand`e language spoken in Africa by about 14,188,850 people. The first pipeline trains a simple transformer to translate sentences from French into Bambara. The second fine-tunes LLaMA3 (3B-8B) instructor models using decoder-only architectures for French-to-Bambara translation. Models from the first two pipelines were trained with different hyperparameter combinations to improve BLEU and chrF scores, evaluated on both test sentences and official Bambara benchmarks. The third pipeline uses language distillation with a student-teacher dual neural network to integrate Bambara into a pre-trained LaBSE model, which provides language-agnostic embeddings. A BERT extension is then applied to LaBSE to generate translations. All pipelines were tested on Dokotoro (medical) and Bayelemagaba (mixed domains). Results show that the first pipeline, although simpler, achieves the best translation accuracy (10% BLEU, 21% chrF on Bayelemagaba), consistent with low-resource translation results. On the Yiri dataset, created for this work, it achieves 33.81% BLEU and 41% chrF. Instructor-based models perform better on single datasets than on aggregated collections, suggesting they capture dataset-specific patterns more effectively.


[76] MAGIC-Enhanced Keyword Prompting for Zero-Shot Audio Captioning with CLIP Models cs.CLPDF

Vijay Govindarajan, Pratik Patel, Sahil Tripathi, Md Azizul Hoque, Gautam Siddharth Kashyap

TL;DR: 该论文提出了一种零样本音频字幕生成系统,利用预训练音频CLIP模型提取特征并生成结构化提示,结合LLM生成字幕,显著提升了性能。

Details

Motivation: 由于音频字幕数据集有限,传统方法需要大量训练数据,论文提出利用预训练模型实现零样本字幕生成,减少数据依赖。

Result: 实验结果表明,使用MAGIC搜索和WavCaps模型时,NLG平均得分从4.7提升至7.3(提升35%)。

Insight: 1. 音频文本匹配模型和关键词选择对性能至关重要;2. 单关键词提示效果最佳;3. 无关键词列表时性能下降50%。

Abstract: Automated Audio Captioning (AAC) generates captions for audio clips but faces challenges due to limited datasets compared to image captioning. To overcome this, we propose the zero-shot AAC system that leverages pre-trained models, eliminating the need for extensive training. Our approach uses a pre-trained audio CLIP model to extract auditory features and generate a structured prompt, which guides a Large Language Model (LLM) in caption generation. Unlike traditional greedy decoding, our method refines token selection through the audio CLIP model, ensuring alignment with the audio content. Experimental results demonstrate a 35% improvement in NLG mean score (from 4.7 to 7.3) using MAGIC search with the WavCaps model. The performance is heavily influenced by the audio-text matching model and keyword selection, with optimal results achieved using a single keyword prompt, and a 50% performance drop when no keyword list is used.


[77] EconProver: Towards More Economical Test-Time Scaling for Automated Theorem Proving cs.CL | cs.AIPDF

Mukai Li, Linfeng Song, Zhenwen Liang, Jiahao Xu, Shansan Gong

TL;DR: EconProver提出了两种互补方法(动态CoT切换机制和多样化并行强化学习),以减少计算成本同时保持自动化定理证明的性能,实验表明仅需12%的计算成本即可达到基线性能。

Details

Motivation: 当前自动化定理证明中广泛采用的测试时扩展策略(如反射性CoT推理和增加采样次数)带来显著计算开销,且现有成本分析未能充分考虑不同策略导致的采样成本差异。

Result: 在miniF2F和ProofNet上的实验表明,仅需基线12%的计算成本即可达到同等性能。

Insight: 结合动态策略和优化采样效率可显著提升ATP模型的经济性,为轻量级部署提供可行方案。

Abstract: Large Language Models (LLMs) have recently advanced the field of Automated Theorem Proving (ATP), attaining substantial performance gains through widely adopted test-time scaling strategies, notably reflective Chain-of-Thought (CoT) reasoning and increased sampling passes. However, they both introduce significant computational overhead for inference. Moreover, existing cost analyses typically regulate only the number of sampling passes, while neglecting the substantial disparities in sampling costs introduced by different scaling strategies. In this paper, we systematically compare the efficiency of different test-time scaling strategies for ATP models and demonstrate the inefficiency of the current state-of-the-art (SOTA) open-source approaches. We then investigate approaches to significantly reduce token usage and sample passes while maintaining the original performance. Specifically, we propose two complementary methods that can be integrated into a unified EconRL pipeline for amplified benefits: (1) a dynamic Chain-of-Thought (CoT) switching mechanism designed to mitigate unnecessary token consumption, and (2) Diverse parallel-scaled reinforcement learning (RL) with trainable prefixes to enhance pass rates under constrained sampling passes. Experiments on miniF2F and ProofNet demonstrate that our EconProver achieves comparable performance to baseline methods with only 12% of the computational cost. This work provides actionable insights for deploying lightweight ATP models without sacrificing performance.


[78] Positional Encoding via Token-Aware Phase Attention cs.CL | cs.AIPDF

Yu, Wang, Sheng Shen, Rémi Munos, Hongyuan Zhan

TL;DR: 论文提出了一种新的位置编码方法TAPA,通过学习相位函数改进注意力机制,解决了RoPE的长距离依赖问题,且无需预训练后调整。

Details

Motivation: RoPE在长距离建模中存在距离依赖的偏差,现有扩展方法通常需要预训练后的调整(如重新缩放或超参数微调),这限制了其灵活性。

Result: TAPA在长距离上下文任务中显著降低了困惑度,优于RoPE系列方法,并能推广到未见过的长度。

Insight: 相位函数的学习能力是关键,它使模型能够灵活适应不同距离的依赖关系,从而提升长距离建模的性能。

Abstract: We prove under practical assumptions that Rotary Positional Embedding (RoPE) introduces an intrinsic distance-dependent bias in attention scores that limits RoPE’s ability to model long-context. RoPE extension methods may alleviate this issue, but they typically require post-hoc adjustments after pretraining, such as rescaling or hyperparameters retuning. This paper introduces Token-Aware Phase Attention (TAPA), a new positional encoding method that incorporates a learnable phase function into the attention mechanism. TAPA preserves token interactions over long range, extends to longer contexts with direct and light fine-tuning, extrapolates to unseen lengths, and attains significantly lower perplexity on long-context than RoPE families.


[79] PAC: Pronunciation-Aware Contextualized Large Language Model-based Automatic Speech Recognition cs.CL | eess.ASPDF

Li Fu, Yu Xin, Sunlu Zeng, Lu Fan, Youzheng Wu

TL;DR: 该论文提出了一个发音感知的上下文化框架PAC,用于解决基于大语言模型(LLM)的自动语音识别(ASR)系统中的发音建模和同音词区分问题。通过两阶段学习方法,显著降低了词错误率(WER)和长尾词的偏置WER。

Details

Motivation: 在基于LLM的ASR系统中,如何有效建模发音并区分同音词是关键挑战,尤其是在处理原始或长尾词汇时。

Result: 在Librispeech和AISHELL-1数据集上,PAC相比预训练的LLM-based ASR模型分别减少了30.2%和53.8%的相对WER,长尾词的偏置WER减少31.8%和60.5%。

Insight: 结合发音(音素)和字形信息的上下文建模以及强化学习是提升ASR系统性能的有效手段,特别是在处理复杂或长尾词汇时。

Abstract: This paper presents a Pronunciation-Aware Contextualized (PAC) framework to address two key challenges in Large Language Model (LLM)-based Automatic Speech Recognition (ASR) systems: effective pronunciation modeling and robust homophone discrimination. Both are essential for raw or long-tail word recognition. The proposed approach adopts a two-stage learning paradigm. First, we introduce a pronunciation-guided context learning method. It employs an interleaved grapheme-phoneme context modeling strategy that incorporates grapheme-only distractors, encouraging the model to leverage phonemic cues for accurate recognition. Then, we propose a pronunciation-discriminative reinforcement learning method with perturbed label sampling to further enhance the model's ability to distinguish contextualized homophones. Experimental results on the public English Librispeech and Mandarin AISHELL-1 datasets indicate that PAC: (1) reduces relative Word Error Rate (WER) by 30.2% and 53.8% compared to pre-trained LLM-based ASR models, and (2) achieves 31.8% and 60.5% relative reductions in biased WER for long-tail words compared to strong baselines, respectively.


[80] Don’t Change My View: Ideological Bias Auditing in Large Language Models cs.CL | cs.AIPDF

Paul Kröger, Emilio Barkett

TL;DR: 论文提出了一种检测大型语言模型(LLMs)意识形态偏见的统计方法,适用于黑箱系统审计。

Details

Motivation: LLMs的输出可能影响公众意见,因此需要检测其是否被有意导向特定意识形态立场。

Result: 实验验证了方法的实用性,支持对LLM行为的独立事后审计。

Insight: 该方法为检测和防范LLMs的意识形态偏差提供了工具,有助于透明性和问责。

Abstract: As large language models (LLMs) become increasingly embedded in products used by millions, their outputs may influence individual beliefs and, cumulatively, shape public opinion. If the behavior of LLMs can be intentionally steered toward specific ideological positions, such as political or religious views, then those who control these systems could gain disproportionate influence over public discourse. Although it remains an open question whether LLMs can reliably be guided toward coherent ideological stances and whether such steering can be effectively prevented, a crucial first step is to develop methods for detecting when such steering attempts occur. In this work, we adapt a previously proposed statistical method to the new context of ideological bias auditing. Our approach carries over the model-agnostic design of the original framework, which does not require access to the internals of the language model. Instead, it identifies potential ideological steering by analyzing distributional shifts in model outputs across prompts that are thematically related to a chosen topic. This design makes the method particularly suitable for auditing proprietary black-box systems. We validate our approach through a series of experiments, demonstrating its practical applicability and its potential to support independent post hoc audits of LLM behavior.


[81] Mitigating Strategy Preference Bias in Emotional Support Conversation via Uncertainty Estimations cs.CLPDF

Yougen Zhou, Qin Chen, Ningning Zhou, Jie Zhou, Xingjiao Wu

TL;DR: 论文通过分析大型语言模型(LLM)在情感支持对话(ESC)中策略规划的偏好偏见原因,提出了一种基于知识边界和双重奖励函数的强化学习方法,有效减少了偏好偏见并提高了策略规划的准确性。

Details

Motivation: 情感支持对话中,LLM存在的策略偏好偏见导致ESC效果不佳,现有方法对偏见的根源研究不足。

Result: 在ESCov和ExTES数据集上的实验表明,该方法优于基线模型。

Insight: LLM的策略偏好偏见与其知识边界密切相关,通过熵调整可以更好地平衡策略选择的多样性和准确性。

Abstract: Emotional support conversation (ESC) aims to alleviate distress through empathetic dialogue, yet large language models (LLMs) face persistent challenges in delivering effective ESC due to low accuracy in strategy planning. Moreover, there is a considerable preference bias towards specific strategies. Prior methods using fine-tuned strategy planners have shown potential in reducing such bias, while the underlying causes of the preference bias in LLMs have not well been studied. To address these issues, we first reveal the fundamental causes of the bias by identifying the knowledge boundaries of LLMs in strategy planning. Then, we propose an approach to mitigate the bias by reinforcement learning with a dual reward function, which optimizes strategy planning via both accuracy and entropy-based confidence for each region according to the knowledge boundaries. Experiments on the ESCov and ExTES datasets with multiple LLM backbones show that our approach outperforms the baselines, confirming the effectiveness of our approach.


[82] Chat-Driven Text Generation and Interaction for Person Retrieval cs.CL | I.2.7; I.4.9PDF

Zequn Xie, Chuxin Wang, Sihang Cai, Yeqiang Wang, Shulei Wang

TL;DR: 该论文提出了一个无需标注的文本驱动人物检索框架,包含多轮文本生成(MTG)和多轮文本交互(MTI)模块,显著提升了检索的准确性和鲁棒性。

Details

Motivation: 传统基于文本的人物检索(TBPS)依赖大量人工标注的文本描述,成本高昂且难以扩展。论文旨在通过模拟对话生成伪标签和动态交互优化查询,减少对人工标注的依赖。

Result: 实验表明,该方法在免人工标注的情况下取得了竞争性或更优的检索结果,显著提升了系统的鲁棒性和实用性。

Insight: 通过模拟对话生成伪标签和动态交互优化查询是一种有效的免标注方法,为TBPS的实际部署提供了新思路。

Abstract: Text-based person search (TBPS) enables the retrieval of person images from large-scale databases using natural language descriptions, offering critical value in surveillance applications. However, a major challenge lies in the labor-intensive process of obtaining high-quality textual annotations, which limits scalability and practical deployment. To address this, we introduce two complementary modules: Multi-Turn Text Generation (MTG) and Multi-Turn Text Interaction (MTI). MTG generates rich pseudo-labels through simulated dialogues with MLLMs, producing fine-grained and diverse visual descriptions without manual supervision. MTI refines user queries at inference time through dynamic, dialogue-based reasoning, enabling the system to interpret and resolve vague, incomplete, or ambiguous descriptions - characteristics often seen in real-world search scenarios. Together, MTG and MTI form a unified and annotation-free framework that significantly improves retrieval accuracy, robustness, and usability. Extensive evaluations demonstrate that our method achieves competitive or superior results while eliminating the need for manual captions, paving the way for scalable and practical deployment of TBPS systems.


[83] Towards Inclusive Toxic Content Moderation: Addressing Vulnerabilities to Adversarial Attacks in Toxicity Classifiers Tackling LLM-generated Content cs.CLPDF

Shaz Furniturewala, Arkaitz Zubiaga

TL;DR: 该论文旨在解决毒性分类器在面对LLM生成内容和对抗攻击时的脆弱性问题,提出了一种基于机制可解释性技术的新策略,通过识别和抑制脆弱电路来提升模型的鲁棒性和公平性。

Details

Motivation: 随着大型语言模型(LLMs)的广泛使用,机器生成内容激增,传统基于人类文本训练的内容审核分类器在面对LLM生成内容和对抗攻击时表现不佳。当前防御方法多为被动应对,缺乏对脆弱性的主动识别和针对性改进。

Result: 结果表明,模型中存在对性能至关重要或易受攻击的特定头部。抑制脆弱头部能够显著提升对抗输入的性能。此外,不同人口统计群体的脆弱性由不同头部负责,这为模型的公平性改进提供了方向。

Insight: 论文揭示了脆弱性与模型结构的紧密关联,并为未来毒性分类器的设计提出了针对性建议,尤其是如何在提升鲁棒性的同时兼顾公平性。

Abstract: The volume of machine-generated content online has grown dramatically due to the widespread use of Large Language Models (LLMs), leading to new challenges for content moderation systems. Conventional content moderation classifiers, which are usually trained on text produced by humans, suffer from misclassifications due to LLM-generated text deviating from their training data and adversarial attacks that aim to avoid detection. Present-day defence tactics are reactive rather than proactive, since they rely on adversarial training or external detection models to identify attacks. In this work, we aim to identify the vulnerable components of toxicity classifiers that contribute to misclassification, proposing a novel strategy based on mechanistic interpretability techniques. Our study focuses on fine-tuned BERT and RoBERTa classifiers, testing on diverse datasets spanning a variety of minority groups. We use adversarial attacking techniques to identify vulnerable circuits. Finally, we suppress these vulnerable circuits, improving performance against adversarial attacks. We also provide demographic-level insights into these vulnerable circuits, exposing fairness and robustness gaps in model training. We find that models have distinct heads that are either crucial for performance or vulnerable to attack and suppressing the vulnerable heads improves performance on adversarial input. We also find that different heads are responsible for vulnerability across different demographic groups, which can inform more inclusive development of toxicity detection models.


[84] Case-Based Decision-Theoretic Decoding with Quality Memories cs.CLPDF

Hiroyuki Deguchi, Masaaki Nagata

TL;DR: 本文提出了一种基于案例的决策理论(CBDT)解码方法,通过利用领域数据示例估计预期效用,改进了传统的最小贝叶斯风险(MBR)解码方法,并在多领域翻译和图像描述任务中表现优于MBR和MAP解码。

Details

Motivation: 传统的MBR解码依赖从文本生成模型中采样的文本,难以捕捉领域外知识。为解决这一问题,作者提出了CBDT解码方法。

Result: 实验结果表明,CBDT解码不仅优于MAP解码,其与MBR的结合还显著优于单独的MBR解码。

Insight: 利用领域数据示例可以更好地估计预期效用,尤其是在处理领域外知识时,CBDT解码提供了一种有效的补充方法。

Abstract: Minimum Bayes risk (MBR) decoding is a decision rule of text generation, which selects the hypothesis that maximizes the expected utility and robustly generates higher-quality texts than maximum a posteriori (MAP) decoding. However, it depends on sample texts drawn from the text generation model; thus, it is difficult to find a hypothesis that correctly captures the knowledge or information of out-of-domain. To tackle this issue, we propose case-based decision-theoretic (CBDT) decoding, another method to estimate the expected utility using examples of domain data. CBDT decoding not only generates higher-quality texts than MAP decoding, but also the combination of MBR and CBDT decoding outperformed MBR decoding in seven domain De–En and Ja$\leftrightarrow$En translation tasks and image captioning tasks on MSCOCO and nocaps datasets.


[85] HistoryBankQA: Multilingual Temporal Question Answering on Historical Events cs.CLPDF

Biswadip Mandal, Anant Khandelwal, Manish Gupta

TL;DR: 该论文提出HistoryBank,一个多语言历史事件数据库,覆盖10种语言和10M+事件,并构建了一个涵盖6种任务的时序问答基准,评估了多款大语言模型(如GPT4o、Gemma-2等)的性能。

Details

Motivation: 当前的时序推理数据集规模有限、多语言覆盖不足且更关注当代事件,无法充分评估语言模型的时序推理能力。

Result: GPT4o在所有任务和语言中表现最佳,Gemma-2在小型模型中表现最优。

Insight: 该研究为增强多语言和时序感知的自然语言理解提供了资源,展示了语言模型在历史事件推理中的潜力与局限。

Abstract: Temporal reasoning about historical events is a critical skill for NLP tasks like event extraction, historical entity linking, temporal question answering, timeline summarization, temporal event clustering and temporal natural language inference. Yet efforts on benchmarking temporal reasoning capabilities of large language models (LLMs) are rather limited. Existing temporal reasoning datasets are limited in scale, lack multilingual coverage and focus more on contemporary events. To address these limitations, we present HistoryBank, a multilingual database of 10M+ historical events extracted from Wikipedia timeline pages and article infoboxes. Our database provides unprecedented coverage in both historical depth and linguistic breadth with 10 languages. Additionally, we construct a comprehensive question answering benchmark for temporal reasoning across all languages. This benchmark covers a diverse set of 6 temporal QA reasoning tasks, and we evaluate a suite of popular language models (LLaMA-3-8B, Mistral-7B, Gemma-2-9b, Qwen3-8B, GPT4o) to assess their performance on these tasks. As expected GPT4o performs best across all answer types and languages; Gemma-2 outperforms the other small language models. Our work aims to provide a comprehensive resource for advancing multilingual and temporally-aware natural language understanding of historical events. To facilitate further research, we will make our code and datasets publicly available upon acceptance of this paper.


[86] Contrastive Learning with Enhanced Abstract Representations using Grouped Loss of Abstract Semantic Supervision cs.CLPDF

Omri Suissa, Muhiim Ali, Shengmai Chen, Yinuo Cai, Shekhar Pradhan

TL;DR: 该论文提出了一种通过分组对比损失增强视觉语言模型(VLM)抽象概念识别能力的方法,并引入了MAGIC数据集。

Details

Motivation: 人类能够识别图像中的抽象概念,而不仅仅是物体及其关系。作者旨在研究视觉语言模型是否具备这种抽象概念能力,并提出方法增强其能力。

Result: 实验表明,CLEAR GLASS模型在抽象概念识别任务上优于现有方法。

Insight: 通过对比学习隐式引入抽象概念信息,模型能在未直接接触高层概念的训练中,自发提升抽象表达能力。

Abstract: Humans can recognize an image as an instance of a general concept, beyond simply identifying its objects and their relationships. In this paper, we investigate 1. The extent to which VLMs have this concept abstraction capacity, and 2. Strategies for encoding the sort of higher-concept information in images that would enable the resulting VLM model (CLEAR GLASS model) to have this capability to a greater degree. To this end, we introduce a grouped image-caption dataset (MAGIC), which consists of several groups of image captions and for each group a set of associated images and higher-level conceptual labels. We use a novel contrastive loss technique to induce the model to encode in the representation of each image (caption) in a group the information that is common to all members of the image-caption group. Our main contribution is a grouped contrastive loss function based on text-image contrastive groups (outer contrastive loss) as well as an inner loss which measures the distances between image-caption instances in the group. Our training methodology results in the CLEAR GLASS model having the concept abstraction capacity as an emergent capacity because the model is not exposed to the higher-level concepts associated with each group. Instead, the training forces the model to create for each image-caption group a semantic representation that brings it closer to the semantic representation of the higher-level concepts in the latent semantic space. Our experiments show that this training methodology results in a model which shows improvement in abstract concept recognition compared to SOTA models.


[87] ConvergeWriter: Data-Driven Bottom-Up Article Construction cs.CLPDF

Binquan Ji, Jiaqi Wang, Ruiting Li, Xingchen Han, Yiyang Qi

TL;DR: 论文提出了一种‘自下而上’的数据驱动框架ConvergeWriter,通过‘先检索知识,再聚类结构’策略,解决现有‘自上而下’方法在生成长篇、事实性文档时的内容碎片化和事实不准确问题。

Details

Motivation: 现有的大语言模型(LLM)在生成长篇、事实性文档时,常因‘自上而下’的方法导致生成内容与知识库脱节。作者希望通过数据驱动的方法解决这一问题,确保生成内容忠实于源材料。

Result: 实验表明,该方法在14B和32B参数模型上表现优于或接近现有基线,特别在知识受限、要求高保真度和结构一致性的场景中具有优势。

Insight: 通过数据驱动的方法约束生成过程,可以显著减少幻觉风险,为高风险、知识密集领域的应用提供了可靠的长文档生成范式。

Abstract: Large Language Models (LLMs) have shown remarkable prowess in text generation, yet producing long-form, factual documents grounded in extensive external knowledge bases remains a significant challenge. Existing “top-down” methods, which first generate a hypothesis or outline and then retrieve evidence, often suffer from a disconnect between the model’s plan and the available knowledge, leading to content fragmentation and factual inaccuracies. To address these limitations, we propose a novel “bottom-up,” data-driven framework that inverts the conventional generation pipeline. Our approach is predicated on a “Retrieval-First for Knowledge, Clustering for Structure” strategy, which first establishes the “knowledge boundaries” of the source corpus before any generative planning occurs. Specifically, we perform exhaustive iterative retrieval from the knowledge base and then employ an unsupervised clustering algorithm to organize the retrieved documents into distinct “knowledge clusters.” These clusters form an objective, data-driven foundation that directly guides the subsequent generation of a hierarchical outline and the final document content. This bottom-up process ensures that the generated text is strictly constrained by and fully traceable to the source material, proactively adapting to the finite scope of the knowledge base and fundamentally mitigating the risk of hallucination. Experimental results on both 14B and 32B parameter models demonstrate that our method achieves performance comparable to or exceeding state-of-the-art baselines, and is expected to demonstrate unique advantages in knowledge-constrained scenarios that demand high fidelity and structural coherence. Our work presents an effective paradigm for generating reliable, structured, long-form documents, paving the way for more robust LLM applications in high-stakes, knowledge-intensive domains.


[88] Data Augmentation for Maltese NLP using Transliterated and Machine Translated Arabic Data cs.CLPDF

Kurt Micallef, Nizar Habash, Claudia Borg

TL;DR: 论文探讨了如何利用阿拉伯语的资源通过跨语言数据增强技术来支持马耳他语的自然语言处理(NLP),包括多种音译方案和机器翻译方法,并展示了这种增强对马耳他语NLP任务的显著益处。

Details

Motivation: 马耳他语是一种独特的闪米特语言,受到了罗曼语和日耳曼语(尤其是意大利语和英语)的深远影响。尽管其根源为闪米特语,但其书写系统基于拉丁字母,与其最近的阿拉伯语亲属语言存在差异。研究者探索是否可以利用阿拉伯语资源通过数据增强技术来支持马耳他语的NLP任务。

Result: 实验结果表明,基于阿拉伯语的数据增强技术可以显著提升马耳他语NLP任务的性能。这些增强方法在单语和多语模型中均表现出积极效果。

Insight: 论文的启示在于,尽管马耳他语与阿拉伯语在书写系统上存在差异,但通过适当的转换和增强方法,可以利用阿拉伯语资源弥补马耳他语数据不足的问题,从而提升NLP任务的性能。这为低资源语言的数据增强提供了新的思路。

Abstract: Maltese is a unique Semitic language that has evolved under extensive influence from Romance and Germanic languages, particularly Italian and English. Despite its Semitic roots, its orthography is based on the Latin script, creating a gap between it and its closest linguistic relatives in Arabic. In this paper, we explore whether Arabic-language resources can support Maltese natural language processing (NLP) through cross-lingual augmentation techniques. We investigate multiple strategies for aligning Arabic textual data with Maltese, including various transliteration schemes and machine translation (MT) approaches. As part of this, we also introduce novel transliteration systems that better represent Maltese orthography. We evaluate the impact of these augmentations on monolingual and mutlilingual models and demonstrate that Arabic-based augmentation can significantly benefit Maltese NLP tasks.


[89] Benchmarking and Improving LVLMs on Event Extraction from Multimedia Documents cs.CL | cs.MMPDF

Fuyu Xing, Zimu Wang, Wei Wang, Haiyang Zhang

TL;DR: 本文首次系统评估了DeepSeek-VL2和Qwen-VL等大型视觉语言模型(LVLM)在多模态事件提取(M2E2)任务中的表现,揭示了其在少样本提示和微调设置下的性能差异,并提出了改进方向。

Details

Motivation: 随着多媒体内容的快速增长,多模态事件提取(M2E2)变得日益重要。尽管大型视觉语言模型(LVLM)在多模态任务中表现出色,但其在M2E2中的应用尚未得到充分研究。

Result: 结果表明:(1) LVLM在视觉任务中表现较好,但在文本任务中表现较差;(2) LoRA微调显著提升性能;(3) 多模态结合效果更优。但语义精度、定位和跨模态基础仍是挑战。

Insight: LVLM在多模态任务中具有潜力,但需进一步优化其在文本任务中的表现,并解决跨模态任务中的语义对齐问题。

Abstract: The proliferation of multimedia content necessitates the development of effective Multimedia Event Extraction (M2E2) systems. Though Large Vision-Language Models (LVLMs) have shown strong cross-modal capabilities, their utility in the M2E2 task remains underexplored. In this paper, we present the first systematic evaluation of representative LVLMs, including DeepSeek-VL2 and the Qwen-VL series, on the M2E2 dataset. Our evaluations cover text-only, image-only, and cross-media subtasks, assessed under both few-shot prompting and fine-tuning settings. Our key findings highlight the following valuable insights: (1) Few-shot LVLMs perform notably better on visual tasks but struggle significantly with textual tasks; (2) Fine-tuning LVLMs with LoRA substantially enhances model performance; and (3) LVLMs exhibit strong synergy when combining modalities, achieving superior performance in cross-modal settings. We further provide a detailed error analysis to reveal persistent challenges in areas such as semantic precision, localization, and cross-modal grounding, which remain critical obstacles for advancing M2E2 capabilities.


[90] The LLM Already Knows: Estimating LLM-Perceived Question Difficulty via Hidden Representations cs.CL | cs.AIPDF

Yubo Zhu, Dongrui Liu, Zecheng Lin, Wei Tong, Sheng Zhong

TL;DR: 本文提出了一种利用大型语言模型(LLM)隐藏表征来估计输入问题难度的新方法,避免了传统方法的计算成本或泛化性问题。

Details

Motivation: 现有方法依赖重复采样、辅助模型或微调目标模型,计算成本高且可能影响泛化性。本文旨在通过LLM的隐藏表征直接估计问题难度。

Result: 实验表明,该方法在难度估计上优于现有基线,并成功应用于自适应推理策略(如Self-Consistency),提高了推理效率。

Insight: LLM的隐藏表征已隐含问题难度信息,利用这些信息可避免冗余计算,实现高效推理。

Abstract: Estimating the difficulty of input questions as perceived by large language models (LLMs) is essential for accurate performance evaluation and adaptive inference. Existing methods typically rely on repeated response sampling, auxiliary models, or fine-tuning the target model itself, which may incur substantial computational costs or compromise generality. In this paper, we propose a novel approach for difficulty estimation that leverages only the hidden representations produced by the target LLM. We model the token-level generation process as a Markov chain and define a value function to estimate the expected output quality given any hidden state. This allows for efficient and accurate difficulty estimation based solely on the initial hidden state, without generating any output tokens. Extensive experiments across both textual and multimodal tasks demonstrate that our method consistently outperforms existing baselines in difficulty estimation. Moreover, we apply our difficulty estimates to guide adaptive reasoning strategies, including Self-Consistency, Best-of-N, and Self-Refine, achieving higher inference efficiency with fewer generated tokens.


[91] Conan-Embedding-v2: Training an LLM from Scratch for Text Embeddings cs.CL | cs.AIPDF

Shiyu Li, Yang Tang, Ruijie Liu, Shi-Zhe Chen, Xi Chen

TL;DR: Conan-embedding-v2是一个从头训练的1.4B参数LLM,专注于解决LLM在文本嵌入任务中的数据与训练差异,提出跨语言检索数据集和软掩码机制,动态硬负采样方法,性能达到SOTA。

Details

Motivation: LLM在文本嵌入任务中表现优异,但现有方法依赖微调(如LoRA),存在数据与训练差异问题。

Result: 在MTEB和中文MTEB上达到SOTA性能。

Insight: 通过数据扩展和训练机制改进,小规模LLM也能在嵌入任务中实现高性能。

Abstract: Large language models (LLMs) have recently demonstrated excellent performance in text embedding tasks. Previous work usually use LoRA to fine-tune existing LLMs, which are limited by the data and training gap between LLMs and embedding models. In this work, we introduce Conan-embedding-v2, a new 1.4B-parameter LLM trained from scratch and fine-tuned as a text embedder. First, we add news data and multilingual pairs for LLM pretraining to bridge the data gap. Based on this, we propose a cross-lingual retrieval dataset that enables the LLM to better integrate embeddings across different languages. Second, whereas LLMs use a causal mask with token-level loss, embedding models use a bidirectional mask with sentence-level loss. This training gap makes full fine-tuning less effective than LoRA. We introduce a soft-masking mechanism to gradually transition between these two types of masks, enabling the model to learn more comprehensive representations. Based on this, we propose a dynamic hard negative mining method that exposes the model to more difficult negative examples throughout the training process. Being intuitive and effective, with only approximately 1.4B parameters, Conan-embedding-v2 achieves SOTA performance on both the Massive Text Embedding Benchmark (MTEB) and Chinese MTEB (May 19, 2025).


[92] All Roads Lead to Rome: Graph-Based Confidence Estimation for Large Language Model Reasoning cs.CL | cs.AIPDF

Caiqi Zhang, Chang Shu, Ehsan Shareghi, Nigel Collier

TL;DR: 论文提出了一种基于图的无训练置信度估计方法,专门用于大型语言模型(LLM)的推理任务,通过建模推理路径为有向图并利用图的属性(如中心性、路径收敛和路径权重)来提升置信度估计效果。

Details

Motivation: 现有置信度估计方法主要针对事实性问答任务,难以推广到复杂的推理任务,因此需要一种更适合推理任务的置信度估计方法。

Result: 在两个大型语言模型和三个推理数据集上的实验表明,该方法能显著提升置信度估计效果,并在两个下游任务中表现更优。

Insight: 通过图的属性建模推理路径,可以更有效地捕捉推理过程中的不确定性,从而提升置信度估计的鲁棒性。

Abstract: Confidence estimation is essential for the reliable deployment of large language models (LLMs). Existing methods are primarily designed for factual QA tasks and often fail to generalize to reasoning tasks. To address this gap, we propose a set of training-free, graph-based confidence estimation methods tailored to reasoning tasks. Our approach models reasoning paths as directed graphs and estimates confidence by exploiting graph properties such as centrality, path convergence, and path weighting. Experiments with two LLMs on three reasoning datasets demonstrate improved confidence estimation and enhanced performance on two downstream tasks.


[93] Automated Generation of Research Workflows from Academic Papers: A Full-text Mining Framework cs.CL | cs.DL | cs.IRPDF

Heng Zhang, Chengzhi Zhang

TL;DR: 该论文提出了一种端到端的框架,通过挖掘全文学术论文自动生成结构化研究流程,重点解决了现有方法仅能提取片段化研究过程的问题。

Details

Motivation: 为了提高研究的可重复性和推动“AI for Science”范式,需要自动化生成完整的研究流程,而现有方法通常只能提取片段化的过程信息。

Result: 段落识别的F1得分为0.9772;流程生成的ROUGE-1/2/L分别为0.4543/0.2877/0.4427;分类精确度为0.958。

Insight: NLP领域的研究流程逐渐从特征工程转向消融研究,数据分析的重要性日益凸显。该方法为自动化流程生成提供了技术框架,并为科学范式演变研究提供了新视角。

Abstract: The automated generation of research workflows is essential for improving the reproducibility of research and accelerating the paradigm of “AI for Science”. However, existing methods typically extract merely fragmented procedural components and thus fail to capture complete research workflows. To address this gap, we propose an end-to-end framework that generates comprehensive, structured research workflows by mining full-text academic papers. As a case study in the Natural Language Processing (NLP) domain, our paragraph-centric approach first employs Positive-Unlabeled (PU) Learning with SciBERT to identify workflow-descriptive paragraphs, achieving an F1-score of 0.9772. Subsequently, we utilize Flan-T5 with prompt learning to generate workflow phrases from these paragraphs, yielding ROUGE-1, ROUGE-2, and ROUGE-L scores of 0.4543, 0.2877, and 0.4427, respectively. These phrases are then systematically categorized into data preparation, data processing, and data analysis stages using ChatGPT with few-shot learning, achieving a classification precision of 0.958. By mapping categorized phrases to their document locations in the documents, we finally generate readable visual flowcharts of the entire research workflows. This approach facilitates the analysis of workflows derived from an NLP corpus and reveals key methodological shifts over the past two decades, including the increasing emphasis on data analysis and the transition from feature engineering to ablation studies. Our work offers a validated technical framework for automated workflow generation, along with a novel, process-oriented perspective for the empirical investigation of evolving scientific paradigms. Source code and data are available at: https://github.com/ZH-heng/research_workflow.


[94] Investigating ReLoRA: Effects on the Learning Dynamics of Small Language Models cs.CL | cs.AIPDF

Yuval Weiss, David Demitri Africa, Paula Buttery, Richard Diehl Martinez

TL;DR: 本文研究了ReLoRA在小语言模型(SLMs)中的学习动态和性能表现,发现它在损失、Paloma困惑度和BLiMP任务上普遍表现不如标准训练,且在较大模型中差距更明显。

Details

Motivation: LoRA等参数高效方法在大语言模型(LLMs)微调中表现优异,但其在预训练(如ReLoRA)中的应用,尤其是对小语言模型(SLMs)的影响尚不明确。SLMs在计算和环境成本上更低,因此研究其学习动态和性能表现具有重要意义。

Result: ReLoRA在SLMs中表现不如标准训练,且在较大模型中表现差距更大。学习动态分析表明,ReLoRA加剧了小模型的秩不足问题。

Insight: 低秩更新策略(如ReLoRA)可能难以直接迁移到SLM预训练中,提示在低计算资源领域需要更多研究。

Abstract: Parameter-efficient methods such as LoRA have revolutionised the fine-tuning of LLMs. Still, their extension to pretraining via ReLoRA is less well understood, especially for small language models (SLMs), which offer lower computational and environmental costs. This work is the first systematic study of ReLoRA in SLMs (11M-66M parameters), evaluating both performance and learning dynamics. Through ablation experiments, we find that ReLoRA generally performs worse than standard training on loss, Paloma perplexity and BLiMP, with the gap widening for the larger models. Further analysis of the learning dynamics of the models indicates that ReLoRA reinforces the rank deficiencies found in smaller models. These results indicate that low-rank update strategies may not transfer easily to SLM pretraining, highlighting the need for more research in the low-compute regime.


[95] SitLLM: Large Language Models for Sitting Posture Health Understanding via Pressure Sensor Data cs.CLPDF

Jian Gao, Fufangchen Zhao, Yiyang Zhang, Danfeng Yan

TL;DR: SitLLM是一个轻量级多模态框架,结合压力传感器与大语言模型(LLM),实现细粒度坐姿理解与个性化健康反馈。

Details

Motivation: 现有坐姿监测系统识别粒度粗且缺乏语义表达力,难以提供个性化反馈,SitLLM旨在解决这一问题。

Result: 实现了细粒度坐姿理解和个性化健康反馈。

Insight: 结合压力传感器与LLM的多模态方法能有效提升坐姿监测的语义表达能力。

Abstract: Poor sitting posture is a critical yet often overlooked factor contributing to long-term musculoskeletal disorders and physiological dysfunctions. Existing sitting posture monitoring systems, although leveraging visual, IMU, or pressure-based modalities, often suffer from coarse-grained recognition and lack the semantic expressiveness necessary for personalized feedback. In this paper, we propose \textbf{SitLLM}, a lightweight multimodal framework that integrates flexible pressure sensing with large language models (LLMs) to enable fine-grained posture understanding and personalized health-oriented response generation. SitLLM comprises three key components: (1) a \textit{Gaussian-Robust Sensor Embedding Module} that partitions pressure maps into spatial patches and injects local noise perturbations for robust feature extraction; (2) a \textit{Prompt-Driven Cross-Modal Alignment Module} that reprograms sensor embeddings into the LLM’s semantic space via multi-head cross-attention using the pre-trained vocabulary embeddings; and (3) a \textit{Multi-Context Prompt Module} that fuses feature-level, structure-level, statistical-level, and semantic-level contextual information to guide instruction comprehension.


[96] Multi-Model Synthetic Training for Mission-Critical Small Language Models cs.CL | cs.AI | cs.LG | 68T50 68T50 | I.2.7; I.2.6PDF

Nolan Platt, Pragyansmita Nayak

TL;DR: 论文提出了一种利用大模型(LLMs)生成合成数据,并用于训练小模型的方法,显著降低了海事领域任务的成本,同时保持了较高的准确性。

Details

Motivation: 大模型在专业领域的应用受限于领域特定数据的稀缺性和复杂性,而直接使用大模型推理成本高昂。

Result: 优化后的小模型比直接使用大模型推理成本更低,同时在海事任务上达到75%的准确性。

Insight: 通过合成数据训练的小模型在专业领域可以达到与昂贵大模型相近的性能,为无法手动标注数据的领域提供了可行解决方案。

Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities across many domains, yet their application to specialized fields remains constrained by the scarcity and complexity of domain-specific training data. We present a novel approach that achieves a 261x cost reduction for maritime intelligence by using LLMs as one-time teachers rather than using them directly for inference. Our method transforms 3.2 billion Automatic Identification System (AIS) vessel tracking records into 21,543 synthetic question and answer pairs through multi-model generation (GPT-4o and o3-mini), preventing overfitting and ensuring accurate reasoning. The resulting fine-tuned Qwen2.5-7B model achieves 75% accuracy on maritime tasks, while being substantially cheaper than using a larger model for inference. We show that smaller, cheaper models – when fine tuned properly – can provide similar accuracy compared to larger models that are prohibitively expensive. Our work contributes to the growing field of synthetic dataset generation for specialized AI applications and presents a highly reproducible framework for domains where manual annotation is infeasible. Beyond expanding research in the growing field of specialized small language models, our approach has immediate applications in maritime safety, security operations, and vessel traffic management systems in various industries.


[97] Shaping Explanations: Semantic Reward Modeling with Encoder-Only Transformers for GRPO cs.CL | cs.AIPDF

Francesco Pappone, Ruggero Marino Lazzaroni, Federico Califano, Niccolò Gentile, Roberto Marras

TL;DR: 该论文提出了一种基于编码器专用变压器(encoder-only transformer)的语义奖励模型,用于在GRPO框架中生成更符合人类专家推理的高质量解释。

Details

Motivation: 尽管大型语言模型(LLMs)能够生成人类般的文本,但其输出与复杂定性目标(如教学合理性)的对齐仍是一个挑战。现有方法如基于关键词的ROUGE或昂贵的LLM-as-a-judge评估无法有效捕捉语义质量。

Result: 与强SFT基线相比,提出的语义奖励显著提高了生成解释的忠实性和清晰度。

Insight: 轻量级编码器模型可用于复杂生成任务中的细腻奖励建模,避免依赖昂贵的LLM评估或基于关键词的指标。

Abstract: While Large Language Models (LLMs) excel at generating human-like text, aligning their outputs with complex, qualitative goals like pedagogical soundness remains a significant challenge. Standard reinforcement learning techniques often rely on slow and expensive LLM-as-a-judge evaluations or on brittle, keyword-based metrics like ROUGE, which fail to capture the semantic essence of a high-quality explanation. In this work, we introduce a novel approach to reward shaping within the Group Relative Policy Optimisation (GRPO) framework. Our central contribution is the use of a small, efficient encoder-only transformer as a semantic reward model. This model provides a dense, semantically rich reward signal based on the cosine similarity between a generated explanation and a ground-truth reference, guiding the policy towards explanations that are not just factually correct but also structurally and conceptually aligned with expert reasoning. We apply this method to the task of training a model for the Italian medical-school entrance examinations, following standard domain-adaptive continued pre-training (CPT) and supervised fine-tuning (SFT). Our results demonstrate that GRPO with our proposed semantic reward significantly improves explanation faithfulness and clarity over a strong SFT baseline, showcasing the power of using lightweight encoder models for nuanced reward shaping in complex generation tasks


[98] Empowering LLMs with Parameterized Skills for Adversarial Long-Horizon Planning cs.CLPDF

Sijia Cui, Shuai Xu, Aiyao He, Yanna Wang, Bo Xu

TL;DR: 论文提出了一种名为PLAP的规划框架,结合了语言模型与参数化技能,以提升LLM在长周期对抗性环境中的规划能力。

Details

Motivation: 现有方法在长周期环境中难以生成可靠的低层动作,或过度依赖专家经验指导高层任务分解。PLAP旨在解决这些问题。

Result: 实验显示,GPT-4驱动的PLAP在零样本设定下优于80%的基线方法,Qwen2-72B驱动的PLAP甚至超越了顶级脚本代理CoacAI。

Insight: 参数化技能库的引入有效填补了LLM在低层动作生成与高层任务分解间的鸿沟,显著提升了规划能力。

Abstract: Recent advancements in Large Language Models(LLMs) have led to the development of LLM-based AI agents. A key challenge is the creation of agents that can effectively ground themselves in complex, adversarial long-horizon environments. Existing methods mainly focus on (1) using LLMs as policies to interact with the environment through generating low-level feasible actions, and (2) utilizing LLMs to generate high-level tasks or language guides to stimulate action generation. However, the former struggles to generate reliable actions, while the latter relies heavily on expert experience to translate high-level tasks into specific action sequences. To address these challenges, we introduce the Plan with Language, Act with Parameter (PLAP) planning framework that facilitates the grounding of LLM-based agents in long-horizon environments. The PLAP method comprises three key components: (1) a skill library containing environment-specific parameterized skills, (2) a skill planner powered by LLMs, and (3) a skill executor converting the parameterized skills into executable action sequences. We implement PLAP in MicroRTS, a long-horizon real-time strategy game that provides an unfamiliar and challenging environment for LLMs. The experimental results demonstrate the effectiveness of PLAP. In particular, GPT-4o-driven PLAP in a zero-shot setting outperforms 80% of baseline agents, and Qwen2-72B-driven PLAP, with carefully crafted few-shot examples, surpasses the top-tier scripted agent, CoacAI. Additionally, we design comprehensive evaluation metrics and test 6 closed-source and 2 open-source LLMs within the PLAP framework, ultimately releasing an LLM leaderboard ranking long-horizon skill planning ability. Our code is available at https://github.com/AI-Research-TeamX/PLAP.


[99] LLM Hallucination Detection: A Fast Fourier Transform Method Based on Hidden Layer Temporal Signals cs.CLPDF

Jinxin Li, Gang Tu, ShengYu Cheng, Junjie Hu, Jinting Wang

TL;DR: 论文提出了一种基于快速傅里叶变换(FFT)和大语言模型(LLM)隐藏层时序信号的新方法HSAD,用于检测幻觉现象,显著优于现有方法。

Details

Motivation: 由于幻觉问题限制了LLM在可靠性敏感场景中的应用,现有方法(如事实性检查和静态隐藏状态分析)受限于外部知识覆盖或未能捕捉推理动态偏差,效果和鲁棒性不足。

Result: 在TruthfulQA等基准测试中,HSAD相比现有方法提升了超过10个百分点。

Insight: 将推理过程建模与频域分析结合,为LLM幻觉检测开辟了新方向,凸显了时序动态特征的重要性。

Abstract: Hallucination remains a critical barrier for deploying large language models (LLMs) in reliability-sensitive applications. Existing detection methods largely fall into two categories: factuality checking, which is fundamentally constrained by external knowledge coverage, and static hidden-state analysis, that fails to capture deviations in reasoning dynamics. As a result, their effectiveness and robustness remain limited. We propose HSAD (Hidden Signal Analysis-based Detection), a novel hallucination detection framework that models the temporal dynamics of hidden representations during autoregressive generation. HSAD constructs hidden-layer signals by sampling activations across layers, applies Fast Fourier Transform (FFT) to obtain frequency-domain representations, and extracts the strongest non-DC frequency component as spectral features. Furthermore, by leveraging the autoregressive nature of LLMs, HSAD identifies optimal observation points for effective and reliable detection. Across multiple benchmarks, including TruthfulQA, HSAD achieves over 10 percentage points improvement compared to prior state-of-the-art methods. By integrating reasoning-process modeling with frequency-domain analysis, HSAD establishes a new paradigm for robust hallucination detection in LLMs.


[100] The Few-shot Dilemma: Over-prompting Large Language Models cs.CLPDF

Yongjian Tang, Doruk Tuncel, Christian Koerner, Thomas Runkler

TL;DR: 论文探讨了在大型语言模型(LLMs)中过度提示(over-prompting)导致性能下降的现象,并提出了一种少样本提示框架,通过实验验证了最佳示例数量的重要性。

Details

Motivation: 传统观点认为增加相关的少样本示例会提升LLMs性能,但在某些LLMs中,过多的示例反而导致性能下降,这种现象称为’少样本困境’。研究旨在量化并解决这一问题。

Result: 实验表明,过多领域特定示例会降低某些LLMs的性能;通过优化示例数量,方法在需求分类任务中提升了性能。

Insight: 少样本示例的数量和质量需谨慎平衡,过度提示反而适得其反;优化提示设计可显著提升LLMs的实际应用效果。

Abstract: Over-prompting, a phenomenon where excessive examples in prompts lead to diminished performance in Large Language Models (LLMs), challenges the conventional wisdom about in-context few-shot learning. To investigate this few-shot dilemma, we outline a prompting framework that leverages three standard few-shot selection methods - random sampling, semantic embedding, and TF-IDF vectors - and evaluate these methods across multiple LLMs, including GPT-4o, GPT-3.5-turbo, DeepSeek-V3, Gemma-3, LLaMA-3.1, LLaMA-3.2, and Mistral. Our experimental results reveal that incorporating excessive domain-specific examples into prompts can paradoxically degrade performance in certain LLMs, which contradicts the prior empirical conclusion that more relevant few-shot examples universally benefit LLMs. Given the trend of LLM-assisted software engineering and requirement analysis, we experiment with two real-world software requirement classification datasets. By gradually increasing the number of TF-IDF-selected and stratified few-shot examples, we identify their optimal quantity for each LLM. This combined approach achieves superior performance with fewer examples, avoiding the over-prompting problem, thus surpassing the state-of-the-art by 1% in classifying functional and non-functional requirements.


[101] Evaluating LLM Alignment on Personality Inference from Real-World Interview Data cs.CLPDF

Jianfeng Zhu, Julina Maharjan, Xinyu Li, Karin G. Coifman, Ruoming Jin

TL;DR: 这篇论文提出了一个新颖的基准测试,用于评估大型语言模型(LLMs)在真实访谈数据中对人格特质的推断能力。实验表明,当前LLMs在连续人格特质评估上的表现有限,相关性低于0.26,链式思考提示仅带来微小提升,突显了LLMs与复杂人类属性对齐的挑战。

Details

Motivation: LLMs在需要心理理解的场景(如心理咨询)中日益重要,但其对真实对话中人格特质的推断能力尚未充分研究。现有的工作多基于模拟数据,缺乏对连续人格评估的研究。

Result: LLMs预测的人格特质与真实得分的皮尔逊相关性均低于0.26,链式思考提示相比零样本仅带来微小改善。这表明人格推断更依赖于潜在语义表示而非显式推理。

Insight: 论文指出,LLMs对复杂人类属性的对齐仍面临挑战,未来需要关注特质特异性提示、上下文感知建模和对齐导向的微调。

Abstract: Large Language Models (LLMs) are increasingly deployed in roles requiring nuanced psychological understanding, such as emotional support agents, counselors, and decision-making assistants. However, their ability to interpret human personality traits, a critical aspect of such applications, remains unexplored, particularly in ecologically valid conversational settings. While prior work has simulated LLM “personas” using discrete Big Five labels on social media data, the alignment of LLMs with continuous, ground-truth personality assessments derived from natural interactions is largely unexamined. To address this gap, we introduce a novel benchmark comprising semi-structured interview transcripts paired with validated continuous Big Five trait scores. Using this dataset, we systematically evaluate LLM performance across three paradigms: (1) zero-shot and chain-of-thought prompting with GPT-4.1 Mini, (2) LoRA-based fine-tuning applied to both RoBERTa and Meta-LLaMA architectures, and (3) regression using static embeddings from pretrained BERT and OpenAI’s text-embedding-3-small. Our results reveal that all Pearson correlations between model predictions and ground-truth personality traits remain below 0.26, highlighting the limited alignment of current LLMs with validated psychological constructs. Chain-of-thought prompting offers minimal gains over zero-shot, suggesting that personality inference relies more on latent semantic representation than explicit reasoning. These findings underscore the challenges of aligning LLMs with complex human attributes and motivate future work on trait-specific prompting, context-aware modeling, and alignment-oriented fine-tuning.


[102] ChartGaze: Enhancing Chart Understanding in LVLMs with Eye-Tracking Guided Attention Refinement cs.CL | cs.CV | cs.LGPDF

Ali Salamatian, Amirhossein Abaskohi, Wan-Cyuan Fan, Mir Rayat Imtiaz Hossain, Leonid Sigal

TL;DR: ChartGaze利用眼动追踪数据优化LVLMs在图表问答任务中的注意力对齐,通过注意力细化提升模型准确性和可解释性。

Details

Motivation: 现有LVLMs在图表问答任务中因注意力分散到无关区域导致性能不佳,与人类注视行为不一致。研究旨在通过人类注视数据优化模型注意力。

Result: 实验显示该方法在多个模型上提升准确率达2.56个百分点,同时注意力对齐度显著改善。

Insight: 人类注视数据能有效指导模型注意力优化,提升图表理解任务中的性能和可解释性。

Abstract: Charts are a crucial visual medium for communicating and representing information. While Large Vision-Language Models (LVLMs) have made progress on chart question answering (CQA), the task remains challenging, particularly when models attend to irrelevant regions of the chart. In this work, we present ChartGaze, a new eye-tracking dataset that captures human gaze patterns during chart reasoning tasks. Through a systematic comparison of human and model attention, we find that LVLMs often diverge from human gaze, leading to reduced interpretability and accuracy. To address this, we propose a gaze-guided attention refinement that aligns image-text attention with human fixations. Our approach improves both answer accuracy and attention alignment, yielding gains of up to 2.56 percentage points across multiple models. These results demonstrate the promise of incorporating human gaze to enhance both the reasoning quality and interpretability of chart-focused LVLMs.


[103] WebResearcher: Unleashing unbounded reasoning capability in Long-Horizon Agents cs.CLPDF

Zile Qiao, Guoxin Chen, Xuanzhong Chen, Donglei Yu, Wenbiao Yin

TL;DR: WebResearcher提出了一种新型框架,通过迭代深度研究范式(WebResearcher)和可扩展数据合成引擎(WebFrontier),解决了传统单上下文方法中的上下文窒息和噪声污染问题,显著提升了工具的利用能力,并在6个基准测试中达到了最先进性能。

Details

Motivation: 当前AI代理在自主发现和合成外部知识时,面临上下文窒息和噪声污染的挑战,限制了其在长时推理任务中的表现。

Result: 在6个基准测试中达到最先进性能,甚至超越前沿专有系统,同时显著提升单上下文方法的工具利用能力。

Insight: 1. 迭代和并行策略能有效解决长时推理中的上下文限制;2. 高质量合成数据对提升工具利用能力至关重要。

Abstract: Recent advances in deep-research systems have demonstrated the potential for AI agents to autonomously discover and synthesize knowledge from external sources. In this paper, we introduce WebResearcher, a novel framework for building such agents through two key components: (1) WebResearcher, an iterative deep-research paradigm that reformulates deep research as a Markov Decision Process, where agents periodically consolidate findings into evolving reports while maintaining focused workspaces, overcoming the context suffocation and noise contamination that plague existing mono-contextual approaches; and (2) WebFrontier, a scalable data synthesis engine that generates high-quality training data through tool-augmented complexity escalation, enabling systematic creation of research tasks that bridge the gap between passive knowledge recall and active knowledge construction. Notably, we find that the training data from our paradigm significantly enhances tool-use capabilities even for traditional mono-contextual methods. Furthermore, our paradigm naturally scales through parallel thinking, enabling concurrent multi-agent exploration for more comprehensive conclusions. Extensive experiments across 6 challenging benchmarks demonstrate that WebResearcher achieves state-of-the-art performance, even surpassing frontier proprietary systems.


[104] Scaling Agents via Continual Pre-training cs.CLPDF

Liangcai Su, Zhen Zhang, Guangyu Li, Zhuo Chen, Chenxi Wang

TL;DR: 论文提出了一种名为Agentic CPT的持续预训练方法,用于构建强大的代理基础模型,并通过实验验证了其性能优势。

Details

Motivation: 现有的大型语言模型在代理任务中表现不佳,主要原因是缺乏强大的代理基础模型,导致模型在训练过程中需要同时学习多种代理行为并与专家演示对齐,造成优化冲突。

Result: AgentFounder-30B在多个基准测试中达到了最先进的性能,尤其是在工具使用能力上表现突出。

Insight: 通过持续预训练可以有效解决代理任务中模型优化的冲突问题,提升模型的性能和泛化能力。

Abstract: Large language models (LLMs) have evolved into agentic systems capable of autonomous tool use and multi-step reasoning for complex problem-solving. However, post-training approaches building upon general-purpose foundation models consistently underperform in agentic tasks, particularly in open-source implementations. We identify the root cause: the absence of robust agentic foundation models forces models during post-training to simultaneously learn diverse agentic behaviors while aligning them to expert demonstrations, thereby creating fundamental optimization tensions. To this end, we are the first to propose incorporating Agentic Continual Pre-training (Agentic CPT) into the deep research agents training pipeline to build powerful agentic foundational models. Based on this approach, we develop a deep research agent model named AgentFounder. We evaluate our AgentFounder-30B on 10 benchmarks and achieve state-of-the-art performance while retains strong tool-use ability, notably 39.9% on BrowseComp-en, 43.3% on BrowseComp-zh, and 31.5% Pass@1 on HLE.


[105] Towards General Agentic Intelligence via Environment Scaling cs.CLPDF

Runnan Fang, Shihao Cai, Baixuan Li, Jialong Wu, Guangyu Li

TL;DR: 该论文提出了一种通过环境扩展实现通用智能体的方法,设计了自动化构建多样化仿真环境的框架,并采用两阶段微调策略提升智能体的功能调用能力。

Details

Motivation: 为了让大型语言模型在实际应用中更高效地调用多样化的API,需要智能体通过与环境交互培养精确、鲁棒的功能调用能力。环境多样性是关键,但如何规模化扩展环境和高效训练智能体是主要挑战。

Result: 实验表明,AgentScaler显著提升了功能调用能力,在多个基准测试中优于基线模型。

Insight: 环境多样性对智能体的功能调用能力至关重要,且通过系统性扩展环境和分阶段训练可以高效提升智能体的通用性。

Abstract: Advanced agentic intelligence is a prerequisite for deploying Large Language Models in practical, real-world applications. Diverse real-world APIs demand precise, robust function-calling intelligence, which needs agents to develop these capabilities through interaction in varied environments. The breadth of function-calling competence is closely tied to the diversity of environments in which agents are trained. In this work, we scale up environments as a step towards advancing general agentic intelligence. This gives rise to two central challenges: (i) how to scale environments in a principled manner, and (ii) how to effectively train agentic capabilities from experiences derived through interactions with these environments. To address these, we design a scalable framework that automatically constructs heterogeneous environments that are fully simulated, systematically broadening the space of function-calling scenarios. We further adapt a two-phase agent fine-tuning strategy: first endowing agents with fundamental agentic capabilities, then specializing them for domain-specific contexts. Extensive experiments on agentic benchmarks, tau-bench, tau2-Bench, and ACEBench, demonstrate that our trained model, AgentScaler, significantly enhances the function-calling capability of models.


[106] ReSum: Unlocking Long-Horizon Search Intelligence via Context Summarization cs.CLPDF

Xixi Wu, Kuan Li, Yida Zhao, Liwen Zhang, Litu Ou

TL;DR: ReSum提出了一种通过上下文摘要解锁长程搜索智能的新范式,解决了大型语言模型代理在处理复杂查询时因上下文限制而无法完成搜索任务的问题。

Details

Motivation: 大型语言模型(LLM)在知识密集型任务上表现优异,但在涉及多实体、复杂关系和高不确定性的查询时,上下文窗口限制成为主要障碍。

Result: 在三个基准测试中,ReSum平均比ReAct提升4.5%,ReSum-GRPO进一步带来高达8.2%的提升;WebResummer-30B在少量训练样本下表现优异。

Insight: 上下文摘要是一种有效缓解LLM代理在长程搜索中上下文限制的方法,且通过适当训练可显著提升性能。

Abstract: Large Language Model (LLM)-based web agents demonstrate strong performance on knowledge-intensive tasks but are hindered by context window limitations in paradigms like ReAct. Complex queries involving multiple entities, intertwined relationships, and high uncertainty demand extensive search cycles that rapidly exhaust context budgets before reaching complete solutions. To overcome this challenge, we introduce ReSum, a novel paradigm that enables indefinite exploration through periodic context summarization. ReSum converts growing interaction histories into compact reasoning states, maintaining awareness of prior discoveries while bypassing context constraints. For paradigm adaptation, we propose ReSum-GRPO, integrating GRPO with segmented trajectory training and advantage broadcasting to familiarize agents with summary-conditioned reasoning. Extensive experiments on web agents of varying scales across three benchmarks demonstrate that ReSum delivers an average absolute improvement of 4.5% over ReAct, with further gains of up to 8.2% following ReSum-GRPO training. Notably, with only 1K training samples, our WebResummer-30B (a ReSum-GRPO-trained version of WebSailor-30B) achieves 33.3% Pass@1 on BrowseComp-zh and 18.3% on BrowseComp-en, surpassing existing open-source web agents.


eess.IV [Back]

[107] Enhancing Radiographic Disease Detection with MetaCheX, a Context-Aware Multimodal Model eess.IV | cs.CV | cs.LGPDF

Nathan He, Cody Chen

TL;DR: MetaCheX整合胸部X光影像與患者元數據,顯著提升疾病檢測的準確性和公平性。

Details

Motivation: 現有深度學習模型忽視患者元數據,限制了診斷準確性和公平性。

Result: 在CheXpert Plus數據集上優於僅影像模型,顯著提升AUROC。

Insight: 元數據有助於提升模型泛化能力並減少偏見,更貼近臨床決策。

Abstract: Existing deep learning models for chest radiology often neglect patient metadata, limiting diagnostic accuracy and fairness. To bridge this gap, we introduce MetaCheX, a novel multimodal framework that integrates chest X-ray images with structured patient metadata to replicate clinical decision-making. Our approach combines a convolutional neural network (CNN) backbone with metadata processed by a multilayer perceptron through a shared classifier. Evaluated on the CheXpert Plus dataset, MetaCheX consistently outperformed radiograph-only baseline models across multiple CNN architectures. By integrating metadata, the overall diagnostic accuracy was significantly improved, measured by an increase in AUROC. The results of this study demonstrate that metadata reduces algorithmic bias and enhances model generalizability across diverse patient populations. MetaCheX advances clinical artificial intelligence toward robust, context-aware radiographic disease detection.


[108] DinoAtten3D: Slice-Level Attention Aggregation of DinoV2 for 3D Brain MRI Anomaly Classification eess.IV | cs.AI | cs.CVPDF

Fazle Rafsani, Jay Shah, Catherine D. Chong, Todd J. Schwedt, Teresa Wu

TL;DR: 该论文提出了一种基于注意力机制的3D医学图像异常分类方法DinoAtten3D,利用DINOv2预训练模型提取特征,并通过软注意力机制为2D轴向切片分配自适应权重。结合对比学习和类方差正则化的复合损失函数,解决了数据稀缺和类别不平衡问题。

Details

Motivation: 医学图像中的异常检测和分类对早期诊断至关重要,但由于标注数据有限、类别不平衡和专家标注成本高,这一问题极具挑战性。

Result: 在ADNI数据集和头痛队列中表现出色,有效解决了数据稀缺和类别不平衡问题。

Insight: 预训练的2D基础模型结合注意力切片聚合,可显著提升3D医学图像异常检测的鲁棒性。

Abstract: Anomaly detection and classification in medical imaging are critical for early diagnosis but remain challenging due to limited annotated data, class imbalance, and the high cost of expert labeling. Emerging vision foundation models such as DINOv2, pretrained on extensive, unlabeled datasets, offer generalized representations that can potentially alleviate these limitations. In this study, we propose an attention-based global aggregation framework tailored specifically for 3D medical image anomaly classification. Leveraging the self-supervised DINOv2 model as a pretrained feature extractor, our method processes individual 2D axial slices of brain MRIs, assigning adaptive slice-level importance weights through a soft attention mechanism. To further address data scarcity, we employ a composite loss function combining supervised contrastive learning with class-variance regularization, enhancing inter-class separability and intra-class consistency. We validate our framework on the ADNI dataset and an institutional multi-class headache cohort, demonstrating strong anomaly classification performance despite limited data availability and significant class imbalance. Our results highlight the efficacy of utilizing pretrained 2D foundation models combined with attention-based slice aggregation for robust volumetric anomaly detection in medical imaging. Our implementation is publicly available at https://github.com/Rafsani/DinoAtten3D.git.


[109] DeepEyeNet: Generating Medical Report for Retinal Images eess.IV | cs.AI | cs.CVPDF

Jia-Hong Huang

TL;DR: 论文提出了DeepEyeNet,一种AI驱动的自动化视网膜图像医疗报告生成系统,旨在解决眼科医生资源不足的问题。

Details

Motivation: 视网膜疾病日益增多,而眼科医生资源有限,传统手动报告生成效率低下且易出错。AI自动化可显著提升诊断效率,减轻医生负担。

Result: 所提方法在多种评估指标下取得了最优性能。

Insight: AI自动化医疗报告生成有望提升临床效率、诊断准确性和患者护理水平,但需解决技术局限性和临床信任问题。

Abstract: The increasing prevalence of retinal diseases poses a significant challenge to the healthcare system, as the demand for ophthalmologists surpasses the available workforce. This imbalance creates a bottleneck in diagnosis and treatment, potentially delaying critical care. Traditional methods of generating medical reports from retinal images rely on manual interpretation, which is time-consuming and prone to errors, further straining ophthalmologists’ limited resources. This thesis investigates the potential of Artificial Intelligence (AI) to automate medical report generation for retinal images. AI can quickly analyze large volumes of image data, identifying subtle patterns essential for accurate diagnosis. By automating this process, AI systems can greatly enhance the efficiency of retinal disease diagnosis, reducing doctors’ workloads and enabling them to focus on more complex cases. The proposed AI-based methods address key challenges in automated report generation: (1) A multi-modal deep learning approach captures interactions between textual keywords and retinal images, resulting in more comprehensive medical reports; (2) Improved methods for medical keyword representation enhance the system’s ability to capture nuances in medical terminology; (3) Strategies to overcome RNN-based models’ limitations, particularly in capturing long-range dependencies within medical descriptions; (4) Techniques to enhance the interpretability of the AI-based report generation system, fostering trust and acceptance in clinical practice. These methods are rigorously evaluated using various metrics and achieve state-of-the-art performance. This thesis demonstrates AI’s potential to revolutionize retinal disease diagnosis by automating medical report generation, ultimately improving clinical efficiency, diagnostic accuracy, and patient care.


[110] MEGAN: Mixture of Experts for Robust Uncertainty Estimation in Endoscopy Videos eess.IV | cs.AI | cs.CV | cs.LGPDF

Damola Agbelese, Krishna Chaitanya, Pushpak Pati, Chaitanya Parmar, Pooya Mobadersany

TL;DR: MEGAN 是一种多专家门控网络,通过结合多个基于Evidential Deep Learning (EDL)的专家模型,显著提升了内窥镜视频中UC疾病严重程度估计的预测置信度和校准性能。

Details

Motivation: 当前医学AI中的不确定性量化方法通常依赖于单一专家的标注数据,忽略了医疗领域中常见的标注者间变异性。MEGAN旨在通过结合多位专家的标注和建模策略来解决这一问题。

Result: 在UC临床试验中,MEGAN相较现有方法提高了3.5%的F1分数,降低了30.5%的ECE,并实现了基于不确定性的样本分层。

Insight: MEGAN表明在医学AI中,结合多专家标注和建模策略可以有效提升模型性能和不确定性量化能力,同时减轻标注负担。

Abstract: Reliable uncertainty quantification (UQ) is essential in medical AI. Evidential Deep Learning (EDL) offers a computationally efficient way to quantify model uncertainty alongside predictions, unlike traditional methods such as Monte Carlo (MC) Dropout and Deep Ensembles (DE). However, all these methods often rely on a single expert’s annotations as ground truth for model training, overlooking the inter-rater variability in healthcare. To address this issue, we propose MEGAN, a Multi-Expert Gating Network that aggregates uncertainty estimates and predictions from multiple AI experts via EDL models trained with diverse ground truths and modeling strategies. MEGAN’s gating network optimally combines predictions and uncertainties from each EDL model, enhancing overall prediction confidence and calibration. We extensively benchmark MEGAN on endoscopy videos for Ulcerative colitis (UC) disease severity estimation, assessed by visual labeling of Mayo Endoscopic Subscore (MES), where inter-rater variability is prevalent. In large-scale prospective UC clinical trial, MEGAN achieved a 3.5% improvement in F1-score and a 30.5% reduction in Expected Calibration Error (ECE) compared to existing methods. Furthermore, MEGAN facilitated uncertainty-guided sample stratification, reducing the annotation burden and potentially increasing efficiency and consistency in UC trials.


cs.LG [Back]

[111] Similarity-Distance-Magnitude Activations cs.LG | cs.CLPDF

Allen Schmaltz

TL;DR: 论文提出了一种更鲁棒且可解释的激活函数SDM(Similarity-Distance-Magnitude),通过引入相似性和距离感知,改进了传统的softmax函数,使其对协变量偏移和分布外输入更具鲁棒性,并提供了基于示例的可解释性。

Details

Motivation: 传统的softmax函数在高概率区域对协变量偏移和分布外输入的鲁棒性不足,且缺乏可解释性。作者希望通过引入相似性和距离感知来改进这些问题。

Result: SDM在协变量偏移和分布外输入的情况下表现优于softmax,且能为选择性分类提供更好的校准效果。

Insight: 结合相似性和距离信息可以有效提升激活函数的鲁棒性和可解释性,为模型的决策边界提供了更丰富的上下文信息。

Abstract: We introduce a more robust and interpretable formulation of the standard softmax activation function commonly used with neural networks by adding Similarity (i.e., correctly predicted depth-matches into training) awareness and Distance-to-training-distribution awareness to the existing output Magnitude (i.e., decision-boundary) awareness. When used as the final-layer activation with language models, the resulting Similarity-Distance-Magnitude (SDM) activation function is more robust than the softmax function to co-variate shifts and out-of-distribution inputs in high-probability regions, and provides interpretability-by-exemplar via dense matching. Complementing the prediction-conditional estimates, the SDM activation enables a partitioning of the class-wise empirical CDFs to guard against low class-wise recall among selective classifications. These properties make it preferable for selective classification, even when considering post-hoc calibration methods over the softmax.


[112] When Inverse Data Outperforms: Exploring the Pitfalls of Mixed Data in Multi-Stage Fine-Tuning cs.LG | cs.CLPDF

Mengyi Deng, Xin Li, Tingyu Zhu, Zhicheng Yang, Zhijiang Guo

TL;DR: 论文探讨了混合数据在多阶段微调中的陷阱,通过构建高质量的反向推理数据集r1k,并分析了SFT和DPO对双向推理目标对齐的影响。结果表明,单纯混合数据会削弱方向区分,而DPO可能抑制非偏好路径。

Details

Motivation: 现有方法主要关注单向监督微调(SFT),忽略了多样推理模式之间的复杂互动。论文旨在探索混合数据在多阶段微调中的潜在问题及其对齐效果。

Result: SFT在r1k上的表现比s1k提升了1.6%–6.8%,但混合数据会削弱方向区分。DPO虽能部分恢复区分,但也可能抑制非偏好推理路径。

Insight: 混合推理数据可能引入冲突的监督信号,需要设计更具鲁棒性和方向感知的对齐策略。

Abstract: Existing work has shown that o1-level performance can be achieved with limited data distillation, but most existing methods focus on unidirectional supervised fine-tuning (SFT), overlooking the intricate interplay between diverse reasoning patterns. In this paper, we construct r1k, a high-quality reverse reasoning dataset derived by inverting 1,000 forward examples from s1k, and examine how SFT and Direct Preference Optimization (DPO) affect alignment under bidirectional reasoning objectives. SFT on r1k yields a 1.6%–6.8% accuracy improvement over s1k across evaluated benchmarks. However, naively mixing forward and reverse data during SFT weakens the directional distinction. Although DPO can partially recover this distinction, it also suppresses less preferred reasoning paths by shifting the probability mass toward irrelevant outputs. These findings suggest that mixed reasoning data introduce conflicting supervision signals, underscoring the need for robust and direction-aware alignment strategies.


[113] WebSailor-V2: Bridging the Chasm to Proprietary Agents via Synthetic Data and Scalable Reinforcement Learning cs.LG | cs.CLPDF

Kuan Li, Zhongwang Zhang, Huifeng Yin, Rui Ye, Yida Zhao

TL;DR: 论文提出WebSailor-V2,通过合成数据和可扩展强化学习提升开源代理在复杂信息检索任务中的表现,缩小与专属代理的差距。

Details

Motivation: 专属代理(如DeepResearch)在复杂信息检索任务中表现超人类能力,而开源模型缺乏类似的系统性推理能力。本文旨在通过合成数据和强化学习填补这一差距。

Result: WebSailor-V2在复杂信息检索任务中表现优于所有开源代理,接近专属代理的性能。

Insight: 系统性推理能力是缩小专属与开源代理差距的关键,联合合成数据与强化学习可有效提升模型能力。

Abstract: Transcending human cognitive limitations represents a critical frontier in LLM training. Proprietary agentic systems like DeepResearch have demonstrated superhuman capabilities on extremely complex information-seeking benchmarks such as BrowseComp, a feat previously unattainable. We posit that their success hinges on a sophisticated reasoning pattern absent in open-source models: the ability to systematically reduce extreme uncertainty when navigating vast information landscapes. Based on this insight, we introduce WebSailor, a complete post-training methodology designed to instill this crucial capability. Our approach involves generating novel, high-uncertainty tasks through structured sampling and information obfuscation, RFT cold start, and an efficient agentic RL training algorithm, Duplicating Sampling Policy Optimization (DUPO). With this integrated pipeline, WebSailor significantly outperforms all open-source agents in complex information-seeking tasks, matching proprietary agents’ performance and closing the capability gap.


[114] Flexible Multimodal Neuroimaging Fusion for Alzheimer’s Disease Progression Prediction cs.LG | cs.AI | cs.CV | eess.IVPDF

Benjamin Burns, Yuan Xue, Douglas W. Scharre, Xia Ning

TL;DR: 论文提出了PerM-MoE方法,通过为每种模态设计独立的路由器,提升多模态模型在缺失模态情况下的灵活性,用于阿尔茨海默病进展预测。

Details

Motivation: 阿尔茨海默病的进展具有高度个体化差异,现有多模态模型在模态缺失时预测准确性下降,限制了临床应用。

Result: 在ADNI数据集上,PerM-MoE在多数模态缺失情况下优于现有最优模型Flex-MoE,并更有效利用专家模型。

Insight: 独立路由器设计有助于在多模态任务中灵活处理模态缺失问题,提升模型鲁棒性。

Abstract: Alzheimer’s disease (AD) is a progressive neurodegenerative disease with high inter-patient variance in rate of cognitive decline. AD progression prediction aims to forecast patient cognitive decline and benefits from incorporating multiple neuroimaging modalities. However, existing multimodal models fail to make accurate predictions when many modalities are missing during inference, as is often the case in clinical settings. To increase multimodal model flexibility under high modality missingness, we introduce PerM-MoE, a novel sparse mixture-of-experts method that uses independent routers for each modality in place of the conventional, single router. Using T1-weighted MRI, FLAIR, amyloid beta PET, and tau PET neuroimaging data from the Alzheimer’s Disease Neuroimaging Initiative (ADNI), we evaluate PerM-MoE, state-of-the-art Flex-MoE, and unimodal neuroimaging models on predicting two-year change in Clinical Dementia Rating-Sum of Boxes (CDR-SB) scores under varying levels of modality missingness. PerM-MoE outperforms the state of the art in most variations of modality missingness and demonstrates more effective utility of experts than Flex-MoE.


[115] InJecteD: Analyzing Trajectories and Drift Dynamics in Denoising Diffusion Probabilistic Models for 2D Point Cloud Generation cs.LG | cs.CVPDF

Sanyam Jain, Khuram Naveed, Illia Oleksiienko, Alexandros Iosifidis, Ruben Pauwels

TL;DR: InJecteD是一个用于分析DDPM在2D点云生成中的轨迹和漂移动力学的框架,通过量化轨迹属性增强模型透明度,帮助调试和改进生成模型。

Details

Motivation: 研究DDPM在2D点云生成中的轨迹和漂移动力学,以增强模型的可解释性,支持人机协作。

Result: 实验展示了去噪过程的三个阶段,并发现基于傅里叶的嵌入提高了轨迹稳定性和重建质量。

Insight: 模型的可解释性对于调试和改进生成模型至关重要,数据集特异性行为为模型设计提供了新视角。

Abstract: This work introduces InJecteD, a framework for interpreting Denoising Diffusion Probabilistic Models (DDPMs) by analyzing sample trajectories during the denoising process of 2D point cloud generation. We apply this framework to three datasets from the Datasaurus Dozen bullseye, dino, and circle using a simplified DDPM architecture with customizable input and time embeddings. Our approach quantifies trajectory properties, including displacement, velocity, clustering, and drift field dynamics, using statistical metrics such as Wasserstein distance and cosine similarity. By enhancing model transparency, InJecteD supports human AI collaboration by enabling practitioners to debug and refine generative models. Experiments reveal distinct denoising phases: initial noise exploration, rapid shape formation, and final refinement, with dataset-specific behaviors example, bullseyes concentric convergence vs. dinos complex contour formation. We evaluate four model configurations, varying embeddings and noise schedules, demonstrating that Fourier based embeddings improve trajectory stability and reconstruction quality


[116] iCD: A Implicit Clustering Distillation Mathod for Structural Information Mining cs.LG | cs.CVPDF

Xiang Xue, Yatu Ji, Qing-dao-er-ji Ren, Bao Shi, Min Lu

TL;DR: iCD是一种无需特征对齐的聚类蒸馏方法,通过解耦局部逻辑表示和Gram矩阵,挖掘可解释的结构知识,显著提升细粒度分类任务性能。

Details

Motivation: 传统的逻辑知识蒸馏虽然简单易用,但其决策过程缺乏可解释性,限制了进一步的应用。iCD旨在通过结构信息挖掘解决这一问题。

Result: 在基准数据集上验证了iCD的有效性,特别是在细粒度分类任务中,性能峰值提升达5.08%。

Insight: iCD提供了一种无需复杂对齐的可解释性蒸馏方法,展示了结构信息在学生模型训练中的重要性。

Abstract: Logit Knowledge Distillation has gained substantial research interest in recent years due to its simplicity and lack of requirement for intermediate feature alignment; however, it suffers from limited interpretability in its decision-making process. To address this, we propose implicit Clustering Distillation (iCD): a simple and effective method that mines and transfers interpretable structural knowledge from logits, without requiring ground-truth labels or feature-space alignment. iCD leverages Gram matrices over decoupled local logit representations to enable student models to learn latent semantic structural patterns. Extensive experiments on benchmark datasets demonstrate the effectiveness of iCD across diverse teacher-student architectures, with particularly strong performance in fine-grained classification tasks – achieving a peak improvement of +5.08% over the baseline. The code is available at: https://github.com/maomaochongaa/iCD.


[117] Tool-R1: Sample-Efficient Reinforcement Learning for Agentic Tool Use cs.LG | cs.CVPDF

Yabo Zhang, Yihan Zeng, Qingyun Li, Zhen Hu, Kavin Han

TL;DR: Tool-R1是一个基于强化学习的框架,旨在通过生成可执行的Python代码,帮助大语言模型(LLMs)完成多步骤的工具使用任务,提升任务的准确性和鲁棒性。

Details

Motivation: 尽管大语言模型在语言理解和推理方面表现强大,但在需要实时知识、精确操作或专用工具使用的实际任务中仍然受限。工具的使用能够增强其能力,但现有方法在复杂、多步骤任务中的表现不理想。

Result: 在GAIA基准测试中,Tool-R1显著提升了准确性和鲁棒性,比基线方法高约10%,尤其在复杂多步骤任务上表现更优。

Insight: Tool-R1展示了通过强化学习优化LLMs工具使用能力的潜力,为实际应用中的可靠和高效工具增强推理提供了新思路。

Abstract: Large language models (LLMs) have demonstrated strong capabilities in language understanding and reasoning, yet they remain limited when tackling real-world tasks that require up-to-date knowledge, precise operations, or specialized tool use. To address this, we propose Tool-R1, a reinforcement learning framework that enables LLMs to perform general, compositional, and multi-step tool use by generating executable Python code. Tool-R1 supports integration of user-defined tools and standard libraries, with variable sharing across steps to construct coherent workflows. An outcome-based reward function, combining LLM-based answer judgment and code execution success, guides policy optimization. To improve training efficiency, we maintain a dynamic sample queue to cache and reuse high-quality trajectories, reducing the overhead of costly online sampling. Experiments on the GAIA benchmark show that Tool-R1 substantially improves both accuracy and robustness, achieving about 10% gain over strong baselines, with larger improvements on complex multi-step tasks. These results highlight the potential of Tool-R1 for enabling reliable and efficient tool-augmented reasoning in real-world applications. Our code will be available at https://github.com/YBYBZhang/Tool-R1.


cs.AI [Back]

[118] Small Models, Big Results: Achieving Superior Intent Extraction through Decomposition cs.AI | cs.CLPDF

Danielle Cohen, Yoni Halpern, Noam Kahlon, Joel Oren, Omri Berkovitch

TL;DR: 这篇论文提出了一种分解方法,通过结构化交互摘要和改进的意图提取,显著提升了小模型在资源受限环境中的意图理解能力,甚至超越了大模型的基准性能。

Details

Motivation: 当前的MLLMs虽然强大,但无法在设备上高效运行,导致隐私保护、成本和延迟问题。小模型在意图理解上表现不佳,限制了其应用。

Result: 小模型的意图理解能力得到显著提升,甚至超越了大模型的基准性能。

Insight: 通过任务分解和结构化信息处理,小模型可以在资源受限环境中达到甚至超越大模型的性能。

Abstract: Understanding user intents from UI interaction trajectories remains a challenging, yet crucial, frontier in intelligent agent development. While massive, datacenter-based, multi-modal large language models (MLLMs) possess greater capacity to handle the complexities of such sequences, smaller models which can run on-device to provide a privacy-preserving, low-cost, and low-latency user experience, struggle with accurate intent inference. We address these limitations by introducing a novel decomposed approach: first, we perform structured interaction summarization, capturing key information from each user action. Second, we perform intent extraction using a fine-tuned model operating on the aggregated summaries. This method improves intent understanding in resource-constrained models, even surpassing the base performance of large MLLMs.


[119] Zero-shot Graph Reasoning via Retrieval Augmented Framework with LLMs cs.AI | cs.CLPDF

Hanqing Li, Kiran Sheena Jyothi, Henry Liang, Sharika Mahadevan, Diego Klabjan

TL;DR: 本文提出了一种无需训练的新方法GRRAF,结合检索增强生成(RAG)和大型语言模型(LLMs)的代码生成能力,解决了多种图推理任务。

Details

Motivation: 现有方法需要大量微调或依赖预定义算法,GRRAF通过检索增强框架弥补了这些不足。

Result: 在GraphInstruct数据集上,GRRAF在大多数图推理任务中达到100%准确率,并能扩展到10,000个节点的大图。

Insight: GRRAF展示了LLMs在复杂图推理任务中的潜力,同时避免了传统方法的训练成本。

Abstract: We propose a new, training-free method, Graph Reasoning via Retrieval Augmented Framework (GRRAF), that harnesses retrieval-augmented generation (RAG) alongside the code-generation capabilities of large language models (LLMs) to address a wide range of graph reasoning tasks. In GRRAF, the target graph is stored in a graph database, and the LLM is prompted to generate executable code queries that retrieve the necessary information. This approach circumvents the limitations of existing methods that require extensive finetuning or depend on predefined algorithms, and it incorporates an error feedback loop with a time-out mechanism to ensure both correctness and efficiency. Experimental evaluations on the GraphInstruct dataset reveal that GRRAF achieves 100% accuracy on most graph reasoning tasks, including cycle detection, bipartite graph checks, shortest path computation, and maximum flow, while maintaining consistent token costs regardless of graph sizes. Imperfect but still very high performance is observed on subgraph matching. Notably, GRRAF scales effectively to large graphs with up to 10,000 nodes.


[120] V-Math: An Agentic Approach to the Vietnamese National High School Graduation Mathematics Exams cs.AI | cs.CV | cs.CYPDF

Duong Q. Nguyen, Quy P. Nguyen, Nguyen Van Nhon, Quang-Thinh Bui, H. Nguyen-Xuan

TL;DR: 论文提出了一个名为V-Math的自主代理框架,旨在帮助越南高中生备考国家高中数学毕业考试。该系统整合了三个AI代理:基于规范矩阵的问题生成器、提供详细分步解答的求解/解释器,以及根据学生表现调整内容的个性化辅导模块。

Details

Motivation: 越南国家高中数学毕业考试对学生至关重要,但传统备考方式缺乏个性化和高效的工具,教师也难以快速生成高质量题目。V-Math的提出旨在通过AI技术解决这些痛点,提升备考效率和公平性。

Result: 初步评估显示,V-Math生成的题目符合规范矩阵,解答准确率高,解释清晰,并能丰富练习材料的多样性。

Insight: AI代理的组合可以高效支持教育和考试准备,同时减轻教师负担。未来的方向可能包括扩展到其他学科和更复杂的学习场景。

Abstract: This paper develops an autonomous agentic framework called V-Math that aims to assist Vietnamese high school students in preparing for the National High School Graduation Mathematics Exams (NHSGMEs). The salient framework integrates three specialized AI agents: a specification-matrix-conditioned question generator, a solver/explainer for detailed step-by-step reasoning, and a personalized tutor that adapts to student performance. Beyond enabling self-paced student practice, V-Math supports teachers by generating innovative, compliant exam questions and building diverse, high-quality question banks. This reduces manual workload and enriches instructional resources. We describe the system architecture, focusing on practice modes for learners and teacher-oriented features for question generation. Preliminary evaluations demonstrate that V-Math produces matrix-aligned exams with high solution accuracy, delivers coherent explanations, and enhances the variety of practice materials. These results highlight its potential to support scalable, equitable mathematics preparation aligned with national standards while also empowering teachers through AI-assisted exam creation.


[121] HLSMAC: A New StarCraft Multi-Agent Challenge for High-Level Strategic Decision-Making cs.AI | cs.CV | cs.GT | cs.LG | cs.MAPDF

Xingxing Hong, Yungong Wang, Dexin Jin, Ye Yuan, Ximing Huang

TL;DR: HLSMAC是一个新的多智能体强化学习基准,基于星际争霸II设计,专注于评估高级战略决策能力,弥补了现有基准如SMAC主要测试微观管理的不足。

Details

Motivation: 现有的多智能体强化学习基准(如SMAC)主要集中在微观管理上,缺乏对高级战略决策能力的评估。这限制了多智能体系统在复杂战略环境中的表现分析。

Result: 实验结果证明,HLSMAC能有效评估多智能体在高级战略决策中的表现,是一个强大的测试平台。

Insight: 高级战略决策能力是多智能体系统在复杂环境中表现的关键,需要专门的基准和评估指标。

Abstract: Benchmarks are crucial for assessing multi-agent reinforcement learning (MARL) algorithms. While StarCraft II-related environments have driven significant advances in MARL, existing benchmarks like SMAC focus primarily on micromanagement, limiting comprehensive evaluation of high-level strategic intelligence. To address this, we introduce HLSMAC, a new cooperative MARL benchmark with 12 carefully designed StarCraft II scenarios based on classical stratagems from the Thirty-Six Stratagems. Each scenario corresponds to a specific stratagem and is designed to challenge agents with diverse strategic elements, including tactical maneuvering, timing coordination, and deception, thereby opening up avenues for evaluating high-level strategic decision-making capabilities. We also propose novel metrics across multiple dimensions beyond conventional win rate, such as ability utilization and advancement efficiency, to assess agents’ overall performance within the HLSMAC environment. We integrate state-of-the-art MARL algorithms and LLM-based agents with our benchmark and conduct comprehensive experiments. The results demonstrate that HLSMAC serves as a robust testbed for advancing multi-agent strategic decision-making.


[122] Simulating Clinical AI Assistance using Multimodal LLMs: A Case Study in Diabetic Retinopathy cs.AI | cs.CV | cs.HCPDF

Nadim Barakat, William Lotter

TL;DR: 该论文研究了多模态大语言模型(MLLMs)在糖尿病视网膜病变(DR)检测中的表现,并评估了不同输出格式对临床AI辅助的影响。实验表明,MedGemma在灵敏度上优于GPT-4o,而GPT-4o在特定条件下表现稳定。多模态输出可能提升临床信任和实用性。

Details

Motivation: 糖尿病视网膜病变是全球致盲的主要原因之一,当前FDA批准的AI系统仅提供二元输出,限制了临床信任和实用性。研究旨在探索多模态大语言模型如何通过不同输出格式提升临床AI辅助效果。

Result: MedGemma在基线测试中表现优于GPT-4o,灵敏度更高;GPT-4o在协作实验中使用MedGemma的描述性输出后,AUROC达到0.96。描述性输出增强了可解释性和临床信任。

Insight: 多模态输出可能显著提升临床AI的实用性,特别是在低资源环境中,开放轻量级模型(如MedGemma)更具潜力。描述性输出有助于增强临床信任。

Abstract: Diabetic retinopathy (DR) is a leading cause of blindness worldwide, and AI systems can expand access to fundus photography screening. Current FDA-cleared systems primarily provide binary referral outputs, where this minimal output may limit clinical trust and utility. Yet, determining the most effective output format to enhance clinician-AI performance is an empirical challenge that is difficult to assess at scale. We evaluated multimodal large language models (MLLMs) for DR detection and their ability to simulate clinical AI assistance across different output types. Two models were tested on IDRiD and Messidor-2: GPT-4o, a general-purpose MLLM, and MedGemma, an open-source medical model. Experiments included: (1) baseline evaluation, (2) simulated AI assistance with synthetic predictions, and (3) actual AI-to-AI collaboration where GPT-4o incorporated MedGemma outputs. MedGemma outperformed GPT-4o at baseline, achieving higher sensitivity and AUROC, while GPT-4o showed near-perfect specificity but low sensitivity. Both models adjusted predictions based on simulated AI inputs, but GPT-4o’s performance collapsed with incorrect ones, whereas MedGemma remained more stable. In actual collaboration, GPT-4o achieved strong results when guided by MedGemma’s descriptive outputs, even without direct image access (AUROC up to 0.96). These findings suggest MLLMs may improve DR screening pipelines and serve as scalable simulators for studying clinical AI assistance across varying output configurations. Open, lightweight models such as MedGemma may be especially valuable in low-resource settings, while descriptive outputs could enhance explainability and clinician trust in clinical workflows.


cs.HC [Back]

[123] Textarium: Entangling Annotation, Abstraction and Argument cs.HC | cs.CL | H.5.2; H.5.4; I.7.1; J.5PDF

Philipp Proff, Marian Dörk

TL;DR: Textarium是一个基于网页的环境,通过将标注、抽象和论证相结合,支持文本解释过程。它为学术阅读和写作提供了可视化界面,结合人为分析与轻量计算,弥合了细读与远读的差距。

Details

Motivation: 当前数字人文领域缺乏有效工具来透明化和共享文本解释过程,特别是在学术阅读与写作中。Textarium旨在填补这一空白,通过技术手段支持复杂的解释行为。

Result: 开发了一个功能完备的网页工具,能够有效支持学术研究中的解释性阅读与写作,提高了过程的透明性和可共享性。

Insight: 文本解释过程可以通过可视化和技术支持的结合变得更加透明和高效,为数字人文研究提供了新的工具和方法。

Abstract: We present a web-based environment that connects annotation, abstraction, and argumentation during the interpretation of text. As a visual interface for scholarly reading and writing, Textarium combines human analysis with lightweight computational processing to bridge close and distant reading practices. Readers can highlight text, group keywords into concepts, and embed these observations as anchors in essays. The interface renders these interpretive actions as parameterized visualization states. Through a speculative design process of co-creative and iterative prototyping, we developed a reading-writing approach that makes interpretive processes transparent and shareable within digital narratives.


cs.RO [Back]

[124] The Better You Learn, The Smarter You Prune: Towards Efficient Vision-language-action Models via Differentiable Token Pruning cs.RO | cs.CL | cs.CVPDF

Titong Jiang, Xuefeng Jiang, Yuan Ma, Xin Wen, Bailin Li

TL;DR: LightVLA是一种可微分视觉token剪枝框架,通过动态评估token重要性并采用Gumbel softmax实现高效剪枝,提升VLA模型的效率与性能。

Details

Motivation: VLA模型在资源受限平台上的部署因大量视觉token计算而受限,需要一种高效且性能驱动的剪枝方法。

Result: 在LIBERO基准测试中,LightVLA显著减少59.1% FLOPs和38.2%延迟,同时任务成功率提升2.9%。

Insight: LightVLA在追求性能优化的过程中自发地学习从性能驱动视角剪枝token,为实时机器人系统提供了高效实用的解决方案。

Abstract: We present LightVLA, a simple yet effective differentiable token pruning framework for vision-language-action (VLA) models. While VLA models have shown impressive capability in executing real-world robotic tasks, their deployment on resource-constrained platforms is often bottlenecked by the heavy attention-based computation over large sets of visual tokens. LightVLA addresses this challenge through adaptive, performance-driven pruning of visual tokens: It generates dynamic queries to evaluate visual token importance, and adopts Gumbel softmax to enable differentiable token selection. Through fine-tuning, LightVLA learns to preserve the most informative visual tokens while pruning tokens which do not contribute to task execution, thereby improving efficiency and performance simultaneously. Notably, LightVLA requires no heuristic magic numbers and introduces no additional trainable parameters, making it compatible with modern inference frameworks. Experimental results demonstrate that LightVLA outperforms different VLA models and existing token pruning methods across diverse tasks on the LIBERO benchmark, achieving higher success rates with substantially reduced computational overhead. Specifically, LightVLA reduces FLOPs and latency by 59.1% and 38.2% respectively, with a 2.9% improvement in task success rate. Meanwhile, we also investigate the learnable query-based token pruning method LightVLA* with additional trainable parameters, which also achieves satisfactory performance. Our work reveals that as VLA pursues optimal performance, LightVLA spontaneously learns to prune tokens from a performance-driven perspective. To the best of our knowledge, LightVLA is the first work to apply adaptive visual token pruning to VLA tasks with the collateral goals of efficiency and performance, marking a significant step toward more efficient, powerful and practical real-time robotic systems.


[125] Neural 3D Object Reconstruction with Small-Scale Unmanned Aerial Vehicles cs.RO | cs.AR | cs.CV | cs.ET | cs.SY | eess.SYPDF

Àlmos Veres-Vitàlyos, Genis Castillo Gomez-Raya, Filip Lemic, Daniel Johannes Bugelnig, Bernhard Rinner

TL;DR: 该论文提出了一种用于小型无人机的高质量3D重建系统架构,通过双重建管道实现了数据捕获与飞行控制的实时反馈,显著提升了重建质量。

Details

Motivation: 小型无人机在复杂任务(如高质量3D重建)中的应用受到负载和自主性的限制,本研究旨在解决这一问题。

Result: 实验验证表明,动态轨迹调整显著优于静态路径,在多无人机配置下也表现出色。

Insight: 该工作展示了小型无人机在受限环境中实现高质量3D重建的潜力,突破了传统依赖大型平台的限制。

Abstract: Small Unmanned Aerial Vehicles (UAVs) exhibit immense potential for navigating indoor and hard-to-reach areas, yet their significant constraints in payload and autonomy have largely prevented their use for complex tasks like high-quality 3-Dimensional (3D) reconstruction. To overcome this challenge, we introduce a novel system architecture that enables fully autonomous, high-fidelity 3D scanning of static objects using UAVs weighing under 100 grams. Our core innovation lies in a dual-reconstruction pipeline that creates a real-time feedback loop between data capture and flight control. A near-real-time (near-RT) process uses Structure from Motion (SfM) to generate an instantaneous pointcloud of the object. The system analyzes the model quality on the fly and dynamically adapts the UAV’s trajectory to intelligently capture new images of poorly covered areas. This ensures comprehensive data acquisition. For the final, detailed output, a non-real-time (non-RT) pipeline employs a Neural Radiance Fields (NeRF)-based Neural 3D Reconstruction (N3DR) approach, fusing SfM-derived camera poses with precise Ultra Wide-Band (UWB) location data to achieve superior accuracy. We implemented and validated this architecture using Crazyflie 2.1 UAVs. Our experiments, conducted in both single- and multi-UAV configurations, conclusively show that dynamic trajectory adaptation consistently improves reconstruction quality over static flight paths. This work demonstrates a scalable and autonomous solution that unlocks the potential of miniaturized UAVs for fine-grained 3D reconstruction in constrained environments, a capability previously limited to much larger platforms.


[126] ActiveVLN: Towards Active Exploration via Multi-Turn RL in Vision-and-Language Navigation cs.RO | cs.AI | cs.CVPDF

Zekai Zhang, Weiye Zhu, Hewei Pan, Xiangchen Wang, Rongtao Xu

TL;DR: ActiveVLN提出了一种基于多轮强化学习(RL)的主动探索框架,用于视觉与语言导航(VLN)任务,解决了现有方法依赖模仿学习和专家轨迹的局限性,并通过动态剪枝策略提升RL效率。

Details

Motivation: 现有VLN方法主要依赖模仿学习(IL),成本高且缺乏主动探索能力。强化学习虽有潜力,但依赖专家轨迹奖励塑造,未能实现开放式探索。ActiveVLN旨在通过多轮RL实现主动探索。

Result: ActiveVLN在性能提升上优于DAgger和现有RL方法,尽管使用较小模型,仍达到与最先进方法竞争的结果。

Insight: 通过主动探索和RL动态优化,VLN任务可以摆脱对专家轨迹的依赖,提升导航多样性和效率。

Abstract: The Vision-and-Language Navigation (VLN) task requires an agent to follow natural language instructions and navigate through complex environments. Existing MLLM-based VLN methods primarily rely on imitation learning (IL) and often use DAgger for post-training to mitigate covariate shift. While effective, these approaches incur substantial data collection and training costs. Reinforcement learning (RL) offers a promising alternative. However, prior VLN RL methods lack dynamic interaction with the environment and depend on expert trajectories for reward shaping, rather than engaging in open-ended active exploration. This restricts the agent’s ability to discover diverse and plausible navigation routes. To address these limitations, we propose ActiveVLN, a VLN framework that explicitly enables active exploration through multi-turn RL. In the first stage, a small fraction of expert trajectories is used for IL to bootstrap the agent. In the second stage, the agent iteratively predicts and executes actions, automatically collects diverse trajectories, and optimizes multiple rollouts via the GRPO objective. To further improve RL efficiency, we introduce a dynamic early-stopping strategy to prune long-tail or likely failed trajectories, along with additional engineering optimizations. Experiments show that ActiveVLN achieves the largest performance gains over IL baselines compared to both DAgger-based and prior RL-based post-training methods, while reaching competitive performance with state-of-the-art approaches despite using a smaller model. Code and data will be released soon.


[127] Unleashing the Power of Discrete-Time State Representation: Ultrafast Target-based IMU-Camera Spatial-Temporal Calibration cs.RO | cs.CVPDF

Junlin Song, Antoine Richard, Miguel Olivares-Mendez

TL;DR: 该论文提出了一种基于离散时间状态表示的极高效IMU-相机时空标定方法,解决了传统连续时间状态标定方法的高计算成本问题,并弥补了离散时间状态在时间标定中的弱点。

Details

Motivation: 视觉-惯性融合在机器人导航和增强现实等智能应用中至关重要,但现有的标定方法通常采用连续时间状态表示(如B样条),计算成本较高。随着无人机、手机等设备的普及,高效的标定方法能为大规模设备节省大量时间。

Result: 该方法在保持高精度的同时大幅提升了标定效率,为大规模设备标定节省了可观的时间成本。

Insight: 离散时间状态表示在标定任务中具有高效潜力,通过针对性优化可以解决其不足,为视觉-惯性融合的实际应用提供了实用解决方案。

Abstract: Visual-inertial fusion is crucial for a large amount of intelligent and autonomous applications, such as robot navigation and augmented reality. To bootstrap and achieve optimal state estimation, the spatial-temporal displacements between IMU and cameras must be calibrated in advance. Most existing calibration methods adopt continuous-time state representation, more specifically the B-spline. Despite these methods achieve precise spatial-temporal calibration, they suffer from high computational cost caused by continuous-time state representation. To this end, we propose a novel and extremely efficient calibration method that unleashes the power of discrete-time state representation. Moreover, the weakness of discrete-time state representation in temporal calibration is tackled in this paper. With the increasing production of drones, cellphones and other visual-inertial platforms, if one million devices need calibration around the world, saving one minute for the calibration of each device means saving 2083 work days in total. To benefit both the research and industry communities, our code will be open-source.


cs.SI [Back]

[128] Podcasts as a Medium for Participation in Collective Action: A Case Study of Black Lives Matter cs.SI | cs.CL | cs.CYPDF

Theodora Moldovan, Arianna Pera, Davide Vega, Luca Maria Aiello

TL;DR: 本文以BLM运动为例,研究播客如何表达集体行动的参与,填补了音频格式研究的空白,并分析了情感与行动阶段的关联。

Details

Motivation: 以往关于集体行动的研究主要聚焦于文本内容,而播客作为音频媒体尚未被充分探索。本研究首次尝试通过播客转录内容分析口语化参与表达及其情感维度。

Result: 情感分布因行动阶段而异,正面情感在号召行动和意图阶段较突出,负面情感与集体行动呈负相关。结果挑战了理论预期。

Insight: 播客中的情感表达可能具有媒体特异性,为研究数字口语中的社会运动参与提供了新视角。

Abstract: We study how participation in collective action is articulated in podcast discussions, using the Black Lives Matter (BLM) movement as a case study. While research on collective action discourse has primarily focused on text-based content, this study takes a first step toward analyzing audio formats by using podcast transcripts. Using the Structured Podcast Research Corpus (SPoRC), we investigated spoken language expressions of participation in collective action, categorized as problem-solution, call-to-action, intention, and execution. We identified podcast episodes discussing racial justice after important BLM-related events in May and June of 2020, and extracted participatory statements using a layered framework adapted from prior work on social media. We examined the emotional dimensions of these statements, detecting eight key emotions and their association with varying stages of activism. We found that emotional profiles vary by stage, with different positive emotions standing out during calls-to-action, intention, and execution. We detected negative associations between collective action and negative emotions, contrary to theoretical expectations. Our work contributes to a better understanding of how activism is expressed in spoken digital discourse and how emotional framing may depend on the format of the discussion.