Table of Contents
- cs.CV [Total: 78]
- cs.CL [Total: 11]
- cs.LG [Total: 6]
- cs.DB [Total: 1]
- cs.AI [Total: 3]
- eess.IV [Total: 1]
- cs.MM [Total: 1]
- cs.PL [Total: 1]
- cs.IR [Total: 1]
- cs.RO [Total: 3]
cs.CV [Back]
[1] Gaussian See, Gaussian Do: Semantic 3D Motion Transfer from Multiview Video cs.CVPDF
Yarin Bekor, Gal Michael Harari, Or Perel, Or Litany
TL;DR: 论文提出了一种新方法Gaussian See, Gaussian Do,用于多视角视频中语义3D运动的无绑定跨类别传递,通过动态3D高斯泼溅重建实现高保真度和结构一致性。
Details
Motivation: 现有方法在跨类别运动传递中存在语义对应不足和结构一致性问题,本文旨在解决这些问题,实现更自然的运动传递。
Result: 在基准测试中表现出优于基线方法的运动保真度和结构一致性。
Insight: 锚点视图感知机制有效提高了跨视图一致性并加速收敛,4D重建流程整合了噪声监督视频。
Abstract: We present Gaussian See, Gaussian Do, a novel approach for semantic 3D motion transfer from multiview video. Our method enables rig-free, cross-category motion transfer between objects with semantically meaningful correspondence. Building on implicit motion transfer techniques, we extract motion embeddings from source videos via condition inversion, apply them to rendered frames of static target shapes, and use the resulting videos to supervise dynamic 3D Gaussian Splatting reconstruction. Our approach introduces an anchor-based view-aware motion embedding mechanism, ensuring cross-view consistency and accelerating convergence, along with a robust 4D reconstruction pipeline that consolidates noisy supervision videos. We establish the first benchmark for semantic 3D motion transfer and demonstrate superior motion fidelity and structural consistency compared to adapted baselines. Code and data for this paper available at https://gsgd-motiontransfer.github.io/
[2] When CNNs Outperform Transformers and Mambas: Revisiting Deep Architectures for Dental Caries Segmentation cs.CV | cs.AIPDF
Aashish Ghimire, Jun Zeng, Roshan Paudel, Nikhil Kumar Tomar, Deepak Ranjan Nayak
TL;DR: 本文对12种深度学习架构在牙科龋齿分割任务上进行了全面基准测试,发现CNN-based的DoubleU-Net在性能上优于Transformer和Mamba-based方法,表明在特定领域任务中架构与任务的匹配比模型复杂度更重要。
Details
Motivation: 牙科龋齿的自动化分割在临床诊断中至关重要,但由于低对比度和数据有限等问题,现有方法表现不佳。本文旨在通过系统评测不同架构(CNN、Transformer和Mamba)的性能,寻找最适合牙科图像的模型。
Result: CNN-based的DoubleU-Net表现最佳(Dice=0.7345),超越了Transformer和Mamba-based方法,表明简单架构在特定任务中可能更优。
Insight: 尽管Transformer和Mamba在理论上具有全局建模优势,但在数据有限的情况下,CNN的空间先验更有效,突出了领域特定任务中架构选择的关键性。
Abstract: Accurate identification and segmentation of dental caries in panoramic radiographs are critical for early diagnosis and effective treatment planning. Automated segmentation remains challenging due to low lesion contrast, morphological variability, and limited annotated data. In this study, we present the first comprehensive benchmarking of convolutional neural networks, vision transformers and state-space mamba architectures for automated dental caries segmentation on panoramic radiographs through a DC1000 dataset. Twelve state-of-the-art architectures, including VMUnet, MambaUNet, VMUNetv2, RMAMamba-S, TransNetR, PVTFormer, DoubleU-Net, and ResUNet++, were trained under identical configurations. Results reveal that, contrary to the growing trend toward complex attention based architectures, the CNN-based DoubleU-Net achieved the highest dice coefficient of 0.7345, mIoU of 0.5978, and precision of 0.8145, outperforming all transformer and Mamba variants. In the study, the top 3 results across all performance metrics were achieved by CNN-based architectures. Here, Mamba and transformer-based methods, despite their theoretical advantage in global context modeling, underperformed due to limited data and weaker spatial priors. These findings underscore the importance of architecture-task alignment in domain-specific medical image segmentation more than model complexity. Our code is available at: https://github.com/JunZengz/dental-caries-segmentation.
[3] B-Rep Distance Functions (BR-DF): How to Represent a B-Rep Model by Volumetric Distance Functions? cs.CV | cs.AIPDF
Fuyang Zhang, Pradeep Kumar Jayaraman, Xiang Xu, Yasutaka Furukawa
TL;DR: 本文提出了一种新的几何表示方法BR-DF,通过体素距离函数编码CAD边界表示(B-Rep),并将其转化为无缝的B-Rep模型。该方法结合了SDF和UDF的联合生成,实现了100%的模型生成成功率。
Details
Motivation: CAD模型的边界表示(B-Rep)通常在几何处理和生成中存在失败率高的问题。作者希望通过体素距离函数(SDF和UDF)的形式改进B-Rep的表示和生成能力。
Result: 在CAD生成任务中表现与SOTA方法相当,同时实现了100%的B-Rep模型生成成功率。
Insight: 体素距离函数(SDF和UDF)可以有效表示B-Rep的复杂几何和拓扑信息,并结合扩散模型实现高效的生成任务。
Abstract: This paper presents a novel geometric representation for CAD Boundary Representation (B-Rep) based on volumetric distance functions, dubbed B-Rep Distance Functions (BR-DF). BR-DF encodes the surface mesh geometry of a CAD model as signed distance function (SDF). B-Rep vertices, edges, faces and their topology information are encoded as per-face unsigned distance functions (UDFs). An extension of the Marching Cubes algorithm converts BR-DF directly into watertight CAD B-Rep model (strictly speaking a faceted B-Rep model). A surprising characteristic of BR-DF is that this conversion process never fails. Leveraging the volumetric nature of BR-DF, we propose a multi-branch latent diffusion with 3D U-Net backbone for jointly generating the SDF and per-face UDFs of a BR-DF model. Our approach achieves comparable CAD generation performance against SOTA methods while reaching the unprecedented 100% success rate in producing (faceted) B-Rep models.
[4] GeoSceneGraph: Geometric Scene Graph Diffusion Model for Text-guided 3D Indoor Scene Synthesis cs.CVPDF
Antonio Ruiz, Tao Wu, Andrew Melnik, Qing Cheng, Xuqin Wang
TL;DR: GeoSceneGraph是一种基于几何场景图的扩散模型,用于从文本提示合成3D室内场景。它利用场景的图结构和几何对称性,无需依赖预定义的关系类别或地面真值关系标注,实现了较高的场景一致性和真实感。
Details
Motivation: 现有方法要么从头训练生成模型,要么依赖视觉语言模型(VLMs),但这些方法要么忽略场景的图结构,要么需要用户提供的语义图或地面真值关系标注,限制了场景生成的一致性和灵活性。
Result: GeoSceneGraph在不依赖地面真值关系的情况下,实现了与依赖此类关系的方法相当的性能。
Insight: 揭示了几何对称性和场景图结构在3D场景生成中的重要性,并提出了一种将文本特征高效融入EGNNs的方法。
Abstract: Methods that synthesize indoor 3D scenes from text prompts have wide-ranging applications in film production, interior design, video games, virtual reality, and synthetic data generation for training embodied agents. Existing approaches typically either train generative models from scratch or leverage vision-language models (VLMs). While VLMs achieve strong performance, particularly for complex or open-ended prompts, smaller task-specific models remain necessary for deployment on resource-constrained devices such as extended reality (XR) glasses or mobile phones. However, many generative approaches that train from scratch overlook the inherent graph structure of indoor scenes, which can limit scene coherence and realism. Conversely, methods that incorporate scene graphs either demand a user-provided semantic graph, which is generally inconvenient and restrictive, or rely on ground-truth relationship annotations, limiting their capacity to capture more varied object interactions. To address these challenges, we introduce GeoSceneGraph, a method that synthesizes 3D scenes from text prompts by leveraging the graph structure and geometric symmetries of 3D scenes, without relying on predefined relationship classes. Despite not using ground-truth relationships, GeoSceneGraph achieves performance comparable to methods that do. Our model is built on equivariant graph neural networks (EGNNs), but existing EGNN approaches are typically limited to low-dimensional conditioning and are not designed to handle complex modalities such as text. We propose a simple and effective strategy for conditioning EGNNs on text features, and we validate our design through ablation studies.
[5] HULFSynth : An INR based Super-Resolution and Ultra Low-Field MRI Synthesis via Contrast factor estimation cs.CV | cs.LGPDF
Pranav Indrakanti, Ivor Simpson
TL;DR: 该论文提出了一种无监督的单图像双向MRI合成方法,通过估计对比因子实现超高场MRI(HF)与超低场MRI(ULF)之间的互相合成,同时利用隐式神经表示(INR)实现超分辨率任务。
Details
Motivation: 现有MRI合成模型缺乏对HF与ULF间对比变化的物理驱动建模,从而限制了合成的准确性和鲁棒性。
Result: 合成ULF图像的WM-GM对比度提升了52%,64mT图像对比度提升了37%。模型对目标对比度、噪声和初始条件的鲁棒性得到验证。
Insight: 结合物理驱动模型与深度学习(如INR)能显著提升MRI合成的准确性和鲁棒性。
Abstract: We present an unsupervised single image bidirectional Magnetic Resonance Image (MRI) synthesizer that synthesizes an Ultra-Low Field (ULF) like image from a High-Field (HF) magnitude image and vice-versa. Unlike existing MRI synthesis models, our approach is inspired by the physics that drives contrast changes between HF and ULF MRIs. Our forward model simulates a HF to ULF transformation by estimating the tissue-type Signal-to-Noise ratio (SNR) values based on target contrast values. For the Super-Resolution task, we used an Implicit Neural Representation (INR) network to synthesize HF image by simultaneously predicting tissue-type segmentations and image intensity without observed HF data. The proposed method is evaluated using synthetic ULF-like data from generated from standard 3T T$_1$-weighted images for qualitative assessments and paired 3T-64mT T$_1$-weighted images for validation experiments. WM-GM contrast improved by 52% in synthetic ULF-like images and 37% in 64mT images. Sensitivity experiments demonstrated the robustness of our forward model to variations in target contrast, noise and initial seeding.
[6] InstructMix2Mix: Consistent Sparse-View Editing Through Multi-View Model Personalization cs.CVPDF
Daniel Gilo, Or Litany
TL;DR: InstructMix2Mix(I-Mix2Mix)是一个用于稀疏视图多视图图像编辑的框架,通过将2D扩散模型的编辑能力蒸馏到预训练的多视图扩散模型中,实现了跨视图一致性编辑。
Details
Motivation: 现有方法基于单场景神经场或时间注意力机制,在多视图编辑中容易产生伪影和不一致性,因此需要一个更高效的解决方案。
Result: 实验表明,I-Mix2Mix在多视图一致性和单帧编辑质量上均有显著提升。
Insight: 利用多视图扩散模型的数据驱动3D先验可以有效解决稀疏视图编辑中的一致性问题,无需额外计算成本。
Abstract: We address the task of multi-view image editing from sparse input views, where the inputs can be seen as a mix of images capturing the scene from different viewpoints. The goal is to modify the scene according to a textual instruction while preserving consistency across all views. Existing methods, based on per-scene neural fields or temporal attention mechanisms, struggle in this setting, often producing artifacts and incoherent edits. We propose InstructMix2Mix (I-Mix2Mix), a framework that distills the editing capabilities of a 2D diffusion model into a pretrained multi-view diffusion model, leveraging its data-driven 3D prior for cross-view consistency. A key contribution is replacing the conventional neural field consolidator in Score Distillation Sampling (SDS) with a multi-view diffusion student, which requires novel adaptations: incremental student updates across timesteps, a specialized teacher noise scheduler to prevent degeneration, and an attention modification that enhances cross-view coherence without additional cost. Experiments demonstrate that I-Mix2Mix significantly improves multi-view consistency while maintaining high per-frame edit quality.
[7] Skin-R1: Toward Trustworthy Clinical Reasoning for Dermatological Diagnosis cs.CV | cs.AI | cs.CLPDF
Zehao Liu, Wejieying Ren, Jipeng Zhang, Tianxiang Zhao, Jingxi Zhu
TL;DR: 本文提出SkinR1,一种结合教科书驱动的深度推理与强化学习泛化能力的皮肤病诊断视觉语言模型,解决了数据异构性、推理监督缺失和泛化能力不足等问题。
Details
Motivation: 皮肤病诊断的视觉语言模型在可信度和临床实用性上存在数据异构性、缺乏可靠的推理监督以及泛化能力受限等问题,SkinR1旨在解决这些挑战。
Result: 在多个皮肤病数据集上,SkinR1显著提升了诊断准确性,消融实验验证了其推理基础的重要性。
Insight: SkinR1的关键在于结合专家知识(教科书)和数据驱动方法(强化学习),为解决医疗领域中模型的可信度和泛化问题提供了新思路。
Abstract: The emergence of vision-language models (VLMs) has opened new possibilities for clinical reasoning and has shown promising performance in dermatological diagnosis. However, their trustworthiness and clinical utility are often limited by three major factors: (1) Data heterogeneity, where diverse datasets lack consistent diagnostic labels and clinical concept annotations; (2) Absence of grounded diagnostic rationales, leading to a scarcity of reliable reasoning supervision; and (3) Limited scalability and generalization, as models trained on small, densely annotated datasets struggle to transfer nuanced reasoning to large, sparsely-annotated ones. To address these limitations, we propose SkinR1, a novel dermatological VLM that combines deep, textbook-based reasoning with the broad generalization capabilities of reinforcement learning (RL). SkinR1 systematically resolves the key challenges through a unified, end-to-end framework. First, we design a textbook-based reasoning generator that synthesizes high-fidelity, hierarchy-aware, and differential-diagnosis (DDx)-informed trajectories, providing reliable expert-level supervision. Second, we leverage the constructed trajectories for supervised fine-tuning (SFT) empowering the model with grounded reasoning ability. Third, we develop a novel RL paradigm that, by incorporating the hierarchical structure of diseases, effectively transfers these grounded reasoning patterns to large-scale, sparse data. Extensive experiments on multiple dermatology datasets demonstrate that SkinR1 achieves superior diagnostic accuracy. The ablation study demonstrates the importance of the reasoning foundation instilled by SFT.
[8] FarSLIP: Discovering Effective CLIP Adaptation for Fine-Grained Remote Sensing Understanding cs.CVPDF
Zhenshi Li, Weikang Yu, Dilxat Muhtar, Xueliang Zhang, Pengfeng Xiao
TL;DR: FarSLIP提出了一种针对遥感数据的细粒度CLIP适应框架,通过多粒度数据集MGRS-200k和patch-to-patch蒸馏方法,提升了特征的可分辨性和语义一致性。
Details
Motivation: 现有CLIP变体在遥感领域中因全局对齐的限制和区域文本对齐方法的直接应用导致性能下降。
Result: FarSLIP在遥感开放词汇语义分割、零样本分类和图像文本检索等任务中实现了新的SOTA。
Insight: 直接应用通用区域对齐方法可能破坏CLIP的语义一致性,需要设计更适合遥感数据的细粒度对齐策略。
Abstract: As CLIP’s global alignment limits its ability to capture fine-grained details, recent efforts have focused on enhancing its region-text alignment. However, current remote sensing (RS)-specific CLIP variants still inherit this limited spatial awareness. We identify two key limitations behind this: (1) current RS image-text datasets generate global captions from object-level labels, leaving the original object-level supervision underutilized; (2) despite the success of region-text alignment methods in general domain, their direct application to RS data often leads to performance degradation. To address these, we construct the first multi-granularity RS image-text dataset, MGRS-200k, featuring rich object-level textual supervision for RS region-category alignment. We further investigate existing fine-grained CLIP tuning strategies and find that current explicit region-text alignment methods, whether in a direct or indirect way, underperform due to severe degradation of CLIP’s semantic coherence. Building on these, we propose FarSLIP, a Fine-grained Aligned RS Language-Image Pretraining framework. Rather than the commonly used patch-to-CLS self-distillation, FarSLIP employs patch-to-patch distillation to align local and global visual cues, which improves feature discriminability while preserving semantic coherence. Additionally, to effectively utilize region-text supervision, it employs simple CLS token-based region-category alignment rather than explicit patch-level alignment, further enhancing spatial awareness. FarSLIP features improved fine-grained vision-language alignment in RS domain and sets a new state of the art not only on RS open-vocabulary semantic segmentation, but also on image-level tasks such as zero-shot classification and image-text retrieval. Our dataset, code, and models are available at https://github.com/NJU-LHRS/FarSLIP.
[9] nnMIL: A generalizable multiple instance learning framework for computational pathology cs.CVPDF
Xiangde Luo, Jinxi Xiang, Yuanfeng Ji, Ruijiang Li
TL;DR: nnMIL是一种通用的多实例学习框架,适用于计算病理学,通过随机采样和轻量级聚合器提升预测的泛化性和可靠性。
Details
Motivation: 当前计算病理学中,WSI特征的聚合方法存在设计局限,影响模型的泛化性和可靠性。
Result: 在4万张WSI和35个临床任务上表现优异,尤其在跨模型泛化和生存预测方面效果显著。
Insight: 通过随机采样和不确定性量化,nnMIL为病理学基础模型的临床转化提供了可靠解决方案。
Abstract: Computational pathology holds substantial promise for improving diagnosis and guiding treatment decisions. Recent pathology foundation models enable the extraction of rich patch-level representations from large-scale whole-slide images (WSIs), but current approaches for aggregating these features into slide-level predictions remain constrained by design limitations that hinder generalizability and reliability. Here, we developed nnMIL, a simple yet broadly applicable multiple-instance learning framework that connects patch-level foundation models to robust slide-level clinical inference. nnMIL introduces random sampling at both the patch and feature levels, enabling large-batch optimization, task-aware sampling strategies, and efficient and scalable training across datasets and model architectures. A lightweight aggregator performs sliding-window inference to generate ensemble slide-level predictions and supports principled uncertainty estimation. Across 40,000 WSIs encompassing 35 clinical tasks and four pathology foundation models, nnMIL consistently outperformed existing MIL methods for disease diagnosis, histologic subtyping, molecular biomarker detection, and pan- cancer prognosis prediction. It further demonstrated strong cross-model generalization, reliable uncertainty quantification, and robust survival stratification in multiple external cohorts. In conclusion, nnMIL offers a practical and generalizable solution for translating pathology foundation models into clinically meaningful predictions, advancing the development and deployment of reliable AI systems in real-world settings.
[10] CPSL: Representing Volumetric Video via Content-Promoted Scene Layers cs.CV | cs.MMPDF
Kaiyuan Hu, Yili Jin, Junhua Liu, Xize Duan, Hong Kang
TL;DR: CPSL是一种紧凑的2.5D视频表示方法,通过将每帧分解为几何一致性的层,提升了传统2D内容的体积视频感知效果,同时降低了计算和渲染成本。
Details
Motivation: 现有的体积视频表示方法(如点云或神经场)在捕获、计算和渲染上成本高昂,限制了其可扩展性和实时性应用。
Result: 在多项基准测试中,CPSL在感知质量和边界保真度上优于基线方法,同时将存储和渲染成本降低数倍。
Insight: CPSL为从2D视频扩展到2.5D沉浸式媒体提供了实用路径,平衡了质量与成本的矛盾。
Abstract: Volumetric video enables immersive and interactive visual experiences by supporting free viewpoint exploration and realistic motion parallax. However, existing volumetric representations from explicit point clouds to implicit neural fields, remain costly in capture, computation, and rendering, which limits their scalability for on-demand video and reduces their feasibility for real-time communication. To bridge this gap, we propose Content-Promoted Scene Layers (CPSL), a compact 2.5D video representation that brings the perceptual benefits of volumetric video to conventional 2D content. Guided by per-frame depth and content saliency, CPSL decomposes each frame into a small set of geometry-consistent layers equipped with soft alpha bands and an edge-depth cache that jointly preserve occlusion ordering and boundary continuity. These lightweight, 2D-encodable assets enable parallax-corrected novel-view synthesis via depth-weighted warping and front-to-back alpha compositing, bypassing expensive 3D reconstruction. Temporally, CPSL maintains inter-frame coherence using motion-guided propagation and per-layer encoding, supporting real-time playback with standard video codecs. Across multiple benchmarks, CPSL achieves superior perceptual quality and boundary fidelity compared with layer-based and neural-field baselines while reducing storage and rendering cost by several folds. Our approach offer a practical path from 2D video to scalable 2.5D immersive media.
[11] Unsupervised Discovery of Long-Term Spatiotemporal Periodic Workflows in Human Activities cs.CVPDF
Fan Yang, Quanting Xie, Atsunori Moteki, Shoichi Masui, Shan Jiang
TL;DR: 这篇论文提出了一个包含580个多模态人类活动序列的基准数据集,专注于长时空周期性工作流的无监督发现,并提出了一个轻量级的、无需训练的基线方法。该方法在多个任务中表现优异,具有实际部署优势。
Details
Motivation: 现有研究主要集中在短时周期性活动,而对结构复杂且模式对比度低的长时周期性工作流研究不足。论文旨在填补这一空白,并提供实际应用支持的基准和方法。
Result: 实验表明,该方法在所有评估任务中大幅优于其他方法,且在现实应用中具有与传统监督方法相当的部署优势。
Insight: 长时空周期性工作流的无监督发现具有实际意义,且轻量级方法可以在不依赖标注的情况下实现高性能。
Abstract: Periodic human activities with implicit workflows are common in manufacturing, sports, and daily life. While short-term periodic activities – characterized by simple structures and high-contrast patterns – have been widely studied, long-term periodic workflows with low-contrast patterns remain largely underexplored. To bridge this gap, we introduce the first benchmark comprising 580 multimodal human activity sequences featuring long-term periodic workflows. The benchmark supports three evaluation tasks aligned with real-world applications: unsupervised periodic workflow detection, task completion tracking, and procedural anomaly detection. We also propose a lightweight, training-free baseline for modeling diverse periodic workflow patterns. Experiments show that: (i) our benchmark presents significant challenges to both unsupervised periodic detection methods and zero-shot approaches based on powerful large language models (LLMs); (ii) our baseline outperforms competing methods by a substantial margin in all evaluation tasks; and (iii) in real-world applications, our baseline demonstrates deployment advantages on par with traditional supervised workflow detection approaches, eliminating the need for annotation and retraining. Our project page is https://sites.google.com/view/periodicworkflow.
[12] RocSync: Millisecond-Accurate Temporal Synchronization for Heterogeneous Camera Systems cs.CVPDF
Jaro Meyer, Frédéric Giraud, Joschua Wüthrich, Marc Pollefeys, Philipp Fürnstahl
TL;DR: RocSync是一种低成本、通用的时间同步方法,通过LED时钟实现多相机系统的毫秒级同步,适用于异构设备,显著提升了多视角计算机视觉任务的性能。
Details
Motivation: 异构相机系统(如专业与消费级设备、RGB与红外传感器)的硬件同步能力不足,导致多视角视频流的时空对齐困难,影响了动态场景应用的效果。
Result: 实验表明,该方法优于基于光、音频和时间码的同步方法,同步残差为1.34ms RMSE,并直接提升了多视角姿态估计和3D重建等下游任务的表现。
Insight: RocSync简化了同步流程,扩展了无约束环境中高级视觉感知的应用范围,尤其在工业和临床领域中具有潜力。
Abstract: Accurate spatiotemporal alignment of multi-view video streams is essential for a wide range of dynamic-scene applications such as multi-view 3D reconstruction, pose estimation, and scene understanding. However, synchronizing multiple cameras remains a significant challenge, especially in heterogeneous setups combining professional and consumer-grade devices, visible and infrared sensors, or systems with and without audio, where common hardware synchronization capabilities are often unavailable. This limitation is particularly evident in real-world environments, where controlled capture conditions are not feasible. In this work, we present a low-cost, general-purpose synchronization method that achieves millisecond-level temporal alignment across diverse camera systems while supporting both visible (RGB) and infrared (IR) modalities. The proposed solution employs a custom-built \textit{LED Clock} that encodes time through red and infrared LEDs, allowing visual decoding of the exposure window (start and end times) from recorded frames for millisecond-level synchronization. We benchmark our method against hardware synchronization and achieve a residual error of 1.34~ms RMSE across multiple recordings. In further experiments, our method outperforms light-, audio-, and timecode-based synchronization approaches and directly improves downstream computer vision tasks, including multi-view pose estimation and 3D reconstruction. Finally, we validate the system in large-scale surgical recordings involving over 25 heterogeneous cameras spanning both IR and RGB modalities. This solution simplifies and streamlines the synchronization pipeline and expands access to advanced vision-based sensing in unconstrained environments, including industrial and clinical applications.
[13] Artificial intelligence approaches for energy-efficient laser cutting machines cs.CV | cs.AI | cs.LG | cs.ROPDF
Mohamed Abdallah Salem, Hamdy Ahmed Ashour, Ahmed Elshenawy
TL;DR: 该研究通过深度学习技术优化激光切割机的能量消耗,提出自适应闭环控制方法,实现了显著的节能效果。
Details
Motivation: 激光切割机的能量消耗和环境问题是当前的主要挑战,尤其CO2激光抽吸泵的开环特性导致能源浪费,亟需自适应解决方案。
Result: 实验证明,该方法显著降低抽吸泵能量消耗20%-50%,推动制造业可持续发展。
Insight: 深度学习在工业节能中具备潜力,闭环自适应控制是提高能源效率的有效途径。
Abstract: This research addresses the significant challenges of energy consumption and environmental impact in laser cutting by proposing novel deep learning (DL) methodologies to achieve energy reduction. Recognizing the current lack of adaptive control and the open-loop nature of CO2 laser suction pumps, this study utilizes closed-loop configurations that dynamically adjust pump power based on both the material being cut and the smoke level generated. To implement this adaptive system, diverse material classification methods are introduced, including techniques leveraging lens-less speckle sensing with a customized Convolutional Neural Network (CNN) and an approach using a USB camera with transfer learning via the pre-trained VGG16 CNN model. Furthermore, a separate DL model for smoke level detection is employed to simultaneously refine the pump’s power output. This integration prompts the exhaust suction pump to automatically halt during inactive times and dynamically adjust power during operation, leading to experimentally proven and remarkable energy savings, with results showing a 20% to 50% reduction in the smoke suction pump’s energy consumption, thereby contributing substantially to sustainable development in the manufacturing sector.
[14] EGSA-PT:Edge-Guided Spatial Attention with Progressive Training for Monocular Depth Estimation and Segmentation of Transparent Objects cs.CV | cs.AI | cs.ROPDF
Gbenga Omotara, Ramy Farag, Seyed Mohamad Ali Tousi, G. N. DeSouza
TL;DR: 论文提出了一种边缘引导的空间注意力机制(EGSA)和渐进式训练策略,用于提升透明物体的单目深度估计和语义分割性能,解决了透明物体感知中的多任务交互问题。
Details
Motivation: 透明物体的感知是计算机视觉中的一大挑战,因其透明特性会影响深度估计和语义分割的性能。现有的多任务学习方法由于任务间的负面交互而性能受限。
Result: 在Syn-TODD和ClearPose基准测试中,EGSA显著提升了透明区域的深度估计精度,同时保持了竞争力的分割性能。
Insight: 边界信息在多任务学习中至关重要,渐进式训练策略能有效利用不同模态的数据提升模型鲁棒性。
Abstract: Transparent object perception remains a major challenge in computer vision research, as transparency confounds both depth estimation and semantic segmentation. Recent work has explored multi-task learning frameworks to improve robustness, yet negative cross-task interactions often hinder performance. In this work, we introduce Edge-Guided Spatial Attention (EGSA), a fusion mechanism designed to mitigate destructive interactions by incorporating boundary information into the fusion between semantic and geometric features. On both Syn-TODD and ClearPose benchmarks, EGSA consistently improved depth accuracy over the current state of the art method (MODEST), while preserving competitive segmentation performance, with the largest improvements appearing in transparent regions. Besides our fusion design, our second contribution is a multi-modal progressive training strategy, where learning transitions from edges derived from RGB images to edges derived from predicted depth images. This approach allows the system to bootstrap learning from the rich textures contained in RGB images, and then switch to more relevant geometric content in depth maps, while it eliminates the need for ground-truth depth at training time. Together, these contributions highlight edge-guided fusion as a robust approach capable of improving transparent object perception.
[15] Logit-Based Losses Limit the Effectiveness of Feature Knowledge Distillation cs.CV | cs.AI | cs.LGPDF
Nicholas Cooper, Lijun Chen, Sailesh Dwivedy, Danna Gurari
TL;DR: 论文提出了一种不依赖logit损失的特征知识蒸馏框架,通过评估知识质量来选择最佳教师层,实现了在多数据集和模型上的性能提升,最高达15%。
Details
Motivation: 传统知识蒸馏方法依赖logit损失(如交叉熵)和中间层特征,限制了知识蒸馏的效果。本文旨在通过完全基于特征的损失函数,优化知识蒸馏性能。
Result: 在三类图像分类数据集和四种学生-教师模型对上,方法均达到SOTA性能,最高提升了15%的top-1准确率。
Insight: logit损失在知识蒸馏中并非必需,反而可能限制性能。通过优化特征损失和选择合适的教师层,可以显著提升蒸馏效果。
Abstract: Knowledge distillation (KD) methods can transfer knowledge of a parameter-heavy teacher model to a light-weight student model. The status quo for feature KD methods is to utilize loss functions based on logits (i.e., pre-softmax class scores) and intermediate layer features (i.e., latent representations). Unlike previous approaches, we propose a feature KD framework for training the student’s backbone using feature-based losses exclusively (i.e., without logit-based losses such as cross entropy). Leveraging recent discoveries about the geometry of latent representations, we introduce a knowledge quality metric for identifying which teacher layers provide the most effective knowledge for distillation. Experiments on three image classification datasets with four diverse student-teacher pairs, spanning convolutional neural networks and vision transformers, demonstrate our KD method achieves state-of-the-art performance, delivering top-1 accuracy boosts of up to 15% over standard approaches. We publically share our code to facilitate future work at https://github.com/Thegolfingocto/KD_wo_CE.
[16] Kandinsky 5.0: A Family of Foundation Models for Image and Video Generation cs.CV | cs.AI | cs.LGPDF
Vladimir Arkhipkin, Vladimir Korviakov, Nikolai Gerasimenko, Denis Parkhomenko, Viacheslav Vasilev
TL;DR: Kandinsky 5.0 是一系列用于图像和视频生成的基础模型,包括高效轻量级的图像和视频生成模型,以及高性能的视频生成模型。通过数据优化和多阶段训练,结合自监督微调和强化学习,实现了高质量的生成效果。
Details
Motivation: 提供高质量、高效的图像和视频生成模型,并通过数据优化和多阶段训练方法提升生成性能,推动生成模型的普及和研究。
Result: Kandinsky 5.0 在各项任务中实现了高效的生成速度和最先进的性能,并通过人类评估验证了其高质量生成能力。
Insight: 通过数据优化和多阶段训练框架,可以显著提升生成模型的性能和适应性。开源和共享资源有助于推动生成模型的进一步发展和应用。
Abstract: This report introduces Kandinsky 5.0, a family of state-of-the-art foundation models for high-resolution image and 10-second video synthesis. The framework comprises three core line-up of models: Kandinsky 5.0 Image Lite - a line-up of 6B parameter image generation models, Kandinsky 5.0 Video Lite - a fast and lightweight 2B parameter text-to-video and image-to-video models, and Kandinsky 5.0 Video Pro - 19B parameter models that achieves superior video generation quality. We provide a comprehensive review of the data curation lifecycle - including collection, processing, filtering and clustering - for the multi-stage training pipeline that involves extensive pre-training and incorporates quality-enhancement techniques such as self-supervised fine-tuning (SFT) and reinforcement learning (RL)-based post-training. We also present novel architectural, training, and inference optimizations that enable Kandinsky 5.0 to achieve high generation speeds and state-of-the-art performance across various tasks, as demonstrated by human evaluation. As a large-scale, publicly available generative framework, Kandinsky 5.0 leverages the full potential of its pre-training and subsequent stages to be adapted for a wide range of generative applications. We hope that this report, together with the release of our open-source code and training checkpoints, will substantially advance the development and accessibility of high-quality generative models for the research community.
[17] FinCriticalED: A Visual Benchmark for Financial Fact-Level OCR Evaluation cs.CVPDF
Yueru He, Xueqing Peng, Yupeng Cao, Yan Wang, Lingfei Qian
TL;DR: FinCriticalED是一个视觉基准,专注于金融文档的事实级OCR评估,强调数值和时间信息的准确性,而非表面的文本相似性。
Details
Motivation: 金融文档的结构紧密且视觉密集,传统OCR指标(如ROUUE和编辑距离)无法捕捉关键事实错误(如符号反转或日期偏移),从而导致重大误解。
Result: 最强专有模型在事实准确性上表现最佳,但在视觉复杂的数值和时间上下文中仍存在显著错误。
Insight: FinCriticalED为金融等领域的高精度视觉事实提取提供了严格基准,揭示了现有模型在复杂场景中的局限性。
Abstract: We introduce FinCriticalED (Financial Critical Error Detection), a visual benchmark for evaluating OCR and vision language models on financial documents at the fact level. Financial documents contain visually dense and table heavy layouts where numerical and temporal information is tightly coupled with structure. In high stakes settings, small OCR mistakes such as sign inversion or shifted dates can lead to materially different interpretations, while traditional OCR metrics like ROUGE and edit distance capture only surface level text similarity. \ficriticaled provides 500 image-HTML pairs with expert annotated financial facts covering over seven hundred numerical and temporal facts. It introduces three key contributions. First, it establishes the first fact level evaluation benchmark for financial document understanding, shifting evaluation from lexical overlap to domain critical factual correctness. Second, all annotations are created and verified by financial experts with strict quality control over signs, magnitudes, and temporal expressions. Third, we develop an LLM-as-Judge evaluation pipeline that performs structured fact extraction and contextual verification for visually complex financial documents. We benchmark OCR systems, open source vision language models, and proprietary models on FinCriticalED. Results show that although the strongest proprietary models achieve the highest factual accuracy, substantial errors remain in visually intricate numerical and temporal contexts. Through quantitative evaluation and expert case studies, FinCriticalED provides a rigorous foundation for advancing visual factual precision in financial and other precision critical domains.
[18] CKDA: Cross-modality Knowledge Disentanglement and Alignment for Visible-Infrared Lifelong Person Re-identification cs.CVPDF
Zhenyu Cui, Jiahuan Zhou, Yuxin Peng
TL;DR: 本文提出了跨模态知识解耦与对齐方法(CKDA),解决了可见光-红外终身行人重识别中模态特定与模态通用知识的冲突问题,通过MCP和MSP模块解耦知识,并通过CKA模块对齐新知识与旧知识。
Details
Motivation: 现有方法在跨模态终身行人重识别中忽视了模态特定知识与模态通用知识的相互干扰,导致协作遗忘。CKDA旨在平衡地分离和保护这两种知识。
Result: 在四个基准数据集上,CKDA优于现有方法,验证了其有效性。
Insight: 分离和对齐模态特定与通用知识是解决跨模态终身学习问题的关键。
Abstract: Lifelong person Re-IDentification (LReID) aims to match the same person employing continuously collected individual data from different scenarios. To achieve continuous all-day person matching across day and night, Visible-Infrared Lifelong person Re-IDentification (VI-LReID) focuses on sequential training on data from visible and infrared modalities and pursues average performance over all data. To this end, existing methods typically exploit cross-modal knowledge distillation to alleviate the catastrophic forgetting of old knowledge. However, these methods ignore the mutual interference of modality-specific knowledge acquisition and modality-common knowledge anti-forgetting, where conflicting knowledge leads to collaborative forgetting. To address the above problems, this paper proposes a Cross-modality Knowledge Disentanglement and Alignment method, called CKDA, which explicitly separates and preserves modality-specific knowledge and modality-common knowledge in a balanced way. Specifically, a Modality-Common Prompting (MCP) module and a Modality-Specific Prompting (MSP) module are proposed to explicitly disentangle and purify discriminative information that coexists and is specific to different modalities, avoiding the mutual interference between both knowledge. In addition, a Cross-modal Knowledge Alignment (CKA) module is designed to further align the disentangled new knowledge with the old one in two mutually independent inter- and intra-modality feature spaces based on dual-modality prototypes in a balanced manner. Extensive experiments on four benchmark datasets verify the effectiveness and superiority of our CKDA against state-of-the-art methods. The source code of this paper is available at https://github.com/PKU-ICST-MIPL/CKDA-AAAI2026.
[19] Complex-Valued 2D Gaussian Representation for Computer-Generated Holography cs.CV | cs.GR | cs.LGPDF
Yicheng Zhan, Xiangjun Gao, Long Quan, Kaan Akşit
TL;DR: 论文提出了一种基于结构化复值2D高斯基元的全息图表示方法,大幅降低了参数搜索空间和存储需求,并通过端到端训练实现了高效优化和高保真重建。
Details
Motivation: 传统全息图方法通常需要逐像素存储信息,导致参数搜索空间大、存储需求高,限制了全息图的扩展性和优化效率。
Result: 实验表明,该方法减少了2.5倍的显存占用,优化速度快50%,并实现了更高保真的重建。
Insight: 通过简化参数空间和优化计算流程,该方法为下一代计算机生成全息系统提供了更具扩展性的解决方案。
Abstract: We propose a new hologram representation based on structured complex-valued 2D Gaussian primitives, which replaces per-pixel information storage and reduces the parameter search space by up to 10:1. To enable end-to-end training, we develop a differentiable rasterizer for our representation, integrated with a GPU-optimized light propagation kernel in free space. Our extensive experiments show that our method achieves up to 2.5x lower VRAM usage and 50% faster optimization while producing higher-fidelity reconstructions than existing methods. We further introduce a conversion procedure that adapts our representation to practical hologram formats, including smooth and random phase-only holograms. Our experiments show that this procedure can effectively suppress noise artifacts observed in previous methods. By reducing the hologram parameter search space, our representation enables a more scalable hologram estimation in the next-generation computer-generated holography systems.
[20] Computer Vision Modeling of the Development of Geometric and Numerical Concepts in Humans cs.CVPDF
Zekun Wang, Sashank Varma
TL;DR: 这篇论文探讨了计算机视觉(CV)模型是否能够模拟人类在几何和数学概念发展中的轨迹,研究发现ResNet-50模型在部分几何概念和数值表示上与儿童发展表现一致。
Details
Motivation: 数学思维是人类认知的核心部分,研究者希望通过CV模型理解其发展过程,为认知科学提供新的工具和视角。
Result: 模型在部分几何概念(如欧几里得几何)和数值表示(心理数轴)上与儿童发展一致,但在其他概念(如手性图形)上不一致。
Insight: CV模型可以作为研究人类数学认知发展的有效工具,需进一步探索不同模型结构和更大规模的数据集。
Abstract: Mathematical thinking is a fundamental aspect of human cognition. Cognitive scientists have investigated the mechanisms that underlie our ability to thinking geometrically and numerically, to take two prominent examples, and developmental scientists have documented the trajectories of these abilities over the lifespan. Prior research has shown that computer vision (CV) models trained on the unrelated task of image classification nevertheless learn latent representations of geometric and numerical concepts similar to those of adults. Building on this demonstrated cognitive alignment, the current study investigates whether CV models also show developmental alignment: whether their performance improvements across training to match the developmental progressions observed in children. In a detailed case study of the ResNet-50 model, we show that this is the case. For the case of geometry and topology, we find developmental alignment for some classes of concepts (Euclidean Geometry, Geometrical Figures, Metric Properties, Topology) but not others (Chiral Figures, Geometric Transformations, Symmetrical Figures). For the case of number, we find developmental alignment in the emergence of a human-like ``mental number line’’ representation with experience. These findings show the promise of computer vision models for understanding the development of mathematical understanding in humans. They point the way to future research exploring additional model architectures and building larger benchmarks.
[21] UniHOI: Unified Human-Object Interaction Understanding via Unified Token Space cs.CV | cs.AIPDF
Panqi Yang, Haodong Jing, Nanning Zheng, Yongqiang Ma
TL;DR: 论文提出了UniHOI,通过统一标记空间联合建模人-物交互(HOI)的检测与生成任务,提高了知识的共享和泛化能力。
Details
Motivation: 传统的HOI检测与生成任务通常是分开处理的,限制了综合交互理解的发展。
Result: 在HOI检测任务中准确率提升4.9%,在开放词汇生成任务中交互指标提升42.0%,达到了SOTA性能。
Insight: 通过联合建模HOI的双重任务,可以更高效地利用知识与数据,提升性能与泛化能力。
Abstract: In the field of human-object interaction (HOI), detection and generation are two dual tasks that have traditionally been addressed separately, hindering the development of comprehensive interaction understanding. To address this, we propose UniHOI, which jointly models HOI detection and generation via a unified token space, thereby effectively promoting knowledge sharing and enhancing generalization. Specifically, we introduce a symmetric interaction-aware attention module and a unified semi-supervised learning paradigm, enabling effective bidirectional mapping between images and interaction semantics even under limited annotations. Extensive experiments demonstrate that UniHOI achieves state-of-the-art performance in both HOI detection and generation. Specifically, UniHOI improves accuracy by 4.9% on long-tailed HOI detection and boosts interaction metrics by 42.0% on open-vocabulary generation tasks.
[22] CellGenNet: A Knowledge-Distilled Framework for Robust Cell Segmentation in Cancer Tissues cs.CVPDF
Srijan Ray, Bikesh K. Nirala, Jason T. Yustein, Sundaresh Ram
TL;DR: CellGenNet是一个基于知识蒸馏的框架,用于在有限监督条件下实现癌症组织中细胞的鲁棒分割,通过师生架构结合混合损失函数和一致性正则化提升了分割精度和泛化能力。
Details
Motivation: 由于染色、成像条件和组织形态的多样性,显微全切片图像(WSIs)中细胞核的准确分割仍然具有挑战性,尤其是在监督数据有限的情况下。
Result: 在多种癌症组织WSIs上的实验表明,CellGenNet在分割精度和泛化性能上优于监督和半监督基线方法。
Insight: 通过集成教师模型的伪标签和混合损失函数,CellGenNet有效缓解了类别不平衡问题,并保留了少数细胞核结构的细节。
Abstract: Accurate nuclei segmentation in microscopy whole slide images (WSIs) remains challenging due to variability in staining, imaging conditions, and tissue morphology. We propose CellGenNet, a knowledge distillation framework for robust cross-tissue cell segmentation under limited supervision. CellGenNet adopts a student-teacher architecture, where a capacity teacher is trained on sparse annotations and generates soft pseudo-labels for unlabeled regions. The student is optimized using a joint objective that integrates ground-truth labels, teacher-derived probabilistic targets, and a hybrid loss function combining binary cross-entropy and Tversky loss, enabling asymmetric penalties to mitigate class imbalance and better preserve minority nuclear structures. Consistency regularization and layerwise dropout further stabilize feature representations and promote reliable feature transfer. Experiments across diverse cancer tissue WSIs show that CellGenNet improves segmentation accuracy and generalization over supervised and semi-supervised baselines, supporting scalable and reproducible histopathology analysis.
[23] ProPL: Universal Semi-Supervised Ultrasound Image Segmentation via Prompt-Guided Pseudo-Labeling cs.CVPDF
Yaxiong Chen, Qicong Wang, Chunlei Li, Jingliang Hu, Yilei Shi
TL;DR: ProPL是一种通用的半监督超声图像分割框架,通过提示引导的双解码器和不确定性驱动的伪标签校准模块,实现多器官和任务的灵活适应。
Details
Motivation: 现有的超声图像分割方法通常针对特定解剖结构或任务,限制了临床实用性,因此需要通用解决方案。
Result: ProPL在多项指标上优于SOTA方法,为通用超声分割设定了新基准。
Insight: 提示机制和伪标签校准是提升半监督学习效果的关键。
Abstract: Existing approaches for the problem of ultrasound image segmentation, whether supervised or semi-supervised, are typically specialized for specific anatomical structures or tasks, limiting their practical utility in clinical settings. In this paper, we pioneer the task of universal semi-supervised ultrasound image segmentation and propose ProPL, a framework that can handle multiple organs and segmentation tasks while leveraging both labeled and unlabeled data. At its core, ProPL employs a shared vision encoder coupled with prompt-guided dual decoders, enabling flexible task adaptation through a prompting-upon-decoding mechanism and reliable self-training via an uncertainty-driven pseudo-label calibration (UPLC) module. To facilitate research in this direction, we introduce a comprehensive ultrasound dataset spanning 5 organs and 8 segmentation tasks. Extensive experiments demonstrate that ProPL outperforms state-of-the-art methods across various metrics, establishing a new benchmark for universal ultrasound image segmentation.
[24] Evaluating Multimodal Large Language Models on Vertically Written Japanese Text cs.CV | cs.CLPDF
Keito Sasagawa, Shuhei Kurita, Daisuke Kawahara
TL;DR: 本文评估了多模态大语言模型(MLLMs)在处理竖排日文文本上的表现,发现现有模型在竖排文本上表现较差。通过合成OCR数据集和真实数据集进行微调和测试,结果表明训练可以显著提升模型能力。
Details
Motivation: 由于某些日文文档采用竖排书写,而现有MLLMs对竖排文本的支持研究有限,因此需要评估和改进模型在此类任务上的表现。
Result: 现有MLLMs在竖排日文文本上的表现较差,但通过训练可以显著提升其能力。
Insight: 竖排文本的识别是跨语言文档理解中的重要挑战,针对性训练可以有效改善模型表现。
Abstract: Multimodal Large Language Models (MLLMs) have seen rapid advances in recent years and are now being applied to visual document understanding tasks. They are expected to process a wide range of document images across languages, including Japanese. Understanding documents from images requires models to read what are written in them. Since some Japanese documents are written vertically, support for vertical writing is essential. However, research specifically focused on vertically written Japanese text remains limited. In this study, we evaluate the reading capability of existing MLLMs on vertically written Japanese text. First, we generate a synthetic Japanese OCR dataset by rendering Japanese texts into images, and use it for both model fine-tuning and evaluation. This dataset includes Japanese text in both horizontal and vertical writing. We also create an evaluation dataset sourced from the real-world document images containing vertically written Japanese text. Using these datasets, we demonstrate that the existing MLLMs perform worse on vertically written Japanese text than on horizontally written Japanese text. Furthermore, we show that training MLLMs on our synthesized Japanese OCR dataset results in improving the performance of models that previously could not handle vertical writing. The datasets and code are publicly available https://github.com/llm-jp/eval_vertical_ja.
[25] Reasoning via Video: The First Evaluation of Video Models’ Reasoning Abilities through Maze-Solving Tasks cs.CV | cs.AIPDF
Cheng Yang, Haiyuan Wan, Yiran Peng, Xin Cheng, Zhaoyang Yu
TL;DR: 本文提出了一种通过视频生成实现推理的新范式,并设计了VR-Bench基准测试来评估视频模型的推理能力,尤其是在迷宫任务中表现优异。
Details
Motivation: 从文本生成到文本推理的演变启发作者思考:视频模型是否也能通过视频生成实现推理?视频的空间布局和时间连续性为空间推理提供了理想基础。
Result: 视频模型在空间推理任务中表现优于主流视觉语言模型(VLM),且泛化能力强;测试时多样性采样使推理可靠性提升10-20%。
Insight: 视频因其显式的时空特性成为空间推理的理想媒介,且测试时多样性采样是提升视频模型推理可靠性的有效策略。
Abstract: Video Models have achieved remarkable success in high-fidelity video generation with coherent motion dynamics. Analogous to the development from text generation to text-based reasoning in language modeling, the development of video models motivates us to ask: Can video models reason via video generation? Compared with the discrete text corpus, video grounds reasoning in explicit spatial layouts and temporal continuity, which serves as an ideal substrate for spatial reasoning. In this work, we explore the reasoning via video paradigm and introduce VR-Bench – a comprehensive benchmark designed to systematically evaluate video models’ reasoning capabilities. Grounded in maze-solving tasks that inherently require spatial planning and multi-step reasoning, VR-Bench contains 7,920 procedurally generated videos across five maze types and diverse visual styles. Our empirical analysis demonstrates that SFT can efficiently elicit the reasoning ability of video model. Video models exhibit stronger spatial perception during reasoning, outperforming leading VLMs and generalizing well across diverse scenarios, tasks, and levels of complexity. We further discover a test-time scaling effect, where diverse sampling during inference improves reasoning reliability by 10–20%. These findings highlight the unique potential and scalability of reasoning via video for spatial reasoning tasks.
[26] MambaTrack3D: A State Space Model Framework for LiDAR-Based Object Tracking under High Temporal Variation cs.CVPDF
Shengjing Tian, Yinan Han, Xiantong Zhao, Xuehu Liu, Qi Lang
TL;DR: MambaTrack3D提出了一种基于状态空间模型的LiDAR对象跟踪框架,专注于高时间变化环境。通过设计MIP模块和GFEM模块,解决了现有方法的计算复杂性和时间冗余问题,在高时间变化和标准数据集上均表现优异。
Details
Motivation: 动态室外环境下的高时间变化(HTV)对LiDAR点云的3D单目标跟踪提出了挑战。现有方法存在计算复杂度高、时间冗余和几何先验利用不足的问题。
Result: 在KITTI-HTV和nuScenes-HTV数据集上,MambaTrack3D优于其他方法(成功率和精度分别提升6.5和9.5)。在标准KITTI数据集上也表现优异。
Insight: MambaTrack3D展示了状态空间模型在处理高时间变化跟踪任务中的潜力,同时保持了高效性和泛化能力。
Abstract: Dynamic outdoor environments with high temporal variation (HTV) pose significant challenges for 3D single object tracking in LiDAR point clouds. Existing memory-based trackers often suffer from quadratic computational complexity, temporal redundancy, and insufficient exploitation of geometric priors. To address these issues, we propose MambaTrack3D, a novel HTV-oriented tracking framework built upon the state space model Mamba. Specifically, we design a Mamba-based Inter-frame Propagation (MIP) module that replaces conventional single-frame feature extraction with efficient inter-frame propagation, achieving near-linear complexity while explicitly modeling spatial relations across historical frames. Furthermore, a Grouped Feature Enhancement Module (GFEM) is introduced to separate foreground and background semantics at the channel level, thereby mitigating temporal redundancy in the memory bank. Extensive experiments on KITTI-HTV and nuScenes-HTV benchmarks demonstrate that MambaTrack3D consistently outperforms both HTV-oriented and normal-scenario trackers, achieving improvements of up to 6.5 success and 9.5 precision over HVTrack under moderate temporal gaps. On the standard KITTI dataset, MambaTrack3D remains highly competitive with state-of-the-art normal-scenario trackers, confirming its strong generalization ability. Overall, MambaTrack3D achieves a superior accuracy-efficiency trade-off, delivering robust performance across both specialized HTV and conventional tracking scenarios.
[27] TiCAL:Typicality-Based Consistency-Aware Learning for Multimodal Emotion Recognition cs.CVPDF
Wen Yin, Siyu Zhan, Cencen Liu, Xin Hu, Guiduo Duan
TL;DR: TiCAL提出了一种基于典型性和一致性感知的多模态情感识别框架,解决了多模态情感冲突问题,通过在双曲空间中嵌入特征和学习过程中引入一致性估计,提升了模型性能。
Details
Motivation: 多模态情感识别中,不同模态可能表达冲突的情感倾向,现有方法忽视了这一问题。TiCAL旨在动态评估样本一致性,并利用典型性估计解决模态不一致性。
Result: 在CMU-MOSEI和MER2023数据集上验证,TiCAL比当前最优方法DMD提升约2.6%。
Insight: 双曲空间嵌入和一致性动态估计有助于缓解多模态情感冲突,提高识别精度。
Abstract: Multimodal Emotion Recognition (MER) aims to accurately identify human emotional states by integrating heterogeneous modalities such as visual, auditory, and textual data. Existing approaches predominantly rely on unified emotion labels to supervise model training, often overlooking a critical challenge: inter-modal emotion conflicts, wherein different modalities within the same sample may express divergent emotional tendencies. In this work, we address this overlooked issue by proposing a novel framework, Typicality-based Consistent-aware Multimodal Emotion Recognition (TiCAL), inspired by the stage-wise nature of human emotion perception. TiCAL dynamically assesses the consistency of each training sample by leveraging pseudo unimodal emotion labels alongside a typicality estimation. To further enhance emotion representation, we embed features in a hyperbolic space, enabling the capture of fine-grained distinctions among emotional categories. By incorporating consistency estimates into the learning process, our method improves model performance, particularly on samples exhibiting high modality inconsistency. Extensive experiments on benchmark datasets, e.g, CMU-MOSEI and MER2023, validate the effectiveness of TiCAL in mitigating inter-modal emotional conflicts and enhancing overall recognition accuracy, e.g., with about 2.6% improvements over the state-of-the-art DMD.
[28] A Comprehensive Study on Visual Token Redundancy for Discrete Diffusion-based Multimodal Large Language Models cs.CVPDF
Duo Li, Zuhao Yang, Xiaoqin Zhang, Ling Shao, Shijian Lu
TL;DR: 本文研究了基于离散扩散的多模态大语言模型(dMLLMs)中视觉标记冗余的问题,揭示了冗余在不同架构和任务中的演变规律,并探讨了视觉标记修剪对模型响应和效率的影响。研究表明,仅在使用从头训练(from-scratch)的dMLLMs处理长答案任务时才会出现冗余,并提出了一些优化方法。
Details
Motivation: 现有dMLLMs在推理时因全序列注意力计算产生显著计算开销,而从模态无关的角度优化忽视了视觉标记的冗余特性。本文希望从模态特异性的角度,研究冗余对dMLLMs效率和信息损失的影响。
Result: 研究表明冗余仅存在于从头训练的dMLLMs长答案任务中;修剪会导致信息损失,但从头训练的dMLLMs能在后期去噪步骤中逐步恢复;对AR-to-diffusion模型适用层跳过,而对从头训练的dMLLMs适用渐进修剪。
Insight: 对dMLLMs的效率优化需考虑其训练方式和任务特性,视觉标记修剪的策略应根据模型类型动态调整。
Abstract: Discrete diffusion-based multimodal large language models (dMLLMs) have emerged as a promising alternative to autoregressive MLLMs thanks to their advantages in parallel decoding and bidirectional context modeling, but most existing dMLLMs incur significant computational overhead during inference due to the full-sequence attention computation in each denoising step. Pioneer studies attempt to resolve this issue from a modality-agnostic perspective via key-value cache optimization or efficient sampling but most of them overlook modality-specific visual token redundancy. In this work, we conduct a comprehensive study on how visual token redundancy evolves with different dMLLM architectures and tasks and how visual token pruning affects dMLLM responses and efficiency. Specifically, our study reveals that visual redundancy emerges only in from-scratch dMLLMs while handling long-answer tasks. In addition, we validate that visual token pruning introduces non-negligible information loss in dMLLMs and only from-scratch dMLLMs can recover the lost information progressively during late denoising steps. Furthermore, our study shows that layer-skipping is promising for accelerating AR-to-diffusion dMLLMs, whereas progressive or late-step pruning is more effective for from-scratch dMLLMs. Overall, this work offers a new perspective on efficiency optimization for dMLLMs, greatly advancing their applicability across various multimodal understanding tasks.
[29] Gaussian Blending: Rethinking Alpha Blending in 3D Gaussian Splatting cs.CVPDF
Junseo Koo, Jinseo Jeong, Gunhee Kim
TL;DR: 论文提出了高斯混合(Gaussian Blending),取代传统3D高斯泼溅(3DGS)中的alpha混合,解决了缩放时的模糊和阶梯状伪影问题,提升了新视角合成的渲染质量。
Details
Motivation: 传统3DGS方法在训练未见采样率下合成视角时出现模糊和阶梯状伪影,作者推测这是由于alpha混合的基本限制导致的。
Result: 实验显示高斯混合在未见采样率下能有效捕捉细节,性能优于现有新视角合成模型。
Insight: 空间分布的alpha和透射率设计是提升3DGS渲染质量的关键。
Abstract: The recent introduction of 3D Gaussian Splatting (3DGS) has significantly advanced novel view synthesis. Several studies have further improved the rendering quality of 3DGS, yet they still exhibit noticeable visual discrepancies when synthesizing views at sampling rates unseen during training. Specifically, they suffer from (i) erosion-induced blurring artifacts when zooming in and (ii) dilation-induced staircase artifacts when zooming out. We speculate that these artifacts arise from the fundamental limitation of the alpha blending adopted in 3DGS methods. Instead of the conventional alpha blending that computes alpha and transmittance as scalar quantities over a pixel, we propose to replace it with our novel Gaussian Blending that treats alpha and transmittance as spatially varying distributions. Thus, transmittances can be updated considering the spatial distribution of alpha values across the pixel area, allowing nearby background splats to contribute to the final rendering. Our Gaussian Blending maintains real-time rendering speed and requires no additional memory cost, while being easily integrated as a drop-in replacement into existing 3DGS-based or other NVS frameworks. Extensive experiments demonstrate that Gaussian Blending effectively captures fine details at various sampling rates unseen during training, consistently outperforming existing novel view synthesis models across both unseen and seen sampling rates.
[30] An Event-triggered System for Social Persuasion and Danger Alert in Elder Home Monitoring cs.CV | cs.MMPDF
Jun-Yi Liu, Chung-Hao Chen, Ya-Chi Tsao, Ssu-Yao Wu, Yu-Ting Tsao
TL;DR: 该论文提出了一种事件触发系统,用于老年人家居监控中的社交说服和危险警报,结合了GMM背景建模和SVM机器学习方法。
Details
Motivation: 考虑到老年人的身体和心理健康,需要一种直观的系统来检测潜在危险并促进社交互动。
Result: 在5个家庭中进行了实验,成功检测并记录了生活中的三类事件。
Insight: 结合计算机视觉和机器学习技术可以有效地解决老年人的安全和社交需求,同时设计直观的操作方式至关重要。
Abstract: In the study, the physical state and mental state of elders are both considered, and an event-triggered system has developed to detect events: watch dog, danger notice and photo link. By adopting GMM background modeling, the motion behavior of visitors and elders can be detected in the watch dog event and danger notice event respectively. Experiments set in home scenarios and 5 families participated in the experiments for detecting and recording three types of events from their life activities. In addition, the captured images were analyzed using SVM machine learning. For lack of technical experiences of elders, an intuitive operation as normal life activity was designed to create communication between elder and relatives via social media.
[31] Generating Natural-Language Surgical Feedback: From Structured Representation to Domain-Grounded Evaluation cs.CV | cs.AI | cs.CL | cs.LGPDF
Firdavs Nasriddinov, Rafal Kocielnik, Anima Anandkumar, Andrew J. Hung
TL;DR: 该论文提出了一种结构化方法,从手术培训文本中提取IAT三元组,并利用其指导GPT-4o生成高质量的手术反馈。
Details
Motivation: 手术培训中高质量的反馈对学员技能提升至关重要,自动化生成自然语言反馈需要临床相关的结构化表示。
Result: IAT结构化表示显著提升反馈生成质量,AUC和评分均有显著提高。
Insight: 结构化表示(如IAT)能有效提升生成反馈的临床可信度和可验证性。
Abstract: High-quality intraoperative feedback from a surgical trainer is pivotal for improving trainee performance and long-term skill acquisition. Automating natural, trainer-style feedback promises timely, accessible, and consistent guidance at scale but requires models that understand clinically relevant representations. We present a structure-aware pipeline that learns a surgical action ontology from real trainer-to-trainee transcripts (33 surgeries) and uses it to condition feedback generation. We contribute by (1) mining Instrument-Action-Target (IAT) triplets from real-world feedback text and clustering surface forms into normalized categories, (2) fine-tuning a video-to-IAT model that leverages the surgical procedure and task contexts as well as fine-grained temporal instrument motion, and (3) demonstrating how to effectively use IAT triplet representations to guide GPT-4o in generating clinically grounded, trainer-style feedback. We show that, on Task 1: Video-to-IAT recognition, our context injection and temporal tracking deliver consistent AUC gains (Instrument: 0.67 to 0.74; Action: 0.60 to 0.63; Tissue: 0.74 to 0.79). For Task 2: feedback text generation (rated on a 1-5 fidelity rubric where 1 = opposite/unsafe, 3 = admissible, and 5 = perfect match to a human trainer), GPT-4o from video alone scores 2.17, while IAT conditioning reaches 2.44 (+12.4%), doubling the share of admissible generations with score >= 3 from 21% to 42%. Traditional text-similarity metrics also improve: word error rate decreases by 15-31% and ROUGE (phrase/substring overlap) increases by 9-64%. Grounding generation in explicit IAT structure improves fidelity and yields clinician-verifiable rationales, supporting auditable use in surgical training.
[32] Unbiased Semantic Decoding with Vision Foundation Models for Few-shot Segmentation cs.CVPDF
Jin Wang, Bingfeng Zhang, Jian Pang, Weifeng Liu, Baodi Liu
TL;DR: 本文提出了一种与SAM集成的无偏语义解码(USD)策略,通过从支持和查询集中提取目标信息,并结合CLIP模型的语义指导,实现一致预测。方法包括两种特征增强策略和目标提示生成器,提高了SAM的无偏语义判别能力。
Details
Motivation: Few-shot分割任务中,现有方法主要依赖支持集提取提示,难以激活SAM的泛化能力,且容易对未知类别产生偏差。因此,本文旨在通过结合CLIP语义和多层次特征增强,实现无偏解码。
Result: 无需重新训练基础模型,提出的方法通过语义判别特征和目标信息的提示引导,聚焦目标区域,实现了更准确的Few-shot分割。
Insight: 结合CLIP的语义能力和SAM的特征提取能力,通过多层次的提示生成和特征增强,可有效缓解Few-shot分割中的偏差问题。
Abstract: Few-shot segmentation has garnered significant attention. Many recent approaches attempt to introduce the Segment Anything Model (SAM) to handle this task. With the strong generalization ability and rich object-specific extraction ability of the SAM model, such a solution shows great potential in few-shot segmentation. However, the decoding process of SAM highly relies on accurate and explicit prompts, making previous approaches mainly focus on extracting prompts from the support set, which is insufficient to activate the generalization ability of SAM, and this design is easy to result in a biased decoding process when adapting to the unknown classes. In this work, we propose an Unbiased Semantic Decoding (USD) strategy integrated with SAM, which extracts target information from both the support and query set simultaneously to perform consistent predictions guided by the semantics of the Contrastive Language-Image Pre-training (CLIP) model. Specifically, to enhance the unbiased semantic discrimination of SAM, we design two feature enhancement strategies that leverage the semantic alignment capability of CLIP to enrich the original SAM features, mainly including a global supplement at the image level to provide a generalize category indicate with support image and a local guidance at the pixel level to provide a useful target location with query image. Besides, to generate target-focused prompt embeddings, a learnable visual-text target prompt generator is proposed by interacting target text embeddings and clip visual features. Without requiring re-training of the vision foundation models, the features with semantic discrimination draw attention to the target region through the guidance of prompt with rich target information.
[33] WaveFuse-AL: Cyclical and Performance-Adaptive Multi-Strategy Active Learning for Medical Images cs.CV | cs.LGPDF
Nishchala Thakur, Swati Kochhar, Deepti R. Bathula, Sukrit Gupta
TL;DR: WaveFuse-AL提出了一种自适应融合多种主动学习策略的框架,通过周期性和性能驱动的方法动态调整策略重要性,显著提升了医学图像的标注效率。
Details
Motivation: 医学图像标注成本高昂,传统主动学习方法在不同学习阶段表现不一致。WaveFuse-AL旨在通过多策略自适应融合来解决这一问题。
Result: 在APTOS-2019、RSNA Pneumonia Detection和ISIC-2018三个数据集上显著优于单策略和交替策略基线。
Insight: 多策略自适应融合能有效提升主动学习在医学图像中的表现,尤其是在标注预算有限的情况下。
Abstract: Active learning reduces annotation costs in medical imaging by strategically selecting the most informative samples for labeling. However, individual acquisition strategies often exhibit inconsistent behavior across different stages of the active learning cycle. We propose Cyclical and Performance-Adaptive Multi-Strategy Active Learning (WaveFuse-AL), a novel framework that adaptively fuses multiple established acquisition strategies-BALD, BADGE, Entropy, and CoreSet throughout the learning process. WaveFuse-AL integrates cyclical (sinusoidal) temporal priors with performance-driven adaptation to dynamically adjust strategy importance over time. We evaluate WaveFuse-AL on three medical imaging benchmarks: APTOS-2019 (multi-class classification), RSNA Pneumonia Detection (binary classification), and ISIC-2018 (skin lesion segmentation). Experimental results demonstrate that WaveFuse-AL consistently outperforms both single-strategy and alternating-strategy baselines, achieving statistically significant performance improvements (on ten out of twelve metric measurements) while maximizing the utility of limited annotation budgets.
[34] Multimodal Continual Instruction Tuning with Dynamic Gradient Guidance cs.CVPDF
Songze Li, Mingyu Gao, Tonghua Su, Xu-Yao Zhang, Zhongjie Wang
TL;DR: 论文提出了一种新颖的方法,通过动态梯度引导解决多模态持续指令调优中的灾难性遗忘问题,利用参数空间的几何特性近似旧任务的缺失梯度。
Details
Motivation: 多模态持续学习中的灾难性遗忘问题限制了模型的性能提升,即学习新任务时会忘记旧任务的信息。
Result: 在多模态持续指令调优数据集上取得了最优性能,避免了模型扩展,同时缓解了灾难性遗忘。
Insight: 灾难性遗忘可以被视为新任务学习中旧任务梯度的缺失问题,通过几何近似和动态调节可以有效解决。
Abstract: Multimodal continual instruction tuning enables multimodal large language models to sequentially adapt to new tasks while building upon previously acquired knowledge. However, this continual learning paradigm faces the significant challenge of catastrophic forgetting, where learning new tasks leads to performance degradation on previous ones. In this paper, we introduce a novel insight into catastrophic forgetting by conceptualizing it as a problem of missing gradients from old tasks during new task learning. Our approach approximates these missing gradients by leveraging the geometric properties of the parameter space, specifically using the directional vector between current parameters and previously optimal parameters as gradient guidance. This approximated gradient can be further integrated with real gradients from a limited replay buffer and regulated by a Bernoulli sampling strategy that dynamically balances model stability and plasticity. Extensive experiments on multimodal continual instruction tuning datasets demonstrate that our method achieves state-of-the-art performance without model expansion, effectively mitigating catastrophic forgetting while maintaining a compact architecture.
[35] When to Think and When to Look: Uncertainty-Guided Lookback cs.CV | cs.CLPDF
Jing Bi, Filippos Bellos, Junjia Guo, Yayuan Li, Chao Huang
TL;DR: 这篇论文首次系统性分析了思考(生成显式中间推理链)对视觉语言模型(LVLMs)视觉推理的影响,发现并非所有情况下思考都有效,过长推理链反而可能忽视图像内容导致性能下降。基于此,作者提出了不确定性引导的回溯方法,通过在解码过程中结合不确定性信号和自适应回溯提示,显著提升了模型性能。
Details
Motivation: 尽管生成中间推理链在大型语言模型和视觉语言模型中展现出性能提升的优势,但目前缺乏对其如何影响视觉推理的系统性研究。本文旨在填补这一空白,探索思考在视觉任务中的实际效果。
Result: 提出的方法在MMMU基准测试中表现优异,尤其在标准推理表现较弱的类别中提升最大,超越了多种基线方法,成为固定模型家族和令牌预算下的新SOTA。此外,该方法在五个额外基准测试中也表现出一致的改进。
Insight: 思考并非总是有益,关键在于如何动态调整模型行为以确保视觉信息不被忽视。短回溯短语的使用和不确定性信号的结合是提升模型视觉基础能力的有效途径。
Abstract: Test-time thinking (that is, generating explicit intermediate reasoning chains) is known to boost performance in large language models and has recently shown strong gains for large vision language models (LVLMs). However, despite these promising results, there is still no systematic analysis of how thinking actually affects visual reasoning. We provide the first such analysis with a large scale, controlled comparison of thinking for LVLMs, evaluating ten variants from the InternVL3.5 and Qwen3-VL families on MMMU-val under generous token budgets and multi pass decoding. We show that more thinking is not always better; long chains often yield long wrong trajectories that ignore the image and underperform the same models run in standard instruct mode. A deeper analysis reveals that certain short lookback phrases, which explicitly refer back to the image, are strongly enriched in successful trajectories and correlate with better visual grounding. Building on this insight, we propose uncertainty guided lookback, a training free decoding strategy that combines an uncertainty signal with adaptive lookback prompts and breadth search. Our method improves overall MMMU performance, delivers the largest gains in categories where standard thinking is weak, and outperforms several strong decoding baselines, setting a new state of the art under fixed model families and token budgets. We further show that this decoding strategy generalizes, yielding consistent improvements on five additional benchmarks, including two broad multimodal suites and math focused visual reasoning datasets.
[36] MMCM: Multimodality-aware Metric using Clustering-based Modes for Probabilistic Human Motion Prediction cs.CVPDF
Kyotaro Tokoro, Hiromu Taketsugu, Norimichi Ukita
TL;DR: 这篇论文提出了一种新的人体运动预测(HMP)度量方法MMCM,通过基于聚类的模态评估预测运动的多样性和有效性,解决了现有度量方法的不足。
Details
Motivation: 现有的HMP度量方法通常只关注预测运动的广泛分布,而忽视了运动的多样性和运动学有效性。MMCM旨在填补这一空白。
Result: 实验验证了聚类模态定义的合理性,并证明MMCM能够准确评估多模态预测。
Insight: 显式定义模态并通过聚类实现,可以有效提升多模态运动预测的评估质量。
Abstract: This paper proposes a novel metric for Human Motion Prediction (HMP). Since a single past sequence can lead to multiple possible futures, a probabilistic HMP method predicts such multiple motions. While a single motion predicted by a deterministic method is evaluated only with the difference from its ground truth motion, multiple predicted motions should also be evaluated based on their distribution. For this evaluation, this paper focuses on the following two criteria. \textbf{(a) Coverage}: motions should be distributed among multiple motion modes to cover diverse possibilities. \textbf{(b) Validity}: motions should be kinematically valid as future motions observable from a given past motion. However, existing metrics simply appreciate widely distributed motions even if these motions are observed in a single mode and kinematically invalid. To resolve these disadvantages, this paper proposes a Multimodality-aware Metric using Clustering-based Modes (MMCM). For (a) coverage, MMCM divides a motion space into several clusters, each of which is regarded as a mode. These modes are used to explicitly evaluate whether predicted motions are distributed among multiple modes. For (b) validity, MMCM identifies valid modes by collecting possible future motions from a motion dataset. Our experiments validate that our clustering yields sensible mode definitions and that MMCM accurately scores multimodal predictions. Code: https://github.com/placerkyo/MMCM
[37] VisPlay: Self-Evolving Vision-Language Models from Images cs.CV | cs.AI | cs.CL | cs.LGPDF
Yicheng He, Chengsong Huang, Zongxia Li, Jiaxin Huang, Yonghui Yang
TL;DR: VisPlay是一种基于强化学习的自演化视觉语言模型框架,无需人工标注即可提升模型的推理能力。
Details
Motivation: 现有强化学习方法依赖人工标注或任务特定启发式奖励,成本高且难以扩展。VisPlay旨在通过自演化机制解决这一问题。
Result: 在Qwen2.5-VL和MiMo-VL模型上,VisPlay在多个基准测试中提升了视觉推理、组合泛化和减少幻觉的能力。
Insight: VisPlay展示了自演化多模态智能的可扩展路径,为无监督强化学习提供了新思路。
Abstract: Reinforcement learning (RL) provides a principled framework for improving Vision-Language Models (VLMs) on complex reasoning tasks. However, existing RL approaches often rely on human-annotated labels or task-specific heuristics to define verifiable rewards, both of which are costly and difficult to scale. We introduce VisPlay, a self-evolving RL framework that enables VLMs to autonomously improve their reasoning abilities using large amounts of unlabeled image data. Starting from a single base VLM, VisPlay assigns the model into two interacting roles: an Image-Conditioned Questioner that formulates challenging yet answerable visual questions, and a Multimodal Reasoner that generates silver responses. These roles are jointly trained with Group Relative Policy Optimization (GRPO), which incorporates diversity and difficulty rewards to balance the complexity of generated questions with the quality of the silver answers. VisPlay scales efficiently across two model families. When trained on Qwen2.5-VL and MiMo-VL, VisPlay achieves consistent improvements in visual reasoning, compositional generalization, and hallucination reduction across eight benchmarks, including MM-Vet and MMMU, demonstrating a scalable path toward self-evolving multimodal intelligence. The project page is available at https://bruno686.github.io/VisPlay/
[38] Instruction-Guided Lesion Segmentation for Chest X-rays with Automatically Generated Large-Scale Dataset cs.CVPDF
Geon Choi, Hangyul Yoon, Hyunju Shin, Hyunki Park, Sang Hoon Seo
TL;DR: 该论文提出了指令引导的病灶分割(ILS)新范式,解决了现有胸部X光病灶分割模型依赖复杂专家文本输入和标签数量少的问题。通过自动生成的大规模数据集MIMIC-ILS和微调的视觉语言模型ROSALIA,实现了基于简单指令的多类型病灶分割。
Details
Motivation: 当前胸部X光病灶分割模型的应用受限,主要因为依赖复杂文本输入和目标标签数量少。为此,论文提出一种新范式,以简化用户输入并扩展分割能力。
Result: ROSALIA在新任务中表现优异,验证了MIMIC-ILS作为像素级病灶分割基础资源的价值。
Insight: 自动化生成大规模标注数据集是提升医学影像分割能力的有效途径,简单指令驱动的模型设计更具实用性。
Abstract: The applicability of current lesion segmentation models for chest X-rays (CXRs) has been limited both by a small number of target labels and the reliance on long, detailed expert-level text inputs, creating a barrier to practical use. To address these limitations, we introduce a new paradigm: instruction-guided lesion segmentation (ILS), which is designed to segment diverse lesion types based on simple, user-friendly instructions. Under this paradigm, we construct MIMIC-ILS, the first large-scale instruction-answer dataset for CXR lesion segmentation, using our fully automated multimodal pipeline that generates annotations from chest X-ray images and their corresponding reports. MIMIC-ILS contains 1.1M instruction-answer pairs derived from 192K images and 91K unique segmentation masks, covering seven major lesion types. To empirically demonstrate its utility, we introduce ROSALIA, a vision-language model fine-tuned on MIMIC-ILS. ROSALIA can segment diverse lesions and provide textual explanations in response to user instructions. The model achieves high segmentation and textual accuracy in our newly proposed task, highlighting the effectiveness of our pipeline and the value of MIMIC-ILS as a foundational resource for pixel-level CXR lesion grounding.
[39] MoDES: Accelerating Mixture-of-Experts Multimodal Large Language Models via Dynamic Expert Skipping cs.CV | cs.CLPDF
Yushi Huang, Zining Wang, Zhihang Yuan, Yifu Ding, Ruihao Gong
TL;DR: MoDES提出了一种无需训练的框架,通过动态跳过专家层来加速混合专家(MoE)多模态大语言模型(MLLMs)的推理过程,解决了现有方法在多模态任务中性能下降的问题,显著提升了效率和准确性。
Details
Motivation: 现有的专家跳过方法最初是为单模态大语言模型设计的,直接应用于多模态任务会导致性能显著下降。作者发现这是因为现有方法未考虑MoE层中专家贡献的异质性和模态特定行为。
Result: MoDES在3个模型系列的13个基准测试中表现优异。例如,在跳过88%专家的情况下,Qwen3-VL-MoE-30B-A3B-Instruct的性能提升了10.67%。推理速度显著提升,预填充时间提高了2.16倍,解码时间提高了1.26倍。
Insight: 多模态任务中MoE层的动态跳过机制需考虑模态特异性,全局与本地信息的结合是准确估计专家重要性的关键。
Abstract: Mixture-of-Experts (MoE) Multimodal large language models (MLLMs) excel at vision-language tasks, but they suffer from high computational inefficiency. To reduce inference overhead, expert skipping methods have been proposed to deactivate redundant experts based on the current input tokens. However, we find that applying these methods-originally designed for unimodal large language models (LLMs)-to MLLMs results in considerable performance degradation. This is primarily because such methods fail to account for the heterogeneous contributions of experts across MoE layers and modality-specific behaviors of tokens within these layers. Motivated by these findings, we propose MoDES, the first training-free framework that adaptively skips experts to enable efficient and accurate MoE MLLM inference. It incorporates a globally-modulated local gating (GMLG) mechanism that integrates global layer-wise importance into local routing probabilities to accurately estimate per-token expert importance. A dual-modality thresholding (DMT) method is then applied, which processes tokens from each modality separately, to derive the skipping schedule. To set the optimal thresholds, we introduce a frontier search algorithm that exploits monotonicity properties, cutting convergence time from several days to a few hours. Extensive experiments for 3 model series across 13 benchmarks demonstrate that MoDES far outperforms previous approaches. For instance, when skipping 88% experts for Qwen3-VL-MoE-30B-A3B-Instruct, the performance boost is up to 10.67% (97.33% vs. 86.66%). Furthermore, MoDES significantly enhances inference speed, improving the prefilling time by 2.16$\times$ and the decoding time by 1.26$\times$.
[40] BrainRotViT: Transformer-ResNet Hybrid for Explainable Modeling of Brain Aging from 3D sMRI cs.CV | cs.LGPDF
Wasif Jalal, Md Nafiu Rahman, M. Sohel Rahman
TL;DR: BrainRotViT是一种结合了Vision Transformer(ViT)和Residual CNN的混合架构,用于从3D sMRI中建模脑老化。该方法通过ViT编码器和ResNet的结合,实现了高效且可解释的脑年龄预测,并在多个数据集上表现优异。
Details
Motivation: 传统回归和基于CNN的方法在脑年龄预测中面临手动特征工程、有限感受野和数据异质性导致的过拟合问题。纯Transformer模型需要大数据集和高计算成本。作者提出混合架构以兼顾全局上下文建模和局部细化,提供高效的解决方案。
Result: 在11个MRI数据集上验证,MAE为3.34年(Pearson r=0.98)。模型在4个独立队列中泛化良好(MAE介于3.77至5.04年)。注意力图揭示了与老化相关的大脑区域。
Insight: 1. 混合架构兼具Transformer的全局建模和CNN的局部细化能力;2. 辅助任务预训练提升了特征学习;3. 模型可解释性有助于理解脑老化模式及其与神经退行性疾病的关系。
Abstract: Accurate brain age estimation from structural MRI is a valuable biomarker for studying aging and neurodegeneration. Traditional regression and CNN-based methods face limitations such as manual feature engineering, limited receptive fields, and overfitting on heterogeneous data. Pure transformer models, while effective, require large datasets and high computational cost. We propose Brain ResNet over trained Vision Transformer (BrainRotViT), a hybrid architecture that combines the global context modeling of vision transformers (ViT) with the local refinement of residual CNNs. A ViT encoder is first trained on an auxiliary age and sex classification task to learn slice-level features. The frozen encoder is then applied to all sagittal slices to generate a 2D matrix of embedding vectors, which is fed into a residual CNN regressor that incorporates subject sex at the final fully-connected layer to estimate continuous brain age. Our method achieves an MAE of 3.34 years (Pearson $r=0.98$, Spearman $ρ=0.97$, $R^2=0.95$) on validation across 11 MRI datasets encompassing more than 130 acquisition sites, outperforming baseline and state-of-the-art models. It also generalizes well across 4 independent cohorts with MAEs between 3.77 and 5.04 years. Analyses on the brain age gap (the difference between the predicted age and actual age) show that aging patterns are associated with Alzheimer’s disease, cognitive impairment, and autism spectrum disorder. Model attention maps highlight aging-associated regions of the brain, notably the cerebellar vermis, precentral and postcentral gyri, temporal lobes, and medial superior frontal gyrus. Our results demonstrate that this method provides an efficient, interpretable, and generalizable framework for brain-age prediction, bridging the gap between CNN- and transformer-based approaches while opening new avenues for aging and neurodegeneration research.
[41] Think Visually, Reason Textually: Vision-Language Synergy in ARC cs.CV | cs.AI | cs.CLPDF
Beichen Zhang, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan
TL;DR: 论文提出了一种结合视觉与语言的方法(VLSR和MSSC)来解决ARC-AGI中的抽象推理问题,验证了视觉和语言在多模态推理中的互补性,性能提升显著。
Details
Motivation: 现有方法将ARC-AGI视为纯文本推理任务,忽略了视觉抽象的重要性,但直接将网格转为图像会降低性能。为此,作者探索了视觉与语言在推理中的互补作用。
Result: 实验显示,该方法在多种模型和ARC-AGI任务中比纯文本基线提升4.33%的性能。
Insight: 视觉和语言在多模态推理中各有优势:视觉擅长模式抽象,语言擅长符号化和精确执行。两者的结合是实现通用人工智能的重要步骤。
Abstract: Abstract reasoning from minimal examples remains a core unsolved problem for frontier foundation models such as GPT-5 and Grok 4. These models still fail to infer structured transformation rules from a handful of examples, which is a key hallmark of human intelligence. The Abstraction and Reasoning Corpus for Artificial General Intelligence (ARC-AGI) provides a rigorous testbed for this capability, demanding conceptual rule induction and transfer to novel tasks. Most existing methods treat ARC-AGI as a purely textual reasoning task, overlooking the fact that humans rely heavily on visual abstraction when solving such puzzles. However, our pilot experiments reveal a paradox: naively rendering ARC-AGI grids as images degrades performance due to imprecise rule execution. This leads to our central hypothesis that vision and language possess complementary strengths across distinct reasoning stages: vision supports global pattern abstraction and verification, whereas language specializes in symbolic rule formulation and precise execution. Building on this insight, we introduce two synergistic strategies: (1) Vision-Language Synergy Reasoning (VLSR), which decomposes ARC-AGI into modality-aligned subtasks; and (2) Modality-Switch Self-Correction (MSSC), which leverages vision to verify text-based reasoning for intrinsic error correction. Extensive experiments demonstrate that our approach yields up to a 4.33% improvement over text-only baselines across diverse flagship models and multiple ARC-AGI tasks. Our findings suggest that unifying visual abstraction with linguistic reasoning is a crucial step toward achieving generalizable, human-like intelligence in future foundation models. Source code will be released soon.
[42] Towards Unbiased Cross-Modal Representation Learning for Food Image-to-Recipe Retrieval cs.CV | cs.MMPDF
Qing Wang, Chong-Wah Ngo, Ee-Peng Lim
TL;DR: 本文通过因果理论解决食谱与食物图像跨模态检索中的表示学习偏差问题,提出了一种新的去偏差方法,并在Recipe1M数据集上实现了最佳检索性能。
Details
Motivation: 现有方法将食谱视为描述食物视觉外观的文本源,忽略了烹饪过程、摆盘和拍摄条件等因素导致的细节丢失,导致图像与食谱相似性判断存在偏差。
Result: 在Recipe1M数据集上,该方法在1K、10K和50K测试规模下均实现了MedR=1的最优检索性能,刷新了现有技术水平。
Insight: 因果理论为跨模态表示学习中的偏差问题提供了理论支持,后门调整方法可有效消除偏差,提升检索性能。
Abstract: This paper addresses the challenges of learning representations for recipes and food images in the cross-modal retrieval problem. As the relationship between a recipe and its cooked dish is cause-and-effect, treating a recipe as a text source describing the visual appearance of a dish for learning representation, as the existing approaches, will create bias misleading image-and-recipe similarity judgment. Specifically, a food image may not equally capture every detail in a recipe, due to factors such as the cooking process, dish presentation, and image-capturing conditions. The current representation learning tends to capture dominant visual-text alignment while overlooking subtle variations that determine retrieval relevance. In this paper, we model such bias in cross-modal representation learning using causal theory. The causal view of this problem suggests ingredients as one of the confounder sources and a simple backdoor adjustment can alleviate the bias. By causal intervention, we reformulate the conventional model for food-to-recipe retrieval with an additional term to remove the potential bias in similarity judgment. Based on this theory-informed formulation, we empirically prove the oracle performance of retrieval on the Recipe1M dataset to be MedR=1 across the testing data sizes of 1K, 10K, and even 50K. We also propose a plug-and-play neural module, which is essentially a multi-label ingredient classifier for debiasing. New state-of-the-art search performances are reported on the Recipe1M dataset.
[43] Physics-Based Benchmarking Metrics for Multimodal Synthetic Images cs.CV | cs.AIPDF
Kishor Datta Gupta, Marufa Kamal, Md. Mahfuzur Rahman, Fahad Rahman, Mohd Ariful Haque
TL;DR: 本文提出了一种新型多模态合成图像评估指标PCMDE,结合物理约束和大语言模型,以弥补现有指标在语义和结构准确性上的不足。
Details
Motivation: 现有评估指标(如BLEU、CIDEr等)在多模态合成图像的语义和结构准确性上表现不足,尤其在特定领域或上下文依赖场景中。
Result: PCMDE在捕捉语义和结构准确性上优于现有指标。
Insight: 结合物理约束和大语言模型可为多模态数据评估提供更全面的解决方案。
Abstract: Current state of the art measures like BLEU, CIDEr, VQA score, SigLIP-2 and CLIPScore are often unable to capture semantic or structural accuracy, especially for domain-specific or context-dependent scenarios. For this, this paper proposes a Physics-Constrained Multimodal Data Evaluation (PCMDE) metric combining large language models with reasoning, knowledge based mapping and vision-language models to overcome these limitations. The architecture is comprised of three main stages: (1) feature extraction of spatial and semantic information with multimodal features through object detection and VLMs; (2) Confidence-Weighted Component Fusion for adaptive component-level validation; and (3) physics-guided reasoning using large language models for structural and relational constraints (e.g., alignment, position, consistency) enforcement.
[44] SkinGPT-R1: Adapter-Only Dual Distillation for Efficient Dermatology Reasoning cs.CVPDF
Yuhao Shen, Jiahe Qian, Zhangtianyi Chen, Yuanhao He, Juexiao Zhou
TL;DR: SkinGPT-R1是一种专注于皮肤病学的视觉语言模型,通过明确的逐步推理链实现诊断。论文提出了DermCoT数据集和DermBench基准,显著提升了皮肤病推理的性能。
Details
Motivation: 皮肤病诊断需要复杂的推理过程,现有模型缺乏标准化和可验证的推理链。本文旨在填补这一空白。
Result: 在DermBench上,SkinGPT-R1排名第一,平均得分为4.031(满分5),比基线模型提升41%。在皮肤病分类任务中也表现优异。
Insight: 标准化的逻辑链监督和视觉蒸馏是提升皮肤病诊断模型性能的关键。
Abstract: We present SkinGPT-R1, a dermatology focused vision language model that makes diagnostic chain of thought reasoning explicit, step by step, and verifiable. To support skin specific reasoning, we build DermCoT, a corpus of standardized dermatologic chain of thought narratives that combines 10,000 DermEval filtered training cases with 3,000 dermatologist scored certified cases, and we define DermEval as a physician aligned six dimensional evaluator and DermBench as the corresponding benchmark for dermatologic chain of thought quality. On DermBench, across 14 general, reasoning, and medical vision language models, SkinGPT-R1 achieves an average score of 4.031 out of 5 over the six clinician defined dimensions, ranks 1st among all systems, and improves the average score over Vision-R1 by about 41%. On three dermatology classification benchmarks, SkinGPT-R1 delivers stable accuracy gains over Vision-R1 and remains competitive among strong vision language models. Ablation results further show that DermCoT based chain of thought supervision provides substantial improvements over the base model and that adding dermatology aware visual distillation yields consistent additional gains in both narrative quality and recognition.
[45] Graph Query Networks for Object Detection with Automotive Radar cs.CV | cs.LGPDF
Loveneet Saini, Hasan Tercan, Tobias Meisen
TL;DR: 该论文提出了一种基于图的注意力框架Graph Query Networks(GQN),用于解决雷达稀疏和不规则反射对3D目标检测的挑战,显著提升了检测性能。
Details
Motivation: 雷达的长波长导致稀疏且不规则的反射信号,传统基于网格或序列的卷积和Transformer检测器难以有效处理。
Result: 在NuScenes数据集上,GQN相对mAP提升了53%,峰值图构建开销降低80%,计算开销适中。
Insight: 通过图查询和注意力机制实现雷达目标的个性化建模,能够显著提升稀疏和不规则数据的检测性能。
Abstract: Object detection with 3D radar is essential for 360-degree automotive perception, but radar’s long wavelengths produce sparse and irregular reflections that challenge traditional grid and sequence-based convolutional and transformer detectors. This paper introduces Graph Query Networks (GQN), an attention-based framework that models objects sensed by radar as graphs, to extract individualized relational and contextual features. GQN employs a novel concept of graph queries to dynamically attend over the bird’s-eye view (BEV) space, constructing object-specific graphs processed by two novel modules: EdgeFocus for relational reasoning and DeepContext Pooling for contextual aggregation. On the NuScenes dataset, GQN improves relative mAP by up to +53%, including a +8.2% gain over the strongest prior radar method, while reducing peak graph construction overhead by 80% with moderate FLOPs cost.
[46] Edge-Centric Relational Reasoning for 3D Scene Graph Prediction cs.CVPDF
Yanni Ma, Hao Liu, Yulan Guo, Theo Gevers, Martin R. Oswald
TL;DR: LEO提出了一种边缘中心的关系推理框架,通过将关系视为节点并构建线图,捕捉高阶关系依赖,提升了3D场景图预测的精度。
Details
Motivation: 现有方法通常采用对象中心的图神经网络,限制了关系表示对高阶依赖的捕捉,难以准确预测复杂3D场景中的关系。
Result: 在3DSSG数据集上的实验表明,LEO显著提升了关系预测性能,适用于多种现有对象中心方法。
Insight: 将关系视为节点并构建线图是实现高阶关系推理的有效方法,边缘中心的设计弥补了对象中心方法的不足。
Abstract: 3D scene graph prediction aims to abstract complex 3D environments into structured graphs consisting of objects and their pairwise relationships. Existing approaches typically adopt object-centric graph neural networks, where relation edge features are iteratively updated by aggregating messages from connected object nodes. However, this design inherently restricts relation representations to pairwise object context, making it difficult to capture high-order relational dependencies that are essential for accurate relation prediction. To address this limitation, we propose a Link-guided Edge-centric relational reasoning framework with Object-aware fusion, namely LEO, which enables progressive reasoning from relation-level context to object-level understanding. Specifically, LEO first predicts potential links between object pairs to suppress irrelevant edges, and then transforms the original scene graph into a line graph where each relation is treated as a node. A line graph neural network is applied to perform edge-centric relational reasoning to capture inter-relation context. The enriched relation features are subsequently integrated into the original object-centric graph to enhance object-level reasoning and improve relation prediction. Our framework is model-agnostic and can be integrated with any existing object-centric method. Experiments on the 3DSSG dataset with two competitive baselines show consistent improvements, highlighting the effectiveness of our edge-to-object reasoning paradigm.
[47] Text2Loc++: Generalizing 3D Point Cloud Localization from Natural Language cs.CVPDF
Yan Xia, Letian Shi, Yilin Di, Joao F. Henriques, Daniel Cremers
TL;DR: Text2Loc++是一个新神经网络,用于通过自然语言描述实现3D点云子图的定位,采用粗到精的流程实现语言与点云的多模态对齐。
Details
Motivation: 解决复杂多样的自然语言描述与3D点云定位之间的跨模态对齐问题,填补现有方法在处理多样城市环境和复杂语言表达上的不足。
Result: 在KITTI360Pose数据集上性能显著提升15%,并在新数据集上展现强泛化能力。
Insight: 1. 移除显式文本-实例匹配可提升效率;2. MIT和多层次对比学习有助于跨模态对齐;3. 轻量化框架在精定位阶段表现优异。
Abstract: We tackle the problem of localizing 3D point cloud submaps using complex and diverse natural language descriptions, and present Text2Loc++, a novel neural network designed for effective cross-modal alignment between language and point clouds in a coarse-to-fine localization pipeline. To support benchmarking, we introduce a new city-scale dataset covering both color and non-color point clouds from diverse urban scenes, and organize location descriptions into three levels of linguistic complexity. In the global place recognition stage, Text2Loc++ combines a pretrained language model with a Hierarchical Transformer with Max pooling (HTM) for sentence-level semantics, and employs an attention-based point cloud encoder for spatial understanding. We further propose Masked Instance Training (MIT) to filter out non-aligned objects and improve multimodal robustness. To enhance the embedding space, we introduce Modality-aware Hierarchical Contrastive Learning (MHCL), incorporating cross-modal, submap-, text-, and instance-level losses. In the fine localization stage, we completely remove explicit text-instance matching and design a lightweight yet powerful framework based on Prototype-based Map Cloning (PMC) and a Cascaded Cross-Attention Transformer (CCAT). Extensive experiments on the KITTI360Pose dataset show that Text2Loc++ outperforms existing methods by up to 15%. In addition, the proposed model exhibits robust generalization when evaluated on the new dataset, effectively handling complex linguistic expressions and a wide variety of urban environments. The code and dataset will be made publicly available.
[48] Adapt-As-You-Walk Through the Clouds: Training-Free Online Test-Time Adaptation of 3D Vision-Language Foundation Models cs.CVPDF
Mehran Tamjidi, Hamidreza Dastmalchi, Mohammadreza Alimoradijazi, Ali Cheraghian, Aijun An
TL;DR: 本文提出了一种无需训练的在线测试时间适应方法Uni-Adapter,用于3D视觉-语言基础模型(VLFMs),通过动态原型学习和图基标签平滑模块解决实际场景中的数据分布漂移问题。
Details
Motivation: 3D VLFMs在开放世界点云处理任务中表现出色,但在数据噪声、不完整或分布不同的实际场景中性能下降。需要一种无需重新训练的适应方法。
Result: 在多个3D基准测试中表现优异,ModelNet-40C提升10.55%,ScanObjectNN-C提升8.26%,ShapeNet-C提升4.49%。
Insight: 动态原型学习和标签平滑模块的结合,可以有效缓解分布漂移问题,无需重新训练即可显著提升3D VLFMs在实际场景中的表现。
Abstract: 3D Vision-Language Foundation Models (VLFMs) have shown strong generalization and zero-shot recognition capabilities in open-world point cloud processing tasks. However, these models often underperform in practical scenarios where data are noisy, incomplete, or drawn from a different distribution than the training data. To address this, we propose Uni-Adapter, a novel training-free online test-time adaptation (TTA) strategy for 3D VLFMs based on dynamic prototype learning. We define a 3D cache to store class-specific cluster centers as prototypes, which are continuously updated to capture intra-class variability in heterogeneous data distributions. These dynamic prototypes serve as anchors for cache-based logit computation via similarity scoring. Simultaneously, a graph-based label smoothing module captures inter-prototype similarities to enforce label consistency among similar prototypes. Finally, we unify predictions from the original 3D VLFM and the refined 3D cache using entropy-weighted aggregation for reliable adaptation. Without retraining, Uni-Adapter effectively mitigates distribution shifts, achieving state-of-the-art performance on diverse 3D benchmarks over different 3D VLFMs, improving ModelNet-40C by 10.55%, ScanObjectNN-C by 8.26%, and ShapeNet-C by 4.49% over the source 3D VLFMs.
[49] A Multimodal Transformer Approach for UAV Detection and Aerial Object Recognition Using Radar, Audio, and Video Data cs.CVPDF
Mauro Larrat, Claudomiro Sales
TL;DR: 该论文提出了一种基于多模态Transformer的无人机(UAV)检测与空中物体识别方法,结合雷达、视频(RGB和红外)及音频数据,通过自注意力机制实现高性能分类,适用于实时应用。
Details
Motivation: 单一模态方法在无人机检测与空中物体识别中存在局限性,因此需要一种能够融合多模态数据的鲁棒系统以提高性能。
Result: 模型在独立测试集上表现优异,准确率达0.9812,F1-score为0.9826,计算效率高(41.11 FPS),适合实时应用。
Insight: 多模态数据融合结合Transformer架构能显著提升空中物体分类性能,为复杂空域中的无人机检测提供了高精度解决方案。
Abstract: Unmanned aerial vehicle (UAV) detection and aerial object recognition are critical for modern surveillance and security, prompting a need for robust systems that overcome limitations of single-modality approaches. This research addresses these challenges by designing and rigorously evaluating a novel multimodal Transformer model that integrates diverse data streams: radar, visual band video (RGB), infrared (IR) video, and audio. The architecture effectively fuses distinct features from each modality, leveraging the Transformer’s self-attention mechanisms to learn comprehensive, complementary, and highly discriminative representations for classification. The model demonstrated exceptional performance on an independent test set, achieving macro-averaged metrics of 0.9812 accuracy, 0.9873 recall, 0.9787 precision, 0.9826 F1-score, and 0.9954 specificity. Notably, it exhibited particularly high precision and recall in distinguishing drones from other aerial objects. Furthermore, computational analysis confirmed its efficiency, with 1.09 GFLOPs, 1.22 million parameters, and an inference speed of 41.11 FPS, highlighting its suitability for real-time applications. This study presents a significant advancement in aerial object classification, validating the efficacy of multimodal data fusion via a Transformer architecture for achieving state-of-the-art performance, thereby offering a highly accurate and resilient solution for UAV detection and monitoring in complex airspace.
[50] What Your Features Reveal: Data-Efficient Black-Box Feature Inversion Attack for Split DNNs cs.CVPDF
Zhihan Ren, Lijun He, Jiaxi Liang, Xinzhu Fu, Haixia Bi
TL;DR: 本文提出了FIA-Flow,一种黑盒特征反演攻击框架,用于在Split DNNs中高效地从中间特征重构高保真图像,揭示了比以往认知更严重的隐私风险。
Details
Motivation: Split DNNs的计算卸载可能导致隐私泄露,现有特征反演攻击(FIA)方法的重构质量有限,难以评估真实的隐私威胁程度。
Result: 在多个模型(如AlexNet、ResNet等)和层级上,FIA-Flow实现了更准确和语义对齐的特征反演,显示了Split DNNs中更严重的隐私风险。
Insight: 研究表明,Split DNNs的中间特征泄露可能导致更严重的隐私风险,而FIA-Flow为隐私评估提供了更有效的工具。
Abstract: Split DNNs enable edge devices by offloading intensive computation to a cloud server, but this paradigm exposes privacy vulnerabilities, as the intermediate features can be exploited to reconstruct the private inputs via Feature Inversion Attack (FIA). Existing FIA methods often produce limited reconstruction quality, making it difficult to assess the true extent of privacy leakage. To reveal the privacy risk of the leaked features, we introduce FIA-Flow, a black-box FIA framework that achieves high-fidelity image reconstruction from intermediate features. To exploit the semantic information within intermediate features, we design a Latent Feature Space Alignment Module (LFSAM) to bridge the semantic gap between the intermediate feature space and the latent space. Furthermore, to rectify distributional mismatch, we develop Deterministic Inversion Flow Matching (DIFM), which projects off-manifold features onto the target manifold with one-step inference. This decoupled design simplifies learning and enables effective training with few image-feature pairs. To quantify privacy leakage from a human perspective, we also propose two metrics based on a large vision-language model. Experiments show that FIA-Flow achieves more faithful and semantically aligned feature inversion across various models (AlexNet, ResNet, Swin Transformer, DINO, and YOLO11) and layers, revealing a more severe privacy threat in Split DNNs than previously recognized.
[51] Fast Post-Hoc Confidence Fusion for 3-Class Open-Set Aerial Object Detection cs.CV | cs.LG | cs.ROPDF
Spyridon Loukovitis, Vasileios Karampinis, Athanasios Voulodimos
TL;DR: 本文提出了一种轻量级、模型无关的后处理框架,用于无人机导航中的3类开放集目标检测,通过融合多个置信度估计和检测特征,显著提升了开放集和闭集性能。
Details
Motivation: 无人机导航需要可靠的空中目标检测器,能够区分训练中见过的目标(ID)和未见过的目标(OOD)。现有的开放集方法通常依赖单一不确定性评分和阈值划分,灵活性不足且易混淆OOD目标和背景干扰。
Result: 在AUROC上平均超过基于阈值的基线方法2.7%,在开放集mAP上保持或提升,同时在闭集mAP上相对提升高达18%(9个点)。
Insight: 通过多置信度融合和特征整合,不仅可以显著提升开放集检测性能,还能同时改善闭集检测结果,为无人机导航等实际应用提供了更可靠的解决方案。
Abstract: Developing reliable UAV navigation systems requires robust air-to-air object detectors capable of distinguishing between objects seen during training and previously unseen objects. While many methods address closed-set detection and achieve high-confidence recognition of in-domain (ID) targets, they generally do not tackle open-set detection, which requires simultaneous handling of both ID and out-of-distribution (OOD) objects. Existing open-set approaches typically rely on a single uncertainty score with thresholding, limiting flexibility and often conflating OOD objects with background clutter. In contrast, we propose a lightweight, model-agnostic post-processing framework that explicitly separates background from unknown objects while preserving the base detector’s performance. Our approach extends open-set detection beyond binary ID/OOD classification to real-time three-way classification among ID targets, OOD objects, and background. To this end, we employ a fusion scheme that aggregates multiple confidence estimates and per-detection features using a compact multilayer perceptron (MLP). Incorporating different logit variants into the MLP consistently enhances performance across both binary and three-class classification without compromising throughput. Extensive ablation and comparative experiments confirm that our method surpasses threshold-based baselines in two-class classification by an average of 2.7% AUROC, while retaining or improving open-set mAP. Furthermore, our study uniquely enables robust three-class classification, a critical capability for safe UAV navigation, where OOD objects must be actively avoided and background regions safely ignored. Comparative analysis highlights that our method surpasses competitive techniques in AUROC across datasets, while improving closed-set mAP by up to 9 points, an 18% relative gain.
[52] IPTQ-ViT: Post-Training Quantization of Non-linear Functions for Integer-only Vision Transformers cs.CV | cs.AIPDF
Gihwan Kim, Jemin Lee, Hyungshin Kim
TL;DR: IPTQ-ViT是一种新颖的后训练量化框架,用于实现完全基于整数的视觉Transformer,无需重新训练。它通过多项式GELU和位移Softmax近似非线性函数,并使用统一度量选择最优近似函数,显著提升了量化性能。
Details
Motivation: 现有量化方法(如QAT)依赖昂贵的重新训练,而后训练量化(PTQ)方法要么部分量化非线性函数,要么无法实现完全整数推理。IPTQ-ViT旨在解决这些问题,使视觉Transformer在资源受限环境中高效部署。
Result: 实验表明,IPTQ-ViT在图像分类任务中平均提升1.78% top-1精度(最高6.44%),在目标检测中提升1.0 mAP,性能优于部分浮点PTQ方法,与整数QAT方法相当。
Insight: IPTQ-ViT展示了非线性函数的高效近似对量化性能的关键影响,同时表明统一度量可以有效平衡量化精度与计算开销。
Abstract: Previous Quantization-Aware Training (QAT) methods for vision transformers rely on expensive retraining to recover accuracy loss in non-linear layer quantization, limiting their use in resource-constrained environments. In contrast, existing Post-Training Quantization (PTQ) methods either partially quantize non-linear functions or adjust activation distributions to maintain accuracy but fail to achieve fully integer-only inference. In this paper, we introduce IPTQ-ViT, a novel PTQ framework for fully integer-only vision transformers without retraining. We present approximation functions: a polynomial-based GELU optimized for vision data and a bit-shifting-based Softmax designed to improve approximation accuracy in PTQ. In addition, we propose a unified metric integrating quantization sensitivity, perturbation, and computational cost to select the optimal approximation function per activation layer. IPTQ-ViT outperforms previous PTQ methods, achieving up to 6.44%p (avg. 1.78%p) top-1 accuracy improvement for image classification, 1.0 mAP for object detection. IPTQ-ViT outperforms partial floating-point PTQ methods under W8A8 and W4A8, and achieves accuracy and latency comparable to integer-only QAT methods. We plan to release our code https://github.com/gihwan-kim/IPTQ-ViT.git.
[53] Zero-Shot Open-Vocabulary Human Motion Grounding with Test-Time Training cs.CVPDF
Yunjiao Zhou, Xinyan Chen, Junlang Qian, Lihua Xie, Jianfei Yang
TL;DR: ZOMG是一个零样本、开放词汇框架,无需标注或微调即可将运动序列分割为语义对齐的子动作,集成语言语义分割和软掩码优化,在HumanML3D基准上超越现有方法8.7% mAP。
Details
Motivation: 复杂人类活动的理解需要将运动分解为细粒度、语义对齐的子动作。现有方法依赖密集监督和预定义动作类,难以适应开放词汇的现实场景。
Result: 在HumanML3D基准上超越现有方法8.7% mAP,下游检索任务也有显著提升,建立无标注运动理解新范式。
Insight: 利用大语言模型的语义分解能力和软掩码优化技术,可以在零样本场景下高效实现开放词汇的运动分割。
Abstract: Understanding complex human activities demands the ability to decompose motion into fine-grained, semantic-aligned sub-actions. This motion grounding process is crucial for behavior analysis, embodied AI and virtual reality. Yet, most existing methods rely on dense supervision with predefined action classes, which are infeasible in open-vocabulary, real-world settings. In this paper, we propose ZOMG, a zero-shot, open-vocabulary framework that segments motion sequences into semantically meaningful sub-actions without requiring any annotations or fine-tuning. Technically, ZOMG integrates (1) language semantic partition, which leverages large language models to decompose instructions into ordered sub-action units, and (2) soft masking optimization, which learns instance-specific temporal masks to focus on frames critical to sub-actions, while maintaining intra-segment continuity and enforcing inter-segment separation, all without altering the pretrained encoder. Experiments on three motion-language datasets demonstrate state-of-the-art effectiveness and efficiency of motion grounding performance, outperforming prior methods by +8.7% mAP on HumanML3D benchmark. Meanwhile, significant improvements also exist in downstream retrieval, establishing a new paradigm for annotation-free motion understanding.
[54] Breaking Expert Knowledge Limits: Self-Pruning for Large Language Models cs.CVPDF
Haidong Kang, Lihong Lin, Enneng Yang, Hongning Dai, Hao Wang
TL;DR: 论文提出了一种名为AutoPrune的自剪枝方法,通过大语言模型(LLM)自动设计剪枝算法,克服了对专家知识的依赖,并解决了高剪枝比例下的异常值问题。
Details
Motivation: 现有的大语言模型剪枝方法依赖专家知识设计算法,成本高且性能受限,同时高剪枝比例下存在异常值问题导致性能急剧下降。
Result: 在主流LLM基准测试中,AutoPrune在性能和可解释性上均优于现有方法。
Insight: LLM可以通过自学习设计剪枝算法,减少对专家知识的依赖,同时动态稀疏性分配是解决高剪枝比例下性能下降的关键。
Abstract: Large language models (LLMs) have achieved remarkable performance on a wide range of tasks, hindering real-world deployment due to their massive size. Existing pruning methods (e.g., Wanda) tailored for LLMs rely heavily on manual design pruning algorithms, thereby leading to \textit{huge labor costs} and \textit{requires expert knowledge}. Furthermore, we are the first to identify the serious \textit{outlier value issue} behind dramatic performance degradation under high pruning ratios that are caused by uniform sparsity, raising an additional concern about how to design adaptive pruning sparsity ideal for LLMs. Can LLMs prune by themselves? In this work, we introduce an affirmative answer by proposing a novel pruning method called \textbf{AutoPrune}, which first overcomes expert knowledge limits by leveraging LLMs to design optimal pruning algorithms for themselves automatically without any expert knowledge. Specifically, to mitigate the black-box nature of LLMs, we propose a Graph-driven Chain-of-Thought (GCoT) to optimize prompts, significantly enhancing the reasoning process in learning the pruning algorithm and enabling us to generate pruning algorithms with superior performance and interpretability in the next generation. Finally, grounded in insights of outlier value issue, we introduce Skew-aware Dynamic Sparsity Allocation (SDSA) to overcome the outlier value issue, mitigating performance degradation under high pruning ratios. We conduct extensive experiments on mainstream LLMs benchmarks, demonstrating the superiority of AutoPrune, which consistently excels state-of-the-art competitors. The code is available at: https://anonymous.4open.science/r/AutoPrune.
[55] ShelfOcc: Native 3D Supervision beyond LiDAR for Vision-Based Occupancy Estimation cs.CVPDF
Simon Boeder, Fabian Gigengack, Simon Roesler, Holger Caesar, Benjamin Risse
TL;DR: ShelfOcc提出了一种仅依赖视觉的3D占用估计方法,通过视频生成语义体素标签,避免了2D投影监督的几何不一致性和深度泄露问题,实现了真正的3D监督。
Details
Motivation: 现有基于自监督或弱监督的占用估计方法依赖2D投影或渲染监督,存在几何不一致和深度泄露问题,而LiDAR依赖性强。ShelfOcc旨在解决这些问题,无需LiDAR或手动3D标注。
Result: 在Occ3D-nuScenes基准上,ShelfOcc相比之前的弱监督方法实现了高达34%的相对提升。
Insight: 研究表明,高质量的三维监督对于占用学习的鲁棒性至关重要,是架构创新的重要补充方向。
Abstract: Recent progress in self- and weakly supervised occupancy estimation has largely relied on 2D projection or rendering-based supervision, which suffers from geometric inconsistencies and severe depth bleeding. We thus introduce ShelfOcc, a vision-only method that overcomes these limitations without relying on LiDAR. ShelfOcc brings supervision into native 3D space by generating metrically consistent semantic voxel labels from video, enabling true 3D supervision without any additional sensors or manual 3D annotations. While recent vision-based 3D geometry foundation models provide a promising source of prior knowledge, they do not work out of the box as a prediction due to sparse or noisy and inconsistent geometry, especially in dynamic driving scenes. Our method introduces a dedicated framework that mitigates these issues by filtering and accumulating static geometry consistently across frames, handling dynamic content and propagating semantic information into a stable voxel representation. This data-centric shift in supervision for weakly/shelf-supervised occupancy estimation allows the use of essentially any SOTA occupancy model architecture without relying on LiDAR data. We argue that such high-quality supervision is essential for robust occupancy learning and constitutes an important complementary avenue to architectural innovation. On the Occ3D-nuScenes benchmark, ShelfOcc substantially outperforms all previous weakly/shelf-supervised methods (up to a 34% relative improvement), establishing a new data-driven direction for LiDAR-free 3D scene understanding.
[56] D4C: Data-free Quantization for Contrastive Language-Image Pre-training Models cs.CV | cs.LGPDF
Wenlun Zhang, Yunshan Zhong, Zihao Ding, Xinyu Li, Kentaro Yoshioka
TL;DR: 该论文提出了D4C框架,首次针对CLIP模型的无需数据量化(DFQ)问题进行了优化,通过语义注入、结构对比生成和扰动增强等方法,显著提升了量化性能。
Details
Motivation: 现有DFQ方法直接应用于CLIP模型时表现不佳,主要由于合成样本的语义内容不足和图像多样性低。因此,需要一种专门为CLIP设计的DFQ方法。
Result: 在多种比特宽度和模型下,D4C显著提升了CLIP的量化性能(如CIFAR-10上Top-1准确率提升12.4%~18.9%)。
Insight: 语义对齐和多样性增强是CLIP模型量化成功的关键,D4C为其他视觉-语言模型的量化提供了借鉴。
Abstract: Data-Free Quantization (DFQ) offers a practical solution for model compression without requiring access to real data, making it particularly attractive in privacy-sensitive scenarios. While DFQ has shown promise for unimodal models, its extension to Vision-Language Models such as Contrastive Language-Image Pre-training (CLIP) models remains underexplored. In this work, we reveal that directly applying existing DFQ techniques to CLIP results in substantial performance degradation due to two key limitations: insufficient semantic content and low intra-image diversity in synthesized samples. To tackle these challenges, we propose D4C, the first DFQ framework tailored for CLIP. D4C synthesizes semantically rich and structurally diverse pseudo images through three key components: (1) Prompt-Guided Semantic Injection aligns generated images with real-world semantics using text prompts; (2) Structural Contrastive Generation reproduces compositional structures of natural images by leveraging foreground-background contrastive synthesis; and (3) Perturbation-Aware Enhancement applies controlled perturbations to improve sample diversity and robustness. These components jointly empower D4C to synthesize images that are both semantically informative and structurally diverse, effectively bridging the performance gap of DFQ on CLIP. Extensive experiments validate the effectiveness of D4C, showing significant performance improvements on various bit-widths and models. For example, under the W4A8 setting with CLIP ResNet-50 and ViT-B/32, D4C achieves Top-1 accuracy improvement of 12.4% and 18.9% on CIFAR-10, 6.8% and 19.7% on CIFAR-100, and 1.4% and 5.7% on ImageNet-1K in zero-shot classification, respectively.
[57] Representation Space Constrained Learning with Modality Decoupling for Multimodal Object Detection cs.CVPDF
YiKang Shao, Tao Shi
TL;DR: 这篇论文提出了一个名为RSC-MD的方法,通过分析多模态目标检测中的融合退化问题,设计了两大模块来解决梯度抑制和模态不平衡问题,并在多个数据集上取得了最先进的性能。
Details
Motivation: 多模态目标检测虽然能增强鲁棒性,但现有方法大多忽略了融合退化问题,并且缺乏理论分析。论文旨在填补这一空白,并提出解决方案。
Result: 在FLIR、LLVIP、M3FD和MFAD数据集上的实验表明,RSC-MD方法有效缓解了融合退化并实现了SOTA性能。
Insight: 1. 融合退化的核心原因是梯度抑制和模态不平衡;2. 通过解耦模态学习和约束表示空间可以显著提升性能。
Abstract: Multimodal object detection has attracted significant attention in both academia and industry for its enhanced robustness. Although numerous studies have focused on improving modality fusion strategies, most neglect fusion degradation, and none provide a theoretical analysis of its underlying causes. To fill this gap, this paper presents a systematic theoretical investigation of fusion degradation in multimodal detection and identifies two key optimization deficiencies: (1) the gradients of unimodal branch backbones are severely suppressed under multimodal architectures, resulting in under-optimization of the unimodal branches; (2) disparities in modality quality cause weaker modalities to experience stronger gradient suppression, which in turn results in imbalanced modality learning. To address these issues, this paper proposes a Representation Space Constrained Learning with Modality Decoupling (RSC-MD) method, which consists of two modules. The RSC module and the MD module are designed to respectively amplify the suppressed gradients and eliminate inter-modality coupling interference as well as modality imbalance, thereby enabling the comprehensive optimization of each modality-specific backbone. Extensive experiments conducted on the FLIR, LLVIP, M3FD, and MFAD datasets demonstrate that the proposed method effectively alleviates fusion degradation and achieves state-of-the-art performance across multiple benchmarks. The code and training procedures will be released at https://github.com/yikangshao/RSC-MD.
[58] HV-Attack: Hierarchical Visual Attack for Multimodal Retrieval Augmented Generation cs.CV | cs.AI | cs.IRPDF
Linyin Luo, Yujuan Ding, Yunshan Ma, Wenqi Fan, Hanjiang Lai
TL;DR: 本文提出了一种针对多模态检索增强生成(MRAG)系统的视觉攻击方法HV-Attack,通过对用户输入图像添加难以察觉的扰动,破坏生成器的输入对齐和语义一致性。
Details
Motivation: 现有对抗研究表明MRAG系统易受知识投毒攻击,但本文探索了仅通过视觉扰动攻击MRAG的新场景,挑战在于MRAG系统的高鲁棒性及其生成链的复杂性。
Result: 在OK-VQA和InfoSeek数据集上的实验表明,HV-Attack显著降低了CLIP检索器和BLIP-2、LLaVA生成器的性能。
Insight: 视觉扰动能有效攻击MRAG系统,提示其在安全性方面的脆弱性,尤其是在生成链的多阶段传播中。
Abstract: Advanced multimodal Retrieval-Augmented Generation (MRAG) techniques have been widely applied to enhance the capabilities of Large Multimodal Models (LMMs), but they also bring along novel safety issues. Existing adversarial research has revealed the vulnerability of MRAG systems to knowledge poisoning attacks, which fool the retriever into recalling injected poisoned contents. However, our work considers a different setting: visual attack of MRAG by solely adding imperceptible perturbations at the image inputs of users, without manipulating any other components. This is challenging due to the robustness of fine-tuned retrievers and large-scale generators, and the effect of visual perturbation may be further weakened by propagation through the RAG chain. We propose a novel Hierarchical Visual Attack that misaligns and disrupts the two inputs (the multimodal query and the augmented knowledge) of MRAG’s generator to confuse its generation. We further design a hierarchical two-stage strategy to obtain misaligned augmented knowledge. We disrupt the image input of the retriever to make it recall irrelevant knowledge from the original database, by optimizing the perturbation which first breaks the cross-modal alignment and then disrupts the multimodal semantic alignment. We conduct extensive experiments on two widely-used MRAG datasets: OK-VQA and InfoSeek. We use CLIP-based retrievers and two LMMs BLIP-2 and LLaVA as generators. Results demonstrate the effectiveness of our visual attack on MRAG through the significant decrease in both retrieval and generation performance.
[59] Driving in Spikes: An Entropy-Guided Object Detector for Spike Cameras cs.CVPDF
Ziyan Liu, Qi Su, Lulu Tang, Zhaofei Yu, Tiejun Huang
TL;DR: 这篇论文提出了一种名为EASD的端到端脉冲相机检测器,通过双分支设计解决脉冲相机稀疏离散输出的检测问题,并推出了首个面向驾驶的模拟脉冲检测基准DSEC Spike。
Details
Motivation: 自动驾驶中的目标检测在高速运动和极端光照条件下容易受到运动模糊和饱和的影响。脉冲相机因其微秒级延迟和超高动态范围成为解决方案,但其稀疏离散的输出无法被标准图像检测器处理。
Result: EASD在脉冲相机数据上实现了高效的目标检测。
Insight: 脉冲相机的稀疏输出需要专门的检测方法,双分支设计能够同时捕捉全局语义和局部细节。
Abstract: Object detection in autonomous driving suffers from motion blur and saturation under fast motion and extreme lighting. Spike cameras, offer microsecond latency and ultra high dynamic range for object detection by using per pixel asynchronous integrate and fire. However, their sparse, discrete output cannot be processed by standard image-based detectors, posing a critical challenge for end to end spike stream detection. We propose EASD, an end to end spike camera detector with a dual branch design: a Temporal Based Texture plus Feature Fusion branch for global cross slice semantics, and an Entropy Selective Attention branch for object centric details. To close the data gap, we introduce DSEC Spike, the first driving oriented simulated spike detection benchmark.
[60] SIGMMA: Hierarchical Graph-Based Multi-Scale Multi-modal Contrastive Alignment of Histopathology Image and Spatial Transcriptome cs.CV | cs.LGPDF
Dabin Jeong, Amirhossein Vahidi, Ciro Ramírez-Suástegui, Marie Moullet, Kevin Ly
TL;DR: SIGMMA提出了一个多尺度多模态对比对齐框架,用于学习组织病理学图像和空间转录组数据的层次表示,显著提高了基因表达预测和跨模态检索任务的表现。
Details
Motivation: 现有方法通常在单一尺度上对齐HE图像和ST数据,忽略了细粒度细胞结构及其空间组织。
Result: SIGMMA在基因表达预测任务中平均提升9.78%,跨模态检索任务中平均提升26.93%。
Insight: 多尺度建模和细胞间相互作用图的集成能更好地捕捉组织微环境中的复杂关系。
Abstract: Recent advances in computational pathology have leveraged vision-language models to learn joint representations of Hematoxylin and Eosin (HE) images with spatial transcriptomic (ST) profiles. However, existing approaches typically align HE tiles with their corresponding ST profiles at a single scale, overlooking fine-grained cellular structures and their spatial organization. To address this, we propose Sigmma, a multi-modal contrastive alignment framework for learning hierarchical representations of HE images and spatial transcriptome profiles across multiple scales. Sigmma introduces multi-scale contrastive alignment, ensuring that representations learned at different scales remain coherent across modalities. Furthermore, by representing cell interactions as a graph and integrating inter- and intra-subgraph relationships, our approach effectively captures cell-cell interactions, ranging from fine to coarse, within the tissue microenvironment. We demonstrate that Sigmm learns representations that better capture cross-modal correspondences, leading to an improvement of avg. 9.78% in the gene-expression prediction task and avg. 26.93% in the cross-modal retrieval task across datasets. We further show that it learns meaningful multi-tissue organization in downstream analyses.
[61] Deep Learning for Accurate Vision-based Catch Composition in Tropical Tuna Purse Seiners cs.CVPDF
Xabier Lekunberri, Ahmad Kamal, Izaro Goienetxea, Jon Ruiz, Iñaki Quincoces
TL;DR: 该论文提出了一种基于深度学习的多阶段流水线方法,用于解决热带金枪鱼围网渔船中鱼种识别的难题,结合分割、跟踪和分类技术,显著提高了识别准确性。
Details
Motivation: 热带金枪鱼围网渔船的电子监控系统生成大量视频数据,人工分析成本高且准确性难以保证。AI技术可以优化这一流程,但鱼种识别仍面临挑战,尤其是区分相近物种(如大眼金枪鱼和黄鳍金枪鱼)。
Result: YOLOv9-SAM2分割方法在验证集上达到mAP 0.66±0.03和召回率0.88±0.03;结合层次分类模型后,84.8%的个体被准确分割和分类,平均误差为4.5%。
Insight: 1. 层次分类模型比标准多分类模型更适合相近物种识别;2. YOLOv9与SAM2的结合在分割任务中表现优异;3. 方法有望推广到其他渔业监控场景。
Abstract: Purse seiners play a crucial role in tuna fishing, as approximately 69% of the world’s tropical tuna is caught using this gear. All tuna Regional Fisheries Management Organizations have established minimum standards to use electronic monitoring (EM) in fisheries in addition to traditional observers. The EM systems produce a massive amount of video data that human analysts must process. Integrating artificial intelligence (AI) into their workflow can decrease that workload and improve the accuracy of the reports. However, species identification still poses significant challenges for AI, as achieving balanced performance across all species requires appropriate training data. Here, we quantify the difficulty experts face to distinguish bigeye tuna (BET, Thunnus Obesus) from yellowfin tuna (YFT, Thunnus Albacares) using images captured by EM systems. We found inter-expert agreements of 42.9% $\pm$ 35.6% for BET and 57.1% $\pm$ 35.6% for YFT. We then present a multi-stage pipeline to estimate the species composition of the catches using a reliable ground-truth dataset based on identifications made by observers on board. Three segmentation approaches are compared: Mask R-CNN, a combination of DINOv2 with SAM2, and a integration of YOLOv9 with SAM2. We found that the latest performs the best, with a validation mean average precision of 0.66 $\pm$ 0.03 and a recall of 0.88 $\pm$ 0.03. Segmented individuals are tracked using ByteTrack. For classification, we evaluate a standard multiclass classification model and a hierarchical approach, finding a superior generalization by the hierarchical. All our models were cross-validated during training and tested on fishing operations with fully known catch composition. Combining YOLOv9-SAM2 with the hierarchical classification produced the best estimations, with 84.8% of the individuals being segmented and classified with a mean average error of 4.5%.
[62] FunnyNodules: A Customizable Medical Dataset Tailored for Evaluating Explainable AI cs.CVPDF
Luisa Gallée, Yiheng Xiong, Meinrad Beer, Michael Götz
TL;DR: FunnyNodules是一个完全参数化的合成医学图像数据集,用于系统评估医学AI模型中的基于属性的推理能力。
Details
Motivation: 现有医学图像数据集缺乏标注的推理信息,而这对开发可解释AI(xAI)模型至关重要。FunnyNodules填补了这一空白。
Result: 展示了FunnyNodules在模型无关评估中的应用,如验证模型是否学习正确的属性-目标关系、分析注意力对齐等。
Insight: 完全可控的合成数据集为医学AI的可解释性研究提供了灵活且可靠的基准平台。
Abstract: Densely annotated medical image datasets that capture not only diagnostic labels but also the underlying reasoning behind these diagnoses are scarce. Such reasoning-related annotations are essential for developing and evaluating explainable AI (xAI) models that reason similarly to radiologists: making correct predictions for the right reasons. To address this gap, we introduce FunnyNodules, a fully parameterized synthetic dataset designed for systematic analysis of attribute-based reasoning in medical AI models. The dataset generates abstract, lung nodule-like shapes with controllable visual attributes such as roundness, margin sharpness, and spiculation. Target class is derived from a predefined attribute combination, allowing full control over the decision rule that links attributes to the diagnostic class. We demonstrate how FunnyNodules can be used in model-agnostic evaluations to assess whether models learn correct attribute-target relations, to interpret over- or underperformance in attribute prediction, and to analyze attention alignment with attribute-specific regions of interest. The framework is fully customizable, supporting variations in dataset complexity, target definitions, class balance, and beyond. With complete ground truth information, FunnyNodules provides a versatile foundation for developing, benchmarking, and conducting in-depth analyses of explainable AI methods in medical image analysis.
[63] Learning to Expand Images for Efficient Visual Autoregressive Modeling cs.CVPDF
Ruiqing Yang, Kaixin Zhang, Zheng Zhang, Shan You, Tao Huang
TL;DR: 提出了一种名为EAR的新生成范式,通过模仿人类视觉系统的中心向外感知模式,实现高效的自回归图像生成。
Details
Motivation: 现有的自回归模型在视觉生成中存在效率低下的问题,要么因为逐令牌解码,要么因为多尺度表示的复杂性。
Result: 在ImageNet上的实验表明,EAR在单尺度自回归模型中实现了保真度和效率之间的最优权衡。
Insight: 通过模仿人类视觉感知模式,EAR不仅降低了计算成本,还提升了生成质量,为可扩展且认知对齐的自回归图像生成指明了新方向。
Abstract: Autoregressive models have recently shown great promise in visual generation by leveraging discrete token sequences akin to language modeling. However, existing approaches often suffer from inefficiency, either due to token-by-token decoding or the complexity of multi-scale representations. In this work, we introduce Expanding Autoregressive Representation (EAR), a novel generation paradigm that emulates the human visual system’s center-outward perception pattern. EAR unfolds image tokens in a spiral order from the center and progressively expands outward, preserving spatial continuity and enabling efficient parallel decoding. To further enhance flexibility and speed, we propose a length-adaptive decoding strategy that dynamically adjusts the number of tokens predicted at each step. This biologically inspired design not only reduces computational cost but also improves generation quality by aligning the generation order with perceptual relevance. Extensive experiments on ImageNet demonstrate that EAR achieves state-of-the-art trade-offs between fidelity and efficiency on single-scale autoregressive models, setting a new direction for scalable and cognitively aligned autoregressive image generation.
[64] A Hybrid CNN-ViT-GNN Framework with GAN-Based Augmentation for Intelligent Weed Detection in Precision Agriculture cs.CVPDF
Pandiyaraju V, Abishek Karthik, Sreya Mynampati, Poovarasan L, D. Saraswathi
TL;DR: 该论文提出了一种结合CNN、ViT和GNN的混合深度学习框架,用于精准农业中的杂草检测,并通过GAN增强和数据预训练提升模型性能,实验结果显示高准确性(99.33%)和实用性。
Details
Motivation: 精准农业中的杂草检测对可持续农业管理至关重要,但目前的方法在复杂田间条件下表现不足。
Result: 在多基准数据集上达到99.33%的准确率、精确率、召回率和F1分数。
Insight: 混合框架能同时捕捉局部、全局和关系特征,且GAN增强和预训练显著提升了小数据集的性能。
Abstract: The task of weed detection is an essential element of precision agriculture since accurate species identification allows a farmer to selectively apply herbicides and fits into sustainable agriculture crop management. This paper proposes a hybrid deep learning framework recipe for weed detection that utilizes Convolutional Neural Networks (CNNs), Vision Transformers (ViTs), and Graph Neural Networks (GNNs) to build robustness to multiple field conditions. A Generative Adversarial Network (GAN)-based augmentation method was imposed to balance class distributions and better generalize the model. Further, a self-supervised contrastive pre-training method helps to learn more features from limited annotated data. Experimental results yield superior results with 99.33% accuracy, precision, recall, and F1-score on multi-benchmark datasets. The proposed model architecture enables local, global, and relational feature representations and offers high interpretability and adaptability. Practically, the framework allows real-time, efficient deployment to edge devices for automated weed detecting, reducing over-reliance on herbicides and providing scalable, sustainable precision-farming options.
[65] Transferable Dual-Domain Feature Importance Attack against AI-Generated Image Detector cs.CV | cs.CRPDF
Weiheng Zhu, Gang Cao, Jing Liu, Lifang Yu, Shaowei Weng
TL;DR: 该论文提出了一种名为DuFIA的双域特征重要性攻击方法,用于对抗AI生成图像检测器,通过联合建模空间和频域特征重要性,生成对抗样本,从而在跨模型转移性、透明性和鲁棒性方面表现出色。
Details
Motivation: 现有的AI生成图像(AIGI)检测器在干净条件下表现优异,但其在面对对抗攻击时的安全性尚未充分研究。因此,亟需开发高级对抗攻击方法以评估此类检测器的安全性。
Result: 实验结果表明,DuFIA在多种AIGI检测器上展现出良好的跨模型转移性、透明性和鲁棒性。
Insight: 通过双域特征重要性联合建模,可以显著提升对抗攻击的转移性和有效性,为对抗AI生成图像检测器提供了新的思路。
Abstract: Recent AI-generated image (AIGI) detectors achieve impressive accuracy under clean condition. In view of antiforensics, it is significant to develop advanced adversarial attacks for evaluating the security of such detectors, which remains unexplored sufficiently. This letter proposes a Dual-domain Feature Importance Attack (DuFIA) scheme to invalidate AIGI detectors to some extent. Forensically important features are captured by the spatially interpolated gradient and frequency-aware perturbation. The adversarial transferability is enhanced by jointly modeling spatial and frequency-domain feature importances, which are fused to guide the optimization-based adversarial example generation. Extensive experiments across various AIGI detectors verify the cross-model transferability, transparency and robustness of DuFIA.
[66] From Low-Rank Features to Encoding Mismatch: Rethinking Feature Distillation in Vision Transformers cs.CVPDF
Huiyuan Tian, Bonan Xu, Shijian Li, Xin Jin
TL;DR: 论文分析了视觉变换器(ViT)中特征蒸馏失败的原因,提出了基于低秩特征的两种策略,显著提升了ViT的特征蒸馏效果。
Details
Motivation: 特征图知识蒸馏(KD)在卷积网络中效果显著,但在ViT中效果不佳。论文旨在揭示ViT特征蒸馏失败的原因并提出改进方法。
Result: 在ImageNet-1K上,DeiT-Tiny的准确率从74.86%提升到77.53%(特征提升)和78.23%(宽度对齐)。
Insight: ViT的特征低秩性和编码不匹配是蒸馏失败的关键,针对性的改进策略可以显著提升效果。
Abstract: Feature-map knowledge distillation (KD) is highly effective for convolutional networks but often fails for Vision Transformers (ViTs). To understand this failure and guide method design, we conduct a two-view representation analysis of ViTs. First, a layer-wise Singular Value Decomposition (SVD) of full feature matrices shows that final-layer representations are globally low-rank: for CaiT-S24, only $121/61/34/14$ dimensions suffice to capture $99%/95%/90%/80%$ of the energy. In principle, this suggests that a compact student plus a simple linear projector should be enough for feature alignment, contradicting the weak empirical performance of standard feature KD. To resolve this paradox, we introduce a token-level Spectral Energy Pattern (SEP) analysis that measures how each token uses channel capacity. SEP reveals that, despite the global low-rank structure, individual tokens distribute energy over most channels, forming a high-bandwidth encoding pattern. This results in an encoding mismatch between wide teachers and narrow students. Motivated by this insight, we propose two minimal, mismatch-driven strategies: (1) post-hoc feature lifting with a lightweight projector retained during inference, or (2) native width alignment that widens only the student’s last block to the teacher’s width. On ImageNet-1K, these strategies reactivate simple feature-map distillation in ViTs, raising DeiT-Tiny accuracy from $74.86%$ to $77.53%$ and $78.23%$ when distilling from CaiT-S24, while also improving standalone students trained without any teacher. Our analysis thus explains why ViT feature distillation fails and shows how exploiting low-rank structure yields effective, interpretable remedies and concrete design guidance for compact ViTs.
[67] AVATAAR: Agentic Video Answering via Temporal Adaptive Alignment and Reasoning cs.CVPDF
Urjitkumar Patel, Fang-Chun Yeh, Chinmay Gondhalekar
TL;DR: AVATAAR是一个模块化且可解释的视频问答框架,通过全局与局部视频上下文结合及反馈循环优化检索策略,显著提升长视频理解能力。
Details
Motivation: 随着视频内容的普及,理解和回答长视频问题的需求日益增加,现有大视觉语言模型在复杂查询中存在性能不足。
Result: 在CinePile基准测试中,AVATAAR在时间推理等技术指标上显著优于基线(最高提升8.2%)。
Insight: 反馈循环对各模块性能提升至关重要,表明迭代式推理能有效复制人类思维方式。
Abstract: With the increasing prevalence of video content, effectively understanding and answering questions about long form videos has become essential for numerous applications. Although large vision language models (LVLMs) have enhanced performance, they often face challenges with nuanced queries that demand both a comprehensive understanding and detailed analysis. To overcome these obstacles, we introduce AVATAAR, a modular and interpretable framework that combines global and local video context, along with a Pre Retrieval Thinking Agent and a Rethink Module. AVATAAR creates a persistent global summary and establishes a feedback loop between the Rethink Module and the Pre Retrieval Thinking Agent, allowing the system to refine its retrieval strategies based on partial answers and replicate human-like iterative reasoning. On the CinePile benchmark, AVATAAR demonstrates significant improvements over a baseline, achieving relative gains of +5.6% in temporal reasoning, +5% in technical queries, +8% in theme-based questions, and +8.2% in narrative comprehension. Our experiments confirm that each module contributes positively to the overall performance, with the feedback loop being crucial for adaptability. These findings highlight AVATAAR’s effectiveness in enhancing video understanding capabilities. Ultimately, AVATAAR presents a scalable solution for long-form Video Question Answering (QA), merging accuracy, interpretability, and extensibility.
[68] CompTrack: Information Bottleneck-Guided Low-Rank Dynamic Token Compression for Point Cloud Tracking cs.CV | cs.AIPDF
Sifan Zhou, Yichao Cao, Jiahao Nie, Yuqian Fu, Ziyu Zhao
TL;DR: CompTrack提出了一种基于信息瓶颈的低秩动态令牌压缩方法,用于解决点云跟踪中的空间冗余和信息冗余问题,实现了高效的实时跟踪性能。
Details
Motivation: 点云的稀疏性带来了空间冗余(背景噪声)和信息冗余(前景信息冗余)的问题,限制了现有3D单目标跟踪器的准确性和效率。
Result: 在KITTI、nuScenes和Waymo数据集上表现优异,实时性达到90 FPS。
Insight: 通过信息熵和低秩近似动态压缩点云信息,显著提升了跟踪效率和准确性。
Abstract: 3D single object tracking (SOT) in LiDAR point clouds is a critical task in computer vision and autonomous driving. Despite great success having been achieved, the inherent sparsity of point clouds introduces a dual-redundancy challenge that limits existing trackers: (1) vast spatial redundancy from background noise impairs accuracy, and (2) informational redundancy within the foreground hinders efficiency. To tackle these issues, we propose CompTrack, a novel end-to-end framework that systematically eliminates both forms of redundancy in point clouds. First, CompTrack incorporates a Spatial Foreground Predictor (SFP) module to filter out irrelevant background noise based on information entropy, addressing spatial redundancy. Subsequently, its core is an Information Bottleneck-guided Dynamic Token Compression (IB-DTC) module that eliminates the informational redundancy within the foreground. Theoretically grounded in low-rank approximation, this module leverages an online SVD analysis to adaptively compress the redundant foreground into a compact and highly informative set of proxy tokens. Extensive experiments on KITTI, nuScenes and Waymo datasets demonstrate that CompTrack achieves top-performing tracking performance with superior efficiency, running at a real-time 90 FPS on a single RTX 3090 GPU.
[69] US-X Complete: A Multi-Modal Approach to Anatomical 3D Shape Recovery cs.CV | cs.LGPDF
Miruna-Alexandra Gafencu, Yordanka Velikova, Nassir Navab, Mohammad Farid Azampour
TL;DR: 这篇论文提出了一种新的多模态深度学习方法,通过结合超声和单张X射线图像的信息,完成3D超声中被遮挡的脊柱解剖结构重建。
Details
Motivation: 超声在脊柱手术中具有无辐射、实时可视化的优势,但骨骼的声影效应限制了其完整解剖结构的可视化能力。
Result: 实验表明,该方法在3D超声脊椎重建中显著优于现有技术(p < 0.001),并在体模研究中验证了临床潜力。
Insight: 结合单张X射线图像可以有效缓解超声的关键限制,同时保留其作为主要成像模态的优势。
Abstract: Ultrasound offers a radiation-free, cost-effective solution for real-time visualization of spinal landmarks, paraspinal soft tissues and neurovascular structures, making it valuable for intraoperative guidance during spinal procedures. However, ultrasound suffers from inherent limitations in visualizing complete vertebral anatomy, in particular vertebral bodies, due to acoustic shadowing effects caused by bone. In this work, we present a novel multi-modal deep learning method for completing occluded anatomical structures in 3D ultrasound by leveraging complementary information from a single X-ray image. To enable training, we generate paired training data consisting of: (1) 2D lateral vertebral views that simulate X-ray scans, and (2) 3D partial vertebrae representations that mimic the limited visibility and occlusions encountered during ultrasound spine imaging. Our method integrates morphological information from both imaging modalities and demonstrates significant improvements in vertebral reconstruction (p < 0.001) compared to state of art in 3D ultrasound vertebral completion. We perform phantom studies as an initial step to future clinical translation, and achieve a more accurate, complete volumetric lumbar spine visualization overlayed on the ultrasound scan without the need for registration with preoperative modalities such as computed tomography. This demonstrates that integrating a single X-ray projection mitigates ultrasound’s key limitation while preserving its strengths as the primary imaging modality. Code and data can be found at https://github.com/miruna20/US-X-Complete
[70] FlashMesh: Faster and Better Autoregressive Mesh Synthesis via Structured Speculation cs.CVPDF
Tingrui Shen, Yiheng Zhang, Chen Tang, Chuan Ping, Zixing Zhao
TL;DR: FlashMesh提出了一种基于结构化推测的快速、高质量3D网格生成框架,通过预测-校正-验证范式优化自回归解码,显著提升了推断速度和质量。
Details
Motivation: 自回归模型生成高质量3D网格时按顺序解码顶点和面,导致推断速度慢,限制了交互式和大规模应用的实际使用。FlashMesh旨在解决这一问题。
Result: 实验表明,FlashMesh在保持生成质量的同时,实现了比标准自回归模型最高2倍的加速。
Insight: 网格数据的结构性先验可以系统地用于加速和提升自回归生成。
Abstract: Autoregressive models can generate high-quality 3D meshes by sequentially producing vertices and faces, but their token-by-token decoding results in slow inference, limiting practical use in interactive and large-scale applications. We present FlashMesh, a fast and high-fidelity mesh generation framework that rethinks autoregressive decoding through a predict-correct-verify paradigm. The key insight is that mesh tokens exhibit strong structural and geometric correlations that enable confident multi-token speculation. FlashMesh leverages this by introducing a speculative decoding scheme tailored to the commonly used hourglass transformer architecture, enabling parallel prediction across face, point, and coordinate levels. Extensive experiments show that FlashMesh achieves up to a 2 x speedup over standard autoregressive models while also improving generation fidelity. Our results demonstrate that structural priors in mesh data can be systematically harnessed to accelerate and enhance autoregressive generation.
[71] The SA-FARI Dataset: Segment Anything in Footage of Animals for Recognition and Identification cs.CV | cs.AIPDF
Dante Francisco Wasmuht, Otto Brookes, Maximillian Schall, Pablo Palencia, Chris Beirne
TL;DR: SA-FARI是迄今最大开源野生动物多目标跟踪数据集,涵盖99种类别、11,609段视频,并提供丰富的时空标注,为泛化性野生动物追踪模型提供了新基准。
Details
Motivation: 现有数据集规模有限,或缺乏时空多样性,无法支持跨物种、泛化的多目标跟踪模型训练。
Result: SA-FARI为野生动物分析和多目标跟踪提供了高质量基准数据。
Insight: 大规模、多样化的数据集是推动野生动物追踪技术泛化性的关键。
Abstract: Automated video analysis is critical for wildlife conservation. A foundational task in this domain is multi-animal tracking (MAT), which underpins applications such as individual re-identification and behavior recognition. However, existing datasets are limited in scale, constrained to a few species, or lack sufficient temporal and geographical diversity - leaving no suitable benchmark for training general-purpose MAT models applicable across wild animal populations. To address this, we introduce SA-FARI, the largest open-source MAT dataset for wild animals. It comprises 11,609 camera trap videos collected over approximately 10 years (2014-2024) from 741 locations across 4 continents, spanning 99 species categories. Each video is exhaustively annotated culminating in ~46 hours of densely annotated footage containing 16,224 masklet identities and 942,702 individual bounding boxes, segmentation masks, and species labels. Alongside the task-specific annotations, we publish anonymized camera trap locations for each video. Finally, we present comprehensive benchmarks on SA-FARI using state-of-the-art vision-language models for detection and tracking, including SAM 3, evaluated with both species-specific and generic animal prompts. We also compare against vision-only methods developed specifically for wildlife analysis. SA-FARI is the first large-scale dataset to combine high species diversity, multi-region coverage, and high-quality spatio-temporal annotations, offering a new foundation for advancing generalizable multianimal tracking in the wild. The dataset is available at $\href{https://www.conservationxlabs.com/sa-fari}{\text{conservationxlabs.com/SA-FARI}}$.
[72] Hierarchical Semantic Tree Anchoring for CLIP-Based Class-Incremental Learning cs.CV | cs.LGPDF
Tao Hu, Lan Li, Zhen-Hao Xie, Da-Wei Zhou
TL;DR: 本文提出了一种基于CLIP的分层语义树锚定方法(HASTEN),通过在CIL任务中显式引入层次结构信息,减少灾难性遗忘问题。
Details
Motivation: 现有的CLIP-based CIL方法未能显式捕捉视觉和语言概念的层次结构,导致增量更新时细粒度类别特征漂移和灾难性遗忘。
Result: HASTEN在实验中表现优于现有方法,并提供统一的结构化表示。
Insight: 显式建模层次结构可以有效缓解CIL中的灾难性遗忘问题。
Abstract: Class-Incremental Learning (CIL) enables models to learn new classes continually while preserving past knowledge. Recently, vision-language models like CLIP offer transferable features via multi-modal pre-training, making them well-suited for CIL. However, real-world visual and linguistic concepts are inherently hierarchical: a textual concept like “dog” subsumes fine-grained categories such as “Labrador” and “Golden Retriever,” and each category entails its images. But existing CLIP-based CIL methods fail to explicitly capture this inherent hierarchy, leading to fine-grained class features drift during incremental updates and ultimately to catastrophic forgetting. To address this challenge, we propose HASTEN (Hierarchical Semantic Tree Anchoring) that anchors hierarchical information into CIL to reduce catastrophic forgetting. First, we employ an external knowledge graph as supervision to embed visual and textual features in hyperbolic space, effectively preserving hierarchical structure as data evolves. Second, to mitigate catastrophic forgetting, we project gradients onto the null space of the shared hyperbolic mapper, preventing interference with prior tasks. These two steps work synergistically to enable the model to resist forgetting by maintaining hierarchical relationships. Extensive experiments show that HASTEN consistently outperforms existing methods while providing a unified structured representation.
[73] MambaIO: Global-Coordinate Inertial Odometry for Pedestrians via Multi-Scale Frequency-Decoupled Modeling cs.CV | cs.ROPDF
Shanshan Zhang
TL;DR: MambaIO通过多尺度频率解耦建模提出了一种新型惯性里程计方法,针对全球坐标系在行人场景中的局限性进行了分析,并结合Mamba架构和卷积结构实现了高精度定位。
Details
Motivation: 传统惯性里程计在全球坐标系下表现不佳,尤其在行人场景中。无人机研究表明体坐标系能提升精度,因此需要重新评估全球坐标系对行人惯性里程计的适用性。
Result: 在多公开数据集上,MambaIO显著降低了定位误差,实现了SOTA性能。
Insight: 全球坐标系在行人惯性里程计中并非最优选择,频率解耦和多尺度建模可显著改善IMU测量处理的精度。
Abstract: Inertial Odometry (IO) enables real-time localization using only acceleration and angular velocity measurements from an Inertial Measurement Unit (IMU), making it a promising solution for localization in consumer-grade applications. Traditionally, IMU measurements in IO have been processed under two coordinate system paradigms: the body coordinate frame and the global coordinate frame, with the latter being widely adopted. However, recent studies in drone scenarios have demonstrated that the body frame can significantly improve localization accuracy, prompting a re-evaluation of the suitability of the global frame for pedestrian IO. To address this issue, this paper systematically evaluates the effectiveness of the global coordinate frame in pedestrian IO through theoretical analysis, qualitative inspection, and quantitative experiments. Building upon these findings, we further propose MambaIO, which decomposes IMU measurements into high-frequency and low-frequency components using a Laplacian pyramid. The low-frequency component is processed by a Mamba architecture to extract implicit contextual motion cues, while the high-frequency component is handled by a convolutional structure to capture fine-grained local motion details. Experiments on multiple public datasets show that MambaIO substantially reduces localization error and achieves state-of-the-art (SOTA) performance. To the best of our knowledge, this is the first application of the Mamba architecture to the inertial odometry task.
[74] GEO-Bench-2: From Performance to Capability, Rethinking Evaluation in Geospatial AI cs.CV | cs.AIPDF
Naomi Simumba, Nils Lehmann, Paolo Fraccaro, Hamed Alemohammad, Geeth De Mel
TL;DR: GEO-Bench-2提出了一个标准化的评估框架,用于评测地理空间基础模型(GeoFMs)在不同任务中的性能与能力,强调了模型选择需根据任务需求和数据模态,同时支持可复现的研究与方法创新。
Details
Motivation: 当前地理空间AI领域缺乏统一的评估标准,使得不同模型之间的比较和优化变得困难。GEO-Bench-2旨在填补这一空白,提供一个全面的评测框架。
Result: 实验表明,没有单一模型在所有任务中表现最佳。自然图像预训练模型在高分辨率任务中表现优异,而地理空间专用模型在多光谱应用中更胜一筹。
Insight: 模型选择需根据任务特性和数据模态,单一通用地理空间基础模型仍是未来研究方向。GEO-Bench-2为针对性优化提供了科学依据。
Abstract: Geospatial Foundation Models (GeoFMs) are transforming Earth Observation (EO), but evaluation lacks standardized protocols. GEO-Bench-2 addresses this with a comprehensive framework spanning classification, segmentation, regression, object detection, and instance segmentation across 19 permissively-licensed datasets. We introduce ‘’capability’’ groups to rank models on datasets that share common characteristics (e.g., resolution, bands, temporality). This enables users to identify which models excel in each capability and determine which areas need improvement in future work. To support both fair comparison and methodological innovation, we define a prescriptive yet flexible evaluation protocol. This not only ensures consistency in benchmarking but also facilitates research into model adaptation strategies, a key and open challenge in advancing GeoFMs for downstream tasks. Our experiments show that no single model dominates across all tasks, confirming the specificity of the choices made during architecture design and pretraining. While models pretrained on natural images (ConvNext ImageNet, DINO V3) excel on high-resolution tasks, EO-specific models (TerraMind, Prithvi, and Clay) outperform them on multispectral applications such as agriculture and disaster response. These findings demonstrate that optimal model choice depends on task requirements, data modalities, and constraints. This shows that the goal of a single GeoFM model that performs well across all tasks remains open for future research. GEO-Bench-2 enables informed, reproducible GeoFM evaluation tailored to specific use cases. Code, data, and leaderboard for GEO-Bench-2 are publicly released under a permissive license.
[75] MF-GCN: A Multi-Frequency Graph Convolutional Network for Tri-Modal Depression Detection Using Eye-Tracking, Facial, and Acoustic Features cs.CV | cs.AIPDF
Sejuti Rahman, Swakshar Deb, MD. Sameer Iqbal Chowdhury, MD. Jubair Ahmed Sourov, Mohammad Shamsuddin
TL;DR: 论文提出了一种多频图卷积网络 (MF-GCN),结合眼动追踪、面部和声学特征进行抑郁症检测。通过多频滤波器组模块 (MFFBM),模型能够利用低高频信号,显著优于传统方法。
Details
Motivation: 现有图模型仅关注低频信息,而抑郁症检测需要整合多模态和高低频信号以提高准确性。
Result: 二分类灵敏度达 0.96,F2 分数 0.94;三分类灵敏度 0.79,特异性 0.87。在 CMDC 数据集上验证了泛化性。
Insight: 多频信号和多模态融合对抑郁症检测至关重要,且模型具有跨文化泛化潜力。
Abstract: Eye tracking data quantifies the attentional bias towards negative stimuli that is frequently observed in depressed groups. Audio and video data capture the affective flattening and psychomotor retardation characteristic of depression. Statistical validation confirmed their significant discriminative power in distinguishing depressed from non depressed groups. We address a critical limitation of existing graph-based models that focus on low-frequency information and propose a Multi-Frequency Graph Convolutional Network (MF-GCN). This framework consists of a novel Multi-Frequency Filter Bank Module (MFFBM), which can leverage both low and high frequency signals. Extensive evaluation against traditional machine learning algorithms and deep learning frameworks demonstrates that MF-GCN consistently outperforms baselines. In binary (depressed and non depressed) classification, the model achieved a sensitivity of 0.96 and F2 score of 0.94. For the 3 class (no depression, mild to moderate depression and severe depression) classification task, the proposed method achieved a sensitivity of 0.79 and specificity of 0.87 and siginificantly suprassed other models. To validate generalizability, the model was also evaluated on the Chinese Multimodal Depression Corpus (CMDC) dataset and achieved a sensitivity of 0.95 and F2 score of 0.96. These results confirm that our trimodal, multi frequency framework effectively captures cross modal interaction for accurate depression detection.
[76] Hyperspectral Image Classification using Spectral-Spatial Mixer Network cs.CVPDF
Mohammed Q. Alkhatib
TL;DR: SS-MixNet是一种轻量级深度学习模型,用于高光谱图像分类,结合3D卷积和MLP风格的混合块,通过注意力机制提升性能,仅需1%的标注数据即可达到最优效果。
Details
Motivation: 高光谱图像分类在有限标注数据下仍需要高效且轻量的模型,以捕捉光谱和空间维度的长程依赖关系。
Result: 在QUH-Tangdaowan和QUH-Qingyun数据集上分别达到95.68%和93.86%的准确率,优于其他对比方法。
Insight: SS-MixNet在有限监督下表现出色,证明了光谱与空间特征结合及注意力机制的重要性。
Abstract: This paper introduces SS-MixNet, a lightweight and effective deep learning model for hyperspectral image (HSI) classification. The architecture integrates 3D convolutional layers for local spectral-spatial feature extraction with two parallel MLP-style mixer blocks that capture long-range dependencies in spectral and spatial dimensions. A depthwise convolution-based attention mechanism is employed to enhance discriminative capability with minimal computational overhead. The model is evaluated on the QUH-Tangdaowan and QUH-Qingyun datasets using only 1% of labeled data for training and validation. SS-MixNet achieves the highest performance among compared methods, including 2D-CNN, 3D-CNN, IP-SWIN, SimPoolFormer, and HybridKAN, reaching 95.68% and 93.86% overall accuracy on the Tangdaowan and Qingyun datasets, respectively. The results, supported by quantitative metrics and classification maps, confirm the model’s effectiveness in delivering accurate and robust predictions with limited supervision. The code will be made publicly available at: https://github.com/mqalkhatib/SS-MixNet
[77] First Frame Is the Place to Go for Video Content Customization cs.CVPDF
Jingxi Chen, Zongxia Li, Zhichao Liu, Guangyao Shi, Xiyang Wu
TL;DR: 本文提出了一种新视角,认为视频生成模型中的第一帧不仅是时空起始点,还是一个隐含的视觉实体存储缓冲区,可用于多样化的视频内容定制。
Details
Motivation: 传统上,第一帧仅为视频生成的种子,但本文揭示其作为概念内存缓冲区的潜力,为视频内容定制提供了新思路。
Result: 在多样化场景中实现了稳健且通用的视频内容定制,验证了方法的有效性。
Insight: 视频生成模型中的第一帧具有被忽视的强大潜力,可用于高效的参考式视频定制。
Abstract: What role does the first frame play in video generation models? Traditionally, it’s viewed as the spatial-temporal starting point of a video, merely a seed for subsequent animation. In this work, we reveal a fundamentally different perspective: video models implicitly treat the first frame as a conceptual memory buffer that stores visual entities for later reuse during generation. Leveraging this insight, we show that it’s possible to achieve robust and generalized video content customization in diverse scenarios, using only 20-50 training examples without architectural changes or large-scale finetuning. This unveils a powerful, overlooked capability of video generation models for reference-based video customization.
[78] GeoVista: Web-Augmented Agentic Visual Reasoning for Geolocalization cs.CVPDF
Yikun Wang, Zuyan Liu, Ziyi Wang, Pengfei Liu, Han Hu
TL;DR: GeoVista是一个新型的代理视觉推理模型,专注于地理定位任务,通过整合图像放大和网络搜索工具,结合监督微调和强化学习训练,显著提升了地理定位能力。
Details
Motivation: 现有代理视觉推理模型主要关注图像处理工具,缺乏通用性。地理定位任务需要复杂的视觉理解和网络搜索能力,目前缺乏满足高分辨率图像和深度代理推理需求的基准。
Result: GeoVista在地理定位任务上显著超越开源代理模型,并在大多数指标上与Gemini-2.5-flash和GPT-5等闭源模型表现相当。
Insight: 地理定位任务的成功需要结合视觉推理和外部信息检索能力;分层奖励和工具嵌入推理循环是提升代理模型性能的关键。
Abstract: Current research on agentic visual reasoning enables deep multimodal understanding but primarily focuses on image manipulation tools, leaving a gap toward more general-purpose agentic models. In this work, we revisit the geolocalization task, which requires not only nuanced visual grounding but also web search to confirm or refine hypotheses during reasoning. Since existing geolocalization benchmarks fail to meet the need for high-resolution imagery and the localization challenge for deep agentic reasoning, we curate GeoBench, a benchmark that includes photos and panoramas from around the world, along with a subset of satellite images of different cities to rigorously evaluate the geolocalization ability of agentic models. We also propose GeoVista, an agentic model that seamlessly integrates tool invocation within the reasoning loop, including an image-zoom-in tool to magnify regions of interest and a web-search tool to retrieve related web information. We develop a complete training pipeline for it, including a cold-start supervised fine-tuning (SFT) stage to learn reasoning patterns and tool-use priors, followed by a reinforcement learning (RL) stage to further enhance reasoning ability. We adopt a hierarchical reward to leverage multi-level geographical information and improve overall geolocalization performance. Experimental results show that GeoVista surpasses other open-source agentic models on the geolocalization task greatly and achieves performance comparable to closed-source models such as Gemini-2.5-flash and GPT-5 on most metrics.
cs.CL [Back]
[79] Temporal Predictors of Outcome in Reasoning Language Models cs.CLPDF
Joey David
TL;DR: 该论文研究了大型语言模型(LLM)在多步推理任务中早期隐藏状态对最终结果的预测能力,并发现即使是少量推理标记也能高度预测模型最终的正确性。
Details
Motivation: 探讨LLM在多步推理任务中何时内部决定最终结果,以提升模型的解释性和推理控制。
Result: 研究表明,仅需少数推理标记即可高度预测模型的最终正确性,且难题在长推理链中表现出预测准确性下降的现象。
Insight: 模型的内部自我评估在早期即显现,这对模型的解释性和推理控制具有重要启发意义。
Abstract: The chain-of-thought (CoT) paradigm uses the elicitation of step-by-step rationales as a proxy for reasoning, gradually refining the model’s latent representation of a solution. However, it remains unclear just how early a Large Language Model (LLM) internally commits to an eventual outcome. We probe this by training linear classifiers on hidden states after the first t reasoning tokens, showing that eventual correctness is highly predictable after only a few tokens, even when longer outputs are needed to reach a definite answer. We show that, for harder questions, a drop in predictive accuracy highlights a selection artifact: hard items are disproportionately represented in long CoTs. Overall, our results imply that for reasoning models, internal self-assessment of success tends to emerge after only a few tokens, with implications for interpretability and for inference-time control.
[80] COMPASS: Context-Modulated PID Attention Steering System for Hallucination Mitigation cs.CLPDF
Snigdha Pandya, Rohan Nagale, Kenji Sahay, Anna Lin, Shikhar Shiromani
TL;DR: COMPASS提出了一种基于PID控制的轻量级框架,通过动态调节注意力头以减少LLM中的上下文幻觉,提升事实一致性。
Details
Motivation: LLM在生成文本时经常产生流畅但不准确的陈述,尽管它们可以访问相关证据。这种现象源于注意力分配问题,需要一种方法来理解和调控内部行为以提高可信度。
Result: 在多个基准测试(HotpotQA、XSum等)中,COMPASS显著降低了上下文幻觉率(绝对减少2.8%至5.8%),同时揭示了注意力头对证据对齐的贡献。
Insight: 反馈驱动的可解释性为理解LLM行为提供了科学路径,注意力头的动态调节是减少幻觉的有效手段。
Abstract: Large language models (LLMs) often generate fluent but factually incorrect statements despite having access to relevant evidence, a failure mode rooted in how they allocate attention between contextual and parametric knowledge. Understanding and steering this internal behavior is key both for trustworthy deployment and for scientific interpretability of model mechanisms. We introduce COMPASS (Context-Modulated PID Attention Steering System), a lightweight, interpretable control framework that embeds a model-based feedback loop directly within decoding. COMPASS quantifies context reliance via a transparent metric, the Context Reliance Score (CRS), which serves as an online probe of how attention heads ground generation in evidence. Using this interpretable signal, a PID controller dynamically modulates attention heads to maintain factual consistency without retraining or multi-pass decoding. Across benchmarks (HotpotQA, XSum, HaluEval, RAGTruth), COMPASS consistently reduces contextual hallucination rates (2.8 to 5.8 percent absolute) while revealing how distinct attention heads contribute to evidence alignment. These results highlight feedback-driven interpretability as a pathway toward scientific understanding of LLM behavior.
[81] HinTel-AlignBench: A Framework and Benchmark for Hindi-Telugu with English-Aligned Samples cs.CL | cs.LGPDF
Rishikant Chigrupaatii, Ponnada Sai Tulasi Kanishka, Lalit Chandra Routhu, Martin Patel Sama Supratheek Reddy, Divyam Gupta
TL;DR: HinTel-AlignBench是一个针对印地语和泰卢固语与英语对齐样本的框架和基准,旨在解决多语言视觉-语言模型(VLM)评估中的局限性,包括数据集质量、任务覆盖范围和样本多样性等问题。
Details
Motivation: 随着多语言视觉-语言模型(VLM)的发展,现有的评估方法存在严重不足,如依赖未验证的自动翻译数据、任务和领域覆盖狭窄、样本量不足以及缺乏本土文化和自然来源的问答数据。
Result: 研究发现,所有任务中4/5的模型在印度语言任务上的表现优于英语任务,印地语和泰卢固语的平均性能分别下降了8.3和5.5分。
Insight: 失败模式的分析揭示了多语言多模态理解的具体改进方向。
Abstract: With nearly 1.5 billion people and more than 120 major languages, India represents one of the most diverse regions in the world. As multilingual Vision-Language Models (VLMs) gain prominence, robust evaluation methodologies are essential to drive progress toward equitable AI for low-resource languages. Current multilingual VLM evaluations suffer from four major limitations: reliance on unverified auto-translations, narrow task/domain coverage, limited sample sizes, and lack of cultural and natively sourced Question-Answering (QA). To address these gaps, we present a scalable framework to evaluate VLMs in Indian languages and compare it with performance in English. Using the framework, we generate HinTel-AlignBench, a benchmark that draws from diverse sources in Hindi and Telugu with English-aligned samples. Our contributions are threefold: (1) a semi-automated dataset creation framework combining back-translation, filtering, and human verification; (2) the most comprehensive vision-language benchmark for Hindi and and Telugu, including adapted English datasets (VQAv2, RealWorldQA, CLEVR-Math) and native novel Indic datasets (JEE for STEM, VAANI for cultural grounding) with approximately 4,000 QA pairs per language; and (3) a detailed performance analysis of various State-of-the-Art (SOTA) open-weight and closed-source VLMs. We find a regression in performance for tasks in English versus in Indian languages for 4 out of 5 tasks across all the models, with an average regression of 8.3 points in Hindi and 5.5 points for Telugu. We categorize common failure modes to highlight concrete areas of improvement in multilingual multimodal understanding.
[82] OEMA: Ontology-Enhanced Multi-Agent Collaboration Framework for Zero-Shot Clinical Named Entity Recognition cs.CL | cs.AIPDF
Xinli Tao, Xin Dong, Xuezhong Zhou
TL;DR: OEMA是一个零样本临床命名实体识别框架,通过多智能体协作和本体引导推理,解决了传统方法的标注数据依赖和示例选择问题,在MTSamples和VAERS数据集上表现出色。
Details
Motivation: 传统的临床命名实体识别方法(如CRF和BioClinicalBERT)需要大量标注数据,成本高昂;零样本方法虽减少依赖但存在示例选择粗糙和提示集成不足的问题。
Result: 在MTSamples和VAERS数据集上达到最优的精确匹配性能,相关匹配性能接近监督模型BioClinicalBERT并超越CRF。
Insight: 多智能体协作和本体结合显著提升了零样本NER的性能,为临床NLP提供了一种低成本高效的解决方案。
Abstract: Clinical named entity recognition (NER) is crucial for extracting information from electronic health records (EHRs), but supervised models like CRF and BioClinicalBERT require costly annotated data. While zero-shot NER with large language models (LLMs) reduces this dependency, it struggles with example selection granularity and integrating prompts with self-improvement. To address this, we propose OEMA, a zero-shot clinical NER framework using multi-agent collaboration. OEMA’s three components are: a self-annotator generating examples, a discriminator filtering them via SNOMED CT, and a predictor using entity descriptions for accurate inference. On MTSamples and VAERS datasets, OEMA achieves state-of-the-art exact-match performance. Under related-match, it matches supervised BioClinicalBERT and surpasses CRF. OEMA addresses key zero-shot NER challenges through ontology-guided reasoning and multi-agent collaboration, achieving near-supervised performance and showing promise for clinical NLP applications.
[83] Context Cascade Compression: Exploring the Upper Limits of Text Compression cs.CL | cs.CVPDF
Fanfan Liu, Haibo Qiu
TL;DR: 论文提出了一种名为C3(Context Cascade Compression)的文本压缩方法,通过级联大小不同的两个LLM分别处理压缩和解码任务,实现了高效的文本压缩并探索了压缩比的上限。
Details
Motivation: 在长上下文任务中,百万级别的token输入给LLM带来了巨大的计算和内存挑战。此前的研究(如DeepSeek-OCR)尝试了光学字符压缩,但效果有限,因此需要一种更高效的文本压缩方法。
Result: 20倍压缩比下解码准确率达98%,40倍压缩比下仍维持93%,表现优于光学字符压缩方法。
Insight: C3展示了纯文本压缩的高效性,为光学字符压缩、OCR等领域提供了压缩比上限的参考,同时验证了级联LLM的可行性。
Abstract: Million-level token inputs in long-context tasks pose significant computational and memory challenges for Large Language Models (LLMs). Recently, DeepSeek-OCR conducted research into the feasibility of Contexts Optical Compression and achieved preliminary results. Inspired by this, we introduce Context Cascade Compression C3 to explore the upper limits of text compression. Our method cascades two LLMs of different sizes to handle the compression and decoding tasks. Specifically, a small LLM, acting as the first stage, performs text compression by condensing a long context into a set of latent tokens (e.g., 32 or 64 in length), achieving a high ratio of text tokens to latent tokens. A large LLM, as the second stage, then executes the decoding task on this compressed context. Experiments show that at a 20x compression ratio (where the number of text tokens is 20 times the number of latent tokens), our model achieves 98% decoding accuracy, compared to approximately 60% for DeepSeek-OCR. When we further increase the compression ratio to 40x, the accuracy is maintained at around 93%. This indicates that in the domain of context compression, C3 Compression demonstrates superior performance and feasibility over optical character compression. C3 uses a simpler, pure-text pipeline that ignores factors like layout, color, and information loss from a visual encoder. This also suggests a potential upper bound for compression ratios in future work on optical character compression, OCR, and related fields. Codes and model weights are publicly accessible at https://github.com/liufanfanlff/C3-Context-Cascade-Compression
[84] HEAD-QA v2: Expanding a Healthcare Benchmark for Reasoning cs.CLPDF
Alexis Correa-Guillén, Carlos Gómez-Rodríguez, David Vilares
TL;DR: HEAD-QA v2是一个扩展和更新的西班牙语/英语医疗多选推理数据集,旨在满足高质量数据的需求,支持医疗领域的复杂语言和概念推理。
Details
Motivation: 随着医疗领域对复杂推理能力的需求增长,现有的数据集需要扩展和更新以支持研究和模型改进。
Result: 结果表明,模型的性能主要由模型规模和内在推理能力驱动,复杂推理策略的提升有限。
Insight: HEAD-QA v2为推进生物医学推理和模型改进提供了可靠的资源,强调了模型规模对性能的关键影响。
Abstract: We introduce HEAD-QA v2, an expanded and updated version of a Spanish/English healthcare multiple-choice reasoning dataset originally released by Vilares and Gómez-Rodríguez (2019). The update responds to the growing need for high-quality datasets that capture the linguistic and conceptual complexity of healthcare reasoning. We extend the dataset to over 12,000 questions from ten years of Spanish professional exams, benchmark several open-source LLMs using prompting, RAG, and probability-based answer selection, and provide additional multilingual versions to support future work. Results indicate that performance is mainly driven by model scale and intrinsic reasoning ability, with complex inference strategies obtaining limited gains. Together, these results establish HEAD-QA v2 as a reliable resource for advancing research on biomedical reasoning and model improvement.
[85] The Empowerment of Science of Science by Large Language Models: New Tools and Methods cs.CL | cs.AIPDF
Guoqiang Liang, Jingqian Gong, Mengxuan Li, Gege Lin, Shuo Zhang
TL;DR: 论文综述了大型语言模型(LLMs)在科学计量学领域的应用潜力,讨论了提示工程、知识增强的检索增强生成等技术,提出了基于AI代理的科学评估模型以及利用LLMs进行新研究方向检测和知识图谱构建的方法。
Details
Motivation: 大型语言模型在自然语言理解和多模态任务中表现出色,但在科学计量学领域的应用尚未系统探索。本文旨在填补这一空白,探讨LLMs如何赋能科学计量学。
Result: 展示了LLMs在科学计量学中的潜力,提出了具体的技术路径和应用场景。
Insight: LLMs不仅可用于语言任务,还能通过知识图谱和代理模型推动科学计量学的智能化发展。
Abstract: Large language models (LLMs) have exhibited exceptional capabilities in natural language understanding and generation, image recognition, and multimodal tasks, charting a course towards AGI and emerging as a central issue in the global technological race. This manuscript conducts a comprehensive review of the core technologies that support LLMs from a user standpoint, including prompt engineering, knowledge-enhanced retrieval augmented generation, fine tuning, pretraining, and tool learning. Additionally, it traces the historical development of Science of Science (SciSci) and presents a forward looking perspective on the potential applications of LLMs within the scientometric domain. Furthermore, it discusses the prospect of an AI agent based model for scientific evaluation, and presents new research fronts detection and knowledge graph building methods with LLMs.
[86] A Compliance-Preserving Retrieval System for Aircraft MRO Task Search cs.CL | cs.AI | cs.ET | cs.IRPDF
Byungho Jo
TL;DR: 本文提出了一种合规保留的检索系统,针对航空MRO(维护、修理和大修)任务搜索,结合LLM重排和语义搜索技术,显著提高了任务检索效率和准确性。
Details
Motivation: 航空维修技师在查找维修手册上花费大量时间,现有系统无法满足高效且合规的需求。本文旨在提高检索效率,同时确保所有操作可追溯至认证来源。
Result: 在4.9万次合成查询中,检索准确率超过90%;双语对照研究中,10位认证技师的任务检索时间从6-15分钟降至18秒,任务成功率90.9%。
Insight: 结果表明,在严格的监管环境下,语义检索技术可以显著提升效率,同时满足合规要求。
Abstract: Aircraft Maintenance Technicians (AMTs) spend up to 30% of work time searching manuals, a documented efficiency bottleneck in MRO operations where every procedure must be traceable to certified sources. We present a compliance-preserving retrieval system that adapts LLM reranking and semantic search to aviation MRO environments by operating alongside, rather than replacing, certified legacy viewers. The system constructs revision-robust embeddings from ATA chapter hierarchies and uses vision-language parsing to structure certified content, allowing technicians to preview ranked tasks and access verified procedures in existing viewers. Evaluation on 49k synthetic queries achieves >90% retrieval accuracy, while bilingual controlled studies with 10 licensed AMTs demonstrate 90.9% top-10 success rate and 95% reduction in lookup time, from 6-15 minutes to 18 seconds per task. These gains provide concrete evidence that semantic retrieval can operate within strict regulatory constraints and meaningfully reduce operational workload in real-world multilingual MRO workflows.
[87] DEPO: Dual-Efficiency Preference Optimization for LLM Agents cs.CL | cs.AIPDF
Sirui Chen, Mengshi Zhao, Lei Xu, Yuying Zhao, Beier Zhu
TL;DR: DEPO提出了一种双效率偏好优化方法,通过联合优化单步令牌数和任务完成步数,显著提升了LLM代理的效率和性能。
Details
Motivation: 尽管LLM代理的推理和决策能力有所提升,但丰富的推理通常伴随更长的思维链(CoT),降低了实际场景中的交互效率。目前缺乏对LLM代理效率的系统定义,影响了针对性改进。
Result: 在WebShop和BabyAI上,DEPO降低了60.9%的令牌使用和26.9%的步数,同时性能提升29.3%。方法还展现出泛化能力和数据效率。
Insight: 双效率定义的提出为LLM代理的效率优化提供了明确方向,DEPO的实现证明了联合优化的有效性,且高效性可泛化到其他任务。
Abstract: Recent advances in large language models (LLMs) have greatly improved their reasoning and decision-making abilities when deployed as agents. Richer reasoning, however, often comes at the cost of longer chain of thought (CoT), hampering interaction efficiency in real-world scenarios. Nevertheless, there still lacks systematic definition of LLM agent efficiency, hindering targeted improvements. To this end, we introduce dual-efficiency, comprising (i) step-level efficiency, which minimizes tokens per step, and (ii) trajectory-level efficiency, which minimizes the number of steps to complete a task. Building on this definition, we propose DEPO, a dual-efficiency preference optimization method that jointly rewards succinct responses and fewer action steps. Experiments on WebShop and BabyAI show that DEPO cuts token usage by up to 60.9% and steps by up to 26.9%, while achieving up to a 29.3% improvement in performance. DEPO also generalizes to three out-of-domain math benchmarks and retains its efficiency gains when trained on only 25% of the data. Our project page is at https://opencausalab.github.io/DEPO.
[88] Multimodal Evaluation of Russian-language Architectures cs.CL | cs.AI | cs.CVPDF
Artem Chervyakov, Ulyana Isaeva, Anton Emelyanov, Artem Safin, Maria Tikhonova
TL;DR: 该论文提出了一个针对俄语的多模态评估框架Mera Multi,填补了俄语多模态基准测试的空白,涵盖了文本、图像、音频和视频四种模态,并通过18个新构建的任务评估通用模型和模态特定架构的能力。
Details
Motivation: 当前多模态大语言模型(MLLMs)的能力和局限性在俄语环境中未被充分研究,缺乏相关基准测试。
Result: 展示了俄语多模态模型的性能基线,为未来研究提供了参考。
Insight: 该方法不仅适用于俄语,还可推广到斯拉夫语系等其他语言,填补了多模态评估在语言多样性方面的空白。
Abstract: Multimodal large language models (MLLMs) are currently at the center of research attention, showing rapid progress in scale and capabilities, yet their intelligence, limitations, and risks remain insufficiently understood. To address these issues, particularly in the context of the Russian language, where no multimodal benchmarks currently exist, we introduce Mera Multi, an open multimodal evaluation framework for Russian-spoken architectures. The benchmark is instruction-based and encompasses default text, image, audio, and video modalities, comprising 18 newly constructed evaluation tasks for both general-purpose models and modality-specific architectures (image-to-text, video-to-text, and audio-to-text). Our contributions include: (i) a universal taxonomy of multimodal abilities; (ii) 18 datasets created entirely from scratch with attention to Russian cultural and linguistic specificity, unified prompts, and metrics; (iii) baseline results for both closed-source and open-source models; (iv) a methodology for preventing benchmark leakage, including watermarking and licenses for private sets. While our current focus is on Russian, the proposed benchmark provides a replicable methodology for constructing multimodal benchmarks in typologically diverse languages, particularly within the Slavic language family.
[89] HSKBenchmark: Modeling and Benchmarking Chinese Second Language Acquisition in Large Language Models through Curriculum Tuning cs.CL | cs.AIPDF
Qihao Yang, Xuelin Wang, Jiale Chen, Xuelian Dong, Yuxin Hao
TL;DR: 论文介绍了HSKBenchmark,首个用于中文二语习得(SLA)的分阶段建模与写作评估基准,包含课程调优框架和评估系统,展示了LLMs在写作表现上与高级人类学习者的可比性。
Details
Motivation: 由于伦理和实践限制,控制人类学习者的语言输入实验不可行,因此需要一种可控且可复现的方法来建模中文SLA。
Result: 实验表明,HSKBenchmark能有效建模中文SLA,且LLMs的写作表现与高级人类学习者相当。
Insight: LLMs可以模拟人类语言学习的轨迹,为语言习得建模和LLMs可解释性研究提供了新工具。
Abstract: Language acquisition is vital to revealing the nature of human language intelligence and has recently emerged as a promising perspective for improving the interpretability of large language models (LLMs). However, it is ethically and practically infeasible to conduct experiments that require controlling human learners’ language inputs. This poses challenges for the verifiability and scalability of language acquisition modeling, particularly in Chinese second language acquisition (SLA). While LLMs provide a controllable and reproducible alternative, a systematic benchmark to support phase-wise modeling and assessment is still lacking. In this paper, we present HSKBenchmark, the first benchmark for staged modeling and writing assessment of LLMs in Chinese SLA. It covers HSK levels 3 to 6 and includes authentic textbooks with 6.76 million tokens, 16K synthetic instruction samples, 30 test topics, and a linguistically grounded evaluation system. To simulate human learning trajectories, we introduce a curriculum-tuning framework that trains models from beginner to advanced levels. An evaluation system is created to examine level-based grammar coverage, writing errors, lexical and syntactic complexity, and holistic scoring. We also build HSKAgent, fine-tuned on 10K learner compositions. Extensive experimental results demonstrate that HSKBenchmark not only models Chinese SLA effectively, but also serves as a reliable benchmark for dynamic writing assessment in LLMs. Our fine-tuned LLMs have writing performance on par with advanced human learners and exhibit human-like acquisition characteristics. The HSKBenchmark, HSKAgent, and checkpoints serve as foundational tools and resources, with the potential to pave the way for future research on language acquisition modeling and LLMs interpretability. Code and data are publicly available at: https://github.com/CharlesYang030/HSKB.
cs.LG [Back]
[90] Empowering Multi-Turn Tool-Integrated Reasoning with Group Turn Policy Optimization cs.LG | cs.AI | cs.CLPDF
Yifeng Ding, Hung Le, Songyang Han, Kangrui Ruan, Zhenghui Jin
TL;DR: 本文提出了一种新的强化学习算法GTPO,专注于多轮工具集成推理(TIR)任务,通过细粒度的奖励分配和自监督奖励塑造,显著提升了训练效果。
Details
Motivation: 现有强化学习方法在多轮TIR任务中表现不佳,主要原因是轨迹级奖励信号过于粗糙,无法有效指导复杂多轮交互的学习。
Result: 实验表明,GTPO在多个推理基准上平均性能优于GRPO 3.0%。
Insight: 细粒度的奖励信号和自监督方法可以有效解决多轮TIR任务中的训练停滞问题。
Abstract: Training Large Language Models (LLMs) for multi-turn Tool-Integrated Reasoning (TIR) - where models iteratively reason, generate code, and verify through execution - remains challenging for existing reinforcement learning (RL) approaches. Current RL methods, exemplified by Group Relative Policy Optimization (GRPO), suffer from coarse-grained, trajectory-level rewards that provide insufficient learning signals for complex multi-turn interactions, leading to training stagnation. To address this issue, we propose Group Turn Policy Optimization (GTPO), a novel RL algorithm specifically designed for training LLMs on multi-turn TIR tasks. GTPO introduces three key innovations: (1) turn-level reward assignment that provides fine-grained feedback for individual turns, (2) return-based advantage estimation where normalized discounted returns are calculated as advantages, and (3) self-supervised reward shaping that exploits self-supervision signals from generated code to densify sparse binary outcome-based rewards. Our comprehensive evaluation demonstrates that GTPO outperforms GRPO by 3.0% on average across diverse reasoning benchmarks, establishing its effectiveness for advancing complex mathematical reasoning in the real world.
[91] Dynamic Nested Hierarchies: Pioneering Self-Evolution in Machine Learning Architectures for Lifelong Intelligence cs.LG | cs.CVPDF
Akbar Anbar Jafari, Cagri Ozcinar, Gholamreza Anbarjafari
TL;DR: 论文提出动态嵌套层次结构(Dynamic Nested Hierarchies),解决了现有机器学习模型在非平稳环境中适应性不足的问题,通过自主调整优化层次和更新频率,实现了模型的自我演化,从而支持终身学习。
Details
Motivation: 现有机器学习模型(如大型语言模型)在静态任务中表现优异,但在非平稳环境中由于架构僵化而表现不佳。论文旨在通过动态调整模型结构,使其能够适应不断变化的环境,实现终身学习。
Result: 论文通过理论分析和实验验证,展示了动态嵌套层次结构在语言建模、持续学习和长上下文推理任务中的卓越性能,支持模型的终身学习和适应性。
Insight: 动态嵌套层次结构通过模拟神经可塑性,实现了模型的自我演化,为通用人工智能的发展提供了新的思路和基础性进展。
Abstract: Contemporary machine learning models, including large language models, exhibit remarkable capabilities in static tasks yet falter in non-stationary environments due to rigid architectures that hinder continual adaptation and lifelong learning. Building upon the nested learning paradigm, which decomposes models into multi-level optimization problems with fixed update frequencies, this work proposes dynamic nested hierarchies as the next evolutionary step in advancing artificial intelligence and machine learning. Dynamic nested hierarchies empower models to autonomously adjust the number of optimization levels, their nesting structures, and update frequencies during training or inference, inspired by neuroplasticity to enable self-evolution without predefined constraints. This innovation addresses the anterograde amnesia in existing models, facilitating true lifelong learning by dynamically compressing context flows and adapting to distribution shifts. Through rigorous mathematical formulations, theoretical proofs of convergence, expressivity bounds, and sublinear regret in varying regimes, alongside empirical demonstrations of superior performance in language modeling, continual learning, and long-context reasoning, dynamic nested hierarchies establish a foundational advancement toward adaptive, general-purpose intelligence.
[92] Knowledge Graphs as Structured Memory for Embedding Spaces: From Training Clusters to Explainable Inference cs.LG | cs.CVPDF
Artur A. Oliveira, Mateus Espadoto, Roberto M. Cesar, Roberto Hirata
TL;DR: 论文提出了一种名为Graph Memory(GM)的结构化非参数框架,通过原型节点和关系边增强嵌入空间的推理能力,支持高效推理和可解释性。
Details
Motivation: 传统的嵌入空间推理方法(如kNN)通常缺乏对全局结构和可靠性的显式建模,限制了可解释性和推理效率。GM旨在通过结构化记忆弥补这一不足。
Result: 实验显示,GM在合成和真实数据集(如乳腺癌组织病理学)上达到与kNN和Label Spreading相当的精度,同时具有更好的校准性和决策边界。
Insight: GM通过显式建模可靠性和关系结构,为非参数学习中局部证据与全局一致性提供了桥梁。
Abstract: We introduce Graph Memory (GM), a structured non-parametric framework that augments embedding-based inference with a compact, relational memory over region-level prototypes. Rather than treating each training instance in isolation, GM summarizes the embedding space into prototype nodes annotated with reliability indicators and connected by edges that encode geometric and contextual relations. This design unifies instance retrieval, prototype-based reasoning, and graph-based label propagation within a single inductive model that supports both efficient inference and faithful explanation. Experiments on synthetic and real datasets including breast histopathology (IDC) show that GM achieves accuracy competitive with $k$NN and Label Spreading while offering substantially better calibration and smoother decision boundaries, all with an order of magnitude fewer samples. By explicitly modeling reliability and relational structure, GM provides a principled bridge between local evidence and global consistency in non-parametric learning.
[93] Deep Pathomic Learning Defines Prognostic Subtypes and Molecular Drivers in Colorectal Cancer cs.LG | cs.AI | cs.CV | q-bio.GNPDF
Zisong Wang, Xuanyu Wang, Hang Chen, Haizhou Wang, Yuxin Chen
TL;DR: 通过深度学习模型TDAM-CRC对结直肠癌进行预后分层,揭示分子机制并提供个性化诊疗工具。
Details
Motivation: 结直肠癌(CRC)的高度异质性使其预后分层困难,传统TNM分期系统无法满足个性化医疗需求。
Result: 模型在风险分层上优于传统分期系统和现有模型,MRPL37被证实为独立预后标志物。
Insight: 揭示了代谢重编程和免疫抑制微环境与高风险亚型的关联,为CRC治疗提供了新靶点。
Abstract: Precise prognostic stratification of colorectal cancer (CRC) remains a major clinical challenge due to its high heterogeneity. The conventional TNM staging system is inadequate for personalized medicine. We aimed to develop and validate a novel multiple instance learning model TDAM-CRC using histopathological whole-slide images for accurate prognostic prediction and to uncover its underlying molecular mechanisms. We trained the model on the TCGA discovery cohort (n=581), validated it in an independent external cohort (n=1031), and further we integrated multi-omics data to improve model interpretability and identify novel prognostic biomarkers. The results demonstrated that the TDAM-CRC achieved robust risk stratification in both cohorts. Its predictive performance significantly outperformed the conventional clinical staging system and multiple state-of-the-art models. The TDAM-CRC risk score was confirmed as an independent prognostic factor in multivariable analysis. Multi-omics analysis revealed that the high-risk subtype is closely associated with metabolic reprogramming and an immunosuppressive tumor microenvironment. Through interaction network analysis, we identified and validated Mitochondrial Ribosomal Protein L37 (MRPL37) as a key hub gene linking deep pathomic features to clinical prognosis. We found that high expression of MRPL37, driven by promoter hypomethylation, serves as an independent biomarker of favorable prognosis. Finally, we constructed a nomogram incorporating the TDAM-CRC risk score and clinical factors to provide a precise and interpretable clinical decision-making tool for CRC patients. Our AI-driven pathological model TDAM-CRC provides a robust tool for improved CRC risk stratification, reveals new molecular targets, and facilitates personalized clinical decision-making.
[94] GRPO-RM: Fine-Tuning Representation Models via GRPO-Driven Reinforcement Learning cs.LG | cs.CVPDF
Yanchen Xu, Ziheng Jiao, Hongyuan Zhang, Xuelong Li
TL;DR: GRPO-RM将GRPO强化学习方法推广至表征学习模型,通过预定义输出集和定制奖励函数优化模型表现。
Details
Motivation: GRPO在大型语言模型(LLMs)的微调中表现优异,研究其在表征学习模型中的泛化性是本文动机。
Result: 在多个现实数据集上的实验验证了方法的有效性。
Insight: GRPO的优化框架可灵活迁移至不同类型模型,关键在于适应任务特性的输出设计和奖励机制。
Abstract: The Group Relative Policy Optimization (GRPO), a reinforcement learning method used to fine-tune large language models (LLMs), has proved its effectiveness in practical applications such as DeepSeek-R1. It raises a question whether GRPO can be generalized to representation learning models. In this paper, we propose Group Relative Policy Optimization for Representation Model (GRPO-RM), and investigate the performance of GRPO-like policy in post-training representation models. Specifically, our method establishes a predefined output set to functionally replace token sequence sampling in LLMs, thereby generating an output group, which is essential for the probability-driven optimization of GRPO. In addition, a specialized reward function is designed to accommodate the properties of representation models. Extensive experiments are conducted on various real-world datasets to validate the effectiveness of our proposed method.
[95] NTK-Guided Implicit Neural Teaching cs.LG | cs.CVPDF
Chen Zhang, Wei Zuo, Bingyang Cheng, Yikun Wang, Wei-Bin Kou
TL;DR: 该论文提出了一种名为NINT的方法,通过动态选择最大化全局功能更新的坐标来加速隐式神经表示(INRs)的训练,利用神经切核(NTK)提升训练效率。
Details
Motivation: 隐式神经表示(INRs)在高分辨率信号拟合时需要优化数百万个坐标,计算成本极高,因此需要一种更高效的训练方法。
Result: 实验表明,NINT将训练时间减少近一半,同时保持或提升表示质量,优于现有采样策略。
Insight: 通过NTK动态优化训练坐标是一种有效的加速方法,适合高分辨率信号的隐式表示任务。
Abstract: Implicit Neural Representations (INRs) parameterize continuous signals via multilayer perceptrons (MLPs), enabling compact, resolution-independent modeling for tasks like image, audio, and 3D reconstruction. However, fitting high-resolution signals demands optimizing over millions of coordinates, incurring prohibitive computational costs. To address it, we propose NTK-Guided Implicit Neural Teaching (NINT), which accelerates training by dynamically selecting coordinates that maximize global functional updates. Leveraging the Neural Tangent Kernel (NTK), NINT scores examples by the norm of their NTK-augmented loss gradients, capturing both fitting errors and heterogeneous leverage (self-influence and cross-coordinate coupling). This dual consideration enables faster convergence compared to existing methods. Through extensive experiments, we demonstrate that NINT significantly reduces training time by nearly half while maintaining or improving representation quality, establishing state-of-the-art acceleration among recent sampling-based strategies.
cs.DB [Back]
[96] BBox DocVQA: A Large Scale Bounding Box Grounded Dataset for Enhancing Reasoning in Document Visual Question Answer cs.DB | cs.AI | cs.CVPDF
Wenhan Yu, Wang Chen, Guanqiang Qi, Weikang Li, Yang Li
TL;DR: BBox DocVQA 是一个大规模的数据集,旨在通过边界框标注增强文档视觉问答的空间推理能力。数据集的构建采用自动化流程,结合了区域分割、语义判断和问题生成技术,并通过人工验证确保质量。
Details
Motivation: 现有 DocVVA 数据集通常缺乏细粒度的空间标注,限制了视觉语言模型的空间推理和证据定位能力。BBox DocVVA 填补了这一空白。
Result: 数据集包含 3.6K 文档和 32K QV 对,覆盖单区域、多区域及多页场景。实验表明当前 VLMs 在空间定位和推理准确性上仍存在挑战,但在 BBox DocVVA 上微调能显著提升性能。
Insight: 边界框标注对增强空间语义对齐至关重要,且数据集的多样性为研究可解释的空间推理提供了重要资源。
Abstract: Document Visual Question Answering (DocVQA) is a fundamental task for multimodal document understanding and a key testbed for vision language reasoning. However, most existing DocVQA datasets are limited to the page level and lack fine grained spatial grounding, constraining the interpretability and reasoning capability of Vision Language Models (VLMs). To address this gap, we introduce BBox DocVQA a large scale, bounding box grounded dataset designed to enhance spatial reasoning and evidence localization in visual documents. We further present an automated construction pipeline, Segment Judge and Generate, which integrates a segment model for region segmentation, a VLM for semantic judgment, and another advanced VLM for question answer generation, followed by human verification for quality assurance. The resulting dataset contains 3.6 K diverse documents and 32 K QA pairs, encompassing single and multi region as well as single and multi page scenarios. Each QA instance is grounded on explicit bounding boxes, enabling fine grained evaluation of spatial semantic alignment. Benchmarking multiple state of the art VLMs (e.g., GPT 5, Qwen2.5 VL, and InternVL) on BBox DocVQA reveals persistent challenges in spatial grounding and reasoning accuracy. Furthermore, fine tuning on BBox DocVQA substantially improves both bounding box localization and answer generation, validating its effectiveness for enhancing the reasoning ability of VLMs. Our dataset and code will be publicly released to advance research on interpretable and spatially grounded vision language reasoning.
cs.AI [Back]
[97] ProRAC: A Neuro-symbolic Method for Reasoning about Actions with LLM-based Progression cs.AI | cs.CLPDF
Haoyong Wu, Yongmei Liu
TL;DR: ProRAC是一个神经符号框架,利用大语言模型(LLM)解决动作与变化推理(RAC)问题,通过逐步执行动作推导最终状态并回答问题。
Details
Motivation: 传统的RAC方法在处理复杂问题时表现有限,ProRAC结合LLM的灵活性,提升推理能力。
Result: 在多领域RAC基准测试中表现优异,适应不同任务和LLM主干。
Insight: 神经符号方法与LLM的结合可以有效提升复杂推理任务的性能和泛化能力。
Abstract: In this paper, we propose ProRAC (Progression-based Reasoning about Actions and Change), a neuro-symbolic framework that leverages LLMs to tackle RAC problems. ProRAC extracts fundamental RAC elements including actions and questions from the problem, progressively executes each action to derive the final state, and then evaluates the query against the progressed state to arrive at an answer. We evaluate ProRAC on several RAC benchmarks, and the results demonstrate that our approach achieves strong performance across different benchmarks, domains, LLM backbones, and types of RAC tasks.
[98] Octopus: Agentic Multimodal Reasoning with Six-Capability Orchestration cs.AI | cs.CVPDF
Yifu Guo, Zishan Xu, Zhiyuan Yao, Yuquan Lu, Jiaye Lin
TL;DR: Octopus提出了一种新的多模态代理推理范式,具备六种核心能力,能够自主探索动态选择最优能力,在Octopus-Bench上表现优异。
Details
Motivation: 现有的多模态推理模型缺乏人类般的自主探索能力,无法动态适应任务需求变化,而Octopus旨在解决这一问题。
Result: 实验表明Octopus在Octopus-Bench上多数任务中表现最佳,验证了能力协调的重要性。
Insight: 多模态代理推理需要全面能力的动态协调,才能适应复杂任务需求。
Abstract: Existing multimodal reasoning models and frameworks suffer from fundamental architectural limitations: most lack the human-like ability to autonomously explore diverse reasoning pathways-whether in direct inference, tool-driven visual exploration, programmatic visual manipulation, or intrinsic visual imagination. Consequently, they struggle to adapt to dynamically changing capability requirements in real-world tasks. Meanwhile, humans exhibit a complementary set of thinking abilities when addressing such tasks, whereas existing methods typically cover only a subset of these dimensions. Inspired by this, we propose Octopus: Agentic Multimodal Reasoning with Six-Capability Orchestration, a new paradigm for multimodal agentic reasoning. We define six core capabilities essential for multimodal reasoning and organize a comprehensive evaluation benchmark, Octopus-Bench, accordingly. Octopus is capable of autonomously exploring during reasoning and dynamically selecting the most appropriate capability based on the current state. Experimental results show that Octopus achieves the best performance on the vast majority of tasks in Octopus-Bench, highlighting the crucial role of capability coordination in agentic multimodal reasoning.
[99] IPR-1: Interactive Physical Reasoner cs.AI | cs.CVPDF
Mingyu Zhang, Lifeng Zhuo, Tianxi Tan, Guocan Xie, Xian Nie
TL;DR: IPR-1是交互式物理推理器,通过世界模型预测和物理动作编码(PhysCode)结合视觉语言模型(VLM)的策略强化,提升多游戏环境下的物理推理能力。
Details
Motivation: 研究智能体是否能通过环境交互学习人类物理推理能力,并在Game-to-Unseen(G2U)场景中验证其从直觉到目标驱动的推理能力。
Result: 在1000+游戏上预训练后,IPR在三个推理层级表现优于GPT-5,尤其在好奇心层级,且能零样本迁移到未见游戏。
Insight: 物理为中心的交互是提升推理能力的有效路径,训练数据量和交互步骤的增加可进一步提高模型性能。
Abstract: Humans learn by observing, interacting with environments, and internalizing physics and causality. Here, we aim to ask whether an agent can similarly acquire human-like reasoning from interaction and keep improving with more experience. We study this in a Game-to-Unseen (G2U) setting, curating 1,000+ heterogeneous games with diverse physical and causal mechanisms, and evaluate at three human-like levels: Survival, Curiosity, Utility, from primitive intuition to goal-driven reasoning. Our analysis reveals complementary failures: VLM/VLA agents reason but lack look-ahead in interactive settings, while world models imagine but imitate visual patterns rather than analyze physics and causality. We therefore propose IPR (Interactive Physical Reasoner), using world-model rollouts to score and reinforce a VLM’s policy, and introduce PhysCode, a physics-centric action code aligning semantic intent with dynamics to provide a shared action space for prediction and reasoning. Pretrained on 1,000+ games, our IPR performs robustly on three levels, matches GPT-5 overall, and surpasses it on Curiosity. We find that performance improves with more training games and interaction steps, and that the model also zero-shot transfers to unseen games. These results support physics-centric interaction as a path to steadily improving physical reasoning.
eess.IV [Back]
[100] Application of Graph Based Vision Transformers Architectures for Accurate Temperature Prediction in Fiber Specklegram Sensors eess.IV | cs.AI | cs.CVPDF
Abhishek Sebastian
TL;DR: 该研究探讨了基于Transformer的架构(如ViT、Swin Transformer、LINA-ViT和MAP-ViGAT)在光纤散斑传感器中温度预测的应用,结果显示Transformer模型优于传统CNN,并引入XAI技术提升模型解释性。
Details
Motivation: 光纤散斑传感器的非线性数据对温度预测带来挑战,传统方法效果有限,需要更高效的建模方法。
Result: ViT的MAE为1.15,优于CNN,GAT-ViT和MAP-ViGAT也表现优异,显示自适应注意力机制的重要性。
Insight: Transformer架构在非线性数据建模中具有潜力,XAI技术可帮助理解模型决策,为工业监测和结构健康评估提供新方向。
Abstract: Fiber Specklegram Sensors (FSS) are highly effective for environmental monitoring, particularly for detecting temperature variations. However, the nonlinear nature of specklegram data presents significant challenges for accurate temperature prediction. This study investigates the use of transformer-based architectures, including Vision Transformers (ViTs), Swin Transformers, and emerging models such as Learnable Importance Non-Symmetric Attention Vision Transformers (LINA-ViT) and Multi-Adaptive Proximity Vision Graph Attention Transformers (MAP-ViGAT), to predict temperature from specklegram data over a range of 0 to 120 Celsius. The results show that ViTs achieved a Mean Absolute Error (MAE) of 1.15, outperforming traditional models such as CNNs. GAT-ViT and MAP-ViGAT variants also demonstrated competitive accuracy, highlighting the importance of adaptive attention mechanisms and graph-based structures in capturing complex modal interactions and phase shifts in specklegram data. Additionally, this study incorporates Explainable AI (XAI) techniques, including attention maps and saliency maps, to provide insights into the decision-making processes of the transformer models, improving interpretability and transparency. These findings establish transformer architectures as strong benchmarks for optical fiber-based temperature sensing and offer promising directions for industrial monitoring and structural health assessment applications.
cs.MM [Back]
[101] ChartEditor: A Reinforcement Learning Framework for Robust Chart Editing cs.MM | cs.CLPDF
Liangyu Chen, Yichen Xu, Jianzhe Ma, Yuqi Liu, Donglu Yang
TL;DR: 论文提出了ChartEditVista基准和ChartEditor框架,前者是一个包含7964个样本的多样化图表编辑基准,后者是基于强化学习的图表编辑模型,通过渲染奖励确保代码可执行性和视觉保真度。
Details
Motivation: 当前图表编辑的基准数据多样性不足且依赖完整的图表代码,与真实场景不符。因此,作者提出了ChartEditVista基准和ChartEditor框架,以弥补这一差距。
Result: 实验和人工评估表明,ChartEditor在图表编辑任务上优于同类和大规模模型,ChartEditVista提供了鲁棒的评估环境。
Insight: 通过自动化生成高质量数据和引入强化学习框架,论文为解决真实场景下的图表编辑问题提供了有效方案。
Abstract: Chart editing reduces manual effort in visualization design. Typical benchmarks limited in data diversity and assume access to complete chart code, which is seldom in real-world scenarios. To address this gap, we present ChartEditVista, a comprehensive benchmark consisting of 7,964 samples spanning 31 chart categories. It encompasses diverse editing instructions and covers nearly all editable chart elements. The inputs in ChartEditVista include only the original chart image and natural language editing instructions, without the original chart codes. ChartEditVista is generated through a fully automated pipeline that produces, edits, and verifies charts, ensuring high-quality chart editing data. Besides, we introduce two novel fine-grained, rule-based evaluation metrics: the layout metric, which evaluates the position, size and color of graphical components; and the text metric, which jointly assesses textual content and font styling. Building on top of ChartEditVista, we present ChartEditor, a model trained using a reinforcement learning framework that incorporates a novel rendering reward to simultaneously enforce code executability and visual fidelity. Through extensive experiments and human evaluations, we demonstrate that ChartEditVista provides a robust evaluation, while ChartEditor consistently outperforms models with similar-scale and larger-scale on chart editing tasks.
cs.PL [Back]
[102] SkyEgg: Joint Implementation Selection and Scheduling for Hardware Synthesis using E-graphs cs.PL | cs.CLPDF
Youwei Xiao, Yuyang Zou, Yun Liang
TL;DR: SkyEgg提出了一种基于e-graph的硬件合成框架,联合优化实现选择与调度,利用等式饱和与MILP求解,显著提升FPGA性能。
Details
Motivation: 现有高电平合成(HLS)工具将实现选择与调度分离,导致无法充分利用现代FPGA异构架构的潜力。
Result: 在Xilinx Kintex UltraScale+ FPGA上,平均加速比达3.01倍,复杂表达式最高提升5.22倍。
Insight: e-graph统一建模设计空间,结合MILP求解,为硬件合成提供了新思路。
Abstract: Hardware synthesis from high-level descriptions remains fundamentally limited by the sequential optimization of interdependent design decisions. Current methodologies, including state-of-the-art high-level synthesis (HLS) tools, artificially separate implementation selection from scheduling, leading to suboptimal designs that cannot fully exploit modern FPGA heterogeneous architectures. Implementation selection is typically performed by ad-hoc pattern matching on operations, a process that does not consider the impact on scheduling. Subsequently, scheduling algorithms operate on fixed selection solutions with inaccurate delay estimates, which misses critical optimization opportunities from appropriately configured FPGA blocks like DSP slices. We present SkyEgg, a novel hardware synthesis framework that jointly optimizes implementation selection and scheduling using the e-graph data structure. Our key insight is that both algebraic transformations and hardware implementation choices can be uniformly represented as rewrite rules within an e-graph, modeling the complete design space of implementation candidates to be selected and scheduled together. First, SkyEgg constructs an e-graph from the input program. It then applies both algebraic and implementation rewrites through equality saturation. Finally, it formulates the joint optimization as a mixed-integer linear programming (MILP) problem on the saturated e-graph. We provide both exact MILP solving and an efficient ASAP heuristic for scalable synthesis. Our evaluation on benchmarks from diverse applications targeting Xilinx Kintex UltraScale+ FPGAs demonstrates that SkyEgg achieves an average speedup of 3.01x over Vitis HLS, with improvements up to 5.22x for complex expressions.
cs.IR [Back]
[103] CroPS: Improving Dense Retrieval with Cross-Perspective Positive Samples in Short-Video Search cs.IR | cs.CLPDF
Ao Xie, Jiahui Chen, Quanzhi Zhu, Xiaoze Jiang, Zhiheng Qin
TL;DR: CroPS 提出了一种新颖的密集检索数据引擎,通过引入多视角的正样本,解决了传统训练中因依赖历史用户交互数据而导致的“过滤泡沫”问题,提升了短视频搜索的性能。
Details
Motivation: 当前密集检索系统依赖历史用户交互数据进行训练,容易产生“过滤泡沫”效应,忽略潜在相关但未曝光的内容。CroPS旨在通过多视角引入多样化的正样本,缓解这一问题。
Result: 实验表明,CroPS在离线评估和在线A/B测试中显著优于基线方法,提升了检索性能并降低了查询改写率,已部署在快手搜索平台。
Insight: 多视角的正样本注入可以有效缓解“过滤泡沫”问题,提升密集检索模型的泛化能力和多样性。
Abstract: Dense retrieval has become a foundational paradigm in modern search systems, especially on short-video platforms. However, most industrial systems adopt a self-reinforcing training pipeline that relies on historically exposed user interactions for supervision. This paradigm inevitably leads to a filter bubble effect, where potentially relevant but previously unseen content is excluded from the training signal, biasing the model toward narrow and conservative retrieval. In this paper, we present CroPS (Cross-Perspective Positive Samples), a novel retrieval data engine designed to alleviate this problem by introducing diverse and semantically meaningful positive examples from multiple perspectives. CroPS enhances training with positive signals derived from user query reformulation behavior (query-level), engagement data in recommendation streams (system-level), and world knowledge synthesized by large language models (knowledge-level). To effectively utilize these heterogeneous signals, we introduce a Hierarchical Label Assignment (HLA) strategy and a corresponding H-InfoNCE loss that together enable fine-grained, relevance-aware optimization. Extensive experiments conducted on Kuaishou Search, a large-scale commercial short-video search platform, demonstrate that CroPS significantly outperforms strong baselines both offline and in live A/B tests, achieving superior retrieval performance and reducing query reformulation rates. CroPS is now fully deployed in Kuaishou Search, serving hundreds of millions of users daily.
cs.RO [Back]
[104] SRPO: Self-Referential Policy Optimization for Vision-Language-Action Models cs.RO | cs.CL | cs.CVPDF
Senyu Fei, Siyin Wang, Li Ji, Ao Li, Shiduo Zhang
TL;DR: SRPO提出了一种自参考策略优化框架,通过利用模型自身的成功轨迹作为参考,解决了VLA模型在机器人操作中依赖专家演示和稀疏奖励的问题,显著提高了训练效率和性能。
Details
Motivation: 现有的VLA模型在机器人操作中依赖专家演示,导致演示偏差和性能受限。RL作为一种后训练策略,当前方法因奖励稀疏而效率低下。SRPO旨在通过自参考方法解决这些问题。
Result: 在LIBERO基准测试中,SRPO从48.9%的成功率起点,仅用200步RL训练达到99.2%的新SOTA成功率,相对提升103%。在LIBERO-Plus基准上性能提升167%。
Insight: 通过自参考和潜在表示,SRPO解决了VLA-RL中的奖励稀疏问题,表明模型自身轨迹可作为高效训练信号,潜在表示能有效支持跨环境通用性。
Abstract: Vision-Language-Action (VLA) models excel in robotic manipulation but are constrained by their heavy reliance on expert demonstrations, leading to demonstration bias and limiting performance. Reinforcement learning (RL) is a vital post-training strategy to overcome these limits, yet current VLA-RL methods, including group-based optimization approaches, are crippled by severe reward sparsity. Relying on binary success indicators wastes valuable information in failed trajectories, resulting in low training efficiency. To solve this, we propose Self-Referential Policy Optimization (SRPO), a novel VLA-RL framework. SRPO eliminates the need for external demonstrations or manual reward engineering by leveraging the model’s own successful trajectories, generated within the current training batch, as a self-reference. This allows us to assign a progress-wise reward to failed attempts. A core innovation is the use of latent world representations to measure behavioral progress robustly. Instead of relying on raw pixels or requiring domain-specific fine-tuning, we utilize the compressed, transferable encodings from a world model’s latent space. These representations naturally capture progress patterns across environments, enabling accurate, generalized trajectory comparison. Empirical evaluations on the LIBERO benchmark demonstrate SRPO’s efficiency and effectiveness. Starting from a supervised baseline with 48.9% success, SRPO achieves a new state-of-the-art success rate of 99.2% in just 200 RL steps, representing a 103% relative improvement without any extra supervision. Furthermore, SRPO shows substantial robustness, achieving a 167% performance improvement on the LIBERO-Plus benchmark.
[105] Look, Zoom, Understand: The Robotic Eyeball for Embodied Perception cs.RO | cs.CVPDF
Jiashu Yang, Yifan Han, Yucheng Xie, Ning Guo, Wenzhao Lian
TL;DR: EyeVLA是一个机器人眼球视觉系统,通过主动视觉感知和指令驱动的动作(旋转和缩放),结合视觉语言模型的开放世界理解能力,实现了高效的环境感知。
Details
Motivation: 现有视觉模型和固定RGB-D相机系统难以同时覆盖广域和获取细节信息,限制了其在开放世界机器人应用中的有效性。
Result: 实验表明,EyeVLA能在真实环境中高效执行指令驱动的场景,并通过主动动作获取更准确的视觉信息。
Insight: 通过结合视觉语言模型和强化学习,机器人可以更智能地主动感知环境,提升下游任务的性能。
Abstract: In embodied AI perception systems, visual perception should be active: the goal is not to passively process static images, but to actively acquire more informative data within pixel and spatial budget constraints. Existing vision models and fixed RGB-D camera systems fundamentally fail to reconcile wide-area coverage with fine-grained detail acquisition, severely limiting their efficacy in open-world robotic applications. To address this issue, we propose EyeVLA, a robotic eyeball for active visual perception that can take proactive actions based on instructions, enabling clear observation of fine-grained target objects and detailed information across a wide spatial extent. EyeVLA discretizes action behaviors into action tokens and integrates them with vision-language models (VLMs) that possess strong open-world understanding capabilities, enabling joint modeling of vision, language, and actions within a single autoregressive sequence. By using the 2D bounding box coordinates to guide the reasoning chain and applying reinforcement learning to refine the viewpoint selection policy, we transfer the open-world scene understanding capability of the VLM to a vision language action (VLA) policy using only minimal real-world data. Experiments show that our system efficiently performs instructed scenes in real-world environments and actively acquires more accurate visual information through instruction-driven actions of rotation and zoom, thereby achieving strong environmental perception capabilities. EyeVLA introduces a novel robotic vision system that leverages detailed and spatially rich, large-scale embodied data, and actively acquires highly informative visual observations for downstream embodied tasks.
[106] In-N-On: Scaling Egocentric Manipulation with in-the-wild and on-task Data cs.RO | cs.AI | cs.CVPDF
Xiongyi Cai, Ri-Zhao Qiu, Geng Chen, Lai Wei, Isabella Liu
TL;DR: 论文提出了一种利用大规模第一人称视角(egocentric)数据学习操作策略的方法,通过分类收集人类数据为‘野外数据’和‘任务数据’,实现语言条件化的流匹配策略Human0。
Details
Motivation: 现有的第一人称视频数据利用方式多局限于简单预训练,未能充分发挥其潜力,因此需探索更高效的数据收集和使用方法。
Result: Human0展示了从人类数据中学习语言指令跟随、少样本学习和任务数据提升鲁棒性等新特性。
Insight: 规模化人类数据分类使用能显著提升操作策略的学习效果,域适应技术是缩小人与机器人性能差距的关键。
Abstract: Egocentric videos are a valuable and scalable data source to learn manipulation policies. However, due to significant data heterogeneity, most existing approaches utilize human data for simple pre-training, which does not unlock its full potential. This paper first provides a scalable recipe for collecting and using egocentric data by categorizing human data into two categories: in-the-wild and on-task alongside with systematic analysis on how to use the data. We first curate a dataset, PHSD, which contains over 1,000 hours of diverse in-the-wild egocentric data and over 20 hours of on-task data directly aligned to the target manipulation tasks. This enables learning a large egocentric language-conditioned flow matching policy, Human0. With domain adaptation techniques, Human0 minimizes the gap between humans and humanoids. Empirically, we show Human0 achieves several novel properties from scaling human data, including language following of instructions from only human data, few-shot learning, and improved robustness using on-task data. Project website: https://xiongyicai.github.io/In-N-On/