Table of Contents

cs.CV [Back]

[1] Unveiling the Underwater World: CLIP Perception Model-Guided Underwater Image Enhancement cs.CVPDF

Jiangzhong Cao, Zekai Zeng, Xu Zhang, Huan Zhang, Chunling Fan

TL;DR: 该论文提出了一种基于CLIP感知模型的水下图像增强方法,结合课程对比正则化,以改进增强图像的感知质量和内容恢复能力。

Details

Motivation: 现有的水下图像增强方法往往忽视人类感知需求,且缺乏对解空间的充分约束,导致增强图像的质量下降或内容恢复不足。

Result: 实验表明,该方法在视觉质量和泛化能力上优于现有方法。

Insight: 通过CLIP感知模型结合课程对比学习,可以更有效地平衡增强图像的感知质量与内容恢复,避免过增强或欠增强的问题。

Abstract: High-quality underwater images are essential for both machine vision tasks and viewers with their aesthetic appeal.However, the quality of underwater images is severely affected by light absorption and scattering. Deep learning-based methods for Underwater Image Enhancement (UIE) have achieved good performance. However, these methods often overlook considering human perception and lack sufficient constraints within the solution space. Consequently, the enhanced images often suffer from diminished perceptual quality or poor content restoration.To address these issues, we propose a UIE method with a Contrastive Language-Image Pre-Training (CLIP) perception loss module and curriculum contrastive regularization. Above all, to develop a perception model for underwater images that more aligns with human visual perception, the visual semantic feature extraction capability of the CLIP model is leveraged to learn an appropriate prompt pair to map and evaluate the quality of underwater images. This CLIP perception model is then incorporated as a perception loss module into the enhancement network to improve the perceptual quality of enhanced images. Furthermore, the CLIP perception model is integrated with the curriculum contrastive regularization to enhance the constraints imposed on the enhanced images within the CLIP perceptual space, mitigating the risk of both under-enhancement and over-enhancement. Specifically, the CLIP perception model is employed to assess and categorize the learning difficulty level of negatives in the regularization process, ensuring comprehensive and nuanced utilization of distorted images and negatives with varied quality levels. Extensive experiments demonstrate that our method outperforms state-of-the-art methods in terms of visual quality and generalization ability.


[2] SPARC: Concept-Aligned Sparse Autoencoders for Cross-Model and Cross-Modal Interpretability cs.CV | cs.AIPDF

Ali Nasiri-Sarvi, Hassan Rivaz, Mahdi S. Hosseini

TL;DR: SPARC提出了一种新框架,通过全局TopK稀疏机制和跨重构损失,实现了跨模型和跨模态的统一潜在空间,显著提升了概念对齐性能。

Details

Motivation: 现有解释性方法(如稀疏自编码器)为每个模型独立生成潜在概念,导致概念空间不兼容,限制了跨模型的可解释性。SPARC旨在解决这一问题。

Result: 在Open Images数据集上,SPARC将Jaccard相似度提升至0.80,比之前方法提升了三倍以上。

Insight: SPARC的学习框架不仅实现了概念对齐,还支持文本引导的空间定位和跨模态检索等应用。

Abstract: Understanding how different AI models encode the same high-level concepts, such as objects or attributes, remains challenging because each model typically produces its own isolated representation. Existing interpretability methods like Sparse Autoencoders (SAEs) produce latent concepts individually for each model, resulting in incompatible concept spaces and limiting cross-model interpretability. To address this, we introduce SPARC (Sparse Autoencoders for Aligned Representation of Concepts), a new framework that learns a single, unified latent space shared across diverse architectures and modalities (e.g., vision models like DINO, and multimodal models like CLIP). SPARC’s alignment is enforced through two key innovations: (1) a Global TopK sparsity mechanism, ensuring all input streams activate identical latent dimensions for a given concept; and (2) a Cross-Reconstruction Loss, which explicitly encourages semantic consistency between models. On Open Images, SPARC dramatically improves concept alignment, achieving a Jaccard similarity of 0.80, more than tripling the alignment compared to previous methods. SPARC creates a shared sparse latent space where individual dimensions often correspond to similar high-level concepts across models and modalities, enabling direct comparison of how different architectures represent identical concepts without requiring manual alignment or model-specific analysis. As a consequence of this aligned representation, SPARC also enables practical applications such as text-guided spatial localization in vision-only models and cross-model/cross-modal retrieval. Code and models are available at https://github.com/AtlasAnalyticsLab/SPARC.


[3] A Probabilistic Approach to Uncertainty Quantification Leveraging 3D Geometry cs.CV | cs.AIPDF

Rushil Desai, Frederik Warburg, Trevor Darrell, Marissa Ramirez de Chanlatte

TL;DR: 该论文提出了BayesSDF,一种用于神经隐式SDF模型不确定性量化的概率框架,解决了现有方法在几何一致性和计算效率上的不足。

Details

Motivation: 科学模拟应用中(如森林流体建模),需要精确的3D几何表示和不确定性量化,但现有方法通常忽略几何一致性,导致不确定性校准不佳。

Result: 在合成和真实数据集上,BayesSDF在校准和几何一致性上优于现有方法,为下游任务提供了可靠的不确定性度量。

Insight: SDF的连续性和可微性使其比辐射场模型更适用于物理建模,而几何感知的不确定性量化对科学模拟和机器人决策至关重要。

Abstract: Quantifying uncertainty in neural implicit 3D representations, particularly those utilizing Signed Distance Functions (SDFs), remains a substantial challenge due to computational inefficiencies, scalability issues, and geometric inconsistencies. Existing methods typically neglect direct geometric integration, leading to poorly calibrated uncertainty maps. We introduce BayesSDF, a novel probabilistic framework for uncertainty quantification in neural implicit SDF models, motivated by scientific simulation applications with 3D environments (e.g., forests) such as modeling fluid flow through forests, where precise surface geometry and awareness of fidelity surface geometric uncertainty are essential. Unlike radiance-based models such as NeRF or 3D Gaussian splatting, which lack explicit surface formulations, SDFs define continuous and differentiable geometry, making them better suited for physical modeling and analysis. BayesSDF leverages a Laplace approximation to quantify local surface instability via Hessian-based metrics, enabling computationally efficient, surface-aware uncertainty estimation. Our method shows that uncertainty predictions correspond closely with poorly reconstructed geometry, providing actionable confidence measures for downstream use. Extensive evaluations on synthetic and real-world datasets demonstrate that BayesSDF outperforms existing methods in both calibration and geometric consistency, establishing a strong foundation for uncertainty-aware 3D scene reconstruction, simulation, and robotic decision-making.


[4] LIRA: Inferring Segmentation in Large Multi-modal Models with Local Interleaved Region Assistance cs.CV | cs.AIPDF

Zhang Li, Biao Yang, Qiang Liu, Shuo Zhang, Zhiyin Ma

TL;DR: LIRA通过结合语义增强特征提取器和交错局部视觉耦合,解决了大型多模态模型在分割和视觉理解中的局限性,提升了分割准确性和减少了幻觉问题。

Details

Motivation: 大型多模态模型(LMMs)在分割和理解任务中存在分割不准确和理解幻觉的问题,主要由于视觉理解能力不足和细粒度感知缺失。

Result: LIRA在分割和理解任务中达到了最先进的性能。

Insight: 分割精度与潜在语义关系存在正相关,细粒度监督能有效减少幻觉。

Abstract: While large multi-modal models (LMMs) demonstrate promising capabilities in segmentation and comprehension, they still struggle with two limitations: inaccurate segmentation and hallucinated comprehension. These challenges stem primarily from constraints in weak visual comprehension and a lack of fine-grained perception. To alleviate these limitations, we propose LIRA, a framework that capitalizes on the complementary relationship between visual comprehension and segmentation via two key components: (1) Semantic-Enhanced Feature Extractor (SEFE) improves object attribute inference by fusing semantic and pixel-level features, leading to more accurate segmentation; (2) Interleaved Local Visual Coupling (ILVC) autoregressively generates local descriptions after extracting local features based on segmentation masks, offering fine-grained supervision to mitigate hallucinations. Furthermore, we find that the precision of object segmentation is positively correlated with the latent related semantics of the token. To quantify this relationship and the model’s potential semantic inferring ability, we introduce the Attributes Evaluation (AttrEval) dataset. Our experiments show that LIRA achieves state-of-the-art performance in both segmentation and comprehension tasks. Code will be available at https://github.com/echo840/LIRA.


[5] Advancing Offline Handwritten Text Recognition: A Systematic Review of Data Augmentation and Generation Techniques cs.CV | cs.AI | cs.LGPDF

Yassin Hussein Rassul, Aram M. Ahmed, Polla Fattah, Bryar A. Hassan, Arwaa W. Abdulkareem

TL;DR: 本文综述了离线手写文本识别中数据增强与生成技术的研究现状,探讨了传统方法与深度学习方法(如GAN、扩散模型和基于Transformer的方法)的优缺点,并提出了未来研究方向。

Details

Motivation: 离线手写文本识别(HTR)在历史文档数字化等领域有广泛应用,但标注数据的稀缺性限制了其性能,尤其对低资源语言和复杂脚本。本文旨在解决这一问题。

Result: 总结了现有数据集的局限性,指出了评估指标的不足,并提出了改进生成模型多样性的建议。

Insight: 生成真实多样的手写样本对提升HTR性能至关重要,未来研究需关注多语言支持和复杂脚本的生成能力。

Abstract: Offline Handwritten Text Recognition (HTR) systems play a crucial role in applications such as historical document digitization, automatic form processing, and biometric authentication. However, their performance is often hindered by the limited availability of annotated training data, particularly for low-resource languages and complex scripts. This paper presents a comprehensive survey of offline handwritten data augmentation and generation techniques designed to improve the accuracy and robustness of HTR systems. We systematically examine traditional augmentation methods alongside recent advances in deep learning, including Generative Adversarial Networks (GANs), diffusion models, and transformer-based approaches. Furthermore, we explore the challenges associated with generating diverse and realistic handwriting samples, particularly in preserving script authenticity and addressing data scarcity. This survey follows the PRISMA methodology, ensuring a structured and rigorous selection process. Our analysis began with 1,302 primary studies, which were filtered down to 848 after removing duplicates, drawing from key academic sources such as IEEE Digital Library, Springer Link, Science Direct, and ACM Digital Library. By evaluating existing datasets, assessment metrics, and state-of-the-art methodologies, this survey identifies key research gaps and proposes future directions to advance the field of handwritten text generation across diverse linguistic and stylistic landscapes.


[6] When Trackers Date Fish: A Benchmark and Framework for Underwater Multiple Fish Tracking cs.CVPDF

Weiran Li, Yeqiang Liu, Qiannan Guo, Yijie Wei, Hwa Liang Leo

TL;DR: 论文提出了首个专门用于水下多鱼跟踪的数据集MFT25,并开发了一种基于Unscented Kalman Filter和FishIoU匹配的跟踪框架SU-T,显著提升了水下鱼类跟踪的性能。

Details

Motivation: 陆地多目标跟踪技术已成熟,但水下场景因环境复杂、鱼类运动模式独特而缺乏研究,这对海洋生态和水产养殖至关重要。

Result: SU-T在MFT25上达到34.1 HOTA和44.6 IDF1,性能领先。

Insight: 水下鱼类跟踪与陆地目标跟踪存在显著差异,需针对鱼类形态和运动模式设计专用方法。

Abstract: Multiple object tracking (MOT) technology has made significant progress in terrestrial applications, but underwater tracking scenarios remain underexplored despite their importance to marine ecology and aquaculture. We present Multiple Fish Tracking Dataset 2025 (MFT25), the first comprehensive dataset specifically designed for underwater multiple fish tracking, featuring 15 diverse video sequences with 408,578 meticulously annotated bounding boxes across 48,066 frames. Our dataset captures various underwater environments, fish species, and challenging conditions including occlusions, similar appearances, and erratic motion patterns. Additionally, we introduce Scale-aware and Unscented Tracker (SU-T), a specialized tracking framework featuring an Unscented Kalman Filter (UKF) optimized for non-linear fish swimming patterns and a novel Fish-Intersection-over-Union (FishIoU) matching that accounts for the unique morphological characteristics of aquatic species. Extensive experiments demonstrate that our SU-T baseline achieves state-of-the-art performance on MFT25, with 34.1 HOTA and 44.6 IDF1, while revealing fundamental differences between fish tracking and terrestrial object tracking scenarios. MFT25 establishes a robust foundation for advancing research in underwater tracking systems with important applications in marine biology, aquaculture monitoring, and ecological conservation. The dataset and codes are released at https://vranlee.github.io/SU-T/.


[7] SImpHAR: Advancing impedance-based human activity recognition using 3D simulation and text-to-motion models cs.CV | cs.AIPDF

Lala Shakti Swarup Ray, Mengxi Liu, Deepika Gurung, Bo Zhou, Sungho Suh

TL;DR: SImpHAR 提出了一种基于生物阻抗传感的人体活动识别新框架,通过3D模拟和文本到动作模型生成合成数据,解决了标签数据稀缺的问题,并在性能上显著优于现有方法。

Details

Motivation: 生物阻抗传感在细粒度动作捕捉中具有独特优势,但缺乏标签数据限制了其应用。SImpHAR 旨在通过模拟数据和模块化训练克服这一限制。

Result: 在 ImpAct 数据集和两个公共基准测试中表现优异,准确率和宏 F1 分数分别提升了 22.3% 和 21.8%。

Insight: 模拟驱动的数据增强和模块化训练对基于阻抗的人体活动识别有显著潜力,为数据稀缺领域提供了新思路。

Abstract: Human Activity Recognition (HAR) with wearable sensors is essential for applications in healthcare, fitness, and human-computer interaction. Bio-impedance sensing offers unique advantages for fine-grained motion capture but remains underutilized due to the scarcity of labeled data. We introduce SImpHAR, a novel framework addressing this limitation through two core contributions. First, we propose a simulation pipeline that generates realistic bio-impedance signals from 3D human meshes using shortest-path estimation, soft-body physics, and text-to-motion generation serving as a digital twin for data augmentation. Second, we design a two-stage training strategy with decoupled approach that enables broader activity coverage without requiring label-aligned synthetic data. We evaluate SImpHAR on our collected ImpAct dataset and two public benchmarks, showing consistent improvements over state-of-the-art methods, with gains of up to 22.3% and 21.8%, in terms of accuracy and macro F1 score, respectively. Our results highlight the promise of simulation-driven augmentation and modular training for impedance-based HAR.


[8] Hierarchical Multi-Stage Transformer Architecture for Context-Aware Temporal Action Localization cs.CVPDF

Hayat Ullah, Arslan Munir, Oliver Nina

TL;DR: 该论文提出了一种名为PCL-Former的层次化多阶段Transformer架构,用于上下文感知的时间动作定位任务,通过三个专用Transformer模块分别处理候选段识别、动作分类和时间边界预测,显著提升了性能。

Details

Motivation: 受Transformer和多阶段架构在视频识别和目标检测领域成功的启发,论文旨在探索这些方法在时间动作定位任务中的潜力。

Result: 在THUMOS14、ActivityNet-1.3和HACS数据集上的实验结果表明,PCL-Former在性能上分别超过当前最佳方法2.8%、1.2%和4.8%。

Insight: 通过模块化设计和专用损失函数,Transformer架构在多阶段任务中可以显著提升时间动作定位的精度和泛化能力。

Abstract: Inspired by the recent success of transformers and multi-stage architectures in video recognition and object detection domains. We thoroughly explore the rich spatio-temporal properties of transformers within a multi-stage architecture paradigm for the temporal action localization (TAL) task. This exploration led to the development of a hierarchical multi-stage transformer architecture called PCL-Former, where each subtask is handled by a dedicated transformer module with a specialized loss function. Specifically, the Proposal-Former identifies candidate segments in an untrimmed video that may contain actions, the Classification-Former classifies the action categories within those segments, and the Localization-Former precisely predicts the temporal boundaries (i.e., start and end) of the action instances. To evaluate the performance of our method, we have conducted extensive experiments on three challenging benchmark datasets: THUMOS-14, ActivityNet-1.3, and HACS Segments. We also conducted detailed ablation experiments to assess the impact of each individual module of our PCL-Former. The obtained quantitative results validate the effectiveness of the proposed PCL-Former, outperforming state-of-the-art TAL approaches by 2.8%, 1.2%, and 4.8% on THUMOS14, ActivityNet-1.3, and HACS datasets, respectively.


[9] THOR: Thermal-guided Hand-Object Reasoning via Adaptive Vision Sampling cs.CVPDF

Soroush Shahi, Farzad Shahabi, Rama Nabulsi, Glenn Fernandes, Aggelos Katsaggelos

TL;DR: THOR提出了一种基于热成像的自适应RGB帧采样方法,通过热感数据动态调整RGB采样率,减少能耗与数据量,同时保持高精度的手-物体活动识别。

Details

Motivation: 穿戴式相机持续处理RGB图像存在高能耗、大数据量和隐私问题。THOR通过热成像技术智能调节RGB采样,以更高效的方式实现实时活动监测。

Result: 实验表明,仅使用3%的原始RGB数据,THOR能捕获所有活动片段,手-活动识别F1分数达到95%,与使用全部RGB数据(94%)相当。

Insight: 热成像可作为高效触发机制,显著减少穿戴设备的数据处理负担,同时保持高精度活动识别,为穿戴式相机的长期使用提供实用方案。

Abstract: Wearable cameras are increasingly used as an observational and interventional tool for human behaviors by providing detailed visual data of hand-related activities. This data can be leveraged to facilitate memory recall for logging of behavior or timely interventions aimed at improving health. However, continuous processing of RGB images from these cameras consumes significant power impacting battery lifetime, generates a large volume of unnecessary video data for post-processing, raises privacy concerns, and requires substantial computational resources for real-time analysis. We introduce THOR, a real-time adaptive spatio-temporal RGB frame sampling method that leverages thermal sensing to capture hand-object patches and classify them in real-time. We use low-resolution thermal camera data to identify moments when a person switches from one hand-related activity to another, and adjust the RGB frame sampling rate by increasing it during activity transitions and reducing it during periods of sustained activity. Additionally, we use the thermal cues from the hand to localize the region of interest (i.e., the hand-object interaction) in each RGB frame, allowing the system to crop and process only the necessary part of the image for activity recognition. We develop a wearable device to validate our method through an in-the-wild study with 14 participants and over 30 activities, and further evaluate it on Ego4D (923 participants across 9 countries, totaling 3,670 hours of video). Our results show that using only 3% of the original RGB video data, our method captures all the activity segments, and achieves hand-related activity recognition F1-score (95%) comparable to using the entire RGB video (94%). Our work provides a more practical path for the longitudinal use of wearable cameras to monitor hand-related activities and health-risk behaviors in real time.


[10] EA: An Event Autoencoder for High-Speed Vision Sensing cs.CV | cs.AIPDF

Riadul Islam, Joey Mulé, Dhandeep Challagundla, Shahmir Rizvi, Sean Carson

TL;DR: 本文提出了一种新型事件自动编码器(Event Autoencoder, EA),用于高效压缩和重建事件相机数据,同时保留关键时空特征,解决了高动态环境下稀疏噪声事件流的物体检测问题。该方法在性能和效率上均优于现有技术。

Details

Motivation: 传统帧式视觉系统在动态环境中存在运动模糊、高延迟和数据冗余问题,而事件相机虽能异步捕捉亮度变化,但其稀疏噪声事件流对物体检测提出了挑战。

Result: 在SEFD数据集上,EA的准确性与YOLO-v4相当,但参数减少35.5倍;在嵌入式设备上实现了高帧率(8-44.8 FPS),性能提升87.84倍。

Insight: 事件自动编码器为低功耗、高动态边缘计算场景提供了一种高效解决方案,平衡了精度与效率的需求。

Abstract: High-speed vision sensing is essential for real-time perception in applications such as robotics, autonomous vehicles, and industrial automation. Traditional frame-based vision systems suffer from motion blur, high latency, and redundant data processing, limiting their performance in dynamic environments. Event cameras, which capture asynchronous brightness changes at the pixel level, offer a promising alternative but pose challenges in object detection due to sparse and noisy event streams. To address this, we propose an event autoencoder architecture that efficiently compresses and reconstructs event data while preserving critical spatial and temporal features. The proposed model employs convolutional encoding and incorporates adaptive threshold selection and a lightweight classifier to enhance recognition accuracy while reducing computational complexity. Experimental results on the existing Smart Event Face Dataset (SEFD) demonstrate that our approach achieves comparable accuracy to the YOLO-v4 model while utilizing up to $35.5\times$ fewer parameters. Implementations on embedded platforms, including Raspberry Pi 4B and NVIDIA Jetson Nano, show high frame rates ranging from 8 FPS up to 44.8 FPS. The proposed classifier exhibits up to 87.84x better FPS than the state-of-the-art and significantly improves event-based vision performance, making it ideal for low-power, high-speed applications in real-time edge computing.


[11] Video-RTS: Rethinking Reinforcement Learning and Test-Time Scaling for Efficient and Enhanced Video Reasoning cs.CV | cs.AI | cs.CLPDF

Ziyang Wang, Jaehong Yoon, Shoubin Yu, Md Mohaiminul Islam, Gedas Bertasius

TL;DR: Video-RTS通过结合高效RL训练和视频自适应测试时缩放策略,显著提升了视频推理能力,数据效率高且无需资源密集型SFT步骤。

Details

Motivation: 传统基于RL的视频推理方法依赖大规模监督微调和长链标注,成本高且难以扩展。

Result: 在多个视频推理基准测试中,Video-RTS平均准确率提升2.4%,训练样本仅需3.6%。Video-Holmes提升4.2%,MMVU提升2.6%。

Insight: 纯RL训练与自适应TTS策略互补,为高效视频推理提供了新思路。

Abstract: Despite advances in reinforcement learning (RL)-based video reasoning with large language models (LLMs), data collection and finetuning remain significant challenges. These methods often rely on large-scale supervised fine-tuning (SFT) with extensive video data and long Chain-of-Thought (CoT) annotations, making them costly and hard to scale. To address this, we present Video-RTS, a new approach to improve video reasoning capability with drastically improved data efficiency by combining data-efficient RL with a video-adaptive test-time scaling (TTS) strategy. Based on observations about the data scaling of RL samples, we skip the resource-intensive SFT step and employ efficient pure-RL training with output-based rewards, requiring no additional annotations or extensive fine-tuning. Furthermore, to utilize computational resources more efficiently, we introduce a sparse-to-dense video TTS strategy that improves inference by iteratively adding frames based on output consistency. We validate our approach on multiple video reasoning benchmarks, showing that Video-RTS surpasses existing video reasoning models by an average of 2.4% in accuracy using only 3.6% training samples. For example, Video-RTS achieves a 4.2% improvement on Video-Holmes, a recent and challenging video reasoning benchmark, and a 2.6% improvement on MMVU. Notably, our pure RL training and adaptive video TTS offer complementary strengths, enabling Video-RTS’s strong reasoning performance.


[12] Mask6D: Masked Pose Priors For 6D Object Pose Estimation cs.CVPDF

Yuechen Xie, Haobo Jiang, Jin Xie

TL;DR: Mask6D引入了一种新的6D物体姿态估计预训练策略,通过结合2D-3D对应图和可见掩码图,有效提升了在遮挡或杂乱场景中的姿态估计性能。

Details

Motivation: 当前基于单目RGB图像的6D姿态估计方法在目标遮挡或场景杂乱时表现不佳,原因是2D特征主干难以提取判别性姿态特征。Mask6D旨在通过引入额外模态信息解决这一问题。

Result: 实验表明,Mask6D在6D姿态估计任务中优于现有的端到端方法,尤其在遮挡和杂乱场景中表现突出。

Insight: 引入姿态感知的多模态信息(如2D-3D对应图)可以显著提升姿态估计的鲁棒性,尤其是在复杂场景中。

Abstract: Robust 6D object pose estimation in cluttered or occluded conditions using monocular RGB images remains a challenging task. One reason is that current pose estimation networks struggle to extract discriminative, pose-aware features using 2D feature backbones, especially when the available RGB information is limited due to target occlusion in cluttered scenes. To mitigate this, we propose a novel pose estimation-specific pre-training strategy named Mask6D. Our approach incorporates pose-aware 2D-3D correspondence maps and visible mask maps as additional modal information, which is combined with RGB images for the reconstruction-based model pre-training. Essentially, this 2D-3D correspondence maps a transformed 3D object model to 2D pixels, reflecting the pose information of the target in camera coordinate system. Meanwhile, the integrated visible mask map can effectively guide our model to disregard cluttered background information. In addition, an object-focused pre-training loss function is designed to further facilitate our network to remove the background interference. Finally, we fine-tune our pre-trained pose prior-aware network via conventional pose training strategy to realize the reliable pose prediction. Extensive experiments verify that our method outperforms previous end-to-end pose estimation methods.


[13] Bilateral Collaboration with Large Vision-Language Models for Open Vocabulary Human-Object Interaction Detection cs.CVPDF

Yupeng Hu, Changxing Ding, Chang Sun, Shaoli Huang, Xiangmin Xu

TL;DR: 该论文提出了一种双边协作框架(BC-HOI),通过注意力偏差引导(ABG)和大语言模型监督引导(LSG)实现开放词汇的人类-物体交互检测,解决了现有方法中视觉特征过于粗粒度的问题。

Details

Motivation: 开放词汇的人类-物体交互检测任务需要检测图像中所有可能的三元组(人、动词、物体),但现有方法依赖大型视觉语言模型生成的视觉特征通常过于整体和粗粒度,与检测任务的细粒度需求不符。

Result: 在HICO-DET和V-COCO基准测试中,BC-HOI在开放和封闭场景下均表现优异。

Insight: 结合视觉语言模型和大型语言模型的协作机制,可以有效提升开放词汇交互检测的性能,尤其是在细粒度特征生成方面。

Abstract: Open vocabulary Human-Object Interaction (HOI) detection is a challenging task that detects all <human, verb, object> triplets of interest in an image, even those that are not pre-defined in the training set. Existing approaches typically rely on output features generated by large Vision-Language Models (VLMs) to enhance the generalization ability of interaction representations. However, the visual features produced by VLMs are holistic and coarse-grained, which contradicts the nature of detection tasks. To address this issue, we propose a novel Bilateral Collaboration framework for open vocabulary HOI detection (BC-HOI). This framework includes an Attention Bias Guidance (ABG) component, which guides the VLM to produce fine-grained instance-level interaction features according to the attention bias provided by the HOI detector. It also includes a Large Language Model (LLM)-based Supervision Guidance (LSG) component, which provides fine-grained token-level supervision for the HOI detector by the LLM component of the VLM. LSG enhances the ability of ABG to generate high-quality attention bias. We conduct extensive experiments on two popular benchmarks: HICO-DET and V-COCO, consistently achieving superior performance in the open vocabulary and closed settings. The code will be released in Github.


[14] What Demands Attention in Urban Street Scenes? From Scene Understanding towards Road Safety: A Survey of Vision-driven Datasets and Studies cs.CVPDF

Yaoqi Huang, Julie Stephany Berrio, Mao Shan, Stewart Worrall

TL;DR: 这篇论文系统性地总结了交通场景中需要关注的视觉元素,提出了新的分类法,并分析了35个视觉任务和73个数据集,旨在促进道路安全研究。

Details

Motivation: 为了将计算机视觉技术的进步应用于道路安全,论文通过整合多个领域的关键元素,提供统一的分类框架和分析,帮助研究者更高效地选择资源。

Result: 论文总结了现有研究的不足之处,强调了标准统一和资源优化的需求,并指出了未来的研究方向。

Insight: 整合的分类法和跨领域分析为道路安全研究提供了一个统一的框架,有助于填补研究空白和优化资源分配。

Abstract: Advances in vision-based sensors and computer vision algorithms have significantly improved the analysis and understanding of traffic scenarios. To facilitate the use of these improvements for road safety, this survey systematically categorizes the critical elements that demand attention in traffic scenarios and comprehensively analyzes available vision-driven tasks and datasets. Compared to existing surveys that focus on isolated domains, our taxonomy categorizes attention-worthy traffic entities into two main groups that are anomalies and normal but critical entities, integrating ten categories and twenty subclasses. It establishes connections between inherently related fields and provides a unified analytical framework. Our survey highlights the analysis of 35 vision-driven tasks and comprehensive examinations and visualizations of 73 available datasets based on the proposed taxonomy. The cross-domain investigation covers the pros and cons of each benchmark with the aim of providing information on standards unification and resource optimization. Our article concludes with a systematic discussion of the existing weaknesses, underlining the potential effects and promising solutions from various perspectives. The integrated taxonomy, comprehensive analysis, and recapitulatory tables serve as valuable contributions to this rapidly evolving field by providing researchers with a holistic overview, guiding strategic resource selection, and highlighting critical research gaps.


[15] FIFA: Unified Faithfulness Evaluation Framework for Text-to-Video and Video-to-Text Generation cs.CV | cs.CL | cs.GRPDF

Liqiang Jing, Viet Lai, Seunghyun Yoon, Trung Bui, Xinya Du

TL;DR: 本文提出了FIFA框架,用于统一评估文本到视频和视频到文本生成任务的忠实性,解决了现有方法仅针对单一任务且无法评估开放性问题中幻觉内容的局限性。

Details

Motivation: 现有的视频多模态大语言模型(VideoMLLMs)在视频到文本和文本到视频任务中表现突出,但常因生成内容与视觉输入矛盾而出现幻觉问题。现有评估方法仅针对单一任务,且无法评估开放式回答中的幻觉。

Result: 实验表明,FIFA比现有评估方法更贴近人类判断,且后校正能有效提升文本与视频生成的事实一致性。

Insight: FIFA为多模态生成任务提供了一种统一的评估和修正方法,强调了语义依赖建模和工具辅助修正的重要性。

Abstract: Video Multimodal Large Language Models (VideoMLLMs) have achieved remarkable progress in both Video-to-Text and Text-to-Video tasks. However, they often suffer fro hallucinations, generating content that contradicts the visual input. Existing evaluation methods are limited to one task (e.g., V2T) and also fail to assess hallucinations in open-ended, free-form responses. To address this gap, we propose FIFA, a unified FaIthFulness evAluation framework that extracts comprehensive descriptive facts, models their semantic dependencies via a Spatio-Temporal Semantic Dependency Graph, and verifies them using VideoQA models. We further introduce Post-Correction, a tool-based correction framework that revises hallucinated content. Extensive experiments demonstrate that FIFA aligns more closely with human judgment than existing evaluation methods, and that Post-Correction effectively improves factual consistency in both text and video generation.


[16] Speak2Sign3D: A Multi-modal Pipeline for English Speech to American Sign Language Animation cs.CVPDF

Kazi Mahathir Rahman, Naveed Imtiaz Nafis, Md. Farhan Sadik, Mohammad Al Rafi, Mehedi Hasan Shahed

TL;DR: 该论文提出了一种多模态管道Speak2Sign3D,将英语语音转化为流畅的3D美国手语动画,结合语音识别、文本转手语翻译和动画生成技术,解决了现有研究中反向翻译的缺失问题。

Details

Motivation: 旨在帮助聋人和听力障碍者更轻松地沟通,填补了现有研究中从语音到手语动画的空白,解决了多步骤转换的技术挑战。

Result: 实现了BLEU分数0.7714和0.8923的高翻译准确率,并生成了流畅的3D手语动画。

Insight: 多模态结合(语音、文本、动画)和高质量数据集是提升手语翻译系统的关键;关键点动画技术能为手语生成提供更自然的动作。

Abstract: Helping deaf and hard-of-hearing people communicate more easily is the main goal of Automatic Sign Language Translation. Although most past research has focused on turning sign language into text, doing the reverse, turning spoken English into sign language animations, has been largely overlooked. That’s because it involves multiple steps, such as understanding speech, translating it into sign-friendly grammar, and generating natural human motion. In this work, we introduce a complete pipeline that converts English speech into smooth, realistic 3D sign language animations. Our system starts with Whisper to translate spoken English into text. Then, we use a MarianMT machine translation model to translate that text into American Sign Language (ASL) gloss, a simplified version of sign language that captures meaning without grammar. This model performs well, reaching BLEU scores of 0.7714 and 0.8923. To make the gloss translation more accurate, we also use word embeddings such as Word2Vec and FastText to understand word meanings. Finally, we animate the translated gloss using a 3D keypoint-based motion system trained on Sign3D-WLASL, a dataset we created by extracting body, hand, and face key points from real ASL videos in the WLASL dataset. To support the gloss translation stage, we also built a new dataset called BookGlossCorpus-CG, which turns everyday English sentences from the BookCorpus dataset into ASL gloss using grammar rules. Our system stitches everything together by smoothly interpolating between signs to create natural, continuous animations. Unlike previous works like How2Sign and Phoenix-2014T that focus on recognition or use only one type of data, our pipeline brings together audio, text, and motion in a single framework that goes all the way from spoken English to lifelike 3D sign language animation.


[17] ILNet: Trajectory Prediction with Inverse Learning Attention for Enhancing Intention Capture cs.CVPDF

Mingjin Zeng, Nan Ouyang, Wenkang Wan, Lei Ao, Qing Cai

TL;DR: ILNet提出了一种多智能体轨迹预测方法,结合逆学习注意力和动态锚点选择模块,显著提升了意图捕获能力和预测准确性,在多个数据集上达到最优性能。

Details

Motivation: 现有方法在捕捉交互意图时缺乏时空协调的动态建模,且固定锚点策略难以适应不同未来环境。受人类驾驶行为的启发,作者提出通过逆学习注意力和动态锚点选择来优化轨迹预测。

Result: 在INTERACTION和Argoverse数据集上取得最优性能,尤其在复杂交互场景中表现出更高的准确性和多模态分布能力。

Insight: 动态建模交互意图和灵活锚点选择是提升轨迹预测的关键,逆学习方法有效增强了模型的意图捕获能力。

Abstract: Trajectory prediction for multi-agent interaction scenarios is a crucial challenge. Most advanced methods model agent interactions by efficiently factorized attention based on the temporal and agent axes. However, this static and foward modeling lacks explicit interactive spatio-temporal coordination, capturing only obvious and immediate behavioral intentions. Alternatively, the modern trajectory prediction framework refines the successive predictions by a fixed-anchor selection strategy, which is difficult to adapt in different future environments. It is acknowledged that human drivers dynamically adjust initial driving decisions based on further assumptions about the intentions of surrounding vehicles. Motivated by human driving behaviors, this paper proposes ILNet, a multi-agent trajectory prediction method with Inverse Learning (IL) attention and Dynamic Anchor Selection (DAS) module. IL Attention employs an inverse learning paradigm to model interactions at neighboring moments, introducing proposed intentions to dynamically encode the spatio-temporal coordination of interactions, thereby enhancing the model’s ability to capture complex interaction patterns. Then, the learnable DAS module is proposed to extract multiple trajectory change keypoints as anchors in parallel with almost no increase in parameters. Experimental results show that the ILNet achieves state-of-the-art performance on the INTERACTION and Argoverse motion forecasting datasets. Particularly, in challenged interaction scenarios, ILNet achieves higher accuracy and more multimodal distributions of trajectories over fewer parameters. Our codes are available at https://github.com/mjZeng11/ILNet.


[18] A model-agnostic active learning approach for animal detection from camera traps cs.CVPDF

Thi Thu Thuy Nguyen, Duc Thanh Nguyen

TL;DR: 论文提出了一种模型无关的主动学习方法,用于优化相机陷阱数据中的动物检测任务,通过结合不确定性和多样性指标,显著减少标注数据量。

Details

Motivation: 相机陷阱捕获的野生动物数据量庞大,标注和模型训练成本高昂。现有主动学习方法需要完全访问模型,限制了其应用。

Result: 实验表明,仅使用30%的标注数据即可达到或超过全量数据的检测性能。

Insight: 模型无关的主动学习方法在减少标注成本的同时保持性能,为野生动物监测提供高效解决方案。

Abstract: Smart data selection is becoming increasingly important in data-driven machine learning. Active learning offers a promising solution by allowing machine learning models to be effectively trained with optimal data including the most informative samples from large datasets. Wildlife data captured by camera traps are excessive in volume, requiring tremendous effort in data labelling and animal detection models training. Therefore, applying active learning to optimise the amount of labelled data would be a great aid in enabling automated wildlife monitoring and conservation. However, existing active learning techniques require that a machine learning model (i.e., an object detector) be fully accessible, limiting the applicability of the techniques. In this paper, we propose a model-agnostic active learning approach for detection of animals captured by camera traps. Our approach integrates uncertainty and diversity quantities of samples at both the object-based and image-based levels into the active learning sample selection process. We validate our approach in a benchmark animal dataset. Experimental results demonstrate that, using only 30% of the training data selected by our approach, a state-of-the-art animal detector can achieve a performance of equal or greater than that with the use of the complete training dataset.


[19] Token Bottleneck: One Token to Remember Dynamics cs.CVPDF

Taekyung Kim, Dongyoon Han, Byeongho Heo, Jeongeun Park, Sangdoo Yun

TL;DR: 论文提出Token Bottleneck (ToBo),一种自监督学习框架,通过将动态场景压缩为瓶颈标记并预测后续场景,实现了紧凑且时序感知的视觉表示。

Details

Motivation: 动态场景的视觉表示需要紧凑且具备时序感知能力,以支持视频追踪和机器人操作等任务。现有方法通常缺乏高效的时序建模能力。

Result: 在视频标签传播和机器人操作等任务上表现优越,预训练模型在真实环境中也验证了有效性。

Insight: ToBo通过简单设计实现了高效的时序建模,展示了自监督学习在动态场景理解中的潜力,且适用于不同规模的模型。

Abstract: Deriving compact and temporally aware visual representations from dynamic scenes is essential for successful execution of sequential scene understanding tasks such as visual tracking and robotic manipulation. In this paper, we introduce Token Bottleneck (ToBo), a simple yet intuitive self-supervised learning pipeline that squeezes a scene into a bottleneck token and predicts the subsequent scene using minimal patches as hints. The ToBo pipeline facilitates the learning of sequential scene representations by conservatively encoding the reference scene into a compact bottleneck token during the squeeze step. In the expansion step, we guide the model to capture temporal dynamics by predicting the target scene using the bottleneck token along with few target patches as hints. This design encourages the vision backbone to embed temporal dependencies, thereby enabling understanding of dynamic transitions across scenes. Extensive experiments in diverse sequential tasks, including video label propagation and robot manipulation in simulated environments demonstrate the superiority of ToBo over baselines. Moreover, deploying our pre-trained model on physical robots confirms its robustness and effectiveness in real-world environments. We further validate the scalability of ToBo across different model scales.


[20] Concept-TRAK: Understanding how diffusion models learn concepts through concept-level attribution cs.CV | cs.LGPDF

Yonghyun Park, Chieh-Hsin Lai, Satoshi Hayakawa, Yuhta Takida, Naoki Murata

TL;DR: 论文提出了Concept-TRAK方法,通过概念级归因理解扩散模型如何学习概念,改进现有归因方法,关注特定元素(如风格或对象)。

Details

Motivation: 随着扩散模型在图像生成中的广泛应用,版权问题和模型透明性成为关键挑战。现有归因方法只能识别影响整张图像的训练样本,无法聚焦特定元素。

Result: 在AbC基准测试中显著优于现有方法,通过案例研究展示了其在版权保护、安全内容分析和组合学习中的实用价值。

Insight: 概念级归因为生成AI的负责任开发和治理提供了可操作的洞察,有助于解决透明性和版权问题。

Abstract: While diffusion models excel at image generation, their growing adoption raises critical concerns around copyright issues and model transparency. Existing attribution methods identify training examples influencing an entire image, but fall short in isolating contributions to specific elements, such as styles or objects, that matter most to stakeholders. To bridge this gap, we introduce \emph{concept-level attribution} via a novel method called \emph{Concept-TRAK}. Concept-TRAK extends influence functions with two key innovations: (1) a reformulated diffusion training loss based on diffusion posterior sampling, enabling robust, sample-specific attribution; and (2) a concept-aware reward function that emphasizes semantic relevance. We evaluate Concept-TRAK on the AbC benchmark, showing substantial improvements over prior methods. Through diverse case studies–ranging from identifying IP-protected and unsafe content to analyzing prompt engineering and compositional learning–we demonstrate how concept-level attribution yields actionable insights for responsible generative AI development and governance.


[21] Divergence-Based Similarity Function for Multi-View Contrastive Learning cs.CV | cs.LG | 68T07, 62H12 | I.2.6; I.4.8; I.5.1PDF

Jae Hyoung Jeon, Cheolsu Lim, Myungjoo Kang

TL;DR: 本文提出了一种基于散度的相似性函数(DSF),用于多视角对比学习中显式捕捉所有视角的联合结构,通过将每组增强视角表示为分布并计算分布之间的散度来衡量相似性。实验表明,DSF在多种任务中表现优异且高效,且无需调参的温度超参数。

Details

Motivation: 已有的多视角对比学习方法主要在损失或特征层面整合多视角,但仅捕捉了成对关系,未能有效建模所有视角的联合结构。

Result: DSF在kNN分类和线性评估等任务中表现优异且高效,优于其他多视角方法。

Insight: 分布级相似性函数能更全面建模多视角关系,且避免了调参复杂度。

Abstract: Recent success in contrastive learning has sparked growing interest in more effectively leveraging multiple augmented views of an instance. While prior methods incorporate multiple views at the loss or feature level, they primarily capture pairwise relationships and fail to model the joint structure across all views. In this work, we propose a divergence-based similarity function (DSF) that explicitly captures the joint structure by representing each set of augmented views as a distribution and measuring similarity as the divergence between distributions. Extensive experiments demonstrate that DSF consistently improves performance across various tasks, including kNN classification and linear evaluation, while also offering greater efficiency compared to other multi-view methods. Furthermore, we establish a theoretical connection between DSF and cosine similarity, and show that, unlike cosine similarity, DSF operates effectively without requiring a temperature hyperparameter.


[22] Edge-Boundary-Texture Loss: A Tri-Class Generalization of Weighted Binary Cross-Entropy for Enhanced Edge Detection cs.CVPDF

Hao Shu

TL;DR: 论文提出了一种新的损失函数EBT Loss,将像素分为边缘、边界和纹理三类,通过差异化的权重分配提升边缘检测的精度和边界定位能力,实验证明其优于常用的WBCE Loss,且无需复杂调参。

Details

Motivation: 传统的WBCE Loss将所有非边缘像素视为同类,忽略了边缘附近的结构差异,导致模糊预测。为解决这一问题,需要一种更精细的监督方式。

Result: 在多个基准测试中,EBT Loss在定量和定性上均优于WBCE Loss,且对超参数变化鲁棒,易于实际应用。

Insight: 损失函数的设计应结合任务特性,差异化监督能显著提升模型性能。EBT Loss为边缘检测任务提供了更灵活和高效的优化方向。

Abstract: Edge detection (ED) remains a fundamental task in computer vision, yet its performance is often hindered by the ambiguous nature of non-edge pixels near object boundaries. The widely adopted Weighted Binary Cross-Entropy (WBCE) loss treats all non-edge pixels uniformly, overlooking the structural nuances around edges and often resulting in blurred predictions. In this paper, we propose the Edge-Boundary-Texture (EBT) loss, a novel objective that explicitly divides pixels into three categories, edge, boundary, and texture, and assigns each a distinct supervisory weight. This tri-class formulation enables more structured learning by guiding the model to focus on both edge precision and contextual boundary localization. We theoretically show that the EBT loss generalizes the WBCE loss, with the latter becoming a limit case. Extensive experiments across multiple benchmarks demonstrate the superiority of the EBT loss both quantitatively and perceptually. Furthermore, the consistent use of unified hyperparameters across all models and datasets, along with robustness to their moderate variations, indicates that the EBT loss requires minimal fine-tuning and is easily deployable in practice.


[23] MOST: Motion Diffusion Model for Rare Text via Temporal Clip Banzhaf Interaction cs.CVPDF

Yin Wang, Mu li, Zhiying Leng, Frederick W. B. Li, Xiaohui Liang

TL;DR: MOST提出了一种新的运动扩散模型,通过时间片段Banzhaf交互解决从罕见文本提示生成人体运动的挑战,实现了细粒度的文本-运动匹配并消除了冗余。

Details

Motivation: 现有方法在生成罕见文本提示对应的人体运动时存在粗粒度匹配和语义忽略的问题,MOST旨在通过细粒度的时间片段交互解决这些问题。

Result: MOST在文本-运动检索和生成任务中达到SOTA性能,尤其在罕见文本提示上表现优异。

Insight: 通过细粒度的片段交互可以有效解决运动冗余问题,提升罕见文本提示下的运动生成质量。

Abstract: We introduce MOST, a novel motion diffusion model via temporal clip Banzhaf interaction, aimed at addressing the persistent challenge of generating human motion from rare language prompts. While previous approaches struggle with coarse-grained matching and overlook important semantic cues due to motion redundancy, our key insight lies in leveraging fine-grained clip relationships to mitigate these issues. MOST’s retrieval stage presents the first formulation of its kind - temporal clip Banzhaf interaction - which precisely quantifies textual-motion coherence at the clip level. This facilitates direct, fine-grained text-to-motion clip matching and eliminates prevalent redundancy. In the generation stage, a motion prompt module effectively utilizes retrieved motion clips to produce semantically consistent movements. Extensive evaluations confirm that MOST achieves state-of-the-art text-to-motion retrieval and generation performance by comprehensively addressing previous challenges, as demonstrated through quantitative and qualitative results highlighting its effectiveness, especially for rare prompts.


[24] Ambiguity-aware Point Cloud Segmentation by Adaptive Margin Contrastive Learning cs.CVPDF

Yang Chen, Yueqi Duan, Haowen Sun, Jiwen Lu, Yap-Peng Tan

TL;DR: 该论文提出了一种基于自适应边界对比学习的点云语义分割方法,解决了现有方法中因忽略点云中模糊区域导致的性能不佳问题。

Details

Motivation: 现有方法对点云中所有点使用相同的惩罚目标,忽略了过渡区域的点模糊性和特征区分度不足的问题,导致模型性能受限。

Result: 在S3DIS和ScanNet数据集上的实验证明,该方法显著提升了点云分割的性能和鲁棒性。

Insight: 点云的模糊性评估和自适应优化可以显著提升语义分割模型的性能,尤其是对于边界模糊的区域。

Abstract: This paper proposes an adaptive margin contrastive learning method for 3D semantic segmentation on point clouds. Most existing methods use equally penalized objectives, which ignore the per-point ambiguities and less discriminated features stemming from transition regions. However, as highly ambiguous points may be indistinguishable even for humans, their manually annotated labels are less reliable, and hard constraints over these points would lead to sub-optimal models. To address this, we first design AMContrast3D, a method comprising contrastive learning into an ambiguity estimation framework, tailored to adaptive objectives for individual points based on ambiguity levels. As a result, our method promotes model training, which ensures the correctness of low-ambiguity points while allowing mistakes for high-ambiguity points. As ambiguities are formulated based on position discrepancies across labels, optimization during inference is constrained by the assumption that all unlabeled points are uniformly unambiguous, lacking ambiguity awareness. Inspired by the insight of joint training, we further propose AMContrast3D++ integrating with two branches trained in parallel, where a novel ambiguity prediction module concurrently learns point ambiguities from generated embeddings. To this end, we design a masked refinement mechanism that leverages predicted ambiguities to enable the ambiguous embeddings to be more reliable, thereby boosting segmentation performance and enhancing robustness. Experimental results on 3D indoor scene datasets, S3DIS and ScanNet, demonstrate the effectiveness of the proposed method. Code is available at https://github.com/YangChenApril/AMContrast3D.


[25] Capturing Stable HDR Videos Using a Dual-Camera System cs.CV | eess.IVPDF

Qianyu Zhang, Bolun Zheng, Hangjia Pan, Lingyu Zhu, Zunjie Zhu

TL;DR: 提出了一种双摄像头系统(DCS)用于HDR视频重建,通过一个摄像头捕获稳定的参考序列,另一个摄像头补充信息,并结合曝光自适应融合网络(EAFNet)解决曝光波动导致的闪烁问题。

Details

Motivation: HDR视频重建中,交替曝光方法中的曝光波动会导致闪烁问题,亟需一种更稳定的解决方案。

Result: 在多个数据集上实现SOTA性能,有效减少闪烁和伪影。

Insight: 双摄像头分工和曝光自适应融合能显著提升HDR视频的稳定性和质量。

Abstract: In HDR video reconstruction, exposure fluctuations in reference images from alternating exposure methods often result in flickering. To address this issue, we propose a dual-camera system (DCS) for HDR video acquisition, where one camera is assigned to capture consistent reference sequences, while the other is assigned to capture non-reference sequences for information supplementation. To tackle the challenges posed by video data, we introduce an exposure-adaptive fusion network (EAFNet) to achieve more robust results. EAFNet introduced a pre-alignment subnetwork to explore the influence of exposure, selectively emphasizing the valuable features across different exposure levels. Then, the enhanced features are fused by the asymmetric cross-feature fusion subnetwork, which explores reference-dominated attention maps to improve image fusion by aligning cross-scale features and performing cross-feature fusion. Finally, the reconstruction subnetwork adopts a DWT-based multiscale architecture to reduce ghosting artifacts and refine features at different resolutions. Extensive experimental evaluations demonstrate that the proposed method achieves state-of-the-art performance on different datasets, validating the great potential of the DCS in HDR video reconstruction. The codes and data captured by DCS will be available at https://github.com/zqqqyu/DCS.


[26] Cross-Modal Dual-Causal Learning for Long-Term Action Recognition cs.CVPDF

Xu Shaowu, Jia Xibin, Gao Junyu, Sun Qianmei, Chang Jing

TL;DR: 论文提出了跨模态双因果学习(CMDCL),通过结构因果模型解决视频与标签文本之间的因果关系,解决了视觉语言模型在长时动作识别(LTAR)中的统计相关性问题和跨模态偏差。

Details

Motivation: 长时动作识别因时间跨度长、动作相关性复杂及视觉干扰问题而具有挑战性。现有视觉语言模型依赖统计相关性而非因果机制,且缺乏跨模态因果建模。为解决这些问题,论文提出CMDCL方法。

Result: 在Charades、Breakfast和COIN三个基准测试中验证了CMDCL的有效性。

Insight: 因果机制在跨模态任务中至关重要,双因果干预能有效提升长时动作识别的鲁棒性。

Abstract: Long-term action recognition (LTAR) is challenging due to extended temporal spans with complex atomic action correlations and visual confounders. Although vision-language models (VLMs) have shown promise, they often rely on statistical correlations instead of causal mechanisms. Moreover, existing causality-based methods address modal-specific biases but lack cross-modal causal modeling, limiting their utility in VLM-based LTAR. This paper proposes \textbf{C}ross-\textbf{M}odal \textbf{D}ual-\textbf{C}ausal \textbf{L}earning (CMDCL), which introduces a structural causal model to uncover causal relationships between videos and label texts. CMDCL addresses cross-modal biases in text embeddings via textual causal intervention and removes confounders inherent in the visual modality through visual causal intervention guided by the debiased text. These dual-causal interventions enable robust action representations to address LTAR challenges. Experimental results on three benchmarks including Charades, Breakfast and COIN, demonstrate the effectiveness of the proposed model. Our code is available at https://github.com/xushaowu/CMDCL.


[27] Omni-Fusion of Spatial and Spectral for Hyperspectral Image Segmentation cs.CVPDF

Qing Zhang, Guoquan Pei, Yan Wang

TL;DR: 提出了一种名为 Omni-Fuse 的新方法,用于医学高光谱图像分割,通过跨维特征融合和双向注意力机制,显著提升了分割性能,优于现有方法。

Details

Motivation: 医学高光谱成像(MHSI)在疾病诊断中具有潜力,但高维性和光谱冗余使得空间和光谱信息的有效融合成为挑战。

Result: 在两个微观高光谱图像数据集上,Omni-Fuse 的分割性能显著优于现有方法,DSC指标提升了5.73%。

Insight: 跨维特征融合和双向注意力机制能够有效解决医学高光谱图像中空间与光谱信息融合的挑战,同时保持计算效率。

Abstract: Medical Hyperspectral Imaging (MHSI) has emerged as a promising tool for enhanced disease diagnosis, particularly in computational pathology, offering rich spectral information that aids in identifying subtle biochemical properties of tissues. Despite these advantages, effectively fusing both spatial-dimensional and spectral-dimensional information from MHSIs remains challenging due to its high dimensionality and spectral redundancy inherent characteristics. To solve the above challenges, we propose a novel spatial-spectral omni-fusion network for hyperspectral image segmentation, named as Omni-Fuse. Here, we introduce abundant cross-dimensional feature fusion operations, including a cross-dimensional enhancement module that refines both spatial and spectral features through bidirectional attention mechanisms, a spectral-guided spatial query selection to select the most spectral-related spatial feature as the query, and a two-stage cross-dimensional decoder which dynamically guide the model to focus on the selected spatial query. Despite of numerous attention blocks, Omni-Fuse remains efficient in execution. Experiments on two microscopic hyperspectral image datasets show that our approach can significantly improve the segmentation performance compared with the state-of-the-art methods, with over 5.73 percent improvement in DSC. Code available at: https://github.com/DeepMed-Lab-ECNU/Omni-Fuse.


[28] EXAONE Path 2.0: Pathology Foundation Model with End-to-End Supervision cs.CV | cs.AI | cs.LGPDF

Myungjang Pyeon, Janghyeon Lee, Minsoo Lee, Juseung Yun, Hwanil Choi

TL;DR: EXAONE Path 2.0是一种病理学基础模型,通过端到端的监督学习在patch级别上进行表征学习,显著提升了数据效率和任务表现。

Details

Motivation: 当前数字病理学中,大多数方法通过自监督学习(SSL)训练patch编码器,但SSL可能忽略领域特定的复杂特征,且数据效率较低。EXAONE Path 2.0旨在克服这些限制。

Result: 在10项生物标志物预测任务中达到了最先进的平均性能,展示了卓越的数据效率。

Insight: 端到端监督学习在病理学中优于自监督学习,尤其是在捕捉复杂特征和提高数据效率方面。

Abstract: In digital pathology, whole-slide images (WSIs) are often difficult to handle due to their gigapixel scale, so most approaches train patch encoders via self-supervised learning (SSL) and then aggregate the patch-level embeddings via multiple instance learning (MIL) or slide encoders for downstream tasks. However, patch-level SSL may overlook complex domain-specific features that are essential for biomarker prediction, such as mutation status and molecular characteristics, as SSL methods rely only on basic augmentations selected for natural image domains on small patch-level area. Moreover, SSL methods remain less data efficient than fully supervised approaches, requiring extensive computational resources and datasets to achieve competitive performance. To address these limitations, we present EXAONE Path 2.0, a pathology foundation model that learns patch-level representations under direct slide-level supervision. Using only 37k WSIs for training, EXAONE Path 2.0 achieves state-of-the-art average performance across 10 biomarker prediction tasks, demonstrating remarkable data efficiency.


[29] Learning from Sparse Point Labels for Dense Carcinosis Localization in Advanced Ovarian Cancer Assessment cs.CV | cs.LGPDF

Farahdiba Zarin, Riccardo Oliva, Vinkle Srivastav, Armine Vardazaryan, Andrea Rosati

TL;DR: 该论文提出了一种从稀疏点标签学习密集定位的方法,用于高级卵巢癌评估中的2D癌变关键点定位,并提出了一种新的损失函数(Crag and Tail loss)以优化学习效果。

Details

Motivation: 在医学领域,密集的像素级标注成本高且不现实,尤其是对新任务而言。研究如何从稀疏的像素级标签学习密集预测任务,以推动标注资源有限的研究进展。

Result: 通过大量实验验证了方法的有效性,能够准确实现癌变关键点的密集定位。

Insight: 该方法展示了在标注资源有限的情况下,通过设计合适的损失函数和任务建模,仍能实现高质量的密集预测任务的潜力。

Abstract: Learning from sparse labels is a challenge commonplace in the medical domain. This is due to numerous factors, such as annotation cost, and is especially true for newly introduced tasks. When dense pixel-level annotations are needed, this becomes even more unfeasible. However, being able to learn from just a few annotations at the pixel-level, while extremely difficult and underutilized, can drive progress in studies where perfect annotations are not immediately available. This work tackles the challenge of learning the dense prediction task of keypoint localization from a few point annotations in the context of 2d carcinosis keypoint localization from laparoscopic video frames for diagnostic planning of advanced ovarian cancer patients. To enable this, we formulate the problem as a sparse heatmap regression from a few point annotations per image and propose a new loss function, called Crag and Tail loss, for efficient learning. Our proposed loss function effectively leverages positive sparse labels while minimizing the impact of false negatives or missed annotations. Through an extensive ablation study, we demonstrate the effectiveness of our approach in achieving accurate dense localization of carcinosis keypoints, highlighting its potential to advance research in scenarios where dense annotations are challenging to obtain.


[30] ClipGS: Clippable Gaussian Splatting for Interactive Cinematic Visualization of Volumetric Medical Data cs.CVPDF

Chengkun Li, Yuqi Tong, Kai Chen, Zhenya Yang, Ruiyang Li

TL;DR: ClipGS是一种支持裁剪平面的高斯泼溅框架,用于医学体积数据的交互式电影级可视化,通过学习截断方案和自适应调整模型提升渲染质量和效率。

Details

Motivation: 医学体积数据的可视化对诊断和手术规划至关重要,但现有技术的高计算成本和低渲染速度限制了交互式应用。

Result: 在五种医学数据(CT和解剖切片)上,达到平均36.635 PSNR,156 FPS和16.1 MB模型大小,优于现有方法。

Insight: 通过动态调整高斯基元的可见性和形状,能够在保持高质量渲染的同时实现高效交互。

Abstract: The visualization of volumetric medical data is crucial for enhancing diagnostic accuracy and improving surgical planning and education. Cinematic rendering techniques significantly enrich this process by providing high-quality visualizations that convey intricate anatomical details, thereby facilitating better understanding and decision-making in medical contexts. However, the high computing cost and low rendering speed limit the requirement of interactive visualization in practical applications. In this paper, we introduce ClipGS, an innovative Gaussian splatting framework with the clipping plane supported, for interactive cinematic visualization of volumetric medical data. To address the challenges posed by dynamic interactions, we propose a learnable truncation scheme that automatically adjusts the visibility of Gaussian primitives in response to the clipping plane. Besides, we also design an adaptive adjustment model to dynamically adjust the deformation of Gaussians and refine the rendering performance. We validate our method on five volumetric medical data (including CT and anatomical slice data), and reach an average 36.635 PSNR rendering quality with 156 FPS and 16.1 MB model size, outperforming state-of-the-art methods in rendering quality and efficiency.


[31] Diff$^2$I2P: Differentiable Image-to-Point Cloud Registration with Diffusion Prior cs.CVPDF

Juncheng Mu, Chengwei Ren, Weixiang Zhang, Liang Pan, Xiao-Ping Zhang

TL;DR: Diff²I2P 利用扩散先验和可微的对应关系调整模块,提升了图像到点云的跨模态配准性能。

Details

Motivation: 现有方法通过度量学习强制跨模态特征对齐,但忽视了图像和点云数据之间的固有模态差异,导致配准效果不佳。

Result: 在 7-Scenes 基准测试上,配准召回率提升超过 7%,优于现有方法。

Insight: 扩散模型可以作为跨模态特征学习的强先验,显著提升图像和点云之间的配准性能。

Abstract: Learning cross-modal correspondences is essential for image-to-point cloud (I2P) registration. Existing methods achieve this mostly by utilizing metric learning to enforce feature alignment across modalities, disregarding the inherent modality gap between image and point data. Consequently, this paradigm struggles to ensure accurate cross-modal correspondences. To this end, inspired by the cross-modal generation success of recent large diffusion models, we propose Diff$^2$I2P, a fully Differentiable I2P registration framework, leveraging a novel and effective Diffusion prior for bridging the modality gap. Specifically, we propose a Control-Side Score Distillation (CSD) technique to distill knowledge from a depth-conditioned diffusion model to directly optimize the predicted transformation. However, the gradients on the transformation fail to backpropagate onto the cross-modal features due to the non-differentiability of correspondence retrieval and PnP solver. To this end, we further propose a Deformable Correspondence Tuning (DCT) module to estimate the correspondences in a differentiable way, followed by the transformation estimation using a differentiable PnP solver. With these two designs, the Diffusion model serves as a strong prior to guide the cross-modal feature learning of image and point cloud for forming robust correspondences, which significantly improves the registration. Extensive experimental results demonstrate that Diff$^2$I2P consistently outperforms SoTA I2P registration methods, achieving over 7% improvement in registration recall on the 7-Scenes benchmark.


[32] MK-Pose: Category-Level Object Pose Estimation via Multimodal-Based Keypoint Learning cs.CV | cs.ROPDF

Yifan Yang, Peili Song, Enfan Lan, Dong Liu, Jingtai Liu

TL;DR: MK-Pose是一种基于多模态关键点学习的框架,用于类别级物体姿态估计,结合RGB图像、点云和文本描述,通过自监督关键点检测和图增强特征融合模块提升性能。

Details

Motivation: 现有方法依赖单一模态(如RGB或点云),难以处理物体遮挡和跨类别的泛化问题,因此提出多模态融合的方法。

Result: MK-Pose在CAMERA25和REAL275数据集上表现优于现有方法,并在HouseCat6D上验证了跨数据集能力。

Insight: 多模态融合和图建模能有效提升姿态估计的鲁棒性,尤其在遮挡和跨类别场景中。

Abstract: Category-level object pose estimation, which predicts the pose of objects within a known category without prior knowledge of individual instances, is essential in applications like warehouse automation and manufacturing. Existing methods relying on RGB images or point cloud data often struggle with object occlusion and generalization across different instances and categories. This paper proposes a multimodal-based keypoint learning framework (MK-Pose) that integrates RGB images, point clouds, and category-level textual descriptions. The model uses a self-supervised keypoint detection module enhanced with attention-based query generation, soft heatmap matching and graph-based relational modeling. Additionally, a graph-enhanced feature fusion module is designed to integrate local geometric information and global context. MK-Pose is evaluated on CAMERA25 and REAL275 dataset, and is further tested for cross-dataset capability on HouseCat6D dataset. The results demonstrate that MK-Pose outperforms existing state-of-the-art methods in both IoU and average precision without shape priors. Codes will be released at \href{https://github.com/yangyifanYYF/MK-Pose}{https://github.com/yangyifanYYF/MK-Pose}.


[33] FlexGaussian: Flexible and Cost-Effective Training-Free Compression for 3D Gaussian Splatting cs.CVPDF

Boyuan Tian, Qizhe Gao, Siran Xianyu, Xiaotong Cui, Minjia Zhang

TL;DR: FlexGaussian是一种无需重新训练的3D高斯泼溅压缩方法,结合混合精度量化和属性判别剪枝,实现高效压缩,适用于移动设备。

Details

Motivation: 大规模3D模型在资源受限设备上的需求增长,需要灵活且高效的压缩方法,当前方法缺乏灵活性且需重新训练。

Result: 在PSNR下降小于1dB的情况下,压缩率高达96.4%,速度快于现有方法1.7-2.1倍。

Insight: 无需重新训练的高效压缩方法可显著提升3D模型在移动端的部署效率。

Abstract: 3D Gaussian splatting has become a prominent technique for representing and rendering complex 3D scenes, due to its high fidelity and speed advantages. However, the growing demand for large-scale models calls for effective compression to reduce memory and computation costs, especially on mobile and edge devices with limited resources. Existing compression methods effectively reduce 3D Gaussian parameters but often require extensive retraining or fine-tuning, lacking flexibility under varying compression constraints. In this paper, we introduce FlexGaussian, a flexible and cost-effective method that combines mixed-precision quantization with attribute-discriminative pruning for training-free 3D Gaussian compression. FlexGaussian eliminates the need for retraining and adapts easily to diverse compression targets. Evaluation results show that FlexGaussian achieves up to 96.4% compression while maintaining high rendering quality (<1 dB drop in PSNR), and is deployable on mobile devices. FlexGaussian delivers high compression ratios within seconds, being 1.7-2.1x faster than state-of-the-art training-free methods and 10-100x faster than training-involved approaches. The code is being prepared and will be released soon at: https://github.com/Supercomputing-System-AI-Lab/FlexGaussian


[34] Text-promptable Object Counting via Quantity Awareness Enhancement cs.CVPDF

Miaojing Shi, Xiaowen Zhang, Zijie Yue, Yong Luo, Cairong Zhao

TL;DR: 论文提出QUANet,通过数量导向的文本提示和视觉-文本数量对齐损失增强模型的计数能力,引入双流自适应计数解码器和交叉流数量排序损失,在多个基准测试中表现优异。

Details

Motivation: 现有方法在文本提示计数任务中仅关注对象类别,忽略数量信息,导致计数准确性不足。

Result: 在FSC-147、CARPK等数据集上展示了优异的零样本类别无关计数性能。

Insight: 数量信息在文本提示计数中至关重要,双流结构和适配器能有效提升模型的泛化能力。

Abstract: Recent advances in large vision-language models (VLMs) have shown remarkable progress in solving the text-promptable object counting problem. Representative methods typically specify text prompts with object category information in images. This however is insufficient for training the model to accurately distinguish the number of objects in the counting task. To this end, we propose QUANet, which introduces novel quantity-oriented text prompts with a vision-text quantity alignment loss to enhance the model’s quantity awareness. Moreover, we propose a dual-stream adaptive counting decoder consisting of a Transformer stream, a CNN stream, and a number of Transformer-to-CNN enhancement adapters (T2C-adapters) for density map prediction. The T2C-adapters facilitate the effective knowledge communication and aggregation between the Transformer and CNN streams. A cross-stream quantity ranking loss is proposed in the end to optimize the ranking orders of predictions from the two streams. Extensive experiments on standard benchmarks such as FSC-147, CARPK, PUCPR+, and ShanghaiTech demonstrate our model’s strong generalizability for zero-shot class-agnostic counting. Code is available at https://github.com/viscom-tongji/QUANet


[35] Spatial-Temporal Graph Mamba for Music-Guided Dance Video Synthesis cs.CVPDF

Hao Tang, Ling Shao, Zhenyu Zhang, Luc Van Gool, Nicu Sebe

TL;DR: STG-Mamba是一种用于音乐引导舞蹈视频合成的空间-时间图Mamba方法,通过音乐到骨架和骨架到视频的两步映射实现,生成效果优于现有方法。

Details

Motivation: 现有方法在音乐引导舞蹈视频合成任务中难以同时捕捉关节的空间和时间依赖性,因此需要一种更有效的方法来生成自然流畅的舞蹈视频。

Result: 实验表明,STG-Mamba在音乐引导舞蹈视频合成任务中优于现有方法。

Insight: 结合空间和时间依赖性的建模是生成流畅舞蹈视频的关键,自监督正则化网络可以有效提升视频生成质量。

Abstract: We propose a novel spatial-temporal graph Mamba (STG-Mamba) for the music-guided dance video synthesis task, i.e., to translate the input music to a dance video. STG-Mamba consists of two translation mappings: music-to-skeleton translation and skeleton-to-video translation. In the music-to-skeleton translation, we introduce a novel spatial-temporal graph Mamba (STGM) block to effectively construct skeleton sequences from the input music, capturing dependencies between joints in both the spatial and temporal dimensions. For the skeleton-to-video translation, we propose a novel self-supervised regularization network to translate the generated skeletons, along with a conditional image, into a dance video. Lastly, we collect a new skeleton-to-video translation dataset from the Internet, containing 54,944 video clips. Extensive experiments demonstrate that STG-Mamba achieves significantly better results than existing methods.


[36] A Neural Representation Framework with LLM-Driven Spatial Reasoning for Open-Vocabulary 3D Visual Grounding cs.CV | cs.ROPDF

Zhenyang Liu, Sixiao Zheng, Siyu Chen, Cairong Zhao, Longfei Liang

TL;DR: 这篇论文提出了一种名为SpatialReasoner的神经表示框架,通过大语言模型(LLM)驱动的空间推理,解决了开放词汇3D视觉定位中空间关系推理的不足。

Details

Motivation: 开放词汇的3D视觉定位在自主导航和机器人等应用中至关重要,但现有方法在语言查询中的空间关系推理(如“椅子上的书”)存在不足,需要提升语言和3D场景中的空间推理能力。

Result: 实验表明,该方法可无缝集成到多种神经表示中,显著优于基线模型,并提升了空间推理能力。

Insight: 通过LLM驱动空间推理和视觉属性增强的特征场,可以更准确地实现复杂的开放词汇3D视觉定位任务。

Abstract: Open-vocabulary 3D visual grounding aims to localize target objects based on free-form language queries, which is crucial for embodied AI applications such as autonomous navigation, robotics, and augmented reality. Learning 3D language fields through neural representations enables accurate understanding of 3D scenes from limited viewpoints and facilitates the localization of target objects in complex environments. However, existing language field methods struggle to accurately localize instances using spatial relations in language queries, such as ``the book on the chair.’’ This limitation mainly arises from inadequate reasoning about spatial relations in both language queries and 3D scenes. In this work, we propose SpatialReasoner, a novel neural representation-based framework with large language model (LLM)-driven spatial reasoning that constructs a visual properties-enhanced hierarchical feature field for open-vocabulary 3D visual grounding. To enable spatial reasoning in language queries, SpatialReasoner fine-tunes an LLM to capture spatial relations and explicitly infer instructions for the target, anchor, and spatial relation. To enable spatial reasoning in 3D scenes, SpatialReasoner incorporates visual properties (opacity and color) to construct a hierarchical feature field. This field represents language and instance features using distilled CLIP features and masks extracted via the Segment Anything Model (SAM). The field is then queried using the inferred instructions in a hierarchical manner to localize the target 3D instance based on the spatial relation in the language query. Extensive experiments show that our framework can be seamlessly integrated into different neural representations, outperforming baseline models in 3D visual grounding while empowering their spatial reasoning capability.


[37] Hierarchical Feature Alignment for Gloss-Free Sign Language Translation cs.CVPDF

Sobhan Asasi, Mohamed Ilyes Lakhal, Richard Bowden

TL;DR: 该论文提出了一种无需标注的分层预训练策略,通过视频伪标签和对比学习改进手语翻译性能。

Details

Motivation: 现有手语翻译方法在端到端学习时存在视觉与文本表示的不对齐问题,基于伪标签的方法虽灵活性高但需有效对齐策略。

Result: 实验显示BLEU-4和ROUGE分数提升,同时保持效率。

Insight: 分层对齐策略能更有效捕捉手语的结构信息,提升翻译性能。

Abstract: Sign Language Translation (SLT) attempts to convert sign language videos into spoken sentences. However, many existing methods struggle with the disparity between visual and textual representations during end-to-end learning. Gloss-based approaches help to bridge this gap by leveraging structured linguistic information. While, gloss-free methods offer greater flexibility and remove the burden of annotation, they require effective alignment strategies. Recent advances in Large Language Models (LLMs) have enabled gloss-free SLT by generating text-like representations from sign videos. In this work, we introduce a novel hierarchical pre-training strategy inspired by the structure of sign language, incorporating pseudo-glosses and contrastive video-language alignment. Our method hierarchically extracts features at frame, segment, and video levels, aligning them with pseudo-glosses and the spoken sentence to enhance translation quality. Experiments demonstrate that our approach improves BLEU-4 and ROUGE scores while maintaining efficiency.


[38] Residual Prior-driven Frequency-aware Network for Image Fusion cs.CV | cs.LG | cs.MMPDF

Guan Zheng, Xue Wang, Wenhua Qian, Peng Liu, Runzhuo Ma

TL;DR: 该论文提出了一种名为RPFNet的残差先验驱动的频率感知网络,用于解决图像融合中的全局特征建模和互补信息捕获问题。通过双分支特征提取框架和多种损失函数,实现了高效的融合性能。

Details

Motivation: 图像融合任务中,传统的全局空间建模方法计算成本高,且缺乏真实标签,难以有效捕获互补信息。RPFNet旨在通过频率域建模和残差先验提取解决这些问题。

Result: 实验表明,RPFNet能够有效整合判别性特征,增强纹理细节和显着对象,并提升高层视觉任务的性能。

Insight: 频率域建模和残差先验的结合为解决图像融合中的计算复杂性和互补信息捕获问题提供了新思路,同时多损失函数的组合优化能够更好地约束模型训练。

Abstract: Image fusion aims to integrate complementary information across modalities to generate high-quality fused images, thereby enhancing the performance of high-level vision tasks. While global spatial modeling mechanisms show promising results, constructing long-range feature dependencies in the spatial domain incurs substantial computational costs. Additionally, the absence of ground-truth exacerbates the difficulty of capturing complementary features effectively. To tackle these challenges, we propose a Residual Prior-driven Frequency-aware Network, termed as RPFNet. Specifically, RPFNet employs a dual-branch feature extraction framework: the Residual Prior Module (RPM) extracts modality-specific difference information from residual maps, thereby providing complementary priors for fusion; the Frequency Domain Fusion Module (FDFM) achieves efficient global feature modeling and integration through frequency-domain convolution. Additionally, the Cross Promotion Module (CPM) enhances the synergistic perception of local details and global structures through bidirectional feature interaction. During training, we incorporate an auxiliary decoder and saliency structure loss to strengthen the model’s sensitivity to modality-specific differences. Furthermore, a combination of adaptive weight-based frequency contrastive loss and SSIM loss effectively constrains the solution space, facilitating the joint capture of local details and global features while ensuring the retention of complementary information. Extensive experiments validate the fusion performance of RPFNet, which effectively integrates discriminative features, enhances texture details and salient objects, and can effectively facilitate the deployment of the high-level vision task.


[39] DIFFUMA: High-Fidelity Spatio-Temporal Video Prediction via Dual-Path Mamba and Diffusion Enhancement cs.CV | cs.AIPDF

Xinyu Xie, Weifeng Cao, Jun Shi, Yangyang Hu, Hui Liang

TL;DR: 该论文提出了DIFFUMA模型和CHDL数据集,用于高精度时空视频预测,在半导体制造领域表现卓越。

Details

Motivation: 解决工业场景中高精度时空视频预测的挑战,尤其是半导体制造领域缺乏专用数据集的问题。

Result: 在CHDL数据集上,MSE降低39%,SSIM提升至0.988;在自然现象数据集上也表现优异。

Insight: 工业AI需要专用数据集和针对性模型设计,DIFFUMA为高精度动态建模提供了新思路。

Abstract: Spatio-temporal video prediction plays a pivotal role in critical domains, ranging from weather forecasting to industrial automation. However, in high-precision industrial scenarios such as semiconductor manufacturing, the absence of specialized benchmark datasets severely hampers research on modeling and predicting complex processes. To address this challenge, we make a twofold contribution.First, we construct and release the Chip Dicing Lane Dataset (CHDL), the first public temporal image dataset dedicated to the semiconductor wafer dicing process. Captured via an industrial-grade vision system, CHDL provides a much-needed and challenging benchmark for high-fidelity process modeling, defect detection, and digital twin development.Second, we propose DIFFUMA, an innovative dual-path prediction architecture specifically designed for such fine-grained dynamics. The model captures global long-range temporal context through a parallel Mamba module, while simultaneously leveraging a diffusion module, guided by temporal features, to restore and enhance fine-grained spatial details, effectively combating feature degradation. Experiments demonstrate that on our CHDL benchmark, DIFFUMA significantly outperforms existing methods, reducing the Mean Squared Error (MSE) by 39% and improving the Structural Similarity (SSIM) from 0.926 to a near-perfect 0.988. This superior performance also generalizes to natural phenomena datasets. Our work not only delivers a new state-of-the-art (SOTA) model but, more importantly, provides the community with an invaluable data resource to drive future research in industrial AI.


[40] PromptTea: Let Prompts Tell TeaCache the Optimal Threshold cs.CVPDF

Zishen Huang, Chunyu Yang, Mengyuan Ren

TL;DR: 该论文提出了一种基于提示复杂度的缓存方法(PCA)和动态CFGCache机制,显著提升了视频生成模型的推理速度,同时保持了视觉保真度。

Details

Motivation: 尽管视频生成技术有所进展,推理速度仍是瓶颈。固定间隔的缓存机制在复杂场景下效果不佳,且手动调整阈值效率低且不鲁棒。

Result: 在Wan2.1模型上实现了2.79倍加速,同时保持高视觉保真度。

Insight: 输入提示的语义信息对缓存决策至关重要,动态机制能更好地平衡速度和生成质量。

Abstract: Despite recent progress in video generation, inference speed remains a major bottleneck. A common acceleration strategy involves reusing model outputs via caching mechanisms at fixed intervals. However, we find that such fixed-frequency reuse significantly degrades quality in complex scenes, while manually tuning reuse thresholds is inefficient and lacks robustness. To address this, we propose Prompt-Complexity-Aware (PCA) caching, a method that automatically adjusts reuse thresholds based on scene complexity estimated directly from the input prompt. By incorporating prompt-derived semantic cues, PCA enables more adaptive and informed reuse decisions than conventional caching methods. We also revisit the assumptions behind TeaCache and identify a key limitation: it suffers from poor input-output relationship modeling due to an oversimplified prior. To overcome this, we decouple the noisy input, enhance the contribution of meaningful textual information, and improve the model’s predictive accuracy through multivariate polynomial feature expansion. To further reduce computational cost, we replace the static CFGCache with DynCFGCache, a dynamic mechanism that selectively reuses classifier-free guidance (CFG) outputs based on estimated output variations. This allows for more flexible reuse without compromising output quality. Extensive experiments demonstrate that our approach achieves significant acceleration-for example, 2.79x speedup on the Wan2.1 model-while maintaining high visual fidelity across a range of scenes.


[41] Finetuning Vision-Language Models as OCR Systems for Low-Resource Languages: A Case Study of Manchu cs.CVPDF

Yan Hon Michael Chung, Donghyeok Choi

TL;DR: 论文通过微调视觉语言模型(VLMs)为濒危语言满文开发了高效的OCR系统,显著提升了在真实历史文档上的识别性能。

Details

Motivation: 满文作为濒危语言,缺乏有效的OCR系统处理真实历史文档,阻碍了早期现代东亚历史的研究。

Result: 1. LLaMA-3.2-11B在合成数据上表现最佳(98.3%单词准确率,0.0024字符错误率)。
2. 在真实文档上保持93.1%的准确率,远超传统CRNN基线(72.5%)。

Insight: 1. VLMs在低资源语言OCR任务中具有潜力。
2. 合成数据训练可以高效迁移到真实场景。
3. 该框架可扩展至其他濒危语言,降低了技术门槛。

Abstract: Manchu, a critically endangered language essential for understanding early modern Eastern Eurasian history, lacks effective OCR systems that can handle real-world historical documents. This study develops high-performing OCR systems by fine-tuning three open-source vision-language models (LLaMA-3.2-11B, Qwen2.5-VL-7B, Qwen2.5-VL-3B) on 60,000 synthetic Manchu word images using parameter-efficient training. LLaMA-3.2-11B achieved exceptional performance with 98.3% word accuracy and 0.0024 character error rate on synthetic data, while crucially maintaining 93.1% accuracy on real-world handwritten documents. Comparative evaluation reveals substantial advantages over traditional approaches: while a CRNN baseline achieved 99.8% synthetic accuracy, it suffered severe degradation to 72.5% on real documents. Our approach demonstrates effective synthetic-to-real domain transfer, providing a cost-effective solution deployable on accessible infrastructure. This work establishes a transferable framework for endangered language OCR that removes technical and financial barriers in digital humanities, enabling historians and linguists to process historical archives without specialized computing resources. Code and model weights are available at https://github.com/mic7ch1/ManchuAI-OCR.


[42] FOLC-Net: A Federated-Optimized Lightweight Architecture for Enhanced MRI Disease Diagnosis across Axial, Coronal, and Sagittal Views cs.CV | cs.AIPDF

Saif Ur Rehman Khan, Muhammad Nabeel Asim, Sebastian Vollmer, Andreas Dengel

TL;DR: FOLC-Net提出了一种新型的联邦优化轻量级架构,用于增强MRI疾病诊断,特别是在轴位、冠状位和矢状位视图中的性能。

Details

Motivation: 现有的MRI疾病诊断模型在处理多视图数据时性能下降,尤其是在矢状位视图中。FOLC-Net旨在解决这一问题,提高模型的适应性和准确性。

Result: 在矢状位视图中,FOLC-Net的准确率达到92.44%,显著优于现有方法。同时,其在所有单视图及多视图数据上均表现出更高的准确性和鲁棒性。

Insight: FOLC-Net展示了轻量级架构和联邦学习在医学图像分析中的潜力,尤其是在处理多视图数据时表现优异,为去中心化环境中的医疗应用提供了可靠解决方案。

Abstract: The framework is designed to improve performance in the analysis of combined as well as single anatomical perspectives for MRI disease diagnosis. It specifically addresses the performance degradation observed in state-of-the-art (SOTA) models, particularly when processing axial, coronal, and sagittal anatomical planes. The paper introduces the FOLC-Net framework, which incorporates a novel federated-optimized lightweight architecture with approximately 1.217 million parameters and a storage requirement of only 0.9 MB. FOLC-Net integrates Manta-ray foraging optimization (MRFO) mechanisms for efficient model structure generation, global model cloning for scalable training, and ConvNeXt for enhanced client adaptability. The model was evaluated on combined multi-view data as well as individual views, such as axial, coronal, and sagittal, to assess its robustness in various medical imaging scenarios. Moreover, FOLC-Net tests a ShallowFed model on different data to evaluate its ability to generalize beyond the training dataset. The results show that FOLC-Net outperforms existing models, particularly in the challenging sagittal view. For instance, FOLC-Net achieved an accuracy of 92.44% on the sagittal view, significantly higher than the 88.37% accuracy of study method (DL + Residual Learning) and 88.95% of DL models. Additionally, FOLC-Net demonstrated improved accuracy across all individual views, providing a more reliable and robust solution for medical image analysis in decentralized environments. FOLC-Net addresses the limitations of existing SOTA models by providing a framework that ensures better adaptability to individual views while maintaining strong performance in multi-view settings. The incorporation of MRFO, global model cloning, and ConvNeXt ensures that FOLC-Net performs better in real-world medical applications.


[43] Learning Deliberately, Acting Intuitively: Unlocking Test-Time Reasoning in Multimodal LLMs cs.CV | cs.CL | cs.LGPDF

Yahan Yu, Yuyang Dong, Masafumi Oyamada

TL;DR: 本文提出了一种名为D2I的框架,通过训练时的规则性格式奖励增强多模态LLM的理解和推理能力,而无需额外标注或复杂奖励,评估时则转向直觉推理。该方法在领域内外基准测试中表现优异,并揭示了格式奖励在提升MLLM可迁移推理能力中的重要作用。

Details

Motivation: 多模态推理研究面临模态对齐和训练成本高的挑战,现有方法依赖额外数据标注和规则奖励,限制了可扩展性。本文旨在通过一种无需额外资源的框架,提升MLLM的推理能力。

Result: D2I在领域内外基准测试中显著优于基线方法,证明了其有效性和可迁移性。

Insight: 格式奖励是提升MLLM推理能力的关键,同时训练时显式推理与测试时隐式推理的解耦为未来研究提供了新方向。

Abstract: Reasoning is a key capability for large language models (LLMs), particularly when applied to complex tasks such as mathematical problem solving. However, multimodal reasoning research still requires further exploration of modality alignment and training costs. Many of these approaches rely on additional data annotation and relevant rule-based rewards to enhance the understanding and reasoning ability, which significantly increases training costs and limits scalability. To address these challenges, we propose the Deliberate-to-Intuitive reasoning framework (D2I) that improves the understanding and reasoning ability of multimodal LLMs (MLLMs) without extra annotations and complex rewards. Specifically, our method sets deliberate reasoning strategies to enhance modality alignment only through the rule-based format reward during training. While evaluating, the reasoning style shifts to intuitive, which removes deliberate reasoning strategies during training and implicitly reflects the model’s acquired abilities in the response. D2I outperforms baselines across both in-domain and out-of-domain benchmarks. Our findings highlight the role of format reward in fostering transferable reasoning skills in MLLMs, and inspire directions for decoupling training-time reasoning depth from test-time response flexibility.


[44] Democratizing High-Fidelity Co-Speech Gesture Video Generation cs.CV | cs.AIPDF

Xu Yang, Shaoli Huang, Shenbo Xie, Xuelin Chen, Yifei Liu

TL;DR: 该论文提出了一种轻量级框架,通过2D全身骨架作为辅助条件,利用扩散模型和音频-骨架特征融合技术,生成高质量、音频同步的说话人动作视频,并发布了首个公开大型数据集CSG-405。

Details

Motivation: 语音同步手势视频生成的研究受限于大规模公开数据集的稀缺性和高计算需求,且音频-视觉映射存在一对多的复杂性。论文旨在解决这些问题,并推动研究民主化。

Result: 实验表明,该方法在视觉质量和同步性上优于现有技术,并能泛化到不同说话人和场景。

Insight: 通过骨架作为中间表示,可以高效连接音频与视觉输出,降低计算负担,同时提升生成视频的多样性和同步性。

Abstract: Co-speech gesture video generation aims to synthesize realistic, audio-aligned videos of speakers, complete with synchronized facial expressions and body gestures. This task presents challenges due to the significant one-to-many mapping between audio and visual content, further complicated by the scarcity of large-scale public datasets and high computational demands. We propose a lightweight framework that utilizes 2D full-body skeletons as an efficient auxiliary condition to bridge audio signals with visual outputs. Our approach introduces a diffusion model conditioned on fine-grained audio segments and a skeleton extracted from the speaker’s reference image, predicting skeletal motions through skeleton-audio feature fusion to ensure strict audio coordination and body shape consistency. The generated skeletons are then fed into an off-the-shelf human video generation model with the speaker’s reference image to synthesize high-fidelity videos. To democratize research, we present CSG-405-the first public dataset with 405 hours of high-resolution videos across 71 speech types, annotated with 2D skeletons and diverse speaker demographics. Experiments show that our method exceeds state-of-the-art approaches in visual quality and synchronization while generalizing across speakers and contexts.


[45] HVI-CIDNet+: Beyond Extreme Darkness for Low-Light Image Enhancement cs.CVPDF

Qingsen Yan, Kangbiao Shi, Yixu Feng, Tao Hu, Peng Wu

TL;DR: 该论文提出了一种新的颜色空间HVI和改进的网络HVI-CIDNet+,用于在极暗环境中增强低光图像,解决了现有方法中颜色偏差和亮度伪影的问题。

Details

Motivation: 现有基于sRGB和HSV颜色空间的低光图像增强方法存在颜色偏差、亮度伪影及噪声问题,亟需一种更有效的解决方案。

Result: HVI-CIDNet+在10个基准数据集上优于现有方法。

Insight: 1. 结合视觉语言模型的先验知识可以显著提升极暗区域的恢复效果;2. 动态选择区域处理策略(卷积或自注意力)有助于优化全局性能。

Abstract: Low-Light Image Enhancement (LLIE) aims to restore vivid content and details from corrupted low-light images. However, existing standard RGB (sRGB) color space-based LLIE methods often produce color bias and brightness artifacts due to the inherent high color sensitivity. While Hue, Saturation, and Value (HSV) color space can decouple brightness and color, it introduces significant red and black noise artifacts. To address this problem, we propose a new color space for LLIE, namely Horizontal/Vertical-Intensity (HVI), defined by the HV color map and learnable intensity. The HV color map enforces small distances for the red coordinates to remove red noise artifacts, while the learnable intensity compresses the low-light regions to remove black noise artifacts. Additionally, we introduce the Color and Intensity Decoupling Network+ (HVI-CIDNet+), built upon the HVI color space, to restore damaged content and mitigate color distortion in extremely dark regions. Specifically, HVI-CIDNet+ leverages abundant contextual and degraded knowledge extracted from low-light images using pre-trained vision-language models, integrated via a novel Prior-guided Attention Block (PAB). Within the PAB, latent semantic priors can promote content restoration, while degraded representations guide precise color correction, both particularly in extremely dark regions through the meticulously designed cross-attention fusion mechanism. Furthermore, we construct a Region Refinement Block that employs convolution for information-rich regions and self-attention for information-scarce regions, ensuring accurate brightness adjustments. Comprehensive results from benchmark experiments demonstrate that the proposed HVI-CIDNet+ outperforms the state-of-the-art methods on 10 datasets.


[46] Physics-Grounded Motion Forecasting via Equation Discovery for Trajectory-Guided Image-to-Video Generation cs.CV | cs.AIPDF

Tao Feng, Xianbing Zhao, Zhenhua Chen, Tien Tsin Wong, Hamid Rezatofighi

TL;DR: 该论文提出了一种结合符号回归和轨迹引导的图像到视频生成框架,通过物理基础的运动预测提升视频生成的物理对齐性。

Details

Motivation: 现有扩散模型和自回归视频生成模型虽然在视觉真实性上表现优异,但缺乏物理对齐性,无法准确模拟现实世界中的物体运动。这主要是因为它们依赖统计相关性而非物理规律。

Result: 在经典力学场景(如弹簧质量、摆锤和抛射运动)中,该方法成功恢复了真实解析方程,并显著提高了生成视频的物理对齐性。

Insight: 通过将符号回归与视频生成结合,论文验证了物理规律可以显著提升生成内容的真实性,为未来基于物理的生成模型提供了新思路。

Abstract: Recent advances in diffusion-based and autoregressive video generation models have achieved remarkable visual realism. However, these models typically lack accurate physical alignment, failing to replicate real-world dynamics in object motion. This limitation arises primarily from their reliance on learned statistical correlations rather than capturing mechanisms adhering to physical laws. To address this issue, we introduce a novel framework that integrates symbolic regression (SR) and trajectory-guided image-to-video (I2V) models for physics-grounded video forecasting. Our approach extracts motion trajectories from input videos, uses a retrieval-based pre-training mechanism to enhance symbolic regression, and discovers equations of motion to forecast physically accurate future trajectories. These trajectories then guide video generation without requiring fine-tuning of existing models. Evaluated on scenarios in Classical Mechanics, including spring-mass, pendulums, and projectile motions, our method successfully recovers ground-truth analytical equations and improves the physical alignment of generated videos over baseline methods.


[47] Know Your Attention Maps: Class-specific Token Masking for Weakly Supervised Semantic Segmentation cs.CVPDF

Joelle Hanna, Damian Borth

TL;DR: 本文提出了一种基于Vision Transformer(ViT)的弱监督语义分割方法,通过多[CLS]标记和随机掩码策略提升注意力图的解释性,生成高精度伪分割掩码。

Details

Motivation: 弱监督语义分割(WSSS)传统方法依赖外部模块(如类别激活图)生成伪掩码。本文旨在利用ViT的注意力图直接解决WSSS问题,减少对精细标注数据的依赖。

Result: 在多个基准数据集上实验表明,生成的伪掩码质量优于现有方法,训练的分割模型性能接近全监督模型。

Insight: ViT的自注意力图可以直接用于WSSS任务,随机掩码策略能有效提升类别分配的准确性。

Abstract: Weakly Supervised Semantic Segmentation (WSSS) is a challenging problem that has been extensively studied in recent years. Traditional approaches often rely on external modules like Class Activation Maps to highlight regions of interest and generate pseudo segmentation masks. In this work, we propose an end-to-end method that directly utilizes the attention maps learned by a Vision Transformer (ViT) for WSSS. We propose training a sparse ViT with multiple [CLS] tokens (one for each class), using a random masking strategy to promote [CLS] token - class assignment. At inference time, we aggregate the different self-attention maps of each [CLS] token corresponding to the predicted labels to generate pseudo segmentation masks. Our proposed approach enhances the interpretability of self-attention maps and ensures accurate class assignments. Extensive experiments on two standard benchmarks and three specialized datasets demonstrate that our method generates accurate pseudo-masks, outperforming related works. Those pseudo-masks can be used to train a segmentation model which achieves results comparable to fully-supervised models, significantly reducing the need for fine-grained labeled data.


[48] IAP: Invisible Adversarial Patch Attack through Perceptibility-Aware Localization and Perturbation Optimization cs.CV | cs.AIPDF

Subrat Kishore Dutta, Xiao Zhang

TL;DR: 这篇论文提出了一种名为IAP的新型对抗补丁攻击框架,通过感知感知定位和扰动优化生成高度不可见的对抗补丁。IAP在攻击成功率和隐蔽性方面优于现有基线方法,并能有效绕过现有补丁防御技术。

Details

Motivation: 现有的对抗补丁攻击方法要么在目标攻击场景中表现不佳,要么生成的补丁与上下文不协调,容易被人类检查者发现或无法绕过自动补丁防御。因此,需要一种更隐蔽且有效的对抗补丁攻击方法。

Result: 在多个图像基准和模型架构上的实验表明,IAP在目标攻击场景中表现优异,攻击成功率高且补丁隐蔽性显著提升,同时能绕过多种现有的补丁防御技术。

Insight: 通过平衡对抗攻击效果和人类视觉系统的感知,可以生成更隐蔽的对抗补丁,同时有效绕过防御机制。这为对抗攻击和防御研究提供了新的方向。

Abstract: Despite modifying only a small localized input region, adversarial patches can drastically change the prediction of computer vision models. However, prior methods either cannot perform satisfactorily under targeted attack scenarios or fail to produce contextually coherent adversarial patches, causing them to be easily noticeable by human examiners and insufficiently stealthy against automatic patch defenses. In this paper, we introduce IAP, a novel attack framework that generates highly invisible adversarial patches based on perceptibility-aware localization and perturbation optimization schemes. Specifically, IAP first searches for a proper location to place the patch by leveraging classwise localization and sensitivity maps, balancing the susceptibility of patch location to both victim model prediction and human visual system, then employs a perceptibility-regularized adversarial loss and a gradient update rule that prioritizes color constancy for optimizing invisible perturbations. Comprehensive experiments across various image benchmarks and model architectures demonstrate that IAP consistently achieves competitive attack success rates in targeted settings with significantly improved patch invisibility compared to existing baselines. In addition to being highly imperceptible to humans, IAP is shown to be stealthy enough to render several state-of-the-art patch defenses ineffective.


[49] SemRaFiner: Panoptic Segmentation in Sparse and Noisy Radar Point Clouds cs.CVPDF

Matthias Zeller, Daniel Casado Herraez, Bengisu Ayan, Jens Behley, Michael Heidingsfeld

TL;DR: 该论文提出了一种名为SemRaFiner的新方法,用于在稀疏且噪声较多的雷达点云中进行panoptic分割,通过优化特征提取和训练流程,提升了分割精度。

Details

Motivation: 现有基于摄像头和LiDAR的语义场景理解方法在恶劣天气下表现不佳,且通常不提供运动信息。雷达传感器可以克服这些限制,但其点云数据稀疏且噪声较多,需要改进分割方法。

Result: 实验表明,SemRaFiner在雷达点云的panoptic分割任务中优于现有最先进方法。

Insight: 雷达在恶劣天气下的稳定性优势使其成为自动驾驶场景理解的重要补充,但需要专门的方法处理其数据稀疏性和噪声问题。

Abstract: Semantic scene understanding, including the perception and classification of moving agents, is essential to enabling safe and robust driving behaviours of autonomous vehicles. Cameras and LiDARs are commonly used for semantic scene understanding. However, both sensor modalities face limitations in adverse weather and usually do not provide motion information. Radar sensors overcome these limitations and directly offer information about moving agents by measuring the Doppler velocity, but the measurements are comparably sparse and noisy. In this paper, we address the problem of panoptic segmentation in sparse radar point clouds to enhance scene understanding. Our approach, called SemRaFiner, accounts for changing density in sparse radar point clouds and optimizes the feature extraction to improve accuracy. Furthermore, we propose an optimized training procedure to refine instance assignments by incorporating a dedicated data augmentation. Our experiments suggest that our approach outperforms state-of-the-art methods for radar-based panoptic segmentation.


[50] Adaptive Part Learning for Fine-Grained Generalized Category Discovery: A Plug-and-Play Enhancement cs.CVPDF

Qiyuan Dai, Hanzhuo Huang, Yu Wu, Sibei Yang

TL;DR: 提出了一种自适应部分学习(APL)方法,通过共享可学习部分查询和DINO部分先验生成一致的目标部分及其对应关系,无需额外标注,显著提升细粒度分类任务的表现。

Details

Motivation: 现有GCD方法依赖DINO的全局表示,导致判别性和泛化性之间的固有权衡,无法满足细粒度分类的需求。

Result: 在细粒度数据集上显著提升了GCD框架的性能。

Insight: 部分学习可以解决全局表示中判别性与泛化性的矛盾,适合细粒度分类任务。

Abstract: Generalized Category Discovery (GCD) aims to recognize unlabeled images from known and novel classes by distinguishing novel classes from known ones, while also transferring knowledge from another set of labeled images with known classes. Existing GCD methods rely on self-supervised vision transformers such as DINO for representation learning. However, focusing solely on the global representation of the DINO CLS token introduces an inherent trade-off between discriminability and generalization. In this paper, we introduce an adaptive part discovery and learning method, called APL, which generates consistent object parts and their correspondences across different similar images using a set of shared learnable part queries and DINO part priors, without requiring any additional annotations. More importantly, we propose a novel all-min contrastive loss to learn discriminative yet generalizable part representation, which adaptively highlights discriminative object parts to distinguish similar categories for enhanced discriminability while simultaneously sharing other parts to facilitate knowledge transfer for improved generalization. Our APL can easily be incorporated into different GCD frameworks by replacing their CLS token feature with our part representations, showing significant enhancements on fine-grained datasets.


[51] Pre-Columbian Settlements Shaped Palm Clusters in the Sierra Nevada de Santa Marta, Colombia cs.CVPDF

Sebastian Fajardo, Sina Mohammadi, Jonas Gregorio de Souza, César Ardila, Alan Tapscott Baltar

TL;DR: 该论文提出了一种结合深度学习与聚类算法的方法,通过卫星图像识别棕榈树分布,进而推断古代人类管理的区域,揭示其对植被的长期影响。

Details

Motivation: 研究旨在解决古代人类活动对热带森林的长期影响问题,尤其是高分辨率尺度上的管理区域识别,为生态与考古学提供新视角。

Result: 棕榈树在考古遗址附近显著更多,且大型遗址周边的管理区域可能比考古证据显示的规模大两个数量级。

Insight: 古代人类活动通过促进棕榈树增殖留下了持久的生态足迹,这可能降低了在难达地区建立基础设施的成本。研究展示了AI与生态、考古数据结合揭示人类环境互动的潜力。

Abstract: Ancient populations markedly transformed Neotropical forests, yet understanding the long-term effects of ancient human management, particularly at high-resolution scales, remains challenging. In this work we propose a new approach to investigate archaeological areas of influence based on vegetation signatures. It consists of a deep learning model trained on satellite imagery to identify palm trees, followed by a clustering algorithm to identify palm clusters, which are then used to estimate ancient management areas. To assess the palm distribution in relation to past human activity, we applied the proposed approach to unique high-resolution satellite imagery data covering 765 km2 of the Sierra Nevada de Santa Marta, Colombia. With this work, we also release a manually annotated palm tree dataset along with estimated locations of archaeological sites from ground-surveys and legacy records. Results demonstrate how palms were significantly more abundant near archaeological sites showing large infrastructure investment. The extent of the largest palm cluster indicates that ancient human-managed areas linked to major infrastructure sites may be up to two orders of magnitude bigger than indicated by archaeological evidence alone. Our findings suggest that pre-Columbian populations influenced local vegetation fostering conditions conducive to palm proliferation, leaving a lasting ecological footprint. This may have lowered the logistical costs of establishing infrastructure-heavy settlements in otherwise less accessible locations. Overall, this study demonstrates the potential of integrating artificial intelligence approaches with new ecological and archaeological data to identify archaeological areas of interest through vegetation patterns, revealing fine-scale human-environment interactions.


[52] CheXPO: Preference Optimization for Chest X-ray VLMs with Counterfactual Rationale cs.CV | cs.AIPDF

Xiao Liang, Jiawei Hu, Di Wang, Zhi Ma, Lin Zhao

TL;DR: CheXPO提出了一种结合置信度-相似度联合挖掘与反事实推理的胸部X光偏好优化策略,有效解决了视觉语言模型在医学应用中的幻觉问题,显著减少了专家标注需求。

Details

Motivation: 视觉语言模型(VLMs)在医学应用中容易产生幻觉问题,影响可靠性。传统偏好优化方法面临临床无关样本、数据分布不均衡和专家标注成本高的挑战,亟需一种更高效且可扩展的解决方案。

Result: CheXPO仅用5%的SFT样本实现了8.93%的相对性能提升,在多种临床任务中达到SOTA性能。

Insight: 通过联合挖掘置信度和相似性,结合反事实推理,CheXPO提供了一种可扩展且解释性强的解决方案,尤其适用于实际放射学应用。

Abstract: Vision-language models (VLMs) are prone to hallucinations that critically compromise reliability in medical applications. While preference optimization can mitigate these hallucinations through clinical feedback, its implementation faces challenges such as clinically irrelevant training samples, imbalanced data distributions, and prohibitive expert annotation costs. To address these challenges, we introduce CheXPO, a Chest X-ray Preference Optimization strategy that combines confidence-similarity joint mining with counterfactual rationale. Our approach begins by synthesizing a unified, fine-grained multi-task chest X-ray visual instruction dataset across different question types for supervised fine-tuning (SFT). We then identify hard examples through token-level confidence analysis of SFT failures and use similarity-based retrieval to expand hard examples for balancing preference sample distributions, while synthetic counterfactual rationales provide fine-grained clinical preferences, eliminating the need for additional expert input. Experiments show that CheXPO achieves 8.93% relative performance gain using only 5% of SFT samples, reaching state-of-the-art performance across diverse clinical tasks and providing a scalable, interpretable solution for real-world radiology applications.


[53] Hallucinating 360°: Panoramic Street-View Generation via Local Scenes Diffusion and Probabilistic Prompting cs.CV | cs.RO | eess.IVPDF

Fei Teng, Kai Luo, Sheng Wu, Siyu Li, Pujun Guo

TL;DR: Percep360提出了一种基于局部场景扩散和概率提示的全景街景生成方法,解决了全景数据生成中的连贯性和可控性问题。

Details

Motivation: 全景感知对自动驾驶很重要,但获取全景数据复杂且耗时。现有方法无法高质量、可控地生成全景数据。

Result: 生成的图像在无参考质量指标上优于原始拼接图像,并能增强下游感知模型。

Insight: 通过扩散过程和动态提示,实现了高质量的全景数据生成,为自动驾驶数据扩充提供了新思路。

Abstract: Panoramic perception holds significant potential for autonomous driving, enabling vehicles to acquire a comprehensive 360{\deg} surround view in a single shot. However, autonomous driving is a data-driven task. Complete panoramic data acquisition requires complex sampling systems and annotation pipelines, which are time-consuming and labor-intensive. Although existing street view generation models have demonstrated strong data regeneration capabilities, they can only learn from the fixed data distribution of existing datasets and cannot achieve high-quality, controllable panoramic generation. In this paper, we propose the first panoramic generation method Percep360 for autonomous driving. Percep360 enables coherent generation of panoramic data with control signals based on the stitched panoramic data. Percep360 focuses on two key aspects: coherence and controllability. Specifically, to overcome the inherent information loss caused by the pinhole sampling process, we propose the Local Scenes Diffusion Method (LSDM). LSDM reformulates the panorama generation as a spatially continuous diffusion process, bridging the gaps between different data distributions. Additionally, to achieve the controllable generation of panoramic images, we propose a Probabilistic Prompting Method (PPM). PPM dynamically selects the most relevant control cues, enabling controllable panoramic image generation. We evaluate the effectiveness of the generated images from three perspectives: image quality assessment (i.e., no-reference and with reference), controllability, and their utility in real-world Bird’s Eye View (BEV) segmentation. Notably, the generated data consistently outperforms the original stitched images in no-reference quality metrics and enhances downstream perception models. The source code will be publicly available at https://github.com/Bryant-Teng/Percep360.


[54] A multi-modal dataset for insect biodiversity with imagery and DNA at the trap and individual level cs.CVPDF

Johanna Orsholm, John Quinto, Hannu Autto, Gaia Banelyte, Nicolas Chazot

TL;DR: 本文介绍了MassID45数据集,结合分子和成像数据,支持大样本昆虫的自动分类,推动了小目标检测和实例分割技术的发展。

Details

Motivation: 昆虫种类繁多且数量下降,需要高效方法研究其多样性。现有技术多依赖单一标本数据,无法满足大样本生态调查需求。

Result: 数据集成功结合了DNA条码的分类精度和大样本图像的丰度估计,为快速、大规模昆虫群落研究提供了新工具。

Insight: 这一数据集推动了小目标检测和实例分割技术的创新,同时为生态学和机器学习研究提供了新方向。

Abstract: Insects comprise millions of species, many experiencing severe population declines under environmental and habitat changes. High-throughput approaches are crucial for accelerating our understanding of insect diversity, with DNA barcoding and high-resolution imaging showing strong potential for automatic taxonomic classification. However, most image-based approaches rely on individual specimen data, unlike the unsorted bulk samples collected in large-scale ecological surveys. We present the Mixed Arthropod Sample Segmentation and Identification (MassID45) dataset for training automatic classifiers of bulk insect samples. It uniquely combines molecular and imaging data at both the unsorted sample level and the full set of individual specimens. Human annotators, supported by an AI-assisted tool, performed two tasks on bulk images: creating segmentation masks around each individual arthropod and assigning taxonomic labels to over 17 000 specimens. Combining the taxonomic resolution of DNA barcodes with precise abundance estimates of bulk images holds great potential for rapid, large-scale characterization of insect communities. This dataset pushes the boundaries of tiny object detection and instance segmentation, fostering innovation in both ecological and machine learning research.


[55] Free on the Fly: Enhancing Flexibility in Test-Time Adaptation with Online EM cs.CVPDF

Qiyuan Dai, Sibei Yang

TL;DR: FreeTTA提出了一种无需训练、通用性强的测试时适应方法,通过在线EM算法利用视觉语言模型的零样本预测作为先验,显著提升了跨域和分布外场景下的性能。

Details

Motivation: 视觉语言模型在实际应用中因领域偏移和分布变化而受限,传统测试时适应方法依赖昂贵训练或数据存储假设。FreeTTA旨在解决这些问题,提升灵活性。

Result: 在15个数据集的跨域和分布外场景下,FreeTTA相比现有方法稳定且显著提升性能。

Insight: 显式建模测试数据分布并利用样本间关系是一种未被探索且有效的方向,在线EM算法为实时适应提供了高效工具。

Abstract: Vision-Language Models (VLMs) have become prominent in open-world image recognition for their strong generalization abilities. Yet, their effectiveness in practical applications is compromised by domain shifts and distributional changes, especially when test data distributions diverge from training data. Therefore, the paradigm of test-time adaptation (TTA) has emerged, enabling the use of online off-the-shelf data at test time, supporting independent sample predictions, and eliminating reliance on test annotations. Traditional TTA methods, however, often rely on costly training or optimization processes, or make unrealistic assumptions about accessing or storing historical training and test data. Instead, this study proposes FreeTTA, a training-free and universally available method that makes no assumptions, to enhance the flexibility of TTA. More importantly, FreeTTA is the first to explicitly model the test data distribution, enabling the use of intrinsic relationships among test samples to enhance predictions of individual samples without simultaneous access–a direction not previously explored. FreeTTA achieves these advantages by introducing an online EM algorithm that utilizes zero-shot predictions from VLMs as priors to iteratively compute the posterior probabilities of each online test sample and update parameters. Experiments demonstrate that FreeTTA achieves stable and significant improvements compared to state-of-the-art methods across 15 datasets in both cross-domain and out-of-distribution settings.


[56] MCA-RG: Enhancing LLMs with Medical Concept Alignment for Radiology Report Generation cs.CV | cs.AIPDF

Qilong Xing, Zikai Song, Youjia Zhang, Na Feng, Junqing Yu

TL;DR: MCA-RG提出了一种基于知识驱动的方法,通过显式对齐视觉特征与医学概念,改进放射学报告生成,利用病理和解剖知识库增强特征提取,并通过对比学习和匹配损失优化模型性能。

Details

Motivation: 当前大型语言模型(LLMs)在放射学报告生成(RRG)中仍面临临床落地困难,主要问题是难以准确映射病理和解剖特征到文本描述,以及特征提取的语义无关性。

Result: 在MIMIC-CXR和CheXpert Plus上实验表明,MCA-RG性能优于现有方法,验证了其有效性。

Insight: 显式对齐医学概念和视觉特征是提升放射学报告生成质量的关键;知识驱动的特征增强和筛选机制能显著改善模型性能。

Abstract: Despite significant advancements in adapting Large Language Models (LLMs) for radiology report generation (RRG), clinical adoption remains challenging due to difficulties in accurately mapping pathological and anatomical features to their corresponding text descriptions. Additionally, semantic agnostic feature extraction further hampers the generation of accurate diagnostic reports. To address these challenges, we introduce Medical Concept Aligned Radiology Report Generation (MCA-RG), a knowledge-driven framework that explicitly aligns visual features with distinct medical concepts to enhance the report generation process. MCA-RG utilizes two curated concept banks: a pathology bank containing lesion-related knowledge, and an anatomy bank with anatomical descriptions. The visual features are aligned with these medical concepts and undergo tailored enhancement. We further propose an anatomy-based contrastive learning procedure to improve the generalization of anatomical features, coupled with a matching loss for pathological features to prioritize clinically relevant regions. Additionally, a feature gating mechanism is employed to filter out low-quality concept features. Finally, the visual features are corresponding to individual medical concepts, and are leveraged to guide the report generation process. Experiments on two public benchmarks (MIMIC-CXR and CheXpert Plus) demonstrate that MCA-RG achieves superior performance, highlighting its effectiveness in radiology report generation.


[57] Cross-Modality Masked Learning for Survival Prediction in ICI Treated NSCLC Patients cs.CV | cs.AIPDF

Qilong Xing, Zikai Song, Bingxin Gong, Lian Yang, Junqing Yu

TL;DR: 该论文提出了一种跨模态掩码学习框架,用于非小细胞肺癌(NSCLC)患者的生存预测,通过结合3D CT图像和临床数据,实现更精准的多模态特征融合。

Details

Motivation: 免疫治疗后的NSCLC患者生存预测对个性化治疗至关重要,但缺乏大规模相关数据集和有效的多模态特征融合方法,阻碍了这一领域的发展。

Result: 该方法在多模态集成中表现优异,超越了现有方法,为NSCLC生存预测设立了新标准。

Insight: 掩码学习策略通过利用完整模态重构缺失部分,能够更有效地整合多模态特征,提升模型性能。

Abstract: Accurate prognosis of non-small cell lung cancer (NSCLC) patients undergoing immunotherapy is essential for personalized treatment planning, enabling informed patient decisions, and improving both treatment outcomes and quality of life. However, the lack of large, relevant datasets and effective multi-modal feature fusion strategies pose significant challenges in this domain. To address these challenges, we present a large-scale dataset and introduce a novel framework for multi-modal feature fusion aimed at enhancing the accuracy of survival prediction. The dataset comprises 3D CT images and corresponding clinical records from NSCLC patients treated with immune checkpoint inhibitors (ICI), along with progression-free survival (PFS) and overall survival (OS) data. We further propose a cross-modality masked learning approach for medical feature fusion, consisting of two distinct branches, each tailored to its respective modality: a Slice-Depth Transformer for extracting 3D features from CT images and a graph-based Transformer for learning node features and relationships among clinical variables in tabular data. The fusion process is guided by a masked modality learning strategy, wherein the model utilizes the intact modality to reconstruct missing components. This mechanism improves the integration of modality-specific features, fostering more effective inter-modality relationships and feature interactions. Our approach demonstrates superior performance in multi-modal integration for NSCLC survival prediction, surpassing existing methods and setting a new benchmark for prognostic models in this context.


[58] GNN-ViTCap: GNN-Enhanced Multiple Instance Learning with Vision Transformers for Whole Slide Image Classification and Captioning cs.CV | cs.LGPDF

S M Taslim Uddin Raju, Md. Milon Islam, Md Rezwanul Haque, Hamdi Altaheri, Fakhri Karray

TL;DR: GNN-ViTCap结合GNN和视觉Transformer,通过动态聚类和注意力机制优化WSI分类与标注,显著提升性能。

Details

Motivation: WSI分类与标注面临冗余补丁和未知位置等挑战,现有方法难以有效结合局部与全局信息。

Result: 分类F1达0.934,AUC为0.963;标注BLEU-4为0.811,METEOR为0.569,超越现有方法。

Insight: GNN与视觉Transformer结合能有效解决WSI中补丁冗余和上下文建模问题,提升病理诊断效率。

Abstract: Microscopic assessment of histopathology images is vital for accurate cancer diagnosis and treatment. Whole Slide Image (WSI) classification and captioning have become crucial tasks in computer-aided pathology. However, microscopic WSI face challenges such as redundant patches and unknown patch positions due to subjective pathologist captures. Moreover, generating automatic pathology captions remains a significant challenge. To address these issues, we introduce a novel GNN-ViTCap framework for classification and caption generation from histopathological microscopic images. First, a visual feature extractor generates patch embeddings. Redundant patches are then removed by dynamically clustering these embeddings using deep embedded clustering and selecting representative patches via a scalar dot attention mechanism. We build a graph by connecting each node to its nearest neighbors in the similarity matrix and apply a graph neural network to capture both local and global context. The aggregated image embeddings are projected into the language model’s input space through a linear layer and combined with caption tokens to fine-tune a large language model. We validate our method on the BreakHis and PatchGastric datasets. GNN-ViTCap achieves an F1 score of 0.934 and an AUC of 0.963 for classification, along with a BLEU-4 score of 0.811 and a METEOR score of 0.569 for captioning. Experimental results demonstrate that GNN-ViTCap outperforms state of the art approaches, offering a reliable and efficient solution for microscopy based patient diagnosis.


[59] MST-Distill: Mixture of Specialized Teachers for Cross-Modal Knowledge Distillation cs.CV | cs.LG | cs.MMPDF

Hui Li, Pengfei Yang, Juanyang Chen, Le Dong, Yanxin Chen

TL;DR: MST-Distill 提出了一种新的跨模态知识蒸馏框架,通过混合多个专业化教师模型和动态路由网络,解决了传统方法中的蒸馏路径选择和知识漂移问题。

Details

Motivation: 传统知识蒸馏方法在跨模态场景中因数据和统计异构性难以利用多模态教师模型的互补知识。本文实证揭示了现有方法的局限性,提出了更有效的解决方案。

Result: 在五个多模态数据集上的实验表明,MST-Distill显著优于现有方法。

Insight: 混合动态教师模型和抑制模态差异的掩码模块是提升跨模态知识蒸馏性能的关键。

Abstract: Knowledge distillation as an efficient knowledge transfer technique, has achieved remarkable success in unimodal scenarios. However, in cross-modal settings, conventional distillation methods encounter significant challenges due to data and statistical heterogeneities, failing to leverage the complementary prior knowledge embedded in cross-modal teacher models. This paper empirically reveals two critical issues in existing approaches: distillation path selection and knowledge drift. To address these limitations, we propose MST-Distill, a novel cross-modal knowledge distillation framework featuring a mixture of specialized teachers. Our approach employs a diverse ensemble of teacher models across both cross-modal and multimodal configurations, integrated with an instance-level routing network that facilitates adaptive and dynamic distillation. This architecture effectively transcends the constraints of traditional methods that rely on monotonous and static teacher models. Additionally, we introduce a plug-in masking module, independently trained to suppress modality-specific discrepancies and reconstruct teacher representations, thereby mitigating knowledge drift and enhancing transfer effectiveness. Extensive experiments across five diverse multimodal datasets, spanning visual, audio, and text, demonstrate that our method significantly outperforms existing state-of-the-art knowledge distillation methods in cross-modal distillation tasks. The source code is available at https://github.com/Gray-OREO/MST-Distill.


[60] Evaluating Large Multimodal Models for Nutrition Analysis: A Benchmark Enriched with Contextual Metadata cs.CVPDF

Bruce Coburn, Jiangpeng He, Megan E. Rollo, Satvinder S. Dhaliwal, Deborah A. Kerr

TL;DR: 该论文研究了如何通过整合上下文元数据(如地点、时间和食物类型)提升大型多模态模型(LMMs)在营养分析中的性能,并提出了新的数据集ACETADA。

Details

Motivation: 现有研究主要评估专有模型(如GPT-4),忽视了广泛的开放权重模型的潜力,且缺乏对上下文元数据及其与推理修饰符交互作用的研究。

Result: 结果显示,智能整合元数据显著降低了营养值预测的平均绝对误差(MAE)和平均绝对百分比误差(MAPE)。

Insight: 上下文感知的LMMs在营养分析中具有巨大潜力,且开放权重的LMMs性能未被充分挖掘。

Abstract: Large Multimodal Models (LMMs) are increasingly applied to meal images for nutrition analysis. However, existing work primarily evaluates proprietary models, such as GPT-4. This leaves the broad range of LLMs underexplored. Additionally, the influence of integrating contextual metadata and its interaction with various reasoning modifiers remains largely uncharted. This work investigates how interpreting contextual metadata derived from GPS coordinates (converted to location/venue type), timestamps (transformed into meal/day type), and the food items present can enhance LMM performance in estimating key nutritional values. These values include calories, macronutrients (protein, carbohydrates, fat), and portion sizes. We also introduce ACETADA, a new food-image dataset slated for public release. This open dataset provides nutrition information verified by the dietitian and serves as the foundation for our analysis. Our evaluation across eight LMMs (four open-weight and four closed-weight) first establishes the benefit of contextual metadata integration over straightforward prompting with images alone. We then demonstrate how this incorporation of contextual information enhances the efficacy of reasoning modifiers, such as Chain-of-Thought, Multimodal Chain-of-Thought, Scale Hint, Few-Shot, and Expert Persona. Empirical results show that integrating metadata intelligently, when applied through straightforward prompting strategies, can significantly reduce the Mean Absolute Error (MAE) and Mean Absolute Percentage Error (MAPE) in predicted nutritional values. This work highlights the potential of context-aware LMMs for improved nutrition analysis.


[61] Reading a Ruler in the Wild cs.CVPDF

Yimu Pan, Manas Mehta, Gwen Sincerbeaux, Jeffery A. Goldstein, Alison D. Gernand

TL;DR: 该论文提出了RulerNet,一种深度学习框架,通过将尺子读数问题统一为关键点检测问题,并利用几何级数参数表示尺子,解决了在复杂环境中将像素测量转换为真实世界尺度的挑战。

Details

Motivation: 传统方法依赖手工阈值或针对特定尺子的固定流程,难以在多样化尺子类型和成像条件下通用。研究旨在开发一种能稳健推断真实世界尺度的通用方法。

Result: 实验表明,RulerNet在复杂真实条件下实现了精确、一致且高效的尺度估计,适用于多种尺子类型和成像条件。

Insight: 几何级数参数化是解决透视变换的有效方法,同时合成数据可显著提升模型通用性;轻量级网络设计为实时应用提供了可能。

Abstract: Accurately converting pixel measurements into absolute real-world dimensions remains a fundamental challenge in computer vision and limits progress in key applications such as biomedicine, forensics, nutritional analysis, and e-commerce. We introduce RulerNet, a deep learning framework that robustly infers scale “in the wild” by reformulating ruler reading as a unified keypoint-detection problem and by representing the ruler with geometric-progression parameters that are invariant to perspective transformations. Unlike traditional methods that rely on handcrafted thresholds or rigid, ruler-specific pipelines, RulerNet directly localizes centimeter marks using a distortion-invariant annotation and training strategy, enabling strong generalization across diverse ruler types and imaging conditions while mitigating data scarcity. We also present a scalable synthetic-data pipeline that combines graphics-based ruler generation with ControlNet to add photorealistic context, greatly increasing training diversity and improving performance. To further enhance robustness and efficiency, we propose DeepGP, a lightweight feed-forward network that regresses geometric-progression parameters from noisy marks and eliminates iterative optimization, enabling real-time scale estimation on mobile or edge devices. Experiments show that RulerNet delivers accurate, consistent, and efficient scale estimates under challenging real-world conditions. These results underscore its utility as a generalizable measurement tool and its potential for integration with other vision components for automated, scale-aware analysis in high-impact domains. A live demo is available at https://huggingface.co/spaces/ymp5078/RulerNet-Demo.


[62] Evaluating Attribute Confusion in Fashion Text-to-Image Generation cs.CVPDF

Ziyue Liu, Federico Girella, Yiming Wang, Davide Talon

TL;DR: 论文提出了一种基于视觉问答(VQA)的新度量方法L-VQAScore,用于评估时尚领域的文本到图像生成模型中的属性混淆问题,优于现有方法。

Details

Motivation: 当前文本到图像(T2I)生成模型的自动评估方法在时尚领域表现不足,尤其是无法准确捕捉复杂的实体-属性关联(如属性混淆问题)。

Result: 在包含挑战性对齐任务的新数据集上,L-VQAScore优于现有T2I评估方法,与人类评估结果的关联性更强。

Insight: 局部化VQA策略能有效捕捉细粒度的实体-属性关联,为T2I评估提供了新的思路。

Abstract: Despite the rapid advances in Text-to-Image (T2I) generation models, their evaluation remains challenging in domains like fashion, involving complex compositional generation. Recent automated T2I evaluation methods leverage pre-trained vision-language models to measure cross-modal alignment. However, our preliminary study reveals that they are still limited in assessing rich entity-attribute semantics, facing challenges in attribute confusion, i.e., when attributes are correctly depicted but associated to the wrong entities. To address this, we build on a Visual Question Answering (VQA) localization strategy targeting one single entity at a time across both visual and textual modalities. We propose a localized human evaluation protocol and introduce a novel automatic metric, Localized VQAScore (L-VQAScore), that combines visual localization with VQA probing both correct (reflection) and miss-localized (leakage) attribute generation. On a newly curated dataset featuring challenging compositional alignment scenarios, L-VQAScore outperforms state-of-the-art T2I evaluation methods in terms of correlation with human judgments, demonstrating its strength in capturing fine-grained entity-attribute associations. We believe L-VQAScore can be a reliable and scalable alternative to subjective evaluations.


[63] Go to Zero: Towards Zero-shot Motion Generation with Million-scale Data cs.CVPDF

Ke Fan, Shunlin Lu, Minyue Dai, Runyi Yu, Lixing Xiao

TL;DR: 该论文提出了MotionMillion数据集和MotionMillion-Eval基准,旨在推动文本到运动的零样本生成能力,并通过大规模数据和模型参数扩展实现了显著的泛化性能。

Details

Motivation: 当前文本到运动生成方法的零样本泛化能力受限,主要原因是训练数据集规模不足,且缺乏全面的评估框架。

Result: 模型在域外和复杂组合运动上表现出强大的泛化能力。

Insight: 大规模数据和全面评估框架是实现零样本运动生成的关键。

Abstract: Generating diverse and natural human motion sequences based on textual descriptions constitutes a fundamental and challenging research area within the domains of computer vision, graphics, and robotics. Despite significant advancements in this field, current methodologies often face challenges regarding zero-shot generalization capabilities, largely attributable to the limited size of training datasets. Moreover, the lack of a comprehensive evaluation framework impedes the advancement of this task by failing to identify directions for improvement. In this work, we aim to push text-to-motion into a new era, that is, to achieve the generalization ability of zero-shot. To this end, firstly, we develop an efficient annotation pipeline and introduce MotionMillion-the largest human motion dataset to date, featuring over 2,000 hours and 2 million high-quality motion sequences. Additionally, we propose MotionMillion-Eval, the most comprehensive benchmark for evaluating zero-shot motion generation. Leveraging a scalable architecture, we scale our model to 7B parameters and validate its performance on MotionMillion-Eval. Our results demonstrate strong generalization to out-of-domain and complex compositional motions, marking a significant step toward zero-shot human motion generation. The code is available at https://github.com/VankouF/MotionMillion-Codes.


[64] Vision-Language-Vision Auto-Encoder: Scalable Knowledge Distillation from Diffusion Models cs.CVPDF

Tiezheng Zhang, Yitong Li, Yu-cheng Chou, Jieneng Chen, Alan Yuille

TL;DR: 提出了Vision-Language-Vision (VLV)自编码框架,通过预训练组件(视觉编码器、T2I扩散模型和LLM)构建信息瓶颈,高效蒸馏扩散模型知识,降低了训练成本和数据需求。

Details

Motivation: 传统的视觉语言模型(VLMs)需要大量的高质量图文对和GPU资源,训练成本高昂。本文旨在通过知识蒸馏和利用预训练组件,降低训练成本和数据需求。

Result: VLV框架在少量数据和低成本下,实现了与GPT-4o和Gemini 2.0 Flash相当的图像描述性能。

Insight: 通过预训练组件的组合和信息瓶颈设计,可以显著降低VLMs训练成本和数据需求,同时保持高性能。

Abstract: Building state-of-the-art Vision-Language Models (VLMs) with strong captioning capabilities typically necessitates training on billions of high-quality image-text pairs, requiring millions of GPU hours. This paper introduces the Vision-Language-Vision (VLV) auto-encoder framework, which strategically leverages key pretrained components: a vision encoder, the decoder of a Text-to-Image (T2I) diffusion model, and subsequently, a Large Language Model (LLM). Specifically, we establish an information bottleneck by regularizing the language representation space, achieved through freezing the pretrained T2I diffusion decoder. Our VLV pipeline effectively distills knowledge from the text-conditioned diffusion model using continuous embeddings, demonstrating comprehensive semantic understanding via high-quality reconstructions. Furthermore, by fine-tuning a pretrained LLM to decode the intermediate language representations into detailed descriptions, we construct a state-of-the-art (SoTA) captioner comparable to leading models like GPT-4o and Gemini 2.0 Flash. Our method demonstrates exceptional cost-efficiency and significantly reduces data requirements; by primarily utilizing single-modal images for training and maximizing the utility of existing pretrained models (image encoder, T2I diffusion model, and LLM), it circumvents the need for massive paired image-text datasets, keeping the total training expenditure under $1,000 USD.


[65] 4KAgent: Agentic Any Image to 4K Super-Resolution cs.CV | eess.IVPDF

Yushen Zuo, Qi Zheng, Mingyang Wu, Xinrui Jiang, Renjie Li

TL;DR: 4KAgent是一个通用的代理放大系统,可将任何图像提升至4K分辨率,通过定制化模块、感知代理和修复代理实现高效超分辨率,并在多个任务类别中表现优异。

Details

Motivation: 现有超分辨率方法通常针对特定任务设计,缺乏通用性和灵活性。4KAgent旨在通过代理化系统实现跨领域的统一超分辨率,尤其是处理极端低分辨率和严重退化图像。

Result: 在26个基准测试中,4KAgent在感知质量和保真度指标(如NIQE、PSNR)上均表现最优,涵盖自然图像、医学影像等多种领域。

Insight: 代理化方法为低层次视觉任务提供了新的范式,其动态规划与执行机制可能启发更广泛的视觉自主代理研究。

Abstract: We present 4KAgent, a unified agentic super-resolution generalist system designed to universally upscale any image to 4K resolution (and even higher, if applied iteratively). Our system can transform images from extremely low resolutions with severe degradations, for example, highly distorted inputs at 256x256, into crystal-clear, photorealistic 4K outputs. 4KAgent comprises three core components: (1) Profiling, a module that customizes the 4KAgent pipeline based on bespoke use cases; (2) A Perception Agent, which leverages vision-language models alongside image quality assessment experts to analyze the input image and make a tailored restoration plan; and (3) A Restoration Agent, which executes the plan, following a recursive execution-reflection paradigm, guided by a quality-driven mixture-of-expert policy to select the optimal output for each step. Additionally, 4KAgent embeds a specialized face restoration pipeline, significantly enhancing facial details in portrait and selfie photos. We rigorously evaluate our 4KAgent across 11 distinct task categories encompassing a total of 26 diverse benchmarks, setting new state-of-the-art on a broad spectrum of imaging domains. Our evaluations cover natural images, portrait photos, AI-generated content, satellite imagery, fluorescence microscopy, and medical imaging like fundoscopy, ultrasound, and X-ray, demonstrating superior performance in terms of both perceptual (e.g., NIQE, MUSIQ) and fidelity (e.g., PSNR) metrics. By establishing a novel agentic paradigm for low-level vision tasks, we aim to catalyze broader interest and innovation within vision-centric autonomous agents across diverse research communities. We will release all the code, models, and results at: https://4kagent.github.io.


[66] Towards Multimodal Understanding via Stable Diffusion as a Task-Aware Feature Extractor cs.CV | cs.LGPDF

Vatsal Agarwal, Matthew Gwilliam, Gefen Kohavi, Eshan Verma, Daniel Ulbricht

TL;DR: 这篇论文探讨了利用预训练文本到图像扩散模型(Stable Diffusion)作为任务感知的特征提取器,以弥补CLIP在视觉编码中无法捕捉细粒度信息的不足。通过分析扩散特征的语义丰富性和图像-文本对齐能力,并提出一种融合CLIP和扩散特征的策略,提升了多模态模型的视觉理解能力。

Details

Motivation: 现有的多模态大语言模型(MLLMs)依赖CLIP作为视觉编码器,但其无法充分捕捉细粒度和任务相关的视觉信息。本文希望探索扩散模型能否作为更好的视觉编码器。

Result: 在通用VQA和专用MLLM基准测试中,融合扩散特征的方法展现出显著优势,尤其是在空间和组合推理任务中。

Insight: 扩散模型不仅能生成高质量的图像,其内部特征还能作为强大的视觉编码器,为多模态理解任务提供更丰富的细粒度信息。

Abstract: Recent advances in multimodal large language models (MLLMs) have enabled image-based question-answering capabilities. However, a key limitation is the use of CLIP as the visual encoder; while it can capture coarse global information, it often can miss fine-grained details that are relevant to the input query. To address these shortcomings, this work studies whether pre-trained text-to-image diffusion models can serve as instruction-aware visual encoders. Through an analysis of their internal representations, we find diffusion features are both rich in semantics and can encode strong image-text alignment. Moreover, we find that we can leverage text conditioning to focus the model on regions relevant to the input question. We then investigate how to align these features with large language models and uncover a leakage phenomenon, where the LLM can inadvertently recover information from the original diffusion prompt. We analyze the causes of this leakage and propose a mitigation strategy. Based on these insights, we explore a simple fusion strategy that utilizes both CLIP and conditional diffusion features. We evaluate our approach on both general VQA and specialized MLLM benchmarks, demonstrating the promise of diffusion models for visual understanding, particularly in vision-centric tasks that require spatial and compositional reasoning. Our project page can be found https://vatsalag99.github.io/mustafar/.


cs.CL [Back]

[67] Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities cs.CL | cs.AIPDF

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva

TL;DR: Gemini 2.5模型家族引入了Gemini 2.5 Pro和Gemini 2.5 Flash,以及早期的Gemini 2.0 Flash和Flash-Lite模型,提升了推理、多模态、长上下文和代理能力的边界。

Details

Motivation: 目的是推动模型在复杂代理问题解决中的能力边界,提供从高性能到低成本的全方位解决方案。

Result: Gemini 2.X模型在能力与成本之间实现了Pareto最优,支持复杂的代理问题解决。

Insight: 该研究表明,结合多种能力的模型能够显著扩展代理任务的应用范围,同时优化成本效益。

Abstract: In this report, we introduce the Gemini 2.X model family: Gemini 2.5 Pro and Gemini 2.5 Flash, as well as our earlier Gemini 2.0 Flash and Flash-Lite models. Gemini 2.5 Pro is our most capable model yet, achieving SoTA performance on frontier coding and reasoning benchmarks. In addition to its incredible coding and reasoning skills, Gemini 2.5 Pro is a thinking model that excels at multimodal understanding and it is now able to process up to 3 hours of video content. Its unique combination of long context, multimodal and reasoning capabilities can be combined to unlock new agentic workflows. Gemini 2.5 Flash provides excellent reasoning abilities at a fraction of the compute and latency requirements and Gemini 2.0 Flash and Flash-Lite provide high performance at low latency and cost. Taken together, the Gemini 2.X model generation spans the full Pareto frontier of model capability vs cost, allowing users to explore the boundaries of what is possible with complex agentic problem solving.


[68] ETT: Expanding the Long Context Understanding Capability of LLMs at Test-Time cs.CLPDF

Kiarash Zahirnia, Zahra Golpayegani, Walid Ahmad, Yang Liu

TL;DR: ETT通过测试时的高效微调,扩展了短上下文Transformer模型的上下文长度,提升长文本理解能力,且计算和内存开销线性增长。

Details

Motivation: Transformer模型的长序列处理能力和计算开销成二次增长,限制了其在长上下文任务中的应用。ETT旨在解决这一问题。

Result: 在GPT-Large和Phi-2上扩展上下文长度至32k,准确率提升30%,且微调部分模块效果优于全微调。

Insight: 发现仅微调FFN第二层比全模型微调更有效,为高效长上下文扩展提供了新思路。

Abstract: Transformer-based Language Models’ computation and memory overhead increase quadratically as a function of sequence length. The quadratic cost poses challenges when employing LLMs for processing long sequences. In this work, we introduce \ourmodelacronym~(Extend at Test-Time), method for extending the context length of short context Transformer-based LLMs, with constant memory requirement and linear computation overhead. ETT enable the extension of the context length at test-time by efficient fine-tuning the model’s parameters on the input context, chunked into overlapping small subsequences. We evaluate ETT on LongBench by extending the context length of GPT-Large and Phi-2 up to 32 times, increasing from 1k to 32k tokens. This results in up to a 30 percent improvement in the model’s accuracy. We also study how context can be stored in LLM’s weights effectively and efficiently. Through a detailed ablation study, we examine which Transformer modules are most beneficial to fine-tune at test-time. Interestingly, we find that fine-tuning the second layer of the FFNs is more effective than full fine-tuning, leading to a further improvement in the models’ accuracy.


[69] PERK: Long-Context Reasoning as Parameter-Efficient Test-Time Learning cs.CL | cs.LGPDF

Zeming Chen, Angelika Romanou, Gail Weiss, Antoine Bosselut

TL;DR: PERK是一种高效参数化的测试时学习方法,用于长上下文推理,通过轻量级模型适配器在测试时编码长输入上下文,显著提升了推理性能。

Details

Motivation: 传统测试时学习方法在长上下文推理中因内存问题无法适用,需要一种更高效的参数化方法。

Result: 在长上下文推理任务中显著优于基线方法,GPT-2性能提升达90%,大模型Qwen-2.5-0.5B提升27%,且对推理复杂度、长度扩展和信息位置更鲁棒。

Insight: PERK在训练阶段内存需求较高,但在推理时效率优于基于提示的方法,为长上下文推理提供了高效解决方案。

Abstract: Long-context reasoning requires accurately identifying relevant information in extensive, noisy input contexts. Previous research shows that using test-time learning to encode context directly into model parameters can effectively enable reasoning over noisy information. However, meta-learning methods for enabling test-time learning are prohibitively memory-intensive, preventing their application to long context settings. In this work, we propose PERK (Parameter Efficient Reasoning over Knowledge), a scalable approach for learning to encode long input contexts using gradient updates to a lightweight model adapter at test time. Specifically, PERK employs two nested optimization loops in a meta-training phase. The inner loop rapidly encodes contexts into a low-rank adapter (LoRA) that serves as a parameter-efficient memory module for the base model. Concurrently, the outer loop learns to use the updated adapter to accurately recall and reason over relevant information from the encoded long context. Our evaluations on several long-context reasoning tasks show that PERK significantly outperforms the standard prompt-based long-context baseline, achieving average absolute performance gains of up to 90% for smaller models (GPT-2) and up to 27% for our largest evaluated model, Qwen-2.5-0.5B. In general, PERK is more robust to reasoning complexity, length extrapolation, and the locations of relevant information in contexts. Finally, we show that while PERK is memory-intensive during training, it scales more efficiently at inference time than prompt-based long-context inference.


[70] Exploring Task Performance with Interpretable Models via Sparse Auto-Encoders cs.CL | cs.LGPDF

Shun Wang, Tyler Loakman, Youbo Lei, Yi Liu, Bohao Yang

TL;DR: 该论文提出了一种基于稀疏自编码器的字典学习方法,用于分解大型语言模型(LLMs),提取单语义特征并揭示模型内部的误解,从而通过提示优化提升下游任务性能。

Details

Motivation: 传统LLMs被视为黑盒算法,缺乏可解释性且难以优化性能。通过分解模型神经元,论文旨在提高可解释性并提升任务表现。

Result: 方法显著提升了数学推理和隐喻检测等下游任务的性能,同时提高了模型的可解释性。

Insight: 稀疏自编码器能够有效分解LLMs的复杂特征,揭示模型行为的内部机制,为优化模型表现提供新思路。

Abstract: Large Language Models (LLMs) are traditionally viewed as black-box algorithms, therefore reducing trustworthiness and obscuring potential approaches to increasing performance on downstream tasks. In this work, we apply an effective LLM decomposition method using a dictionary-learning approach with sparse autoencoders. This helps extract monosemantic features from polysemantic LLM neurons. Remarkably, our work identifies model-internal misunderstanding, allowing the automatic reformulation of the prompts with additional annotations to improve the interpretation by LLMs. Moreover, this approach demonstrates a significant performance improvement in downstream tasks, such as mathematical reasoning and metaphor detection.


[71] Temporal Analysis of Climate Policy Discourse: Insights from Dynamic Embedded Topic Modeling cs.CLPDF

Rafiu Adekoya Badekale, Adewale Akinfaderin

TL;DR: 该论文提出了一种动态嵌入主题模型(DETM),用于分析全球气候政策话语的演变,揭示了从早期关注温室气体到近年强调实施与技术合作等主题的变化。

Details

Motivation: 传统手动主题编码方法耗时且难以捕捉全球政策话语的复杂性和动态性,因此需要一种自动化的、基于机器学习的方法来分析政策语言的演变。

Result: 结果显示DETM能够有效捕捉气候政策主题的演变,例如从温室气体到技术合作和全球协议的转变。

Insight: 动态主题模型可以成为分析政策话语演变的强大工具,帮助政策制定者和研究者识别趋势并制定应对策略。

Abstract: Understanding how policy language evolves over time is critical for assessing global responses to complex challenges such as climate change. Temporal analysis helps stakeholders, including policymakers and researchers, to evaluate past priorities, identify emerging themes, design governance strategies, and develop mitigation measures. Traditional approaches, such as manual thematic coding, are time-consuming and limited in capturing the complex, interconnected nature of global policy discourse. With the increasing relevance of unsupervised machine learning, these limitations can be addressed, particularly under high-volume, complex, and high-dimensional data conditions. In this work, we explore a novel approach that applies the dynamic embedded topic model (DETM) to analyze the evolution of global climate policy discourse. A probabilistic model designed to capture the temporal dynamics of topics over time. We collected a corpus of United Nations Framework Convention on Climate Change (UNFCCC) policy decisions from 1995 to 2023, excluding 2020 due to the postponement of COP26 as a result of the COVID-19 pandemic. The model reveals shifts from early emphases on greenhouse gases and international conventions to recent focuses on implementation, technical collaboration, capacity building, finance, and global agreements. Section 3 presents the modeling pipeline, including preprocessing, model training, and visualization of temporal word distributions. Our results show that DETM is a scalable and effective tool for analyzing the evolution of global policy discourse. Section 4 discusses the implications of these findings and we concluded with future directions and refinements to extend this approach to other policy domains.


[72] Perception-Aware Policy Optimization for Multimodal Reasoning cs.CLPDF

Zhenhailong Wang, Xuehang Guo, Sofia Stoica, Haiyang Xu, Hongru Wang

TL;DR: 论文提出了一种名为PAPO的方法,通过在多模态推理任务中引入感知意识监督信号,显著提升了模型在视觉依赖任务上的表现。

Details

Motivation: 现有的RLVR方法在多模态推理任务中表现不佳,主要原因是视觉输入的感知能力不足。

Result: 在多模态基准测试上提升了4.4%,视觉依赖任务上提升了8.0%,感知错误降低了30.5%。

Insight: 感知意识的监督信号能够显著提升多模态推理任务的性能,且无需依赖外部资源。

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has proven to be a highly effective strategy for endowing Large Language Models (LLMs) with robust multi-step reasoning abilities. However, its design and optimizations remain tailored to purely textual domains, resulting in suboptimal performance when applied to multimodal reasoning tasks. In particular, we observe that a major source of error in current multimodal reasoning lies in the perception of visual inputs. To address this bottleneck, we propose Perception-Aware Policy Optimization (PAPO), a simple yet effective extension of GRPO that encourages the model to learn to perceive while learning to reason, entirely from internal supervision signals. Notably, PAPO does not rely on additional data curation, external reward models, or proprietary models. Specifically, we introduce the Implicit Perception Loss in the form of a KL divergence term to the GRPO objective, which, despite its simplicity, yields significant overall improvements (4.4%) on diverse multimodal benchmarks. The improvements are more pronounced, approaching 8.0%, on tasks with high vision dependency. We also observe a substantial reduction (30.5%) in perception errors, indicating improved perceptual capabilities with PAPO. We conduct comprehensive analysis of PAPO and identify a unique loss hacking issue, which we rigorously analyze and mitigate through a Double Entropy Loss. Overall, our work introduces a deeper integration of perception-aware supervision into RLVR learning objectives and lays the groundwork for a new RL framework that encourages visually grounded reasoning. Project page: https://mikewangwzhl.github.io/PAPO.


[73] Enhancing Food-Domain Question Answering with a Multimodal Knowledge Graph: Hybrid QA Generation and Diversity Analysis cs.CLPDF

Srihari K B, Pushpak Bhattacharyya

TL;DR: 本文提出了一种结合多模态知识图谱(MMKG)和生成式AI的统一食品领域问答框架,显著提升了问答的可靠性和多样性。

Details

Motivation: 食品领域的问答需要结合多模态信息(如食谱、食材和图像)来提供全面的答案,传统方法在可靠性和多样性上存在不足。

Result: BERTScore提升16.2%,FID降低37.8%,CLIP对齐提高31.1%。通过诊断分析,不匹配率从35.2%降至7.3%,图像重用准确率达94.1%。

Insight: 结构化知识与多模态生成的结合显著提升了食品领域问答的可靠性和多样性,为类似任务提供了新思路。

Abstract: We propose a unified food-domain QA framework that combines a large-scale multimodal knowledge graph (MMKG) with generative AI. Our MMKG links 13,000 recipes, 3,000 ingredients, 140,000 relations, and 14,000 images. We generate 40,000 QA pairs using 40 templates and LLaVA/DeepSeek augmentation. Joint fine-tuning of Meta LLaMA 3.1-8B and Stable Diffusion 3.5-Large improves BERTScore by 16.2%, reduces FID by 37.8%, and boosts CLIP alignment by 31.1%. Diagnostic analyses-CLIP-based mismatch detection (35.2% to 7.3%) and LLaVA-driven hallucination checks-ensure factual and visual fidelity. A hybrid retrieval-generation strategy achieves 94.1% accurate image reuse and 85% adequacy in synthesis. Our results demonstrate that structured knowledge and multimodal generation together enhance reliability and diversity in food QA.


[74] Decoder-Hybrid-Decoder Architecture for Efficient Reasoning with Long Generation cs.CL | cs.LGPDF

Liliang Ren, Congcong Chen, Haoran Xu, Young Jin Kim, Adam Atkinson

TL;DR: 该论文提出了一种名为SambaY的解码器-混合-解码器架构,通过引入门控记忆单元(GMU)实现了跨层的高效表征共享,显著提升了长序列生成的推理效率和性能。

Details

Motivation: 尽管混合架构(如Samba和YOCO)在序列建模中表现出色,但现有研究未充分探索状态空间模型(SSMs)层间表征共享的效率潜力。论文旨在填补这一空白。

Result: SambaY在Math500、AIME24/25和GPQA Diamond等推理任务中表现优于基准模型Phi4-mini-Reasoning,解码吞吐量提升高达10倍,且无需强化学习。

Insight: GMU通过跨层记忆共享显著提升了模型效率,表明表征共享在长序列建模中具有巨大潜力。差分注意力的引入进一步优化了模型性能。

Abstract: Recent advances in language modeling have demonstrated the effectiveness of State Space Models (SSMs) for efficient sequence modeling. While hybrid architectures such as Samba and the decoder-decoder architecture, YOCO, have shown promising performance gains over Transformers, prior works have not investigated the efficiency potential of representation sharing between SSM layers. In this paper, we introduce the Gated Memory Unit (GMU), a simple yet effective mechanism for efficient memory sharing across layers. We apply it to create SambaY, a decoder-hybrid-decoder architecture that incorporates GMUs in the cross-decoder to share memory readout states from a Samba-based self-decoder. SambaY significantly enhances decoding efficiency, preserves linear pre-filling time complexity, and boosts long-context performance, all while eliminating the need for explicit positional encoding. Through extensive scaling experiments, we demonstrate that our model exhibits a significantly lower irreducible loss compared to a strong YOCO baseline, indicating superior performance scalability under large-scale compute regimes. Our largest model enhanced with Differential Attention, Phi4-mini-Flash-Reasoning, achieves significantly better performance than Phi4-mini-Reasoning on reasoning tasks such as Math500, AIME24/25, and GPQA Diamond without any reinforcement learning, while delivering up to 10x higher decoding throughput on 2K-length prompts with 32K generation length under the vLLM inference framework. We release our training codebase on open-source data at https://github.com/microsoft/ArchScale.


[75] FuDoBa: Fusing Document and Knowledge Graph-based Representations with Bayesian Optimisation cs.CLPDF

Boshko Koloski, Senja Pollak, Roberto Navigli, Blaž Škrlj

TL;DR: FuDoBa 是一种基于贝叶斯优化的方法,将基于 LLM 的嵌入与领域特定结构化知识融合,生成低维、任务相关的表征,提升分类性能并降低计算复杂度。

Details

Motivation: 现有的 LLM 生成的嵌入虽然性能强大,但在领域特定应用中可能过于通用或计算昂贵,FuDoBa 旨在解决这一问题。

Result: 在六个数据集上的实验表明,该方法性能与或超过仅依赖 LLM 嵌入的基线。

Insight: 领域特定知识的融合可以显著提升嵌入的任务相关性,同时降低计算复杂度,为文档表征提供新方向。

Abstract: Building on the success of Large Language Models (LLMs), LLM-based representations have dominated the document representation landscape, achieving great performance on the document embedding benchmarks. However, the high-dimensional, computationally expensive embeddings from LLMs tend to be either too generic or inefficient for domain-specific applications. To address these limitations, we introduce FuDoBa a Bayesian optimisation-based method that integrates LLM-based embeddings with domain-specific structured knowledge, sourced both locally and from external repositories like WikiData. This fusion produces low-dimensional, task-relevant representations while reducing training complexity and yielding interpretable early-fusion weights for enhanced classification performance. We demonstrate the effectiveness of our approach on six datasets in two domains, showing that when paired with robust AutoML-based classifiers, our proposed representation learning approach performs on par with, or surpasses, those produced solely by the proprietary LLM-based embedding baselines.


[76] Adaptive Termination for Multi-round Parallel Reasoning: An Universal Semantic Entropy-Guided Framework cs.CLPDF

Zenan Xu, Zexuan Qiu, Guanhua Huang, Kun Li, Siheng Li

TL;DR: 该论文提出了一种基于语义熵的自适应终止框架,结合顺序推理和并行推理的优点,通过动态控制和早期终止提升推理效率和质量。

Details

Motivation: 当前的大语言模型推理方法(顺序推理和并行推理)存在效率低下或缺乏协调的问题,亟需一种灵活的协作推理框架来解决这些局限性。

Result: 语义熵与准确性呈强负相关,能够高效指导推理终止,提升推理效率和质量。

Insight: 结合顺序和并行推理的优势,并利用语义熵动态控制推理过程,是提升模型推理能力的有效途径。

Abstract: Recent advances in large language models (LLMs) have accelerated progress toward artificial general intelligence, with inference-time scaling emerging as a key technique. Contemporary approaches leverage either sequential reasoning (iteratively extending chains of thought) or parallel reasoning (generating multiple solutions simultaneously) to scale inference. However, both paradigms face fundamental limitations: sequential scaling typically relies on arbitrary token budgets for termination, leading to inefficiency or premature cutoff; while parallel scaling often lacks coordination among parallel branches and requires intrusive fine-tuning to perform effectively. In light of these challenges, we aim to design a flexible test-time collaborative inference framework that exploits the complementary strengths of both sequential and parallel reasoning paradigms. Towards this goal, the core challenge lies in developing an efficient and accurate intrinsic quality metric to assess model responses during collaborative inference, enabling dynamic control and early termination of the reasoning trace. To address this challenge, we introduce semantic entropy (SE), which quantifies the semantic diversity of parallel model responses and serves as a robust indicator of reasoning quality due to its strong negative correlation with accuracy…


[77] Shifting from Ranking to Set Selection for Retrieval Augmented Generation cs.CL | cs.IRPDF

Dahyun Lee, Yongrae Jo, Haeju Park, Moontae Lee

TL;DR: 论文提出了一种从传统排序转向集合选择的检索增强生成(RAG)方法——SETR,通过显式识别查询的信息需求并选择最优的段落集合,以提升多跳问答中的检索质量。

Details

Motivation: 现有的检索增强生成方法主要基于段落个体相关性进行重排序,但在复杂查询(如多跳问答)中无法确保段落集合的全面性,从而影响生成结果的质量。

Result: 在多跳RAG基准测试中,SETR在答案正确性和检索质量上均优于现有方法,包括专有LLM重排序器和开源基线模型。

Insight: 集合选择方法(而非个体排序)能更有效地满足复杂查询的信息需求,为RAG系统提供了一种高效替代方案。

Abstract: Retrieval in Retrieval-Augmented Generation(RAG) must ensure that retrieved passages are not only individually relevant but also collectively form a comprehensive set. Existing approaches primarily rerank top-k passages based on their individual relevance, often failing to meet the information needs of complex queries in multi-hop question answering. In this work, we propose a set-wise passage selection approach and introduce SETR, which explicitly identifies the information requirements of a query through Chain-of-Thought reasoning and selects an optimal set of passages that collectively satisfy those requirements. Experiments on multi-hop RAG benchmarks show that SETR outperforms both proprietary LLM-based rerankers and open-source baselines in terms of answer correctness and retrieval quality, providing an effective and efficient alternative to traditional rerankers in RAG systems. The code is available at https://github.com/LGAI-Research/SetR


[78] SCoRE: Streamlined Corpus-based Relation Extraction using Multi-Label Contrastive Learning and Bayesian kNN cs.CL | cs.AI | cs.IR | cs.LGPDF

Luca Mariotti, Veronica Guidetti, Federica Mandreoli

TL;DR: SCoRE是一个高效、模块化的关系抽取系统,结合多标签对比学习和贝叶斯kNN分类器,适用于低监督环境,性能优于现有方法且能耗更低。

Details

Motivation: 知识图谱(KG)的扩展需要高效且适应性强的关系抽取(RE)方法,尤其是在低监督和噪声环境下。SCoRE旨在提供一种无需微调、可灵活切换预训练模型(PLM)的解决方案。

Result: 在五个基准测试中,SCoRE性能达到或超过最优方法,同时显著降低能耗。分析表明,复杂模型设计可能降低性能。

Insight: 简单高效的设计(如SCoRE)在实际应用中更具优势,复杂模型可能因噪声数据而表现不佳。

Abstract: The growing demand for efficient knowledge graph (KG) enrichment leveraging external corpora has intensified interest in relation extraction (RE), particularly under low-supervision settings. To address the need for adaptable and noise-resilient RE solutions that integrate seamlessly with pre-trained large language models (PLMs), we introduce SCoRE, a modular and cost-effective sentence-level RE system. SCoRE enables easy PLM switching, requires no finetuning, and adapts smoothly to diverse corpora and KGs. By combining supervised contrastive learning with a Bayesian k-Nearest Neighbors (kNN) classifier for multi-label classification, it delivers robust performance despite the noisy annotations of distantly supervised corpora. To improve RE evaluation, we propose two novel metrics: Correlation Structure Distance (CSD), measuring the alignment between learned relational patterns and KG structures, and Precision at R (P@R), assessing utility as a recommender system. We also release Wiki20d, a benchmark dataset replicating real-world RE conditions where only KG-derived annotations are available. Experiments on five benchmarks show that SCoRE matches or surpasses state-of-the-art methods while significantly reducing energy consumption. Further analyses reveal that increasing model complexity, as seen in prior work, degrades performance, highlighting the advantages of SCoRE’s minimal design. Combining efficiency, modularity, and scalability, SCoRE stands as an optimal choice for real-world RE applications.


[79] VisualTrap: A Stealthy Backdoor Attack on GUI Agents via Visual Grounding Manipulation cs.CL | cs.AIPDF

Ziang Ye, Yang Zhang, Wentao Shi, Xiaoyu You, Fuli Feng

TL;DR: 论文揭示了GUI代理在视觉接地(visual grounding)中的漏洞,提出了一种名为VisualTrap的隐蔽后门攻击方法,通过误导代理将任务计划映射到触发位置而非目标位置,实现攻击。

Details

Motivation: GUI代理与个人设备的高度集成带来了安全风险,尤其是后门攻击的潜在威胁未被充分研究。本文希望通过视觉接地漏洞的研究,揭示这一新型攻击方式的可行性。

Result: 实验表明,VisualTrap能以5%的毒化数据高效攻击,并泛化到下游任务和不同GUI环境(如从移动端/网页到桌面端)。

Insight: GUI代理的视觉接地机制存在严重安全隐患,亟需进一步研究防御手段,以避免后门攻击的潜在威胁。

Abstract: Graphical User Interface (GUI) agents powered by Large Vision-Language Models (LVLMs) have emerged as a revolutionary approach to automating human-machine interactions, capable of autonomously operating personal devices (e.g., mobile phones) or applications within the device to perform complex real-world tasks in a human-like manner. However, their close integration with personal devices raises significant security concerns, with many threats, including backdoor attacks, remaining largely unexplored. This work reveals that the visual grounding of GUI agent-mapping textual plans to GUI elements-can introduce vulnerabilities, enabling new types of backdoor attacks. With backdoor attack targeting visual grounding, the agent’s behavior can be compromised even when given correct task-solving plans. To validate this vulnerability, we propose VisualTrap, a method that can hijack the grounding by misleading the agent to locate textual plans to trigger locations instead of the intended targets. VisualTrap uses the common method of injecting poisoned data for attacks, and does so during the pre-training of visual grounding to ensure practical feasibility of attacking. Empirical results show that VisualTrap can effectively hijack visual grounding with as little as 5% poisoned data and highly stealthy visual triggers (invisible to the human eye); and the attack can be generalized to downstream tasks, even after clean fine-tuning. Moreover, the injected trigger can remain effective across different GUI environments, e.g., being trained on mobile/web and generalizing to desktop environments. These findings underscore the urgent need for further research on backdoor attack risks in GUI agents.


[80] Rethinking Verification for LLM Code Generation: From Generation to Testing cs.CLPDF

Zihan Ma, Taolin Zhang, Maosong Cao, Wenwei Zhang, Minnan Luo

TL;DR: 本文探讨了大型语言模型(LLM)在代码生成评估中的局限性,提出了一种量化测试套件全面性的多维指标,并引入了一种人机协作方法(SAGA)以提高测试用例的质量和覆盖率,实验结果显著优于现有基准。

Details

Motivation: 现有代码生成评测基准(如HumanEval和LiveCodeBench)仅包含有限的同质测试用例,导致细微错误未被发现,从而高估模型性能并影响强化学习中的奖励估算。

Result: SAGA在TCGBench上的检测率达90.62%,验证器准确率32.58%,合成的代码生成评测基准验证器准确率比LiveCodeBench-v6高10.78%。

Insight: 人机协作可显著提升测试用例的全面性和质量,为可靠的LLM代码评估奠定了基础,并推动代码生成中强化学习的进一步发展。

Abstract: Large language models (LLMs) have recently achieved notable success in code-generation benchmarks such as HumanEval and LiveCodeBench. However, a detailed examination reveals that these evaluation suites often comprise only a limited number of homogeneous test cases, resulting in subtle faults going undetected. This not only artificially inflates measured performance but also compromises accurate reward estimation in reinforcement learning frameworks utilizing verifiable rewards (RLVR). To address these critical shortcomings, we systematically investigate the test-case generation (TCG) task by proposing multi-dimensional metrics designed to rigorously quantify test-suite thoroughness. Furthermore, we introduce a human-LLM collaborative method (SAGA), leveraging human programming expertise with LLM reasoning capability, aimed at significantly enhancing both the coverage and the quality of generated test cases. In addition, we develop a TCGBench to facilitate the study of the TCG task. Experiments show that SAGA achieves a detection rate of 90.62% and a verifier accuracy of 32.58% on TCGBench. The Verifier Accuracy (Verifier Acc) of the code generation evaluation benchmark synthesized by SAGA is 10.78% higher than that of LiveCodeBench-v6. These results demonstrate the effectiveness of our proposed method. We hope this work contributes to building a scalable foundation for reliable LLM code evaluation, further advancing RLVR in code generation, and paving the way for automated adversarial test synthesis and adaptive benchmark integration.


[81] Investigating the Robustness of Retrieval-Augmented Generation at the Query Level cs.CLPDF

Sezen Perçin, Xin Su, Qutub Sha Syed, Phillip Howard, Aleksei Kuvshinov

TL;DR: 本文研究了检索增强生成(RAG)系统在查询层面的鲁棒性,发现其性能容易受到查询微小变化的影响,并提出了一套评估框架和实用建议。

Details

Motivation: 大型语言模型(LLMs)难以高效更新新信息,检索增强生成(RAG)通过动态整合外部知识来解决这一问题,但其性能高度依赖输入查询的质量。本文旨在分析RAG对查询扰动的敏感性。

Result: 实验表明,常见检索器性能在查询微小变化下显著下降;评估框架能有效量化RAG鲁棒性。

Insight: RAG系统的鲁棒性问题可能成为其实际应用的瓶颈,需进一步优化检索模块或设计更鲁棒的查询处理方法。

Abstract: Large language models (LLMs) are very costly and inefficient to update with new information. To address this limitation, retrieval-augmented generation (RAG) has been proposed as a solution that dynamically incorporates external knowledge during inference, improving factual consistency and reducing hallucinations. Despite its promise, RAG systems face practical challenges-most notably, a strong dependence on the quality of the input query for accurate retrieval. In this paper, we investigate the sensitivity of different components in the RAG pipeline to various types of query perturbations. Our analysis reveals that the performance of commonly used retrievers can degrade significantly even under minor query variations. We study each module in isolation as well as their combined effect in an end-to-end question answering setting, using both general-domain and domain-specific datasets. Additionally, we propose an evaluation framework to systematically assess the query-level robustness of RAG pipelines and offer actionable recommendations for practitioners based on the results of more than 1092 experiments we performed.


[82] FRaN-X: FRaming and Narratives-eXplorer cs.CLPDF

Artur Muratov, Hana Fatima Shaikh, Vanshikaa Jani, Tarek Mahmoud, Zhuohan Xie

TL;DR: FRaN-X是一个自动检测实体并分类其叙事角色的工具,支持五种语言和两种领域,提供交互式可视化分析媒体中的叙事框架。

Details

Motivation: 媒体分析中,如何自动检测和标记实体的叙事角色(如主角、反派或无辜者)是一个挑战。FRaN-X旨在解决这一问题,帮助分析人员理解不同来源的叙事框架。

Result: 开发了一个公开可用的工具(FRaN-X),支持多语言和多领域分析,并提供了直观的图形可视化功能。

Insight: 该工具能够帮助媒体分析师快速识别和比较不同来源中的叙事框架,揭示实体的角色变化,适合跨文化和跨领域的叙事分析。

Abstract: We present FRaN-X, a Framing and Narratives Explorer that automatically detects entity mentions and classifies their narrative roles directly from raw text. FRaN-X comprises a two-stage system that combines sequence labeling with fine-grained role classification to reveal how entities are portrayed as protagonists, antagonists, or innocents, using a unique taxonomy of 22 fine-grained roles nested under these three main categories. The system supports five languages (Bulgarian, English, Hindi, Russian, and Portuguese) and two domains (the Russia-Ukraine Conflict and Climate Change). It provides an interactive web interface for media analysts to explore and compare framing across different sources, tackling the challenge of automatically detecting and labeling how entities are framed. Our system allows end users to focus on a single article as well as analyze up to four articles simultaneously. We provide aggregate level analysis including an intuitive graph visualization that highlights the narrative a group of articles are pushing. Our system includes a search feature for users to look up entities of interest, along with a timeline view that allows analysts to track an entity’s role transitions across different contexts within the article. The FRaN-X system and the trained models are licensed under an MIT License. FRaN-X is publicly accessible at https://fran-x.streamlit.app/ and a video demonstration is available at https://youtu.be/VZVi-1B6yYk.


[83] Discrete Diffusion Models for Language Generation cs.CL | cs.LG | stat.ML | 68T50 (Primary) 68Q32, 60J27 (Secondary) | G.3PDF

Ashen Weligalle

TL;DR: 论文探讨了离散扩散模型(D3PM)在自然语言生成中的可行性和性能,并与自回归(AR)模型进行了比较,结果显示D3PM在并行生成速度上具有优势,但压缩性能略逊于AR模型。

Details

Motivation: 扩散模型在连续数据生成(如图像和视频)中表现出色,但在离散数据(如自然语言)中的应用仍具挑战性。本文旨在研究离散扩散模型在语言生成中的潜力。

Result: D3PM的最佳BPT为5.72(平均8.05),略逊于AR模型的4.59,但其批处理速度高达3.97批次/秒,显示出并行生成的潜力。

Insight: 扩散模型在离散数据生成中具有效率优势,但在生成质量(如压缩性能)上与AR模型仍有差距,为未来的非自回归语言生成研究提供了方向。

Abstract: Diffusion models have emerged as a powerful class of generative models, achieving state-of-the-art results in continuous data domains such as image and video generation. Their core mechanism involves a forward diffusion process that gradually transforms structured data into a Gaussian-like distribution, followed by a learned reverse process to reconstruct the data. While successful in continuous modalities, applying this framework to discrete data-particularly natural language-remains challenging due to token dependency complexities and the lack of a defined generation order.This thesis investigates the feasibility and performance of discrete diffusion models for natural language generation. Specifically, we evaluate the Discrete Denoising Diffusion Probabilistic Model (D3PM) and compare it with traditional autoregressive (AR) language models. To assess generative performance, we use Bits Per Token (BPT), Negative Log-Likelihood (NLL), Perplexity (PPL), and Batch Processing Speed. Results show the best-performing D3PM model achieves a BPT of 5.72, with a mean of 8.05. The AR model outperforms in compression with a lower mean BPT of 4.59, but D3PM achieves higher processing speed, reaching up to 3.97 batches per sec., indicating potential for parallel generation.All evaluations were conducted under consistent conditions-generating 100,000 tokens per model with a fixed batch size of four-for fair comparison. This research presents a detailed analysis of diffusion-based vs. autoregressive models, highlighting trade-offs in generative quality and efficiency. Findings emphasize both the promise and limitations of diffusion models for discrete data, supporting future work in non-autoregressive language generation.


eess.IV [Back]

[84] Mamba Goes HoME: Hierarchical Soft Mixture-of-Experts for 3D Medical Image Segmentation eess.IV | cs.CVPDF

Szymon Płotka, Maciej Chrabaszcz, Gizem Mert, Ewa Szczurek, Arkadiusz Sitek

TL;DR: 论文提出了一种名为HoME的分层软混合专家模型,用于3D医学图像分割,通过两级token路由层提升长上下文建模效率,显著优于现有方法。

Details

Motivation: 3D医学图像分割面临多样模态处理和数据变异性等挑战,需要高效的长上下文建模方法。

Result: 在多模态3D医学图像数据集上取得SOTA结果,泛化性强。

Insight: 分层专家路由能有效结合局部特征提取和全局上下文融合,适合复杂医学图像分割任务。

Abstract: In recent years, artificial intelligence has significantly advanced medical image segmentation. However, challenges remain, including efficient 3D medical image processing across diverse modalities and handling data variability. In this work, we introduce Hierarchical Soft Mixture-of-Experts (HoME), a two-level token-routing layer for efficient long-context modeling, specifically designed for 3D medical image segmentation. Built on the Mamba state-space model (SSM) backbone, HoME enhances sequential modeling through sparse, adaptive expert routing. The first stage employs a Soft Mixture-of-Experts (SMoE) layer to partition input sequences into local groups, routing tokens to specialized per-group experts for localized feature extraction. The second stage aggregates these outputs via a global SMoE layer, enabling cross-group information fusion and global context refinement. This hierarchical design, combining local expert routing with global expert refinement improves generalizability and segmentation performance, surpassing state-of-the-art results across datasets from the three most commonly used 3D medical imaging modalities and data quality.


[85] Mitigating Multi-Sequence 3D Prostate MRI Data Scarcity through Domain Adaptation using Locally-Trained Latent Diffusion Models for Prostate Cancer Detection eess.IV | cs.CVPDF

Emerson P. Grabke, Babak Taati, Masoom A. Haider

TL;DR: 论文提出CCELLA++,一种改进的潜在扩散模型(LDM),用于生成多序列前列腺MRI数据,解决数据稀缺问题,并提升前列腺癌检测的分类器性能。

Details

Motivation: 现有的CCELLA LDM仅限于轴向T2加权序列,未研究机构间的域偏移问题,且未优化病理学结果。CCELLA++旨在解决这些问题,提升临床应用价值。

Result: CCELLA++显著提升了HighB和ADC序列的FID分数,且在域适应任务中分类器性能优于真实数据。

Insight: 合成数据在小样本域适应任务中可能优于真实数据,多序列生成能力有助于医学影像分析的发展。

Abstract: Objective: Latent diffusion models (LDMs) could mitigate data scarcity challenges affecting machine learning development for medical image interpretation. The recent CCELLA LDM improved prostate cancer detection performance using synthetic MRI for classifier training but was limited to the axial T2-weighted (AxT2) sequence, did not investigate inter-institutional domain shift, and prioritized radiology over histopathology outcomes. We propose CCELLA++ to address these limitations and improve clinical utility. Methods: CCELLA++ expands CCELLA for simultaneous biparametric prostate MRI (bpMRI) generation, including the AxT2, high b-value diffusion series (HighB) and apparent diffusion coefficient map (ADC). Domain adaptation was investigated by pretraining classifiers on real or LDM-generated synthetic data from an internal institution, followed with fine-tuning on progressively smaller fractions of an out-of-distribution, external dataset. Results: CCELLA++ improved 3D FID for HighB and ADC but not AxT2 (0.013, 0.012, 0.063 respectively) sequences compared to CCELLA (0.060). Classifier pretraining with CCELLA++ bpMRI outperformed real bpMRI in AP and AUC for all domain adaptation scenarios. CCELLA++ pretraining achieved highest classifier performance below 50% (n=665) external dataset volume. Conclusion: Synthetic bpMRI generated by our method can improve downstream classifier generalization and performance beyond real bpMRI or CCELLA-generated AxT2-only images. Future work should seek to quantify medical image sample quality, balance multi-sequence LDM training, and condition the LDM with additional information. Significance: The proposed CCELLA++ LDM can generate synthetic bpMRI that outperforms real data for domain adaptation with a limited target institution dataset. Our code is available at https://github.com/grabkeem/CCELLA-plus-plus


[86] Capsule-ConvKAN: A Hybrid Neural Approach to Medical Image Classification eess.IV | cs.CV | cs.LGPDF

Laura Pituková, Peter Sinčák, László József Kovács

TL;DR: 该研究提出了一种新的混合神经网络架构Capsule-ConvKAN,结合了Capsule Network和Convolutional Kolmogorov–Arnold Network的优势,在医学图像分类任务中取得了最佳性能(91.21%准确率)。

Details

Motivation: 传统卷积神经网络在医学图像分类中难以捕捉复杂空间特征。为了提升特征表示能力和分类准确性,作者探索了结合Capsule Network动态路由能力和Convolutional Kolmogorov–Arnold Network灵活性的混合架构。

Result: 在组织病理学图像数据集上的实验表明,Capsule-ConvKAN以91.21%的准确率优于其他对比架构。

Insight: Capsule-ConvKAN能够更好地捕获空间模式和处理复杂特征,为解决传统卷积模型在医学图像分类中的局限性提供了新思路。

Abstract: This study conducts a comprehensive comparison of four neural network architectures: Convolutional Neural Network, Capsule Network, Convolutional Kolmogorov–Arnold Network, and the newly proposed Capsule–Convolutional Kolmogorov–Arnold Network. The proposed Capsule-ConvKAN architecture combines the dynamic routing and spatial hierarchy capabilities of Capsule Network with the flexible and interpretable function approximation of Convolutional Kolmogorov–Arnold Networks. This novel hybrid model was developed to improve feature representation and classification accuracy, particularly in challenging real-world biomedical image data. The architectures were evaluated on a histopathological image dataset, where Capsule-ConvKAN achieved the highest classification performance with an accuracy of 91.21%. The results demonstrate the potential of the newly introduced Capsule-ConvKAN in capturing spatial patterns, managing complex features, and addressing the limitations of traditional convolutional models in medical image classification.


[87] Fast Equivariant Imaging: Acceleration for Unsupervised Learning via Augmented Lagrangian and Auxiliary PnP Denoisers eess.IV | cs.CV | cs.LG | math.OCPDF

Guixian Xu, Jinglai Li, Junqi Tang

TL;DR: 论文提出了一种快速等变成像(FEI)框架,通过拉格朗日乘子和可插拔降噪器加速无监督学习,显著提升了训练效率和性能。

Details

Motivation: 无监督学习在成像任务中需要高效且性能优越的方法。传统等变成像(EI)方法效率较低,亟需改进。

Result: 在CT100数据集上,训练U-Net的速度提升10倍,且性能优于标准EI。

Insight: 拉格朗日乘子和可插拔降噪器的结合为无监督学习提供了高效且性能优越的解决方案。

Abstract: We propose Fast Equivariant Imaging (FEI), a novel unsupervised learning framework to efficiently train deep imaging networks without ground-truth data. From the perspective of reformulating the Equivariant Imaging based optimization problem via the method of Lagrange multipliers and utilizing plug-and-play denoisers, this novel unsupervised scheme shows superior efficiency and performance compared to vanilla Equivariant Imaging paradigm. In particular, our PnP-FEI scheme achieves an order-of-magnitude (10x) acceleration over standard EI on training U-Net with CT100 dataset for X-ray CT reconstruction, with improved generalization performance.


[88] Speckle2Self: Self-Supervised Ultrasound Speckle Reduction Without Clean Data eess.IV | cs.AI | cs.CVPDF

Xuesong Li, Nassir Navab, Zhongliang Jiang

TL;DR: Speckle2Self是一种无需干净数据的自监督超声斑点噪声抑制方法,通过多尺度扰动操作实现对斑点噪声的有效建模和去除。

Details

Motivation: 超声图像中的斑点噪声具有组织依赖性,传统去噪方法(如Noise2Noise或盲点网络)无法直接适用,因此需要一种仅依赖单张噪声图像的自监督方法。

Result: 在模拟和真实超声图像上验证了有效性,优于传统滤波器和SOTA学习方法,并展现了跨设备的泛化能力。

Insight: 超声斑点噪声的高空间依赖性可通过多尺度扰动建模,解剖结构的低秩特性是去噪的关键。

Abstract: Image denoising is a fundamental task in computer vision, particularly in medical ultrasound (US) imaging, where speckle noise significantly degrades image quality. Although recent advancements in deep neural networks have led to substantial improvements in denoising for natural images, these methods cannot be directly applied to US speckle noise, as it is not purely random. Instead, US speckle arises from complex wave interference within the body microstructure, making it tissue-dependent. This dependency means that obtaining two independent noisy observations of the same scene, as required by pioneering Noise2Noise, is not feasible. Additionally, blind-spot networks also cannot handle US speckle noise due to its high spatial dependency. To address this challenge, we introduce Speckle2Self, a novel self-supervised algorithm for speckle reduction using only single noisy observations. The key insight is that applying a multi-scale perturbation (MSP) operation introduces tissue-dependent variations in the speckle pattern across different scales, while preserving the shared anatomical structure. This enables effective speckle suppression by modeling the clean image as a low-rank signal and isolating the sparse noise component. To demonstrate its effectiveness, Speckle2Self is comprehensively compared with conventional filter-based denoising algorithms and SOTA learning-based methods, using both realistic simulated US images and human carotid US images. Additionally, data from multiple US machines are employed to evaluate model generalization and adaptability to images from unseen domains. \textit{Code and datasets will be released upon acceptance.


[89] SimCortex: Collision-free Simultaneous Cortical Surfaces Reconstruction eess.IV | cs.CVPDF

Kaveh Moradkhani, R Jarrett Rushmore, Sylvain Bouix

TL;DR: SimCortex是一种深度学习框架,用于从T1加权MRI数据中同时重建无碰撞的皮质表面,解决了现有方法中常见的重叠、自相交和拓扑缺陷问题。

Details

Motivation: 现有的皮质表面重建方法常因复杂的几何结构和严格的拓扑要求导致表面重叠和拓扑缺陷,无法满足可靠的神经解剖分析需求。

Result: 在标准数据集上,SimCortex显著减少了表面重叠和自相交,同时保持了几何精度,性能超越现有方法。

Insight: 结合深度学习和微分同胚变形可以高效解决皮质表面重建中的复杂拓扑问题,为神经解剖分析提供更可靠的工具。

Abstract: Accurate cortical surface reconstruction from magnetic resonance imaging (MRI) data is crucial for reliable neuroanatomical analyses. Current methods have to contend with complex cortical geometries, strict topological requirements, and often produce surfaces with overlaps, self-intersections, and topological defects. To overcome these shortcomings, we introduce SimCortex, a deep learning framework that simultaneously reconstructs all brain surfaces (left/right white-matter and pial) from T1-weighted(T1w) MRI volumes while preserving topological properties. Our method first segments the T1w image into a nine-class tissue label map. From these segmentations, we generate subject-specific, collision-free initial surface meshes. These surfaces serve as precise initializations for subsequent multiscale diffeomorphic deformations. Employing stationary velocity fields (SVFs) integrated via scaling-and-squaring, our approach ensures smooth, topology-preserving transformations with significantly reduced surface collisions and self-intersections. Evaluations on standard datasets demonstrate that SimCortex dramatically reduces surface overlaps and self-intersections, surpassing current methods while maintaining state-of-the-art geometric accuracy.


[90] Deep Brain Net: An Optimized Deep Learning Model for Brain tumor Detection in MRI Images Using EfficientNetB0 and ResNet50 with Transfer Learning eess.IV | cs.CVPDF

Daniel Onah, Ravish Desai

TL;DR: 论文提出了一种名为Deep Brain Net的深度学习模型,结合EfficientNetB0和ResNet50架构及迁移学习,用于MRI图像中脑肿瘤的检测,实现了高准确率和计算效率。

Details

Motivation: 现有的深度学习模型在脑肿瘤检测中虽表现出潜力,但在准确率和计算效率方面仍有不足。本文旨在通过结合高效架构和迁移学习优化性能。

Result: 在MRI数据集上,模型达到了88%的准确率、88.75%的加权F1分数和98.17%的宏AUC ROC分数,优于现有方法。

Insight: 结合高效架构和迁移学习可以有效提升脑肿瘤检测的准确率和计算效率,为临床诊断提供可靠辅助工具。

Abstract: In recent years, deep learning has shown great promise in the automated detection and classification of brain tumors from MRI images. However, achieving high accuracy and computational efficiency remains a challenge. In this research, we propose Deep Brain Net, a novel deep learning system designed to optimize performance in the detection of brain tumors. The model integrates the strengths of two advanced neural network architectures which are EfficientNetB0 and ResNet50, combined with transfer learning to improve generalization and reduce training time. The EfficientNetB0 architecture enhances model efficiency by utilizing mobile inverted bottleneck blocks, which incorporate depth wise separable convolutions. This design significantly reduces the number of parameters and computational cost while preserving the ability of models to learn complex feature representations. The ResNet50 architecture, pre trained on large scale datasets like ImageNet, is fine tuned for brain tumor classification. Its use of residual connections allows for training deeper networks by mitigating the vanishing gradient problem and avoiding performance degradation. The integration of these components ensures that the proposed system is both computationally efficient and highly accurate. Extensive experiments performed on publicly available MRI datasets demonstrate that Deep Brain Net consistently outperforms existing state of the art methods in terms of classification accuracy, precision, recall, and computational efficiency. The result is an accuracy of 88 percent, a weighted F1 score of 88.75 percent, and a macro AUC ROC score of 98.17 percent which demonstrates the robustness and clinical potential of Deep Brain Net in assisting radiologists with brain tumor diagnosis.


cs.CR [Back]

[91] The bitter lesson of misuse detection cs.CR | cs.AI | cs.CLPDF

Hadrien Mariaccia, Charbel-Raphaël Segerie, Diego Dorn

TL;DR: 本文提出了BELLS基准测试框架,用于评估LLM监督系统的性能,发现现有专用监督系统在多样化的对抗性攻击中表现不佳,而通用LLM的简单检测方法却更有效。

Details

Motivation: 现有的LLM监督系统缺乏全面的公开基准测试,无法评估其在多样化对抗性攻击下的表现。

Result: 专用监督系统检测率低,通用LLM的简单检测方法更有效,但LLM存在元认知不连贯问题(如Claude 3.7和Mistral Large对有害查询的响应率仍较高)。

Insight: 通用LLM能力对多样化滥用检测至关重要,简单的架构改进可提升鲁棒性,但需进一步研究权衡问题。

Abstract: Prior work on jailbreak detection has established the importance of adversarial robustness for LLMs but has largely focused on the model ability to resist adversarial inputs and to output safe content, rather than the effectiveness of external supervision systems. The only public and independent benchmark of these guardrails to date evaluates a narrow set of supervisors on limited scenarios. Consequently, no comprehensive public benchmark yet verifies how well supervision systems from the market perform under realistic, diverse attacks. To address this, we introduce BELLS, a Benchmark for the Evaluation of LLM Supervision Systems. The framework is two dimensional: harm severity (benign, borderline, harmful) and adversarial sophistication (direct vs. jailbreak) and provides a rich dataset covering 3 jailbreak families and 11 harm categories. Our evaluations reveal drastic limitations of specialized supervision systems. While they recognize some known jailbreak patterns, their semantic understanding and generalization capabilities are very limited, sometimes with detection rates close to zero when asking a harmful question directly or with a new jailbreak technique such as base64 encoding. Simply asking generalist LLMs if the user question is “harmful or not” largely outperforms these supervisors from the market according to our BELLS score. But frontier LLMs still suffer from metacognitive incoherence, often responding to queries they correctly identify as harmful (up to 30 percent for Claude 3.7 and greater than 50 percent for Mistral Large). These results suggest that simple scaffolding could significantly improve misuse detection robustness, but more research is needed to assess the tradeoffs of such techniques. Our results support the “bitter lesson” of misuse detection: general capabilities of LLMs are necessary to detect a diverse array of misuses and jailbreaks.


cs.RO [Back]

[92] Learning to Evaluate Autonomous Behaviour in Human-Robot Interaction cs.RO | cs.CV | cs.LGPDF

Matteo Tiezzi, Tommaso Apicella, Carlos Cardenas-Perez, Giovanni Fregonese, Stefano Dafarra

TL;DR: 论文提出了一种用于评估人形机器人自主行为的框架NeME,通过深度学习方法从关节轨迹中分类动作,实现了无需人工干预的策略评估。

Details

Motivation: 传统评估方法难以复现且无法捕捉机器人轨迹的复杂性,因此需要一种新方法来衡量模仿学习在复杂人机交互任务中的表现。

Result: 在ergoCub人形机器人上验证,实验结果表明NeME更符合实际成功率,且具有可复现性和系统性。

Insight: NeME为复杂HRI任务中的策略评估提供了自动化、可扩展的解决方案,减少了人工参与的需求。

Abstract: Evaluating and comparing the performance of autonomous Humanoid Robots is challenging, as success rate metrics are difficult to reproduce and fail to capture the complexity of robot movement trajectories, critical in Human-Robot Interaction and Collaboration (HRIC). To address these challenges, we propose a general evaluation framework that measures the quality of Imitation Learning (IL) methods by focusing on trajectory performance. We devise the Neural Meta Evaluator (NeME), a deep learning model trained to classify actions from robot joint trajectories. NeME serves as a meta-evaluator to compare the performance of robot control policies, enabling policy evaluation without requiring human involvement in the loop. We validate our framework on ergoCub, a humanoid robot, using teleoperation data and comparing IL methods tailored to the available platform. The experimental results indicate that our method is more aligned with the success rate obtained on the robot than baselines, offering a reproducible, systematic, and insightful means for comparing the performance of multimodal imitation learning approaches in complex HRI tasks.


q-bio.QM [Back]

[93] DeepRetro: Retrosynthetic Pathway Discovery using Iterative LLM Reasoning q-bio.QM | cs.AI | cs.CL | cs.LG | q-bio.BM | q-bio.MNPDF

Shreyas Vinaya Sathyanarayana, Rahil Shah, Sharanabasava D. Hiremath, Rishikesh Panda, Rahul Jana

TL;DR: DeepRetro提出了一种结合LLM与传统模板/MCTS工具的迭代式逆合成框架,通过动态反馈和修正探索新颖合成路径。

Details

Motivation: 现有逆合成方法多依赖预定义模板,难以发现新路径,而基于LLM的方法在多步规划上仍有不足。

Result: 框架成功识别可行且新颖的合成路径,并通过交互界面实现人类专家反馈,应用于复杂天然产物合成。

Insight: LLM的迭代推理与人类反馈结合可显著提升逆合成的创新性和实用性。

Abstract: Retrosynthesis, the identification of precursor molecules for a target compound, is pivotal for synthesizing complex molecules, but faces challenges in discovering novel pathways beyond predefined templates. Recent large language model (LLM) approaches to retrosynthesis have shown promise but effectively harnessing LLM reasoning capabilities for effective multi-step planning remains an open question. To address this challenge, we introduce DeepRetro, an open-source, iterative, hybrid LLM-based retrosynthetic framework. Our approach integrates the strengths of conventional template-based/Monte Carlo tree search tools with the generative power of LLMs in a step-wise, feedback-driven loop. Initially, synthesis planning is attempted with a template-based engine. If this fails, the LLM subsequently proposes single-step retrosynthetic disconnections. Crucially, these suggestions undergo rigorous validity, stability, and hallucination checks before the resulting precursors are recursively fed back into the pipeline for further evaluation. This iterative refinement allows for dynamic pathway exploration and correction. We demonstrate the potential of this pipeline through benchmark evaluations and case studies, showcasing its ability to identify viable and potentially novel retrosynthetic routes. In particular, we develop an interactive graphical user interface that allows expert human chemists to provide human-in-the-loop feedback to the reasoning algorithm. This approach successfully generates novel pathways for complex natural product compounds, demonstrating the potential for iterative LLM reasoning to advance state-of-art in complex chemical syntheses.


[94] PAST: A multimodal single-cell foundation model for histopathology and spatial transcriptomics in cancer q-bio.QM | cs.CV | stat.APPDF

Changchun Yang, Haoyang Li, Yushuai Wu, Yilan Zhang, Yifeng Jiao

TL;DR: PAST 是一个多模态单细胞基础模型,通过联合学习组织病理图像和单细胞转录组数据,实现了跨模态表征的统一,显著提升了在癌症研究中的预测和分析能力。

Details

Motivation: 当前的病理基础模型通常缺乏与分子数据的单细胞分辨率整合,限制了其在精准肿瘤学中的应用。

Result: 在多种癌症和多任务中,PAST 的表现优于现有方法,具有高泛化性和扩展性。

Insight: PAST 为高分辨率空间组学、机制发现和精准癌症研究提供了一个灵活的工具。

Abstract: While pathology foundation models have transformed cancer image analysis, they often lack integration with molecular data at single-cell resolution, limiting their utility for precision oncology. Here, we present PAST, a pan-cancer single-cell foundation model trained on 20 million paired histopathology images and single-cell transcriptomes spanning multiple tumor types and tissue contexts. By jointly encoding cellular morphology and gene expression, PAST learns unified cross-modal representations that capture both spatial and molecular heterogeneity at the cellular level. This approach enables accurate prediction of single-cell gene expression, virtual molecular staining, and multimodal survival analysis directly from routine pathology slides. Across diverse cancers and downstream tasks, PAST consistently exceeds the performance of existing approaches, demonstrating robust generalizability and scalability. Our work establishes a new paradigm for pathology foundation models, providing a versatile tool for high-resolution spatial omics, mechanistic discovery, and precision cancer research.


cs.LG [Back]

[95] Can Interpretation Predict Behavior on Unseen Data? cs.LG | cs.AI | cs.CLPDF

Victoria R. Li, Jenny Kaufmann, Martin Wattenberg, David Alvarez-Melis, Naomi Saphra

TL;DR: 这篇论文探讨了可解释性是否能预测模型在未见数据上的行为,通过实验表明简单的可解释性工具可以预测模型在分布外(OOD)的性能。

Details

Motivation: 研究的动机是验证可解释性是否能用于预测模型对未见输入数据的反应,而不仅仅是针对特定干预的效果。

Result: 实验结果表明,当模型在分布内数据上表现出分层注意力模式时,其在OOD数据上也倾向于分层泛化。

Insight: 研究发现为可解释性在预测模型未见行为方面的潜力提供了概念验证,激励未来进一步研究。

Abstract: Interpretability research often aims to predict how a model will respond to targeted interventions on specific mechanisms. However, it rarely predicts how a model will respond to unseen input data. This paper explores the promises and challenges of interpretability as a tool for predicting out-of-distribution (OOD) model behavior. Specifically, we investigate the correspondence between attention patterns and OOD generalization in hundreds of Transformer models independently trained on a synthetic classification task. These models exhibit several distinct systematic generalization rules OOD, forming a diverse population for correlational analysis. In this setting, we find that simple observational tools from interpretability can predict OOD performance. In particular, when in-distribution attention exhibits hierarchical patterns, the model is likely to generalize hierarchically on OOD data – even when the rule’s implementation does not rely on these hierarchical patterns, according to ablation tests. Our findings offer a proof-of-concept to motivate further interpretability work on predicting unseen model behavior.


[96] Squeeze the Soaked Sponge: Efficient Off-policy Reinforcement Finetuning for Large Language Model cs.LG | cs.AI | cs.CLPDF

Jing Liang, Hongyao Tang, Yi Ma, Jinyi Liu, Yan Zheng

TL;DR: 该论文提出了一种高效的离策略强化学习微调方法ReMix,用于大型语言模型(LLMs),显著降低了训练成本,并提升了推理能力。

Details

Motivation: 现有的大多数强化学习微调(RFT)方法属于同策略RL,无法充分利用历史数据,导致计算和时间成本高昂。通过引入离策略RL,可以显著提高效率和可扩展性。

Result: ReMix在多个数学推理基准测试中表现出色,1.5B和7B模型的Pass@1准确率分别达到52.10%和63.27%/64.39%。训练成本降低了30x至450x,远超其他先进模型。

Insight: 揭示了离策略偏差导致的隐式偏好(如更短回答的“鞭打效应”)以及严重离策略性下自反思行为的崩溃模式。

Abstract: Reinforcement Learning (RL) has demonstrated its potential to improve the reasoning ability of Large Language Models (LLMs). One major limitation of most existing Reinforcement Finetuning (RFT) methods is that they are on-policy RL in nature, i.e., data generated during the past learning process is not fully utilized. This inevitably comes at a significant cost of compute and time, posing a stringent bottleneck on continuing economic and efficient scaling. To this end, we launch the renaissance of off-policy RL and propose Reincarnating Mix-policy Proximal Policy Gradient (ReMix), a general approach to enable on-policy RFT methods like PPO and GRPO to leverage off-policy data. ReMix consists of three major components: (1) Mix-policy proximal policy gradient with an increased Update-To-Data (UTD) ratio for efficient training; (2) KL-Convex policy constraint to balance the trade-off between stability and flexibility; (3) Policy reincarnation to achieve a seamless transition from efficient early-stage learning to steady asymptotic improvement. In our experiments, we train a series of ReMix models upon PPO, GRPO and 1.5B, 7B base models. ReMix shows an average Pass@1 accuracy of 52.10% (for 1.5B model) with 0.079M response rollouts, 350 training steps and achieves 63.27%/64.39% (for 7B model) with 0.007M/0.011M response rollouts, 50/75 training steps, on five math reasoning benchmarks (i.e., AIME’24, AMC’23, Minerva, OlympiadBench, and MATH500). Compared with 15 recent advanced models, ReMix shows SOTA-level performance with an over 30x to 450x reduction in training cost in terms of rollout data volume. In addition, we reveal insightful findings via multifaceted analysis, including the implicit preference for shorter responses due to the Whipping Effect of off-policy discrepancy, the collapse mode of self-reflection behavior under the presence of severe off-policyness, etc.


[97] Denoising Multi-Beta VAE: Representation Learning for Disentanglement and Generation cs.LG | cs.AI | cs.CVPDF

Anshuk Uppal, Yuhta Takida, Chieh-Hsin Lai, Yuki Mitsufuji

TL;DR: 本文提出了一种名为Denoising Multi-Beta VAE的新框架,通过使用多个β值学习不同的潜在表示,解决了生成模型中解纠缠与生成质量之间的权衡问题,并通过非线性扩散模型实现平滑过渡。

Details

Motivation: 传统的β-VAE框架在解纠缠和生成质量之间存在权衡,β值的增加会牺牲生成质量以获得更好的解纠缠。本文旨在解决这一问题,通过多种β值的潜在表示学习,实现解纠缠与高质量生成的统一。

Result: 实验表明,该框架在解纠缠和生成质量上均有显著提升,并实现了潜在空间的平滑过渡与一致的可控生成。

Insight: 通过动态调整β值,可以有效平衡解纠缠与生成质量,为生成模型的潜在空间设计提供了新的思路。

Abstract: Disentangled and interpretable latent representations in generative models typically come at the cost of generation quality. The $\beta$-VAE framework introduces a hyperparameter $\beta$ to balance disentanglement and reconstruction quality, where setting $\beta > 1$ introduces an information bottleneck that favors disentanglement over sharp, accurate reconstructions. To address this trade-off, we propose a novel generative modeling framework that leverages a range of $\beta$ values to learn multiple corresponding latent representations. First, we obtain a slew of representations by training a single variational autoencoder (VAE), with a new loss function that controls the information retained in each latent representation such that the higher $\beta$ value prioritize disentanglement over reconstruction fidelity. We then, introduce a non-linear diffusion model that smoothly transitions latent representations corresponding to different $\beta$ values. This model denoises towards less disentangled and more informative representations, ultimately leading to (almost) lossless representations, enabling sharp reconstructions. Furthermore, our model supports sample generation without input images, functioning as a standalone generative model. We evaluate our framework in terms of both disentanglement and generation quality. Additionally, we observe smooth transitions in the latent spaces with respect to changes in $\beta$, facilitating consistent manipulation of generated outputs.


[98] A Principled Framework for Multi-View Contrastive Learning cs.LG | cs.CVPDF

Panagiotis Koromilas, Efthymios Georgiou, Giorgos Bouritsas, Theodoros Giannakopoulos, Mihalis A. Nicolaou

TL;DR: 该论文针对多视图对比学习中现有方法的局限性,提出了两种新的损失函数:MV-InfoNCE和MV-DHEL,通过理论证明和实验验证,展示了它们在多视图和模态数据上的优越性。

Details

Motivation: 当前多视图对比学习方法存在多个局限性(如目标冲突、视图间交互建模不充分等),无法充分利用多视图带来的优势。论文旨在解决这些问题,提出更高效的多视图对比学习框架。

Result: 实验证明,提出的方法在多视图和多模态数据上均优于现有方法,尤其是在高视图数量下能有效避免维度塌缩。

Insight: 1. 多视图的学习可以通过更高效的损失函数更好地实现;2. 解耦对齐与均匀性是提升性能的关键;3. 多视图对比学习可以扩展到多模态场景。

Abstract: Contrastive Learning (CL), a leading paradigm in Self-Supervised Learning (SSL), typically relies on pairs of data views generated through augmentation. While multiple augmentations per instance (more than two) improve generalization in supervised learning, current CL methods handle additional views suboptimally by simply aggregating different pairwise objectives. This approach suffers from four critical limitations: (L1) it utilizes multiple optimization terms per data point resulting to conflicting objectives, (L2) it fails to model all interactions across views and data points, (L3) it inherits fundamental limitations (e.g. alignment-uniformity coupling) from pairwise CL losses, and (L4) it prevents fully realizing the benefits of increased view multiplicity observed in supervised settings. We address these limitations through two novel loss functions: MV-InfoNCE, which extends InfoNCE to incorporate all possible view interactions simultaneously in one term per data point, and MV-DHEL, which decouples alignment from uniformity across views while scaling interaction complexity with view multiplicity. Both approaches are theoretically grounded - we prove they asymptotically optimize for alignment of all views and uniformity, providing principled extensions to multi-view contrastive learning. Our empirical results on ImageNet1K and three other datasets demonstrate that our methods consistently outperform existing multi-view approaches and effectively scale with increasing view multiplicity. We also apply our objectives to multimodal data and show that, in contrast to other contrastive objectives, they can scale beyond just two modalities. Most significantly, ablation studies reveal that MV-DHEL with five or more views effectively mitigates dimensionality collapse by fully utilizing the embedding space, thereby delivering multi-view benefits observed in supervised learning.


cs.AI [Back]

[99] Scaling Towards the Information Boundary of Instruction Set: InfinityInstruct-Subject Technical Report cs.AI | cs.CLPDF

Li Du, Hanyu Zhao, Yiming Ju, Tengfei Pan

TL;DR: 论文提出了一个系统化的指令数据构建框架InfinityInstruct-Subject,通过层次化标注和针对性数据生成,解决了现有指令数据集在覆盖范围和深度上的不足,显著提升了模型的指令跟随能力。

Details

Motivation: 尽管现有的指令数据集规模已达千万级,但模型在复杂指令和罕见领域任务上仍表现不佳,主要原因是指令集在覆盖范围和深度上的扩展不足。

Result: 实验表明,InfinityInstruct-Subject在多个基础模型和基准任务上显著提升了指令跟随能力,覆盖范围和深度优于同类合成指令数据集。

Insight: 通过系统性方法提升指令数据的质量和多样性,能够更有效地增强模型的泛化能力和复杂任务处理能力,而不仅仅是增加数据量。

Abstract: Instruction tuning has become a foundation for unlocking the capabilities of large-scale pretrained models and improving their performance on complex tasks. Thus, the construction of high-quality instruction datasets is crucial for enhancing model performance and generalizability. Although current instruction datasets have reached tens of millions of samples, models finetuned on them may still struggle with complex instruction following and tasks in rare domains. This is primarily due to limited expansion in both coverage'' (coverage of task types and knowledge areas) and depth’’ (instruction complexity) of the instruction set. To address this issue, we propose a systematic instruction data construction framework, which integrates a hierarchical labeling system, an informative seed selection algorithm, an evolutionary data synthesis process, and a model deficiency diagnosis with targeted data generation. These components form an iterative closed-loop to continuously enhance the coverage and depth of instruction data. Based on this framework, we construct InfinityInstruct-Subject, a high-quality dataset containing ~1.5 million instructions. Experiments on multiple foundation models and benchmark tasks demonstrate its effectiveness in improving instruction-following capabilities. Further analyses suggest that InfinityInstruct-Subject shows enlarged coverage and depth compared to comparable synthesized instruction datasets. Our work lays a theoretical and practical foundation for the efficient, continuous evolution of instruction datasets, moving from data quantity expansion to qualitative improvement.


[100] The User-Centric Geo-Experience: An LLM-Powered Framework for Enhanced Planning, Navigation, and Dynamic Adaptation cs.AI | cs.CVPDF

Jieren Deng, Aleksandar Cvetkovic, Pak Kiu Chung, Dragomir Yankov, Chiqun Zhang

TL;DR: 本文提出了一种基于LLM的用户中心地理体验框架,通过三个协作代理(旅行规划、目的地辅助和本地发现)解决了传统旅行规划系统的静态性和碎片化问题,显著提升了查询解析、导航精度和适应能力。

Details

Motivation: 传统旅行规划系统难以应对复杂的现实场景(如环境变化和行程中断),导致用户体验差。本文旨在填补智能行程规划、精确导航和动态适应方面的技术空白。

Result: 实验证明系统在查询解析、导航精度和中断适应方面表现优异,适用于城市探索和应急响应等场景。

Insight: 通过LLM和多代理协作,实现了动态性和用户中心的设计,为未来智能地理服务提供了新思路。

Abstract: Traditional travel-planning systems are often static and fragmented, leaving them ill-equipped to handle real-world complexities such as evolving environmental conditions and unexpected itinerary disruptions. In this paper, we identify three gaps between existing service providers causing frustrating user experience: intelligent trip planning, precision “last-100-meter” navigation, and dynamic itinerary adaptation. We propose three cooperative agents: a Travel Planning Agent that employs grid-based spatial grounding and map analysis to help resolve complex multi-modal user queries; a Destination Assistant Agent that provides fine-grained guidance for the final navigation leg of each journey; and a Local Discovery Agent that leverages image embeddings and Retrieval-Augmented Generation (RAG) to detect and respond to trip plan disruptions. With evaluations and experiments, our system demonstrates substantial improvements in query interpretation, navigation accuracy, and disruption resilience, underscoring its promise for applications from urban exploration to emergency response.


cs.IR [Back]

[101] DS@GT at CheckThat! 2025: Exploring Retrieval and Reranking Pipelines for Scientific Claim Source Retrieval on Social Media Discourse cs.IR | cs.CLPDF

Jeanette Schofield, Shuyu Tian, Hoang Thanh Thanh Truong, Maximilian Heil

TL;DR: DS@GT团队在CLEF 2025 CheckThat! Lab Task 4b中探索了多种检索与重排序方法,用于从社交媒体推文中检索科学声称的源头,取得了MRR@5为0.58的成绩,相比基线提升了0.15。

Details

Motivation: 社交媒体用户常提出科学声称却未提供来源,导致验证这些声称的需求增加。

Result: 在CLEF 2025 CheckThat! Lab Task 4b中获得了MRR@5为0.58的成绩,排名16/30。

Insight: 数据增强与检索流程的优化对提升科学声称来源检索任务效果至关重要。

Abstract: Social media users often make scientific claims without citing where these claims come from, generating a need to verify these claims. This paper details work done by the DS@GT team for CLEF 2025 CheckThat! Lab Task 4b Scientific Claim Source Retrieval which seeks to find relevant scientific papers based on implicit references in tweets. Our team explored 6 different data augmentation techniques, 7 different retrieval and reranking pipelines, and finetuned a bi-encoder. Achieving an MRR@5 of 0.58, our team ranked 16th out of 30 teams for the CLEF 2025 CheckThat! Lab Task 4b, and improvement of 0.15 over the BM25 baseline of 0.43. Our code is available on Github at https://github.com/dsgt-arc/checkthat-2025-swd/tree/main/subtask-4b.


eess.AS [Back]

[102] Pronunciation-Lexicon Free Training for Phoneme-based Crosslingual ASR via Joint Stochastic Approximation eess.AS | cs.AI | cs.CLPDF

Saierdaer Yusuyin, Te Ma, Hao Huang, Zhijian Ou

TL;DR: 论文提出了一种无需发音词典的音素跨语言语音识别方法,通过联合随机逼近(JSA)算法训练语音到音素(S2P)、音素到字形(P2G)和字形到音素(G2G)模型,显著提升了性能。

Details

Motivation: 现有的基于音素的跨语言语音识别方法需要发音词典,限制了其适用性。本研究旨在消除这一限制,并提出了一种更高效的方法。

Result: 在波兰语和印尼语实验中,仅需10分钟音素监督即可实现5%错误率降低;在跨领域文本数据适应中,性能提升9%。

Insight: 无需发音词典的方法在跨语言语音识别中具有潜力,JSA算法在处理离散潜在变量模型时表现出色。

Abstract: Recently, pre-trained models with phonetic supervision have demonstrated their advantages for crosslingual speech recognition in data efficiency and information sharing across languages. However, a limitation is that a pronunciation lexicon is needed for such phoneme-based crosslingual speech recognition. In this study, we aim to eliminate the need for pronunciation lexicons and propose a latent variable model based method, with phonemes being treated as discrete latent variables. The new method consists of a speech-to-phoneme (S2P) model and a phoneme-to-grapheme (P2G) model, and a grapheme-to-phoneme (G2P) model is introduced as an auxiliary inference model. To jointly train the three models, we utilize the joint stochastic approximation (JSA) algorithm, which is a stochastic extension of the EM (expectation-maximization) algorithm and has demonstrated superior performance particularly in estimating discrete latent variable models. Based on the Whistle multilingual pre-trained S2P model, crosslingual experiments are conducted in Polish (130 h) and Indonesian (20 h). With only 10 minutes of phoneme supervision, the new method, JSA-SPG, achieves 5% error rate reductions compared to the best crosslingual fine-tuning approach using subword or full phoneme supervision. Furthermore, it is found that in language domain adaptation (i.e., utilizing cross-domain text-only data), JSA-SPG outperforms the standard practice of language model fusion via the auxiliary support of the G2P model by 9% error rate reductions. To facilitate reproducibility and encourage further exploration in this field, we open-source the JSA-SPG training code and complete pipeline.


cs.HC [Back]

[103] Super Kawaii Vocalics: Amplifying the “Cute” Factor in Computer Voice cs.HC | cs.AI | cs.CL | cs.CY | cs.SD | eess.ASPDF

Yuto Mandai, Katie Seaborn, Tomoyasu Nakano, Xin Sun, Yijia Wang

TL;DR: 这篇论文探讨了如何通过声音的基频和共振峰频率调整来增强计算机语音的“可爱”感知,为“卡哇伊”声音学研究提供了初步模型和方法。

Details

Motivation: 现有研究主要关注“卡哇伊”在视觉领域的表现,而忽略了声音方面。本文旨在填补这一空白,研究如何通过声音特性增强计算机语音的可爱感。

Result: 发现某些声音在特定频率调整下能找到“卡哇伊”的“甜点”,但部分声音存在天花板效应。

Insight: 声音的可爱感不仅受频率特性影响,还依赖于声音本身的类型,表明可爱感知存在非线性关系。

Abstract: “Kawaii” is the Japanese concept of cute, which carries sociocultural connotations related to social identities and emotional responses. Yet, virtually all work to date has focused on the visual side of kawaii, including in studies of computer agents and social robots. In pursuit of formalizing the new science of kawaii vocalics, we explored what elements of voice relate to kawaii and how they might be manipulated, manually and automatically. We conducted a four-phase study (grand N = 512) with two varieties of computer voices: text-to-speech (TTS) and game character voices. We found kawaii “sweet spots” through manipulation of fundamental and formant frequencies, but only for certain voices and to a certain extent. Findings also suggest a ceiling effect for the kawaii vocalics of certain voices. We offer empirical validation of the preliminary kawaii vocalics model and an elementary method for manipulating kawaii perceptions of computer voice.


[104] Learning Japanese with Jouzu: Interaction Outcomes with Stylized Dialogue Fictional Agents cs.HC | cs.CLPDF

Zackary Rackauckas, Julia Hirschberg

TL;DR: 该研究探讨了基于日本动漫风格的有声虚拟代理如何影响用户在多模态语言学习环境中的互动,发现代理的设计(尤其是声音、角色和语言风格)显著影响用户体验和学习动机。

Details

Motivation: 研究动机在于探索具有文化和情感风格化的虚拟代理如何提升语言学习环境中的用户互动体验和学习效果。

Result: 研究结果表明,代理的风格化设计(尤其是声音和语言风格)对用户体验、学习动机和策略有显著影响。

Insight: 研究揭示了情感和文化风格化的代理在提升用户互动和学习效果中的潜力,为未来设计类似系统提供了重要参考。

Abstract: This study investigates how stylized, voiced agents shape user interaction in a multimodal language learning environment. We conducted a mixed-methods evaluation of 54 participants interacting with anime-inspired characters powered by large language models and expressive text-to-speech synthesis. These agents responded in Japanese character language, offering users asynchronous, semi-structured conversation in varying speech styles and emotional tones. We analyzed user engagement patterns, perceived usability, emotional responses, and learning behaviors, with particular attention to how agent stylization influenced interaction across language proficiency levels and cultural backgrounds. Our findings reveal that agent design, especially voice, persona, and linguistic style, substantially affected user experience, motivation, and strategy. This work contributes to the understanding of affective, culturally stylized agents in human-agent interaction and offers guidance for designing more engaging, socially responsive systems.


cs.GR [Back]

[105] 3D-Generalist: Self-Improving Vision-Language-Action Models for Crafting 3D Worlds cs.GR | cs.CVPDF

Fan-Yun Sun, Shengguang Wu, Christian Jacobsen, Thomas Yim, Haoming Zou

TL;DR: 该论文提出了一种可扩展的方法(3D-Generalist),通过自改进微调的视觉语言模型(VLM)生成高质量3D环境,作为基础模型的训练数据。该方法在生成仿真就绪的3D环境和合成数据质量与可扩展性方面表现出色。

Details

Motivation: 目前,尽管大规模预训练赋予了模型语言和视觉推理能力,但由于缺乏基于3D世界的数据,其空间推理能力仍有限。手动创建沉浸式3D世界(如VR、游戏和机器人应用)高度费时费力,因此亟需一种自动化方法。

Result: 生成的3D环境可用于仿真,且预训练模型在微调后的表现优于基于人工合成数据的模型,接近大规模真实数据的结果。

Insight: 该研究表明,自动化生成的3D合成数据可以替代或补充人工数据,为基础模型的预训练提供高效且高质量的数据源。

Abstract: Despite large-scale pretraining endowing models with language and vision reasoning capabilities, improving their spatial reasoning capability remains challenging due to the lack of data grounded in the 3D world. While it is possible for humans to manually create immersive and interactive worlds through 3D graphics, as seen in applications such as VR, gaming, and robotics, this process remains highly labor-intensive. In this paper, we propose a scalable method for generating high-quality 3D environments that can serve as training data for foundation models. We recast 3D environment building as a sequential decision-making problem, employing Vision-Language-Models (VLMs) as policies that output actions to jointly craft a 3D environment’s layout, materials, lighting, and assets. Our proposed framework, 3D-Generalist, trains VLMs to generate more prompt-aligned 3D environments via self-improvement fine-tuning. We demonstrate the effectiveness of 3D-Generalist and the proposed training strategy in generating simulation-ready 3D environments. Furthermore, we demonstrate its quality and scalability in synthetic data generation by pretraining a vision foundation model on the generated data. After fine-tuning the pre-trained model on downstream tasks, we show that it surpasses models pre-trained on meticulously human-crafted synthetic data and approaches results achieved with real data orders of magnitude larger.