cs.CV [Total: 174]
cs.CL [Total: 46]
eess.IV [Total: 2]
cs.NI [Total: 1]
cs.AR [Total: 1]
cs.RO [Total: 15]
cs.SD [Total: 2]
hep-ex [Total: 1]
cs.MM [Total: 1]
cs.LG [Total: 15]
cs.CR [Total: 3]
cs.AI [Total: 7]
cs.SE [Total: 1]

cs.CV [Back]

[1] MOTION: ML-Assisted On-Device Low-Latency Motion Recognition cs.CV | cs.AI | cs.HCPDF

Veeramani Pugazhenthi, Wei-Hsiang Chu, Junwei Lu, Jadyn N. Miyahira, Soheil Salehi

TL;DR: 该论文提出了一种基于AutoML的低延迟手势识别方法，适用于嵌入式设备，尤其是在医疗监测领域。通过使用三轴加速度计传感器和轻量级机器学习算法，实现了高效的实时手势识别。

Details

Motivation: 随着人机交互和医疗监测需求的增加，嵌入式设备需要快速、高效且低延迟的手势识别能力，以避免误报。本文旨在探索一种基于AutoML的方法，仅使用三轴加速度计传感器实现高效的实时手势识别。

Result: 在WeBe Band设备上实现了可靠的实时手势识别，表现出高精度和低延迟。神经网络在平衡精度、延迟和内存使用方面表现最佳。

Insight: AutoML和轻量级机器学习算法的结合可以在嵌入式设备上实现高效的低延迟手势识别，为医疗监测等需要快速响应和安全的场景提供了潜在解决方案。

Abstract: The use of tiny devices capable of low-latency gesture recognition is gaining momentum in everyday human-computer interaction and especially in medical monitoring fields. Embedded solutions such as fall detection, rehabilitation tracking, and patient supervision require fast and efficient tracking of movements while avoiding unwanted false alarms. This study presents an efficient solution on how to build very efficient motion-based models only using triaxial accelerometer sensors. We explore the capability of the AutoML pipelines to extract the most important features from the data segments. This approach also involves training multiple lightweight machine learning algorithms using the extracted features. We use WeBe Band, a multi-sensor wearable device that is equipped with a powerful enough MCU to effectively perform gesture recognition entirely on the device. Of the models explored, we found that the neural network provided the best balance between accuracy, latency, and memory use. Our results also demonstrate that reliable real-time gesture recognition can be achieved in WeBe Band, with great potential for real-time medical monitoring solutions that require a secure and fast response time.

[2] Closing the Gap: Data-Centric Fine-Tuning of Vision Language Models for the Standardized Exam Questions cs.CV | cs.AI | cs.CL | cs.CYPDF

Egemen Sert, Şeyda Ertekin

TL;DR: 论文提出了一种数据中心的方法，通过高质量的多模态数据微调视觉语言模型，用于标准化考试题目，取得了接近SOTA的性能。

Details

Motivation: 标准化考试题目提供了一个严格的多模态推理测试平台，但目前的研究主要集中在算法改进上，而数据中心的视觉语言推理基础仍未得到充分探索。

Result: 模型在新发布的YKSUniform基准（1,854个多模态考试题目）上达到78.6%的准确率，仅比Gemini 2.0 Flash低1.0%。

Insight: 数据组成和表征语法在多模态推理中起决定性作用，精心设计的数据中心方法可以将监督微调提升至接近SOTA的性能。

Abstract: Multimodal reasoning has become a cornerstone of modern AI research. Standardized exam questions offer a uniquely rigorous testbed for such reasoning, providing structured visual contexts and verifiable answers. While recent progress has largely focused on algorithmic advances such as reinforcement learning (e.g., GRPO, DPO), the data centric foundations of vision language reasoning remain less explored. We show that supervised fine-tuning (SFT) with high-quality data can rival proprietary approaches. To this end, we compile a 161.4 million token multimodal dataset combining textbook question-solution pairs, curriculum aligned diagrams, and contextual materials, and fine-tune Qwen-2.5VL-32B using an optimized reasoning syntax (QMSA). The resulting model achieves 78.6% accuracy, only 1.0% below Gemini 2.0 Flash, on our newly released benchmark YKSUniform, which standardizes 1,854 multimodal exam questions across 309 curriculum topics. Our results reveal that data composition and representational syntax play a decisive role in multimodal reasoning. This work establishes a data centric framework for advancing open weight vision language models, demonstrating that carefully curated and curriculum-grounded multimodal data can elevate supervised fine-tuning to near state-of-the-art performance.

Abdolazim Rezaei, Mehdi Sookhak

TL;DR: PEFT-DML是一种参数高效的深度度量学习框架，用于自动驾驶中鲁棒的多模态3D目标检测。通过将多种模态映射到共享潜在空间，并在训练中引入LoRA和适配器层，提升了效率和对传感器丢失的鲁棒性。

Details

Motivation: 自动驾驶环境中传感器可能因故障或环境变化不可用，传统方法无法处理多模态的动态变化。PEFT-DML旨在解决这一问题。

Result: 在nuScenes基准测试中表现出更高的准确性和鲁棒性。

Insight: 共享潜在空间和参数高效的微调机制可有效解决多模态动态性问题。

Abstract: This study introduces PEFT-DML, a parameter-efficient deep metric learning framework for robust multi-modal 3D object detection in autonomous driving. Unlike conventional models that assume fixed sensor availability, PEFT-DML maps diverse modalities (LiDAR, radar, camera, IMU, GNSS) into a shared latent space, enabling reliable detection even under sensor dropout or unseen modality class combinations. By integrating Low-Rank Adaptation (LoRA) and adapter layers, PEFT-DML achieves significant training efficiency while enhancing robustness to fast motion, weather variability, and domain shifts. Experiments on benchmarks nuScenes demonstrate superior accuracy.

[4] DL-CapsNet: A Deep and Light Capsule Network cs.CVPDF

Pouya Shiri, Amirali Baniasadi

TL;DR: DL-CapsNet提出了一种深度且轻量化的胶囊网络，通过胶囊汇总层降低复杂度，保持了高准确率的同时减少了参数量和训练时间。

Details

Motivation: CNN在处理重叠类别和仿射变换的图像时表现不足，CapsNet虽更准确但复杂度高。为此，作者提出DL-CapsNet以平衡准确性与效率。

Result: DL-CapsNet在复杂数据集上表现优异，参数量少，训练和推理速度快。

Insight: 胶囊网络可通过深度化和轻量化设计实现高效准确的分类，适用于复杂场景。

Abstract: Capsule Network (CapsNet) is among the promising classifiers and a possible successor of the classifiers built based on Convolutional Neural Network (CNN). CapsNet is more accurate than CNNs in detecting images with overlapping categories and those with applied affine transformations. In this work, we propose a deep variant of CapsNet consisting of several capsule layers. In addition, we design the Capsule Summarization layer to reduce the complexity by reducing the number of parameters. DL-CapsNet, while being highly accurate, employs a small number of parameters and delivers faster training and inference. DL-CapsNet can process complex datasets with a high number of categories.

[5] Satellite to Street : Disaster Impact Estimator cs.CV | cs.AIPDF

Sreesritha Sai, Sai Venkata Suma Sreeja, Deepthi, Nikhil

TL;DR: 该论文提出了一种名为’Satellite-to-Street: Disaster Impact Estimator’的深度学习框架，用于通过联合处理灾前和灾后卫星图像，生成精细的像素级损害地图，以提升灾害损害评估的准确性和效率。

Details

Motivation: 传统灾害损害评估依赖人工解读卫星图像，存在速度慢、主观性强且难以规模化的问题。现有的深度学习方法（如U-Net和变化检测模型）在捕捉细微结构变化和处理类别不平衡方面表现不佳。

Result: 在公开的灾害数据集上，相较于传统分割模型和变化检测基线模型，该方法在结构损害的定位和分类方面表现更优，生成了更准确的损害地图。

Insight: 通过结合局部和全局信息，以及优化损失函数设计，可以显著提升灾害损害评估模型的性能。该方法的数据驱动特性为灾害管理提供了高效且一致的决策支持。

Abstract: Accurate post-disaster damage assessment is of high importance for prioritizing emergency response; however, manual interpretation of satellite imagery is slow, subjective, and hard to scale. While deep-learning models for image segmentation, such as U-Net-based baselines and change-detection models, are useful baselines, they often struggle with subtle structural variations and severe class imbalance, yielding poor detection of highly damaged regions. The present work proposes a deep-learning framework that jointly processes pre- and post-disaster satellite images to obtain fine-grained pixel-level damage maps: Satellite-to-Street: Disaster Impact Estimator. The model uses a modified dual-input U-Net architecture with enhanced feature fusion to capture both the local structural changes as well as the broader contextual cues. Class-aware weighted loss functions are integrated in order to handle the dominance of undamaged pixels in real disaster datasets, thus enhancing sensitivity toward major and destroyed categories. Experimentation on publicly available disaster datasets shows improved localization and classification of structural damage when compared to traditional segmentation and baseline change-detection models. The resulting damage maps provide a rapid and consistent assessment mechanism to support and not replace expert decision-making, thus allowing more efficient, data-driven disaster management.

[6] ProvRain: Rain-Adaptive Denoising and Vehicle Detection via MobileNet-UNet and Faster R-CNN cs.CVPDF

Aswinkumar Varathakumaran, Nirmala Paramanandham

TL;DR: ProvRain提出了一种结合MobileNet-U-Net和Faster R-CNN的管道，用于雨夜条件下的车辆检测和去噪，通过课程学习和合成数据提升了性能。

Details

Motivation: 在夜间和恶劣天气（如雨天）条件下，车辆检测面临严重的噪声问题，影响检测精度。现有方法难以平衡去噪和检测性能。

Result: 与基线相比，检测准确率提升8.94%，召回率提升10.25%；去噪方面PSNR提升10-15%，SSIM提升5-6%，LPIPS降低67%。

Insight: 课程学习和混合数据是提升模型在恶劣天气下性能的关键；轻量级架构适用于实时应用。

Abstract: Provident vehicle detection has a lot of scope in the detection of vehicle during night time. The extraction of features other than the headlamps of vehicles allows us to detect oncoming vehicles before they appear directly on the camera. However, it faces multiple issues especially in the field of night vision, where a lot of noise caused due to weather conditions such as rain or snow as well as camera conditions. This paper focuses on creating a pipeline aimed at dealing with such noise while at the same time maintaining the accuracy of provident vehicular detection. The pipeline in this paper, ProvRain, uses a lightweight MobileNet-U-Net architecture tuned to generalize to robust weather conditions by using the concept of curricula training. A mix of synthetic as well as available data from the PVDN dataset is used for this. This pipeline is compared to the base Faster RCNN architecture trained on the PVDN dataset to see how much the addition of a denoising architecture helps increase the detection model’s performance in rainy conditions. The system boasts an 8.94% increase in accuracy and a 10.25% increase in recall in the detection of vehicles in rainy night time frames. Similarly, the custom MobileNet-U-Net architecture that was trained also shows a 10-15% improvement in PSNR, a 5-6% increase in SSIM, and upto a 67% reduction in perceptual error (LPIPS) compared to other transformer approaches.

[7] Conceptual Evaluation of Deep Visual Stereo Odometry for the MARWIN Radiation Monitoring Robot in Accelerator Tunnels cs.CV | cs.ROPDF

André Dehne, Juri Zach, Peer Stelldinger

TL;DR: 本文探讨了在MARWIN辐射监测机器人中使用深度视觉立体里程计(DVSO)的可行性，替代现有基于激光雷达的方案，以提高在单调加速隧道中的导航灵活性和自主性。

Details

Motivation: 现有基于激光雷达的导航方案在预定义区域表现稳健，但在未知几何和障碍物环境下缺乏灵活性。DVSO作为纯视觉方案，有望解决这一问题。

Result: DVSO预期优势包括减少尺度漂移、低成本传感和可扩展数据收集，但在低纹理表面、光照变化和辐射环境下仍面临挑战。

Insight: DVSO为MARWIN在安全关键基础设施中实现更高自主性提供了研究方向，但其在实际应用中的鲁棒性和计算负载仍需进一步验证。

Abstract: The MARWIN robot operates at the European XFEL to perform autonomous radiation monitoring in long, monotonous accelerator tunnels where conventional localization approaches struggle. Its current navigation concept combines lidar-based edge detection, wheel/lidar odometry with periodic QR-code referencing, and fuzzy control of wall distance, rotation, and longitudinal position. While robust in predefined sections, this design lacks flexibility for unknown geometries and obstacles. This paper explores deep visual stereo odometry (DVSO) with 3D-geometric constraints as a focused alternative. DVSO is purely vision-based, leveraging stereo disparity, optical flow, and self-supervised learning to jointly estimate depth and ego-motion without labeled data. For global consistency, DVSO can subsequently be fused with absolute references (e.g., landmarks) or other sensors. We provide a conceptual evaluation for accelerator tunnel environments, using the European XFEL as a case study. Expected benefits include reduced scale drift via stereo, low-cost sensing, and scalable data collection, while challenges remain in low-texture surfaces, lighting variability, computational load, and robustness under radiation. The paper defines a research agenda toward enabling MARWIN to navigate more autonomously in constrained, safety-critical infrastructures.

[8] Exploring Diagnostic Prompting Approach for Multimodal LLM-based Visual Complexity Assessment: A Case Study of Amazon Search Result Pages cs.CVPDF

Divendar Murtadak, Yoon Kim, Trilokya Akula

TL;DR: 该研究探讨了诊断性提示方法对多模态大语言模型（MLLM）在亚马逊搜索结果页（SRP）视觉复杂性评估中的可靠性提升。结果显示，诊断性提示显著优于基于格式塔原则的标准提示，但性能仍有提升空间。

Details

Motivation: 研究旨在解决MLLM在视觉复杂性评估中的可靠性问题，尤其是在实际应用如亚马逊SRP中。现有的基于格式塔原则的提示方法效果有限，因此探索诊断性提示的潜力。

Result: 诊断性提示将F1分数从0.031提升至0.297（相对提升858%），但绝对性能仍较低（Cohen’s κ = 0.071）。MLLM更关注设计元素（如徽章杂乱），而人类更注重内容相似性。

Insight: 1. 诊断性提示是提升MLLM评估可靠性的有效方向；2. MLLM与人类评估的对齐仍需改进，尤其是在复杂视觉任务中；3. 提示方法需结合更大规模的真实数据集以进一步提升性能。

Abstract: This study investigates whether diagnostic prompting can improve Multimodal Large Language Model (MLLM) reliability for visual complexity assessment of Amazon Search Results Pages (SRP). We compare diagnostic prompting with standard gestalt principles-based prompting using 200 Amazon SRP pages and human expert annotations. Diagnostic prompting showed notable improvements in predicting human complexity judgments, with F1-score increasing from 0.031 to 0.297 (+858% relative improvement), though absolute performance remains modest (Cohen’s $κ$ = 0.071). The decision tree revealed that models prioritize visual design elements (badge clutter: 38.6% importance) while humans emphasize content similarity, suggesting partial alignment in reasoning patterns. Failure case analysis reveals persistent challenges in MLLM visual perception, particularly for product similarity and color intensity assessment. Our findings indicate that diagnostic prompting represents a promising initial step toward human-aligned MLLM-based evaluation, though failure cases with consistent human-MLLM disagreement require continued research and refinement in prompting approaches with larger ground truth datasets for reliable practical deployment.

[9] A Fast and Efficient Modern BERT based Text-Conditioned Diffusion Model for Medical Image Segmentation cs.CV | cs.LGPDF

Venkata Siddharth Dhara, Pawan Kumar

TL;DR: 论文提出了一种基于现代BERT的FastTextDiff模型，用于医学图像分割，通过结合医学文本标注提升语义表示，减少了对密集像素标签的依赖，提高了分割效率和准确性。

Details

Motivation: 现有医学图像分割方法依赖密集像素标签，标注成本高且耗时。FastTextDiff通过文本标注增强语义表示，解决了这一问题。

Result: FastTextDiff在分割精度和训练效率上优于传统扩散模型。

Insight: 多模态技术（文本+图像）在医学图像分析中具有潜力；ModernBERT是Clinical BioBERT的高效替代方案。

Abstract: In recent times, denoising diffusion probabilistic models (DPMs) have proven effective for medical image generation and denoising, and as representation learners for downstream segmentation. However, segmentation performance is limited by the need for dense pixel-wise labels, which are expensive, time-consuming, and require expert knowledge. We propose FastTextDiff, a label-efficient diffusion-based segmentation model that integrates medical text annotations to enhance semantic representations. Our approach uses ModernBERT, a transformer capable of processing long clinical notes, to tightly link textual annotations with semantic content in medical images. Trained on MIMIC-III and MIMIC-IV, ModernBERT encodes clinical knowledge that guides cross-modal attention between visual and textual features. This study validates ModernBERT as a fast, scalable alternative to Clinical BioBERT in diffusion-based segmentation pipelines and highlights the promise of multi-modal techniques for medical image analysis. By replacing Clinical BioBERT with ModernBERT, FastTextDiff benefits from FlashAttention 2, an alternating attention mechanism, and a 2-trillion-token corpus, improving both segmentation accuracy and training efficiency over traditional diffusion-based models.

Davide Nadalini, Manuele Rusci, Elia Cereda, Luca Benini, Francesco Conti

TL;DR: 该论文提出了一种多模态的片上学习（ODL）技术，用于在超低功耗（ULP）MCU上实现单目深度估计（MDE）。通过结合深度传感器和摄像头数据，系统在部署环境中动态生成伪标签并进行片上微调，显著提升了深度估计的准确性。

Details

Motivation: 为了解决单目深度估计在物联网（IoT）平台上因训练数据和实际传感器数据之间域偏移导致的精度下降问题，论文提出了一种高效的片上学习方法。

Result: 实验表明，该方法在KITTI和NYUv2数据集上的精度仅下降2%和1.5%，微调内存消耗减少2.2倍，并在17.8分钟内完成片上微调，将深度估计误差从4.9米显著降低至0.6米。

Insight: 该方法展示了在资源受限的IoT平台上动态适应新环境的潜力，为低功耗设备的实时深度估计提供了可行的解决方案。

Abstract: Monocular depth estimation (MDE) plays a crucial role in enabling spatially-aware applications in Ultra-low-power (ULP) Internet-of-Things (IoT) platforms. However, the limited number of parameters of Deep Neural Networks for the MDE task, designed for IoT nodes, results in severe accuracy drops when the sensor data observed in the field shifts significantly from the training dataset. To address this domain shift problem, we present a multi-modal On-Device Learning (ODL) technique, deployed on an IoT device integrating a Greenwaves GAP9 MicroController Unit (MCU), a 80 mW monocular camera and a 8 x 8 pixel depth sensor, consuming $\approx$300mW. In its normal operation, this setup feeds a tiny 107 k-parameter $μ$PyD-Net model with monocular images for inference. The depth sensor, usually deactivated to minimize energy consumption, is only activated alongside the camera to collect pseudo-labels when the system is placed in a new environment. Then, the fine-tuning task is performed entirely on the MCU, using the new data. To optimize our backpropagation-based on-device training, we introduce a novel memory-driven sparse update scheme, which minimizes the fine-tuning memory to 1.2 MB, 2.2x less than a full update, while preserving accuracy (i.e., only 2% and 1.5% drops on the KITTI and NYUv2 datasets). Our in-field tests demonstrate, for the first time, that ODL for MDE can be performed in 17.8 minutes on the IoT node, reducing the root mean squared error from 4.9 to 0.6m with only 3 k self-labeled samples, collected in a real-life deployment scenario.

[11] Exploring Automated Recognition of Instructional Activity and Discourse from Multimodal Classroom Data cs.CVPDF

Ivo Bueno, Ruikun Hou, Babette Bühler, Tim Fütterer, James Drimalla

TL;DR: 论文探讨了利用AI自动化分析课堂录像和转录的多模态数据，以识别教学活动和话语，为教师提供可操作的反馈。通过164小时视频和68节课的转录数据，设计了多模态并行处理流程，结果显示微调模型显著优于提示方法。

Details

Motivation: 当前课堂互动的观察依赖人工标注，费时费力且难以扩展。研究旨在开发AI驱动的自动化分析工具，为教师提供高效反馈。

Result: 微调模型在视频和转录分析中的macro-F1分数分别为0.577和0.460，显著优于提示方法。

Insight: 研究表明微调模型在多模态课堂数据分析中表现更优，为可扩展的教师反馈系统奠定了基础。

Abstract: Observation of classroom interactions can provide concrete feedback to teachers, but current methods rely on manual annotation, which is resource-intensive and hard to scale. This work explores AI-driven analysis of classroom recordings, focusing on multimodal instructional activity and discourse recognition as a foundation for actionable feedback. Using a densely annotated dataset of 164 hours of video and 68 lesson transcripts, we design parallel, modality-specific pipelines. For video, we evaluate zero-shot multimodal LLMs, fine-tuned vision-language models, and self-supervised video transformers on 24 activity labels. For transcripts, we fine-tune a transformer-based classifier with contextualized inputs and compare it against prompting-based LLMs on 19 discourse labels. To handle class imbalance and multi-label complexity, we apply per-label thresholding, context windows, and imbalance-aware loss functions. The results show that fine-tuned models consistently outperform prompting-based approaches, achieving macro-F1 scores of 0.577 for video and 0.460 for transcripts. These results demonstrate the feasibility of automated classroom analysis and establish a foundation for scalable teacher feedback systems.

[12] SemImage: Semantic Image Representation for Text, a Novel Framework for Embedding Disentangled Linguistic Features cs.CV | cs.LGPDF

Mohammad Zare

TL;DR: SemImage是一种新颖的文本表示方法，将文档转换为语义图像，通过HSV颜色空间解耦语言学特征，并使用CNN处理。在分类任务中表现出色，且增强了可解释性。

Details

Motivation: 传统文本处理方法通常依赖于序列模型（如RNN或Transformer），而SemImage旨在通过图像表示和CNN处理文本，同时解耦语言学特征（如主题和情感），以提高分类性能和可解释性。

Result: 实验表明SemImage在多标签和单标签文本分类任务中优于BERT和分层注意力网络等基线模型，并通过消融实验验证了HSV表示和动态边界行的重要性。

Insight: 1. 通过2D图像表示文本可以有效利用CNN的空间感知能力。
2. 解耦的语言学特征（HSV通道）提供了可视化的解释性，使主题和情感变化直观可见。

Abstract: We propose SemImage, a novel method for representing a text document as a two-dimensional semantic image to be processed by convolutional neural networks (CNNs). In a SemImage, each word is represented as a pixel in a 2D image: rows correspond to sentences and an additional boundary row is inserted between sentences to mark semantic transitions. Each pixel is not a typical RGB value but a vector in a disentangled HSV color space, encoding different linguistic features: the Hue with two components H_cos and H_sin to account for circularity encodes the topic, Saturation encodes the sentiment, and Value encodes intensity or certainty. We enforce this disentanglement via a multi-task learning framework: a ColorMapper network maps each word embedding to the HSV space, and auxiliary supervision is applied to the Hue and Saturation channels to predict topic and sentiment labels, alongside the main task objective. The insertion of dynamically computed boundary rows between sentences yields sharp visual boundaries in the image when consecutive sentences are semantically dissimilar, effectively making paragraph breaks salient. We integrate SemImage with standard 2D CNNs (e.g., ResNet) for document classification. Experiments on multi-label datasets (with both topic and sentiment annotations) and single-label benchmarks demonstrate that SemImage can achieve competitive or better accuracy than strong text classification baselines (including BERT and hierarchical attention networks) while offering enhanced interpretability. An ablation study confirms the importance of the multi-channel HSV representation and the dynamic boundary rows. Finally, we present visualizations of SemImage that qualitatively reveal clear patterns corresponding to topic shifts and sentiment changes in the generated image, suggesting that our representation makes these linguistic features visible to both humans and machines.

[13] TeleViT1.0: Teleconnection-aware Vision Transformers for Subseasonal to Seasonal Wildfire Pattern Forecasts cs.CVPDF

Ioannis Prapas, Nikolaos Papadopoulos, Nikolaos-Ioannis Bountos, Dimitrios Michail, Gustau Camps-Valls

TL;DR: TeleViT1.0是一种基于Transformer的模型，通过多尺度融合局部火灾驱动因子、全球场数据和遥相关指数，显著提升了季节性至亚季节性火灾预测的准确性，超越了传统方法。

Details

Motivation: 长期火灾预测需要综合考虑本地和全球尺度的因素，传统方法难以有效融合这些多尺度信息，因此需要一种新的架构来提高预测能力。

Result: TeleViT在所有预测时间范围内（包括四个月）的表现优于U-Net++、ViT和气候学方法，尤其在非洲稀树草原等季节性一致的火灾区域表现最佳。

Insight: 全局场和遥相关指数提供了粗粒度的上下文信息，而局部token主导了预测，表明显式编码大尺度地球系统背景的架构可以扩展火灾的可预测性。

Abstract: Forecasting wildfires weeks to months in advance is difficult, yet crucial for planning fuel treatments and allocating resources. While short-term predictions typically rely on local weather conditions, long-term forecasting requires accounting for the Earth’s interconnectedness, including global patterns and teleconnections. We introduce TeleViT, a Teleconnection-aware Vision Transformer that integrates (i) fine-scale local fire drivers, (ii) coarsened global fields, and (iii) teleconnection indices. This multi-scale fusion is achieved through an asymmetric tokenization strategy that produces heterogeneous tokens processed jointly by a transformer encoder, followed by a decoder that preserves spatial structure by mapping local tokens to their corresponding prediction patches. Using the global SeasFire dataset (2001-2021, 8-day resolution), TeleViT improves AUPRC performance over U-Net++, ViT, and climatology across all lead times, including horizons up to four months. At zero lead, TeleViT with indices and global inputs reaches AUPRC 0.630 (ViT 0.617, U-Net 0.620), at 16x8day lead (around 4 months), TeleViT variants using global input maintain 0.601-0.603 (ViT 0.582, U-Net 0.578), while surpassing the climatology (0.572) at all lead times. Regional results show the highest skill in seasonally consistent fire regimes, such as African savannas, and lower skill in boreal and arid regions. Attention and attribution analyses indicate that predictions rely mainly on local tokens, with global fields and indices contributing coarse contextual information. These findings suggest that architectures explicitly encoding large-scale Earth-system context can extend wildfire predictability on subseasonal-to-seasonal timescales.

[14] Comparative Analysis of Vision Transformer, Convolutional, and Hybrid Architectures for Mental Health Classification Using Actigraphy-Derived Images cs.CV | cs.LGPDF

Ifeanyi Okala

TL;DR: 论文比较了VGG16、ViT-B/16和CoAtNet-Tiny三种方法在基于腕部活动信号图像的心理健康分类中的表现，发现CoAtNet-Tiny在性能和稳定性上表现最佳。

Details

Motivation: 研究旨在探索不同图像分类方法在心理健康诊断中的适用性，尤其关注基于活动信号的图像数据。

Result: CoAtNet-Tiny在平均准确率、精确率、召回率和F1得分上表现最优，尤其是在少数类别（抑郁和精神分裂症）上表现突出。

Insight: 混合架构可能更适合基于活动信号图像的心理健康分类任务，而纯卷积或纯Transformer架构的表现则较为不稳定。

Abstract: This work examines how three different image-based methods, VGG16, ViT-B/16, and CoAtNet-Tiny, perform in identifying depression, schizophrenia, and healthy controls using daily actigraphy records. Wrist-worn activity signals from the Psykose and Depresjon datasets were converted into 30 by 48 images and evaluated through a three-fold subject-wise split. Although all methods fitted the training data well, their behaviour on unseen data differed. VGG16 improved steadily but often settled at lower accuracy. ViT-B/16 reached strong results in some runs, but its performance shifted noticeably from fold to fold. CoAtNet-Tiny stood out as the most reliable, recording the highest average accuracy and the most stable curves across folds. It also produced the strongest precision, recall, and F1-scores, particularly for the underrepresented depression and schizophrenia classes. Overall, the findings indicate that CoAtNet-Tiny performed most consistently on the actigraphy images, while VGG16 and ViT-B/16 yielded mixed results. These observations suggest that certain hybrid designs may be especially suited for mental-health work that relies on actigraphy-derived images.

[15] TinyViT: Field Deployable Transformer Pipeline for Solar Panel Surface Fault and Severity Screening cs.CV | eess.IVPDF

Ishwaryah Pandiarajan, Mohamed Mansoor Roomi Sindha, Uma Maheswari Pandyan, Sharafia N

TL;DR: TinyViT是一种紧凑的太阳能板表面故障检测系统，仅使用可见光图像实现故障分类和严重性评估，结合Transformer分割和特征工程，适用于资源有限的场景。

Details

Motivation: 太阳能光伏资产的高效运维需要低成本、可扩展的表面故障检测方法，传统多模态成像方案存在经济和物流限制。

Result: 在真实数据集上验证了分类和回归模块的准确性，与专用方法竞争且具有可解释性。

Insight: 低成本的可见光图像结合深度学习与经典机器学习，可替代昂贵的多模态传感器，适用于资源有限的环境。

Abstract: Sustained operation of solar photovoltaic assets hinges on accurate detection and prioritization of surface faults across vast, geographically distributed modules. While multi modal imaging strategies are popular, they introduce logistical and economic barriers for routine farm level deployment. This work demonstrates that deep learning and classical machine learning may be judiciously combined to achieve robust surface anomaly categorization and severity estimation from planar visible band imagery alone. We introduce TinyViT which is a compact pipeline integrating Transformer based segmentation, spectral-spatial feature engineering, and ensemble regression. The system ingests consumer grade color camera mosaics of PV panels, classifies seven nuanced surface faults, and generates actionable severity grades for maintenance triage. By eliminating reliance on electroluminescence or IR sensors, our method enables affordable, scalable upkeep for resource limited installations, and advances the state of solar health monitoring toward universal field accessibility. Experiments on real public world datasets validate both classification and regression sub modules, achieving accuracy and interpretability competitive with specialized approaches.

[16] Hybrid Synthetic Data Generation with Domain Randomization Enables Zero-Shot Vision-Based Part Inspection Under Extreme Class Imbalance cs.CV | cs.LGPDF

Ruo-Syuan Mei, Sixian Jia, Guangze Li, Soo Yeon Lee, Brian Musser

TL;DR: 该论文提出了一种混合合成数据生成框架，结合仿真渲染、域随机化和真实背景合成，实现了零样本视觉工业零件检测，无需手动标注，显著提升了极端类别不平衡下的检测性能。

Details

Motivation: 工业质量检测中，高质量标注数据难以获取且缺陷样本稀少，导致类别不平衡问题，限制了机器学习方法的广泛应用。合成数据生成为解决这一问题提供了高效、经济的解决方案。

Result: 在300个真实工业零件的测试集上，检测mAP@0.5达到0.995，分类准确率为96%，平衡准确率为90.1%，显著优于少样本基准方法。

Insight: 合成数据生成可有效解决工业检测中的数据不足和类别不平衡问题，且无需人工标注，为实际应用提供了高效、可扩展的解决方案。

Abstract: Machine learning, particularly deep learning, is transforming industrial quality inspection. Yet, training robust machine learning models typically requires large volumes of high-quality labeled data, which are expensive, time-consuming, and labor-intensive to obtain in manufacturing. Moreover, defective samples are intrinsically rare, leading to severe class imbalance that degrades model performance. These data constraints hinder the widespread adoption of machine learning-based quality inspection methods in real production environments. Synthetic data generation (SDG) offers a promising solution by enabling the creation of large, balanced, and fully annotated datasets in an efficient, cost-effective, and scalable manner. This paper presents a hybrid SDG framework that integrates simulation-based rendering, domain randomization, and real background compositing to enable zero-shot learning for computer vision-based industrial part inspection without manual annotation. The SDG pipeline generates 12,960 labeled images in one hour by varying part geometry, lighting, and surface properties, and then compositing synthetic parts onto real image backgrounds. A two-stage architecture utilizing a YOLOv8n backbone for object detection and MobileNetV3-small for quality classification is trained exclusively on synthetic data and evaluated on 300 real industrial parts. The proposed approach achieves an mAP@0.5 of 0.995 for detection, 96% classification accuracy, and 90.1% balanced accuracy. Comparative evaluation against few-shot real-data baseline approaches demonstrates significant improvement. The proposed SDG-based approach achieves 90-91% balanced accuracy under severe class imbalance, while the baselines reach only 50% accuracy. These results demonstrate that the proposed method enables annotation-free, scalable, and robust quality inspection for real-world manufacturing applications.

[17] AutocleanEEG ICVision: Automated ICA Artifact Classification Using Vision-Language AI cs.CV | cs.LG | eess.IV | q-bio.QMPDF

Zag ElSayed, Grace Westerkamp, Gavin Gammoh, Yanchen Liu, Peyton Siekierski

TL;DR: 论文介绍了ICVision，一种基于视觉-语言AI的EEG ICA成分自动分类系统，通过多模态大语言模型（GPT-4 Vision）直接解析ICA可视化数据，模仿专家分类行为，性能超越传统分类器ICLabel。

Details

Motivation: 传统EEG ICA成分分类依赖人工提取特征（如ICLabel），限制了分类的准确性和解释性。ICVision的创新在于利用AI的视觉和语言推理能力，模拟专家对EEG数据的直观理解。

Result: 在3,168个ICA成分上的评估显示，与专家共识的Kappa值达0.677，优于MNE ICLabel。97%的输出被专家评为可解释且可操作。

Insight: ICVision展示了AI在科学领域的新范式：不仅能分类，还能像专家一样‘看’和‘解释’数据，为神经生理学及其他领域的可扩展、可解释AI代理开辟了新方向。

Abstract: We introduce EEG Autoclean Vision Language AI (ICVision) a first-of-its-kind system that emulates expert-level EEG ICA component classification through AI-agent vision and natural language reasoning. Unlike conventional classifiers such as ICLabel, which rely on handcrafted features, ICVision directly interprets ICA dashboard visualizations topography, time series, power spectra, and ERP plots, using a multimodal large language model (GPT-4 Vision). This allows the AI to see and explain EEG components the way trained neurologists do, making it the first scientific implementation of AI-agent visual cognition in neurophysiology. ICVision classifies each component into one of six canonical categories (brain, eye, heart, muscle, channel noise, and other noise), returning both a confidence score and a human-like explanation. Evaluated on 3,168 ICA components from 124 EEG datasets, ICVision achieved k = 0.677 agreement with expert consensus, surpassing MNE ICLabel, while also preserving clinically relevant brain signals in ambiguous cases. Over 97% of its outputs were rated as interpretable and actionable by expert reviewers. As a core module of the open-source EEG Autoclean platform, ICVision signals a paradigm shift in scientific AI, where models do not just classify, but see, reason, and communicate. It opens the door to globally scalable, explainable, and reproducible EEG workflows, marking the emergence of AI agents capable of expert-level visual decision-making in brain science and beyond.

[18] DenseScan: Advancing 3D Scene Understanding with 2D Dense Annotation cs.CV | cs.AIPDF

Zirui Wang, Tao Zhang

TL;DR: DenseScan是一个新颖的3D场景理解数据集，通过自动化流程生成密集的多级描述，结合2D图像和多模态大语言模型（MLLMs），提供丰富的语义标注和场景级问答任务支持。

Details

Motivation: 当前3D场景理解数据集缺乏丰富的语义标注，限制了视觉-语言任务的细致发展。DenseScan旨在填补这一空白，提供更全面的对象级描述和场景上下文。

Result: 实验表明，DenseScan显著提升了对象级理解和问答性能，优于传统标注方法。

Insight: DenseScan展示了语义标注与几何细节结合的重要性，为机器人、增强现实等领域提供了更丰富的3D场景理解工具。

Abstract: 3D understanding is a key capability for real-world AI assistance. High-quality data plays an important role in driving the development of the 3D understanding community. Current 3D scene understanding datasets often provide geometric and instance-level information, yet they lack the rich semantic annotations necessary for nuanced visual-language tasks.In this work, we introduce DenseScan, a novel dataset with detailed multi-level descriptions generated by an automated pipeline leveraging multi-view 2D images and multimodal large language models (MLLMs). Our approach enables dense captioning of scene elements, ensuring comprehensive object-level descriptions that capture context-sensitive details. Furthermore, we extend these annotations through scenario-based question generation, producing high-level queries that integrate object properties, spatial relationships, and scene context. By coupling geometric detail with semantic richness, DenseScan broadens the range of downstream tasks, from detailed visual-language navigation to interactive question answering. Experimental results demonstrate that our method significantly enhances object-level understanding and question-answering performance in 3D environments compared to traditional annotation pipelines. We release both the annotated dataset and our annotation pipeline to facilitate future research and applications in robotics, augmented reality, and beyond. Through DenseScan, we aim to catalyze new avenues in 3D scene understanding, allowing researchers and practitioners to tackle the complexities of real-world environments with richer, more contextually aware annotations.

[19] Relightable Holoported Characters: Capturing and Relighting Dynamic Human Performance from Sparse Views cs.CVPDF

Kunwar Maheep Singh, Jianchun Chen, Vladislav Golyanik, Stephan J. Garbin, Thabo Beeler

TL;DR: 提出了Relightable Holoported Characters（RHC），一种基于稀疏视图RGB视频的新方法，用于动态人体的自由视角渲染和重新光照。该方法利用Transformer架构的RelightNet一步预测重光照效果，避免了传统OLAT方法的昂贵计算开销。

Details

Motivation: 传统的人体重光照方法（如OLAT）需要逐帧捕获光照基，计算成本高昂且无法实时处理动态人体。RHC旨在通过稀疏视图输入和单次网络推理实现高效且高质量的重光照。

Result: 实验表明，RHC在视觉逼真度和光照重现上优于现有方法，支持高效的单次推理。

Insight: 结合物理特征和Transformer架构的RHC展示了在稀疏视图下实现高质量动态人体重光照的潜力，为实时应用提供了新思路。

Abstract: We present Relightable Holoported Characters (RHC), a novel person-specific method for free-view rendering and relighting of full-body and highly dynamic humans solely observed from sparse-view RGB videos at inference. In contrast to classical one-light-at-a-time (OLAT)-based human relighting, our transformer-based RelightNet predicts relit appearance within a single network pass, avoiding costly OLAT-basis capture and generation. For training such a model, we introduce a new capture strategy and dataset recorded in a multi-view lightstage, where we alternate frames lit by random environment maps with uniformly lit tracking frames, simultaneously enabling accurate motion tracking and diverse illumination as well as dynamics coverage. Inspired by the rendering equation, we derive physics-informed features that encode geometry, albedo, shading, and the virtual camera view from a coarse human mesh proxy and the input views. Our RelightNet then takes these features as input and cross-attends them with a novel lighting condition, and regresses the relit appearance in the form of texel-aligned 3D Gaussian splats attached to the coarse mesh proxy. Consequently, our RelightNet implicitly learns to efficiently compute the rendering equation for novel lighting conditions within a single feed-forward pass. Experiments demonstrate our method’s superior visual fidelity and lighting reproduction compared to state-of-the-art approaches. Project page: https://vcai.mpi-inf.mpg.de/projects/RHC/

Yuzhen Hu, Saurabh Prasad

TL;DR: UniDiff提出了一种参数高效的框架，通过适应ImageNet预训练的扩散模型来处理多模态遥感数据，解决了稀疏标注的问题，并在多种模态数据上取得了良好的效果。

Details

Motivation: 多模态遥感数据的稀疏标注限制了现有监督方法的性能，而UniDiff旨在通过参数高效的扩散模型适应来解决这一问题。

Result: 在两个多模态遥感基准数据集上的实验表明，UniDiff能够有效缓解标注稀疏问题，实现对多模态数据的有效融合和特征提取。

Insight: 通过参数高效的设计和伪RGB锚定技术，UniDiff展示了预训练扩散模型在多模态遥感任务中的潜力，同时避免了灾难性遗忘。

Abstract: Sparse annotations fundamentally constrain multimodal remote sensing: even recent state-of-the-art supervised methods such as MSFMamba are limited by the availability of labeled data, restricting their practical deployment despite architectural advances. ImageNet-pretrained models provide rich visual representations, but adapting them to heterogeneous modalities such as hyperspectral imaging (HSI) and synthetic aperture radar (SAR) without large labeled datasets remains challenging. We propose UniDiff, a parameter-efficient framework that adapts a single ImageNet-pretrained diffusion model to multiple sensing modalities using only target-domain data. UniDiff combines FiLM-based timestep-modality conditioning, parameter-efficient adaptation of approximately 5% of parameters, and pseudo-RGB anchoring to preserve pre-trained representations and prevent catastrophic forgetting. This design enables effective feature extraction from remote sensing data under sparse annotations. Our results with two established multi-modal benchmarking datasets demonstrate that unsupervised adaptation of a pre-trained diffusion model effectively mitigates annotation constraints and achieves effective fusion of multi-modal remotely sensed data.

[21] HeartFormer: Semantic-Aware Dual-Structure Transformers for 3D Four-Chamber Cardiac Point Cloud Reconstruction cs.CVPDF

Zhengda Ma, Abhirup Banerjee

TL;DR: 该论文提出了一种基于点云的几何深度学习框架HeartFormer，用于从cine MRI数据重建3D四腔心脏点云，解决了传统cine MRI只能提供2D图像的局限。

Details

Motivation: 传统cine MRI仅提供2D切片图像，限制了心脏形态学和生理机制的全面理解。论文旨在通过3D点云重建填补这一空白。

Result: 在HeartCompv1和UK Biobank上的实验表明，HeartFormer表现稳健、准确且泛化能力强，优于现有SOTA方法。

Insight: 结合语义和几何特征的Transformer网络能够显著提升3D心脏点云的重建质量，为心脏形态学研究提供了新工具。

Abstract: We present the first geometric deep learning framework based on point cloud representation for 3D four-chamber cardiac reconstruction from cine MRI data. This work addresses a long-standing limitation in conventional cine MRI, which typically provides only 2D slice images of the heart, thereby restricting a comprehensive understanding of cardiac morphology and physiological mechanisms in both healthy and pathological conditions. To overcome this, we propose \textbf{HeartFormer}, a novel point cloud completion network that extends traditional single-class point cloud completion to the multi-class. HeartFormer consists of two key components: a Semantic-Aware Dual-Structure Transformer Network (SA-DSTNet) and a Semantic-Aware Geometry Feature Refinement Transformer Network (SA-GFRTNet). SA-DSTNet generates an initial coarse point cloud with both global geometry features and substructure geometry features. Guided by these semantic-geometry representations, SA-GFRTNet progressively refines the coarse output, effectively leveraging both global and substructure geometric priors to produce high-fidelity and geometrically consistent reconstructions. We further construct \textbf{HeartCompv1}, the first publicly available large-scale dataset with 17,000 high-resolution 3D multi-class cardiac meshes and point-clouds, to establish a general benchmark for this emerging research direction. Extensive cross-domain experiments on HeartCompv1 and UK Biobank demonstrate that HeartFormer achieves robust, accurate, and generalizable performance, consistently surpassing state-of-the-art (SOTA) methods. Code and dataset will be released upon acceptance at: https://github.com/10Darren/HeartFormer.

[22] Words into World: A Task-Adaptive Agent for Language-Guided Spatial Retrieval in AR cs.CV | cs.AI | cs.HCPDF

Lixing Guo, Tobias Höllerer

TL;DR: 该论文提出了一个模块化的AR代理系统，结合多模态大语言模型（MLLMs）和空间感知工具，支持复杂自然语言查询的空间推理与检索，并动态生成准确的3D锚点。

Details

Motivation: 传统AR系统通常依赖固定的类别检测器或标记物，无法处理开放词汇的自然语言查询，限制了其在复杂环境中的应用。

Result: 系统能够处理从简单物体识别到多物体关系推理的复杂查询，支持米级精度的3D定位。

Insight: 通过结合MLLMs的空间推理能力和AR的实时感知能力，可以实现更灵活、交互性更强的场景理解。

Abstract: Traditional augmented reality (AR) systems predominantly rely on fixed class detectors or fiducial markers, limiting their ability to interpret complex, open-vocabulary natural language queries. We present a modular AR agent system that integrates multimodal large language models (MLLMs) with grounded vision models to enable relational reasoning in space and language-conditioned spatial retrieval in physical environments. Our adaptive task agent coordinates MLLMs and coordinate-aware perception tools to address varying query complexities, ranging from simple object identification to multi-object relational reasoning, while returning meter-accurate 3D anchors. It constructs dynamic AR scene graphs encoding nine typed relations (spatial, structural-semantic, causal-functional), enabling MLLMs to understand not just what objects exist, but how they relate and interact in 3D space. Through task-adaptive region-of-interest highlighting and contextual spatial retrieval, the system guides human attention to information-dense areas while supporting human-in-the-loop refinement. The agent dynamically invokes coordinate-aware tools for complex queries-selection, measurement, comparison, and actuation-grounding language understanding in physical operations. The modular architecture supports plug-and-use vision-language models without retraining, establishing AR agents as intermediaries that augment MLLMs with real-world spatial intelligence for interactive scene understanding. We also introduce GroundedAR-Bench, an evaluation framework for language-driven real world localization and relation grounding across diverse environments.

[23] TGSFormer: Scalable Temporal Gaussian Splatting for Embodied Semantic Scene Completion cs.CVPDF

Rui Qian, Haozhi Cao, Tianchen Deng, Tianxin Hu, Weixiang Guo

TL;DR: TGSFormer是一个可扩展的时序高斯泼溅框架，用于解决嵌入式3D语义场景补全中的冗余和扩展性问题。

Details

Motivation: 现有高斯方法在无边界场景中因随机初始化导致冗余和扩展性差，传统深度引导方法也存在局部性和内存开销问题。

Result: 在局部和嵌入式SSC任务中达到SOTA性能，显著减少基元数量且保持场景完整性。

Insight: 持久高斯记忆和时序融合是提高嵌入式场景重建效率和扩展性的关键。

Abstract: Embodied 3D Semantic Scene Completion (SSC) infers dense geometry and semantics from continuous egocentric observations. Most existing Gaussian-based methods rely on random initialization of many primitives within predefined spatial bounds, resulting in redundancy and poor scalability to unbounded scenes. Recent depth-guided approach alleviates this issue but remains local, suffering from latency and memory overhead as scale increases. To overcome these challenges, we propose TGSFormer, a scalable Temporal Gaussian Splatting framework for embodied SSC. It maintains a persistent Gaussian memory for temporal prediction, without relying on image coherence or frame caches. For temporal fusion, a Dual Temporal Encoder jointly processes current and historical Gaussian features through confidence-aware cross-attention. Subsequently, a Confidence-aware Voxel Fusion module merges overlapping primitives into voxel-aligned representations, regulating density and maintaining compactness. Extensive experiments demonstrate that TGSFormer achieves state-of-the-art results on both local and embodied SSC benchmarks, offering superior accuracy and scalability with significantly fewer primitives while maintaining consistent long-term scene integrity. The code will be released upon acceptance.

[24] Optimizing Distributional Geometry Alignment with Optimal Transport for Generative Dataset Distillation cs.CVPDF

Xiao Cui, Yulei Qin, Wengang Zhou, Hongsheng Li, Houqiang Li

TL;DR: 该论文提出了一种基于最优传输（OT）的数据集蒸馏方法，通过全局和实例级别的细粒度对齐，解决现有方法忽视实例级特征和类内变化的问题，显著提升了性能。

Details

Motivation: 现有的大规模数据集蒸馏方法仅匹配全局统计量（如均值和方差），忽略了实例级特征和类内变化，导致泛化能力不足。作者通过引入最优传输（OT）距离最小化问题，实现对分布几何的细粒度对齐。

Result: 在ImageNet-1K等大规模数据集上，该方法在IPC=10设置下比其他方法至少提升了4%的准确率。

Insight: 最优传输提供了一种几何保真的分布匹配框架，能够有效保留复杂高维分布的局部模式和类内变化。该方法为数据集蒸馏提供了更细粒度的对齐思路。

Abstract: Dataset distillation seeks to synthesize a compact distilled dataset, enabling models trained on it to achieve performance comparable to models trained on the full dataset. Recent methods for large-scale datasets focus on matching global distributional statistics (e.g., mean and variance), but overlook critical instance-level characteristics and intraclass variations, leading to suboptimal generalization. We address this limitation by reformulating dataset distillation as an Optimal Transport (OT) distance minimization problem, enabling fine-grained alignment at both global and instance levels throughout the pipeline. OT offers a geometrically faithful framework for distribution matching. It effectively preserves local modes, intra-class patterns, and fine-grained variations that characterize the geometry of complex, high-dimensional distributions. Our method comprises three components tailored for preserving distributional geometry: (1) OT-guided diffusion sampling, which aligns latent distributions of real and distilled images; (2) label-image-aligned soft relabeling, which adapts label distributions based on the complexity of distilled image distributions; and (3) OT-based logit matching, which aligns the output of student models with soft-label distributions. Extensive experiments across diverse architectures and large-scale datasets demonstrate that our method consistently outperforms state-of-the-art approaches in an efficient manner, achieving at least 4% accuracy improvement under IPC=10 settings for each architecture on ImageNet-1K.

[25] ART-ASyn: Anatomy-aware Realistic Texture-based Anomaly Synthesis Framework for Chest X-Rays cs.CVPDF

Qinyi Cao, Jianan Fan, Weidong Cai

TL;DR: 本文提出了一种名为ART-ASyn的新框架，用于合成胸部X光片中真实且解剖学一致的异常纹理。通过改进的PBTSeg方法引导纹理增强，生成异常样本及其精确掩码，支持显式分割监督，并在未见数据集上展示了零样本异常分割的泛化能力。

Details

Motivation: 现有方法合成的异常在视觉上与真实病理模式存在差异，且忽略了解剖结构的一致性，限制了其在医学图像异常检测中的应用效果。

Result: 在高仿真异常生成和零样本分割任务上表现出色，验证了其在未见数据集上的泛化能力。

Insight: 解剖学意识和纹理增强的结合是提升医学图像异常检测性能的关键。

Abstract: Unsupervised anomaly detection aims to identify anomalies without pixel-level annotations. Synthetic anomaly-based methods exhibit a unique capacity to introduce controllable irregularities with known masks, enabling explicit supervision during training. However, existing methods often produce synthetic anomalies that are visually distinct from real pathological patterns and ignore anatomical structure. This paper presents a novel Anatomy-aware Realistic Texture-based Anomaly Synthesis framework (ART-ASyn) for chest X-rays that generates realistic and anatomically consistent lung opacity related anomalies using texture-based augmentation guided by our proposed Progressive Binary Thresholding Segmentation method (PBTSeg) for lung segmentation. The generated paired samples of synthetic anomalies and their corresponding precise pixel-level anomaly mask for each normal sample enable explicit segmentation supervision. In contrast to prior work limited to one-class classification, ART-ASyn is further evaluated for zero-shot anomaly segmentation, demonstrating generalizability on an unseen dataset without target-domain annotations. Code availability is available at https://github.com/angelacao-hub/ART-ASyn.

[26] Odometry Without Correspondence from Inertially Constrained Ruled Surfaces cs.CVPDF

Chenqi Zhu, Levi Burner, Yiannis Aloimonos

TL;DR: 该论文提出了一种新的视觉里程计算法，利用直线特征和惯性测量单元（IMU）数据，通过分析图像空间中的直线运动轨迹（规则曲面）来估计运动，避免了传统点对点对应的计算成本和不准确性。

Details

Motivation: 传统的视觉里程计依赖于点对点的特征对应，计算成本高且精度不稳定。为克服这一问题，研究者尝试利用直线特征和多传感器融合，但仍难以完全摆脱对应问题的困扰。本文提出通过分析图像中直线运动的轨迹（规则曲面）来估计运动，结合IMU数据降低求解空间的维度。

Result: 该方法能够高效地重建3D场景并估计视觉里程计，避免了传统点对点对应的计算开销和不准确性，且精度得到提升。

Insight: 利用直线运动的几何特征（规则曲面）结合IMU数据，可以有效解决传统视觉里程计中的对应问题，同时提高计算效率和鲁棒性。

Abstract: Visual odometry techniques typically rely on feature extraction from a sequence of images and subsequent computation of optical flow. This point-to-point correspondence between two consecutive frames can be costly to compute and suffers from varying accuracy, which affects the odometry estimate’s quality. Attempts have been made to bypass the difficulties originating from the correspondence problem by adopting line features and fusing other sensors (event camera, IMU) to improve performance, many of which still heavily rely on correspondence. If the camera observes a straight line as it moves, the image of the line sweeps a smooth surface in image-space time. It is a ruled surface and analyzing its shape gives information about odometry. Further, its estimation requires only differentially computed updates from point-to-line associations. Inspired by event cameras’ propensity for edge detection, this research presents a novel algorithm to reconstruct 3D scenes and visual odometry from these ruled surfaces. By constraining the surfaces with the inertia measurements from an onboard IMU sensor, the dimensionality of the solution space is greatly reduced.

[27] MVAD : A Comprehensive Multimodal Video-Audio Dataset for AIGC Detection cs.CVPDF

Mengxue Hu, Yunfeng Diao, Changtao Miao, Jianshu Li, Zhe Li

TL;DR: MVAD是首个专注于检测AI生成的多模态视频-音频内容的综合数据集，填补了现有数据集在多模态检测上的空白。

Details

Motivation: AI生成的多模态视频-音频内容快速发展，但现有数据集仅关注视觉模态或局限于面部深度伪造，限制了可信检测系统的开发。

Result: MVAD数据集为AI生成多模态内容的检测提供了首个综合性基准。

Insight: 多模态检测需要更全面的数据集支持，MVAD填补了这一领域的关键空白。

Abstract: The rapid advancement of AI-generated multimodal video-audio content has raised significant concerns regarding information security and content authenticity. Existing synthetic video datasets predominantly focus on the visual modality alone, while the few incorporating audio are largely confined to facial deepfakes–a limitation that fails to address the expanding landscape of general multimodal AI-generated content and substantially impedes the development of trustworthy detection systems. To bridge this critical gap, we introduce the Multimodal Video-Audio Dataset (MVAD), the first comprehensive dataset specifically designed for detecting AI-generated multimodal video-audio content. Our dataset exhibits three key characteristics: (1) genuine multimodality with samples generated according to three realistic video-audio forgery patterns; (2) high perceptual quality achieved through diverse state-of-the-art generative models; and (3) comprehensive diversity spanning realistic and anime visual styles, four content categories (humans, animals, objects, and scenes), and four video-audio multimodal data types. Our dataset will be available at https://github.com/HuMengXue0104/MVAD.

[28] Assimilation Matters: Model-level Backdoor Detection in Vision-Language Pretrained Models cs.CVPDF

Zhongqi Wang, Jie Zhang, Shiguang Shan, Xilin Chen

TL;DR: AMDET是一种无需训练数据集、触发器或下游分类器等先验知识的模型级后门检测框架，利用梯度反演揭示特征同化特性，有效检测视觉语言预训练模型中的后门。

Details

Motivation: 视觉语言预训练模型（如CLIP）易受后门攻击，现有检测方法依赖过多先验知识，难以实际应用。AMDET旨在解决这一局限性，提出无需先验知识的检测方法。

Result: 在3,600个后门模型上的实验表明，AMDET的F1分数达89.90%，单次检测仅需5分钟，且对自适应攻击具有强鲁棒性。

Insight: 特征同化是后门行为的核心特征，梯度反演可用于高效恢复潜在触发器，损失景观分析有助于区分自然与恶意后门。

Abstract: Vision-language pretrained models (VLPs) such as CLIP have achieved remarkable success, but are also highly vulnerable to backdoor attacks. Given a model fine-tuned by an untrusted third party, determining whether the model has been injected with a backdoor is a critical and challenging problem. Existing detection methods usually rely on prior knowledge of training dataset, backdoor triggers and targets, or downstream classifiers, which may be impractical for real-world applications. To address this, To address this challenge, we introduce Assimilation Matters in DETection (AMDET), a novel model-level detection framework that operates without any such prior knowledge. Specifically, we first reveal the feature assimilation property in backdoored text encoders: the representations of all tokens within a backdoor sample exhibit a high similarity. Further analysis attributes this effect to the concentration of attention weights on the trigger token. Leveraging this insight, AMDET scans a model by performing gradient-based inversion on token embeddings to recover implicit features that capable of activating backdoor behaviors. Furthermore, we identify the natural backdoor feature in the OpenAI’s official CLIP model, which are not intentionally injected but still exhibit backdoor-like behaviors. We then filter them out from real injected backdoor by analyzing their loss landscapes. Extensive experiments on 3,600 backdoored and benign-finetuned models with two attack paradigms and three VLP model structures show that AMDET detects backdoors with an F1 score of 89.90%. Besides, it achieves one complete detection in approximately 5 minutes on a RTX 4090 GPU and exhibits strong robustness against adaptive attacks. Code is available at: https://github.com/Robin-WZQ/AMDET

[29] mmPred: Radar-based Human Motion Prediction in the Dark cs.CVPDF

Junqiao Fan, Haocong Rao, Jiarui Zhang, Jianfei Yang, Lihua Xie

TL;DR: mmPred是首个基于毫米波雷达的扩散框架，用于黑暗环境下的人体运动预测，解决了雷达信号的多径效应和噪声问题，并通过双域历史运动表示和全局骨架关系Transformer提升性能。

Details

Motivation: 现有基于RGB-D相机的HMP方法对光照敏感且存在隐私问题，限制了其在消防、医疗等实际场景的应用。毫米波雷达因其鲁棒性和隐私保护特性成为新选择，但其信号噪声和多径效应带来了挑战。

Result: 在mmBody和mm-Fi数据集上分别以8.6%和22%的优势超越现有方法。

Insight: 雷达信号虽噪声多，但通过双域表示和全局建模可有效提升预测性能，为黑暗环境和隐私敏感场景提供了新思路。

Abstract: Existing Human Motion Prediction (HMP) methods based on RGB-D cameras are sensitive to lighting conditions and raise privacy concerns, limiting their real-world applications such as firefighting and healthcare. Motivated by the robustness and privacy-preserving nature of millimeter-wave (mmWave) radar, this work introduces radar as a novel sensing modality for HMP, for the first time. Nevertheless, radar signals often suffer from specular reflections and multipath effects, resulting in noisy and temporally inconsistent measurements, such as body-part miss-detection. To address these radar-specific artifacts, we propose mmPred, the first diffusion-based framework tailored for radar-based HMP. mmPred introduces a dual-domain historical motion representation to guide the generation process, combining a Time-domain Pose Refinement (TPR) branch for learning fine-grained details and a Frequency-domain Dominant Motion (FDM) branch for capturing global motion trends and suppressing frame-level inconsistency. Furthermore, we design a Global Skeleton-relational Transformer (GST) as the diffusion backbone to model global inter-joint cooperation, enabling corrupted joints to dynamically aggregate information from others. Extensive experiments show that mmPred achieves state-of-the-art performance, outperforming existing methods by 8.6% on mmBody and 22% on mm-Fi.

[30] MM-DETR: An Efficient Multimodal Detection Transformer with Mamba-Driven Dual-Granularity Fusion and Frequency-Aware Modality Adapters cs.CVPDF

Jianhong Han, Yupei Wang, Yuan Zhang, Liang Chen

TL;DR: MM-DETR是一种轻量高效的多模态目标检测框架，通过Mamba驱动的双粒度融合和频率感知模态适配器，解决了现有方法在性能和轻量设计上的不平衡问题。

Details

Motivation: 现有方法在融合互补模态信息时，难以平衡性能和轻量化设计，且共享主干网络或双流架构存在特征提取不足或参数冗余问题。

Result: 在四个多模态基准数据集上的实验证明了方法的有效性和泛化能力。

Insight: 1. 线性复杂度的跨模态建模是高效多模态融合的关键；2. 模态补全视角有助于细粒度融合；3. 空间-频率协同结构可兼顾模态特异性和轻量化。

Abstract: Multimodal remote sensing object detection aims to achieve more accurate and robust perception under challenging conditions by fusing complementary information from different modalities. However, existing approaches that rely on attention-based or deformable convolution fusion blocks still struggle to balance performance and lightweight design. Beyond fusion complexity, extracting modality features with shared backbones yields suboptimal representations due to insufficient modality-specific modeling, whereas dual-stream architectures nearly double the parameter count, ultimately limiting practical deployment. To this end, we propose MM-DETR, a lightweight and efficient framework for multimodal object detection. Specifically, we propose a Mamba-based dual granularity fusion encoder that reformulates global interaction as channel-wise dynamic gating and leverages a 1D selective scan for efficient cross-modal modeling with linear complexity. Following this design, we further reinterpret multimodal fusion as a modality completion problem. A region-aware 2D selective scanning completion branch is introduced to recover modality-specific cues, supporting fine-grained fusion along a bidirectional pyramid pathway with minimal overhead. To further reduce parameter redundancy while retaining strong feature extraction capability, a lightweight frequency-aware modality adapter is inserted into the shared backbone. This adapter employs a spatial-frequency co-expert structure to capture modality-specific cues, while a pixel-wise router dynamically balances expert contributions for efficient spatial-frequency fusion. Extensive experiments conducted on four multimodal benchmark datasets demonstrate the effectiveness and generalization capability of the proposed method.

[31] Towards aligned body representations in vision models cs.CV | cs.AIPDF

Andrey Gizdov, Andrea Procopio, Yichen Li, Daniel Harari, Tomer Ullman

TL;DR: 该论文研究了视觉模型中是否会出现与人类类似的粗粒度身体表征。通过心理物理学实验和语义分割任务，作者发现小模型自然地形成了人类相似的粗粒度表征，而大模型则倾向于过于细粒度的编码。

Details

Motivation: 人类依赖粗粒度的身体表征进行物理推理，但这些表征的内部结构尚不清楚。作者希望通过计算机视觉模型探索是否存在类似的表征，并对比大小模型的表现。

Result: 结果表明，小模型自然地形成了人类类似的粗粒度表征，而大模型则倾向于细粒度编码。

Insight: 粗粒度表征可能在资源有限的条件下更容易形成，这为理解人类大脑中的物理推理机制提供了可扩展的途径。

Abstract: Human physical reasoning relies on internal “body” representations - coarse, volumetric approximations that capture an object’s extent and support intuitive predictions about motion and physics. While psychophysical evidence suggests humans use such coarse representations, their internal structure remains largely unknown. Here we test whether vision models trained for segmentation develop comparable representations. We adapt a psychophysical experiment conducted with 50 human participants to a semantic segmentation task and test a family of seven segmentation networks, varying in size. We find that smaller models naturally form human-like coarse body representations, whereas larger models tend toward overly detailed, fine-grain encodings. Our results demonstrate that coarse representations can emerge under limited computational resources, and that machine representations can provide a scalable path toward understanding the structure of physical reasoning in the brain.

[32] THCRL: Trusted Hierarchical Contrastive Representation Learning for Multi-View Clustering cs.CVPDF

Jian Zhu

TL;DR: THCRL是一种新的多层次对比表示学习方法，用于解决多视图聚类中的不可信融合问题，通过两个关键模块实现高效且可靠的多视图数据融合。

Details

Motivation: 多视图聚类中，现有的方法常忽略视图内部的噪声以及忽视同一簇内最近邻的结构信息，导致融合结果不可信。

Result: THCRL在多视图聚类任务中取得了最先进的性能。

Insight: 强调了噪声处理和簇内结构信息的重要性，为可信的多视图融合提供了新思路。

Abstract: Multi-View Clustering (MVC) has garnered increasing attention in recent years. It is capable of partitioning data samples into distinct groups by learning a consensus representation. However, a significant challenge remains: the problem of untrustworthy fusion. This problem primarily arises from two key factors: 1) Existing methods often ignore the presence of inherent noise within individual views; 2) In traditional MVC methods using Contrastive Learning (CL), similarity computations typically rely on different views of the same instance, while neglecting the structural information from nearest neighbors within the same cluster. Consequently, this leads to the wrong direction for multi-view fusion. To address this problem, we present a novel Trusted Hierarchical Contrastive Representation Learning (THCRL). It consists of two key modules. Specifically, we propose the Deep Symmetry Hierarchical Fusion (DSHF) module, which leverages the UNet architecture integrated with multiple denoising mechanisms to achieve trustworthy fusion of multi-view data. Furthermore, we present the Average K-Nearest Neighbors Contrastive Learning (AKCL) module to align the fused representation with the view-specific representation. Unlike conventional strategies, AKCL enhances representation similarity among samples belonging to the same cluster, rather than merely focusing on the same sample across views, thereby reinforcing the confidence of the fused representation. Extensive experiments demonstrate that THCRL achieves the state-of-the-art performance in deep MVC tasks.

[33] WiseEdit: Benchmarking Cognition- and Creativity-Informed Image Editing cs.CVPDF

Kaihang Pan, Weile Chen, Haiyi Qiu, Qifan Yu, Wendong Bu

TL;DR: WiseEdit是一个综合评估认知和创意驱动的图像编辑能力的基准，通过将其分解为三个递进步骤（感知、解释与想象）并结合三类知识（陈述性、程序性与元认知知识），揭示了当前先进模型在知识和创意推理方面的局限性。

Details

Motivation: 现有基准对认知和创意驱动的图像编辑能力评估过于狭窄，无法全面反映模型的智能水平。

Result: 基准测试客观展示了当前先进模型在知识推理和创造性方面的短板。

Insight: 认知和创意驱动的图像编辑需要结合多层次知识和推理能力，现有模型仍需改进。

Abstract: Recent image editing models boast next-level intelligent capabilities, facilitating cognition- and creativity-informed image editing. Yet, existing benchmarks provide too narrow a scope for evaluation, failing to holistically assess these advanced abilities. To address this, we introduce WiseEdit, a knowledge-intensive benchmark for comprehensive evaluation of cognition- and creativity-informed image editing, featuring deep task depth and broad knowledge breadth. Drawing an analogy to human cognitive creation, WiseEdit decomposes image editing into three cascaded steps, i.e., Awareness, Interpretation, and Imagination, each corresponding to a task that poses a challenge for models to complete at the specific step. It also encompasses complex tasks, where none of the three steps can be finished easily. Furthermore, WiseEdit incorporates three fundamental types of knowledge: Declarative, Procedural, and Metacognitive knowledge. Ultimately, WiseEdit comprises 1,220 test cases, objectively revealing the limitations of SoTA image editing models in knowledge-based cognitive reasoning and creative composition capabilities. The benchmark, evaluation code, and the generated images of each model will be made publicly available soon. Project Page: https://qnancy.github.io/wiseedit_project_page/.

[34] Better, Stronger, Faster: Tackling the Trilemma in MLLM-based Segmentation with Simultaneous Textual Mask Prediction cs.CVPDF

Jiazhen Liu, Mingkuan Feng, Long Chen

TL;DR: 该论文提出了一种新颖的范式STAMP，通过在MLLM中同时进行文本对话和非自回归的分割掩码预测，解决了对话能力、分割性能和推理速度之间的三难问题。

Details

Motivation: 在MLLM中集成分割任务时，现有方法难以同时保持对话能力、高分割性能和快速推理。这种三难问题促使研究者提出新的解决方案。

Result: STAMP在多个分割基准测试中表现优异，同时保持了对话能力和快速的推理速度。

Insight: 该研究表明，通过任务解耦和并行预测机制，可以在MLLM中高效地集成复杂的视觉任务，而不牺牲其核心能力。

Abstract: Integrating segmentation into Multimodal Large Language Models (MLLMs) presents a core trilemma: simultaneously preserving dialogue ability, achieving high segmentation performance, and ensuring fast inference. Prevailing paradigms are forced into a compromise. Embedding prediction methods introduce a conflicting pixel-level objective that degrades the MLLM’s general dialogue abilities. The alternative, next-token prediction, reframes segmentation as an autoregressive task, which preserves dialogue but forces a trade-off between poor segmentation performance with sparse outputs or prohibitive inference speeds with rich ones. We resolve this trilemma with all-mask prediction, a novel paradigm that decouples autoregressive dialogue generation from non-autoregressive mask prediction. We present STAMP: Simultaneous Textual All-Mask Prediction, an MLLM that embodies this paradigm. After generating a textual response, STAMP predicts an entire segmentation mask in a single forward pass by treating it as a parallel “fill-in-the-blank” task over image patches. This design maintains the MLLM’s dialogue ability by avoiding conflicting objectives, enables high segmentation performance by leveraging rich, bidirectional spatial context for all mask tokens, and achieves exceptional speed. Extensive experiments show that STAMP significantly outperforms state-of-the-art methods across multiple segmentation benchmarks, providing a solution that excels in dialogue, segmentation, and speed without compromise.

[35] Low-Bitrate Video Compression through Semantic-Conditioned Diffusion cs.CV | cs.AIPDF

Lingdong Wang, Guan-Ming Su, Divya Kothandaraman, Tsung-Wei Huang, Mohammad Hajiesmaili

TL;DR: 论文提出了一种名为DiSCo的语义视频压缩框架，通过生成先验合成细节，仅传输最关键的信息，从而在超低比特率下优于传统和基线语义编解码器。

Details

Motivation: 传统视频编解码器在超低比特率下因过度依赖像素保真度而产生严重失真，与人类感知不一致。DiSCo旨在通过语义和生成方法的结合解决这一问题。

Result: 实验显示，DiSCo在低比特率下相比传统和基线语义编解码器，感知指标提升了2-10倍。

Insight: 人类感知更关注语义而非像素精度，利用生成模型合成细节可显著提升低比特率下的视频质量。

Abstract: Traditional video codecs optimized for pixel fidelity collapse at ultra-low bitrates and produce severe artifacts. This failure arises from a fundamental misalignment between pixel accuracy and human perception. We propose a semantic video compression framework named DiSCo that transmits only the most meaningful information while relying on generative priors for detail synthesis. The source video is decomposed into three compact modalities: a textual description, a spatiotemporally degraded video, and optional sketches or poses that respectively capture semantic, appearance, and motion cues. A conditional video diffusion model then reconstructs high-quality, temporally coherent videos from these compact representations. Temporal forward filling, token interleaving, and modality-specific codecs are proposed to improve multimodal generation and modality compactness. Experiments show that our method outperforms baseline semantic and traditional codecs by 2-10X on perceptual metrics at low bitrates.

[36] SplatFont3D: Structure-Aware Text-to-3D Artistic Font Generation with Part-Level Style Control cs.CV | cs.GRPDF

Ji Gan, Lingxu Chen, Jiaxu Leng, Xinbo Gao

TL;DR: SplatFont3D 是一个结构感知的文本到3D艺术字体生成框架，利用3D高斯核函数实现精细的部件级风格控制，优于现有3D模型。

Details

Motivation: 现有研究主要关注2D艺术字体，3D艺术字体生成（3D-AFG）尚未充分探索。3D-AFG在沉浸式环境（如游戏和动画）中有应用潜力，还能提升2D-AFG质量。3D字体具有严格的结构约束和部件级风格控制需求。

Result: 实验表明，SplatFont3D在风格一致性、视觉质量和渲染效率上优于现有3D模型。

Insight: 1. 3D高斯核函数在3D字体生成中兼具高效性和可控性；2. 结合2D扩散模型可以提升3D生成质量；3. 动态部件分配策略有效解决了3D优化中的结构性问题。

Abstract: Artistic font generation (AFG) can assist human designers in creating innovative artistic fonts. However, most previous studies primarily focus on 2D artistic fonts in flat design, leaving personalized 3D-AFG largely underexplored. 3D-AFG not only enables applications in immersive 3D environments such as video games and animations, but also may enhance 2D-AFG by rendering 2D fonts of novel views. Moreover, unlike general 3D objects, 3D fonts exhibit precise semantics with strong structural constraints and also demand fine-grained part-level style control. To address these challenges, we propose SplatFont3D, a novel structure-aware text-to-3D AFG framework with 3D Gaussian splatting, which enables the creation of 3D artistic fonts from diverse style text prompts with precise part-level style control. Specifically, we first introduce a Glyph2Cloud module, which progressively enhances both the shapes and styles of 2D glyphs (or components) and produces their corresponding 3D point clouds for Gaussian initialization. The initialized 3D Gaussians are further optimized through interaction with a pretrained 2D diffusion model using score distillation sampling. To enable part-level control, we present a dynamic component assignment strategy that exploits the geometric priors of 3D Gaussians to partition components, while alleviating drift-induced entanglement during 3D Gaussian optimization. Our SplatFont3D provides more explicit and effective part-level style control than NeRF, attaining faster rendering efficiency. Experiments show that our SplatFont3D outperforms existing 3D models for 3D-AFG in style-text consistency, visual quality, and rendering efficiency.

[37] What about gravity in video generation? Post-Training Newton’s Laws with Verifiable Rewards cs.CVPDF

Minh-Quan Le, Yuanzhi Zhu, Vicky Kalogeiton, Dimitris Samaras

TL;DR: 论文提出了一种基于可验证奖励的后训练框架NewtonRewards，用于提升视频生成中的物理合理性，通过光学流和外观特征作为速度和质量的代理，强制执行牛顿运动定律和质能守恒。

Details

Motivation: 当前的视频扩散模型虽然能生成视觉上逼真的视频，但常常违反基本物理定律（如物体漂浮、加速度漂移等），导致视觉真实性与物理真实性之间存在明显差距。

Result: 在牛顿运动基准测试NewtonBench-60K上，NewtonRewards在物理合理性、运动平滑性和时间一致性上均优于现有方法，且对分布外变化具有鲁棒性。

Insight: 通过可验证的物理奖励，可以无需依赖人类或VLM反馈，实现对视频生成模型的物理合理性优化，为物理感知的视频生成提供了可扩展的路径。

Abstract: Recent video diffusion models can synthesize visually compelling clips, yet often violate basic physical laws-objects float, accelerations drift, and collisions behave inconsistently-revealing a persistent gap between visual realism and physical realism. We propose $\texttt{NewtonRewards}$, the first physics-grounded post-training framework for video generation based on $\textit{verifiable rewards}$. Instead of relying on human or VLM feedback, $\texttt{NewtonRewards}$ extracts $\textit{measurable proxies}$ from generated videos using frozen utility models: optical flow serves as a proxy for velocity, while high-level appearance features serve as a proxy for mass. These proxies enable explicit enforcement of Newtonian structure through two complementary rewards: a Newtonian kinematic constraint enforcing constant-acceleration dynamics, and a mass conservation reward preventing trivial, degenerate solutions. We evaluate $\texttt{NewtonRewards}$ on five Newtonian Motion Primitives (free fall, horizontal/parabolic throw, and ramp sliding down/up) using our newly constructed large-scale benchmark, $\texttt{NewtonBench-60K}$. Across all primitives in visual and physics metrics, $\texttt{NewtonRewards}$ consistently improves physical plausibility, motion smoothness, and temporal coherence over prior post-training methods. It further maintains strong performance under out-of-distribution shifts in height, speed, and friction. Our results show that physics-grounded verifiable rewards offer a scalable path toward physics-aware video generation.

[38] RecruitView: A Multimodal Dataset for Predicting Personality and Interview Performance for Human Resources Applications cs.CV | cs.AIPDF

Amit Kumar Gupta, Farhan Sheth, Hammad Shaikh, Dheeraj Kumar, Angkul Puniya

TL;DR: 该论文介绍了RecruitView数据集和一个名为CRMF的多模态深度学习框架，用于预测人格特质和面试表现，显著提升了性能指标。

Details

Motivation: 当前的人格特质和软技能自动评估面临数据集有限和方法无法捕捉人类特质固有几何结构的挑战。

Result: CRMF比基线方法在Spearman相关性和一致性指数上分别提升11.4%和6.0%，且参数更少。

Insight: 多流形建模能更有效地捕捉人格和行为的几何结构，自适应路由机制提升了模型灵活性。

Abstract: Automated personality and soft skill assessment from multimodal behavioral data remains challenging due to limited datasets and methods that fail to capture geometric structure inherent in human traits. We introduce RecruitView, a dataset of 2,011 naturalistic video interview clips from 300+ participants with 27,000 pairwise comparative judgments across 12 dimensions: Big Five personality traits, overall personality score, and six interview performance metrics. To leverage this data, we propose Cross-Modal Regression with Manifold Fusion (CRMF), a geometric deep learning framework that explicitly models behavioral representations across hyperbolic, spherical, and Euclidean manifolds. CRMF employs geometry-specific expert networks to capture hierarchical trait structures, directional behavioral patterns, and continuous performance variations simultaneously. An adaptive routing mechanism dynamically weights expert contributions based on input characteristics. Through principled tangent space fusion, CRMF achieves superior performance while training 40-50% fewer trainable parameters than large multimodal models. Extensive experiments demonstrate that CRMF substantially outperforms the selected baselines, achieving up to 11.4% improvement in Spearman correlation and 6.0% in concordance index. Our RecruitView dataset is publicly available at https://huggingface.co/datasets/AI4A-lab/RecruitView

[39] CausalAffect: Causal Discovery for Facial Affective Understanding cs.CV | cs.AIPDF

Guanyu Hu, Tangzheng Lian, Dimitrios Kollias, Oya Celiktutan, Xinyu Yang

TL;DR: 这篇论文提出了CausalAffect，首个用于面部情感分析的因果图发现框架，通过两级因果层次结构和特征级反事实干预机制，显著提升了AU检测和表情识别的性能。

Details

Motivation: 现有面部情感分析方法大多忽略了动作单元（AUs）与表情之间的因果依赖关系，无法推断心理学上可信的因果结构。

Result: 在六个基准测试中表现优异，AU检测和表情识别均达到SOTA，验证了因果结构与心理理论的一致性。

Insight: 揭示了新颖的抑制性和未表征的依赖关系，为可解释的面部行为分析提供了基础。

Abstract: Understanding human affect from facial behavior requires not only accurate recognition but also structured reasoning over the latent dependencies that drive muscle activations and their expressive outcomes. Although Action Units (AUs) have long served as the foundation of affective computing, existing approaches rarely address how to infer psychologically plausible causal relations between AUs and expressions directly from data. We propose CausalAffect, the first framework for causal graph discovery in facial affect analysis. CausalAffect models AU-AU and AU-Expression dependencies through a two-level polarity and direction aware causal hierarchy that integrates population-level regularities with sample-adaptive structures. A feature-level counterfactual intervention mechanism further enforces true causal effects while suppressing spurious correlations. Crucially, our approach requires neither jointly annotated datasets nor handcrafted causal priors, yet it recovers causal structures consistent with established psychological theories while revealing novel inhibitory and previously uncharacterized dependencies. Extensive experiments across six benchmarks demonstrate that CausalAffect advances the state of the art in both AU detection and expression recognition, establishing a principled connection between causal discovery and interpretable facial behavior. All trained models and source code will be released upon acceptance.

[40] RealGen: Photorealistic Text-to-Image Generation via Detector-Guided Rewards cs.CV | cs.AIPDF

Junyan Ye, Leiqi Zhu, Yuncheng Guo, Dongzhi Jiang, Zilong Huang

TL;DR: RealGen提出了一种基于检测器引导奖励的文本到图像生成框架，通过对抗生成和奖励机制显著提升了生成图像的真实感和细节。

Details

Motivation: 尽管现有文本到图像生成模型（如GPT-Image-1和Qwen-Image）在一致性和世界知识方面表现优异，但其生成图像仍存在明显的AI痕迹（如过于光滑的皮肤和油光满面），无法达到‘与现实无异’的目标。

Result: 实验表明，RealGen在真实感、细节和美学方面显著优于通用模型（如GPT-Image-1和Qwen-Image）和专业真实感模型（如FLUX-Krea）。

Insight: 通过引入对抗生成思想和检测器奖励机制，可以显著提升文本到图像生成的真实性，而自动化评测基准能够更准确反映用户体验。

Abstract: With the continuous advancement of image generation technology, advanced models such as GPT-Image-1 and Qwen-Image have achieved remarkable text-to-image consistency and world knowledge However, these models still fall short in photorealistic image generation. Even on simple T2I tasks, they tend to produce “ fake” images with distinct AI artifacts, often characterized by “overly smooth skin” and “oily facial sheens”. To recapture the original goal of “indistinguishable-from-reality” generation, we propose RealGen, a photorealistic text-to-image framework. RealGen integrates an LLM component for prompt optimization and a diffusion model for realistic image generation. Inspired by adversarial generation, RealGen introduces a “Detector Reward” mechanism, which quantifies artifacts and assesses realism using both semantic-level and feature-level synthetic image detectors. We leverage this reward signal with the GRPO algorithm to optimize the entire generation pipeline, significantly enhancing image realism and detail. Furthermore, we propose RealBench, an automated evaluation benchmark employing Detector-Scoring and Arena-Scoring. It enables human-free photorealism assessment, yielding results that are more accurate and aligned with real user experience. Experiments demonstrate that RealGen significantly outperforms general models like GPT-Image-1 and Qwen-Image, as well as specialized photorealistic models like FLUX-Krea, in terms of realism, detail, and aesthetics. The code is available at https://github.com/yejy53/RealGen.

[41] Structured Context Learning for Generic Event Boundary Detection cs.CVPDF

Xin Gu, Congcong Li, Xinyao Wang, Dexiang Hong, Libo Zhang

TL;DR: 本文提出了一种名为结构化上下文学习（Structured Context Learning）的新方法，用于通用事件边界检测（GEBD），通过结构化序列分区（SPoS）提供结构化上下文，实现了端到端训练和灵活性，计算复杂度线性增长，并在多个数据集上优于现有方法。

Details

Motivation: 现有的事件边界检测方法往往依赖特定时间模型（如GRU、LSTM和Transformer），缺乏灵活性和效率。本文旨在提出一种通用且高效的方法，以解决这一问题。

Result: 方法在Kinetics-GEBD、TAPOS和镜头转换检测数据集上表现优异，超越现有方法。

Insight: 结构化上下文学习能够显著提升事件边界检测的灵活性和效率，同时适应不同时间模型。

Abstract: Generic Event Boundary Detection (GEBD) aims to identify moments in videos that humans perceive as event boundaries. This paper proposes a novel method for addressing this task, called Structured Context Learning, which introduces the Structured Partition of Sequence (SPoS) to provide a structured context for learning temporal information. Our approach is end-to-end trainable and flexible, not restricted to specific temporal models like GRU, LSTM, and Transformers. This flexibility enables our method to achieve a better speed-accuracy trade-off. Specifically, we apply SPoS to partition the input frame sequence and provide a structured context for the subsequent temporal model. Notably, SPoS’s overall computational complexity is linear with respect to the video length. We next calculate group similarities to capture differences between frames, and a lightweight fully convolutional network is utilized to determine the event boundaries based on the grouped similarity maps. To remedy the ambiguities of boundary annotations, we adapt the Gaussian kernel to preprocess the ground-truth event boundaries. Our proposed method has been extensively evaluated on the challenging Kinetics-GEBD, TAPOS, and shot transition detection datasets, demonstrating its superiority over existing state-of-the-art methods.

[42] Learning What Helps: Task-Aligned Context Selection for Vision Tasks cs.CVPDF

Jingyu Guo, Emir Konuk, Fredrik Strand, Christos Matsoukas, Kevin Smith

TL;DR: TACS是一个通过学习选择任务相关上下文样本的框架，显著提升了视觉任务的性能，尤其是在数据有限或挑战性场景中。

Details

Motivation: 人类通过对比相关样本解决视觉不确定性，但ViTs缺乏识别哪些样本能提升预测的能力。

Result: 在18个数据集上（涵盖细粒度识别、医学图像分类和分割）TACS始终优于基于相似性的检索方法。

Insight: 任务对齐的样本选择比仅依赖相似性的检索更有效，尤其是在数据受限情况下。

Abstract: Humans often resolve visual uncertainty by comparing an image with relevant examples, but ViTs lack the ability to identify which examples would improve their predictions. We present Task-Aligned Context Selection (TACS), a framework that learns to select paired examples which truly improve task performance rather than those that merely appear similar. TACS jointly trains a selector network with the task model through a hybrid optimization scheme combining gradient-based supervision and reinforcement learning, making retrieval part of the learning objective. By aligning selection with task rewards, TACS enables discriminative models to discover which contextual examples genuinely help. Across 18 datasets covering fine-grained recognition, medical image classification, and medical image segmentation, TACS consistently outperforms similarity-based retrieval, particularly in challenging or data-limited settings.

[43] CC-FMO: Camera-Conditioned Zero-Shot Single Image to 3D Scene Generation with Foundation Model Orchestration cs.CVPDF

Boshi Tang, Henry Zheng, Rui Huang, Gao Huang

TL;DR: CC-FMO是一个零样本、相机条件化的单图像到3D场景生成方法，结合了语义感知的向量集表示和细节丰富的结构化潜在表示，解决了现有方法在物体姿态估计和空间一致性上的不足。

Details

Motivation: 高质量的单图像到3D场景生成对AR/VR和具身AI应用至关重要。现有方法在小数据集上训练的专用模型泛化能力有限，且大尺度3D基础模型在场景级生成上仍存在挑战。

Result: 实验表明CC-FMO生成高质量、相机对齐的组合场景，优于所有现有方法。

Insight: 结合语义信息与细节表示，以及相机条件化优化，是实现高质量3D场景生成的关键。

Abstract: High-quality 3D scene generation from a single image is crucial for AR/VR and embodied AI applications. Early approaches struggle to generalize due to reliance on specialized models trained on curated small datasets. While recent advancements in large-scale 3D foundation models have significantly enhanced instance-level generation, coherent scene generation remains a challenge, where performance is limited by inaccurate per-object pose estimations and spatial inconsistency. To this end, this paper introduces CC-FMO, a zero-shot, camera-conditioned pipeline for single-image to 3D scene generation that jointly conforms to the object layout in input image and preserves instance fidelity. CC-FMO employs a hybrid instance generator that combines semantics-aware vector-set representation with detail-rich structured latent representation, yielding object geometries that are both semantically plausible and high-quality. Furthermore, CC-FMO enables the application of foundational pose estimation models in the scene generation task via a simple yet effective camera-conditioned scale-solving algorithm, to enforce scene-level coherence. Extensive experiments demonstrate that CC-FMO consistently generates high-fidelity camera-aligned compositional scenes, outperforming all state-of-the-art methods.

[44] Terrain Sensing with Smartphone Structured Light: 2D Dynamic Time Warping for Grid Pattern Matching cs.CVPDF

Tanaka Nobuaki

TL;DR: 论文提出了一种基于智能手机结构光的低成本地形感知系统，通过投影网格图案并利用改进的二维动态时间规整（2D-DTW）算法匹配变形网格，从而重构地形不平度。

Details

Motivation: 低成本移动机器人在不平坦地形上运行时，微小的高低或倾斜可能难以视觉感知，但会显著影响运动稳定性。传统的一维动态时间规整（1D-DTW）无法直接适用于二维网格图案的匹配。

Result: 该系统不仅可用于地形感知，还能作为图像处理中结构化网格匹配的通用工具。

Insight: 2D-DTW算法的设计展示了在资源受限平台上实现复杂网格匹配的可行性，拓展了动态时间规整的应用场景。

Abstract: Low-cost mobile rovers often operate on uneven terrain where small bumps or tilts are difficult to perceive visually but can significantly affect locomotion stability. To address this problem, we explore a smartphone-based structured-light system that projects a grid pattern onto the ground and reconstructs local terrain unevenness from a single handheld device. The system is inspired by face-recognition projectors, but adapted for ground sensing. A key technical challenge is robustly matching the projected grid with its deformed observation under perspective distortion and partial occlusion. Conventional one-dimensional dynamic time warping (1D-DTW) is not directly applicable to such two-dimensional grid patterns. We therefore propose a topology-constrained two-dimensional dynamic time warping (2D-DTW) algorithm that performs column-wise alignment under a global grid consistency constraint. The proposed method is designed to be simple enough to run on resource limited platforms while preserving the grid structure required for accurate triangulation. We demonstrate that our 2D-DTW formulation can be used not only for terrain sensing but also as a general tool for matching structured grid patterns in image processing scenarios. This paper describes the overall system design as well as the 2D-DTW extension that emerged from this application.

[45] Image Generation as a Visual Planner for Robotic Manipulation cs.CV | cs.ROPDF

Ye Pang

TL;DR: 该论文提出了一种基于预训练图像生成模型的视觉规划器，通过LoRA微调实现机器人操作任务的视频生成。方法分为文本条件和轨迹条件两部分，实验表明其能生成平滑、连贯的机器人视频。

Details

Motivation: 当前视频扩散模型需要大量领域数据且泛化能力不足，而语言-图像预训练模型具有较强的组合能力，包括生成时间连贯的网格图像。论文探索这类模型能否在微调后作为机器人的视觉规划器。

Result: 在Jaco Play、Bridge V2和RT1数据集上的实验表明，提出的方法能生成与条件对齐的平滑、连贯机器人视频。

Insight: 预训练图像生成器隐含了时间连续性能力，即使未显式建模时间信息也能作为有效的视频规划器。

Abstract: Generating realistic robotic manipulation videos is an important step toward unifying perception, planning, and action in embodied agents. While existing video diffusion models require large domain-specific datasets and struggle to generalize, recent image generation models trained on language-image corpora exhibit strong compositionality, including the ability to synthesize temporally coherent grid images. This suggests a latent capacity for video-like generation even without explicit temporal modeling. We explore whether such models can serve as visual planners for robots when lightly adapted using LoRA finetuning. We propose a two-part framework that includes: (1) text-conditioned generation, which uses a language instruction and the first frame, and (2) trajectory-conditioned generation, which uses a 2D trajectory overlay and the same initial frame. Experiments on the Jaco Play dataset, Bridge V2, and the RT1 dataset show that both modes produce smooth, coherent robot videos aligned with their respective conditions. Our findings indicate that pretrained image generators encode transferable temporal priors and can function as video-like robotic planners under minimal supervision. Code is released at \href{https://github.com/pangye202264690373/Image-Generation-as-a-Visual-Planner-for-Robotic-Manipulation}{https://github.com/pangye202264690373/Image-Generation-as-a-Visual-Planner-for-Robotic-Manipulation}.

[46] Cross-Temporal 3D Gaussian Splatting for Sparse-View Guided Scene Update cs.CVPDF

Zeyuan An, Yanghang Xiao, Zhiying Leng, Frederick W. B. Li, Xiaohui Liang

TL;DR: 论文提出了一种名为Cross-Temporal 3DGS的新方法，通过稀疏视图和历史先验，高效地重建和更新3D场景，适用于非连续捕获的场景更新。

Details

Motivation: 真实世界中，密集扫描通常不可行或不便，稀疏视图的3D场景更新在城规、灾害评估等领域有重要应用价值。

Result: 实验显示，该方法在重建质量和数据效率上显著优于基线方法。

Insight: 仅用稀疏图像即可实现时间变化追踪，并支持非连续捕获的场景重建，为场景版本化和数字孪生提供了新思路。

Abstract: Maintaining consistent 3D scene representations over time is a significant challenge in computer vision. Updating 3D scenes from sparse-view observations is crucial for various real-world applications, including urban planning, disaster assessment, and historical site preservation, where dense scans are often unavailable or impractical. In this paper, we propose Cross-Temporal 3D Gaussian Splatting (Cross-Temporal 3DGS), a novel framework for efficiently reconstructing and updating 3D scenes across different time periods, using sparse images and previously captured scene priors. Our approach comprises three stages: 1) Cross-temporal camera alignment for estimating and aligning camera poses across different timestamps; 2) Interference-based confidence initialization to identify unchanged regions between timestamps, thereby guiding updates; and 3) Progressive cross-temporal optimization, which iteratively integrates historical prior information into the 3D scene to enhance reconstruction quality. Our method supports non-continuous capture, enabling not only updates using new sparse views to refine existing scenes, but also recovering past scenes from limited data with the help of current captures. Furthermore, we demonstrate the potential of this approach to achieve temporal changes using only sparse images, which can later be reconstructed into detailed 3D representations as needed. Experimental results show significant improvements over baseline methods in reconstruction quality and data efficiency, making this approach a promising solution for scene versioning, cross-temporal digital twins, and long-term spatial documentation.

[47] NeuroVolve: Evolving Visual Stimuli toward Programmable Neural Objectives cs.CVPDF

Haomiao Chen, Keith W Jamison, Mert R. Sabuncu, Amy Kuceyeski

TL;DR: NeuroVolve是一种生成框架，通过优化预训练视觉-语言模型嵌入空间中的神经目标函数，实现大脑引导的图像合成。它能编程激活或抑制单个或多个脑区，生成满足复杂约束的场景，并揭示语义轨迹。

Details

Motivation: 现有方法仅能复制孤立脑区的类别选择性（如FFA中的面孔），难以揭示复杂自然视觉中脑区间的交互。NeuroVolve旨在提供更灵活的神经目标编程功能，以研究脑区协同和对抗关系。

Result: 1. 生成单脑区的低层和语义特征特异性刺激；2. 合成符合定制神经目标（如脑区共激活或去相关）的图像；3. 捕捉被试特异性偏好，支持个性化合成。

Insight: NeuroVolve为研究脑区协同和对抗关系提供了新工具，同时支持个性化视觉刺激生成，有助于神经表征的映射和分析。

Abstract: What visual information is encoded in individual brain regions, and how do distributed patterns combine to create their neural representations? Prior work has used generative models to replicate known category selectivity in isolated regions (e.g., faces in FFA), but these approaches offer limited insight into how regions interact during complex, naturalistic vision. We introduce NeuroVolve, a generative framework that provides brain-guided image synthesis via optimization of a neural objective function in the embedding space of a pretrained vision-language model. Images are generated under the guidance of a programmable neural objective, i.e., activating or deactivating single regions or multiple regions together. NeuroVolve is validated by recovering known selectivity for individual brain regions, while expanding to synthesize coherent scenes that satisfy complex, multi-region constraints. By tracking optimization steps, it reveals semantic trajectories through embedding space, unifying brain-guided image editing and preferred stimulus generation in a single process. We show that NeuroVolve can generate both low-level and semantic feature-specific stimuli for single ROIs, as well as stimuli aligned to curated neural objectives. These include co-activation and decorrelation between regions, exposing cooperative and antagonistic tuning relationships. Notably, the framework captures subject-specific preferences, supporting personalized brain-driven synthesis and offering interpretable constraints for mapping, analyzing, and probing neural representations of visual information.

[48] Describe Anything Anywhere At Any Moment cs.CV | cs.AI | cs.ROPDF

Nicolas Gorlo, Lukas Schmid, Luca Carlone

TL;DR: DAAAM是一个新颖的时空记忆框架，支持大规模实时4D场景理解，通过优化前端处理和分层4D场景图，显著提升了语义描述的速度和准确性，并在多个基准测试中达到了最先进的结果。

Details

Motivation: 现有的计算机视觉和机器人应用需要在语义丰富度和实时性能之间权衡。DAAAM旨在克服这一挑战，提供既包含丰富语义又能实时处理的4D场景理解框架。

Result: 在NaVQA和SG3D基准测试中大幅领先基线方法，OC-NaVQA问题准确率提升53.6%，位置误差降低21.9%，时间误差降低21.6%，SG3D任务准确率提高27.8%。

Insight: DAAAM通过结合语义理解和时空一致性，展示了在复杂环境中实时场景理解的潜力，为增强现实和机器人自主性提供了新的工具。

Abstract: Computer vision and robotics applications ranging from augmented reality to robot autonomy in large-scale environments require spatio-temporal memory frameworks that capture both geometric structure for accurate language-grounding as well as semantic detail. Existing methods face a tradeoff, where producing rich open-vocabulary descriptions comes at the expense of real-time performance when these descriptions have to be grounded in 3D. To address these challenges, we propose Describe Anything, Anywhere, at Any Moment (DAAAM), a novel spatio-temporal memory framework for large-scale and real-time 4D scene understanding. DAAAM introduces a novel optimization-based frontend to infer detailed semantic descriptions from localized captioning models, such as the Describe Anything Model (DAM), leveraging batch processing to speed up inference by an order of magnitude for online processing. It leverages such semantic understanding to build a hierarchical 4D scene graph (SG), which acts as an effective globally spatially and temporally consistent memory representation. DAAAM constructs 4D SGs with detailed, geometrically grounded descriptions while maintaining real-time performance. We show that DAAAM’s 4D SG interfaces well with a tool-calling agent for inference and reasoning. We thoroughly evaluate DAAAM in the complex task of spatio-temporal question answering on the NaVQA benchmark and show its generalization capabilities for sequential task grounding on the SG3D benchmark. We further curate an extended OC-NaVQA benchmark for large-scale and long-time evaluations. DAAAM achieves state-of-the-art results in both tasks, improving OC-NaVQA question accuracy by 53.6%, position errors by 21.9%, temporal errors by 21.6%, and SG3D task grounding accuracy by 27.8% over the most competitive baselines, respectively. We release our data and code open-source.

[49] SatireDecoder: Visual Cascaded Decoupling for Enhancing Satirical Image Comprehension cs.CVPDF

Yue Jiang, Haiwei Xue, Minghao Han, Mingcheng Li, Xiaolu Hou

TL;DR: 该论文提出了SatireDecoder框架，通过视觉级联解耦和多代理系统增强讽刺图像的理解能力，显著减少了误解和幻觉现象。

Details

Motivation: 讽刺作为一种结合幽默与隐含批评的艺术表达形式，具有重要的社会价值。然而，现有的视觉语言模型在理解纯视觉讽刺时存在困难，无法有效整合局部实体关系与全局上下文。

Result: 实验证明SatireDecoder在讽刺图像理解任务中优于现有基线方法，显著提高了准确性并减少了幻觉现象。

Insight: 该方法展示了在复杂、高层次语义任务中视觉语言推理的潜力，特别是在需要整合局部与全局信息的场景中。

Abstract: Satire, a form of artistic expression combining humor with implicit critique, holds significant social value by illuminating societal issues. Despite its cultural and societal significance, satire comprehension, particularly in purely visual forms, remains a challenging task for current vision-language models. This task requires not only detecting satire but also deciphering its nuanced meaning and identifying the implicated entities. Existing models often fail to effectively integrate local entity relationships with global context, leading to misinterpretation, comprehension biases, and hallucinations. To address these limitations, we propose SatireDecoder, a training-free framework designed to enhance satirical image comprehension. Our approach proposes a multi-agent system performing visual cascaded decoupling to decompose images into fine-grained local and global semantic representations. In addition, we introduce a chain-of-thought reasoning strategy guided by uncertainty analysis, which breaks down the complex satire comprehension process into sequential subtasks with minimized uncertainty. Our method significantly improves interpretive accuracy while reducing hallucinations. Experimental results validate that SatireDecoder outperforms existing baselines in comprehending visual satire, offering a promising direction for vision-language reasoning in nuanced, high-level semantic tasks.

Thuraya Alzubaidi, Farhad R. Nezami, Muzammil Behzad

TL;DR: 该论文提出了一种参数高效的医学视觉-语言模型MedCT-VLM，通过低秩适应（LoRA）方法，在仅微调少量参数的情况下，显著提升了CT影像零样本分类任务的性能。

Details

Motivation: 尽管大规模视觉-语言预训练模型在多个领域表现出强大的零样本能力，但在体积医学影像中的应用仍受限。论文旨在通过参数高效的方法，解决下游临床任务中标注数据稀缺的问题。

Result: 零样本分类任务的AUROC从61.3%提升至68.9%，准确率从67.2%提升至73.6%，宏F1分数从32.1%提升至36.9%。

Insight: 参数高效方法（如LoRA）能够有效将大规模预训练能力迁移至医学影像任务，尤其是在标注数据稀缺的零样本场景中具有显著优势。

Abstract: Foundation models trained via vision-language pretraining have demonstrated strong zero-shot capabilities across diverse image domains, yet their application to volumetric medical imaging remains limited. We introduce MedCT-VLM: Medical CT Vision-Language Model, a parameter-efficient vision-language framework designed to adapt large-scale CT foundation models for downstream clinical tasks. MedCT-VLM uses a parameter-efficient approach to adapt CT-CLIP, a contrastive vision-language model trained on 25,692 chest CT volumes, for multi-label pathology classification using Low-Rank Adaptation (LoRA). Rather than fine-tuning the model’s 440 M parameters directly, we insert low-rank decomposition matrices into attention layers of both vision and text encoders, training only 1.67M parameters (0.38% of total). We evaluate on zero-shot classification across 18 thoracic pathologies, where the model must align CT embeddings with unseen text prompts at inference without task-specific training. LoRA fine-tuning improves mean AUROC from 61.3% to 68.9% (+7.6 pp), accuracy from 67.2% to 73.6% (+6.4 pp), and macro-F1 from 32.1% to 36.9% (+4.8 pp). These results demonstrate that parameter-efficient methods can effectively transfer large-scale pretraining to downstream medical imaging tasks, particularly for zero-shot scenarios where labeled data is scarce.

[51] Automatic Pith Detection in Tree Cross-Section Images Using Deep Learning cs.CV | cs.AIPDF

Tzu-I Liao, Mahmoud Fakhry, Jibin Yesudas Varghese

TL;DR: 该论文评估了五种深度学习模型（YOLOv9、U-Net、Swin Transformer、DeepLabV3和Mask R-CNN）在树木横截面图像中自动检测髓心的性能，Swin Transformer表现最佳。

Details

Motivation: 树木横截面中髓心的手动检测效率低且易出错，迫切需要可靠的自动化方法。

Result: Swin Transformer在精细分割中表现最优（准确率0.94），Mask R-CNN通过NMS将IoU从0.45提升至0.80。

Insight: 模型选择需根据数据集特点和应用需求，数据增强和后处理对不同模型性能提升显著。

Abstract: Pith detection in tree cross-sections is essential for forestry and wood quality analysis but remains a manual, error-prone task. This study evaluates deep learning models – YOLOv9, U-Net, Swin Transformer, DeepLabV3, and Mask R-CNN – to automate the process efficiently. A dataset of 582 labeled images was dynamically augmented to improve generalization. Swin Transformer achieved the highest accuracy (0.94), excelling in fine segmentation. YOLOv9 performed well for bounding box detection but struggled with boundary precision. U-Net was effective for structured patterns, while DeepLabV3 captured multi-scale features with slight boundary imprecision. Mask R-CNN initially underperformed due to overlapping detections, but applying Non-Maximum Suppression (NMS) improved its IoU from 0.45 to 0.80. Generalizability was next tested using an oak dataset of 11 images from Oregon State University’s Tree Ring Lab. Additionally, for exploratory analysis purposes, an additional dataset of 64 labeled tree cross-sections was used to train the worst-performing model to see if this would improve its performance generalizing to the unseen oak dataset. Key challenges included tensor mismatches and boundary inconsistencies, addressed through hyperparameter tuning and augmentation. Our results highlight deep learning’s potential for tree cross-section pith detection, with model choice depending on dataset characteristics and application needs.

[52] XAI-Driven Skin Disease Classification: Leveraging GANs to Augment ResNet-50 Performance cs.CV | cs.AIPDF

Kim Gerard A. Villanueva, Priyanka Kumar

TL;DR: 该论文提出了一种结合GAN和ResNet-50的计算机辅助诊断系统，用于解决皮肤病变分类中的数据不平衡和模型可解释性问题，并通过XAI技术提升了分类性能和临床可解释性。

Details

Motivation: 皮肤病变分类面临数据集不平衡、主观诊断方法和深度学习模型的黑箱问题，需要一种高性能且可解释的解决方案。

Result: 系统在HAM10000数据集上实现了92.50%的准确率和98.82%的Macro-AUC，优于基准模型。

Insight: 数据增强和XAI技术的结合不仅能提升分类性能，还能增强模型的临床可信度，为医疗诊断提供了更可靠的解决方案。

Abstract: Accurate and timely diagnosis of multi-class skin lesions is hampered by subjective methods, inherent data imbalance in datasets like HAM10000, and the “black box” nature of Deep Learning (DL) models. This study proposes a trustworthy and highly accurate Computer-Aided Diagnosis (CAD) system to overcome these limitations. The approach utilizes Deep Convolutional Generative Adversarial Networks (DCGANs) for per class data augmentation to resolve the critical class imbalance problem. A fine-tuned ResNet-50 classifier is then trained on the augmented dataset to classify seven skin disease categories. Crucially, LIME and SHAP Explainable AI (XAI) techniques are integrated to provide transparency by confirming that predictions are based on clinically relevant features like irregular morphology. The system achieved a high overall Accuracy of 92.50 % and a Macro-AUC of 98.82 %, successfully outperforming various prior benchmarked architectures. This work successfully validates a verifiable framework that combines high performance with the essential clinical interpretability required for safe diagnostic deployment. Future research should prioritize enhancing discrimination for critical categories, such as Melanoma NOS (F1-Score is 0.8602).

[53] Doppler-Enhanced Deep Learning: Improving Thyroid Nodule Segmentation with YOLOv5 Instance Segmentation cs.CV | cs.AI | cs.CE | cs.LG | cs.PFPDF

Mahmoud El Hussieni

TL;DR: 这篇论文探讨了利用YOLOv5实例分割算法在超声图像中进行甲状腺结节分割的方法，并发现包含多普勒图像可以显著提升分割性能。YOLOv5-Large模型在多普勒数据集上表现最佳。

Details

Motivation: 全球甲状腺癌发病率上升，需要开发计算机辅助检测方法。甲状腺结节的精确分割是AI辅助临床决策系统的关键步骤。

Result: YOLOv5-Large在多普勒数据集上表现最佳（Dice得分91%，mAP 0.87）。多普勒图像的加入使所有模型性能提升。

Insight: 多普勒图像虽常被医生忽略，但对分割任务有重要增益，为实时甲状腺结节自动检测提供了临床应用的潜力。

Abstract: The increasing prevalence of thyroid cancer globally has led to the development of various computer-aided detection methods. Accurate segmentation of thyroid nodules is a critical first step in the development of AI-assisted clinical decision support systems. This study focuses on instance segmentation of thyroid nodules using YOLOv5 algorithms on ultrasound images. We evaluated multiple YOLOv5 variants (Nano, Small, Medium, Large, and XLarge) across two dataset versions, with and without doppler images. The YOLOv5-Large algorithm achieved the highest performance with a dice score of 91% and mAP of 0.87 on the dataset including doppler images. Notably, our results demonstrate that doppler images, typically excluded by physicians, can significantly improve segmentation performance. The YOLOv5-Small model achieved 79% dice score when doppler images were excluded, while including them improved performance across all model variants. These findings suggest that instance segmentation with YOLOv5 provides an effective real-time approach for thyroid nodule detection, with potential clinical applications in automated diagnostic systems.

[54] MambaScope: Coarse-to-Fine Scoping for Efficient Vision Mamba cs.CV | cs.AIPDF

Shanhui Liu, Rui Xu, Yunke Wang

TL;DR: CF-ViM是一个自适应的Coarse-to-Fine框架，通过动态调整分辨率来高效处理图像，简单图像用粗粒度处理，复杂图像仅在需要时细化，显著提高了效率和准确性。

Details

Motivation: 现有的Vision Mamba在输入token数量上存在效率瓶颈，而传统的token削减方法（如剪枝或合并）会导致信息丢失。研究发现，并非所有图像都需要细粒度处理，简单图像可以用粗粒度高效处理，复杂图像则需局部细化。

Result: 在ImageNet上的实验表明，CF-ViM在准确性和效率上均优于基线Vision Mamba和现有的token削减技术。

Insight: 动态分辨率分配是关键，简单图像无需细粒度处理，复杂图像仅在关键区域细化，可高效平衡计算成本和信息保留。

Abstract: Vision Mamba has emerged as a promising and efficient alternative to Vision Transformers, yet its efficiency remains fundamentally constrained by the number of input tokens. Existing token reduction approaches typically adopt token pruning or merging to reduce computation. However, they inherently lead to information loss, as they discard or compress token representations. This problem is exacerbated when applied uniformly to fine-grained token representations across all images, regardless of visual complexity. We observe that not all inputs require fine-grained processing. Simple images can be effectively handled at coarse resolution, while only complex ones may warrant refinement. Based on this insight, we propose \textit{Coarse-to-Fine Vision Mamba (CF-ViM)}, an adaptive framework for efficient inference. CF-ViM first performs coarse-grained inference by dividing the input image into large patches, significantly reducing the token length and computation. When the model’s prediction confidence is low, selected regions are re-processed at a finer resolution to recover critical visual details with minimal additional cost. This dynamic resolution assignment strategy allows CF-ViM to allocate computation adaptively according to image complexity, ensuring efficient processing without compromising essential visual information. Experiments on ImageNet demonstrate that CF-ViM outperforms both the baseline Vision Mamba and state-of-the-art token reduction techniques in terms of accuracy and efficiency.

[55] Dynamic-eDiTor: Training-Free Text-Driven 4D Scene Editing with Multimodal Diffusion Transformer cs.CV | cs.AIPDF

Dong In Lee, Hyungjun Doh, Seunggeun Chi, Runlin Duan, Sangpil Kim

TL;DR: Dynamic-eDiTor是一种无需训练的文本驱动4D场景编辑框架，利用多模态扩散变换器（MM-DiT）和4D高斯喷洒（4DGS）实现一致的空间-时间编辑。

Details

Motivation: 现有方法依赖2D扩散模型独立编辑帧，导致运动失真、几何漂移和不完整编辑，Dynamic-eDiTor旨在解决这些问题。

Result: 在DyNeRF数据集上展示了优于现有方法的编辑质量和一致性。

Insight: 光流引导的令牌替换和多模态融合是实现一致4D编辑的关键。

Abstract: Recent progress in 4D representations, such as Dynamic NeRF and 4D Gaussian Splatting (4DGS), has enabled dynamic 4D scene reconstruction. However, text-driven 4D scene editing remains under-explored due to the challenge of ensuring both multi-view and temporal consistency across space and time during editing. Existing studies rely on 2D diffusion models that edit frames independently, often causing motion distortion, geometric drift, and incomplete editing. We introduce Dynamic-eDiTor, a training-free text-driven 4D editing framework leveraging Multimodal Diffusion Transformer (MM-DiT) and 4DGS. This mechanism consists of Spatio-Temporal Sub-Grid Attention (STGA) for locally consistent cross-view and temporal fusion, and Context Token Propagation (CTP) for global propagation via token inheritance and optical-flow-guided token replacement. Together, these components allow Dynamic-eDiTor to perform seamless, globally consistent multi-view video without additional training and directly optimize pre-trained source 4DGS. Extensive experiments on multi-view video dataset DyNeRF demonstrate that our method achieves superior editing fidelity and both multi-view and temporal consistency prior approaches. Project page for results and code: https://di-lee.github.io/dynamic-eDiTor/

[56] Silhouette-based Gait Foundation Model cs.CVPDF

Dingqiang Ye, Chao Fan, Kartik Narayan, Bingzhe Wu, Chengwen Luo

TL;DR: 该论文提出了一种名为FoundationGait的基于轮廓的步态基础模型，它是首个可扩展、自监督的预训练框架，解决了步态模型在可扩展性和泛化性上的长期问题。其最大版本拥有近0.13亿参数，并在包含200多万步行序列的12个公共步态数据集上进行预训练。实验表明，该模型在多种任务中表现优异，尤其在零样本步态识别中取得了里程碑式的成果。

Details

Motivation: 当前步态模型存在无法扩展和泛化的问题，尤其是在小规模和孤立任务中表现不佳。论文旨在通过一个统一的预训练框架解决这些问题，以支持多样化的步态任务。

Result: 在Gait3D和OU-MVLP数据集上分别取得了48.0%和64.5%的零样本识别准确率，展现了模型的强大泛化能力。

Insight: 大规模的预训练和统一框架设计是多任务步态分析的关键。FoundationGait的成功表明，基础模型可以显著提升步态任务的性能。

Abstract: Gait patterns play a critical role in human identification and healthcare analytics, yet current progress remains constrained by small, narrowly designed models that fail to scale or generalize. Building a unified gait foundation model requires addressing two longstanding barriers: (a) Scalability. Why have gait models historically failed to follow scaling laws? (b) Generalization. Can one model serve the diverse gait tasks that have traditionally been studied in isolation? We introduce FoundationGait, the first scalable, self-supervised pretraining framework for gait understanding. Its largest version has nearly 0.13 billion parameters and is pretrained on 12 public gait datasets comprising over 2 million walking sequences. Extensive experiments demonstrate that FoundationGait, with or without fine-tuning, performs robustly across a wide spectrum of gait datasets, conditions, tasks (e.g., human identification, scoliosis screening, depression prediction, and attribute estimation), and even input modality. Notably, it achieves 48.0% zero-shot rank-1 accuracy on the challenging in-the-wild Gait3D dataset (1,000 test subjects) and 64.5% on the largest in-the-lab OU-MVLP dataset (5,000+ test subjects), setting a new milestone in robust gait recognition. Coming code and model: https://github.com/ShiqiYu/OpenGait.

[57] Affordance-First Decomposition for Continual Learning in Video-Language Understanding cs.CVPDF

Mengzhu Xu, Hanzhi Liu, Ningkang Peng, Qianyu Chen, Canran Xiao

TL;DR: 该论文提出了Affordance-First Decomposition (AFD)方法，用于视频-语言理解的持续学习，通过在共享的时间对齐基板上映射缓慢变化的affordance tokens，并结合轻量级的调度器实现适应性和容量的动态增长，取得了多项任务的领先性能。

Details

Motivation: 现有方法在持续学习中难以明确区分稳定性与适应性，且通常依赖静态路由或重放历史数据，这可能受限于内存和隐私约束。AFD旨在在现实约束下显式指定稳定性和适应性的边界。

Result: AFD在多个协议中表现优异，例如在VideoQA中平均准确率达51.6%，遗忘率仅为-1.8%。

Insight: AFD通过显式分解稳定性和适应性，提供了一种可解释的持续学习方法，适用于视频-语言任务中的非平稳数据环境。

Abstract: Continual learning for video–language understanding is increasingly important as models face non-stationary data, domains, and query styles, yet prevailing solutions blur what should stay stable versus what should adapt, rely on static routing/capacity, or require replaying past videos. We aim to explicitly specify where stability lives and where plasticity should be focused under realistic memory and privacy constraints. We introduce Affordance-First Decomposition (AFD): videos are mapped to slowly varying affordance tokens that form a shared, time-aligned substrate, while a lightweight, query-routed, conflict-aware scheduler concentrates adaptation and grows capacity only when needed. The substrate is stabilized via weak alignment and teacher consistency, and training uses question-only replay. AFD achieves state-of-the-art across protocols: 51.6% average accuracy with -1.8% forgetting on domain-incremental VideoQA, ViLCo R@1@0.5 of 29.6% (MQ) and 20.7% (NLQ) with 18.4% stAP@0.25 (VQ), and 39.5% accuracy with -1.6% forgetting on time-incremental iVQA. Overall, AFD offers an explicit, interpretable split between a stable interaction-centered substrate and targeted adaptation.

[58] Optimizing LVLMs with On-Policy Data for Effective Hallucination Mitigation cs.CV | cs.AIPDF

Chengzhi Yu, Yifan Xu, Yifan Chen, Wenyi Zhang

TL;DR: 该论文提出了一种通过策略数据优化大型视觉语言模型（LVLMs）的方法，以减少幻觉问题。作者首先证明了策略数据优于非策略数据，并提出了一种幻觉分类器来确保数据质量，同时设计了一种动态样本重加权的DPO算法，显著降低了幻觉率。

Details

Motivation: LVLMs在多模态任务中表现出色，但幻觉问题仍是一个关键挑战。作者希望通过优化数据生成和训练方法，有效减轻幻觉现象。

Result: 实验表明，该方法在多个基准测试中表现优异，将LLaVA-1.5-7B的幻觉率降低了50.8%，并让LLaVA-1.5-13B的性能超越了GPT-4V。

Insight: 策略数据在幻觉缓解中至关重要，而动态样本重加权进一步提升了模型性能；同时，分类器的引入有助于从源头减少幻觉现象。

Abstract: Recently, large vision-language models (LVLMs) have risen to be a promising approach for multimodal tasks. However, principled hallucination mitigation remains a critical challenge.In this work, we first analyze the data generation process in LVLM hallucination mitigation and affirm that on-policy data significantly outperforms off-policy data, which thus calls for efficient and reliable preference annotation of on-policy data. We then point out that, existing annotation methods introduce additional hallucination in training samples, which may enhance the model’s hallucination patterns, to address this problem, we propose training a hallucination classifier giving binary annotations, which guarantee clean chosen samples for the subsequent alignment. To further harness of the power of on-policy data, we design a robust iterative direct preference optimization (DPO) algorithm adopting a dynamic sample reweighting scheme. We conduct comprehensive experiments on three benchmarks with comparison to 8 state-of-the-art baselines. In particular, our approach reduces the hallucination rate of LLaVA-1.5-7B on MMHalBench by 50.8% and the average hallucination rate on Object HalBench by 79.5%; more significantly, our method fully taps into the potential of open-source models, enabling LLaVA-1.5-13B to even surpass the performance of GPT-4V.

[59] Deep Learning-Based Computer Vision Models for Early Cancer Detection Using Multimodal Medical Imaging and Radiogenomic Integration Frameworks cs.CV | cs.AIPDF

Emmanuella Avwerosuoghene Oghenekaro

TL;DR: 论文提出了一种基于深度学习的计算机视觉模型，通过多模态医学影像和放射基因组学整合框架实现早期癌症检测。这些模型能够自动提取复杂的空间、形态和时间模式，超越了传统的放射学评估方法。

Details

Motivation: 早期癌症检测是现代医疗中的关键挑战，延迟诊断会显著降低生存率。深度学习的进展为医学影像分析带来了变革性进步。

Result: 模型能够识别人眼无法察觉的细微组织异常和肿瘤微环境变化，实现了更高的检测精度。

Insight: 多模态影像与放射基因组学的融合为个性化肿瘤学开辟了新方向，推动了无创诊断的发展。

Abstract: Early cancer detection remains one of the most critical challenges in modern healthcare, where delayed diagnosis significantly reduces survival outcomes. Recent advancements in artificial intelligence, particularly deep learning, have enabled transformative progress in medical imaging analysis. Deep learning-based computer vision models, such as convolutional neural networks (CNNs), transformers, and hybrid attention architectures, can automatically extract complex spatial, morphological, and temporal patterns from multimodal imaging data including MRI, CT, PET, mammography, histopathology, and ultrasound. These models surpass traditional radiological assessment by identifying subtle tissue abnormalities and tumor microenvironment variations invisible to the human eye. At a broader scale, the integration of multimodal imaging with radiogenomics linking quantitative imaging features with genomics, transcriptomics, and epigenetic biomarkers has introduced a new paradigm for personalized oncology. This radiogenomic fusion allows the prediction of tumor genotype, immune response, molecular subtypes, and treatment resistance without invasive biopsies.

[60] RS-ISRefiner: Towards Better Adapting Vision Foundation Models for Interactive Segmentation of Remote Sensing Images cs.CVPDF

Deliang Wang, Peng Liu

TL;DR: RS-ISRefiner提出了一种基于点击的交互式图像分割框架，专为遥感图像设计，通过适配器调优策略和混合注意力机制，提升了分割精度和效率。

Details

Motivation: 现有交互式图像分割方法主要针对自然图像，难以泛化到遥感领域，主要由于数据标注有限且计算开销大。

Result: 在6个遥感数据集上验证，表现优于现有方法，兼顾分割精度、效率和交互成本。

Insight: 适配器调优和混合注意力机制的结合，为解决遥感图像分割的挑战提供了高效且可泛化的解决方案。

Abstract: Interactive image segmentation(IIS) plays a critical role in generating precise annotations for remote sensing imagery, where objects often exhibit scale variations, irregular boundaries and complex backgrounds. However, existing IIS methods, primarily designed for natural images, struggle to generalize to remote sensing domains due to limited annotated data and computational overhead. To address these challenges, we proposed RS-ISRefiner, a novel click-based IIS framework tailored for remote sensing images. The framework employs an adapter-based tuning strategy that preserves the general representations of Vision Foundation Models while enabling efficient learning of remote sensing-specific spatial and boundary characteristics. A hybrid attention mechanism integrating convolutional local modeling with Transformer-based global reasoning enhances robustness against scale diversity and scene complexity. Furthermore, an improved probability map modulation scheme effectively incorporates historical user interactions, yielding more stable iterative refinement and higher boundary fidelity. Comprehensive experiments on six remote sensing datasets, including iSAID, ISPRS Potsdam, SandBar, NWPU, LoveDA Urban and WHUBuilding, demonstrate that RS-ISRefiner consistently outperforms state-of-the-art IIS methods in terms of segmentation accuracy, efficiency and interaction cost. These results confirm the effectiveness and generalizability of our framework, making it highly suitable for high-quality instance segmentation in practical remote sensing scenarios.

[61] TrajDiff: End-to-end Autonomous Driving without Perception Annotation cs.CV | cs.ROPDF

Xingtai Gui, Jianbo Zhao, Wencheng Han, Jikai Wang, Jiahao Gong

TL;DR: 论文提出了TrajDiff，一种无需感知标注的端到端自动驾驶生成方法，通过轨迹导向的BEV扩散框架直接生成多样化且合理的轨迹，避免了手工运动先验的需求。

Details

Motivation: 由于手工感知标注成本高昂，开发无需感知标注的规划方法变得至关重要。当前端到端自动驾驶系统依赖感知标注辅助规划，TrajDiff旨在解决这一问题。

Result: 在NAVSIM基准测试中取得87.5 PDMS，优于所有无需标注方法；通过数据扩展提升至88.5 PDMS，接近基于感知的高级方法。

Insight: 1. 无需感知标注的方法可以减少开发成本；2. 扩散模型适合生成多样化轨迹；3. 数据扩展对无需标注方法的效果有显著提升。

Abstract: End-to-end autonomous driving systems directly generate driving policies from raw sensor inputs. While these systems can extract effective environmental features for planning, relying on auxiliary perception tasks, developing perception annotation-free planning paradigms has become increasingly critical due to the high cost of manual perception annotation. In this work, we propose TrajDiff, a Trajectory-oriented BEV Conditioned Diffusion framework that establishes a fully perception annotation-free generative method for end-to-end autonomous driving. TrajDiff requires only raw sensor inputs and future trajectory, constructing Gaussian BEV heatmap targets that inherently capture driving modalities. We design a simple yet effective trajectory-oriented BEV encoder to extract the TrajBEV feature without perceptual supervision. Furthermore, we introduce Trajectory-oriented BEV Diffusion Transformer (TB-DiT), which leverages ego-state information and the predicted TrajBEV features to directly generate diverse yet plausible trajectories, eliminating the need for handcrafted motion priors. Beyond architectural innovations, TrajDiff enables exploration of data scaling benefits in the annotation-free setting. Evaluated on the NAVSIM benchmark, TrajDiff achieves 87.5 PDMS, establishing state-of-the-art performance among all annotation-free methods. With data scaling, it further improves to 88.5 PDMS, which is comparable to advanced perception-based approaches. Our code and model will be made publicly available.

[62] Multi-GRPO: Multi-Group Advantage Estimation for Text-to-Image Generation with Tree-Based Trajectories and Multiple Rewards cs.CVPDF

Qiang Lyu, Zicong Chen, Chongxiao Wang, Haolin Shi, Shibo Gao

TL;DR: Multi-GRPO提出了一种多组优势估计框架，通过树基轨迹和多奖励分组机制优化文本到图像生成模型，解决了现有GRPO方法的共享信用分配和奖励混合问题，提升了稳定性和对齐性能。

Details

Motivation: 现有GRPO方法在文本到图像生成中存在共享信用分配不准确（轨迹优势统一分配）和奖励混合不稳定的问题（多目标奖励权重预定义导致梯度冲突），影响了模型的对齐效果。

Result: Multi-GRPO在单奖励和多目标基准测试中表现优越，平衡了冲突目标，提升了稳定性和对齐性能。

Insight: 时间分组和奖励分组的正交设计有效解决了信用分配和梯度冲突问题，为多目标优化提供了一种新思路。

Abstract: Recently, Group Relative Policy Optimization (GRPO) has shown promising potential for aligning text-to-image (T2I) models, yet existing GRPO-based methods suffer from two critical limitations. (1) \textit{Shared credit assignment}: trajectory-level advantages derived from group-normalized sparse terminal rewards are uniformly applied across timesteps, failing to accurately estimate the potential of early denoising steps with vast exploration spaces. (2) \textit{Reward-mixing}: predefined weights for combining multi-objective rewards (e.g., text accuracy, visual quality, text color)–which have mismatched scales and variances–lead to unstable gradients and conflicting updates. To address these issues, we propose \textbf{Multi-GRPO}, a multi-group advantage estimation framework with two orthogonal grouping mechanisms. For better credit assignment, we introduce tree-based trajectories inspired by Monte Carlo Tree Search: branching trajectories at selected early denoising steps naturally forms \emph{temporal groups}, enabling accurate advantage estimation for early steps via descendant leaves while amortizing computation through shared prefixes. For multi-objective optimization, we introduce \emph{reward-based grouping} to compute advantages for each reward function \textit{independently} before aggregation, disentangling conflicting signals. To facilitate evaluation of multiple objective alignment, we curate \textit{OCR-Color-10}, a visual text rendering dataset with explicit color constraints. Across the single-reward \textit{PickScore-25k} and multi-objective \textit{OCR-Color-10} benchmarks, Multi-GRPO achieves superior stability and alignment performance, effectively balancing conflicting objectives. Code will be publicly available at \href{https://github.com/fikry102/Multi-GRPO}{https://github.com/fikry102/Multi-GRPO}.

[63] Joint Multi-scale Gated Transformer and Prior-guided Convolutional Network for Learned Image Compression cs.CVPDF

Zhengxin Chen, Xiaohai He, Tingrong Zhang, Shuhua Xiong, Chao Ren

TL;DR: 这篇论文提出了一种联合多尺度门控Transformer和先验引导卷积网络（MGTPCN）的深度学习图像压缩方法，通过改进局部和非局部特征提取能力，显著提升了压缩性能。

Details

Motivation: 传统图像压缩方法（如VVC）已被基于学习的图像压缩方法超越，但进一步提升性能需要更强的非线性变换编码能力。卷积层和Transformer块的特征提取能力是关键，因此论文旨在改进它们的局部和非局部特征提取能力。

Result: 实验表明，MGTPCN在性能和复杂性方面超越了现有最优算法。

Insight: 结合卷积和Transformer的优势，并通过多尺度和门控机制增强特征提取能力，是提升图像压缩性能的有效方法。

Abstract: Recently, learned image compression methods have made remarkable achievements, some of which have outperformed the traditional image codec VVC. The advantages of learned image compression methods over traditional image codecs can be largely attributed to their powerful nonlinear transform coding. Convolutional layers and shifted window transformer (Swin-T) blocks are the basic units of neural networks, and their representation capabilities play an important role in nonlinear transform coding. In this paper, to improve the ability of the vanilla convolution to extract local features, we propose a novel prior-guided convolution (PGConv), where asymmetric convolutions (AConvs) and difference convolutions (DConvs) are introduced to strengthen skeleton elements and extract high-frequency information, respectively. A re-parameterization strategy is also used to reduce the computational complexity of PGConv. Moreover, to improve the ability of the Swin-T block to extract non-local features, we propose a novel multi-scale gated transformer (MGT), where dilated window-based multi-head self-attention blocks with different dilation rates and depth-wise convolution layers with different kernel sizes are used to extract multi-scale features, and a gate mechanism is introduced to enhance non-linearity. Finally, we propose a novel joint Multi-scale Gated Transformer and Prior-guided Convolutional Network (MGTPCN) for learned image compression. Experimental results show that our MGTPCN surpasses state-of-the-art algorithms with a better trade-off between performance and complexity.

[64] Seeing the Wind from a Falling Leaf cs.CVPDF

Zhiyuan Gao, Jiageng Mao, Hong-Xing Yu, Haozhe Lou, Emily Yue-Ting Jia

TL;DR: 论文提出了一种端到端可微的逆向图形框架，从视频中恢复隐形的力场，如通过观察落叶估计风场。

Details

Motivation: 计算机视觉中长期以来目标是建模运动，但导致运动的隐形物理交互（如风力）尚未充分探索。

Result: 在合成和真实场景中验证了方法，能够从视频中推断出合理的力场，并展示了物理驱动的视频生成与编辑应用。

Insight: 该方法为理解像素背后的物理过程提供了新思路，填补了视觉与物理学之间的鸿沟。

Abstract: A longstanding goal in computer vision is to model motions from videos, while the representations behind motions, i.e. the invisible physical interactions that cause objects to deform and move, remain largely unexplored. In this paper, we study how to recover the invisible forces from visual observations, e.g., estimating the wind field by observing a leaf falling to the ground. Our key innovation is an end-to-end differentiable inverse graphics framework, which jointly models object geometry, physical properties, and interactions directly from videos. Through backpropagation, our approach enables the recovery of force representations from object motions. We validate our method on both synthetic and real-world scenarios, and the results demonstrate its ability to infer plausible force fields from videos. Furthermore, we show the potential applications of our approach, including physics-based video generation and editing. We hope our approach sheds light on understanding and modeling the physical process behind pixels, bridging the gap between vision and physics. Please check more video results in our \href{https://chaoren2357.github.io/seeingthewind/}{project page}.

[65] The Outline of Deception: Physical Adversarial Attacks on Traffic Signs Using Edge Patches cs.CVPDF

Haojie Jia, Te Hu, Haowen Li, Long Jin, Chongshi Xin

TL;DR: 论文提出了TESP-Attack方法，通过边缘对齐的对抗性补丁攻击交通标志分类模型，优化了隐蔽性和攻击效果。

Details

Motivation: 现有交通标志的物理对抗攻击缺乏隐蔽性，容易被人类察觉，限制了实际应用。

Result: 攻击成功率超过90%，展示了较强的跨模型迁移性和实际应用稳定性。

Insight: 人类视觉注意力集中于交通标志中心区域的特点可被利用以设计更隐蔽的攻击方法。

Abstract: Intelligent driving systems are vulnerable to physical adversarial attacks on traffic signs. These attacks can cause misclassification, leading to erroneous driving decisions that compromise road safety. Moreover, within V2X networks, such misinterpretations can propagate, inducing cascading failures that disrupt overall traffic flow and system stability. However, a key limitation of current physical attacks is their lack of stealth. Most methods apply perturbations to central regions of the sign, resulting in visually salient patterns that are easily detectable by human observers, thereby limiting their real-world practicality. This study proposes TESP-Attack, a novel stealth-aware adversarial patch method for traffic sign classification. Based on the observation that human visual attention primarily focuses on the central regions of traffic signs, we employ instance segmentation to generate edge-aligned masks that conform to the shape characteristics of the signs. A U-Net generator is utilized to craft adversarial patches, which are then optimized through color and texture constraints along with frequency domain analysis to achieve seamless integration with the background environment, resulting in highly effective visual concealment. The proposed method demonstrates outstanding attack success rates across traffic sign classification models with varied architectures, achieving over 90% under limited query budgets. It also exhibits strong cross-model transferability and maintains robust real-world performance that remains stable under varying angles and distances.

[66] EAG3R: Event-Augmented 3D Geometry Estimation for Dynamic and Extreme-Lighting Scenes cs.CV | cs.AIPDF

Xiaoshan Wu, Yifei Yu, Xiaoyang Lyu, Yihua Huang, Bo Wang

TL;DR: EAG3R提出了一种结合事件相机数据的三维几何估计框架，显著提升了动态环境和极端光照条件下的重建效果。

Details

Motivation: 传统RGB相机在动态物体和极端光照条件下表现不佳，EAG3R通过引入异步事件流增强几何估计的鲁棒性。

Result: 在单目深度估计、相机姿态跟踪和动态重建任务中，EAG3R显著优于现有RGB-only方法。

Insight: 事件相机数据能够有效弥补RGB相机在动态和低光环境下的不足，提升三维重建的鲁棒性和准确性。

Abstract: Robust 3D geometry estimation from videos is critical for applications such as autonomous navigation, SLAM, and 3D scene reconstruction. Recent methods like DUSt3R demonstrate that regressing dense pointmaps from image pairs enables accurate and efficient pose-free reconstruction. However, existing RGB-only approaches struggle under real-world conditions involving dynamic objects and extreme illumination, due to the inherent limitations of conventional cameras. In this paper, we propose EAG3R, a novel geometry estimation framework that augments pointmap-based reconstruction with asynchronous event streams. Built upon the MonST3R backbone, EAG3R introduces two key innovations: (1) a retinex-inspired image enhancement module and a lightweight event adapter with SNR-aware fusion mechanism that adaptively combines RGB and event features based on local reliability; and (2) a novel event-based photometric consistency loss that reinforces spatiotemporal coherence during global optimization. Our method enables robust geometry estimation in challenging dynamic low-light scenes without requiring retraining on night-time data. Extensive experiments demonstrate that EAG3R significantly outperforms state-of-the-art RGB-only baselines across monocular depth estimation, camera pose tracking, and dynamic reconstruction tasks.

[67] DEJIMA: A Novel Large-scale Japanese Dataset for Image Captioning and Visual Question Answering cs.CVPDF

Toshiki Katsube, Taiga Fukuhara, Kenichiro Ando, Yusuke Mukuta, Kohei Uehara

TL;DR: DEJIMA是一个针对日语视觉与语言建模的大规模数据集，填补了高质量日语资源的空白，包含3.88M图像-文本对，并通过严格的数据处理和LLM优化提升了数据的质量和文化代表性。

Details

Motivation: 解决日语视觉与语言建模中高质量、大规模数据集稀缺的问题，并提升模型在日语和文化背景下的表现。

Result: DEJIMA在日语语言自然性和文化代表性上优于翻译或人工标注的数据集，同时在多模态基准测试中显著提升了模型性能。

Insight: 文化背景和语言自然性是构建高质量视觉与语言数据集的关键因素，DEJIMA的成功验证了这一点。

Abstract: This work addresses the scarcity of high-quality, large-scale resources for Japanese Vision-and-Language (V&L) modeling. We present a scalable and reproducible pipeline that integrates large-scale web collection with rigorous filtering/deduplication, object-detection-driven evidence extraction, and Large Language Model (LLM)-based refinement under grounding constraints. Using this pipeline, we build two resources: an image-caption dataset (DEJIMA-Cap) and a VQA dataset (DEJIMA-VQA), each containing 3.88M image-text pairs, far exceeding the size of existing Japanese V&L datasets. Human evaluations demonstrate that DEJIMA achieves substantially higher Japaneseness and linguistic naturalness than datasets constructed via translation or manual annotation, while maintaining factual correctness at a level comparable to human-annotated corpora. Quantitative analyses of image feature distributions further confirm that DEJIMA broadly covers diverse visual domains characteristic of Japan, complementing its linguistic and cultural representativeness. Models trained on DEJIMA exhibit consistent improvements across multiple Japanese multimodal benchmarks, confirming that culturally grounded, large-scale resources play a key role in enhancing model performance. All data sources and modules in our pipeline are licensed for commercial use, and we publicly release the resulting dataset and metadata to encourage further research and industrial applications in Japanese V&L modeling.

[68] PolarGS: Polarimetric Cues for Ambiguity-Free Gaussian Splatting with Accurate Geometry Recovery cs.CVPDF

Bo Guo, Sijia Wen, Yifan Zhao, Jia Li, Zhiming Zheng

TL;DR: PolarGS是一种基于偏振光学的3D高斯泼溅扩展方法，通过偏振线索解决光测模糊问题，提升几何重建精度。

Details

Motivation: 传统3D高斯泼溅（3DGS）在反射和无纹理区域因光测模糊导致几何重建精度下降。偏振光能揭示表面方向，可作为光测线索的补充。

Result: PolarGS在几何重建精度上优于现有方法，且框架无关。

Insight: 偏振光作为光学先验，能有效解决光测模糊问题，适合复杂场景的几何重建。

Abstract: Recent advances in surface reconstruction for 3D Gaussian Splatting (3DGS) have enabled remarkable geometric accuracy. However, their performance degrades in photometrically ambiguous regions such as reflective and textureless surfaces, where unreliable cues disrupt photometric consistency and hinder accurate geometry estimation. Reflected light is often partially polarized in a manner that reveals surface orientation, making polarization an optic complement to photometric cues in resolving such ambiguities. Therefore, we propose PolarGS, an optics-aware extension of RGB-based 3DGS that leverages polarization as an optical prior to resolve photometric ambiguities and enhance reconstruction accuracy. Specifically, we introduce two complementary modules: a polarization-guided photometric correction strategy, which ensures photometric consistency by identifying reflective regions via the Degree of Linear Polarization (DoLP) and refining reflective Gaussians with Color Refinement Maps; and a polarization-enhanced Gaussian densification mechanism for textureless area geometry recovery, which integrates both Angle and Degree of Linear Polarization (A/DoLP) into a PatchMatch-based depth completion process. This enables the back-projection and fusion of new Gaussians, leading to more complete reconstruction. PolarGS is framework-agnostic and achieves superior geometric accuracy compared to state-of-the-art methods.

[69] CircleFlow: Flow-Guided Camera Blur Estimation using a Circle Grid Target cs.CVPDF

Jiajian He, Enjie Hu, Shiqi Chen, Tianchen Qiu, Huajun Feng

TL;DR: CircleFlow提出了一种基于圆形网格目标的高精度相机模糊估计框架，通过光流引导边缘定位和能量约束的隐式神经表示，解决了PSF估计的挑战性问题。

Details

Motivation: 相机模糊的点扩散函数（PSF）估算对光学表征和计算视觉至关重要，但由于其内在模糊性和逆问题的复杂性，传统方法难以实现高精度。

Result: 在仿真和真实数据上的实验表明，CircleFlow在PSF估计的精度和可靠性上达到了最先进水平。

Insight: CircleFlow通过结合结构化目标和可微分优化，解决了PSF估计的逆问题，为相机模糊的精确表征提供了新思路。

Abstract: The point spread function (PSF) serves as a fundamental descriptor linking the real-world scene to the captured signal, manifesting as camera blur. Accurate PSF estimation is crucial for both optical characterization and computational vision, yet remains challenging due to the inherent ambiguity and the ill-posed nature of intensity-based deconvolution. We introduce CircleFlow, a high-fidelity PSF estimation framework that employs flow-guided edge localization for precise blur characterization. CircleFlow begins with a structured capture that encodes locally anisotropic and spatially varying PSFs by imaging a circle grid target, while leveraging the target’s binary luminance prior to decouple image and kernel estimation. The latent sharp image is then reconstructed through subpixel alignment of an initialized binary structure guided by optical flow, whereas the PSF is modeled as an energy-constrained implicit neural representation. Both components are jointly optimized within a demosaicing-aware differentiable framework, ensuring physically consistent and robust PSF estimation enabled by accurate edge localization. Extensive experiments on simulated and real-world data demonstrate that CircleFlow achieves state-of-the-art accuracy and reliability, validating its effectiveness for practical PSF calibration.

[70] Thinking with Drafts: Speculative Temporal Reasoning for Efficient Long Video Understanding cs.CVPDF

Pengfei Hu, Meng Cao, Yingyao Wang, Yi Wang, Jiahua Dong

TL;DR: 论文提出了一种名为SpecTemp的强化学习框架，用于高效的长视频理解，通过解耦时间感知与推理，利用双模型设计显著提高了推理效率。

Details

Motivation: 长视频理解需要处理冗余的多模态上下文，现有的思维框架方法存在效率瓶颈。

Result: 在多个视频理解基准测试中，SpecTemp保持了高准确性，同时显著加速了推理过程。

Insight: 双模型协作设计模仿了人脑的工作方式，平衡了效率与准确性，为长视频理解提供了一种新思路。

Abstract: Long video understanding is essential for human-like intelligence, enabling coherent perception and reasoning over extended temporal contexts. While the emerging thinking-with-frames paradigm, which alternates between global temporal reasoning and local frame examination, has advanced the reasoning capabilities of video multi-modal large language models (MLLMs), it suffers from a significant efficiency bottleneck due to the progressively growing and redundant multi-modal context. To address this, we propose SpecTemp, a reinforcement learning-based Speculative Temporal reasoning framework that decouples temporal perception from reasoning via a cooperative dual-model design. In SpecTemp, a lightweight draft MLLM rapidly explores and proposes salient frames from densely sampled temporal regions, while a powerful target MLLM focuses on temporal reasoning and verifies the draft’s proposals, iteratively refining its attention until convergence. This design mirrors the collaborative pathways of the human brain, balancing efficiency with accuracy. To support training, we construct the SpecTemp-80K dataset, featuring synchronized dual-level annotations for coarse evidence spans and fine-grained frame-level evidence. Experiments across multiple video understanding benchmarks demonstrate that SpecTemp not only maintains competitive accuracy but also significantly accelerates inference compared with existing thinking-with-frames methods.

[71] IRPO: Boosting Image Restoration via Post-training GRPO cs.CVPDF

Haoxuan Xu. Yi Liu, Boyuan Jiang, Jinlong Peng, Donghao Luo, Xiaobin Hu

TL;DR: IRPO提出了一种基于GRPO的后训练范式，通过数据选择和奖励建模提升图像复原任务的效果，优于现有基线方法。

Details

Motivation: 现有图像复原方法依赖像素级硬拟合，导致过度平滑和泛化能力差。IRPO旨在通过后训练范式解决这些问题。

Result: 在6个域内和5个域外任务中表现最佳，比AdaIR基线分别提升0.83 dB和3.43 dB。

Insight: 选择预训练阶段的低效样本和平衡多种奖励标准能显著提升图像复原任务的性能。

Abstract: Recent advances in post-training paradigms have achieved remarkable success in high-level generation tasks, yet their potential for low-level vision remains rarely explored. Existing image restoration (IR) methods rely on pixel-level hard-fitting to ground-truth images, struggling with over-smoothing and poor generalization. To address these limitations, we propose IRPO, a low-level GRPO-based post-training paradigm that systematically explores both data formulation and reward modeling. We first explore a data formulation principle for low-level post-training paradigm, in which selecting underperforming samples from the pre-training stage yields optimal performance and improved efficiency. Furthermore, we model a reward-level criteria system that balances objective accuracy and human perceptual preference through three complementary components: a General Reward for structural fidelity, an Expert Reward leveraging Qwen-VL for perceptual alignment, and a Restoration Reward for task-specific low-level quality. Comprehensive experiments on six in-domain and five out-of-domain (OOD) low-level benchmarks demonstrate that IRPO achieves state-of-the-art results across diverse degradation types, surpassing the AdaIR baseline by 0.83 dB on in-domain tasks and 3.43 dB on OOD settings. Our code can be shown in https://github.com/HaoxuanXU1024/IRPO.

[72] PanFlow: Decoupled Motion Control for Panoramic Video Generation cs.CVPDF

Cheng Zhang, Hanwen Liang, Donny Y. Chen, Qianyi Wu, Konstantinos N. Plataniotis

TL;DR: PanFlow提出了一种新颖的全景视频生成方法，通过解耦相机旋转和输入光流条件，实现了对大动态运动的精确控制，并在运动保真度、视觉质量和时间一致性上显著优于现有方法。

Details

Motivation: 全景视频生成在虚拟现实和沉浸式媒体中有广泛应用，但现有方法缺乏明确的运动控制，难以处理大动态和复杂场景的运动。

Result: 实验表明PanFlow在运动保真度、视觉质量和时间一致性上显著优于现有方法。

Insight: 解耦相机运动与其他动态因素是提升全景视频生成质量的关键，球形变换策略为全景边界运动一致性提供了新思路。

Abstract: Panoramic video generation has attracted growing attention due to its applications in virtual reality and immersive media. However, existing methods lack explicit motion control and struggle to generate scenes with large and complex motions. We propose PanFlow, a novel approach that exploits the spherical nature of panoramas to decouple the highly dynamic camera rotation from the input optical flow condition, enabling more precise control over large and dynamic motions. We further introduce a spherical noise warping strategy to promote loop consistency in motion across panorama boundaries. To support effective training, we curate a large-scale, motion-rich panoramic video dataset with frame-level pose and flow annotations. We also showcase the effectiveness of our method in various applications, including motion transfer and video editing. Extensive experiments demonstrate that PanFlow significantly outperforms prior methods in motion fidelity, visual quality, and temporal coherence. Our code, dataset, and models are available at https://github.com/chengzhag/PanFlow.

[73] AFRAgent : An Adaptive Feature Renormalization Based High Resolution Aware GUI agent cs.CVPDF

Neeraj Anand, Rishabh Jain, Sohan Patnaik, Balaji Krishnamurthy, Mausoom Sarkar

TL;DR: AFRAgent是一种基于自适应特征重归一化的高分辨率感知GUI代理，通过改进视觉编码器特征的空间信息，实现了在移动UI自动化任务中的高效表现，同时模型尺寸仅为竞争对手的四分之一。

Details

Motivation: 移动用户界面自动化需求日益增长，现有视觉语言模型（VLM）在GUI自动化中存在空间信息不足和模型过大的问题，AFRAgent旨在解决这些挑战。

Result: 在Meta-GUI和AITW基准测试中，AFRAgent建立了新的最先进基线。

Insight: 自适应特征重归一化技术可以显著提升视觉语言模型在高分辨率细节和空间信息上的表现，同时保持模型轻量化。

Abstract: There is a growing demand for mobile user interface (UI) automation, driven by its broad applications across industries. With the advent of visual language models (VLMs), GUI automation has progressed from generating text-based instructions for humans to autonomously executing tasks, thus optimizing automation workflows. Recent approaches leverage VLMs for this problem due to their ability to 1) process on-screen content directly, 2) remain independent of device-specific APIs by utilizing human actions (e.g., clicks, typing), and 3) apply real-world contextual knowledge for task understanding. However, these models often have trouble accurately identifying widgets and determining actions due to limited spatial information in vision encoder features. Additionally, top-performing models are often large, requiring extensive training and resulting in inference delays. In this work, we introduce AFRAgent, an instruct-BLIP-based multimodal architecture that achieves superior performance in GUI automation while being less than one-fourth the size of its nearest competitor. To enhance image embeddings in the large language model (LLM) pipeline, we propose an adaptive feature renormalization-based (a token-level affine transformation) technique that effectively enriches low-resolution image embeddings and fuses high-resolution details. We evaluate AFRAgent on Meta-GUI and AITW benchmarks, establishing a new state-of-the-art baseline for smartphone automation.

[74] TAP-CT: 3D Task-Agnostic Pretraining of Computed Tomography Foundation Models cs.CV | cs.AIPDF

Tim Veenboer, George Yiasemis, Eric Marcus, Vivien Van Veldhuizen, Cees G. M. Snoek

TL;DR: TAP-CT提出了一种任务无关的3D CT基础模型预训练方法，通过改进ViT和DINOv2适配3D体积数据，实现大规模自监督预训练。

Details

Motivation: 现有医学基础模型需大量微调或依赖资源密集型解码器，预训练目标偏向特定任务，因此需要一种强任务无关模型。

Result: 模型生成稳定且鲁棒的冻结表示，泛化性能强于下游任务，无需大量微调。

Insight: 3D任务无关预训练可通过架构适配和自监督学习有效捕获CT体积特征，减少下游任务依赖。

Abstract: Existing foundation models (FMs) in the medical domain often require extensive fine-tuning or rely on training resource-intensive decoders, while many existing encoders are pretrained with objectives biased toward specific tasks. This illustrates a need for a strong, task-agnostic foundation model that requires minimal fine-tuning beyond feature extraction. In this work, we introduce a suite of task-agnostic pretraining of CT foundation models (TAP-CT): a simple yet effective adaptation of Vision Transformers (ViTs) and DINOv2 for volumetric data, enabling scalable self-supervised pretraining directly on 3D CT volumes. Our approach incorporates targeted modifications to patch embeddings, positional encodings, and volumetric augmentations, making the architecture depth-aware while preserving the simplicity of the underlying architectures. We show that large-scale 3D pretraining on an extensive in-house CT dataset (105K volumes) yields stable, robust frozen representations that generalize strongly across downstream tasks. To promote transparency and reproducibility, and to establish a powerful, low-resource baseline for future research in medical imaging, we will release all pretrained models, experimental configurations, and downstream benchmark code at https://huggingface.co/fomofo/tap-ct-b-3d.

[75] Generative Adversarial Gumbel MCTS for Abstract Visual Composition Generation cs.CV | cs.AI | cs.CLPDF

Zirui Zhao, Boye Niu, David Hsu, Wee Sun Lee

TL;DR: 这篇论文提出了一个结合几何推理和神经语义的框架，用于生成抽象的视觉构图。通过AlphaGo风格的搜索确保可行性，并通过视觉语言模型评估语义对齐。该方法在Tangram Assembly任务中表现出色，显著优于扩散和自回归基线。

Details

Motivation: 抽象视觉构图需要在小几何基元的空间配置和关系下完成，传统方法难以处理组合放置选择和数据稀疏性问题，导致解空间稀疏。

Result: 在Tangram Assembly任务中，该方法在有效性和语义保真度上优于扩散和自回归基线，尤其在约束更严格时表现更佳。

Insight: 结合显式几何推理和神经语义的方法在稀疏解空间问题中表现优异，对抗奖励优化有助于生成更接近真实数据的结果。

Abstract: We study abstract visual composition, in which identity is primarily determined by the spatial configuration and relations among a small set of geometric primitives (e.g., parts, symmetry, topology). They are invariant primarily to texture and photorealistic detail. Composing such structures from fixed components under geometric constraints and vague goal specification (such as text) is non-trivial due to combinatorial placement choices, limited data, and discrete feasibility (overlap-free, allowable orientations), which create a sparse solution manifold ill-suited to purely statistical pixel-space generators. We propose a constraint-guided framework that combines explicit geometric reasoning with neural semantics. An AlphaGo-style search enforces feasibility, while a fine-tuned vision-language model scores semantic alignment as reward signals. Our algorithm uses a policy network as a heuristic in Monte-Carlo Tree Search and fine-tunes the network via search-generated plans. Inspired by the Generative Adversarial Network, we use the generated instances for adversarial reward refinement. Over time, the generation should approach the actual data more closely when the reward model cannot distinguish between generated instances and ground-truth. In the Tangram Assembly task, our approach yields higher validity and semantic fidelity than diffusion and auto-regressive baselines, especially as constraints tighten.

[76] Quantum-Inspired Spectral Geometry for Neural Operator Equivalence and Structured Pruning cs.CVPDF

Haijian Shao, Wei Liu, Xing Deng

TL;DR: 该论文提出了一种量子启发的几何框架，通过布洛赫超球上的归一化奇异值谱表示神经算子，证明了谱到功能的等价性定理，并提出了一种基于该度量的结构化剪枝方法QM-FRG。

Details

Motivation: 多模态智能在资源受限和异构硬件上的快速增长暴露了关键瓶颈，如多模态特征异构性、动态场景中的实时需求以及硬件特定的算子冗余。

Result: 仿真验证表明，所提出的度量方法在性能上优于基于幅值和随机基线的方法。大规模多模态变压器和异构硬件上的实验验证将在扩展版本中提供。

Insight: 量子启发的几何框架为跨模态和跨架构的算子替换提供了严格的理论基础，并通过结构化剪枝优化硬件性能。

Abstract: The rapid growth of multimodal intelligence on resource-constrained and heterogeneous domestic hardware exposes critical bottlenecks: multimodal feature heterogeneity, real-time requirements in dynamic scenarios, and hardware-specific operator redundancy. This work introduces a quantum-inspired geometric framework for neural operators that represents each operator by its normalized singular value spectrum on the Bloch hypersphere. We prove a tight spectral-to-functional equivalence theorem showing that vanishing Fubini–Study/Wasserstein-2 distance implies provable functional closeness, establishing the first rigorous foundation for cross-modal and cross-architecture operator substitutability. Based on this metric, we propose Quantum Metric-Driven Functional Redundancy Graphs (QM-FRG) and one-shot structured pruning. Controlled simulation validates the superiority of the proposed metric over magnitude and random baselines. An extensive experimental validation on large-scale multimodal transformers and domestic heterogeneous hardware (Huawei Ascend, Cambricon MLU, Kunlunxin) hardware is deferred to an extended journal version currently in preparation.

[77] Look, Recite, Then Answer: Enhancing VLM Performance via Self-Generated Knowledge Hints cs.CV | cs.AIPDF

Xisheng Feng

TL;DR: 该论文提出了一种名为‘Look, Recite, Then Answer’的框架，通过自生成知识提示增强视觉语言模型（VLM）的性能，解决了‘模态鸿沟’和‘推理驱动幻觉’问题。

Details

Motivation: 视觉语言模型在专业领域（如精准农业）表现不佳，主要因‘推理驱动幻觉’导致语言先验覆盖视觉感知，且视觉嵌入未能激活模型中已有的专家知识。

Result: 在AgroBench上实现了最先进的结果，杂草识别准确率比Qwen-VL提高23.6%，且无需外部搜索即可超越GPT-4o。

Insight: 模块化设计通过主动知识检索减轻了幻觉问题，将被动感知转化为可控的知识激活。

Abstract: Vision-Language Models (VLMs) exhibit significant performance plateaus in specialized domains like precision agriculture, primarily due to “Reasoning-Driven Hallucination” where linguistic priors override visual perception. A key bottleneck is the “Modality Gap”: visual embeddings fail to reliably activate the fine-grained expert knowledge already encoded in model parameters. We propose “Look, Recite, Then Answer,” a parameter-efficient framework that enhances VLMs via self-generated knowledge hints while keeping backbone models frozen. The framework decouples inference into three stages: (1) Look generates objective visual descriptions and candidate sets; (2) Recite employs a lightweight 1.7B router to transform visual cues into targeted queries that trigger candidate-specific parametric knowledge; (3) Answer performs parallel evidence alignment between descriptions and recited knowledge to select the most consistent label. On AgroBench, our method achieves state-of-the-art results, improving Weed Identification accuracy by 23.6% over Qwen-VL and surpassing GPT-4o without external search overhead. This modular design mitigates hallucinations by transforming passive perception into active, controllable knowledge retrieval

[78] HanDyVQA: A Video QA Benchmark for Fine-Grained Hand-Object Interaction Dynamics cs.CVPDF

Masatoshi Tateno, Gido Kato, Hirokatsu Kataoka, Yoichi Sato, Takuma Yagi

TL;DR: HanDyVQA是一个细粒度视频问答基准，专注于手-物交互的动态特性，涵盖操作和效果两方面，包含六类问题和11.1K问答对。现有视频基础模型表现不佳（73%准确率），彰显其挑战性。

Details

Motivation: 现有手-物交互基准或关注操作或关注效果，缺乏对动态特性的细粒度时空推理。HanDyVQA填补了这一空白，强调操纵风格、物体/手的运动以及部件级状态变化。

Result: 最佳模型Gemini-2.5-Pro仅达73%平均准确率，远低于人类（97%）。融入HOI线索的视觉特征可改进性能，但空间关系、运动和部件级理解仍是挑战。

Insight: 当前模型缺乏对HOI动态特性的细粒度理解。未来工作需整合显式HOI线索，并强化时空推理能力。

Abstract: Hand-object interaction (HOI) inherently involves dynamics where human manipulations produce distinct spatio-temporal effects on objects. However, existing semantic HOI benchmarks focused either on manipulation or on the resulting effects at a coarse level, lacking fine-grained spatio-temporal reasoning to capture the underlying dynamics in HOI. We introduce HanDyVQA, a fine-grained video question-answering benchmark that comprehensively covers both the manipulation and effect aspects of HOI. HanDyVQA comprises six complementary question types (Action, Process, Objects, Location, State Change, and Object Parts), totalling 11.1K multiple-choice QA pairs. Collected QA pairs recognizing manipulation styles, hand/object motions, and part-level state changes. HanDyVQA also includes 10.3K segmentation masks for Objects and Object Parts questions, enabling the evaluation of object/part-level reasoning in video object segmentation. We evaluated recent video foundation models on our benchmark and found that even the best-performing model, Gemini-2.5-Pro, reached only 73% average accuracy, which is far from human performance (97%). Further analysis shows the remaining challenges in spatial relationship, motion, and part-level geometric understanding. We also found that integrating explicit HOI-related cues into visual features improves performance, offering insights for developing future models with a deeper understanding of HOI dynamics.

[79] Multilingual Training-Free Remote Sensing Image Captioning cs.CVPDF

Carlos Rebelo, Gil Rocha, João Daniel Silva, Bruno Martins

TL;DR: 该论文提出了一种无需训练的多语言遥感图像描述方法，通过检索增强提示和图形重排策略，在多种语言中表现出色，并展示了直接生成目标语言描述的优势。

Details

Motivation: 现有遥感图像描述方法依赖大规模标注数据和英语，限制了全球适用性。论文旨在通过训练无关的多语言方法解决这一问题。

Result: 在四种基准数据集的十种语言上，该方法与全监督英语系统竞争，且直接生成目标语言描述优于翻译策略，PageRank提升性能达35%。

Insight: 1. VLM生成的描述视觉相关性高但词汇多样性大，LLM则在BLEU和CIDEr得分上更强。2. 直接生成目标语言描述优于翻译策略。

Abstract: Remote sensing image captioning has advanced rapidly through encoder–decoder models, although the reliance on large annotated datasets and the focus on English restricts global applicability. To address these limitations, we propose the first training-free multilingual approach, based on retrieval-augmented prompting. For a given aerial image, we employ a domain-adapted SigLIP2 encoder to retrieve related captions and few-shot examples from a datastore, which are then provided to a language model. We explore two variants: an image-blind setup, where a multilingual Large Language Model (LLM) generates the caption from textual prompts alone, and an image-aware setup, where a Vision–Language Model (VLM) jointly processes the prompt and the input image. To improve the coherence of the retrieved content, we introduce a graph-based re-ranking strategy using PageRank on a graph of images and captions. Experiments on four benchmark datasets across ten languages demonstrate that our approach is competitive with fully supervised English-only systems and generalizes to other languages. Results also highlight the importance of re-ranking with PageRank, yielding up to 35% improvements in performance metrics. Additionally, it was observed that while VLMs tend to generate visually grounded but lexically diverse captions, LLMs can achieve stronger BLEU and CIDEr scores. Lastly, directly generating captions in the target language consistently outperforms other translation-based strategies. Overall, our work delivers one of the first systematic evaluations of multilingual, training-free captioning for remote sensing imagery, advancing toward more inclusive and scalable multimodal Earth observation systems.

[80] Accelerating Streaming Video Large Language Models via Hierarchical Token Compression cs.CVPDF

Yiyu Wang, Xuyang Liu, Xiyan Gui, Xinying Lin, Boxue Yang

TL;DR: 论文提出了STC框架，通过分层令牌压缩加速流式视频大语言模型的处理，同时保留高准确率。

Details

Motivation: 流式视频大语言模型在处理连续视频流时，因视觉令牌处理的计算成本高和冗余帧处理效率低，导致实时部署困难。

Result: 实验显示，STC在多个基准测试中表现优异，ViT编码和LLM预填充延迟分别降低24.5%和45.3%，准确率保留99%。

Insight: 通过分层令牌压缩，可以在不显著损失性能的情况下显著提升流式视频大语言模型的处理效率。

Abstract: Streaming Video Large Language Models (VideoLLMs) have demonstrated impressive performance across various video understanding tasks, but they face significant challenges in real-time deployment due to the high computational cost of processing dense visual tokens from continuous video streams. In streaming video scenarios, the primary bottleneck lies in the Vision Transformer (ViT) encoding stage, where redundant processing of temporally similar frames leads to inefficiency. Additionally, inflated token sequences during LLM pre-filling further exacerbate latency and memory overhead. To address these challenges, we propose \textbf{S}treaming \textbf{T}oken \textbf{C}ompression (\textbf{STC}), a plug-and-play hierarchical framework that seamlessly integrates into existing streaming VideoLLMs, optimizing both ViT encoding and LLM pre-filling stages to accelerate processing. STC introduces two token-level accelerators: \textbf{STC-Cacher}, which reduces ViT encoding overhead by caching and reusing features from temporally similar frames, and \textbf{STC-Pruner}, which compresses the visual token sequence before it enters the LLM, preserving only the most salient tokens based on both spatial and temporal relevance. Extensive experiments on four baseline streaming VideoLLMs across five benchmarks demonstrate that STC outperforms other compression methods. Notably, STC retains up to \textbf{99%} of accuracy on the ReKV framework while reducing ViT encoding latency and LLM pre-filling latency by \textbf{24.5%} and \textbf{45.3%}.

[81] SwiftVLA: Unlocking Spatiotemporal Dynamics for Lightweight VLA Models at Minimal Overhead cs.CV | cs.ROPDF

Chaojun Ni, Cheng Chen, Xiaofeng Wang, Zheng Zhu, Wenzhao Zheng

TL;DR: SwiftVLA是一种轻量级视觉-语言-动作（VLA）模型，通过引入4D视觉几何变换器和Fusion Tokens，显著提升了时空理解能力，同时保持了高效设计和低开销。

Details

Motivation: 现有基于预训练视觉-语言模型（VLM）的VLA模型参数量大，实用性受限；轻量级VLM虽减小了参数但牺牲了时空推理能力。SwiftVLA旨在解决这一问题。

Result: 在真实和模拟环境中，SwiftVLA性能优于轻量基线，接近参数量7倍的模型，速度提升18倍，内存占用减少12倍。

Insight: 轻量级VLA模型可通过高效的4D特征提取和融合策略达到接近大型模型的性能，适合边缘计算场景。

Abstract: Vision-Language-Action (VLA) models built on pretrained Vision-Language Models (VLMs) show strong potential but are limited in practicality due to their large parameter counts. To mitigate this issue, using a lightweight VLM has been explored, but it compromises spatiotemporal reasoning. Although some methods suggest that incorporating additional 3D inputs can help, they usually rely on large VLMs to fuse 3D and 2D inputs and still lack temporal understanding. Therefore, we propose SwiftVLA, an architecture that enhances a compact model with 4D understanding while preserving design efficiency. Specifically, our approach features a pretrained 4D visual geometry transformer with a temporal cache that extracts 4D features from 2D images. Then, to enhance the VLM’s ability to exploit both 2D images and 4D features, we introduce Fusion Tokens, a set of learnable tokens trained with a future prediction objective to generate unified representations for action generation. Finally, we introduce a mask-and-reconstruct strategy that masks 4D inputs to the VLM and trains the VLA to reconstruct them, enabling the VLM to learn effective 4D representations and allowing the 4D branch to be dropped at inference with minimal performance loss. Experiments in real and simulated environments show that SwiftVLA outperforms lightweight baselines and rivals VLAs up to 7 times larger, achieving comparable performance on edge devices while being 18 times faster and reducing memory footprint by 12 times.

[82] Hierarchical Semantic Alignment for Image Clustering cs.CV | cs.LGPDF

Xingyu Zhu, Beier Zhu, Yunfan Li, Junfeng Fang, Shuo Wang

TL;DR: 论文提出了一种名为CAE的分层语义对齐方法，用于提升图像聚类性能。该方法通过结合caption级的描述和noun级的概念，构建了一个与图像特征对齐的语义空间，并通过最优运输实现特征对齐。实验表明，该方法显著优于现有训练无关方法。

Details

Motivation: 现有的图像聚类方法通常利用名词作为外部语义知识，但由于名词的固有歧义性，可能导致语义表示失真，从而降低聚类质量。为了解决这一问题，论文提出了分层语义对齐方法。

Result: 在8个数据集上的实验表明，CAE显著优于现有训练无关方法，尤其在ImageNet-1K数据集上，准确率和ARI分别提升了4.2%和2.9%。

Insight: 结合多层级语义（caption和noun）可以有效解决名词歧义问题，提升聚类性能；训练无关的方法在实际应用中更具灵活性。

Abstract: Image clustering is a classic problem in computer vision, which categorizes images into different groups. Recent studies utilize nouns as external semantic knowledge to improve clus- tering performance. However, these methods often overlook the inherent ambiguity of nouns, which can distort semantic representations and degrade clustering quality. To address this issue, we propose a hierarChical semAntic alignmEnt method for image clustering, dubbed CAE, which improves cluster- ing performance in a training-free manner. In our approach, we incorporate two complementary types of textual seman- tics: caption-level descriptions, which convey fine-grained attributes of image content, and noun-level concepts, which represent high-level object categories. We first select relevant nouns from WordNet and descriptions from caption datasets to construct a semantic space aligned with image features. Then, we align image features with selected nouns and captions via optimal transport to obtain a more discriminative semantic space. Finally, we combine the enhanced semantic and image features to perform clustering. Extensive experiments across 8 datasets demonstrate the effectiveness of our method, notably surpassing the state-of-the-art training-free approach with a 4.2% improvement in accuracy and a 2.9% improvement in adjusted rand index (ARI) on the ImageNet-1K dataset.

[83] TalkingPose: Efficient Face and Gesture Animation with Feedback-guided Diffusion Model cs.CVPDF

Alireza Javanmardi, Pragati Jaiswal, Tewodros Amberbir Habtegebrial, Christen Millerdurai, Shaoxiang Wang

TL;DR: 论文提出TalkingPose，一种基于扩散模型的框架，用于生成长时间序列的上半身人体动画，通过反馈机制增强时间一致性，无需额外计算开销或训练阶段。

Details

Motivation: 现有扩散模型在生成高质量运动方面表现优异，但受限于计算和内存，难以生成长时间序列的一致动画。

Result: 能够生成不受帧数限制且时间一致的动画。

Insight: 反馈机制在不增加计算开销的情况下有效提升了长序列生成的质量。

Abstract: Recent advancements in diffusion models have significantly improved the realism and generalizability of character-driven animation, enabling the synthesis of high-quality motion from just a single RGB image and a set of driving poses. Nevertheless, generating temporally coherent long-form content remains challenging. Existing approaches are constrained by computational and memory limitations, as they are typically trained on short video segments, thus performing effectively only over limited frame lengths and hindering their potential for extended coherent generation. To address these constraints, we propose TalkingPose, a novel diffusion-based framework specifically designed for producing long-form, temporally consistent human upper-body animations. TalkingPose leverages driving frames to precisely capture expressive facial and hand movements, transferring these seamlessly to a target actor through a stable diffusion backbone. To ensure continuous motion and enhance temporal coherence, we introduce a feedback-driven mechanism built upon image-based diffusion models. Notably, this mechanism does not incur additional computational costs or require secondary training stages, enabling the generation of animations with unlimited duration. Additionally, we introduce a comprehensive, large-scale dataset to serve as a new benchmark for human upper-body animation.

[84] Dual-Projection Fusion for Accurate Upright Panorama Generation in Robotic Vision cs.CVPDF

Yuhao Shan, Qianyi Yuan, Jingguo Liu, Shigang Li, Jianfeng Li

TL;DR: 论文提出了一种双投影融合网络，通过CNN和ViT分支分别提取局部几何结构和全局上下文信息，生成直立的全景图像，并在SUN360和M3D数据集上表现优异。

Details

Motivation: 机器人视觉中，非直立全景图像由于不稳定的机器人姿态会干扰下游任务。传统IMU校正方法存在漂移和外部干扰问题，而基于视觉的方法提供了可行的替代方案。

Result: 在SUN360和M3D数据集上，方法在倾斜角估计和直立全景生成任务中均优于现有方法。消融实验验证了各模块的贡献和任务协同性。

Insight: 联合估计倾斜角和重建全景图像的任务具有协同优势；双投影融合能够有效结合局部和全局特征。

Abstract: Panoramic cameras, capable of capturing a 360-degree field of view, are crucial in robotic vision, particularly in environments with sparse features. However, non-upright panoramas due to unstable robot postures hinder downstream tasks. Traditional IMU-based correction methods suffer from drift and external disturbances, while vision-based approaches offer a promising alternative. This study presents a dual-stream angle-aware generation network that jointly estimates camera inclination angles and reconstructs upright panoramic images. The network comprises a CNN branch that extracts local geometric structures from equirectangular projections and a ViT branch that captures global contextual cues from cubemap projections. These are integrated through a dual-projection adaptive fusion module that aligns spatial features across both domains. To further enhance performance, we introduce a high-frequency enhancement block, circular padding, and channel attention mechanisms to preserve 360° continuity and improve geometric sensitivity. Experiments on the SUN360 and M3D datasets demonstrate that our method outperforms existing approaches in both inclination estimation and upright panorama generation. Ablation studies further validate the contribution of each module and highlight the synergy between the two tasks. The code and related datasets can be found at: https://github.com/YuhaoShine/DualProjectionFusion.

[85] StreamGaze: Gaze-Guided Temporal Reasoning and Proactive Understanding in Streaming Videos cs.CV | cs.AI | cs.CLPDF

Daeun Lee, Subhojyoti Mukherjee, Branislav Kveton, Ryan A. Rossi, Viet Dac Lai

TL;DR: StreamGaze是一个新的基准测试，旨在评估多模态大语言模型（MLLMs）如何在流媒体视频中利用人类注视信号进行时序推理和主动预测。它填补了现有研究中的空白，并通过注视引导的任务全面评估模型的能力。

Details

Motivation: 现有流媒体视频基准测试仅关注时序推理，而未评估模型是否能解释或利用人类注视信号。StreamGaze提出了注视引导的任务，以更全面地衡量模型的能力，特别是在真实应用（如AR眼镜）中的实用性。

Result: 实验显示，当前最先进的MLLMs在StreamGaze任务上的表现显著低于人类水平，特别是在注视引导的时序推理和意图建模方面。详细分析揭示了模型的具体失败模式。

Insight: 1. 当前MLLMs在利用注视信号进行主动预测的能力上存在局限。2. 未来模型需要改进注视提示策略和时序推理能力。3. StreamGaze为相关研究提供了一个新的评估标准和数据支持。

Abstract: Streaming video understanding requires models not only to process temporally incoming frames, but also to anticipate user intention for realistic applications like AR glasses. While prior streaming benchmarks evaluate temporal reasoning, none measure whether MLLMs can interpret or leverage human gaze signals within a streaming setting. To fill this gap, we introduce StreamGaze, the first benchmark designed to evaluate how effectively MLLMs use gaze for temporal and proactive reasoning in streaming videos. StreamGaze introduces gaze-guided past, present, and proactive tasks that comprehensively evaluate streaming video understanding. These tasks assess whether models can use real-time gaze to follow shifting attention and infer user intentions from only past and currently observed frames. To build StreamGaze, we develop a gaze-video QA generation pipeline that aligns egocentric videos with raw gaze trajectories via fixation extraction, region-specific visual prompting, and scanpath construction. This pipeline produces spatio-temporally grounded QA pairs that closely reflect human perceptual dynamics. Across all StreamGaze tasks, we observe substantial performance gaps between state-of-the-art MLLMs and human performance, revealing fundamental limitations in gaze-based temporal reasoning, intention modeling, and proactive prediction. We further provide detailed analyses of gaze-prompting strategies, reasoning behaviors, and task-specific failure modes, offering deeper insight into why current MLLMs struggle and what capabilities future models must develop. All data and code will be publicly released to support continued research in gaze-guided streaming video understanding.

[86] SceneProp: Combining Neural Network and Markov Random Field for Scene-Graph Grounding cs.CVPDF

Keita Otani, Tatsuya Harada

TL;DR: SceneProp提出了一种新方法，将场景图落地任务重新定义为马尔可夫随机场中的最大后验推断问题，通过全局推理优化图像区域与查询节点的联合分配。

Details

Motivation: 现有方法在复杂的视觉查询中表现不佳，尤其是随着查询图规模的增大，性能反而下降，无法充分利用关系信息。SceneProp旨在解决这一问题。

Result: 在四个基准测试上显著优于现有方法，且性能随查询图的复杂性提升而提升。

Insight: 关系信息的增加可以改善视觉落地任务，这是之前未被充分利用的潜力。

Abstract: Grounding complex, compositional visual queries with multiple objects and relationships is a fundamental challenge for vision-language models. While standard phrase grounding methods excel at localizing single objects, they lack the structural inductive bias to parse intricate relational descriptions, often failing as queries become more descriptive. To address this structural deficit, we focus on scene-graph grounding, a powerful but less-explored formulation where the query is an explicit graph of objects and their relationships. However, existing methods for this task also struggle, paradoxically showing decreased performance as the query graph grows – failing to leverage the very information that should make grounding easier. We introduce SceneProp, a novel method that resolves this issue by reformulating scene-graph grounding as a Maximum a Posteriori (MAP) inference problem in a Markov Random Field (MRF). By performing global inference over the entire query graph, SceneProp finds the optimal assignment of image regions to nodes that jointly satisfies all constraints. This is achieved within an end-to-end framework via a differentiable implementation of the Belief Propagation algorithm. Experiments on four benchmarks show that our dedicated focus on the scene-graph grounding formulation allows SceneProp to significantly outperform prior work. Critically, its accuracy consistently improves with the size and complexity of the query graph, demonstrating for the first time that more relational context can, and should, lead to better grounding. Codes are available at https://github.com/keitaotani/SceneProp.

[87] Adaptive Evidential Learning for Temporal-Semantic Robustness in Moment Retrieval cs.CVPDF

Haojian Huang, Kaijing Ma, Jin Chen, Haodong Chen, Zhou Wu

TL;DR: 该论文提出了一种名为DEMR的新框架，通过结合Reflective Flipped Fusion块和改进的Geom-regularizer，解决了传统方法在视频时序片段检索中的模态不平衡和不确定性估计偏差问题，显著提升了检索的准确性和鲁棒性。

Details

Motivation: 传统方法在视频时序片段检索中难以处理细粒度信息和复杂歧义场景，且不确定性估计存在偏差，导致高不确定性错误地与准确样本关联。

Result: DEMR在ActivityNet-CD和Charades-CD等数据集上表现出更高的检索准确性、鲁棒性和可解释性。

Insight: 1. 跨模态对齐和文本敏感性的提升能有效减少不确定性估计的偏差；2. Geom-regularizer有助于自适应地对齐具有挑战性的时刻，提高检索性能。

Abstract: In the domain of moment retrieval, accurately identifying temporal segments within videos based on natural language queries remains challenging. Traditional methods often employ pre-trained models that struggle with fine-grained information and deterministic reasoning, leading to difficulties in aligning with complex or ambiguous moments. To overcome these limitations, we explore Deep Evidential Regression (DER) to construct a vanilla Evidential baseline. However, this approach encounters two major issues: the inability to effectively handle modality imbalance and the structural differences in DER’s heuristic uncertainty regularizer, which adversely affect uncertainty estimation. This misalignment results in high uncertainty being incorrectly associated with accurate samples rather than challenging ones. Our observations indicate that existing methods lack the adaptability required for complex video scenarios. In response, we propose Debiased Evidential Learning for Moment Retrieval (DEMR), a novel framework that incorporates a Reflective Flipped Fusion (RFF) block for cross-modal alignment and a query reconstruction task to enhance text sensitivity, thereby reducing bias in uncertainty estimation. Additionally, we introduce a Geom-regularizer to refine uncertainty predictions, enabling adaptive alignment with difficult moments and improving retrieval accuracy. Extensive testing on standard datasets and debiased datasets ActivityNet-CD and Charades-CD demonstrates significant enhancements in effectiveness, robustness, and interpretability, positioning our approach as a promising solution for temporal-semantic robustness in moment retrieval. The code is publicly available at https://github.com/KaijingOfficial/DEMR.

[88] Efficient and Scalable Monocular Human-Object Interaction Motion Reconstruction cs.CVPDF

Boran Wen, Ye Lu, Keyan Wan, Sirui Wang, Jiahong Zhou

TL;DR: 提出了一个高效的优化框架4DHOISolver，通过稀疏的人工标注约束4D HOI重建问题，并引入大规模数据集Open4DHOI，同时展示了其在RL模仿任务中的有效性。

Details

Motivation: 为解决从单目视频中高效且可扩展地提取4D HOI数据的挑战，利用互联网视频的多样性资源。

Result: 重建动作成功应用于RL模仿任务，但现有3D基础模型在预测人类-物体接触对应关系上仍有不足。

Insight: 稀疏人工标注在当前阶段仍是必要手段，同时自动预测精确接触关系仍是一个开放挑战。

Abstract: Generalized robots must learn from diverse, large-scale human-object interactions (HOI) to operate robustly in the real world. Monocular internet videos offer a nearly limitless and readily available source of data, capturing an unparalleled diversity of human activities, objects, and environments. However, accurately and scalably extracting 4D interaction data from these in-the-wild videos remains a significant and unsolved challenge. Thus, in this work, we introduce 4DHOISolver, a novel and efficient optimization framework that constrains the ill-posed 4D HOI reconstruction problem by leveraging sparse, human-in-the-loop contact point annotations, while maintaining high spatio-temporal coherence and physical plausibility. Leveraging this framework, we introduce Open4DHOI, a new large-scale 4D HOI dataset featuring a diverse catalog of 144 object types and 103 actions. Furthermore, we demonstrate the effectiveness of our reconstructions by enabling an RL-based agent to imitate the recovered motions. However, a comprehensive benchmark of existing 3D foundation models indicates that automatically predicting precise human-object contact correspondences remains an unsolved problem, underscoring the immediate necessity of our human-in-the-loop strategy while posing an open challenge to the community. Data and code will be publicly available at https://wenboran2002.github.io/open4dhoi/

[89] MM-ACT: Learn from Multimodal Parallel Generation to Act cs.CV | cs.LG | cs.ROPDF

Haotian Liang, Xinyi Chen, Bin Wang, Mingkang Chen, Yitian Liu

TL;DR: MM-ACT是一个统一的视觉-语言-动作（VLA）模型，通过共享的token空间整合文本、图像和动作，并在所有三种模态中生成。采用跨模态上下文共享学习策略，提升了动作生成的效率和质量。在仿真和真实机器人任务中表现出色。

Details

Motivation: 通用机器人策略需要语义理解以规划任务，同时需要与环境交互的预测能力。现有方法通常未能统一处理多模态信息，限制了性能。

Result: 在LIBERO仿真中任务成功率达96.3%，真实Franka机器人任务中72.0%，RoboTwin2.0双手机器人任务中52.38%，跨模态学习带来额外9.25%提升。

Insight: 统一的多模态学习和高效的解码策略显著提升了机器人策略的性能；跨模态学习对动作生成具有显著增益。

Abstract: A generalist robotic policy needs both semantic understanding for task planning and the ability to interact with the environment through predictive capabilities. To tackle this, we present MM-ACT, a unified Vision-Language-Action (VLA) model that integrates text, image, and action in shared token space and performs generation across all three modalities. MM-ACT adopts a re-mask parallel decoding strategy for text and image generation, and employs a one-step parallel decoding strategy for action generation to improve efficiency. We introduce Context-Shared Multimodal Learning, a unified training paradigm that supervises generation in all three modalities from a shared context, enhancing action generation through cross-modal learning. Experiments were conducted on the LIBERO simulation and Franka real-robot setups as well as RoboTwin2.0 to assess in-domain and out-of-domain performances respectively. Our approach achieves a success rate of 96.3% on LIBERO, 72.0% across three tasks of real Franka, and 52.38% across eight bimanual tasks of RoboTwin2.0 with an additional gain of 9.25% from cross-modal learning. We release our codes, models and data at https://github.com/HHYHRHY/MM-ACT.

Zhiyuan You, Ke Wang, He Zhang, Xin Cai, Jinjin Gu

TL;DR: PhotoFramer是一个多模态图像构图指导框架，通过文本和图像生成帮助用户改进构图。它利用大规模数据集训练模型，提供分层任务（平移、放大、视角变换）指导，并证明文本指令与示例结合比仅用示例更有效。

Details

Motivation: 许多用户在拍照时缺乏构图技巧，导致照片效果不佳。PhotoFramer旨在通过提供自然语言指导和示例图像，帮助用户改进构图，从而提升照片质量。

Result: 实验表明，文本指令有效指导图像构图，且与示例结合比仅用示例效果更优。PhotoFramer为用户提供了实用的构图辅助工具。

Insight: 构图改进可以通过分层任务和多模态结合实现，合成数据可用于训练高质量的指导模型，自然语言与视觉示例结合能显著提升用户体验。

Abstract: Composition matters during the photo-taking process, yet many casual users struggle to frame well-composed images. To provide composition guidance, we introduce PhotoFramer, a multi-modal composition instruction framework. Given a poorly composed image, PhotoFramer first describes how to improve the composition in natural language and then generates a well-composed example image. To train such a model, we curate a large-scale dataset. Inspired by how humans take photos, we organize composition guidance into a hierarchy of sub-tasks: shift, zoom-in, and view-change tasks. Shift and zoom-in data are sampled from existing cropping datasets, while view-change data are obtained via a two-stage pipeline. First, we sample pairs with varying viewpoints from multi-view datasets, and train a degradation model to transform well-composed photos into poorly composed ones. Second, we apply this degradation model to expert-taken photos to synthesize poor images to form training pairs. Using this dataset, we finetune a model that jointly processes and generates both text and images, enabling actionable textual guidance with illustrative examples. Extensive experiments demonstrate that textual instructions effectively steer image composition, and coupling them with exemplars yields consistent improvements over exemplar-only baselines. PhotoFramer offers a practical step toward composition assistants that make expert photographic priors accessible to everyday users. Codes, model weights, and datasets have been released in https://zhiyuanyou.github.io/photoframer.

[91] S2AM3D: Scale-controllable Part Segmentation of 3D Point Cloud cs.CVPDF

Han Su, Tianyu Huang, Zichen Wan, Xiaohe Wu, Wangmeng Zuo

TL;DR: S2AM3D 提出了一种结合 2D 分割先验和 3D 一致性监督的方法，用于解决点云分割中的泛化性和多视图一致性问题，并引入了大规模高质量数据集支持模型训练。

Details

Motivation: 现有 3D 点云分割方法因数据稀缺导致泛化性不足，而引入 2D 预训练知识又可能导致多视图分割结果不一致。S2AM3D 旨在解决这些问题。

Result: S2AM3D 在多个评估设置中表现领先，对复杂结构和尺寸变化大的部分具有出色的鲁棒性和可控性。

Insight: 结合 2D 和 3D 信息可以有效提升点云分割的一致性和泛化性；尺度信号的引入为分割任务提供了灵活的粒度控制。

Abstract: Part-level point cloud segmentation has recently attracted significant attention in 3D computer vision. Nevertheless, existing research is constrained by two major challenges: native 3D models lack generalization due to data scarcity, while introducing 2D pre-trained knowledge often leads to inconsistent segmentation results across different views. To address these challenges, we propose S2AM3D, which incorporates 2D segmentation priors with 3D consistent supervision. We design a point-consistent part encoder that aggregates multi-view 2D features through native 3D contrastive learning, producing globally consistent point features. A scale-aware prompt decoder is then proposed to enable real-time adjustment of segmentation granularity via continuous scale signals. Simultaneously, we introduce a large-scale, high-quality part-level point cloud dataset with more than 100k samples, providing ample supervision signals for model training. Extensive experiments demonstrate that S2AM3D achieves leading performance across multiple evaluation settings, exhibiting exceptional robustness and controllability when handling complex structures and parts with significant size variations.

[92] LISA-3D: Lifting Language-Image Segmentation to 3D via Multi-View Consistency cs.CVPDF

Zhongbin Guo, Jiahe Liu, Wenyu Gao, Yushan Li, Chengzhi Li

TL;DR: LISA-3D是一个两阶段框架，将语言-图像分割提升到3D重建中，通过几何感知的LoRA层和多视角一致性损失实现跨视角一致性，无需额外3D文本标注。

Details

Motivation: 文本驱动的3D重建需要掩码生成器既能理解开放词汇指令，又能保持跨视角一致性。现有方法在跨视角一致性和数据效率方面存在局限。

Result: 在ScanRefer和Nr3D数据集上，LISA-3D比单视角基线方法提升了15.6%的准确率，仅需微调11.6M参数。

Insight: 1. 多视角一致性损失无需3D文本标注，降低了数据需求；2. 模块化设计支持灵活应用；3. 为零样本语言驱动的3D内容生成提供了实用方案。

Abstract: Text-driven 3D reconstruction demands a mask generator that simultaneously understands open-vocabulary instructions and remains consistent across viewpoints. We present LISA-3D, a two-stage framework that lifts language-image segmentation into 3D by retrofitting the instruction-following model LISA with geometry-aware Low-Rank Adaptation (LoRA) layers and reusing a frozen SAM-3D reconstructor. During training we exploit off-the-shelf RGB-D sequences and their camera poses to build a differentiable reprojection loss that enforces cross-view agreement without requiring any additional 3D-text supervision. The resulting masks are concatenated with RGB images to form RGBA prompts for SAM-3D, which outputs Gaussian splats or textured meshes without retraining. Across ScanRefer and Nr3D, LISA-3D improves language-to-3D accuracy by up to +15.6 points over single-view baselines while adapting only 11.6M parameters. The system is modular, data-efficient, and supports zero-shot deployment on unseen categories, providing a practical recipe for language-guided 3D content creation. Our code will be available at https://github.com/binisalegend/LISA-3D.

[93] Lotus-2: Advancing Geometric Dense Prediction with Powerful Image Generative Model cs.CVPDF

Jing He, Haodong Li, Mingzhi Sheng, Ying-Cong Chen

TL;DR: Lotus-2提出了一种两阶段确定性框架，通过充分利用预训练的生成先验，实现稳定、准确且细粒度的几何密集预测，在单目深度估计和表面法线预测中达到了新的SOTA。

Details

Motivation: 解决单张图像中像素级几何属性恢复的固有不适定问题。传统判别回归模型受限于数据规模和质量，而扩散模型虽具有强大的世界先验，但其随机生成特性不适合确定性几何推理。

Result: Lotus-2在单目深度估计中达到SOTA，表面法线预测也具有竞争力，仅需1%传统大规模数据集的数据量。

Insight: 扩散模型可作为确定性世界先验，支持高质量的几何推理，超越传统判别和生成范式。

Abstract: Recovering pixel-wise geometric properties from a single image is fundamentally ill-posed due to appearance ambiguity and non-injective mappings between 2D observations and 3D structures. While discriminative regression models achieve strong performance through large-scale supervision, their success is bounded by the scale, quality and diversity of available data and limited physical reasoning. Recent diffusion models exhibit powerful world priors that encode geometry and semantics learned from massive image-text data, yet directly reusing their stochastic generative formulation is suboptimal for deterministic geometric inference: the former is optimized for diverse and high-fidelity image generation, whereas the latter requires stable and accurate predictions. In this work, we propose Lotus-2, a two-stage deterministic framework for stable, accurate and fine-grained geometric dense prediction, aiming to provide an optimal adaption protocol to fully exploit the pre-trained generative priors. Specifically, in the first stage, the core predictor employs a single-step deterministic formulation with a clean-data objective and a lightweight local continuity module (LCM) to generate globally coherent structures without grid artifacts. In the second stage, the detail sharpener performs a constrained multi-step rectified-flow refinement within the manifold defined by the core predictor, enhancing fine-grained geometry through noise-free deterministic flow matching. Using only 59K training samples, less than 1% of existing large-scale datasets, Lotus-2 establishes new state-of-the-art results in monocular depth estimation and highly competitive surface normal prediction. These results demonstrate that diffusion models can serve as deterministic world priors, enabling high-quality geometric reasoning beyond traditional discriminative and generative paradigms.

[94] TRoVe: Discovering Error-Inducing Static Feature Biases in Temporal Vision-Language Models cs.CVPDF

Maya Varma, Jean-Benoit Delbrouck, Sophie Ostmeier, Akshay Chaudhari, Curtis Langlotz

TL;DR: TRoVe是一种自动方法，用于发现时序视觉语言模型中导致错误的静态特征偏差，并通过评估框架验证其有效性。

Details

Motivation: 视觉语言模型(VLMs)在处理时序任务时可能依赖静态特征偏差（如背景或对象特征），导致系统预测错误，因此需要识别和消除这些偏差。

Result: TRoVe在识别静态特征偏差方面表现优异，且在7个现成VLMs和2个时序任务中发现新偏差，提高了测试性能。

Insight: 静态特征偏差是VLMs中的常见问题，TRoVe揭示了这些偏差并为进一步优化模型提供了方向。

Abstract: Vision-language models (VLMs) have made great strides in addressing temporal understanding tasks, which involve characterizing visual changes across a sequence of images. However, recent works have suggested that when making predictions, VLMs may rely on static feature biases, such as background or object features, rather than dynamic visual changes. Static feature biases are a type of shortcut and can contribute to systematic prediction errors on downstream tasks; as a result, identifying and characterizing error-inducing static feature biases is critical prior to real-world model deployment. In this work, we introduce TRoVe, an automated approach for discovering error-inducing static feature biases learned by temporal VLMs. Given a trained VLM and an annotated validation dataset associated with a downstream classification task, TRoVe extracts candidate static features from the dataset and scores each feature by (i) the effect of the feature on classification errors as well as (ii) the extent to which the VLM relies on the feature when making predictions. In order to quantitatively evaluate TRoVe, we introduce an evaluation framework consisting of 101 trained temporal VLMs paired with ground-truth annotations for learned static feature biases. We use this framework to demonstrate that TRoVe can accurately identify error-inducing static feature biases in VLMs, achieving a 28.6% improvement over the closest baseline. Finally, we apply TRoVe to 7 off-the-shelf VLMs and 2 temporal understanding tasks, surfacing previously-unknown static feature biases and demonstrating that knowledge of learned biases can aid in improving model performance at test time. Our code is available at https://github.com/Stanford-AIMI/TRoVe.

Anantha Padmanaban Krishna Kumar

TL;DR: 该论文研究了在Vision Transformers（ViT-B/16）中通过参数减少策略（权重共享和宽度缩减）优化模型性能的方法，发现减少MLP块的参数不仅不影响性能，反而能小幅提升准确率并改善训练稳定性。

Details

Motivation: 尽管扩大Vision Transformers的规模通常能提高性能，但并非总是单调增加。论文旨在探索通过参数减少策略优化ViT-B/16在ImageNet-1K上的表现，特别是MLP块的容量是否可以被缩减而不损害性能。

Result: GroupedMLP达到81.47% top-1准确率，ShallowMLP达到81.25%，均优于基线（81.05%）。训练稳定性显著改善，峰值到最终准确率退化从0.47%降至0.03%-0.06%。

Insight: ViT-B/16在ImageNet-1K上处于过参数化状态，MLP容量可以被缩减而不损害性能；参数共享和宽度缩减可作为有用的归纳偏置，优化参数分配对模型设计至关重要。

Abstract: Although scaling laws and many empirical results suggest that increasing the size of Vision Transformers often improves performance, model accuracy and training behavior are not always monotonically increasing with scale. Focusing on ViT-B/16 trained on ImageNet-1K, we study two simple parameter-reduction strategies applied to the MLP blocks, each removing 32.7% of the baseline parameters. Our \emph{GroupedMLP} variant shares MLP weights between adjacent transformer blocks and achieves 81.47% top-1 accuracy while maintaining the baseline computational cost. Our \emph{ShallowMLP} variant halves the MLP hidden dimension and reaches 81.25% top-1 accuracy with a 38% increase in inference throughput. Both models outperform the 86.6M-parameter baseline (81.05%) and exhibit substantially improved training stability, reducing peak-to-final accuracy degradation from 0.47% to the range 0.03% to 0.06%. These results suggest that, for ViT-B/16 on ImageNet-1K with a standard training recipe, the model operates in an overparameterized regime in which MLP capacity can be reduced without harming performance and can even slightly improve it. More broadly, our findings suggest that architectural constraints such as parameter sharing and reduced width may act as useful inductive biases, and highlight the importance of how parameters are allocated when designing Vision Transformers. All code is available at: https://github.com/AnanthaPadmanaban-KrishnaKumar/parameter-efficient-vit-mlps.

[96] Accelerating Inference of Masked Image Generators via Reinforcement Learning cs.CVPDF

Pranav Subbaraman, Shufan Li, Siyan Zhao, Aditya Grover

TL;DR: 该论文提出了Speed-RL，一种基于强化学习的加速Masked Generative Models（MGM）推理的方法，能在更少步骤下生成高质量图像。

Details

Motivation: MGM在生成高质量图像时通常需多次采样步骤，导致推理速度慢。传统蒸馏方法难以高效解决这一问题，因此需要一种新的加速范式。

Result: 实验表明，该方法能将基础模型的推理速度提升3倍，同时保持图像质量。

Insight: 将加速任务转化为强化学习问题是一种新颖且有效的思路，避免了传统蒸馏方法的局限性。

Abstract: Masked Generative Models (MGM)s demonstrate strong capabilities in generating high-fidelity images. However, they need many sampling steps to create high-quality generations, resulting in slow inference speed. In this work, we propose Speed-RL, a novel paradigm for accelerating a pretrained MGMs to generate high-quality images in fewer steps. Unlike conventional distillation methods which formulate the acceleration problem as a distribution matching problem, where a few-step student model is trained to match the distribution generated by a many-step teacher model, we consider this problem as a reinforcement learning problem. Since the goal of acceleration is to generate high quality images in fewer steps, we can combine a quality reward with a speed reward and finetune the base model using reinforcement learning with the combined reward as the optimization target. Through extensive experiments, we show that the proposed method was able to accelerate the base model by a factor of 3x while maintaining comparable image quality.

[97] CycliST: A Video Language Model Benchmark for Reasoning on Cyclical State Transitions cs.CV | cs.AI | cs.LGPDF

Simon Kohaut, Daniel Ochs, Shun Zhang, Benedict Flade, Julian Eggert

TL;DR: CycliST 是一个新的基准数据集，用于评估视频语言模型（VLM）在周期性状态转换上的文本推理能力。它通过合成结构化的视频序列，模拟真实世界的周期性模式，并揭示当前模型的局限性。

Details

Motivation: 现有的视频语言模型在理解和推理周期性动态（如线性或轨道运动、时间依赖的视觉属性变化）方面表现不佳，缺乏时空认知能力。CycliST 旨在填补这一技术空白，推动模型在周期性模式理解上的进步。

Result: 当前 VLMs 在周期性模式检测、时间理解和定量场景分析中表现不佳，且无单一模型在所有任务中表现一致。规模或架构与性能无强相关性。

Insight: CycliST 突出了 VLMs 在周期性动态推理中的技术缺口，为未来模型提供了明确改进方向，强调需要更强大的时空理解能力。

Abstract: We present CycliST, a novel benchmark dataset designed to evaluate Video Language Models (VLM) on their ability for textual reasoning over cyclical state transitions. CycliST captures fundamental aspects of real-world processes by generating synthetic, richly structured video sequences featuring periodic patterns in object motion and visual attributes. CycliST employs a tiered evaluation system that progressively increases difficulty through variations in the number of cyclic objects, scene clutter, and lighting conditions, challenging state-of-the-art models on their spatio-temporal cognition. We conduct extensive experiments with current state-of-the-art VLMs, both open-source and proprietary, and reveal their limitations in generalizing to cyclical dynamics such as linear and orbital motion, as well as time-dependent changes in visual attributes like color and scale. Our results demonstrate that present-day VLMs struggle to reliably detect and exploit cyclic patterns, lack a notion of temporal understanding, and are unable to extract quantitative insights from scenes, such as the number of objects in motion, highlighting a significant technical gap that needs to be addressed. More specifically, we find no single model consistently leads in performance: neither size nor architecture correlates strongly with outcomes, and no model succeeds equally well across all tasks. By providing a targeted challenge and a comprehensive evaluation framework, CycliST paves the way for visual reasoning models that surpass the state-of-the-art in understanding periodic patterns.

[98] Structural Prognostic Event Modeling for Multimodal Cancer Survival Analysis cs.CVPDF

Yilan Zhang, Li Nanbo, Changchun Yang, Jürgen Schmidhuber, Xin Gao

TL;DR: 该论文提出了一种基于槽位（Slot）的结构化预后事件建模框架SlotSPE，用于整合组织学图像和基因谱数据，以提高癌症生存预测的准确性。该方法通过压缩多模态输入为紧凑的槽位表示，有效建模复杂的模态内和模态间交互，并在10种癌症基准测试中表现出色。

Details

Motivation: 当前的多模态癌症生存分析方法在建模高效和有效的模态内及模态间交互方面存在困难，尤其是捕获稀疏且未注释的关键预后事件。因此，需要一种能够揭示这些结构性信号并增强预测性能的方法。

Result: 在10种癌症数据集的8种中，SlotSPE优于现有方法，整体性能提升了2.9%。该方法对缺失基因组数据具有鲁棒性，并通过结构化事件分解显著提升了可解释性。

Insight: SlotSPE展示了通过结构化压缩多模态数据可以高效捕获稀疏的预后事件，从而改善癌症生存预测。该方法还为多模态数据的可解释性分析提供了新思路。

Abstract: The integration of histology images and gene profiles has shown great promise for improving survival prediction in cancer. However, current approaches often struggle to model intra- and inter-modal interactions efficiently and effectively due to the high dimensionality and complexity of the inputs. A major challenge is capturing critical prognostic events that, though few, underlie the complexity of the observed inputs and largely determine patient outcomes. These events, manifested as high-level structural signals such as spatial histologic patterns or pathway co-activations, are typically sparse, patient-specific, and unannotated, making them inherently difficult to uncover. To address this, we propose SlotSPE, a slot-based framework for structural prognostic event modeling. Specifically, inspired by the principle of factorial coding, we compress each patient’s multimodal inputs into compact, modality-specific sets of mutually distinctive slots using slot attention. By leveraging these slot representations as encodings for prognostic events, our framework enables both efficient and effective modeling of complex intra- and inter-modal interactions, while also facilitating seamless incorporation of biological priors that enhance prognostic relevance. Extensive experiments on ten cancer benchmarks show that SlotSPE outperforms existing methods in 8 out of 10 cohorts, achieving an overall improvement of 2.9%. It remains robust under missing genomic data and delivers markedly improved interpretability through structured event decomposition.

[99] OmniFD: A Unified Model for Versatile Face Forgery Detection cs.CVPDF

Haotian Liu, Haoyu Chen, Chenhui Pan, You Hu, Guoying Zhao

TL;DR: OmniFD是一个统一模型，用于多功能人脸伪造检测，能同时处理图像和视频分类、空间定位和时间定位四项任务，显著提高了效率和性能。

Details

Motivation: 现有的人脸伪造检测方法通常针对不同任务使用独立的模型，导致计算冗余且忽略了任务间的潜在关联。OmniFD旨在通过统一框架解决这一问题。

Result: 实验表明，OmniFD在多项任务上优于任务专用模型，例如视频分类精度提升4.63%，同时减少63%的参数和50%的训练时间。

Insight: 通过统一框架和多任务学习，OmniFD展示了任务间知识迁移的潜力，尤其是在图像数据辅助下提升视频任务的性能。

Abstract: Face forgery detection encompasses multiple critical tasks, including identifying forged images and videos and localizing manipulated regions and temporal segments. Current approaches typically employ task-specific models with independent architectures, leading to computational redundancy and ignoring potential correlations across related tasks. We introduce OmniFD, a unified framework that jointly addresses four core face forgery detection tasks within a single model, i.e., image and video classification, spatial localization, and temporal localization. Our architecture consists of three principal components: (1) a shared Swin Transformer encoder that extracts unified 4D spatiotemporal representations from both images and video inputs, (2) a cross-task interaction module with learnable queries that dynamically captures inter-task dependencies through attention-based reasoning, and (3) lightweight decoding heads that transform refined representations into corresponding predictions for all FFD tasks. Extensive experiments demonstrate OmniFD’s advantage over task-specific models. Its unified design leverages multi-task learning to capture generalized representations across tasks, especially enabling fine-grained knowledge transfer that facilitates other tasks. For example, video classification accuracy improves by 4.63% when image data are incorporated. Furthermore, by unifying images, videos and the four tasks within one framework, OmniFD achieves superior performance across diverse benchmarks with high efficiency and scalability, e.g., reducing 63% model parameters and 50% training time. It establishes a practical and generalizable solution for comprehensive face forgery detection in real-world applications. The source code is made available at https://github.com/haotianll/OmniFD.

Hamza Tahboub, Weiyan Shi, Gang Hua, Huaizu Jiang

TL;DR: 论文揭示了预训练的视觉语言模型（VLMs）在社交感知任务中的“社交退化”问题，并提出了一种名为SocialFusion的统一框架，通过最小化连接固定视觉编码器和语言模型来解决这一问题，从而在多任务中实现正向迁移并提升性能。

Details

Motivation: 现有的预训练视觉语言模型（VLMs）虽然在许多任务上表现出色，但在社交感知任务中却表现不佳，甚至出现负迁移现象。这源于预训练过程中视觉编码器对细微社交信息的表示能力受损，即“社交退化”问题。

Result: SocialFusion在五个社交任务中表现出正向迁移，整体性能提升，并在多个基准测试中达到与任务专用最先进模型相当的性能。

Insight: 论文指出，当前的VLM预训练策略可能对通用社交能力的获取有害，强调了需要更注重社交感知的训练范式。

Abstract: Understanding social interactions from visual cues is a fundamental challenge for a socially competent AI. While powerful pre-trained vision-language models (VLMs) have shown remarkable general capabilities, they surprisingly struggle to unify and learn multiple social perception tasks simultaneously, often exhibiting negative transfer. We identify that this negative transfer stems from a critical issue we term “social degradation,” whereby the general visual-linguistic pre-training process of VLMs impairs the visual encoder’s ability to represent nuanced social information. We investigate this behavior further under two lenses: decodability through linear representation probing and compatibility through gradient conflict analysis, revealing that both play a role in the degradation, especially the former, which is significantly compromised in the VLM pre-training process. To address these issues, we propose SocialFusion, a unified framework that learns a minimal connection between a frozen visual encoder and a language model. Compared with existing VLMs, it exhibits positive transfer across all five social tasks, leveraging synergies between them to enhance overall performance and achieves comparable performance to task-specific state-of-the-art models on various benchmarks. Our findings suggest that current VLM pre-training strategies may be detrimental to acquiring general social competence and highlight the need for more socially-aware training paradigms.

[101] Real-Time On-the-Go Annotation Framework Using YOLO for Automated Dataset Generation cs.CV | cs.AI | cs.ROPDF

Mohamed Abdallah Salem, Ahmed Harb Rabia

TL;DR: 提出了一个基于YOLO的实时标注框架，能够在边缘设备上即时标注图像，显著减少数据集准备时间，同时保持高质量。

Details

Motivation: 传统数据标注方法耗时长且劳动密集，尤其在农业等需要快速决策的应用场景中，高效的自动化标注是迫切需求。

Result: 实验表明，预训练和单类别配置在模型收敛性、性能和鲁棒性上具有显著优势，验证了框架的高效性和可行性。

Insight: 实时标注框架在农业等时间敏感场景中具有广泛应用潜力，同时揭示了预训练和单类别配置对模型性能的重要性。

Abstract: Efficient and accurate annotation of datasets remains a significant challenge for deploying object detection models such as You Only Look Once (YOLO) in real-world applications, particularly in agriculture where rapid decision-making is critical. Traditional annotation techniques are labor-intensive, requiring extensive manual labeling post data collection. This paper presents a novel real-time annotation approach leveraging YOLO models deployed on edge devices, enabling immediate labeling during image capture. To comprehensively evaluate the efficiency and accuracy of our proposed system, we conducted an extensive comparative analysis using three prominent YOLO architectures (YOLOv5, YOLOv8, YOLOv12) under various configurations: single-class versus multi-class annotation and pretrained versus scratch-based training. Our analysis includes detailed statistical tests and learning dynamics, demonstrating significant advantages of pretrained and single-class configurations in terms of model convergence, performance, and robustness. Results strongly validate the feasibility and effectiveness of our real-time annotation framework, highlighting its capability to drastically reduce dataset preparation time while maintaining high annotation quality.

[102] VSRD++: Autolabeling for 3D Object Detection via Instance-Aware Volumetric Silhouette Rendering cs.CVPDF

Zihua Liu, Hiroki Sakuma, Masatoshi Okutomi

TL;DR: VSRD++提出了一种用于单目3D目标检测的弱监督框架，通过基于神经场的体积渲染和2D弱标注避免了依赖3D标注，显著提升了性能。

Details

Motivation: 现有单目3D检测方法依赖大量3D标注，标注成本高。VSRD++旨在通过弱监督方法减少对3D标注的依赖。

Result: 在KITTI-360数据集上，VSRD++在静态和动态场景中均优于现有弱监督方法。

Insight: 体积渲染与动态建模的结合可以有效提升弱监督3D检测的性能。

Abstract: Monocular 3D object detection is a fundamental yet challenging task in 3D scene understanding. Existing approaches heavily depend on supervised learning with extensive 3D annotations, which are often acquired from LiDAR point clouds through labor-intensive labeling processes. To tackle this problem, we propose VSRD++, a novel weakly supervised framework for monocular 3D object detection that eliminates the reliance on 3D annotations and leverages neural-field-based volumetric rendering with weak 2D supervision. VSRD++ consists of a two-stage pipeline: multi-view 3D autolabeling and subsequent monocular 3D detector training. In the multi-view autolabeling stage, object surfaces are represented as signed distance fields (SDFs) and rendered as instance masks via the proposed instance-aware volumetric silhouette rendering. To optimize 3D bounding boxes, we decompose each instance’s SDF into a cuboid SDF and a residual distance field (RDF) that captures deviations from the cuboid. To address the geometry inconsistency commonly observed in volume rendering methods applied to dynamic objects, we model the dynamic objects by including velocity into bounding box attributes as well as assigning confidence to each pseudo-label. Moreover, we also employ a 3D attribute initialization module to initialize the dynamic bounding box parameters. In the monocular 3D object detection phase, the optimized 3D bounding boxes serve as pseudo labels for training monocular 3D object detectors. Extensive experiments on the KITTI-360 dataset demonstrate that VSRD++ significantly outperforms existing weakly supervised approaches for monocular 3D object detection on both static and dynamic scenes. Code is available at https://github.com/Magicboomliu/VSRD_plus_plus

[103] TabletopGen: Instance-Level Interactive 3D Tabletop Scene Generation from Text or Single Image cs.CVPDF

Ziqian Wang, Yonghao He, Licheng Yang, Wei Zou, Hongxuan Ma

TL;DR: TabletopGen 是一个无需训练的、全自动框架，用于从文本或单张图像生成高保真、实例级的交互式3D桌面场景，解决了现有方法在处理高密度布局和复杂空间关系时的不足。

Details

Motivation: 现有文本或图像驱动的3D场景生成方法主要关注大尺度场景，难以捕捉桌面场景的高密度布局和复杂空间关系。TabletopGen旨在填补这一空白，为具身AI（如机器人操纵策略学习）提供高质量的模拟场景。

Result: 实验和用户研究表明，TabletopGen在视觉保真度、布局准确性和物理合理性上显著优于现有方法，能够生成具有丰富风格和空间多样性的真实桌面场景。

Insight: 分阶段的空间推理方法（姿态和尺度对齐）是关键创新点，显著提升了3D重建的准确性，同时无需训练的特性使其更具通用性。

Abstract: Generating high-fidelity, physically interactive 3D simulated tabletop scenes is essential for embodied AI–especially for robotic manipulation policy learning and data synthesis. However, current text- or image-driven 3D scene generation methods mainly focus on large-scale scenes, struggling to capture the high-density layouts and complex spatial relations that characterize tabletop scenes. To address these challenges, we propose TabletopGen, a training-free, fully automatic framework that generates diverse, instance-level interactive 3D tabletop scenes. TabletopGen accepts a reference image as input, which can be synthesized by a text-to-image model to enhance scene diversity. We then perform instance segmentation and completion on the reference to obtain per-instance images. Each instance is reconstructed into a 3D model followed by canonical coordinate alignment. The aligned 3D models then undergo pose and scale estimation before being assembled into a collision-free, simulation-ready tabletop scene. A key component of our framework is a novel pose and scale alignment approach that decouples the complex spatial reasoning into two stages: a Differentiable Rotation Optimizer for precise rotation recovery and a Top-view Spatial Alignment mechanism for robust translation and scale estimation, enabling accurate 3D reconstruction from 2D reference. Extensive experiments and user studies show that TabletopGen achieves state-of-the-art performance, markedly surpassing existing methods in visual fidelity, layout accuracy, and physical plausibility, capable of generating realistic tabletop scenes with rich stylistic and spatial diversity. Our code will be publicly available.

Hang Wu, Ke Sun, Jiayi Ji, Xiaoshuai Sun, Rongrong Ji

TL;DR: 本文提出了一种名为M4-BLIP的创新框架，通过结合局部特征（尤其是面部区域）与全局特征，提升了多模态媒体操纵检测的准确性，并增强了结果的可解释性。

Details

Motivation: 随着数字媒体操纵的增多，现有方法常忽略局部信息的重要性，尤其是在面部区域的操纵。M4-BLIP通过结合局部和全局特征，填补了这一空白。

Result: 实验表明，M4-BLIP在多模态媒体操纵检测任务中表现优于现有方法，并通过可视化验证了其有效性。

Insight: 面部区域的局部特征对媒体操纵检测至关重要，结合LLM可以进一步提升检测结果的可解释性。

Abstract: In the contemporary digital landscape, multi-modal media manipulation has emerged as a significant societal threat, impacting the reliability and integrity of information dissemination. Current detection methodologies in this domain often overlook the crucial aspect of localized information, despite the fact that manipulations frequently occur in specific areas, particularly in facial regions. In response to this critical observation, we propose the M4-BLIP framework. This innovative framework utilizes the BLIP-2 model, renowned for its ability to extract local features, as the cornerstone for feature extraction. Complementing this, we incorporate local facial information as prior knowledge. A specially designed alignment and fusion module within M4-BLIP meticulously integrates these local and global features, creating a harmonious blend that enhances detection accuracy. Furthermore, our approach seamlessly integrates with Large Language Models (LLM), significantly improving the interpretability of the detection outcomes. Extensive quantitative and visualization experiments validate the effectiveness of our framework against the state-of-the-art competitors.

[105] S$^2$-MLLM: Boosting Spatial Reasoning Capability of MLLMs for 3D Visual Grounding with Structural Guidance cs.CV | cs.AIPDF

Beining Xu, Siting Zhu, Zhao Jin, Junxian Li, Hesheng Wang

TL;DR: S$^2$-MLLM提出了一种高效框架，通过隐式空间推理增强MLLMs在3D视觉定位中的空间推理能力，避免了传统方法中对点云重建的低效依赖。

Details

Motivation: 传统MLLMs主要处理2D视觉输入，难以直接从有限视角理解3D场景的空间结构。现有方法依赖于点云重建，导致效率低下且空间推理能力受限。

Result: 在ScanRefer、Nr3D和Sr3D数据集上，S$^2$-MLLM表现出优异的性能、泛化能力和效率，显著优于现有方法。

Insight: 隐式空间推理避免了显式点云重建的低效问题，结合多级位置编码和注意力机制能够更准确地理解3D场景结构。

Abstract: 3D Visual Grounding (3DVG) focuses on locating objects in 3D scenes based on natural language descriptions, serving as a fundamental task for embodied AI and robotics. Recent advances in Multi-modal Large Language Models (MLLMs) have motivated research into extending them to 3DVG. However, MLLMs primarily process 2D visual inputs and struggle with understanding 3D spatial structure of scenes solely from these limited perspectives. Existing methods mainly utilize viewpoint-dependent rendering of reconstructed point clouds to provide explicit structural guidance for MLLMs in 3DVG tasks, leading to inefficiency and limited spatial reasoning. To address this issue, we propose S$^2$-MLLM, an efficient framework that enhances spatial reasoning in MLLMs through implicit spatial reasoning. We introduce a spatial guidance strategy that leverages the structure awareness of feed-forward 3D reconstruction. By acquiring 3D structural understanding during training, our model can implicitly reason about 3D scenes without relying on inefficient point cloud reconstruction. Moreover, we propose a structure-enhanced module (SE), which first employs intra-view and inter-view attention mechanisms to capture dependencies within views and correspondences across views. The module further integrates multi-level position encoding to associate visual representations with spatial positions and viewpoint information, enabling more accurate structural understanding. Extensive experiments demonstrate that S$^2$-MLLM unifies superior performance, generalization, and efficiency, achieving significant performance over existing methods across the ScanRefer, Nr3D, and Sr3D datasets. Code will be available upon acceptance.

[106] PSR: Scaling Multi-Subject Personalized Image Generation with Pairwise Subject-Consistency Rewards cs.CVPDF

Shulei Wang, Longhui Wei, Xin He, Jianbo Ouyang, Hui Lu

TL;DR: 本文提出了一种可扩展的多主体个性化图像生成方法PSR，通过设计多主体数据集生成流程和一致性奖励机制，解决了现有模型在多主体生成中的性能下降问题。

Details

Motivation: 现有单主体个性化生成模型表现优异，但扩展到多主体时性能下降，主要原因是缺乏高质量多主体数据集和后训练策略不当。

Result: 实验表明，PSR在多主体个性化图像生成中显著提升了主体一致性和文本可控性。

Insight: 高质量的多主体训练数据和一致性的强化学习奖励是提升多主体生成性能的关键。

Abstract: Personalized generation models for a single subject have demonstrated remarkable effectiveness, highlighting their significant potential. However, when extended to multiple subjects, existing models often exhibit degraded performance, particularly in maintaining subject consistency and adhering to textual prompts. We attribute these limitations to the absence of high-quality multi-subject datasets and refined post-training strategies. To address these challenges, we propose a scalable multi-subject data generation pipeline that leverages powerful single-subject generation models to construct diverse and high-quality multi-subject training data. Through this dataset, we first enable single-subject personalization models to acquire knowledge of synthesizing multi-image and multi-subject scenarios. Furthermore, to enhance both subject consistency and text controllability, we design a set of Pairwise Subject-Consistency Rewards and general-purpose rewards, which are incorporated into a refined reinforcement learning stage. To comprehensively evaluate multi-subject personalization, we introduce a new benchmark that assesses model performance using seven subsets across three dimensions. Extensive experiments demonstrate the effectiveness of our approach in advancing multi-subject personalized image generation. Github Link: https://github.com/wang-shulei/PSR

[107] TRivia: Self-supervised Fine-tuning of Vision-Language Models for Table Recognition cs.CVPDF

Junyuan Zhang, Bin Wang, Qintong Zhang, Fan Wu, Zichen Wen

TL;DR: TRivia是一种自监督微调方法，利用无标签表格图像训练视觉语言模型（VLMs），提升表格识别（TR）性能，无需人工标注。

Details

Motivation: 现有表格识别方法依赖有标签数据，但标注成本高昂，特别是开源模型因资源限制性能较差。TRivia旨在通过自监督学习解决这一瓶颈。

Result: TRivia-3B在三个基准测试中超越Gemini 2.5 Pro和MinerU2.5等现有系统，实现SOTA性能。

Insight: 自监督学习可有效解决表格识别领域的标注数据稀缺问题，问答反馈机制为无监督优化提供了新思路。

Abstract: Table recognition (TR) aims to transform table images into semi-structured representations such as HTML or Markdown. As a core component of document parsing, TR has long relied on supervised learning, with recent efforts dominated by fine-tuning vision-language models (VLMs) using labeled data. While VLMs have brought TR to the next level, pushing performance further demands large-scale labeled data that is costly to obtain. Consequently, although proprietary models have continuously pushed the performance boundary, open-source models, often trained with limited resources and, in practice, the only viable option for many due to privacy regulations, still lag far behind. To bridge this gap, we introduce TRivia, a self-supervised fine-tuning method that enables pretrained VLMs to learn TR directly from unlabeled table images in the wild. Built upon Group Relative Policy Optimization, TRivia automatically identifies unlabeled samples that most effectively facilitate learning and eliminates the need for human annotations through a question-answering-based reward mechanism. An attention-guided module generates diverse questions for each table image, and the ability to interpret the recognition results and answer them correctly provides feedback to optimize the TR model. This closed-loop process allows the TR model to autonomously learn to recognize, structure, and reason over tables without labeled data. Leveraging this pipeline, we present TRivia-3B, an open-sourced, compact, and state-of-the-art TR model that surpasses existing systems (e.g., Gemini 2.5 Pro, MinerU2.5) on three popular benchmarks. Model and code are released at: https://github.com/opendatalab/TRivia

[108] ViscNet: Vision-Based In-line Viscometry for Fluid Mixing Process cs.CVPDF

Jongwon Sohn, Juhyeon Moon, Hyunjoon Jung, Jaewook Nam

TL;DR: 论文提出了一种基于计算机视觉的粘度测量方法（ViscNet），通过分析流体混合过程中自由表面的光学畸变来非侵入式地推断粘度，适用于多变的光照条件，并引入多模式策略和不确定性量化以提高鲁棒性和可靠性。

Details

Motivation: 传统粘度测量方法是侵入式的且需要受控实验室环境，而真实工业场景条件复杂。因此，需要一种非侵入式、适应性强且可自动化操作的粘度测量方法。

Result: 在多样光照条件下，回归任务的平均绝对误差为0.113 log m² s⁻¹，粘度分类准确率达81%。多模式策略和不确定性量化显著提升了系统的可靠性和鲁棒性。

Insight: 1）视觉方法可在非受控环境下实现高精度的粘度测量；2）多模式输入和不确定性量化是提高工业传感器可靠性的有效手段。

Abstract: Viscosity measurement is essential for process monitoring and autonomous laboratory operation, yet conventional viscometers remain invasive and require controlled laboratory environments that differ substantially from real process conditions. We present a computer-vision-based viscometer that infers viscosity by exploiting how a fixed background pattern becomes optically distorted as light refracts through the mixing-driven, continuously deforming free surface. Under diverse lighting conditions, the system achieves a mean absolute error of 0.113 in log m2 s^-1 units for regression and reaches up to 81% accuracy in viscosity-class prediction. Although performance declines for classes with closely clustered viscosity values, a multi-pattern strategy improves robustness by providing enriched visual cues. To ensure sensor reliability, we incorporate uncertainty quantification, enabling viscosity predictions with confidence estimates. This stand-off viscometer offers a practical, automation-ready alternative to existing viscometry methods.

[109] nnMobileNet++: Towards Efficient Hybrid Networks for Retinal Image Analysis cs.CVPDF

Xin Li, Wenhui Zhu, Xuanzhao Dong, Hao Wang, Yujian Xiong

TL;DR: nnMobileNet++是一种轻量级混合网络，结合了卷积和Transformer模块，用于提升视网膜图像分析的性能和效率。

Details

Motivation: 纯粹卷积架构难以捕捉视网膜图像中的长距离依赖和不规则病变特征，限制了其在临床诊断中的可靠性。

Result: 在多个公开数据集上实现了SOTA或接近SOTA的性能，同时保持低计算成本。

Insight: 混合卷积和Transformer的架构可以兼顾局部和全局特征，是轻量级视网膜图像分析的潜在解决方案。

Abstract: Retinal imaging is a critical, non-invasive modality for the early detection and monitoring of ocular and systemic diseases. Deep learning, particularly convolutional neural networks (CNNs), has significant progress in automated retinal analysis, supporting tasks such as fundus image classification, lesion detection, and vessel segmentation. As a representative lightweight network, nnMobileNet has demonstrated strong performance across multiple retinal benchmarks while remaining computationally efficient. However, purely convolutional architectures inherently struggle to capture long-range dependencies and model the irregular lesions and elongated vascular patterns that characterize on retinal images, despite the critical importance of vascular features for reliable clinical diagnosis. To further advance this line of work and extend the original vision of nnMobileNet, we propose nnMobileNet++, a hybrid architecture that progressively bridges convolutional and transformer representations. The framework integrates three key components: (i) dynamic snake convolution for boundary-aware feature extraction, (ii) stage-specific transformer blocks introduced after the second down-sampling stage for global context modeling, and (iii) retinal image pretraining to improve generalization. Experiments on multiple public retinal datasets for classification, together with ablation studies, demonstrate that nnMobileNet++ achieves state-of-the-art or highly competitive accuracy while maintaining low computational cost, underscoring its potential as a lightweight yet effective framework for retinal image analysis.

[110] Diffusion Model in Latent Space for Medical Image Segmentation Task cs.CV | cs.AIPDF

Huynh Trinh Ngoc, Toan Nguyen Hai, Ba Luong Son, Long Tran Quoc

TL;DR: 提出了一种名为MedSegLatDiff的潜在空间扩散模型，结合VAE和潜在扩散模型，用于高效的医学图像分割。该方法通过压缩输入到低维潜在空间减少了噪声并加速训练，同时采用加权交叉熵损失更好地保留微小结构，在多个数据集上取得了优异的分割效果和不确定性建模能力。

Details

Motivation: 医学图像分割需要捕捉不确定性，传统方法仅生成单一分割掩码，无法满足需求。生成模型虽能生成多样掩码，但计算成本高，因此需要一种高效且能捕捉不确定性的方法。

Result: 在ISIC-2018、CVC-Clinic和LIDC-IDRI数据集上取得了SOTA或接近SOTA的Dice和IoU分数，同时能生成多样化的分割结果和置信度图。

Insight: 1. 潜在空间扩散模型在医学图像分割中兼具高效性和多样性；2. 加权交叉熵损失对小结构的保留至关重要；3. 置信度图增强了模型的临床可解释性和可靠性。

Abstract: Medical image segmentation is crucial for clinical diagnosis and treatment planning. Traditional methods typically produce a single segmentation mask, failing to capture inherent uncertainty. Recent generative models enable the creation of multiple plausible masks per image, mimicking the collaborative interpretation of several clinicians. However, these approaches remain computationally heavy. We propose MedSegLatDiff, a diffusion based framework that combines a variational autoencoder (VAE) with a latent diffusion model for efficient medical image segmentation. The VAE compresses the input into a low dimensional latent space, reducing noise and accelerating training, while the diffusion process operates directly in this compact representation. We further replace the conventional MSE loss with weighted cross entropy in the VAE mask reconstruction path to better preserve tiny structures such as small nodules. MedSegLatDiff is evaluated on ISIC-2018 (skin lesions), CVC-Clinic (polyps), and LIDC-IDRI (lung nodules). It achieves state of the art or highly competitive Dice and IoU scores while simultaneously generating diverse segmentation hypotheses and confidence maps. This provides enhanced interpretability and reliability compared to deterministic baselines, making the model particularly suitable for clinical deployment.

[111] TBT-Former: Learning Temporal Boundary Distributions for Action Localization cs.CVPDF

Thisara Rathnayaka, Uthayasanker Thayasivam

TL;DR: TBT-Former是一个用于时序动作定位的新架构，通过增强Transformer骨干网络、多尺度特征融合和边界分布回归头部，解决了边界模糊和信息整合的挑战。

Details

Motivation: 时序动作定位（TAL）中的模糊边界和多尺度信息融合是现有方法的瓶颈，尤其是基于Transformer的单阶段模型。

Result: 在THUMOS14和EPIC-Kitchens 100上表现优异，ActivityNet-1.3上也有竞争力。

Insight: 将边界回归问题转化为概率分布学习任务，提高了模型对边界不确定性的建模能力。

Abstract: Temporal Action Localization (TAL) remains a fundamental challenge in video understanding, aiming to identify the start time, end time, and category of all action instances within untrimmed videos. While recent single-stage, anchor-free models like ActionFormer have set a high standard by leveraging Transformers for temporal reasoning, they often struggle with two persistent issues: the precise localization of actions with ambiguous or “fuzzy” temporal boundaries and the effective fusion of multi-scale contextual information. In this paper, we introduce the Temporal Boundary Transformer (TBT-Former), a new architecture that directly addresses these limitations. TBT-Former enhances the strong ActionFormer baseline with three core contributions: (1) a higher-capacity scaled Transformer backbone with an increased number of attention heads and an expanded Multi-Layer Perceptron (MLP) dimension for more powerful temporal feature extraction; (2) a cross-scale feature pyramid network (FPN) that integrates a top-down pathway with lateral connections, enabling richer fusion of high-level semantics and low-level temporal details; and (3) a novel boundary distribution regression head. Inspired by the principles of Generalized Focal Loss (GFL), this new head recasts the challenging task of boundary regression as a more flexible probability distribution learning problem, allowing the model to explicitly represent and reason about boundary uncertainty. Within the paradigm of Transformer-based architectures, TBT-Former advances the formidable benchmark set by its predecessors, establishing a new level of performance on the highly competitive THUMOS14 and EPIC-Kitchens 100 datasets, while remaining competitive on the large-scale ActivityNet-1.3. Our code is available at https://github.com/aaivu/In21-S7-CS4681-AML-Research-Projects/tree/main/projects/210536K-Multi-Modal-Learning_Video-Understanding

[112] DCText: Scheduled Attention Masking for Visual Text Generation via Divide-and-Conquer Strategy cs.CVPDF

Jaewoo Song, Jooyoung Choi, Kanghyun Baek, Sangyub Lee, Daemin Park

TL;DR: DCText提出了一种无需训练的视觉文本生成方法，采用分治策略解决长文本或多文本生成中的全局注意力稀释问题，通过分段处理和改进的注意力掩码提升文本准确性和图像连贯性。

Details

Motivation: 当前文本到图像模型在长文本或多文本生成中存在全局注意力稀释的问题，导致文本渲染效果不佳。

Result: 在单句和多句基准测试中，DCText实现了最高的文本准确性，同时保持了图像质量，并具有最低的生成延迟。

Insight: 分治策略和局部注意力掩码的结合可以有效解决全局注意力稀释问题，同时无需额外训练成本。

Abstract: Despite recent text-to-image models achieving highfidelity text rendering, they still struggle with long or multiple texts due to diluted global attention. We propose DCText, a training-free visual text generation method that adopts a divide-and-conquer strategy, leveraging the reliable short-text generation of Multi-Modal Diffusion Transformers. Our method first decomposes a prompt by extracting and dividing the target text, then assigns each to a designated region. To accurately render each segment within their regions while preserving overall image coherence, we introduce two attention masks - Text-Focus and Context-Expansion - applied sequentially during denoising. Additionally, Localized Noise Initialization further improves text accuracy and region alignment without increasing computational cost. Extensive experiments on single- and multisentence benchmarks show that DCText achieves the best text accuracy without compromising image quality while also delivering the lowest generation latency.

[113] Gaussian Swaying: Surface-Based Framework for Aerodynamic Simulation with 3D Gaussians cs.CV | cs.GRPDF

Hongru Yan, Xiang Zhang, Zeyuan Chen, Fangyin Wei, Zhuowen Tu

TL;DR: 这篇论文提出了Gaussian Swaying，一种基于表面的3D高斯框架，用于高效且精细的空气动力学模拟，避免了繁琐的网格化需求。

Details

Motivation: 自然界中由空气动力学驱动的运动（如树枝摇曳、旗帜飘扬）对视觉和图形的真实感至关重要。现有的网格或粒子方法要么计算成本高，要么缺乏连续性。

Result: 在合成和真实数据集上，Gaussian Swaying实现了最先进的性能和效率。

Insight: 通过3D高斯建模表面，可在高效性和精细性之间取得平衡，为空气动力学模拟提供了一种可扩展的解决方案。

Abstract: Branches swaying in the breeze, flags rippling in the wind, and boats rocking on the water all show how aerodynamics shape natural motion – an effect crucial for realism in vision and graphics. In this paper, we present Gaussian Swaying, a surface-based framework for aerodynamic simulation using 3D Gaussians. Unlike mesh-based methods that require costly meshing, or particle-based approaches that rely on discrete positional data, Gaussian Swaying models surfaces continuously with 3D Gaussians, enabling efficient and fine-grained aerodynamic interaction. Our framework unifies simulation and rendering on the same representation: Gaussian patches, which support force computation for dynamics while simultaneously providing normals for lightweight shading. Comprehensive experiments on both synthetic and real-world datasets across multiple metrics demonstrate that Gaussian Swaying achieves state-of-the-art performance and efficiency, offering a scalable approach for realistic aerodynamic scene simulation.

[114] Lost in Distortion: Uncovering the Domain Gap Between Computer Vision and Brain Imaging - A Study on Pretraining for Age Prediction cs.CVPDF

Yanteng Zhang, Songheng Li, Zeyu Shen, Qizhen Lan, Lipei Zhang

TL;DR: 这篇论文探讨了预训练中数据质量对脑影像分析任务的影响，通过与计算机视觉实践的对比，揭示了异构数据质量的挑战与机遇。

Details

Motivation: 大型脑影像数据集为预训练提供了机会，但数据质量的高度异构性（从结构良好的扫描到严重失真的脑部体积）引发了一个基本问题：低质量数据是否能有效参与预训练，还是会阻碍模型学习？

Result: 结果显示，不同质量水平的数据在性能上存在显著差异，表明数据质量对预训练的有效性具有重要影响。

Insight: 论文强调了领域感知的数据筛选的必要性，以确保可信且可泛化的领域特定基础模型。同时，揭示了计算机视觉与临床神经影像标准之间的差距。

Abstract: Large-scale brain imaging datasets provide unprecedented opportunities for developing domain foundation models through pretraining. However, unlike natural image datasets in computer vision, these neuroimaging data often exhibit high heterogeneity in quality, ranging from well-structured scans to severely distorted or incomplete brain volumes. This raises a fundamental question: can noise or low-quality scans contribute meaningfully to pretraining, or do they instead hinder model learning? In this study, we systematically explore the role of data quality level in pretraining and its impact on downstream tasks. Specifically, we perform pretraining on datasets with different quality levels and perform fine-tuning for brain age prediction on external cohorts. Our results show significant performance differences across quality levels, revealing both opportunities and limitations. We further discuss the gap between computer vision practices and clinical neuroimaging standards, emphasizing the necessity of domain-aware curation to ensure trusted and generalizable domain-specific foundation models.

[115] IVCR-200K: A Large-Scale Multi-turn Dialogue Benchmark for Interactive Video Corpus Retrieval cs.CVPDF

Ning Han, Yawen Zeng, Shaohua Long, Chengqing Li, Sijie Yang

TL;DR: 本文介绍了IVCR-200K，一个支持多轮对话的大规模视频语料交互检索数据集，并提出了基于多模态大语言模型（MLLM）的框架，以提升用户与检索系统的交互体验。

Details

Motivation: 现有视频检索任务缺乏用户与系统之间的交互性，无法满足80.8%用户的个性化和动态需求，因此提出了交互式视频语料检索（IVCR）任务。

Result: 实验证明了该数据集和框架的有效性。

Insight: 交互性是提升视频检索任务用户体验的关键，多模态大语言模型为这一问题提供了新的解决方案。

Abstract: In recent years, significant developments have been made in both video retrieval and video moment retrieval tasks, which respectively retrieve complete videos or moments for a given text query. These advancements have greatly improved user satisfaction during the search process. However, previous work has failed to establish meaningful “interaction” between the retrieval system and the user, and its one-way retrieval paradigm can no longer fully meet the personalization and dynamic needs of at least 80.8% of users. In this paper, we introduce the Interactive Video Corpus Retrieval (IVCR) task, a more realistic setting that enables multi-turn, conversational, and realistic interactions between the user and the retrieval system. To facilitate research on this challenging task, we introduce IVCR-200K, a high-quality, bilingual, multi-turn, conversational, and abstract semantic dataset that supports video retrieval and even moment retrieval. Furthermore, we propose a comprehensive framework based on multi-modal large language models (MLLMs) to help users interact in several modes with more explainable solutions. The extensive experiments demonstrate the effectiveness of our dataset and framework.

[116] FOD-S2R: A FOD Dataset for Sim2Real Transfer Learning based Object Detection cs.CV | cs.AIPDF

Ashish Vashist, Qiranul Saadiyean, Suresh Sundaram, Chandra Sekhar Seelamantula

TL;DR: 论文提出了一个新的数据集FOD-S2R，专注于飞机油箱内部的异物检测（FOD），填补了封闭环境下缺乏专用数据集的空白，并通过实验验证了合成数据对提升现实检测性能的有效性。

Details

Motivation: 飞机油箱内的异物（FOD）可能导致严重的安全隐患，但目前缺乏针对封闭环境的数据集，限制了相关检测技术的发展。

Result: 实验表明，引入合成数据可以显著提升目标检测模型的精度和泛化能力。

Insight: 合成数据能够减少对真实标注的依赖，并为封闭环境下的检测任务提供有效支持。

Abstract: Foreign Object Debris (FOD) within aircraft fuel tanks presents critical safety hazards including fuel contamination, system malfunctions, and increased maintenance costs. Despite the severity of these risks, there is a notable lack of dedicated datasets for the complex, enclosed environments found inside fuel tanks. To bridge this gap, we present a novel dataset, FOD-S2R, composed of real and synthetic images of the FOD within a simulated aircraft fuel tank. Unlike existing datasets that focus on external or open-air environments, our dataset is the first to systematically evaluate the effectiveness of synthetic data in enhancing the real-world FOD detection performance in confined, closed structures. The real-world subset consists of 3,114 high-resolution HD images captured in a controlled fuel tank replica, while the synthetic subset includes 3,137 images generated using Unreal Engine. The dataset is composed of various Field of views (FOV), object distances, lighting conditions, color, and object size. Prior research has demonstrated that synthetic data can reduce reliance on extensive real-world annotations and improve the generalizability of vision models. Thus, we benchmark several state-of-the-art object detection models and demonstrate that introducing synthetic data improves the detection accuracy and generalization to real-world conditions. These experiments demonstrate the effectiveness of synthetic data in enhancing the model performance and narrowing the Sim2Real gap, providing a valuable foundation for developing automated FOD detection systems for aviation maintenance.

[117] Optimizing Stroke Risk Prediction: A Machine Learning Pipeline Combining ROS-Balanced Ensembles and XAI cs.CV | cs.LGPDF

A S M Ahsanul Sarkar Akib, Raduana Khawla, Abdul Hasib

TL;DR: 该论文提出了一种结合ROS平衡集成和XAI的机器学习流程，用于优化中风风险预测。通过集成多种机器学习模型并进行特征工程和数据预处理，实现了99.09%的高准确率，并通过XAI技术增强了模型的可解释性和临床适用性。

Details

Motivation: 中风是全球健康的主要威胁，早期风险评估对及时干预和有效预防至关重要。现有的预测方法在准确性和可解释性方面存在不足，需要开发更高效且透明的模型。

Result: 优化后的集成模型在中风预测数据集（SPD）上达到了99.09%的准确率。LIME分析识别出年龄、高血压和血糖水平为关键预测变量。

Insight: 研究表明，结合集成学习和XAI技术可以显著提升模型的准确性和可解释性，为中风风险的早期预测和个性化临床决策提供了有力工具。

Abstract: Stroke is a major cause of death and permanent impairment, making it a major worldwide health concern. For prompt intervention and successful preventative tactics, early risk assessment is essential. To address this challenge, we used ensemble modeling and explainable AI (XAI) techniques to create an interpretable machine learning framework for stroke risk prediction. A thorough evaluation of 10 different machine learning models using 5-fold cross-validation across several datasets was part of our all-inclusive strategy, which also included feature engineering and data pretreatment (using Random Over-Sampling (ROS) to solve class imbalance). Our optimized ensemble model (Random Forest + ExtraTrees + XGBoost) performed exceptionally well, obtaining a strong 99.09% accuracy on the Stroke Prediction Dataset (SPD). We improved the model’s transparency and clinical applicability by identifying three important clinical variables using LIME-based interpretability analysis: age, hypertension, and glucose levels. Through early prediction, this study highlights how combining ensemble learning with explainable AI (XAI) can deliver highly accurate and interpretable stroke risk assessment. By enabling data-driven prevention and personalized clinical decisions, our framework has the potential to transform stroke prediction and cardiovascular risk management.

[118] AlignVid: Training-Free Attention Scaling for Semantic Fidelity in Text-Guided Image-to-Video Generation cs.CVPDF

Yexin Liu, Wen-Jie Shu, Zile Huang, Haoze Zheng, Yueze Wang

TL;DR: AlignVid是一种无需训练的方法，通过注意力缩放和调度，改善文本引导的图像到视频生成中的语义忠实度，同时引入OmitI2V数据集评估语义忽视问题。

Details

Motivation: 现有的文本引导图像到视频生成方法在输入图像需要大幅变换时（如对象增删改），难以忠实于细粒度语义，称为语义忽视。研究发现高斯模糊可以改善语义忠实度，从而启发对齐方法的提出。

Result: 实验表明AlignVid能显著提升语义忠实度，同时限制美学退化。

Insight: 高斯模糊可改善语义忠实度，提示低熵注意力分布对语义对齐的重要性；轻量化干预能在提升语义的同时保持视觉质量。

Abstract: Text-guided image-to-video (TI2V) generation has recently achieved remarkable progress, particularly in maintaining subject consistency and temporal coherence. However, existing methods still struggle to adhere to fine-grained prompt semantics, especially when prompts entail substantial transformations of the input image (e.g., object addition, deletion, or modification), a shortcoming we term semantic negligence. In a pilot study, we find that applying a Gaussian blur to the input image improves semantic adherence. Analyzing attention maps, we observe clearer foreground-background separation. From an energy perspective, this corresponds to a lower-entropy cross-attention distribution. Motivated by this, we introduce AlignVid, a training-free framework with two components: (i) Attention Scaling Modulation (ASM), which directly reweights attention via lightweight Q or K scaling, and (ii) Guidance Scheduling (GS), which applies ASM selectively across transformer blocks and denoising steps to reduce visual quality degradation. This minimal intervention improves prompt adherence while limiting aesthetic degradation. In addition, we introduce OmitI2V to evaluate semantic negligence in TI2V generation, comprising 367 human-annotated samples that span addition, deletion, and modification scenarios. Extensive experiments demonstrate that AlignVid can enhance semantic fidelity.

[119] EvalTalker: Learning to Evaluate Real-Portrait-Driven Multi-Subject Talking Humans cs.CVPDF

Yingjie Zhou, Xilei Zhu, Siyu Ren, Ziyi Zhao, Ziwen Wang

TL;DR: 论文提出EvalTalker框架，用于评估多主体说话人（Multi-Talker）生成质量，并通过THQA-MT数据集分析了15种代表性Multi-Talker的感知差异和常见失真类型。

Details

Motivation: 当前Multi-Talker在多主体驱动能力上存在质量下降问题，限制了用户体验。为解决这一问题，需要一种评估框架来提升生成质量。

Result: EvalTalker在主观评分相关性上表现优异，为未来高质量Multi-Talker生成和评估研究奠定基础。

Insight: 多模态同步性和身份一致性是评估Multi-Talker生成质量的关键因素，EvalTalker为相关研究提供了标准化评估工具。

Abstract: Speech-driven Talking Human (TH) generation, commonly known as “Talker,” currently faces limitations in multi-subject driving capabilities. Extending this paradigm to “Multi-Talker,” capable of animating multiple subjects simultaneously, introduces richer interactivity and stronger immersion in audiovisual communication. However, current Multi-Talkers still exhibit noticeable quality degradation caused by technical limitations, resulting in suboptimal user experiences. To address this challenge, we construct THQA-MT, the first large-scale Multi-Talker-generated Talking Human Quality Assessment dataset, consisting of 5,492 Multi-Talker-generated THs (MTHs) from 15 representative Multi-Talkers using 400 real portraits collected online. Through subjective experiments, we analyze perceptual discrepancies among different Multi-Talkers and identify 12 common types of distortion. Furthermore, we introduce EvalTalker, a novel TH quality assessment framework. This framework possesses the ability to perceive global quality, human characteristics, and identity consistency, while integrating Qwen-Sync to perceive multimodal synchrony. Experimental results demonstrate that EvalTalker achieves superior correlation with subjective scores, providing a robust foundation for future research on high-quality Multi-Talker generation and evaluation.

[120] InternVideo-Next: Towards General Video Foundation Models without Video-Text Supervision cs.CVPDF

Chenting Wang, Yuhan Zhu, Yicheng Xu, Jiange Yang, Ziang Yan

TL;DR: 论文提出了InternVideo-Next，一种无需视频-文本监督的通用视频基础模型，通过两阶段预训练方案解决了像素级重建和语义抽象之间的冲突。

Details

Motivation: 大规模视频-文本预训练依赖噪声大、语义覆盖有限的合成字幕，忽略了隐式世界知识；而掩码视频建模（MVM）虽直接利用时空结构，但在通用任务上表现较差。论文旨在弥合这一差距。

Result: 在公开无标签视频上训练，InternVideo-Next在多个基准测试中达到SOTA性能。

Insight: 通过分离编码器-解码器设计和两阶段预训练，可以更好地平衡像素级重建和语义抽象，为通用视频表示学习提供了可扩展路径。

Abstract: Large-scale video-text pretraining achieves strong performance but depends on noisy, synthetic captions with limited semantic coverage, often overlooking implicit world knowledge such as object motion, 3D geometry, and physical cues. In contrast, masked video modeling (MVM) directly exploits spatiotemporal structures but trails text-supervised methods on general tasks. We find this gap arises from overlooked architectural issues: pixel-level reconstruction struggles with convergence and its low-level requirement often conflicts with semantics, while latent prediction often encourages shortcut learning. To address these, we disentangle the traditional encoder-decoder design into an Encoder-Predictor-Decoder (EPD) framework, where the predictor acts as a latent world model, and propose InternVideo-Next, a two-stage pretraining scheme that builds a semantically consistent yet detail-preserving latent space for this world model. First, conventional linear decoder in pixel MVM enforces the predictor output latent to be linearly projected to, thus separable in pixel space, causing the conflict with semantic abstraction. Our Stage 1 proposes a conditional diffusion decoder and injects reliable image-level semantic priors to enhance semantics and convergence, thus bridging pixel-level fidelity with high-level semantic abstraction. Stage 2 further learns world knowledge by predicting frozen Stage 1 targets within this space, mitigating shortcut learning. Trained on public, unlabeled videos, InternVideo-Next achieves state-of-the-art results across benchmarks and provides a scalable path toward general video representation learning.

[121] Handwritten Text Recognition for Low Resource Languages cs.CVPDF

Sayantan Dey, Alireza Alaei, Partha Pratim Roy

TL;DR: 这篇论文提出了一种名为BharatOCR的新型分割自由段落级手写印地语和乌尔都语文本识别方法，结合ViT、Transformer解码器和预训练语言模型，显著提升了低资源语言的OCR性能。

Details

Motivation: 手写文本识别在低资源语言（如印地语、乌尔都语）中面临巨大挑战，因为这些语言缺乏全面的语言资源。现有的OCR系统在这些语言中的表现不佳，亟需一种高效方法解决这一问题。

Result: 在多个数据集（NUST-UHWR、PUCIT-OUHL、Parimal-Urdu）上取得了96.24%、92.05%、94.80%的字符识别率，在印地语数据集上达到80.64%，优于现有方法。

Insight: 1. 结合视觉和语言模型可显著提升OCR性能；2. 低资源语言中，预训练语言模型对提升输出质量至关重要；3. 隐式分割技术简化了段落级文本识别的复杂性。

Abstract: Despite considerable progress in handwritten text recognition, paragraph-level handwritten text recognition, especially in low-resource languages, such as Hindi, Urdu and similar scripts, remains a challenging problem. These languages, often lacking comprehensive linguistic resources, require special attention to develop robust systems for accurate optical character recognition (OCR). This paper introduces BharatOCR, a novel segmentation-free paragraph-level handwritten Hindi and Urdu text recognition. We propose a ViT-Transformer Decoder-LM architecture for handwritten text recognition, where a Vision Transformer (ViT) extracts visual features, a Transformer decoder generates text sequences, and a pre-trained language model (LM) refines the output to improve accuracy, fluency, and coherence. Our model utilizes a Data-efficient Image Transformer (DeiT) model proposed for masked image modeling in this research work. In addition, we adopt a RoBERTa architecture optimized for masked language modeling (MLM) to enhance the linguistic comprehension and generative capabilities of the proposed model. The transformer decoder generates text sequences from visual embeddings. This model is designed to iteratively process a paragraph image line by line, called implicit line segmentation. The proposed model was evaluated using our custom dataset (‘Parimal Urdu’) and (‘Parimal Hindi’), introduced in this research work, as well as two public datasets. The proposed model achieved benchmark results in the NUST-UHWR, PUCIT-OUHL, and Parimal-Urdu datasets, achieving character recognition rates of 96.24%, 92.05%, and 94.80%, respectively. The model also provided benchmark results using the Hindi dataset achieving a character recognition rate of 80.64%. The results obtained from our proposed model indicated that it outperformed several state-of-the-art Urdu text recognition methods.

[122] OpenBox: Annotate Any Bounding Boxes in 3D cs.CVPDF

In-Jae Lee, Mungyeom Kim, Kwonyoung Ryu, Pierre Musacchio, Jaesik Park

TL;DR: OpenBox是一个两阶段自动标注管道，利用2D视觉基础模型为3D点云生成高质量边界框标注，无需自训练。

Details

Motivation: 现有3D目标检测方法通常忽略物体的物理状态，且需要多次自训练迭代优化标注，导致质量和效率不高。OpenBox旨在解决这些问题。

Result: 在Waymo、Lyft和nuScenes数据集上验证，OpenBox在标注质量和效率上优于基线方法。

Insight: 利用2D基础模型的信息可以有效辅助3D标注任务，减少对昂贵3D标注的依赖。

Abstract: Unsupervised and open-vocabulary 3D object detection has recently gained attention, particularly in autonomous driving, where reducing annotation costs and recognizing unseen objects are critical for both safety and scalability. However, most existing approaches uniformly annotate 3D bounding boxes, ignore objects’ physical states, and require multiple self-training iterations for annotation refinement, resulting in suboptimal quality and substantial computational overhead. To address these challenges, we propose OpenBox, a two-stage automatic annotation pipeline that leverages a 2D vision foundation model. In the first stage, OpenBox associates instance-level cues from 2D images processed by a vision foundation model with the corresponding 3D point clouds via cross-modal instance alignment. In the second stage, it categorizes instances by rigidity and motion state, then generates adaptive bounding boxes with class-specific size statistics. As a result, OpenBox produces high-quality 3D bounding box annotations without requiring self-training. Experiments on the Waymo Open Dataset, the Lyft Level 5 Perception dataset, and the nuScenes dataset demonstrate improved accuracy and efficiency over baselines.

[123] BlinkBud: Detecting Hazards from Behind via Sampled Monocular 3D Detection on a Single Earbud cs.CV | cs.HC | cs.LGPDF

Yunzhe Li, Jiajun Yan, Yuzhou Wei, Kechen Liu, Yize Zhao

TL;DR: BlinkBud是一种利用单耳机和配对手机的轻量级系统，通过采样单目3D检测技术检测用户后方的危险物，结合卡尔曼滤波和强化学习优化采样策略，显著降低能耗并提高准确性。

Details

Motivation: 骑行者或行人难以察觉后方高速接近的车辆，存在安全隐患。现有方案通常耗电高或准确性不足。

Result: 原型系统能耗极低（耳机29.8mW，手机702.6mW），误报率4.90%，漏检率1.47%。

Insight: 轻量化采样策略结合动态姿态校正是解决移动设备实时检测的有效途径。

Abstract: Failing to be aware of speeding vehicles approaching from behind poses a huge threat to the road safety of pedestrians and cyclists. In this paper, we propose BlinkBud, which utilizes a single earbud and a paired phone to online detect hazardous objects approaching from behind of a user. The core idea is to accurately track visually identified objects utilizing a small number of sampled camera images taken from the earbud. To minimize the power consumption of the earbud and the phone while guaranteeing the best tracking accuracy, a novel 3D object tracking algorithm is devised, integrating both a Kalman filter based trajectory estimation scheme and an optimal image sampling strategy based on reinforcement learning. Moreover, the impact of constant user head movements on the tracking accuracy is significantly eliminated by leveraging the estimated pitch and yaw angles to correct the object depth estimation and align the camera coordinate system to the user’s body coordinate system, respectively. We implement a prototype BlinkBud system and conduct extensive real-world experiments. Results show that BlinkBud is lightweight with ultra-low mean power consumptions of 29.8 mW and 702.6 mW on the earbud and smartphone, respectively, and can accurately detect hazards with a low average false positive ratio (FPR) and false negative ratio (FNR) of 4.90% and 1.47%, respectively.

[124] Textured Geometry Evaluation: Perceptual 3D Textured Shape Metric via 3D Latent-Geometry Network cs.CVPDF

Tianyu Luan, Xuelu Feng, Zixin Zhu, Phani Nuney, Sheng Liu

TL;DR: 论文提出了一种新的纹理几何评估方法TGE，通过直接分析3D网格和纹理，避免了渲染的局限性，结合几何与颜色信息评估3D模型的保真度，并在真实数据集上超越现有方法。

Details

Motivation: 现有的3D模型评估方法（如Chamfer Distance）往往与人类感知不一致，而基于渲染的2D图像质量指标又存在视角敏感性和结构覆盖不全的问题，且多数方法依赖合成数据，与实际失真存在差距。

Result: 实验表明，TGE在真实世界失真数据集上优于基于渲染和仅几何的评估方法。

Insight: 直接在3D空间中进行评估可以避免渲染带来的视角偏差和结构覆盖不全问题，结合颜色和几何信息能更全面地反映人类感知的保真度。

Abstract: Textured high-fidelity 3D models are crucial for games, AR/VR, and film, but human-aligned evaluation methods still fall behind despite recent advances in 3D reconstruction and generation. Existing metrics, such as Chamfer Distance, often fail to align with how humans evaluate the fidelity of 3D shapes. Recent learning-based metrics attempt to improve this by relying on rendered images and 2D image quality metrics. However, these approaches face limitations due to incomplete structural coverage and sensitivity to viewpoint choices. Moreover, most methods are trained on synthetic distortions, which differ significantly from real-world distortions, resulting in a domain gap. To address these challenges, we propose a new fidelity evaluation method that is based directly on 3D meshes with texture, without relying on rendering. Our method, named Textured Geometry Evaluation TGE, jointly uses the geometry and color information to calculate the fidelity of the input textured mesh with comparison to a reference colored shape. To train and evaluate our metric, we design a human-annotated dataset with real-world distortions. Experiments show that TGE outperforms rendering-based and geometry-only methods on real-world distortion dataset.

[125] PointNet4D: A Lightweight 4D Point Cloud Video Backbone for Online and Offline Perception in Robotic Applications cs.CVPDF

Yunze Liu, Zifan Wang, Peiran Wu, Jiayang Ao

TL;DR: PointNet4D是一个轻量级的4D点云视频骨干网络，专为机器人在线和离线感知任务设计，通过创新的Hybrid Mamba-Transformer时序融合模块和4DMAP预训练策略，显著提升了计算效率和时序建模能力。

Details

Motivation: 动态4D环境（3D空间随时间演化）的理解对机器人应用至关重要。现有4D骨干网络依赖高计算成本的时空卷积和Transformer，难以满足实时资源约束需求。

Result: 在7个数据集的9个任务上评估，表现一致优于基线。在RoboTwin和HandoverSim机器人应用基准上取得显著提升。

Insight: 轻量化设计（如Mamba的应用）在实时4D任务中具有潜力；时序预训练策略（4DMAP）能有效提升动态场景建模能力。

Abstract: Understanding dynamic 4D environments-3D space evolving over time-is critical for robotic and interactive systems. These applications demand systems that can process streaming point cloud video in real-time, often under resource constraints, while also benefiting from past and present observations when available. However, current 4D backbone networks rely heavily on spatiotemporal convolutions and Transformers, which are often computationally intensive and poorly suited to real-time applications. We propose PointNet4D, a lightweight 4D backbone optimized for both online and offline settings. At its core is a Hybrid Mamba-Transformer temporal fusion block, which integrates the efficient state-space modeling of Mamba and the bidirectional modeling power of Transformers. This enables PointNet4D to handle variable-length online sequences efficiently across different deployment scenarios. To enhance temporal understanding, we introduce 4DMAP, a frame-wise masked auto-regressive pretraining strategy that captures motion cues across frames. Our extensive evaluations across 9 tasks on 7 datasets, demonstrating consistent improvements across diverse domains. We further demonstrate PointNet4D’s utility by building two robotic application systems: 4D Diffusion Policy and 4D Imitation Learning, achieving substantial gains on the RoboTwin and HandoverSim benchmarks.

[126] FRAMER: Frequency-Aligned Self-Distillation with Adaptive Modulation Leveraging Diffusion Priors for Real-World Image Super-Resolution cs.CVPDF

Seungho Choi, Jeahun Sung, Jihyong Oh

TL;DR: FRAMER提出了一种利用扩散先验的频率对齐自蒸馏方法，通过自适应调制改善真实图像超分辨率任务的性能和感知质量。

Details

Motivation: 真实图像超分辨率任务面临混合未知退化问题，扩散模型虽在感知质量上优于GANs，但在高频细节重建上存在不足，主要由于低频偏差和深度层次的“低频优先”特性。

Result: 在U-Net和DiT主干网络（如Stable Diffusion 2, 3）上，FRAMER显著提高了PSNR/SSIM和感知指标（LPIPS、NIQE、MANIQA、MUSIQ）。

Insight: FRAMER的关键在于利用频率对齐的动态调制，有效缓解扩散模型的低频偏差问题，同时提升高频细节的还原能力。

Abstract: Real-image super-resolution (Real-ISR) seeks to recover HR images from LR inputs with mixed, unknown degradations. While diffusion models surpass GANs in perceptual quality, they under-reconstruct high-frequency (HF) details due to a low-frequency (LF) bias and a depth-wise “low-first, high-later” hierarchy. We introduce FRAMER, a plug-and-play training scheme that exploits diffusion priors without changing the backbone or inference. At each denoising step, the final-layer feature map teaches all intermediate layers. Teacher and student feature maps are decomposed into LF/HF bands via FFT masks to align supervision with the model’s internal frequency hierarchy. For LF, an Intra Contrastive Loss (IntraCL) stabilizes globally shared structure. For HF, an Inter Contrastive Loss (InterCL) sharpens instance-specific details using random-layer and in-batch negatives. Two adaptive modulators, Frequency-based Adaptive Weight (FAW) and Frequency-based Alignment Modulation (FAM), reweight per-layer LF/HF signals and gate distillation by current similarity. Across U-Net and DiT backbones (e.g., Stable Diffusion 2, 3), FRAMER consistently improves PSNR/SSIM and perceptual metrics (LPIPS, NIQE, MANIQA, MUSIQ). Ablations validate the final-layer teacher and random-layer negatives.

[127] Rice-VL: Evaluating Vision-Language Models for Cultural Understanding Across ASEAN Countries cs.CV | cs.AIPDF

Tushar Pranav, Eshan Pandey, Austria Lyka Diane Bala, Aman Chadha, Indriyati Atmosukarto

TL;DR: 论文提出RICE-VL基准，评估VLMs在东南亚文化理解上的表现，揭示其西方中心偏见，并提出改进需求。

Details

Motivation: 当前VLMs存在西方中心偏见，无法有效服务于东南亚多样化文化背景的需求，亟需针对性评估和改进。

Result: 评估显示VLMs在低资源国家和抽象文化领域表现较差，视觉定位任务暴露其空间和上下文准确性不足。

Insight: VLMs需更包容的文化多样化训练，以提升全球服务能力；基准设计可推动跨文化领域的模型改进。

Abstract: Vision-Language Models (VLMs) excel in multimodal tasks but often exhibit Western-centric biases, limiting their effectiveness in culturally diverse regions like Southeast Asia (SEA). To address this, we introduce RICE-VL, a novel benchmark evaluating VLM cultural understanding across 11 ASEAN countries. RICE-VL includes over 28,000 human-curated Visual Question Answering (VQA) samples – covering True or False, Fill-in-the-Blank, and open-ended formats – and 1,000 image-bounding box pairs for Visual Grounding, annotated by culturally informed experts across 14 sub-ground categories. We propose SEA-LAVE, an extension of the LAVE metric, assessing textual accuracy, cultural alignment, and country identification. Evaluations of six open- and closed-source VLMs reveal significant performance gaps in low-resource countries and abstract cultural domains. The Visual Grounding task tests models’ ability to localize culturally significant elements in complex scenes, probing spatial and contextual accuracy. RICE-VL exposes limitations in VLMs’ cultural comprehension and highlights the need for inclusive model development to better serve diverse global populations.

[128] MDiff4STR: Mask Diffusion Model for Scene Text Recognition cs.CVPDF

Yongkun Du, Miaomiao Zhao, Songlin Fan, Zhineng Chen, Caiyan Jia

TL;DR: MDiff4STR首次将Mask Diffusion模型引入场景文本识别任务，通过改进的噪声策略和token替换机制，提高了识别精度和效率。

Details

Motivation: 由于MDM在视觉语言任务中展现了效率和精度的平衡潜力，但直接应用于STR任务时精度落后于ARMs，因此作者提出MDiff4STR以解决这一差距。

Result: 在多个STR基准测试中，MDiff4STR表现优异，尤其在精度上超越SOTA ARMs，且仅需三步去噪即可快速推理。

Insight: MDM在STR任务中的潜力被低估，通过针对性的改进策略，可以显著提升性能；token替换机制可能是解决模型预测过自信问题的通用方法。

Abstract: Mask Diffusion Models (MDMs) have recently emerged as a promising alternative to auto-regressive models (ARMs) for vision-language tasks, owing to their flexible balance of efficiency and accuracy. In this paper, for the first time, we introduce MDMs into the Scene Text Recognition (STR) task. We show that vanilla MDM lags behind ARMs in terms of accuracy, although it improves recognition efficiency. To bridge this gap, we propose MDiff4STR, a Mask Diffusion model enhanced with two key improvement strategies tailored for STR. Specifically, we identify two key challenges in applying MDMs to STR: noising gap between training and inference, and overconfident predictions during inference. Both significantly hinder the performance of MDMs. To mitigate the first issue, we develop six noising strategies that better align training with inference behavior. For the second, we propose a token-replacement noise mechanism that provides a non-mask noise type, encouraging the model to reconsider and revise overly confident but incorrect predictions. We conduct extensive evaluations of MDiff4STR on both standard and challenging STR benchmarks, covering diverse scenarios including irregular, artistic, occluded, and Chinese text, as well as whether the use of pretraining. Across these settings, MDiff4STR consistently outperforms popular STR models, surpassing state-of-the-art ARMs in accuracy, while maintaining fast inference with only three denoising steps. Code: https://github.com/Topdu/OpenOCR.

[129] \textit{ViRectify}: A Challenging Benchmark for Video Reasoning Correction with Multimodal Large Language Models cs.CVPDF

Xusen Hei, Jiali Chen, Jinyu Yang, Mengchen Zhao, Yi Cai

TL;DR: extit{ViRectify} 是一个用于评估多模态大语言模型（MLLMs）在视频推理中错误纠正能力的综合基准，包含30K标注实例，覆盖动态感知、科学推理和具身决策领域。提出的轨迹证据驱动框架显著提升了模型性能。

Details

Motivation: 现有基准缺乏对MLLMs在视频推理中错误纠正能力的系统性评估，限制了对其弱点的理解和性能改进。

Result: GPT-5在 extit{ViRectify}上仅达到31.94%的纠正准确率；Qwen2.5-VL-7B通过框架超越72B变体，验证了有效性。

Insight: 1) 视频推理错误纠正存在系统性不对称；2) 数据集可用于反思学习；3) extit{ViRectify}为评估MLLMs提供了新方向。

Abstract: As multimodal large language models (MLLMs) frequently exhibit errors in complex video reasoning scenarios, correcting these errors is critical for uncovering their weaknesses and improving performance. However, existing benchmarks lack systematic evaluation of MLLMs’ ability to identify and correct these video reasoning errors. To bridge this gap, we propose \textit{ViRectify}, a comprehensive benchmark to evaluate their fine-grained correction capability. Through an AI-assisted annotation pipeline with human verification, we construct a dataset of over 30\textit{K} instances spanning dynamic perception, scientific reasoning, and embodied decision-making domains. In \textit{ViRectify}, we challenge MLLMs to perform step-wise error identification and generate rationales with key video evidence grounding. In addition, we further propose the trajectory evidence-driven correction framework, comprising step-wise error trajectory and reward modeling on visual evidence-grounded correction. It encourages the model to explicitly concentrate on error propagation and key timestamps for correction. Extensive evaluation across 16 advanced MLLMs demonstrates that our \textit{ViRectify} serves as a challenging testbed, where GPT-5 achieves only 31.94% correction accuracy. Our framework enables a Qwen2.5-VL-7B to consistently outperform the variants of 72B on \textit{ViRectify}, showing the effectiveness of our approach. Further analysis uncovers systematic asymmetries in error correction across models, and our dataset is also a valuable data resource to perform reflection learning. We believe \textit{ViRectify} provides a new direction for comprehensively evaluating the advanced MLLMs in video reasoning.

[130] Language-Guided Open-World Anomaly Segmentation cs.CVPDF

Klara Reichard, Nikolas Brasch, Nassir Navab, Federico Tombari

TL;DR: Clipomaly提出了一种基于CLIP的开放世界和异常分割方法，首次实现了对自动驾驶场景中未知物体的分割和语义标注。

Details

Motivation: 现有开放世界分割方法难以对未知区域赋予有意义的语义标签，而开放词汇分割方法无法适用于异常分割任务，因为未知类别不受限制。

Result: 在异常分割基准测试中达到SOTA，同时提供可解释性和灵活性。

Insight: CLIP的图文嵌入能力可用于开放世界任务，动态词汇扩展是实现灵活标注的关键。

Abstract: Open-world and anomaly segmentation methods seek to enable autonomous driving systems to detect and segment both known and unknown objects in real-world scenes. However, existing methods do not assign semantically meaningful labels to unknown regions, and distinguishing and learning representations for unknown classes remains difficult. While open-vocabulary segmentation methods show promise in generalizing to novel classes, they require a fixed inference vocabulary and thus cannot be directly applied to anomaly segmentation where unknown classes are unconstrained. We propose Clipomaly, the first CLIP-based open-world and anomaly segmentation method for autonomous driving. Our zero-shot approach requires no anomaly-specific training data and leverages CLIP’s shared image-text embedding space to both segment unknown objects and assign human-interpretable names to them. Unlike open-vocabulary methods, our model dynamically extends its vocabulary at inference time without retraining, enabling robust detection and naming of anomalies beyond common class definitions such as those in Cityscapes. Clipomaly achieves state-of-the-art performance on established anomaly segmentation benchmarks while providing interpretability and flexibility essential for practical deployment.

[131] CourtMotion: Learning Event-Driven Motion Representations from Skeletal Data for Basketball cs.CV | cs.MAPDF

Omer Sela, Michael Chertok, Lior Wolf

TL;DR: CourtMotion是一个基于骨骼数据的时空建模框架，用于分析和预测篮球比赛中的事件和战术动作。它通过图神经网络和Transformer架构捕捉球员的细微动作模式及其语义意义，显著提升了轨迹预测和事件识别的性能。

Details

Motivation: 传统方法仅依赖球员位置数据，无法捕捉身体朝向、防守姿态或投篮准备动作等关键信息。CourtMotion旨在通过结合骨骼数据和语义建模，提升篮球事件预测的准确性。

Result: 在NBA跟踪数据上，轨迹预测误差比基于位置的最先进模型降低35%；在传球、投篮、抢断等事件识别任务中显著优于现有方法。

Insight: 骨骼数据能够提供比位置数据更丰富的运动语义信息，而事件投影头的设计有助于模型理解动作的战术意图。

Abstract: This paper presents CourtMotion, a spatiotemporal modeling framework for analyzing and predicting game events and plays as they develop in professional basketball. Anticipating basketball events requires understanding both physical motion patterns and their semantic significance in the context of the game. Traditional approaches that use only player positions fail to capture crucial indicators such as body orientation, defensive stance, or shooting preparation motions. Our two-stage approach first processes skeletal tracking data through Graph Neural Networks to capture nuanced motion patterns, then employs a Transformer architecture with specialized attention mechanisms to model player interactions. We introduce event projection heads that explicitly connect player movements to basketball events like passes, shots, and steals, training the model to associate physical motion patterns with their tactical purposes. Experiments on NBA tracking data demonstrate significant improvements over position-only baselines: 35% reduction in trajectory prediction error compared to state-of-the-art position-based models and consistent performance gains across key basketball analytics tasks. The resulting pretrained model serves as a powerful foundation for multiple downstream tasks, with pick detection, shot taker identification, assist prediction, shot location classification, and shot type recognition demonstrating substantial improvements over existing methods.

[132] ChronosObserver: Taming 4D World with Hyperspace Diffusion Sampling cs.CVPDF

Qisen Wang, Yifan Zhao, Peisen Shen, Jialu Li, Jia Li

TL;DR: ChronosObserver提出了一种无需训练的方法，通过超空间扩散采样实现高保真、3D一致的多视图视频生成。

Details

Motivation: 现有的相机控制视频生成模型难以直接生成3D一致且时间同步的多视图视频，这是实现4D世界建模的关键能力。现有方法依赖数据增强或测试时优化，但泛化性和扩展性受限。

Result: 实验表明，该方法能够生成高保真、3D一致且时间同步的多视图视频，且无需训练或调优扩散模型。

Insight: 通过超空间约束同步多视图采样，可实现高效、高质量的4D场景生成，为未来视频生成技术提供了新思路。

Abstract: Although prevailing camera-controlled video generation models can produce cinematic results, lifting them directly to the generation of 3D-consistent and high-fidelity time-synchronized multi-view videos remains challenging, which is a pivotal capability for taming 4D worlds. Some works resort to data augmentation or test-time optimization, but these strategies are constrained by limited model generalization and scalability issues. To this end, we propose ChronosObserver, a training-free method including World State Hyperspace to represent the spatiotemporal constraints of a 4D world scene, and Hyperspace Guided Sampling to synchronize the diffusion sampling trajectories of multiple views using the hyperspace. Experimental results demonstrate that our method achieves high-fidelity and 3D-consistent time-synchronized multi-view videos generation without training or fine-tuning for diffusion models.

[133] ELVIS: Enhance Low-Light for Video Instance Segmentation in the Dark cs.CVPDF

Joanne Lin, Ruirui Lin, Yini Li, David Bull, Nantheera Anantrasirichai

TL;DR: ELVIS是一个新框架，通过无监督合成低光视频管道、退化谱合成网络和增强解码器头，提升视频实例分割在低光环境下的性能。

Details

Motivation: 低光条件下的视频实例分割面临噪声、模糊和低对比度的挑战，缺乏大规模标注数据和现有合成管道的局限性阻碍了进展。

Result: 在合成低光YouTube-VIS 2019数据集上性能提升高达+3.7AP。

Insight: 通过建模时空退化和域适应策略，可以有效提升低光环境下的视频分割性能。

Abstract: Video instance segmentation (VIS) for low-light content remains highly challenging for both humans and machines alike, due to adverse imaging conditions including noise, blur and low-contrast. The lack of large-scale annotated datasets and the limitations of current synthetic pipelines, particularly in modeling temporal degradations, further hinder progress. Moreover, existing VIS methods are not robust to the degradations found in low-light videos and, as a result, perform poorly even when finetuned on low-light data. In this paper, we introduce \textbf{ELVIS} (\textbf{E}nhance \textbf{L}ow-light for \textbf{V}ideo \textbf{I}nstance \textbf{S}egmentation), a novel framework that enables effective domain adaptation of state-of-the-art VIS models to low-light scenarios. ELVIS comprises an unsupervised synthetic low-light video pipeline that models both spatial and temporal degradations, a calibration-free degradation profile synthesis network (VDP-Net) and an enhancement decoder head that disentangles degradations from content features. ELVIS improves performances by up to \textbf{+3.7AP} on the synthetic low-light YouTube-VIS 2019 dataset. Code will be released upon acceptance.

[134] QuantumCanvas: A Multimodal Benchmark for Visual Learning of Atomic Interactions cs.CV | cond-mat.mtrl-sci | quant-phPDF

Can Polat, Erchin Serpedin, Mustafa Kurban, Hasan Kurban

TL;DR: QuantumCanvas是一个多模态基准数据集，专注于从原子对的量子相互作用中学习，提供了基于物理的图像表示和数值属性标注，提升了量子机器学习模型的性能。

Details

Motivation: 现有分子和材料机器学习模型缺乏物理可迁移性，因为它们通常拟合整体分子或晶体的相关性，而非原子对的量子相互作用。QuantumCanvas旨在填补这一空白。

Result: 在能量缺口等任务中，模型表现优异（如GATv2的MAE为0.201 eV）。预训练后模型在QM9等数据集上展现出更好的泛化能力和收敛稳定性。

Insight: 将轨道物理与视觉表示学习结合，为学习可迁移的量子相互作用提供了原则性方法，同时增强了模型的物理可解释性。

Abstract: Despite rapid advances in molecular and materials machine learning, most models still lack physical transferability: they fit correlations across whole molecules or crystals rather than learning the quantum interactions between atomic pairs. Yet bonding, charge redistribution, orbital hybridization, and electronic coupling all emerge from these two-body interactions that define local quantum fields in many-body systems. We introduce QuantumCanvas, a large-scale multimodal benchmark that treats two-body quantum systems as foundational units of matter. The dataset spans 2,850 element-element pairs, each annotated with 18 electronic, thermodynamic, and geometric properties and paired with ten-channel image representations derived from l- and m-resolved orbital densities, angular field transforms, co-occupancy maps, and charge-density projections. These physically grounded images encode spatial, angular, and electrostatic symmetries without explicit coordinates, providing an interpretable visual modality for quantum learning. Benchmarking eight architectures across 18 targets, we report mean absolute errors of 0.201 eV on energy gap using GATv2, 0.265 eV on HOMO and 0.274 eV on LUMO using EGNN. For energy-related quantities, DimeNet attains 2.27 eV total-energy MAE and 0.132 eV repulsive-energy MAE, while a multimodal fusion model achieves a 2.15 eV Mermin free-energy MAE. Pretraining on QuantumCanvas further improves convergence stability and generalization when fine-tuned on larger datasets such as QM9, MD17, and CrysMTM. By unifying orbital physics with vision-based representation learning, QuantumCanvas provides a principled and interpretable basis for learning transferable quantum interactions through coupled visual and numerical modalities. Dataset and model implementations are available at https://github.com/KurbanIntelligenceLab/QuantumCanvas.

[135] Diffusion Fuzzy System: Fuzzy Rule Guided Latent Multi-Path Diffusion Modeling cs.CV | cs.AIPDF

Hailong Yang, Te Zhang, Kup-sze Choi, Zhaohong Deng

TL;DR: 该论文提出了扩散模糊系统（DFS），通过模糊规则引导的多路径扩散建模解决传统扩散模型在复杂图像特征捕捉和多路径协调中的问题，提升了训练稳定性和生成质量。

Details

Motivation: 扩散模型在生成高质量图像方面表现出色，但难以有效管理具有显著特征差异的图像集合，且在捕捉复杂特征时容易产生冲突结果。现有方法尝试通过多路径学习不同区域图像特征，但协调效率低且计算成本高。

Result: 在LSUN Bedroom、LSUN Church和MS COCO数据集上的实验表明，DFS训练更稳定、收敛更快，且在图像质量、图文对齐和目标参考对比精度上均优于基线模型。

Insight: 模糊规则和多路径设计的结合能够有效提升扩散模型对复杂特征的建模能力，同时动态协调和潜在空间压缩为解决多路径扩散的效率问题提供了新思路。

Abstract: Diffusion models have emerged as a leading technique for generating images due to their ability to create high-resolution and realistic images. Despite their strong performance, diffusion models still struggle in managing image collections with significant feature differences. They often fail to capture complex features and produce conflicting results. Research has attempted to address this issue by learning different regions of an image through multiple diffusion paths and then combining them. However, this approach leads to inefficient coordination among multiple paths and high computational costs. To tackle these issues, this paper presents a Diffusion Fuzzy System (DFS), a latent-space multi-path diffusion model guided by fuzzy rules. DFS offers several advantages. First, unlike traditional multi-path diffusion methods, DFS uses multiple diffusion paths, each dedicated to learning a specific class of image features. By assigning each path to a different feature type, DFS overcomes the limitations of multi-path models in capturing heterogeneous image features. Second, DFS employs rule-chain-based reasoning to dynamically steer the diffusion process and enable efficient coordination among multiple paths. Finally, DFS introduces a fuzzy membership-based latent-space compression mechanism to reduce the computational costs of multi-path diffusion effectively. We tested our method on three public datasets: LSUN Bedroom, LSUN Church, and MS COCO. The results show that DFS achieves more stable training and faster convergence than existing single-path and multi-path diffusion models. Additionally, DFS surpasses baseline models in both image quality and alignment between text and images, and also shows improved accuracy when comparing generated images to target references.

[136] FlashVGGT: Efficient and Scalable Visual Geometry Transformers with Compressed Descriptor Attention cs.CVPDF

Zipeng Wang, Dan Xu

TL;DR: FlashVGGT提出了一种基于描述符注意力机制的高效视觉几何变换器，通过压缩空间信息为紧凑的描述符令牌，大幅降低了计算开销，并在多视图图像3D重建任务中表现出色。

Details

Motivation: 传统的视觉几何变换器（如VGGT）由于使用全自注意力机制，计算复杂度高且无法扩展到长序列图像。FlashVGGT旨在解决这一问题，提升模型的效率和可扩展性。

Result: 实验结果表明，FlashVGGT在1,000张图像上的推理时间仅为VGGT的9.3%，且能扩展到3,000张以上的图像序列，同时保持重建精度。

Insight: 通过压缩和复用描述符，可以有效降低Transformer模型的复杂度，尤其在处理长序列任务时表现出高效性和可扩展性。

Abstract: 3D reconstruction from multi-view images is a core challenge in computer vision. Recently, feed-forward methods have emerged as efficient and robust alternatives to traditional per-scene optimization techniques. Among them, state-of-the-art models like the Visual Geometry Grounding Transformer (VGGT) leverage full self-attention over all image tokens to capture global relationships. However, this approach suffers from poor scalability due to the quadratic complexity of self-attention and the large number of tokens generated in long image sequences. In this work, we introduce FlashVGGT, an efficient alternative that addresses this bottleneck through a descriptor-based attention mechanism. Instead of applying dense global attention across all tokens, FlashVGGT compresses spatial information from each frame into a compact set of descriptor tokens. Global attention is then computed as cross-attention between the full set of image tokens and this smaller descriptor set, significantly reducing computational overhead. Moreover, the compactness of the descriptors enables online inference over long sequences via a chunk-recursive mechanism that reuses cached descriptors from previous chunks. Experimental results show that FlashVGGT achieves reconstruction accuracy competitive with VGGT while reducing inference time to just 9.3% of VGGT for 1,000 images, and scaling efficiently to sequences exceeding 3,000 images. Our project page is available at https://wzpscott.github.io/flashvggt_page/.

[137] RoleMotion: A Large-Scale Dataset towards Robust Scene-Specific Role-Playing Motion Synthesis with Fine-grained Descriptions cs.CV | cs.AIPDF

Junran Peng, Yiheng Huang, Silei Shen, Zeji Wei, Jingwei Yang

TL;DR: RoleMotion是一个大规模的人类动作数据集，专注于场景和角色的精细描述，填补了现有数据集在功能性和一致性上的不足。

Details

Motivation: 现有文本数据集通常是分散的、功能单一且缺乏精细标注，无法覆盖多样化的社交活动场景。RoleMotion旨在解决这些问题，提供一个高质量、场景专用的数据集。

Result: 实验结果表明，RoleMotion在文本驱动的全身动作生成任务中表现出高质量和功能性。

Insight: \n1. 精细的场景和角色设计可以显著提升动作数据的实用性和覆盖范围。\n2. 身体和手部动作的联合生成是一个值得探索的方向。

Abstract: In this paper, we introduce RoleMotion, a large-scale human motion dataset that encompasses a wealth of role-playing and functional motion data tailored to fit various specific scenes. Existing text datasets are mainly constructed decentrally as amalgamation of assorted subsets that their data are nonfunctional and isolated to work together to cover social activities in various scenes. Also, the quality of motion data is inconsistent, and textual annotation lacks fine-grained details in these datasets. In contrast, RoleMotion is meticulously designed and collected with a particular focus on scenes and roles. The dataset features 25 classic scenes, 110 functional roles, over 500 behaviors, and 10296 high-quality human motion sequences of body and hands, annotated with 27831 fine-grained text descriptions. We build an evaluator stronger than existing counterparts, prove its reliability, and evaluate various text-to-motion methods on our dataset. Finally, we explore the interplay of motion generation of body and hands. Experimental results demonstrate the high-quality and functionality of our dataset on text-driven whole-body generation.

[138] Depth Matching Method Based on ShapeDTW for Oil-Based Mud Imager cs.CV | physics.geo-phPDF

Fengfeng Li, Zhou Feng, Hongliang Wu, Hao Zhang, Han Tian

TL;DR: 本文提出了一种基于ShapeDTW算法的深度匹配方法，用于解决油基泥浆微电阻率成像仪在测井操作中的深度对齐问题，通过结合HOG1D和原始信号的特征提取，实现了复杂纹理图像的精确对齐。

Details

Motivation: 在油基泥浆微电阻率成像仪的测井操作中，即使经过速度校正，上下两组垫片的图像仍存在深度对齐问题。为解决这一问题，本文提出了一种形态敏感的深度匹配方法。

Result: 现场测试表明，该方法能够对具有复杂纹理、深度偏移或局部尺度变化的图像实现精确对齐，并提供了特征扩展的灵活框架。

Insight: 该方法不仅适用于油基泥浆成像仪的深度对齐问题，还可通过集成其他特征描述符，适应不同地质特征的需求，展现了较强的通用性和扩展性。

Abstract: In well logging operations using the oil-based mud (OBM) microresistivity imager, which employs an interleaved design with upper and lower pad sets, depth misalignment issues persist between the pad images even after velocity correction. This paper presents a depth matching method for borehole images based on the Shape Dynamic Time Warping (ShapeDTW) algorithm. The method extracts local shape features to construct a morphologically sensitive distance matrix, better preserving structural similarity between sequences during alignment. We implement this by employing a combined feature set of the one-dimensional Histogram of Oriented Gradients (HOG1D) and the original signal as the shape descriptor. Field test examples demonstrate that our method achieves precise alignment for images with complex textures, depth shifts, or local scaling. Furthermore, it provides a flexible framework for feature extension, allowing the integration of other descriptors tailored to specific geological features.

[139] SPARK: Sim-ready Part-level Articulated Reconstruction with VLM Knowledge cs.CV | cs.ROPDF

Yumeng He, Ying Jiang, Jiayin Lu, Yin Yang, Chenfanfu Jiang

TL;DR: SPARK是一个从单张RGB图像重建具有物理一致性的、运动结构的3D物体部件的框架，结合VLM知识生成模拟就绪的关节物体。

Details

Motivation: 当前创建模拟就绪的关节3D物体需要专家建模，且过程耗时费力。SPARK旨在通过VLM和生成模型自动化这一过程。

Result: 实验表明SPARK能生成多样类别的高质量、模拟就绪关节资产，适用于机器人操作和交互建模。

Insight: 结合VLM知识和生成模型，可以自动化生成复杂的关节3D物体，减少专家干预需求。

Abstract: Articulated 3D objects are critical for embodied AI, robotics, and interactive scene understanding, yet creating simulation-ready assets remains labor-intensive and requires expert modeling of part hierarchies and motion structures. We introduce SPARK, a framework for reconstructing physically consistent, kinematic part-level articulated objects from a single RGB image. Given an input image, we first leverage VLMs to extract coarse URDF parameters and generate part-level reference images. We then integrate the part-image guidance and the inferred structure graph into a generative diffusion transformer to synthesize consistent part and complete shapes of articulated objects. To further refine the URDF parameters, we incorporate differentiable forward kinematics and differentiable rendering to optimize joint types, axes, and origins under VLM-generated open-state supervision. Extensive experiments show that SPARK produces high-quality, simulation-ready articulated assets across diverse categories, enabling downstream applications such as robotic manipulation and interaction modeling.

[140] Generative Editing in the Joint Vision-Language Space for Zero-Shot Composed Image Retrieval cs.CVPDF

Xin Wang, Haipeng Zhang, Mang Li, Zhaohui Xia, Yueguo Chen

TL;DR: 该论文提出了一种名为Fusion-Diff的生成式编辑框架，用于解决零样本组合图像检索（ZS-CIR）中的模态鸿沟问题。通过多模态融合特征编辑策略和轻量级Control-Adapter，仅需小规模合成数据集即可实现高效检索。

Details

Motivation: 现有的零样本CIR方法在桥接视觉-语言模态鸿沟方面表现不佳，且监督方法依赖昂贵的三元组标注数据。作者提出了一种无需标注数据的高效生成式编辑框架。

Result: 在CIRR、FashionIQ和CIRCO等标准基准测试中，Fusion-Diff显著优于现有零样本方法。

Insight: 生成式编辑和多模态特征融合是一种有效的零样本CIR解决方案，尤其是在数据效率和解耦模态对齐方面表现突出。

Abstract: Composed Image Retrieval (CIR) enables fine-grained visual search by combining a reference image with a textual modification. While supervised CIR methods achieve high accuracy, their reliance on costly triplet annotations motivates zero-shot solutions. The core challenge in zero-shot CIR (ZS-CIR) stems from a fundamental dilemma: existing text-centric or diffusion-based approaches struggle to effectively bridge the vision-language modality gap. To address this, we propose Fusion-Diff, a novel generative editing framework with high effectiveness and data efficiency designed for multimodal alignment. First, it introduces a multimodal fusion feature editing strategy within a joint vision-language (VL) space, substantially narrowing the modality gap. Second, to maximize data efficiency, the framework incorporates a lightweight Control-Adapter, enabling state-of-the-art performance through fine-tuning on only a limited-scale synthetic dataset of 200K samples. Extensive experiments on standard CIR benchmarks (CIRR, FashionIQ, and CIRCO) demonstrate that Fusion-Diff significantly outperforms prior zero-shot approaches. We further enhance the interpretability of our model by visualizing the fused multimodal representations.

[141] ViT$^3$: Unlocking Test-Time Training in Vision cs.CVPDF

Dongchen Han, Yining Li, Tianyu Li, Zixuan Cao, Ziming Wang

TL;DR: ViT$^3$是一种纯测试时训练（TTT）架构，通过系统实证研究总结了六条实用设计原则，实现了线性复杂度和并行计算，并在多个视觉任务中表现出色。

Details

Motivation: 测试时训练（TTT）在视觉序列建模中潜力巨大，但目前缺乏对内部模块和训练选择的系统性理解与设计指南。ViT$^3$旨在填补这一空白。

Result: ViT$^3$在图像分类、生成、目标检测和语义分割等任务中，性能优于或匹配现有线性复杂度模型，并缩小了与优化视觉Transformer的差距。

Insight: 1. TTT设计需要关注内部模块和训练的动态平衡；2. 线性复杂度和并行化是实现高效视觉TTT的关键；3. ViT$^3$为未来研究提供了可扩展的基线。

Abstract: Test-Time Training (TTT) has recently emerged as a promising direction for efficient sequence modeling. TTT reformulates attention operation as an online learning problem, constructing a compact inner model from key-value pairs at test time. This reformulation opens a rich and flexible design space while achieving linear computational complexity. However, crafting a powerful visual TTT design remains challenging: fundamental choices for the inner module and inner training lack comprehensive understanding and practical guidelines. To bridge this critical gap, in this paper, we present a systematic empirical study of TTT designs for visual sequence modeling. From a series of experiments and analyses, we distill six practical insights that establish design principles for effective visual TTT and illuminate paths for future improvement. These findings culminate in the Vision Test-Time Training (ViT$^3$) model, a pure TTT architecture that achieves linear complexity and parallelizable computation. We evaluate ViT$^3$ across diverse visual tasks, including image classification, image generation, object detection, and semantic segmentation. Results show that ViT$^3$ consistently matches or outperforms advanced linear-complexity models (e.g., Mamba and linear attention variants) and effectively narrows the gap to highly optimized vision Transformers. We hope this study and the ViT$^3$ baseline can facilitate future work on visual TTT models. Code is available at https://github.com/LeapLabTHU/ViTTT.

[142] Bridging the Scale Gap: Balanced Tiny and General Object Detection in Remote Sensing Imagery cs.CVPDF

Zhicheng Zhao, Yin Huang, Lingma Sun, Chenglong Li, Jin Tang

TL;DR: 该论文提出了ScaleBridge-Det框架，首次为遥感图像中的微小物体设计了大规模检测方法，通过尺度自适应专家路由和密度引导查询分配，实现了对不同尺度物体的平衡检测性能。

Details

Motivation: 遥感图像中的微小物体检测面临极端尺度变化和密度分布的挑战，现有大规模基础模型未能有效解决这一问题。论文旨在填补这一研究空白，提升多尺度物体的检测平衡性。

Result: 在AI-TOD-V2和DTOD数据集上实现了最先进的性能，并在VisDrone上展示了卓越的跨域鲁棒性。

Insight: 通过尺度自适应和密度引导机制，能够在多尺度密集分布场景中实现平衡检测，解决了传统方法在微小物体和大物体共存时的性能瓶颈。

Abstract: Tiny object detection in remote sensing imagery has attracted significant research interest in recent years. Despite recent progress, achieving balanced detection performance across diverse object scales remains a formidable challenge, particularly in scenarios where dense tiny objects and large objects coexist. Although large foundation models have revolutionized general vision tasks, their application to tiny object detection remains unexplored due to the extreme scale variation and density distribution inherent to remote sensing imagery. To bridge this scale gap, we propose ScaleBridge-Det, to the best of our knowledge, the first large detection framework designed for tiny objects, which could achieve balanced performance across diverse scales through scale-adaptive expert routing and density-guided query allocation. Specifically, we introduce a Routing-Enhanced Mixture Attention (REM) module that dynamically selects and fuses scale-specific expert features via adaptive routing to address the tendency of standard MoE models to favor dominant scales. REM generates complementary and discriminative multi-scale representations suitable for both tiny and large objects. Furthermore, we present a Density-Guided Dynamic Query (DGQ) module that predicts object density to adaptively adjust query positions and numbers, enabling efficient resource allocation for objects of varying scales. The proposed framework allows ScaleBridge-Det to simultaneously optimize performance for both dense tiny and general objects without trade-offs. Extensive experiments on benchmark and cross-domain datasets demonstrate that ScaleBridge-Det achieves state-of-the-art performance on AI-TOD-V2 and DTOD, while exhibiting superior cross-domain robustness on VisDrone.

[143] Open-world Hand-Object Interaction Video Generation Based on Structure and Contact-aware Representation cs.CVPDF

Haodong Yan, Hang Yu, Zhide Zhong, Weilin Yuan, Xin Gong

TL;DR: 该论文提出了一种基于结构和接触感知表示的开放手-物体交互（HOI）视频生成方法，解决了现有方法在2D和3D表示之间的权衡问题，并结合联合生成范式实现了物理真实和时间一致的视频生成。

Details

Motivation: 目前的手-物体交互视频生成方法在建模物理约束（如接触和遮挡）时面临挑战，2D和3D表示无法同时保证可扩展性和交互真实性。

Result: 该方法在生成物理真实和时间一致的HOI视频方面优于现有技术，并在开放世界场景中表现出强大的泛化能力。

Insight: 交互导向和可扩展的监督信号有助于模型学习细粒度交互物理，并适应开放世界场景。

Abstract: Generating realistic hand-object interactions (HOI) videos is a significant challenge due to the difficulty of modeling physical constraints (e.g., contact and occlusion between hands and manipulated objects). Current methods utilize HOI representation as an auxiliary generative objective to guide video synthesis. However, there is a dilemma between 2D and 3D representations that cannot simultaneously guarantee scalability and interaction fidelity. To address this limitation, we propose a structure and contact-aware representation that captures hand-object contact, hand-object occlusion, and holistic structure context without 3D annotations. This interaction-oriented and scalable supervision signal enables the model to learn fine-grained interaction physics and generalize to open-world scenarios. To fully exploit the proposed representation, we introduce a joint-generation paradigm with a share-and-specialization strategy that generates interaction-oriented representations and videos. Extensive experiments demonstrate that our method outperforms state-of-the-art methods on two real-world datasets in generating physics-realistic and temporally coherent HOI videos. Furthermore, our approach exhibits strong generalization to challenging open-world scenarios, highlighting the benefit of our scalable design. Our project page is https://hgzn258.github.io/SCAR/.

[144] DreamingComics: A Story Visualization Pipeline via Subject and Layout Customized Generation using Video Models cs.CVPDF

Patrick Kwon, Chen Chen

TL;DR: DreamingComics提出了一种基于预训练视频扩散-Transformer（DiT）的故事可视化框架，通过RegionalRoPE编码和LLM布局生成器，提升了角色一致性和风格相似性。

Details

Motivation: 当前故事可视化方法主要依赖文本定位主题，难以保持艺术一致性，DreamingComics旨在解决这一问题。

Result: 角色一致性提升29.2%，风格相似性提升36.2%，空间准确性显著提高。

Insight: 视频模型的时空先验对故事可视化任务有益，LLM生成布局提供了灵活的控制手段。

Abstract: Current story visualization methods tend to position subjects solely by text and face challenges in maintaining artistic consistency. To address these limitations, we introduce DreamingComics, a layout-aware story visualization framework. We build upon a pretrained video diffusion-transformer (DiT) model, leveraging its spatiotemporal priors to enhance identity and style consistency. For layout-based position control, we propose RegionalRoPE, a region-aware positional encoding scheme that re-indexes embeddings based on the target layout. Additionally, we introduce a masked condition loss to further constrain each subject’s visual features to their designated region. To infer layouts from natural language scripts, we integrate an LLM-based layout generator trained to produce comic-style layouts, enabling flexible and controllable layout conditioning. We present a comprehensive evaluation of our approach, showing a 29.2% increase in character consistency and a 36.2% increase in style similarity compared to previous methods, while displaying high spatial accuracy. Our project page is available at https://yj7082126.github.io/dreamingcomics/

[145] FreqEdit: Preserving High-Frequency Features for Robust Multi-Turn Image Editing cs.CVPDF

Yucheng Liao, Jiajun Liang, Kaiqian Cui, Baoquan Zhao, Haoran Xie

TL;DR: FreqEdit 是一个无需训练的图像多轮编辑框架，通过高频特征注入、自适应注入策略和路径补偿机制，有效解决了多轮编辑中的高频信息丢失问题。

Details

Motivation: 当前的指令驱动图像编辑模型在单次编辑中表现优异，但在多轮编辑中会出现严重的质量退化。研究发现，高频信息的渐进性丢失是主要原因。

Result: 实验表明，FreqEdit 在身份保持和指令遵循方面优于七种现有方法。

Insight: 高频信息的保留是多轮编辑质量的关键，通过动态注入和补偿机制可以有效解决这一问题。

Abstract: Instruction-based image editing through natural language has emerged as a powerful paradigm for intuitive visual manipulation. While recent models achieve impressive results on single edits, they suffer from severe quality degradation under multi-turn editing. Through systematic analysis, we identify progressive loss of high-frequency information as the primary cause of this quality degradation. We present FreqEdit, a training-free framework that enables stable editing across 10+ consecutive iterations. Our approach comprises three synergistic components: (1) high-frequency feature injection from reference velocity fields to preserve fine-grained details, (2) an adaptive injection strategy that spatially modulates injection strength for precise region-specific control, and (3) a path compensation mechanism that periodically recalibrates the editing trajectory to prevent over-constraint. Extensive experiments demonstrate that FreqEdit achieves superior performance in both identity preservation and instruction following compared to seven state-of-the-art baselines.

[146] VideoScoop: A Non-Traditional Domain-Independent Framework For Video Analysis cs.CV | cs.DBPDF

Hafsa Billah

TL;DR: VideoScoop提出了一种非传统的、领域无关的视频分析框架，结合了关系模型和图模型，支持通用视频情境分析（VSA）。

Details

Motivation: 传统视频情境分析需要人工参与或自定义算法，无法通用且效率低。该研究旨在提供一种通用框架，克服这些局限。

Result: 实验表明框架在AL、CM和SL领域的情境检测中准确、高效且鲁棒。

Insight: 多模态表示与通用查询语言结合，提高了视频情境分析的灵活性和适用范围。

Abstract: Automatically understanding video contents is important for several applications in Civic Monitoring (CM), general Surveillance (SL), Assisted Living (AL), etc. Decades of Image and Video Analysis (IVA) research have advanced tasks such as content extraction (e.g., object recognition and tracking). Identifying meaningful activities or situations (e.g., two objects coming closer) remains difficult and cannot be achieved by content extraction alone. Currently, Video Situation Analysis (VSA) is done manually with a human in the loop, which is error-prone and labor-intensive, or through custom algorithms designed for specific video types or situations. These algorithms are not general-purpose and require a new algorithm/software for each new situation or video from a new domain. This report proposes a general-purpose VSA framework that overcomes the above limitations. Video contents are extracted once using state-of-the-art Video Content Extraction technologies. They are represented using two alternative models – the extended relational model (R++) and graph models. When represented using R++, the extracted contents can be used as data streams, enabling Continuous Query Processing via the proposed Continuous Query Language for Video Analysis. The graph models complement this by enabling the detection of situations that are difficult or impossible to detect using the relational model alone. Existing graph algorithms and newly developed algorithms support a wide variety of situation detection. To support domain independence, primitive situation variants across domains are identified and expressed as parameterized templates. Extensive experiments were conducted across several interesting situations from three domains – AL, CM, and SL– to evaluate the accuracy, efficiency, and robustness of the proposed approach using a dataset of videos of varying lengths from these domains.

[147] Robust Rigid and Non-Rigid Medical Image Registration Using Learnable Edge Kernels cs.CVPDF

Ahsan Raza Siyal, Markus Haltmeier, Ruth Steiger, Malik Galijasevic, Elke Ruth Gizewski

TL;DR: 论文提出了一种结合可学习边缘核的方法，用于医学图像的刚性和非刚性配准，显著提升了多模态图像的配准效果。

Details

Motivation: 医学图像配准在多模态、时间点或不同主体间的对齐中至关重要，但传统方法在对比度差异、空间扭曲等问题上表现不佳。

Result: 在多个实验设置和公开数据集上，该方法均优于现有技术，显著提升了图像对齐和解剖结构分析的准确性。

Insight: 通过学习最优边缘特征，该方法能够更好地捕捉医学图像中的关键结构信息，从而在多模态配准中表现出更强的鲁棒性和适应性。

Abstract: Medical image registration is crucial for various clinical and research applications including disease diagnosis or treatment planning which require alignment of images from different modalities, time points, or subjects. Traditional registration techniques often struggle with challenges such as contrast differences, spatial distortions, and modality-specific variations. To address these limitations, we propose a method that integrates learnable edge kernels with learning-based rigid and non-rigid registration techniques. Unlike conventional layers that learn all features without specific bias, our approach begins with a predefined edge detection kernel, which is then perturbed with random noise. These kernels are learned during training to extract optimal edge features tailored to the task. This adaptive edge detection enhances the registration process by capturing diverse structural features critical in medical imaging. To provide clearer insight into the contribution of each component in our design, we introduce four variant models for rigid registration and four variant models for non-rigid registration. We evaluated our approach using a dataset provided by the Medical University across three setups: rigid registration without skull removal, with skull removal, and non-rigid registration. Additionally, we assessed performance on two publicly available datasets. Across all experiments, our method consistently outperformed state-of-the-art techniques, demonstrating its potential to improve multi-modal image alignment and anatomical structure analysis.

[148] Evaluating SAM2 for Video Semantic Segmentation cs.CVPDF

Syed Hesham Syed Ariff, Yun Liu, Guolei Sun, Jing Yang, Henghui Ding

TL;DR: 论文探讨了如何将SAM2（Segmentation Anything Model 2）扩展到视频语义分割（VSS）任务，提出了两种主要方法，并验证了SAM2在提升VSS任务性能方面的有效性。

Details

Motivation: SAM2已在图像和视频的即时对象分割中表现出色，但其扩展到密集视频语义分割任务仍面临空间精度、时间一致性和多对象跟踪等挑战。本文旨在探索如何利用SAM2解决这些问题。

Result: 实验表明，SAM2的精确边界预测能力显著提升了VSS任务的整体性能。

Insight: SAM2的对象感知能力使其在复杂场景中表现优异，但其在高密度语义分割任务中仍需进一步优化。

Abstract: The Segmentation Anything Model 2 (SAM2) has proven to be a powerful foundation model for promptable visual object segmentation in both images and videos, capable of storing object-aware memories and transferring them temporally through memory blocks. While SAM2 excels in video object segmentation by providing dense segmentation masks based on prompts, extending it to dense Video Semantic Segmentation (VSS) poses challenges due to the need for spatial accuracy, temporal consistency, and the ability to track multiple objects with complex boundaries and varying scales. This paper explores the extension of SAM2 for VSS, focusing on two primary approaches and highlighting firsthand observations and common challenges faced during this process. The first approach involves using SAM2 to extract unique objects as masks from a given image, with a segmentation network employed in parallel to generate and refine initial predictions. The second approach utilizes the predicted masks to extract unique feature vectors, which are then fed into a simple network for classification. The resulting classifications and masks are subsequently combined to produce the final segmentation. Our experiments suggest that leveraging SAM2 enhances overall performance in VSS, primarily due to its precise predictions of object boundaries.

[149] SAM3-UNet: Simplified Adaptation of Segment Anything Model 3 cs.CVPDF

Xinyu Xiong, Zihuang Wu, Lei Lu, Yufa Xia

TL;DR: SAM3-UNet是Segment Anything Model 3（SAM3）的简化变体，旨在低成本适应下游任务。它由SAM3图像编码器、高效参数微调的简单适配器和轻量级U-Net解码器组成，实验表明其在多个任务上优于SAM2-UNet等方法，且训练内存需求低。

Details

Motivation: 为了解决Segment Anything Model 3（SAM3）在下游任务中适应成本高的问题，提出了SAM3-UNet，旨在提供一个高效且轻量化的解决方案。

Result: 在镜面检测和显著性物体检测等任务中，SAM3-UNet优于SAM2-UNet和其他先进方法，且训练内存需求低（小于6 GB）。

Insight: 通过简化SAM3的结构并引入适配器和轻量级解码器，可以在降低计算资源需求的同时提升下游任务的性能。

Abstract: In this paper, we introduce SAM3-UNet, a simplified variant of Segment Anything Model 3 (SAM3), designed to adapt SAM3 for downstream tasks at a low cost. Our SAM3-UNet consists of three components: a SAM3 image encoder, a simple adapter for parameter-efficient fine-tuning, and a lightweight U-Net-style decoder. Preliminary experiments on multiple tasks, such as mirror detection and salient object detection, demonstrate that the proposed SAM3-UNet outperforms the prior SAM2-UNet and other state-of-the-art methods, while requiring less than 6 GB of GPU memory during training with a batch size of 12. The code is publicly available at https://github.com/WZH0120/SAM3-UNet.

[150] Generative Action Tell-Tales: Assessing Human Motion in Synthesized Videos cs.CVPDF

Xavier Thomas, Youngsun Lim, Ananya Srinivasan, Audrey Zheng, Deepti Ghadiyaram

TL;DR: 该论文提出了一种新的评估指标，用于衡量生成视频中人类动作的视觉和时间正确性。通过结合外观无关的骨骼几何特征和外观特征，该方法在复杂动作的动态和解剖学合理性上优于现有方法。

Details

Motivation: 现有的视频生成模型缺乏评估复杂人类动作视觉和时间正确性的稳健指标，尤其是外观偏见的编码器和多模态大语言模型难以捕捉动作的细节和动力学。

Result: 实验表明，该方法在多个基准测试中表现优异，与人类感知相关性更强，揭示了当前视频生成模型的关键局限性。

Insight: 外观和几何特征的融合是评估生成视频中复杂动作有效性的关键，为视频生成研究设立了新标准。

Abstract: Despite rapid advances in video generative models, robust metrics for evaluating visual and temporal correctness of complex human actions remain elusive. Critically, existing pure-vision encoders and Multimodal Large Language Models (MLLMs) are strongly appearance-biased, lack temporal understanding, and thus struggle to discern intricate motion dynamics and anatomical implausibilities in generated videos. We tackle this gap by introducing a novel evaluation metric derived from a learned latent space of real-world human actions. Our method first captures the nuances, constraints, and temporal smoothness of real-world motion by fusing appearance-agnostic human skeletal geometry features with appearance-based features. We posit that this combined feature space provides a robust representation of action plausibility. Given a generated video, our metric quantifies its action quality by measuring the distance between its underlying representations and this learned real-world action distribution. For rigorous validation, we develop a new multi-faceted benchmark specifically designed to probe temporally challenging aspects of human action fidelity. Through extensive experiments, we show that our metric achieves substantial improvement of more than 68% compared to existing state-of-the-art methods on our benchmark, performs competitively on established external benchmarks, and has a stronger correlation with human perception. Our in-depth analysis reveals critical limitations in current video generative models and establishes a new standard for advanced research in video generation.

[151] Envision: Benchmarking Unified Understanding & Generation for Causal World Process Insights cs.CV | cs.AIPDF

Juanxi Tian, Siyuan Li, Conghui He, Lijun Wu, Cheng Tan

TL;DR: 该论文提出了Envision基准，用于评估多模态模型在因果世界过程理解与生成中的表现。通过设计序列化的多图像生成任务，揭示了现有模型在动态过程建模中的局限性。

Details

Motivation: 当前多模态模型依赖静态单图像生成任务，导致模型在动态过程建模和世界知识内化方面表现不足。Envision旨在解决这一问题。

Result: 结果显示，专门化的T2I模型在美学表现上优秀但缺乏世界知识，统一模型在因果一致性上表现更优，但仍落后于闭源模型。

Insight: 动态过程建模需要超越静态模式匹配的能力，世界知识的内化是实现高质量多帧生成的关键。

Abstract: Current multimodal models aim to transcend the limitations of single-modality representations by unifying understanding and generation, often using text-to-image (T2I) tasks to calibrate semantic consistency. However, their reliance on static, single-image generation in training and evaluation leads to overfitting to static pattern matching and semantic fusion, while fundamentally hindering their ability to model dynamic processes that unfold over time. To address these constraints, we propose Envision-a causal event progression benchmark for chained text-to-multi-image generation. Grounded in world knowledge and structured by spatiotemporal causality, it reorganizes existing evaluation dimensions and includes 1,000 four-stage prompts spanning six scientific and humanities domains. To transition evaluation from single images to sequential frames and assess whether models truly internalize world knowledge while adhering to causal-temporal constraints, we introduce Envision-Score, a holistic metric integrating multi-dimensional consistency, physicality, and aesthetics. Comprehensive evaluation of 15 models (10 specialized T2I models, 5 unified models) uncovers: specialized T2I models demonstrate proficiency in aesthetic rendering yet lack intrinsic world knowledge. Unified multimodal models bridge this gap, consistently outperforming specialized counterparts in causal narrative coherence. However, even these unified architectures remain subordinate to closed-source models and struggle to overcome the core challenge of spatiotemporal consistency. This demonstrates that a focus on causally-isolated single images impedes multi-frame reasoning and generation, promoting static pattern matching over dynamic world modeling-ultimately limiting world knowledge internalization, generation.

[152] Seeing through Imagination: Learning Scene Geometry via Implicit Spatial World Modeling cs.CVPDF

Meng Cao, Haokun Lin, Haoyuan Li, Haoran Tang, Rongtao Xu

TL;DR: 该论文提出了MILO，一种隐式空间世界建模范式，通过视觉生成器提供几何感知反馈，提升多模态大语言模型（MLLMs）的空间推理能力。同时提出了RePE编码方案和GeoGen数据集，实验表明该方法显著优于现有基准。

Details

Motivation: 当前MLLMs的空间推理能力不足，主要依赖文本符号描述，缺乏视觉化关联，导致空间概念理解不全面。

Result: 实验证明了MILO在空间推理任务上的显著性能提升，优于现有基准。

Insight: 通过视觉化反馈和相对位姿编码，MLLMs可以更全面地理解3D空间结构。

Abstract: Spatial reasoning, the ability to understand and interpret the 3D structure of the world, is a critical yet underdeveloped capability in Multimodal Large Language Models (MLLMs). Current methods predominantly rely on verbal descriptive tuning, which suffers from visual illiteracy, i.e., they learn spatial concepts through textual symbols alone, devoid of connection to their visual manifestations. To bridge this gap, this paper introduces MILO, an Implicit spatIaL wOrld modeling paradigm that simulates human-like spatial imagination. MILO integrates a visual generator to provide geometry-aware feedback, thereby implicitly grounding the MLLM’s symbolic reasoning in perceptual experience. Complementing this paradigm, we propose RePE (Relative Positional Encoding), a novel encoding scheme that captures relative camera-pose transformations, offering superior performance over absolute coordinate systems. To support the training, we construct GeoGen, a large-scale Geometry-aware Generative dataset with approximately 2,241 videos and 67,827 observation-action-outcome triplets. Experiments demonstrate that our approach significantly enhances spatial reasoning capabilities across multiple baselines and benchmarks, offering a more holistic understanding of 3D space.

[153] CauSight: Learning to Supersense for Visual Causal Discovery cs.CVPDF

Yize Zhang, Meiqi Chen, Sirui Chen, Bo Peng, Yanxi Zhang

TL;DR: 论文提出了视觉因果发现的任务，并构建了VCG-32K数据集和CauSight模型，通过因果感知推理超越了GPT-4.1的表现。

Details

Motivation: 人类的因果思维能力使其能够理解现象背后的原因，而非仅仅是观察到的内容。为了让现代AI系统复现这一能力，论文提出了视觉因果发现的任务，旨在从视觉实体中推断因果关系。

Result: 实验结果表明，CauSight在视觉因果发现任务上超越了GPT-4.1，性能提升超过三倍（绝对增益达21%）。

Insight: 通过结合大规模数据集和因果感知的推理方法，可以有效提升AI系统在复杂视觉任务中的因果推理能力。

Abstract: Causal thinking enables humans to understand not just what is seen, but why it happens. To replicate this capability in modern AI systems, we introduce the task of visual causal discovery. It requires models to infer cause-and-effect relations among visual entities across diverse scenarios instead of merely perceiving their presence. To this end, we first construct the Visual Causal Graph dataset (VCG-32K), a large-scale collection of over 32,000 images annotated with entity-level causal graphs, and further develop CauSight, a novel vision-language model to perform visual causal discovery through causally aware reasoning. Our training recipe integrates three components: (1) training data curation from VCG-32K, (2) Tree-of-Causal-Thought (ToCT) for synthesizing reasoning trajectories, and (3) reinforcement learning with a designed causal reward to refine the reasoning policy. Experiments show that CauSight outperforms GPT-4.1 on visual causal discovery, achieving over a threefold performance boost (21% absolute gain). Our code, model, and dataset are fully open-sourced at project page: https://github.com/OpenCausaLab/CauSight.

[154] OpenREAD: Reinforced Open-Ended Reasoing for End-to-End Autonomous Driving with LLM-as-Critic cs.CVPDF

Songyan Zhang, Wenhui Huang, Zhan Chen, Chua Jiahao Collister, Qihang Huang

TL;DR: 该论文提出了一种名为OpenREAD的端到端自动驾驶框架，通过强化开环推理（RFT）结合大型语言模型（LLM）作为评估器，显著提升了自动驾驶的知识驱动性能。

Details

Motivation: 现有自动驾驶框架中的监督微调（SFT）和强化微调（RFT）方法在泛化推理能力和量化开放性问题奖励方面存在局限性。OpenREAD旨在通过结合LLM作为评估器，解决这些限制。

Result: 实验表明，OpenREAD在推理和规划任务上达到了SOTA性能，尤其在上下游任务中均实现了显著改进。

Insight: LLM作为批评器的引入为开放性问题提供了有效的奖励量化方法，推动了知识驱动自动驾驶的发展。

Abstract: Recently, two-stage fine-tuning strategies, e.g., acquiring essential driving knowledge through supervised fine-tuning (SFT) and further enhancing decision-making and planning via reinforcement fine-tuning (RFT), have shown strong potential in advancing the knowledge-driven autonomous driving (AD) paradigm. However, the learning nature of SFT still limits the generalization of reasoning, thereby constraining the full potential of driving performance. Meanwhile, current RFT approaches are primarily applied to downstream tasks, since scene understanding is an open-ended problem where corresponding rewards are difficult to quantify. To address these limitations, we propose OpenREAD, an OPEN-ended REasoning reinforced vision-language model (VLM)-based autonomous driving (AD) framework that enables end-to-end RFT across the full spectrum from high-level reasoning to low-level trajectory planning. Specifically, we begin by constructing large-scale Chain-of-Thought (CoT) annotations on open-source driving-related knowledge datasets, and employ the powerful Qwen3 large language model (LLM) as the critic in RFT to quantify reasoning quality for open-ended questions during reward modeling. Extensive experiments confirm that joint end-to-end RFT yields substantial improvements in both upstream and downstream tasks, enabling OpenREAD to achieve state-of-the-art performance on reasoning and planning benchmarks.

[155] PhyDetEx: Detecting and Explaining the Physical Plausibility of T2V Models cs.CVPDF

Zeqing Wang, Keze Wang, Lei Zhang

TL;DR: 论文提出了PhyDetEx，通过构建PID数据集和微调VLM模型，检测和解释T2V模型生成的视频是否物理合理，发现开源模型在物理合理性方面仍有挑战。

Details

Motivation: 现有T2V模型在视频生成方面取得进展，但其是否理解物理规律并生成合理视频仍是问题。现有VLMs难以检测视频中的物理不合理内容。

Result: 实验表明，T2V模型在物理合理性方面仍有不足，尤其是开源模型。

Insight: 理解物理规律是T2V模型的难点，需要更多研究提升其物理合理性。数据集和方法的公开有助于推动相关研究。

Abstract: Driven by the growing capacity and training scale, Text-to-Video (T2V) generation models have recently achieved substantial progress in video quality, length, and instruction-following capability. However, whether these models can understand physics and generate physically plausible videos remains a question. While Vision-Language Models (VLMs) have been widely used as general-purpose evaluators in various applications, they struggle to identify the physically impossible content from generated videos. To investigate this issue, we construct a \textbf{PID} (\textbf{P}hysical \textbf{I}mplausibility \textbf{D}etection) dataset, which consists of a \textit{test split} of 500 manually annotated videos and a \textit{train split} of 2,588 paired videos, where each implausible video is generated by carefully rewriting the caption of its corresponding real-world video to induce T2V models producing physically implausible content. With the constructed dataset, we introduce a lightweight fine-tuning approach, enabling VLMs to not only detect physically implausible events but also generate textual explanations on the violated physical principles. Taking the fine-tuned VLM as a physical plausibility detector and explainer, namely \textbf{PhyDetEx}, we benchmark a series of state-of-the-art T2V models to assess their adherence to physical laws. Our findings show that although recent T2V models have made notable progress toward generating physically plausible content, understanding and adhering to physical laws remains a challenging issue, especially for open-source models. Our dataset, training code, and checkpoints are available at \href{https://github.com/Zeqing-Wang/PhyDetEx}{https://github.com/Zeqing-Wang/PhyDetEx}.

[156] COACH: Collaborative Agents for Contextual Highlighting - A Multi-Agent Framework for Sports Video Analysis cs.CVPDF

Tsz-To Wong, Ching-Chun Huang, Hong-Han Shuai

TL;DR: COACH提出了一种可重构的多智能体系统（MAS）框架，用于体育视频分析，解决了现有端到端模型在时间层次理解、泛化性和可解释性方面的不足。

Details

Motivation: 现有端到端模型在体育视频分析中难以处理多层次时间上下文，泛化能力差且开发成本高。COACH旨在通过多智能体系统提供灵活、可扩展且可解释的解决方案。

Result: 在羽毛球分析任务中展示了框架的适应性，能够同时处理短期推理和长期生成任务。

Insight: 多智能体系统为体育视频分析提供了新的范式，通过模块化设计提升了系统的灵活性和可解释性。

Abstract: Intelligent sports video analysis demands a comprehensive understanding of temporal context, from micro-level actions to macro-level game strategies. Existing end-to-end models often struggle with this temporal hierarchy, offering solutions that lack generalization, incur high development costs for new tasks, and suffer from poor interpretability. To overcome these limitations, we propose a reconfigurable Multi-Agent System (MAS) as a foundational framework for sports video understanding. In our system, each agent functions as a distinct “cognitive tool” specializing in a specific aspect of analysis. The system’s architecture is not confined to a single temporal dimension or task. By leveraging iterative invocation and flexible composition of these agents, our framework can construct adaptive pipelines for both short-term analytic reasoning (e.g., Rally QA) and long-term generative summarization (e.g., match summaries). We demonstrate the adaptability of this framework using two representative tasks in badminton analysis, showcasing its ability to bridge fine-grained event detection and global semantic organization. This work presents a paradigm shift towards a flexible, scalable, and interpretable system for robust, cross-task sports video intelligence.The project homepage is available at https://aiden1020.github.io/COACH-project-page

[157] TransientTrack: Advanced Multi-Object Tracking and Classification of Cancer Cells with Transient Fluorescent Signals cs.CV | q-bio.CB | q-bio.QMPDF

Florian Bürger, Martim Dias Gomes, Nica Gutu, Adrián E. Granada, Noémie Moreau

TL;DR: TransientTrack是一个基于深度学习的轻量级框架，用于多通道显微镜视频数据中的细胞追踪，特别适用于瞬态荧光信号的情况。它能识别关键事件（如细胞分裂和死亡），结合Transformer网络和多阶段匹配技术，在多样化条件下表现优异。

Details

Motivation: 当前细胞追踪方法主要针对恒定信号，无法检测关键事件（如细胞死亡）。TransientTrack填补了这一空白，专注于瞬态信号和关键事件的追踪，提升了对癌症细胞动态的定量研究能力。

Result: 在多样化条件下表现优异，能够有效追踪细胞并捕捉关键事件（如分裂和死亡），成功应用于化疗药物的单细胞分析。

Insight: TransientTrack的创新在于对瞬态信号的处理和关键事件的识别，为癌症治疗反应的定量研究提供了新工具。

Abstract: Tracking cells in time-lapse videos is an essential technique for monitoring cell population dynamics at a single-cell level. Current methods for cell tracking are developed on videos with mostly single, constant signals and do not detect pivotal events such as cell death. Here, we present TransientTrack, a deep learning-based framework for cell tracking in multi-channel microscopy video data with transient fluorescent signals that fluctuate over time following processes such as the circadian rhythm of cells. By identifying key cellular events - mitosis (cell division) and apoptosis (cell death) our method allows us to build complete trajectories, including cell lineage information. TransientTrack is lightweight and performs matching on cell detection embeddings directly, without the need for quantification of tracking-specific cell features. Furthermore, our approach integrates Transformer Networks, multi-stage matching using all detection boxes, and the interpolation of missing tracklets with the Kalman Filter. This unified framework achieves strong performance across diverse conditions, effectively tracking cells and capturing cell division and death. We demonstrate the use of TransientTrack in an analysis of the efficacy of a chemotherapeutic drug at a single-cell level. The proposed framework could further advance quantitative studies of cancer cell dynamics, enabling detailed characterization of treatment response and resistance mechanisms. The code is available at https://github.com/bozeklab/TransientTrack.

[158] KM-ViPE: Online Tightly Coupled Vision-Language-Geometry Fusion for Open-Vocabulary Semantic SLAM cs.CVPDF

Zaid Nasser, Mikhail Iumanov, Tianhao Li, Maxim Popov, Jaafar Mahmoud

TL;DR: KM-ViPE是一个实时开放词汇SLAM框架，仅需单目摄像头输入，无需深度传感器或离线校准，适用于动态环境，结合了几何约束和视觉语言特征

Details

Motivation: 现有SLAM系统通常需要深度传感器或离线校准，且在动态环境中表现不佳；KM-ViPE旨在解决这些问题，提供更实用的实时空间智能解决方案

Result: 与现有方法相比，KM-ViPE在未校准单目输入和动态环境下表现优越，适用于自主机器人和AR/VR应用

Insight: 视觉语言特征的融合增强了SLAM的语义理解能力，而在线操作使其更适用于实际场景。

Abstract: We present KM-ViPE (Knowledge Mapping Video Pose Engine), a real-time open-vocabulary SLAM framework for uncalibrated monocular cameras in dynamic environments. Unlike systems requiring depth sensors and offline calibration, KM-ViPE operates directly on raw RGB streams, making it ideal for ego-centric applications and harvesting internet-scale video data for training. KM-ViPE tightly couples DINO visual features with geometric constraints through a high-level features based adaptive robust kernel that handles both moving objects and movable static objects (e.g., moving furniture in ego-centric views). The system performs simultaneous online localization and open-vocabulary semantic mapping by fusing geometric and deep visual features aligned with language embeddings. Our results are competitive with state-of-the-art approaches, while existing solutions either operate offline, need depth data and/or odometry estimation, or lack dynamic scene robustness. KM-ViPE benefits from internet-scale training and uniquely combines online operation, uncalibrated monocular input, and robust handling of dynamic scenes, which makes it a good fit for autonomous robotics and AR/VR applications and advances practical spatial intelligence capabilities for embodied AI.

[159] SARL: Spatially-Aware Self-Supervised Representation Learning for Visuo-Tactile Perception cs.CVPDF

Gurmeher Khurana, Lan Wei, Dandan Zhang

TL;DR: SARL是一种空间感知的自监督学习框架，通过引入三种地图级目标（SAL、PPDA、RAM）来保留特征图的空间结构，显著提升融合视觉-触觉数据的感知任务性能。

Details

Motivation: 现有自监督学习框架通常将特征图压缩为全局向量，忽略了空间结构，而融合视觉-触觉数据的任务需要保留局部几何信息。SARL旨在解决这一问题。

Result: 在边缘姿态回归任务中，SARL的MAE为0.3955，比次优方法（0.5682 MAE）提升了30%，接近监督学习的上限。

Insight: 研究结果表明，对于融合视觉-触觉数据，结构化空间等变性是最有效的信号，其特征能够根据物体几何结构变化，从而提升机器人感知能力。

Abstract: Contact-rich robotic manipulation requires representations that encode local geometry. Vision provides global context but lacks direct measurements of properties such as texture and hardness, whereas touch supplies these cues. Modern visuo-tactile sensors capture both modalities in a single fused image, yielding intrinsically aligned inputs that are well suited to manipulation tasks requiring visual and tactile information. Most self-supervised learning (SSL) frameworks, however, compress feature maps into a global vector, discarding spatial structure and misaligning with the needs of manipulation. To address this, we propose SARL, a spatially-aware SSL framework that augments the Bootstrap Your Own Latent (BYOL) architecture with three map-level objectives, including Saliency Alignment (SAL), Patch-Prototype Distribution Alignment (PPDA), and Region Affinity Matching (RAM), to keep attentional focus, part composition, and geometric relations consistent across views. These losses act on intermediate feature maps, complementing the global objective. SARL consistently outperforms nine SSL baselines across six downstream tasks with fused visual-tactile data. On the geometry-sensitive edge-pose regression task, SARL achieves a Mean Absolute Error (MAE) of 0.3955, a 30% relative improvement over the next-best SSL method (0.5682 MAE) and approaching the supervised upper bound. These findings indicate that, for fused visual-tactile data, the most effective signal is structured spatial equivariance, in which features vary predictably with object geometry, which enables more capable robotic perception.

[160] Med-VCD: Mitigating Hallucination for Medical Large Vision Language Models through Visual Contrastive Decoding cs.CVPDF

Zahra Mahdavi, Zahra Khodakaramimaghsoud, Hooman Khaloo, Sina Bakhshandeh Taleshani, Erfan Hashemi

TL;DR: 论文提出了一种名为Med-VCD的稀疏视觉对比解码方法，旨在减少医疗大型视觉语言模型（LVLM）中的幻觉输出，无需额外的解码时间开销。该方法通过动态选择视觉信息令牌来提高准确性，并在多项医疗任务中表现优异。

Details

Motivation: 尽管大型视觉语言模型（LVLM）在医疗领域的应用日益广泛，但其仍面临幻觉输出的问题（看似合理但实际错误）。现有方法多依赖二次解码或回滚过程，效率较低且可能引发模态间的不对齐。因此，作者希望开发一种高效且可靠的解决方案。

Result: 在八个医疗数据集上的实验表明，Med-VCD平均提高了13%的事实准确率和6%的幻觉准确率，优于基线模型。

Insight: 论文展示了通过动态令牌选择和视觉对比解码，可以在不牺牲效率的情况下显著减少医疗LVLM的幻觉问题。这一方法为其他领域的高效解码提供了借鉴。

Abstract: Large vision-language models (LVLMs) are now central to healthcare applications such as medical visual question answering and imaging report generation. Yet, these models remain vulnerable to hallucination outputs that appear plausible but are in fact incorrect. In the natural image domain, several decoding strategies have been proposed to mitigate hallucinations by reinforcing visual evidence, but most rely on secondary decoding or rollback procedures that substantially slow inference. Moreover, existing solutions are often domain-specific and may introduce misalignment between modalities or between generated and ground-truth content. We introduce Med-VCD, a sparse visual-contrastive decoding method that mitigates hallucinations in medical LVLMs without the time overhead of secondary decoding. Med-VCD incorporates a novel token-sparsification strategy that selects visually informed tokens on the fly, trimming redundancy while retaining critical visual context and thus balancing efficiency with reliability. Evaluations on eight medical datasets, spanning ophthalmology, radiology, and pathology tasks in visual question answering, report generation, and dedicated hallucination benchmarks, show that Med-VCD raises factual accuracy by an average of 13% and improves hallucination accuracy by 6% relative to baseline medical LVLMs.

[161] Physical ID-Transfer Attacks against Multi-Object Tracking via Adversarial Trajectory cs.CVPDF

Chenyi Wang, Yanmao Man, Raymond Muller, Ming Li, Z. Berkay Celik

TL;DR: 该论文提出了一种名为AdvTraj的新型物理攻击方法，通过对抗性轨迹干扰多目标跟踪（MOT）系统的ID分配，揭示现有跟踪系统的潜在弱点。

Details

Motivation: 多目标跟踪（MOT）在计算机视觉中至关重要，但其安全性研究较少。现有攻击手段多为针对特定模型或离线数据的数字攻击，缺乏对物理环境中在线MOT系统的威胁分析。

Result: 在白盒攻击下，AdvTraj对SORT的成功率达100%，对其他SOTA MOT算法的攻击成功率高达93%。

Insight: MOT系统在目标关联阶段存在漏洞，可通过对抗性轨迹被操纵。为进一步增强MOT系统的鲁棒性提供了方向。

Abstract: Multi-Object Tracking (MOT) is a critical task in computer vision, with applications ranging from surveillance systems to autonomous driving. However, threats to MOT algorithms have yet been widely studied. In particular, incorrect association between the tracked objects and their assigned IDs can lead to severe consequences, such as wrong trajectory predictions. Previous attacks against MOT either focused on hijacking the trackers of individual objects, or manipulating the tracker IDs in MOT by attacking the integrated object detection (OD) module in the digital domain, which are model-specific, non-robust, and only able to affect specific samples in offline datasets. In this paper, we present AdvTraj, the first online and physical ID-manipulation attack against tracking-by-detection MOT, in which an attacker uses adversarial trajectories to transfer its ID to a targeted object to confuse the tracking system, without attacking OD. Our simulation results in CARLA show that AdvTraj can fool ID assignments with 100% success rate in various scenarios for white-box attacks against SORT, which also have high attack transferability (up to 93% attack success rate) against state-of-the-art (SOTA) MOT algorithms due to their common design principles. We characterize the patterns of trajectories generated by AdvTraj and propose two universal adversarial maneuvers that can be performed by a human walker/driver in daily scenarios. Our work reveals under-explored weaknesses in the object association phase of SOTA MOT systems, and provides insights into enhancing the robustness of such systems.

[162] Script: Graph-Structured and Query-Conditioned Semantic Token Pruning for Multimodal Large Language Models cs.CVPDF

Zhongyu Yang, Dannong Xu, Wei Pang, Yingfang Yuan

TL;DR: Script是一种基于图结构和查询条件化的语义令牌修剪方法，用于提升多模态大型语言模型（MLLMs）的效率。它通过视觉冗余令牌修剪和查询相关语义保留模块，实现了无需重新训练即可显著提高模型速度和性能的效果。

Details

Motivation: 现有的令牌修剪方法在多模态大型语言模型中存在冗余令牌处理不足或忽视用户查询相关性等问题，导致效率低下和性能损失。

Result: 在14个图像和视频理解基准测试中，Script较现有方法表现更优，实现了速度和计算资源的显著节约。

Insight: 结合图结构和查询条件的修剪策略能有效平衡模型效率与性能，为多模态任务提供了轻量化的解决方案。

Abstract: The rapid growth of visual tokens in multimodal large language models (MLLMs) leads to excessive memory consumption and inference latency, especially when handling high-resolution images and videos. Token pruning is a technique used to mitigate this issue by removing redundancy, but existing methods often ignore relevance to the user query or suffer from the limitations of attention mechanisms, reducing their adaptability and effectiveness. To address these challenges, we propose Script, a plug-and-play pruning method that requires no retraining and generalizes across diverse MLLMs. Script comprises two modules: a graph-structured pruning module that removes visually redundant tokens, and a query-conditioned semantic pruning module that preserves query-relevant visual information. Together, they enhance performance on multimodal tasks. Experiments on fourteen benchmarks across image and video understanding tasks show that Script consistently achieves higher model efficiency and predictive accuracy compared to existing pruning methods. On LLaVA-NeXT-7B, it achieves up to 6.8x prefill speedup and 10x FLOP reduction, while retaining 96.88% of the original performance.

[163] GrndCtrl: Grounding World Models via Self-Supervised Reward Alignment cs.CV | cs.AI | cs.LG | cs.ROPDF

Haoyang He, Jay Patrikar, Dong-Ki Kim, Max Smith, Daniel McGann

TL;DR: 本文提出了GrndCtrl框架，通过自监督奖励对齐方法提升世界模型的几何接地能力，解决导航任务中的空间一致性和长期稳定性问题。

Details

Motivation: 当前视频世界模型虽然视觉保真度高，但缺乏几何接地性，限制了其在需要空间一致性和长期稳定性的导航任务中的应用。

Result: 实验表明GrndCtrl在户外环境中显著优于监督微调方法，模型展现出更高的空间一致性和导航稳定性。

Insight: 类似于大型语言模型的精调对齐，GrndCtrl展示了如何通过可验证奖励桥接生成预训练与接地行为，为世界模型的实际应用提供了新方向。

Abstract: Recent advances in video world modeling have enabled large-scale generative models to simulate embodied environments with high visual fidelity, providing strong priors for prediction, planning, and control. Yet, despite their realism, these models often lack geometric grounding, limiting their use in navigation tasks that require spatial coherence and long-horizon stability. We introduce Reinforcement Learning with World Grounding (RLWG), a self-supervised post-training framework that aligns pretrained world models with a physically verifiable structure through geometric and perceptual rewards. Analogous to reinforcement learning from verifiable feedback (RLVR) in language models, RLWG can use multiple rewards that measure pose cycle-consistency, depth reprojection, and temporal coherence. We instantiate this framework with GrndCtrl, a reward-aligned adaptation method based on Group Relative Policy Optimization (GRPO), yielding world models that maintain stable trajectories, consistent geometry, and reliable rollouts for embodied navigation. Like post-training alignment in large language models, GrndCtrl leverages verifiable rewards to bridge generative pretraining and grounded behavior, achieving superior spatial coherence and navigation stability over supervised fine-tuning in outdoor environments.

[164] SpriteHand: Real-Time Versatile Hand-Object Interaction with Autoregressive Video Generation cs.CV | cs.HCPDF

Zisu Li, Hengye Lyu, Jiaxin Shi, Yufeng Zeng, Mingming Fan

TL;DR: SpriteHand是一个基于自回归视频生成的实时手-物交互框架，能够合成多样化的手-物交互视频，适用于多种对象类型和运动模式。

Details

Motivation: 传统基于仿真的方法依赖于预定义的刚体模型和预设的手势，无法捕捉与非刚体或复杂结构的动态交互。SpriteHand旨在解决这一限制。

Result: 模型支持18 FPS、640x368分辨率的实时生成，在视觉质量、物理合理性和交互保真度上优于基线方法。

Insight: 通过结合生成方法和因果推理架构，SpriteHand展示了在复杂交互任务中超越传统仿真方法的潜力。

Abstract: Modeling and synthesizing complex hand-object interactions remains a significant challenge, even for state-of-the-art physics engines. Conventional simulation-based approaches rely on explicitly defined rigid object models and pre-scripted hand gestures, making them inadequate for capturing dynamic interactions with non-rigid or articulated entities such as deformable fabrics, elastic materials, hinge-based structures, furry surfaces, or even living creatures. In this paper, we present SpriteHand, an autoregressive video generation framework for real-time synthesis of versatile hand-object interaction videos across a wide range of object types and motion patterns. SpriteHand takes as input a static object image and a video stream in which the hands are imagined to interact with the virtual object embedded in a real-world scene, and generates corresponding hand-object interaction effects in real time. Our model employs a causal inference architecture for autoregressive generation and leverages a hybrid post-training approach to enhance visual realism and temporal coherence. Our 1.3B model supports real-time streaming generation at around 18 FPS and 640x368 resolution, with an approximate 150 ms latency on a single NVIDIA RTX 5090 GPU, and more than a minute of continuous output. Experiments demonstrate superior visual quality, physical plausibility, and interaction fidelity compared to both generative and engine-based baselines.

[165] Artemis: Structured Visual Reasoning for Perception Policy Learning cs.CVPDF

Wei Tang, Yanpeng Sun, Shan Zhang, Xiaofan Li, Piotr Koniusz

TL;DR: Artemis是一个感知策略学习框架，通过结构化基于提议的推理（而非自然语言）提升视觉感知任务性能，支持显式中间状态跟踪和监督，并在多任务中表现优异。

Details

Motivation: 现有强化学习框架常使用自然语言作为中间推理链条，但这可能降低视觉感知任务的性能。核心问题在于推理形式不匹配：视觉感知需要空间和物体中心的推理，而非非结构化的语义推理。

Result: Artemis在基础任务（如检测和grounding）和多任务（计数和几何感知）中表现优异，同时在通用MLLM基准测试中竞争性领先。

Insight: 空间对齐的推理形式是提升感知策略学习的关键，结构化提议优于语言推理，为通用感知策略提供了可扩展路径。

Abstract: Recent reinforcement-learning frameworks for visual perception policy have begun to incorporate intermediate reasoning chains expressed in natural language. Empirical observations indicate that such purely linguistic intermediate reasoning often reduces performance on perception tasks. We argue that the core issue lies not in reasoning per se but in the form of reasoning: while these chains perform semantic reasoning in an unstructured linguistic space, visual perception requires reasoning in a spatial and object-centric space. In response, we introduce Artemis, a perception-policy learning framework that performs structured proposal-based reasoning, where each intermediate step is represented as a (label, bounding-box) pair capturing a verifiable visual state. This design enables explicit tracking of intermediate states, direct supervision for proposal quality, and avoids ambiguity introduced by language-based reasoning. Artemis is built on Qwen2.5-VL-3B, achieves strong performance on grounding and detection task and exhibits substantial generalization to counting and geometric-perception tasks. The consistent improvements across these diverse settings confirm that aligning reasoning with spatial representations enhances perception-policy learning. Owing to its strengthened visual reasoning, Artemis also achieves competitive performance on general MLLM benchmarks, illustrating that spatially grounded reasoning provides a principled route toward scalable and general perception policies.

[166] PAI-Bench: A Comprehensive Benchmark For Physical AI cs.CVPDF

Fengzhe Zhou, Jiannan Huang, Jialuo Li, Deva Ramanan, Humphrey Shi

TL;DR: PAI-Bench是一个统一的、全面的基准测试，用于评估物理AI的感知和预测能力，覆盖视频生成、条件视频生成和视频理解任务。研究发现，当前视频生成模型在物理一致性上表现不佳，而多模态大语言模型在预测和因果解释上有限。

Details

Motivation: 当前多模态大语言模型和视频生成模型在理解和预测真实世界动态方面的能力尚未充分研究，因此需要一个全面评估物理AI能力的基准测试。

Result: 研究发现视频生成模型在视觉保真度上表现良好，但物理动态一致性较差；多模态大语言模型在预测和因果解释上存在局限。

Insight: 当前模型在物理AI的感知和预测需求上仍处于早期阶段，未来研究需填补这些关键空白。

Abstract: Physical AI aims to develop models that can perceive and predict real-world dynamics; yet, the extent to which current multi-modal large language models and video generative models support these abilities is insufficiently understood. We introduce Physical AI Bench (PAI-Bench), a unified and comprehensive benchmark that evaluates perception and prediction capabilities across video generation, conditional video generation, and video understanding, comprising 2,808 real-world cases with task-aligned metrics designed to capture physical plausibility and domain-specific reasoning. Our study provides a systematic assessment of recent models and shows that video generative models, despite strong visual fidelity, often struggle to maintain physically coherent dynamics, while multi-modal large language models exhibit limited performance in forecasting and causal interpretation. These observations suggest that current systems are still at an early stage in handling the perceptual and predictive demands of Physical AI. In summary, PAI-Bench establishes a realistic foundation for evaluating Physical AI and highlights key gaps that future systems must address.

[167] Learning Visual Affordance from Audio cs.CVPDF

Lidong Lu, Guo Chen, Zhu Wei, Yicheng Liu, Tong Lu

TL;DR: 该论文提出了音频-视觉可供性定位（AV-AG）任务，通过音频信号分割物体交互区域，弥补了传统方法依赖文本或视频的局限性。作者构建了首个AV-AG数据集，并提出AVAGFormer模型，融合音频和视觉信号进行高效掩码预测，实现了最先进的性能。

Details

Motivation: 传统可供性定位方法依赖文本指令或演示视频，但文本可能存在歧义，视频可能受遮挡等因素影响。而音频能提供实时、语义丰富且视觉独立的线索，为理解交互区域提供了更直观的途径。

Result: 实验表明，AVAGFormer在AV-AG任务上达到最优性能，超越相关任务的基线方法，证明了音频信号的有效性。

Insight: 音频信号可作为视觉任务的补充信息，尤其在动态交互场景中表现出优势；端到端设计能显著提升跨模态任务的性能。

Abstract: We introduce Audio-Visual Affordance Grounding (AV-AG), a new task that segments object interaction regions from action sounds. Unlike existing approaches that rely on textual instructions or demonstration videos, which often limited by ambiguity or occlusion, audio provides real-time, semantically rich, and visually independent cues for affordance grounding, enabling more intuitive understanding of interaction regions. To support this task, we construct the first AV-AG dataset, comprising a large collection of action sounds, object images, and pixel-level affordance annotations. The dataset also includes an unseen subset to evaluate zero-shot generalization. Furthermore, we propose AVAGFormer, a model equipped with a semantic-conditioned cross-modal mixer and a dual-head decoder that effectively fuses audio and visual signals for mask prediction. Experiments show that AVAGFormer achieves state-of-the-art performance on AV-AG, surpassing baselines from related tasks. Comprehensive analyses highlight the distinctions between AV-AG and AVS, the benefits of end-to-end modeling, and the contribution of each component. Code and dataset have been released on https://jscslld.github.io/AVAGFormer/.

[168] MV-TAP: Tracking Any Point in Multi-View Videos cs.CVPDF

Jahyeok Koo, Inès Hyeonsu Kim, Mungyeom Kim, Junghyun Park, Seohyun Park

TL;DR: MV-TAP提出了一种新颖的多视角视频点跟踪方法，通过利用跨视角信息和几何一致性，显著提升了动态场景中点轨迹的完整性和可靠性。

Details

Motivation: 多视角相机系统能够捕捉复杂动态场景的丰富信息，但现有方法在多视角点跟踪任务中表现不足，尤其是在跨视角信息利用和轨迹完整性方面。

Result: 实验表明MV-TAP在挑战性基准测试中表现优于现有方法，为多视角点跟踪任务提供了强基线。

Insight: 跨视角注意力机制和几何一致性是多视角点跟踪任务的关键，合成数据的训练可以提升模型在真实场景中的泛化能力。

Abstract: Multi-view camera systems enable rich observations of complex real-world scenes, and understanding dynamic objects in multi-view settings has become central to various applications. In this work, we present MV-TAP, a novel point tracker that tracks points across multi-view videos of dynamic scenes by leveraging cross-view information. MV-TAP utilizes camera geometry and a cross-view attention mechanism to aggregate spatio-temporal information across views, enabling more complete and reliable trajectory estimation in multi-view videos. To support this task, we construct a large-scale synthetic training dataset and real-world evaluation sets tailored for multi-view tracking. Extensive experiments demonstrate that MV-TAP outperforms existing point-tracking methods on challenging benchmarks, establishing an effective baseline for advancing research in multi-view point tracking.

[169] Improved Mean Flows: On the Challenges of Fastforward Generative Models cs.CV | cs.LGPDF

Zhengyang Geng, Yiyang Lu, Zongze Wu, Eli Shechtman, J. Zico Kolter

TL;DR: 论文iMF改进了MeanFlow框架，解决了训练目标和引导机制中的挑战，提出了一种更稳定的训练目标和灵活的引导方法，显著提升了生成模型的性能。

Details

Motivation: MeanFlow作为一种一步生成模型框架，其”fastforward”特性在训练目标和引导机制上引入了挑战，需改进以实现更高的效率和灵活性。

Result: iMF在ImageNet 256×256数据集上实现了1.72 FID（单次函数评估），显著优于同类方法，接近多步方法的性能。

Insight: 改进的训练目标和灵活的引导机制能显著提升一步生成模型的性能，同时保持高效和灵活性。

Abstract: MeanFlow (MF) has recently been established as a framework for one-step generative modeling. However, its ``fastforward’’ nature introduces key challenges in both the training objective and the guidance mechanism. First, the original MF’s training target depends not only on the underlying ground-truth fields but also on the network itself. To address this issue, we recast the objective as a loss on the instantaneous velocity $v$, re-parameterized by a network that predicts the average velocity $u$. Our reformulation yields a more standard regression problem and improves the training stability. Second, the original MF fixes the classifier-free guidance scale during training, which sacrifices flexibility. We tackle this issue by formulating guidance as explicit conditioning variables, thereby retaining flexibility at test time. The diverse conditions are processed through in-context conditioning, which reduces model size and benefits performance. Overall, our $\textbf{improved MeanFlow}$ ($\textbf{iMF}$) method, trained entirely from scratch, achieves $\textbf{1.72}$ FID with a single function evaluation (1-NFE) on ImageNet 256$\times$256. iMF substantially outperforms prior methods of this kind and closes the gap with multi-step methods while using no distillation. We hope our work will further advance fastforward generative modeling as a stand-alone paradigm.

[170] TUNA: Taming Unified Visual Representations for Native Unified Multimodal Models cs.CVPDF

Zhiheng Liu, Weiming Ren, Haozhe Liu, Zijian Zhou, Shoufa Chen

TL;DR: TUNA提出了一种原生统一多模态模型（UMM），通过级联VAE编码器和表示编码器构建统一的连续视觉表示空间，支持端到端的图像与视频理解和生成任务。

Details

Motivation: 此前统一多模态模型采用解耦的表示形式，导致表示格式不匹配问题。TUNA旨在通过统一视觉表示空间解决这一问题，并提升多模态任务的性能。

Result: 在图像/视频理解、生成和编辑任务上实现了最优性能，验证了统一表示设计的有效性和扩展性。

Insight: 1. 统一表示空间避免了表示格式不匹配；2. 更强的预训练表示编码器提升多模态任务性能；3. 联合训练理解和生成任务能互相受益。

Abstract: Unified multimodal models (UMMs) aim to jointly perform multimodal understanding and generation within a single framework. We present TUNA, a native UMM that builds a unified continuous visual representation by cascading a VAE encoder with a representation encoder. This unified representation space allows end-to-end processing of images and videos for both understanding and generation tasks. Compared to prior UMMs with decoupled representations, TUNA’s unified visual space avoids representation format mismatches introduced by separate encoders, outperforming decoupled alternatives in both understanding and generation. Moreover, we observe that stronger pretrained representation encoders consistently yield better performance across all multimodal tasks, highlighting the importance of the representation encoder. Finally, in this unified setting, jointly training on both understanding and generation data allows the two tasks to benefit from each other rather than interfere. Our extensive experiments on multimodal understanding and generation benchmarks show that TUNA achieves state-of-the-art results in image and video understanding, image and video generation, and image editing, demonstrating the effectiveness and scalability of its unified representation design.

[171] Generative Video Motion Editing with 3D Point Tracks cs.CVPDF

Yao-Chih Lee, Zhoutong Zhang, Jiahui Huang, Jui-Hsien Wang, Joon-Young Lee

TL;DR: 本文提出了一种基于3D点轨迹的视频运动编辑框架，能够同时编辑相机和物体运动，解决了现有方法在复杂运动场景中缺乏一致性和精细控制的问题。

Details

Motivation: 相机和物体运动是视频叙事的核心，但现有技术（如I2V和V2V）在复杂运动中缺乏一致性和精细控制能力。需要一种能够联合编辑相机和物体运动的方法。

Result: 模型实现了多样化的运动编辑功能，包括联合相机/物体操作、运动转移和非刚性变形，展现了视频编辑的新创意潜力。

Insight: 3D轨迹比2D轨迹更适合复杂运动编辑，因其能显式提供深度信息，解决遮挡和深度问题，从而提升编辑的精确性和一致性。

Abstract: Camera and object motions are central to a video’s narrative. However, precisely editing these captured motions remains a significant challenge, especially under complex object movements. Current motion-controlled image-to-video (I2V) approaches often lack full-scene context for consistent video editing, while video-to-video (V2V) methods provide viewpoint changes or basic object translation, but offer limited control over fine-grained object motion. We present a track-conditioned V2V framework that enables joint editing of camera and object motion. We achieve this by conditioning a video generation model on a source video and paired 3D point tracks representing source and target motions. These 3D tracks establish sparse correspondences that transfer rich context from the source video to new motions while preserving spatiotemporal coherence. Crucially, compared to 2D tracks, 3D tracks provide explicit depth cues, allowing the model to resolve depth order and handle occlusions for precise motion editing. Trained in two stages on synthetic and real data, our model supports diverse motion edits, including joint camera/object manipulation, motion transfer, and non-rigid deformation, unlocking new creative potential in video editing.

[172] Objects in Generated Videos Are Slower Than They Appear: Models Suffer Sub-Earth Gravity and Don’t Know Galileo’s Principle…for now cs.CVPDF

Varun Varma Thozhiyoor, Shivam Tripathi, Venkatesh Babu Radhakrishnan, Anand Bhattad

TL;DR: 论文研究表明，现有视频生成模型未能准确模拟重力效应，生成的物体下落速度比实际慢。通过提出的无单位测试方法，论文揭示了模型违反伽利略等效原理的问题，并证明小规模微调能显著改善物理模拟效果。

Details

Motivation: 视频生成模型作为潜在的世界模拟器，需要理解和编码物理规律。论文探讨了其对重力这一基本物理定律的表现，旨在揭示现有模型的局限性并提出改进方法。

Result: 1. 模型生成的重力加速度平均为$1.81,\mathrm{m/s^2}$，远低于地球重力。2. 微调后将有效重力提升至$6.43,\mathrm{m/s^2}$（达到地球重力的65%），并在多场景中验证了泛化能力。

Insight: 1. 视频生成模型在物理规律模拟上存在明显缺陷，需针对性改进。2. 小规模数据微调即可显著提升特定物理规律的模拟能力，展现了模型的可修正性。

Abstract: Video generators are increasingly evaluated as potential world models, which requires them to encode and understand physical laws. We investigate their representation of a fundamental law: gravity. Out-of-the-box video generators consistently generate objects falling at an effectively slower acceleration. However, these physical tests are often confounded by ambiguous metric scale. We first investigate if observed physical errors are artifacts of these ambiguities (e.g., incorrect frame rate assumptions). We find that even temporal rescaling cannot correct the high-variance gravity artifacts. To rigorously isolate the underlying physical representation from these confounds, we introduce a unit-free, two-object protocol that tests the timing ratio $t_1^2/t_2^2 = h_1/h_2$, a relationship independent of $g$, focal length, and scale. This relative test reveals violations of Galileo’s equivalence principle. We then demonstrate that this physical gap can be partially mitigated with targeted specialization. A lightweight low-rank adaptor fine-tuned on only 100 single-ball clips raises $g_{\mathrm{eff}}$ from $1.81,\mathrm{m/s^2}$ to $6.43,\mathrm{m/s^2}$ (reaching $65%$ of terrestrial gravity). This specialist adaptor also generalizes zero-shot to two-ball drops and inclined planes, offering initial evidence that specific physical laws can be corrected with minimal data.

[173] Visual Sync: Multi-Camera Synchronization via Cross-View Object Motion cs.CV | cs.AI | cs.LG | cs.ROPDF

Shaowei Liu, David Yifan Yao, Saurabh Gupta, Shenlong Wang

TL;DR: VisualSync是一种基于多视角动力学的优化框架，用于对齐未标定和非同步的多摄像机视频，达到毫秒级精度。

Details

Motivation: 现有方法需要受控环境、特定目标、手动校正或昂贵硬件，而VisualSync旨在解决这一挑战。

Result: 在四个多样化数据集上实验表明，VisualSync优于基线方法，中位同步误差低于50毫秒。

Insight: 关键在于利用共同可见的3D点在同步后满足极线约束。

Abstract: Today, people can easily record memorable moments, ranging from concerts, sports events, lectures, family gatherings, and birthday parties with multiple consumer cameras. However, synchronizing these cross-camera streams remains challenging. Existing methods assume controlled settings, specific targets, manual correction, or costly hardware. We present VisualSync, an optimization framework based on multi-view dynamics that aligns unposed, unsynchronized videos at millisecond accuracy. Our key insight is that any moving 3D point, when co-visible in two cameras, obeys epipolar constraints once properly synchronized. To exploit this, VisualSync leverages off-the-shelf 3D reconstruction, feature matching, and dense tracking to extract tracklets, relative poses, and cross-view correspondences. It then jointly minimizes the epipolar error to estimate each camera’s time offset. Experiments on four diverse, challenging datasets show that VisualSync outperforms baseline methods, achieving an median synchronization error below 50 ms.

[174] Data-Centric Visual Development for Self-Driving Labs cs.CV | cs.ROPDF

Anbang Liu, Guanzhong Hu, Jiayi Wang, Ping Guo, Han Liu

TL;DR: 论文提出了一种结合真实数据和虚拟数据生成的混合管道，用于训练自驱动实验室中的高精度模型，解决了数据稀缺问题，并在气泡检测任务中实现了高准确性。

Details

Motivation: 自驱动实验室需要高精度模型，但训练这类模型需要大量标注数据，尤其是负样本稀缺。本文旨在通过结合真实和虚拟数据生成，解决数据不足的问题。

Result: 在真实测试集上，仅使用真实数据的模型达到99.6%准确率，混合数据训练保持99.4%准确率，同时减少了数据收集和审核负担。

Insight: 结合真实和虚拟数据生成是一种可扩展且经济高效的数据增强策略，适用于自驱动实验室及其他视觉任务中的稀有事件检测。

Abstract: Self-driving laboratories offer a promising path toward reducing the labor-intensive, time-consuming, and often irreproducible workflows in the biological sciences. Yet their stringent precision requirements demand highly robust models whose training relies on large amounts of annotated data. However, this kind of data is difficult to obtain in routine practice, especially negative samples. In this work, we focus on pipetting, the most critical and precision sensitive action in SDLs. To overcome the scarcity of training data, we build a hybrid pipeline that fuses real and virtual data generation. The real track adopts a human-in-the-loop scheme that couples automated acquisition with selective human verification to maximize accuracy with minimal effort. The virtual track augments the real data using reference-conditioned, prompt-guided image generation, which is further screened and validated for reliability. Together, these two tracks yield a class-balanced dataset that enables robust bubble detection training. On a held-out real test set, a model trained entirely on automatically acquired real images reaches 99.6% accuracy, and mixing real and generated data during training sustains 99.4% accuracy while reducing collection and review load. Our approach offers a scalable and cost-effective strategy for supplying visual feedback data to SDL workflows and provides a practical solution to data scarcity in rare event detection and broader vision tasks.

cs.CL [Back]

[175] Towards Corpus-Grounded Agentic LLMs for Multilingual Grammatical Analysis cs.CLPDF

Matej Klemen, Tjaša Arčon, Luka Terčon, Marko Robnik-Šikonja, Kaja Dobrovoljc

TL;DR: 这篇论文探讨了如何利用具备自主性的大型语言模型（LLMs）简化基于语料库的语法分析，通过推理和代码生成等技术在多语言任务中实现数据驱动的回答。

Details

Motivation: 传统基于语料库的语法研究需要大量方法和技术努力，而LLMs的自主性能力可以显著简化这一过程。

Result: 评估了系统在主导顺序准确性、覆盖完整性和分布保真度三个维度上的表现，证明了LLMs与结构化语言数据结合的可行性。

Insight: LLMs的自主推理能力为基于语料库的语法研究提供了可解释、可扩展的自动化工具。

Abstract: Empirical grammar research has become increasingly data-driven, but the systematic analysis of annotated corpora still requires substantial methodological and technical effort. We explore how agentic large language models (LLMs) can streamline this process by reasoning over annotated corpora and producing interpretable, data-grounded answers to linguistic questions. We introduce an agentic framework for corpus-grounded grammatical analysis that integrates concepts such as natural-language task interpretation, code generation, and data-driven reasoning. As a proof of concept, we apply it to Universal Dependencies (UD) corpora, testing it on multilingual grammatical tasks inspired by the World Atlas of Language Structures (WALS). The evaluation spans 13 word-order features and over 170 languages, assessing system performance across three complementary dimensions - dominant-order accuracy, order-coverage completeness, and distributional fidelity - which reflect how well the system generalizes, identifies, and quantifies word-order variations. The results demonstrate the feasibility of combining LLM reasoning with structured linguistic data, offering a first step toward interpretable, scalable automation of corpus-based grammatical inquiry.

[176] Minimal-Edit Instruction Tuning for Low-Resource Indic GEC cs.CLPDF

Akhil Rajeev P

TL;DR: 这篇论文提出了一种无需数据扩增的低资源印度语言语法纠错方法，通过指令调优大语言模型和确定性解码实现高效、可复现的纠错效果。

Details

Motivation: 印度语言的语法纠错面临监督数据稀缺、脚本多样和形态丰富等挑战，传统方法依赖数据扩增，计算成本高且效果有限。

Result: 在GLEU评估中，Malayalam得分92.41（排名第六），Hindi得分81.44（排名第三），表明该方法高效且可复现。

Insight: 未来可探索更强的形态句法约束和人类中心化的保守编辑评估，以进一步提升性能。

Abstract: Grammatical error correction for Indic languages faces limited supervision, diverse scripts, and rich morphology. We propose an augmentation-free setup that uses instruction-tuned large language models and conservative decoding. A 12B GEMMA 3 model is instruction-tuned in bnb 4-bit precision with parameter-efficient fine-tuning (PEFT) and Alpaca-style formatting. Decoding follows a deterministic, constraint-aware procedure with a lightweight normaliser that encourages minimal, meaning-preserving edits. We operationalise inference, subsequent to instruction fine-tuning (IFT), via a fixed, language-specific prompt directly synthesised from a deterministic error classifier’s taxonomy, label distributions, and precedence ordering computed on the training corpus. Under the official untuned GLEU evaluation, the system scores 92.41 on Malayalam, sixth overall, and 81.44 on Hindi, third overall. These results indicate that classifier-informed prompt design, adapter-based instruction tuning, and deterministic decoding provide a reproducible and a computationally efficient alternative to augmentation-centred pipelines for Indic GEC. The approach also motivates future work on stronger morphosyntactic constraints and human-centred evaluation of conservative edits.

[177] OmniFusion: Simultaneous Multilingual Multimodal Translations via Modular Fusion cs.CL | cs.AIPDF

Sai Koneru, Matthias Huck, Jan Niehues

TL;DR: OmniFusion提出了一种新颖的多模态融合方法，将预训练的多模态基础模型（MMFM）和翻译专用大型语言模型（LLM）结合，实现了端到端的多语言多模态翻译，降低了延迟并提升了翻译质量。

Details

Motivation: 当前的开源文本翻译LLM仅支持级联流水线的语音翻译（ST），额外延迟问题严重且无法利用多模态上下文（如图像）辅助消歧。预训练的MMFM虽然具备多模态感知能力，但缺乏多语言覆盖和专业化翻译性能。

Result: 实验表明，OmniFusion能有效利用音频和视觉输入，在SimulST任务中比级联流水线降低1秒延迟，同时提升了整体翻译质量。

Insight: 多模态信息的融合可以显著提升翻译任务的效果，尤其是在需要低延迟的场景（如同传翻译）中，端到端设计优于传统的级联方法。

Abstract: There has been significant progress in open-source text-only translation large language models (LLMs) with better language coverage and quality. However, these models can be only used in cascaded pipelines for speech translation (ST), performing automatic speech recognition first followed by translation. This introduces additional latency, which is particularly critical in simultaneous ST (SimulST), and prevents the model from exploiting multimodal context, such as images, which can aid disambiguation. Pretrained multimodal foundation models (MMFMs) already possess strong perception and reasoning capabilities across multiple modalities, but generally lack the multilingual coverage and specialized translation performance of dedicated translation LLMs. To build an effective multimodal translation system, we propose an end-to-end approach that fuses MMFMs with translation LLMs. We introduce a novel fusion strategy that connects hidden states from multiple layers of a pretrained MMFM to a translation LLM, enabling joint end-to-end training. The resulting model, OmniFusion, built on Omni 2.5-7B as the MMFM and SeedX PPO-7B as the translation LLM, can perform speech-to-text, speech-and-image-to-text, and text-and-image-to-text translation. Experiments demonstrate that OmniFusion effectively leverages both audio and visual inputs, achieves a 1-second latency reduction in SimulST compared to cascaded pipelines and also improves the overall translation quality\footnote{Code is available at https://github.com/saikoneru/OmniFusion}.

[178] EduEval: A Hierarchical Cognitive Benchmark for Evaluating Large Language Models in Chinese Education cs.CL | cs.AIPDF

Guoqing Ma, Jia Zhu, Hanghui Guo, Weijie Shi, Yue Cui

TL;DR: EduEval是一个用于评估大型语言模型在中国K-12教育中的表现的层次化基准测试，包含认知框架、真实性和规模三大贡献。

Details

Motivation: 大型语言模型在教育领域有巨大潜力，但未经严格评估的部署可能威胁教育标准。

Result: 模型在事实性任务表现良好，但在课堂对话分类和创造性任务上表现不佳；开源模型在教育推理任务中超越闭源模型。

Insight: 不同教育目标需要定制化的提示方法，开源模型在教育推理任务中表现出潜力。

Abstract: Large language models (LLMs) demonstrate significant potential for educational applications. However, their unscrutinized deployment poses risks to educational standards, underscoring the need for rigorous evaluation. We introduce EduEval, a comprehensive hierarchical benchmark for evaluating LLMs in Chinese K-12 education. This benchmark makes three key contributions: (1) Cognitive Framework: We propose the EduAbility Taxonomy, which unifies Bloom’s Taxonomy and Webb’s Depth of Knowledge to organize tasks across six cognitive dimensions including Memorization, Understanding, Application, Reasoning, Creativity, and Ethics. (2) Authenticity: Our benchmark integrates real exam questions, classroom conversation, student essays, and expert-designed prompts to reflect genuine educational challenges; (3) Scale: EduEval comprises 24 distinct task types with over 11,000 questions spanning primary to high school levels. We evaluate 14 leading LLMs under both zero-shot and few-shot settings, revealing that while models perform well on factual tasks, they struggle with classroom dialogue classification and exhibit inconsistent results in creative content generation. Interestingly, several open source models outperform proprietary systems on complex educational reasoning. Few-shot prompting shows varying effectiveness across cognitive dimensions, suggesting that different educational objectives require tailored approaches. These findings provide targeted benchmarking metrics for developing LLMs specifically optimized for diverse Chinese educational tasks.

[179] Evidence-Guided Schema Normalization for Temporal Tabular Reasoning cs.CL | cs.AI | cs.IRPDF

Ashish Thanga, Vibhu Dixit, Abhilash Shankarampeta, Vivek Gupta

TL;DR: 论文提出了一种基于SQL的方法，解决时态表格推理中因模式设计质量导致的QA精度问题，并通过规范化、语义命名和时间锚定原则显著提升了性能。

Details

Motivation: 当前QA系统在处理半结构化表格的时态推理时面临挑战，特别是模式设计对精度的影响大于模型容量的假设。

Result: 最佳配置（Gemini 2.5 Flash模式 + Gemini-2.0-Flash查询）实现了80.39 EM，比基线（68.89 EM）提升了16.8%。

Insight: 模式设计的质量对QA精度的影响远超模型容量，规范化、语义一致性和时间锚定是关键优化方向。

Abstract: Temporal reasoning over evolving semi-structured tables poses a challenge to current QA systems. We propose a SQL-based approach that involves (1) generating a 3NF schema from Wikipedia infoboxes, (2) generating SQL queries, and (3) query execution. Our central finding challenges model scaling assumptions: the quality of schema design has a greater impact on QA precision than model capacity. We establish three evidence-based principles: normalization that preserves context, semantic naming that reduces ambiguity, and consistent temporal anchoring. Our best configuration (Gemini 2.5 Flash schema + Gemini-2.0-Flash queries) achieves 80.39 EM, a 16.8% improvement over the baseline (68.89 EM).

Vsevolod Kovalev, Parteek Kumar

TL;DR: 该论文提出了CourseTimeQA基准和一个基于单GPU延迟预算的跨模态融合方法CrossFusion-RAG，用于解决教育视频中的时间戳问答问题。

Details

Motivation: 研究的目标是在单GPU延迟和内存限制下，实现高效的时间戳问答，特别是在教育视频场景中。

Result: 在CourseTimeQA上，CrossFusion-RAG将nDCG@10提高了0.10，MRR提高了0.08，且中位端到端延迟为1.55秒。

Insight: 论文表明，在有限的硬件资源下，通过结合跨模态信息和时序一致性优化，可以显著提升时间戳问答的性能。

Abstract: We study timestamped question answering over educational lecture videos under a single-GPU latency/memory budget. Given a natural-language query, the system retrieves relevant timestamped segments and synthesizes a grounded answer. We present CourseTimeQA (52.3 h, 902 queries across six courses) and a lightweight, latency-constrained cross-modal retriever (CrossFusion-RAG) that combines frozen encoders, a learned 512->768 vision projection, shallow query-agnostic cross-attention over ASR and frames with a temporal-consistency regularizer, and a small cross-attentive reranker. On CourseTimeQA, CrossFusion-RAG improves nDCG@10 by 0.10 and MRR by 0.08 over a strong BLIP-2 retriever while achieving approximately 1.55 s median end-to-end latency on a single A100. Closest comparators (zero-shot CLIP multi-frame pooling; CLIP + cross-encoder reranker + MMR; learned late-fusion gating; text-only hybrid with cross-encoder reranking and its MMR variant; caption-augmented text retrieval; non-learned temporal smoothing) are evaluated under matched hardware and indexing. We report robustness across ASR noise (WER quartiles), diagnostics for temporal localization, and full training/tuning details to support reproducible comparison.

[181] Mitigating the Threshold Priming Effect in Large Language Model-Based Relevance Judgments via Personality Infusing cs.CL | cs.IRPDF

Nuo Chen, Hanpei Fang, Jiqun Liu, Wilson Wei, Tetsuya Sakai

TL;DR: 该论文研究了大型语言模型（LLM）在相关性标注任务中的阈值启动效应，并通过模拟人格特质（如大五人格）来降低这种效应。研究发现，某些人格配置（如高开放性和低神经质）能有效减少启动效应，且最佳人格配置可能因模型和任务类型而异。

Details

Motivation: 现有的研究表明，LLM在相关性标注任务中容易受到启动效应的影响，即先前的标注可能对后续标注产生偏差。心理学的理论表明人格特质与此类偏差有关，但LLM中模拟的人格是否会产生类似效果尚不清楚。

Result: 高开放性和低神经质等人格配置在减少启动效应方面表现一致。最佳人格配置因模型和任务类型不同而有所变化。

Insight: LLM的行为可以通过模拟人格特质进行调整，这为改善其在实际任务中的可靠性提供了新思路。同时，心理学理论与LLM实践的结合可能为未来的研究开辟新方向。

Abstract: Recent research has explored LLMs as scalable tools for relevance labeling, but studies indicate they are susceptible to priming effects, where prior relevance judgments influence later ones. Although psychological theories link personality traits to such biases, it is unclear whether simulated personalities in LLMs exhibit similar effects. We investigate how Big Five personality profiles in LLMs influence priming in relevance labeling, using multiple LLMs on TREC 2021 and 2022 Deep Learning Track datasets. Our results show that certain profiles, such as High Openness and Low Neuroticism, consistently reduce priming susceptibility. Additionally, the most effective personality in mitigating priming may vary across models and task types. Based on these findings, we propose personality prompting as a method to mitigate threshold priming, connecting psychological evidence with LLM-based evaluation practices.

[182] SCALE: Selective Resource Allocation for Overcoming Performance Bottlenecks in Mathematical Test-time Scaling cs.CL | cs.AIPDF

Yang Xiao, Chunpu Xu, Ruifeng Yuan, Jiashuo Wang, Wenjie Li

TL;DR: SCALE提出了一种选择性资源分配框架，通过区分和处理数学推理任务中的简单与复杂子问题，显著提升了大型语言模型的计算效率与准确性。

Details

Motivation: 现有的测试时计算扩展方法在推理阶段对所有子问题均匀分配资源，导致复杂子问题资源不足而简单子问题资源浪费，限制了性能提升。

Result: 实验表明，SCALE在AIME25数据集上将准确率从57.50%提升到71.25%，同时计算成本减少了33%-53%。

Insight: 选择性资源分配是解决测试时计算扩展中均匀分配资源局限的关键，借鉴了双系统理论的思路，为未来高效推理提供了新方向。

Abstract: Test-time compute scaling has emerged as a powerful paradigm for enhancing mathematical reasoning in large language models (LLMs) by allocating additional computational resources during inference. However, current methods employ uniform resource distribution across all reasoning sub-problems, creating fundamental bottlenecks where challenging sub-problems receive insufficient attention while routine operations consume disproportionate resources. This uniform allocation creates performance bottlenecks where additional computational resources yield diminishing returns. Inspired by dual-process theory, we propose \textbf{SCALE} (Selective Resource Allocation), a framework that selectively allocates computational resources based on sub-problem difficulty. SCALE operates through four stages: (1) problem decomposition into sequential reasoning sub-problems, (2) difficulty assessment of each sub-problem to distinguish between routine operations and computationally challenging sub-problems, (3) selective processing mode assignment between System 1 for simple sub-problems and System 2 for complex ones, and (4) sequential execution with context propagation. By concentrating resources on challenging sub-problems while processing routine operations efficiently, SCALE achieves substantial performance improvements with superior resource utilization. Extensive experiments demonstrate that SCALE significantly outperforms uniform scaling baselines, achieving accuracy improvements of up to 13.75 percentage points (57.50% to 71.25% on AIME25) while reducing computational costs by 33%-53%, representing a major advance in test-time scaling that addresses fundamental limitations of current approaches.

Diego A. B. Moreira, Alef I. Ferreira, Jhessica Silva, Gabriel O. dos Santos, Gustavo Bonil

TL;DR: CACARA提出了一种跨模态对齐方法，通过以文本为中心的策略实现高效的多模态和多语言学习，避免了昂贵的多模态和多语言训练过程。

Details

Motivation: 现有多模态模型通常需要昂贵的多模态和多语言训练，限制了其扩展性和效率。CACARA试图通过一种更轻量的对齐学习方式，实现跨模态和多语言能力的无缝集成。

Result: 在音频到文本检索任务中，R@1指标提升了14.24个百分点，且训练成本与单语言模型相当。

Insight: 多模态和多语言能力可以通过轻量级对齐学习实现，无需昂贵的全模态或多语言训练，为扩展模型提供了高效路径。

Abstract: As deep learning models evolve, new applications and challenges are rapidly emerging. Tasks that once relied on a single modality, such as text, images, or audio, are now enriched by seamless interactions between multimodal data. These connections bridge information gaps: an image can visually materialize a text, while audio can add context to an image. Researchers have developed numerous multimodal models, but most rely on resource-intensive training across multiple modalities. Similarly, extending these models to new languages often follows the same resource-heavy training strategy. In this work, we propose a multimodal and multilingual architecture, CACARA, trained through emergent alignment learning, enabling the seamless integration of new modalities into an existing bimodal/multimodal model without requiring full retraining. This work breaks new ground by demonstrating that this emergent alignment paradigm can unlock multilingual capabilities from monolingual training. By fine-tuning the newly incorporated modality only on data aligned with the English language, our model develops support for over 100 languages without explicit multilingual pretraining or tuning of the text encoder. Such emergent multimodal and multilingual properties are gained efficiently, preserving previously learned knowledge at a training cost comparable to that of a monolingual model. Our strategy achieves up to a 14.24 percentage points improvement in R@1 audio-to-text retrieval, outperforming state-of-the-art multimodal models – all without the heavy computational cost of retraining across every modality and language.

[184] G-KV: Decoding-Time KV Cache Eviction with Global Attention cs.CL | cs.AIPDF

Mengqi Liao, Lu Wang, Chaoyun Zhang, Zekai Shen, Xiaowei Mao

TL;DR: G-KV提出了一种KV缓存淘汰方法，通过全局评分机制结合局部和历史注意力评分，更准确地评估令牌重要性，并引入了强化学习和蒸馏等技术以优化压缩KV缓存设置下的模型性能。

Details

Motivation: 由于长序列带来的计算和内存挑战，现有的KV缓存压缩方法通常关注提示压缩或基于局部注意力评分的令牌淘汰，忽视了令牌的长期重要性。

Result: G-KV在效率和性能上显著优于现有方法，代码已开源。

Insight: 全局评分机制能更准确地识别重要令牌，后训练技术进一步提升了模型在压缩KV缓存下的表现。

Abstract: Recent reasoning large language models (LLMs) excel in complex tasks but encounter significant computational and memory challenges due to long sequence lengths. KV cache compression has emerged as an effective approach to greatly enhance the efficiency of reasoning. However, existing methods often focus on prompt compression or token eviction with local attention score, overlooking the long-term importance of tokens. We propose G-KV, a KV cache eviction method that employs a global scoring mechanism, combining local and historical attention scores to more accurately assess token importance. Additionally, we introduce post-training techniques, including reinforcement learning and distillation, to optimize models for compressed KV cache settings. The code of this paper is available on: https://github.com/microsoft/G-KV.

[185] Developing a Comprehensive Framework for Sentiment Analysis in Turkish cs.CLPDF

Cem Rifki Aydin

TL;DR: 该论文提出了一个全面的土耳其语情感分析框架，并针对英语提出了一些特定方法，结合多种特征和机器学习方法，取得了优于神经网络的结果。

Details

Motivation: 为土耳其语设计一个全面的情感分析框架，填补该语言相关研究的空白，并对英语情感分析提出新方法。

Result: 在所有提出的方法中均取得了state-of-the-art的显著成果。

Insight: 土耳其语的细粒度形态分析可推广到其他形态丰富的语言；新型词嵌入和上下文窗口定义对其他NLP任务也有意义。

Abstract: In this thesis, we developed a comprehensive framework for sentiment analysis that takes its many aspects into account mainly for Turkish. We have also proposed several approaches specific to sentiment analysis in English only. We have accordingly made five major and three minor contributions. We generated a novel and effective feature set by combining unsupervised, semi-supervised, and supervised metrics. We then fed them as input into classical machine learning methods, and outperformed neural network models for datasets of different genres in both Turkish and English. We created a polarity lexicon with a semi-supervised domain-specific method, which has been the first approach applied for corpora in Turkish. We performed a fine morphological analysis for the sentiment classification task in Turkish by determining the polarities of morphemes. This can be adapted to other morphologically-rich or agglutinative languages as well. We have built a novel neural network architecture, which combines recurrent and recursive neural network models for English. We built novel word embeddings that exploit sentiment, syntactic, semantic, and lexical characteristics for both Turkish and English. We also redefined context windows as subclauses in modelling word representations in English. This can also be applied to other linguistic fields and natural language processing tasks. We have achieved state-of-the-art and significant results for all these original approaches. Our minor contributions include methods related to aspect-based sentiment in Turkish, parameter redefinition in the semi-supervised approach, and aspect term extraction techniques for English. This thesis can be considered the most detailed and comprehensive study made on sentiment analysis in Turkish as of July, 2020. Our work has also contributed to the opinion classification problem in English.

[186] Catch Me If You Can: How Smaller Reasoning Models Pretend to Reason with Mathematical Fidelity cs.CLPDF

Subramanyam Sahoo, Vinija Jain, Saanidhya Vats, Siddharth Mohapatra, Rui Min

TL;DR: 该论文提出了一个新的诊断框架，通过四个互补的维度（前向-后向一致性、传递性覆盖率、反事实敏感性和扰动鲁棒性）来区分语言模型中的真正数学推理与表面模式匹配。研究发现，尽管Qwen3-0.6B在MenatQA数据集上的答案准确率达到70%以上，但其推理能力较差，表明模型依赖模式匹配而非逻辑计算。

Details

Motivation: 当前语言模型在数学推理能力的评估中，主要依赖答案准确率，这可能会掩盖逻辑计算中的根本性缺陷。因此，需要一种更全面的诊断方法来揭示模型是否真正掌握数学推理能力。

Result: 结果显示，尽管模型答案准确率达到70%以上，但在推理真实性维度上表现不佳（如后向一致性仅为15%），表明其推理能力有限。

Insight: 研究发现小型模型可能严重依赖模式匹配而非逻辑推理，强调了超越表面准确率指标的重要性，并为开发更通用的推理评估方法提供了基础。

Abstract: Current evaluation of mathematical reasoning in language models relies primarily on answer accuracy, potentially masking fundamental failures in logical computation. We introduce a diagnostic framework that distinguishes genuine mathematical reasoning from superficial pattern matching through four complementary axes: forward-backward consistency, transitivity coverage, counterfactual sensitivity, and perturbation robustness. Through a case study applying this framework to Qwen3-0.6B on the MenatQA dataset, we reveal a striking disconnect between surface performance and reasoning fidelity. While the model achieves reasonable answer accuracy (70%+), it demonstrates poor backward consistency (15%), limited transitivity coverage (32.2%), and brittle sensitivity to perturbations. Our diagnostics expose reasoning failures invisible to traditional accuracy metrics, suggesting that this small model relies heavily on pattern matching rather than genuine logical computation. While our empirical findings are based on a single 600M-parameter model, the diagnostic framework itself is model-agnostic and generalizable. We release our evaluation protocols to enable the research community to assess reasoning fidelity across different model scales and architectures, moving beyond surface-level accuracy toward verifiable mathematical reasoning.

[187] Slovak Conceptual Dictionary cs.CL | cs.AIPDF

Miroslav Blšták

TL;DR: 论文介绍了一个新的斯洛伐克语概念词典，填补了斯洛伐克语在机器可读语言学数据方面的空白。

Details

Motivation: 斯洛伐克语作为一种低资源语言，缺乏足够的语言学工具和数据，限制了自然语言处理任务的性能。

Result: 新词典为斯洛伐克语的自动化文本处理提供了支持，提升了相关任务的可行性。

Insight: 低资源语言的数据工具开发具有重要意义，能显著改善其自然语言处理任务的结果。

Abstract: When solving tasks in the field of natural language processing, we sometimes need dictionary tools, such as lexicons, word form dictionaries or knowledge bases. However, the availability of dictionary data is insufficient in many languages, especially in the case of low resourced languages. In this article, we introduce a new conceptual dictionary for the Slovak language as the first linguistic tool of this kind. Since Slovak language is a language with limited linguistic resources and there are currently not available any machine-readable linguistic data sources with a sufficiently large volume of data, many tasks which require automated processing of Slovak text achieve weaker results compared to other languages and are almost impossible to solve.

[188] Wikontic: Constructing Wikidata-Aligned, Ontology-Aware Knowledge Graphs with Large Language Models cs.CL | cs.AI | cs.LGPDF

Alla Chepurova, Aydar Bulatov, Yuri Kuratov, Mikhail Burtsev

TL;DR: Wikontic提出了一种多阶段流水线，用于构建与Wikidata对齐、面向本体论的知识图谱（KG），并通过类型和关系约束以及实体规范化提升KG质量，同时在多项基准测试中表现优异。

Details

Motivation: 当前基于大语言模型（LLM）的系统通常仅将知识图谱作为辅助结构用于文本检索，而对其内在质量的研究不足。Wikontic旨在通过结构化方法生成高质量KG，以更好地支持LLM的知识基础。

Result: 1) MuSiQue上正确实体出现率达96%；2) HotpotQA和MuSiQue的F1分别为76.0和59.8，优于需要文本上下文的基线；3) MINE-1上信息保留率达86%（SOTA）。

Insight: 通过结构化约束和实体规范化，Wikontic证明LLM可以有效生成高质量KG，为LLM与结构化知识的结合提供了可扩展方案。

Abstract: Knowledge graphs (KGs) provide structured, verifiable grounding for large language models (LLMs), but current LLM-based systems commonly use KGs as auxiliary structures for text retrieval, leaving their intrinsic quality underexplored. In this work, we propose Wikontic, a multi-stage pipeline that constructs KGs from open-domain text by extracting candidate triplets with qualifiers, enforcing Wikidata-based type and relation constraints, and normalizing entities to reduce duplication. The resulting KGs are compact, ontology-consistent, and well-connected; on MuSiQue, the correct answer entity appears in 96% of generated triplets. On HotpotQA, our triplets-only setup achieves 76.0 F1, and on MuSiQue 59.8 F1, matching or surpassing several retrieval-augmented generation baselines that still require textual context. In addition, Wikontic attains state-of-the-art information-retention performance on the MINE-1 benchmark (86%), outperforming prior KG construction methods. Wikontic is also efficient at build time: KG construction uses less than 1,000 output tokens, about 3$\times$ fewer than AriGraph and $<$1/20 of GraphRAG. The proposed pipeline enhances the quality of the generated KG and offers a scalable solution for leveraging structured knowledge in LLMs.

[189] ART: Adaptive Response Tuning Framework – A Multi-Agent Tournament-Based Approach to LLM Response Optimization cs.CL | cs.AIPDF

Omer Jauhar Khan

TL;DR: ART是一个基于多智能体竞赛的框架，通过ELO排名和多智能体协作优化LLM响应质量，显著提升了准确性和一致性。

Details

Motivation: 现有的单一LLM模型在响应中常出现不一致和幻觉问题，需要一种系统化的方法来优化输出质量。

Result: 实验显示ART在响应准确性、一致性和可靠性上显著优于单模型方法，质量指标提升8.4%。

Insight: 多智能体协作和竞赛机制能有效减少LLM的错误，并提供更可靠的共识输出。

Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in natural language understanding and generation. However, single-model responses often exhibit inconsistencies, hallucinations, and varying quality across different query domains. This paper presents ART (Adaptive Response Tuning), a novel framework that employs tournament-style ELO ranking and multi-agent reasoning to systematically optimize LLM outputs. By enabling multiple LLM agents to compete, critique, and collaborate through structured tournament workflows, ART produces consensus responses that outperform individual model outputs. Our framework introduces configurable tournament parameters, dynamic agent selection, and multiple consensus fusion strategies. Experimental evaluations demonstrate significant improvements in response accuracy, coherence, and reliability compared to baseline single-model approaches. The ART framework provides a scalable, production-ready solution for applications requiring high-quality, vetted LLM responses, achieving an 8.4% improvement in overall quality metrics and R22 values exceeding 0.96 in ELO rating convergence.

[190] Graphing the Truth: Structured Visualizations for Automated Hallucination Detection in LLMs cs.CLPDF

Tanmay Agrawal

TL;DR: 该论文提出了一种框架，通过交互式视觉知识图谱组织专有知识和模型生成的内容，帮助用户直观识别LLM的幻觉区域，并提供反馈以提升模型可靠性。

Details

Motivation: 大语言模型（LLM）在企业环境中常因上下文窗口限制和预训练数据与专有知识不一致导致幻觉问题，现有缓解策略成本高或缺乏确定性。

Result: 用户可通过可视化界面诊断不一致性并为模型提供反馈，形成人机协作的工作流以提高模型可靠性。

Insight: 视觉化工具不仅能辅助用户检测幻觉，还能通过结构化反馈循环持续优化模型性能。

Abstract: Large Language Models have rapidly advanced in their ability to interpret and generate natural language. In enterprise settings, they are frequently augmented with closed-source domain knowledge to deliver more contextually informed responses. However, operational constraints such as limited context windows and inconsistencies between pre-training data and supplied knowledge often lead to hallucinations, some of which appear highly credible and escape routine human review. Current mitigation strategies either depend on costly, large-scale gold-standard Q&A curation or rely on secondary model verification, neither of which offers deterministic assurance. This paper introduces a framework that organizes proprietary knowledge and model-generated content into interactive visual knowledge graphs. The objective is to provide end users with a clear, intuitive view of potential hallucination zones by linking model assertions to underlying sources of truth and indicating confidence levels. Through this visual interface, users can diagnose inconsistencies, identify weak reasoning chains, and supply corrective feedback. The resulting human-in-the-loop workflow creates a structured feedback loop that can enhance model reliability and continuously improve response quality.

Breanna E. Green, Ashley L. Shea, Pengfei Zhao, Drew B. Margolin

TL;DR: 论文比较了ChatGPT（包括GPT-3.5、GPT-4和GPT-4o版本）与人类在复杂社交媒体数据分类任务中的表现，结果显示GPT-4在处理细腻语言时表现不佳，提示设计对性能有一定帮助。

Details

Motivation: 随着生成式人工智能工具（如ChatGPT）在计算社会科学中的应用日益广泛，研究其在复杂任务（如分类和标注包含细腻语言的数据集）中的表现具有重要意义。

Result: 结果表明，尽管提示中包含标签定义可能提升性能，但GPT-4在分类细腻语言时表现不佳。定性分析还揭示了四项具体发现。

Insight: 论文指出，在处理细腻语言的分类任务中使用ChatGPT时应谨慎，提示设计对改善性能有一定作用，但仍无法完全替代人类标注。

Abstract: Generative artificial intelligence tools, like ChatGPT, are an increasingly utilized resource among computational social scientists. Nevertheless, there remains space for improved understanding of the performance of ChatGPT in complex tasks such as classifying and annotating datasets containing nuanced language. Method. In this paper, we measure the performance of GPT-4 on one such task and compare results to human annotators. We investigate ChatGPT versions 3.5, 4, and 4o to examine performance given rapid changes in technological advancement of large language models. We craft four prompt styles as input and evaluate precision, recall, and F1 scores. Both quantitative and qualitative evaluations of results demonstrate that while including label definitions in prompts may help performance, overall GPT-4 has difficulty classifying nuanced language. Qualitative analysis reveals four specific findings. Our results suggest the use of ChatGPT in classification tasks involving nuanced language should be conducted with prudence.

[192] FastPOS: Language-Agnostic Scalable POS Tagging Framework Low-Resource Use Case cs.CLPDF

Md Abdullah Al Kafi, Sumit Kumar Banshal

TL;DR: 提出一个语言无关的基于Transformer的词性标注框架FastPOS，适用于低资源语言，以孟加拉语和印地语为例验证其高效性和可移植性。

Details

Motivation: 为了解决低资源语言在NLP任务（如词性标注）中缺乏标注数据的问题，设计了一个轻量且易于移植的框架。

Result: 在孟加拉语和印地语中分别达到96.85%和97%的词性标注准确率，F1分数优秀。

Insight: 数据集标注质量对模型性能有显著影响，模块化设计可极大降低跨语言NLP任务的开发成本。

Abstract: This study proposes a language-agnostic transformer-based POS tagging framework designed for low-resource languages, using Bangla and Hindi as case studies. With only three lines of framework-specific code, the model was adapted from Bangla to Hindi, demonstrating effective portability with minimal modification. The framework achieves 96.85 percent and 97 percent token-level accuracy across POS categories in Bangla and Hindi while sustaining strong F1 scores despite dataset imbalance and linguistic overlap. A performance discrepancy in a specific POS category underscores ongoing challenges in dataset curation. The strong results stem from the underlying transformer architecture, which can be replaced with limited code adjustments. Its modular and open-source design enables rapid cross-lingual adaptation while reducing model design and tuning overhead, allowing researchers to focus on linguistic preprocessing and dataset refinement, which are essential for advancing NLP in underrepresented languages.

[193] Auxiliary-Hyperparameter-Free Sampling: Entropy Equilibrium for Text Generation cs.CLPDF

Xiaodong Cai, Hai Lin, Shaoxiong Zhan, Weiqi Luo, Hong-Gee Kim

TL;DR: 论文提出了一种无需额外超参数的采样方法EES，通过平衡归一化熵与概率质量动态调整候选集，简化部署的同时提升性能。

Details

Motivation: 现有的文本生成采样方法通常需要引入额外超参数，导致调参复杂且部署困难。作者希望通过信息论启发的方法解决这一问题。

Result: 实验表明，EES在推理和生成任务中表现稳定，准确性、连贯性和多样性均优于或接近现有方法。

Insight: 通过平衡熵与概率质量，可以简化采样过程并提升生成质量，这一思路可推广到其他生成任务中。

Abstract: Token sampling strategies critically influence text generation quality in large language models (LLMs). However, existing methods introduce additional hyperparameters, requiring extensive tuning and complicating deployment. We present Entropy Equilibrium Sampling (EES), an auxiliary hyperparameter-free approach inspired by information theory that can dynamically adjust candidate sets by balancing normalized entropy with probability mass. We evaluate EES on both reasoning and generation tasks across a range of model architectures. Our results show that EES consistently performs well across temperature settings, delivering competitive accuracy and coherence while maintaining diversity. By eliminating the need for hyperparameter tuning, EES greatly simplifies deployment while improving performance. Code is available at https://github.com/shuanncai/EES

[194] Less is More: Resource-Efficient Low-Rank Adaptation cs.CL | cs.AIPDF

Chunlin Tian, Xuyang Wei, Huanrong Liu, Zhijiang Guo, Li Li

TL;DR: EffiLoRA通过减少LoRA中的参数冗余和动态调整资源分配，显著提升了资源效率，同时保持了模型性能。

Details

Motivation: LoRA虽然是一种参数高效的微调方法，但在复杂数据集上仍存在计算开销大和参数干扰的问题。

Result: EffiLoRA在多种任务（如常识推理、视觉指令调和图像生成）上均优于LoRA，表现出更高的效率和鲁棒性。

Insight: 减少参数冗余和动态资源分配是提升参数高效微调方法性能的关键。

Abstract: Low-Rank Adaptation (LoRA) is a widely adopted parameter-efficient fine-tuning (PEFT) method for Large Language Models (LLMs), but it still incurs notable overhead and suffers from parameter interference in complex datasets. While re- cent works decouple LoRA update matrices to exploit matrix-wise asymmetry, training costs remain high. We revisit LoRA from the perspective of inter-matrix and intra-layer parameter redundancy and propose Resource-Efficient Low-Rank Adaptation, EffiLoRA, a lightweight and generalizable approach for language, multimodal, and diffusion models. EffiLoRA employs a unified A matrix across all transformer layers and introduces a runtime selective B matrices up- date to dynamically trade-off the system resource budget and model performance. EffiLoRA consistently outperforms LoRA across diverse modalities, including commonsense reasoning, visual instruction tuning, and image generation, demon- strating improved efficiency and robustness.

[195] Table as a Modality for Large Language Models cs.CL | cs.AIPDF

Liyao Li, Chao Ye, Wentao Ye, Yifei Sun, Zhe Jiang

TL;DR: 该论文指出大语言模型（LLMs）在处理表格数据时因结构信息丢失而表现不佳，并提出了一种将表格作为独立模态整合到LLMs的多模态框架TAMO，显著提升了模型性能。

Details

Motivation: 当前LLMs通过序列化表格数据和元信息来处理任务，但这种方法导致结构信息丢失，限制了模型在表格推理任务中的表现。

Result: 在多个基准数据集（如HiTab、WikiTQ等）上平均相对增益达42.65%，验证了方法的有效性。

Insight: 表格的结构信息对推理任务至关重要，将其视为独立模态可以显著提升LLMs的表格处理能力。

Abstract: To migrate the remarkable successes of Large Language Models (LLMs), the community has made numerous efforts to generalize them to the table reasoning tasks for the widely deployed tabular data. Despite that, in this work, by showing a probing experiment on our proposed StructQA benchmark, we postulate that even the most advanced LLMs (such as GPTs) may still fall short of coping with tabular data. More specifically, the current scheme often simply relies on serializing the tabular data, together with the meta information, then inputting them through the LLMs. We argue that the loss of structural information is the root of this shortcoming. In this work, we further propose TAMO, which bears an ideology to treat the tables as an independent modality integrated with the text tokens. The resulting model in TAMO is a multimodal framework consisting of a hypergraph neural network as the global table encoder seamlessly integrated with the mainstream LLM. Empirical results on various benchmarking datasets, including HiTab, WikiTQ, WikiSQL, FeTaQA, and StructQA, have demonstrated significant improvements on generalization with an average relative gain of 42.65%.

[196] Dr.Mi-Bench: A Modular-integrated Benchmark for Scientific Deep Research Agent cs.CLPDF

Zhihan Guo, Feiyang Xu, Yifan Li, Muzhi Li, Shuai Zou

TL;DR: Dr.Mi-Bench是一个模块化集成的基准测试，用于评估科学领域深度研究代理的核心能力，填补了现有基准测试在高层次规划和推理能力评估上的空白。

Details

Motivation: 学术文献的爆炸式增长需要自动化深度研究代理，但目前缺乏全面评估这些代理在科学领域中高层次规划和推理能力的基准测试。

Result: 实验结果显示，代理在科学领域中表现分散，尤其是在多源检索和跨领域一致性方面存在明显弱点。高层次规划能力是解锁基础大语言模型推理潜力的关键。

Insight: Dr.Mi-Bench揭示了代理在科学领域的薄弱环节，为开发更可靠的学术研究助手提供了诊断工具。

Abstract: The explosive growth in academic literature necessitates automated deep research (DR) agents, yet their evaluation remains a significant challenge. First, existing benchmarks often focus narrowly on retrieval while neglecting high-level planning and reasoning. Second, existing benchmarks favor general domains over the scientific domains that are the core application for DR agents. To address these gaps, we introduce Dr.Mi-Bench, a Modular-integrated benchmark for scientific DR agents. Grounded in academic literature, our benchmark uses a human-annotated dataset of 200 instances across 10 scientific domains, including both research and review papers. Besides, we also propose a Modular-integrated Evaluation Paradigm for DR Agents (Dr.Mi-Eval), a novel modular-integrated evaluation paradigm, which leverages the rich structure of academic papers to assess the core competencies of planning, retrieval, and reasoning through two complementary modes: an end-to-end evaluation for DR agents and an isolated evaluation for foundational LLMs as potential backbones. Experimental results reveal a fragmented performance landscape: agents exhibit specialized strengths but share critical weaknesses, most notably in performing the multi-source retrieval required for review-style tasks and performing consistently across diverse scientific fields. Moreover, improving high-level planning capability is the crucial factor for unlocking the reasoning potential of foundational LLMs as backbones. By exposing these actionable failure modes, Dr.Mi-Bench provides a diagnostic tool to guide the development of more reliable academic research assistants.

[197] ELR-1000: A Community-Generated Dataset for Endangered Indic Indigenous Languages cs.CL | cs.HCPDF

Neha Joshi, Pamir Gogoi, Aasim Mirza, Aayush Jansari, Aditya Yadavalli

TL;DR: ELR-1000是一个多模态数据集，包含1060种来自印度东部偏远地区的传统食谱，涉及10种濒危语言。研究发现，尽管大型语言模型（LLM）在这些低资源文化特异性语言的翻译中表现不佳，但提供针对性上下文能显著提升翻译质量。

Details

Motivation: 研究旨在通过收集濒危语言中的传统食谱数据，填补自然语言处理（NLP）领域在低资源文化特异性语言上的空白，并推动对这些语言的保护和技术开发。

Result: LLM在低资源、文化特异性语言翻译中表现不佳，但提供背景信息、翻译示例和文化保护指南后，翻译质量显著提升。

Insight: 研究强调了为少数语言和领域定制基准的必要性，以便开发更具包容性和文化敏感性的语言技术。

Abstract: We present a culturally-grounded multimodal dataset of 1,060 traditional recipes crowdsourced from rural communities across remote regions of Eastern India, spanning 10 endangered languages. These recipes, rich in linguistic and cultural nuance, were collected using a mobile interface designed for contributors with low digital literacy. Endangered Language Recipes (ELR)-1000 – captures not only culinary practices but also the socio-cultural context embedded in indigenous food traditions. We evaluate the performance of several state-of-the-art large language models (LLMs) on translating these recipes into English and find the following: despite the models’ capabilities, they struggle with low-resource, culturally-specific language. However, we observe that providing targeted context – including background information about the languages, translation examples, and guidelines for cultural preservation – leads to significant improvements in translation quality. Our results underscore the need for benchmarks that cater to underrepresented languages and domains to advance equitable and culturally-aware language technologies. As part of this work, we release the ELR-1000 dataset to the NLP community, hoping it motivates the development of language technologies for endangered languages.

[198] How do we measure privacy in text? A survey of text anonymization metrics cs.CLPDF

Yaxuan Ren, Krithika Ramesh, Yaxing Yao, Anjalie Field

TL;DR: 这篇论文通过系统性调查澄清和调和了评估文本隐私保护的指标，分析了六种隐私概念，并评估了这些指标与法律标准（如HIPAA和GDPR）以及用户期望的吻合程度。

Details

Motivation: 文本匿名化在敏感数据领域的NLP研究和模型开发中至关重要，但目前缺乏统一的隐私评估指标。论文旨在填补这一空白，提供更强大、可比性更强的隐私评估方法。

Result: 论文给出了隐私评估指标的全面分析，揭示了当前方法的局限性和改进方向。

Insight: 1. 隐私评估需要兼顾技术和法律视角；2. 用户期望是对现有法律标准的重要补充；3. 未来研究应关注指标的鲁棒性和可比性。

Abstract: In this work, we aim to clarify and reconcile metrics for evaluating privacy protection in text through a systematic survey. Although text anonymization is essential for enabling NLP research and model development in domains with sensitive data, evaluating whether anonymization methods sufficiently protect privacy remains an open challenge. In manually reviewing 47 papers that report privacy metrics, we identify and compare six distinct privacy notions, and analyze how the associated metrics capture different aspects of privacy risk. We then assess how well these notions align with legal privacy standards (HIPAA and GDPR), as well as user-centered expectations grounded in HCI studies. Our analysis offers practical guidance on navigating the landscape of privacy evaluation approaches further and highlights gaps in current practices. Ultimately, we aim to facilitate more robust, comparable, and legally aware privacy evaluations in text anonymization.

[199] DrawingBench: Evaluating Spatial Reasoning and UI Interaction Capabilities of Large Language Models through Mouse-Based Drawing Tasks cs.CLPDF

Hyunjun Kim, Sooyoung Ryu

TL;DR: DrawingBench是一个透明、可审计的评估框架，通过空间推理任务验证大型语言模型（LLM）的可信赖性，提供8项客观标准和多轮反馈机制。实验表明，模型在明确标准下表现优异，但长期规划能力仍有不足。

Details

Motivation: 随着自主AI系统的普及，评估其可靠性成为关键，现有基准缺乏透明度和可审计性。

Result: 模型在明确标准下表现完美，反馈机制显著提升性能，但长期规划和工具状态管理存在系统性错误。

Insight: 透明评估框架和外部监督比自我纠正更能建立对自主系统的信任。

Abstract: As agentic AI systems increasingly operate autonomously, establishing trust through verifiable evaluation becomes critical. Yet existing benchmarks lack the transparency and auditability needed to assess whether agents behave reliably. We present DrawingBench, a verification framework for evaluating the trustworthiness of agentic LLMs through spatial reasoning tasks that require generating sequences of low-level GUI actions. Unlike opaque evaluations, DrawingBench provides transparent, rule-based assessment: 8 objective criteria enable reproducible scoring, while action-level inspection allows stakeholders to audit agent behavior. Our framework comprises 250 diverse prompts across 20 categories and 4 difficulty levels, deterministic evaluation metrics, and an external oversight mechanism through multi-turn feedback that enables human control over agent refinement. Evaluating four state-of-the-art LLMs (Claude-4 Sonnet, GPT-4.1, GPT-4.1-mini, Gemini-2.5 Flash) across 1,000 tests, we establish both capabilities and limitations: models achieved 92.8% perfect performance with structured external feedback driving significant improvements (average +3.2%, up to +32.8% for complex scenes), but systematic error patterns emerged in tool state management and long-horizon planning. Notably, specification clarity proved more important than task complexity – models achieved 100% perfect performance when given explicit, verifiable criteria. These findings demonstrate that transparent evaluation frameworks can establish trust in agentic systems, with external oversight proving more reliable than self-correction for guiding agent behavior. Our open-source framework provides a template for trustworthy agent assessment. Code and data: https://github.com/hyunjun1121/DrawingBench

[200] TempPerturb-Eval: On the Joint Effects of Internal Temperature and External Perturbations in RAG Robustness cs.CL | cs.AIPDF

Yongxin Zhou, Philippe Mulhem, Didier Schwab

TL;DR: 该论文研究了检索增强生成（RAG）系统中文本扰动与温度设置的交互作用，提出了一个分析框架，发现在高温度设置下模型对扰动更敏感。

Details

Motivation: 现有RAG系统的评估通常孤立地分析检索质量和生成参数，忽视了它们的交互作用。本文旨在填补这一空白。

Result: 实验表明，高温度设置会放大扰动的影响，而某些扰动类型对温度的敏感性呈现非线性。

Insight: 模型在高温度下对噪声更敏感，选择合适的温度可以提升RAG系统在噪声环境下的表现。

Abstract: The evaluation of Retrieval-Augmented Generation (RAG) systems typically examines retrieval quality and generation parameters like temperature in isolation, overlooking their interaction. This work presents a systematic investigation of how text perturbations (simulating noisy retrieval) interact with temperature settings across multiple LLM runs. We propose a comprehensive RAG Perturbation-Temperature Analysis Framework that subjects retrieved documents to three distinct perturbation types across varying temperature settings. Through extensive experiments on HotpotQA with both open-source and proprietary LLMs, we demonstrate that performance degradation follows distinct patterns: high-temperature settings consistently amplify vulnerability to perturbations, while certain perturbation types exhibit non-linear sensitivity across the temperature range. Our work yields three key contributions: (1) a diagnostic benchmark for assessing RAG robustness, (2) an analytical framework for quantifying perturbation-temperature interactions, and (3) practical guidelines for model selection and parameter tuning under noisy retrieval conditions.

[201] Generalist Large Language Models Outperform Clinical Tools on Medical Benchmarks cs.CLPDF

Krithik Vishwanath, Mrigayu Ghosh, Anton Alyakin, Daniel Alexander Alber, Yindalon Aphinyanaphongs

TL;DR: 通用大语言模型（LLMs）在医学基准测试中表现优于专用临床AI工具，凸显了其在医疗决策支持中的潜力。

Details

Motivation: 专用临床AI助手在医疗实践中逐渐普及，但缺乏独立、定量化的评估，形成证据空白。本文旨在验证通用LLMs是否优于这些专用工具。

Result: 通用LLMs在完整度、沟通质量、上下文意识和系统安全性推理等方面均优于临床AI工具，其中GPT-5表现最佳。

Insight: 现有临床AI工具可能落后于前沿LLMs，强调了在临床部署前进行透明独立评估的重要性。

Abstract: Specialized clinical AI assistants are rapidly entering medical practice, often framed as safer or more reliable than general-purpose large language models (LLMs). Yet, unlike frontier models, these clinical tools are rarely subjected to independent, quantitative evaluation, creating a critical evidence gap despite their growing influence on diagnosis, triage, and guideline interpretation. We assessed two widely deployed clinical AI systems (OpenEvidence and UpToDate Expert AI) against three state-of-the-art generalist LLMs (GPT-5, Gemini 3 Pro, and Claude Sonnet 4.5) using a 1,000-item mini-benchmark combining MedQA (medical knowledge) and HealthBench (clinician-alignment) tasks. Generalist models consistently outperformed clinical tools, with GPT-5 achieving the highest scores, while OpenEvidence and UpToDate demonstrated deficits in completeness, communication quality, context awareness, and systems-based safety reasoning. These findings reveal that tools marketed for clinical decision support may often lag behind frontier LLMs, underscoring the urgent need for transparent, independent evaluation before deployment in patient-facing workflows.

[202] SUPERChem: A Multimodal Reasoning Benchmark in Chemistry cs.CL | cs.AI | cs.LGPDF

Zehua Zhao, Zhixian Huang, Junren Li, Siyu Lin, Junting Zhou

TL;DR: SUPERChem是一个多模态化学推理评测基准，包含500道专家精心设计的问题，涵盖多个子领域，支持多模态和纯文本格式，旨在解决现有评测标准的不足，提升大语言模型在化学领域的推理能力。

Details

Motivation: 现有评测标准在化学推理任务中存在任务过于简化、缺乏过程评估以及与专家技能不匹配的问题，影响了评测的有效性和实用性，因此需要更严谨的评测基准。

Result: 表现最好的模型GPT-5（High）准确率为38.5%，Gemini 2.5 Pro和DeepSeek-V3.1-Think紧随其后。SUPERChem成功区分了高保真推理和基于启发式的模型表现。

Insight: 1. 多模态信息对模型推理表现有显著影响；2. RPF评分能有效衡量推理质量；3. 当前模型的化学推理能力仍需提升。

Abstract: Current benchmarks for evaluating the chemical reasoning capabilities of Large Language Models (LLMs) are limited by oversimplified tasks, lack of process-level evaluation, and misalignment with expert-level chemistry skills. To address these issues, we introduce SUPERChem, a benchmark of 500 expert-curated reasoning-intensive chemistry problems, covering diverse subfields and provided in both multimodal and text-only formats. Original content and an iterative curation pipeline eliminate flawed items and mitigate data contamination. Each problem is paired with an expert-authored solution path, enabling Reasoning Path Fidelity (RPF) scoring to evaluate reasoning quality beyond final-answer accuracy. Evaluations against a human baseline of 40.3% accuracy show that even the best-performing model, GPT-5 (High), reaches only 38.5%, followed closely by Gemini 2.5 Pro (37.9%) and DeepSeek-V3.1-Think (37.3%). SUPERChem elicits multi-step, multimodal reasoning, reveals model-dependent effects of visual information, and distinguishes high-fidelity reasoners from heuristic ones. By providing a challenging benchmark and a reliable evaluation framework, SUPERChem aims to facilitate the advancement of LLMs toward expert-level chemical intelligence. The dataset of the benchmark is available at https://huggingface.co/datasets/ZehuaZhao/SUPERChem.

[203] Kardia-R1: Unleashing LLMs to Reason toward Understanding and Empathy for Emotional Support via Rubric-as-Judge Reinforcement Learning cs.CL | cs.AIPDF

Jiahao Yuan, Zhiqing Cui, Hanqing Wang, Yuansheng Gao, Yucheng Zhou

TL;DR: 论文提出了Kardia-R1框架，通过Rubric-as-Judge强化学习方法，结合大规模用户基准数据集KardiaBench，训练具有逐步共情认知能力的模型，显著优于现有方法。

Details

Motivation: 现有对话系统在情感支持方面存在两个主要问题：一是依赖缺乏用户身份持久性的数据集，难以捕捉个性化情感细节；二是依赖不透明的粗粒度奖励信号，阻碍了可验证的共情推理发展。

Result: 在四种LLM主干上的实验表明，Kardia-R1在情感准确性、共情、相关性、身份一致性和安全性方面均优于其他方法。

Insight: 通过规则引导的迭代优化，模型能够更好地理解和响应用户的个性化情感需求，提升了对话系统的共情能力和可解释性。

Abstract: As web platforms evolve towards greater personalization and emotional complexity, conversational agents must transcend superficial empathy to demonstrate identity-aware emotional reasoning. However, existing systems face two limitations: (1) reliance on situation-centric datasets lacking persistent user identity, which hampers the capture of personalized affective nuances; and (2) dependence on opaque, coarse reward signals that hinder development of verifiable empathetic reasoning. To address these gaps, we introduce KardiaBench, a large-scale user-grounded benchmark comprising 178,080 QA pairs across 22,080 multi-turn conversations anchored to 671 real-world profiles. The dataset is constructed via a model-in-the-loop pipeline with iterative rubric-guided refinement to ensure psychological plausibility and persona consistency. This progressive empathy pipeline that integrates user comprehension, contextual reasoning, and emotion perception into conversations, followed by iterative critique and rubric-based refinement to ensure psychological plausibility, emotional fidelity, and persona consistency. Building on this, we propose Kardia-R1, a framework that trains models for interpretable, stepwise empathetic cognition. Kardia-R1 leverages Rubric-as-Judge Empathetic Reinforcement Learning (Rubric-ERL), a GRPO-based method that uses explainable, human-aligned rubric rewards to tightly couple user understanding, emotional inference, and supportive response generation. Extensive experiments across four LLM backbones demonstrate that Kardia-R1 consistently outperforms othet methods in emotion accuracy, empathy, relevance, persona consistency, and safety. Our dataset and model will be released at https://github.com/JhCircle/Kardia-R1.

[204] Agreement-Constrained Probabilistic Minimum Bayes Risk Decoding cs.CL | cs.AI | cs.LGPDF

Koki Natsumi, Hiroyuki Deguchi, Yusuke Sakai, Hidetaka Kamigaito, Taro Watanabe

TL;DR: AC-PMBR解码通过结合知识蒸馏模型改进概率最小贝叶斯风险解码（PMBR），减少了矩阵补全的近似误差，并在翻译质量与计算成本之间实现了更好的平衡。

Details

Motivation: 最小贝叶斯风险（MBR）解码因其高质量翻译而闻名，但其计算成本高（二次时间）。概率MBR（PMBR）解码虽然减少了计算量，但降低了翻译质量。为此，作者提出一种改进方法，以在保持计算效率的同时提升翻译质量。

Result: AC-PMBR在WMT’23 En↔De翻译任务中表现优于PMBR解码，同时计算成本相近。

Insight: 知识蒸馏模型可以有效指导矩阵补全任务，提升解码效率和质量。

Abstract: Minimum Bayes risk (MBR) decoding generates high-quality translations by maximizing the expected utility of output candidates, but it evaluates all pairwise scores over the candidate set; hence, it takes quadratic time with respect to the number of candidates. To reduce the number of utility function calls, probabilistic MBR (PMBR) decoding partially evaluates quality scores using sampled pairs of candidates and completes the missing scores with a matrix completion algorithm. Nevertheless, it degrades the translation quality as the number of utility function calls is reduced. Therefore, to improve the trade-off between quality and cost, we propose agreement-constrained PMBR (AC-PMBR) decoding, which leverages a knowledge distilled model to guide the completion of the score matrix. Our AC-PMBR decoding improved approximation errors of matrix completion by up to 3 times and achieved higher translation quality compared with PMBR decoding at a comparable computational cost on the WMT’23 En$\leftrightarrow$De translation tasks.

Md. Rafiul Biswas, Firoj Alam, Wajdi Zaghouani

TL;DR: MARSAD是一个多功能的NLP平台，专注于阿拉伯语社交媒体内容的实时监控和分析，提供情感分析、情绪分析、宣传检测等功能，并支持数据抓取和可视化。

Details

Motivation: 社交媒体数据的实时分析与监测在阿拉伯语世界中具有重要价值，但目前缺乏易用且功能全面的工具。MARSAD旨在填补这一空白。

Result: MARSAD实现了实时社交媒体数据的多维度分析，并为用户提供了可视化报告。

Insight: 整合多种分析功能和使用友好界面可以显著提升非技术用户的分析能力。

Abstract: MARSAD is a multifunctional natural language processing (NLP) platform designed for real-time social media monitoring and analysis, with a particular focus on the Arabic-speaking world. It enables researchers and non-technical users alike to examine both live and archived social media content, producing detailed visualizations and reports across various dimensions, including sentiment analysis, emotion analysis, propaganda detection, fact-checking, and hate speech detection. The platform also provides secure data-scraping capabilities through API keys for accessing public social media data. MARSAD’s backend architecture integrates flexible document storage with structured data management, ensuring efficient processing of large and multimodal datasets. Its user-friendly frontend supports seamless data upload and interaction.

[206] DyFuLM: An Advanced Multimodal Framework for Sentiment Analysis cs.CLPDF

Ruohan Zhou, Jiachen Yuan, Churui Yang, Wenzheng Huang, Guoyan Zhang

TL;DR: DyFuLM是一种多模态情感分析框架，通过动态融合学习和门控特征聚合提升情感表征能力，实验证明其在粗粒度和细粒度任务中均表现优异。

Details

Motivation: 当前的情感分析任务在处理复杂文本表达时仍面临语义层次和情感细微差别的挑战，需要一个更强大的多模态框架来解决这一问题。

Result: 在实验数据集上，DyFuLM实现了82.64%的粗粒度准确率和68.48%的细粒度准确率，同时误差最小（MAE=0.0674，MSE=0.0082），决定系数最高（R²=0.6903）。消融实验进一步证明各模块的贡献。

Insight: 动态特征融合和门控机制在多模态情感分析中至关重要，能够显著提升模型的表征能力和任务平衡性。

Abstract: Understanding sentiment in complex textual expressions remains a fundamental challenge in affective computing. To address this, we propose a Dynamic Fusion Learning Model (DyFuLM), a multimodal framework designed to capture both hierarchical semantic representations and fine-grained emotional nuances. DyFuLM introduces two key moodules: a Hierarchical Dynamic Fusion module that adaptively integrates multi-level features, and a Gated Feature Aggregation module that regulates cross-layer information ffow to achieve balanced representation learning. Comprehensive experiments on multi-task sentiment datasets demonstrate that DyFuLM achieves 82.64% coarse-grained and 68.48% fine-grained accuracy, yielding the lowest regression errors (MAE = 0.0674, MSE = 0.0082) and the highest R^2 coefficient of determination (R^2= 0.6903). Furthermore, the ablation study validates the effectiveness of each module in DyFuLM. When all modules are removed, the accuracy drops by 0.91% for coarse-grained and 0.68% for fine-grained tasks. Keeping only the gated fusion module causes decreases of 0.75% and 0.55%, while removing the dynamic loss mechanism results in drops of 0.78% and 0.26% for coarse-grained and fine-grained sentiment classification, respectively. These results demonstrate that each module contributes significantly to feature interaction and task balance. Overall, the experimental findings further validate that DyFuLM enhances sentiment representation and overall performance through effective hierarchical feature fusion.

[207] PromptBridge: Cross-Model Prompt Transfer for Large Language Models cs.CL | cs.AIPDF

Yaxuan Wang, Quan Liu, Zhenting Wang, Zichao Li, Wei Wei

TL;DR: 论文提出PromptBridge框架，解决大语言模型（LLMs）中模型漂移问题，通过无需训练的方法实现跨模型提示迁移，减少重新优化的成本。

Details

Motivation: 大语言模型在实际应用中频繁切换，但为某个模型优化的提示在其他模型上效果大幅下降（模型漂移现象），需要一种高效方法实现提示的跨模型迁移。

Result: 实验证明，PromptBridge在单智能体和多智能体设置下均能提升任务准确率，同时降低迁移成本。

Insight: 模型漂移现象普遍且严重，PromptBridge为解决这一问题提供了一种高效、低成本的解决方案。

Abstract: Large language models (LLMs) underpin applications in code generation, mathematical reasoning, and agent-based workflows. In practice, systems access LLMs via commercial APIs or open-source deployments, and the model landscape (e.g., GPT, Claude, Llama) evolves rapidly. This rapid evolution forces frequent model switches driven by capability, cost, deployment constraints, and privacy. Yet prompts are highly model-sensitive: reusing a prompt engineered for one model on another often yields substantially worse performance than a prompt optimized for the target model. We term this phenomenon Model Drifting. Through extensive empirical analysis across diverse LLM configurations, we show that model drifting is both common and severe. To address this challenge, we introduce PromptBridge, a training-free framework that preserves prompt effectiveness under model switches, enabling cross-model prompt transfer without costly per-task or per-model re-optimization. PromptBridge requires only a small set of alignment tasks for calibration. It first applies Model-Adaptive Reflective Prompt Evolution (MAP-RPE) to obtain task- and model-specific optimal prompts via iterative reflective refinement and quantitative evaluation. Using the resulting calibrated prompt pairs for the source and target models, PromptBridge learns a cross-model prompt mapping. At test time, i.e., for an unseen task, given a source-model prompt, this mapping directly produces an optimized prompt for the target model. Experiments in single-agent and multi-agent settings show that PromptBridge consistently improves downstream accuracy while reducing migration effort. The code will be available soon.

[208] MCAT: Scaling Many-to-Many Speech-to-Text Translation with MLLMs to 70 Languages cs.CLPDF

Yexing Du, Kaiyuan Liu, Youcheng Pan, Bo Yang, Keqi Deng

TL;DR: MCAT框架通过语言扩展和优化的语音适配器模块，解决了MLLMs在多语言语音到文本翻译中的覆盖范围和效率问题，支持70种语言的互译，并在实验中表现优异。

Details

Motivation: 当前MLLMs在语音到文本翻译任务中面临语言覆盖范围有限和推理速度下降的问题，亟需一种能够高效支持多语言互译的解决方案。

Result: 在FLEURS数据集上，MCAT在70x69个方向上超越现有端到端模型，同时提升了批量推理效率，仅需约1亿可训练参数和每种语言10小时的S2TT数据。

Insight: 通过创新的语言扩展和语音序列优化，MCAT展示了MLLMs在多语言翻译任务中的巨大潜力，同时为开源社区提供了实用工具。

Abstract: Multimodal Large Language Models (MLLMs) have achieved great success in Speech-to-Text Translation (S2TT) tasks. However, current research is constrained by two key challenges: language coverage and efficiency. Most of the popular S2TT datasets are substantially English-centric, which restricts the scaling-up of MLLMs’ many-to-many translation capabilities. Moreover, the inference speed of MLLMs degrades dramatically when the speech is converted into long sequences (e.g., 750 tokens). To address these limitations, we propose a Multilingual Cost-effective Accelerated Speech-to-Text Translator (MCAT) framework, which includes two innovations. First, a language scaling method that leverages curriculum learning and a data balancing strategy is introduced to extend the language coverage supported by MLLMs to 70 languages and achieve mutual translation among these languages. Second, an optimized speech adapter module is designed to reduce the length of the speech sequence to only 30 tokens. Extensive experiments were conducted on MLLMs of different scales (9B and 27B). The experimental results demonstrate that MCAT not only surpasses state-of-the-art end-to-end models on the FLEURS dataset across 70x69 directions but also enhances batch inference efficiency. This is achieved with only ~100M trainable parameters and by using only 10 hours of S2TT data per language. Furthermore, we have released MCAT as open-source to promote the development of MLLMs for robust S2TT capabilities. The code and models are released at https://github.com/yxduir/m2m-70.

[209] Learning the Boundary of Solvability: Aligning LLMs to Detect Unsolvable Problems cs.CL | cs.AIPDF

Dengyun Peng, Qiguang Chen, Bofei Liu, Jiannan Guan, Libo Qin

TL;DR: 论文提出了UnsolvableQA和UnsolvableRL方法，旨在让LLM不仅能解决问题，还能识别问题是否无解。通过构建包含可解和不可解问题的数据集，并结合强化学习框架，实现了近乎完美的无解检测能力，同时提升了可解任务的准确性。

Details

Motivation: 当前LLM在区分客观无解问题（问题本身的矛盾）和主观能力限制（超出模型能力的问题）时表现不佳，导致幻觉和过度自信。为了解决这一问题，研究提出了新方法。

Result: 实验结果表明，该方法实现了近乎完美的无解检测能力，同时在可解任务上提升了准确性。Capability Collapse现象的发现支持了显式暴露于无解数据的重要性。

Insight: 研究揭示了显式训练模型识别无解问题的重要性，这不仅提升了模型的可靠性，还防止了过度自信问题。这种思路可以推广到其他LLM任务中。

Abstract: Ensuring LLM reliability requires not only solving complex problems but also recognizing when a problem is unsolvable. Current models often struggle to distinguish objective unsolvability (inherent contradictions in the problem) from subjective capability limitations (problems beyond the model’s competence), which leads to hallucinations and overconfidence. To address this, we propose UnsolvableQA and UnsolvableRL to solve feasible problems, detect inherent contradictions, and prudently refuse tasks beyond capability. Specifically, we construct UnsolvableQA, a dataset of paired solvable and unsolvable instances derived via a dual-track methodology: programmatic generation for logic puzzles and a novel “Reverse Construction” method that injects contradictions into valid reasoning chains for mathematics. Building on this dataset, we introduce UnsolvableRL, a reinforcement learning framework with three reward components jointly accounting for accuracy, unsolvability, and difficulty. Empirical results show that our approach achieves near-perfect unsolvability detection while also improving accuracy on solvable tasks. Crucially, we identify Capability Collapse, demonstrating that explicit exposure to unsolvable data is indispensable for preventing models from becoming systematically overconfident. Our code and data are available at https://github.com/sfasfaffa/unsolvableQA.

[210] Self-Supervised Borrowing Detection on Multilingual Wordlists cs.CLPDF

Tim Wientzek

TL;DR: 论文提出了一种全自监督的方法用于多语言词表中的借用检测，结合了全局对应模型的PMI相似度和基于语音特征向量的轻量对比学习，并通过自动选择决策阈值避免标注数据的需求。实验表明该方法优于现有字符串相似度方法（如NED和SCA），且性能与监督基线相当或更好。

Details

Motivation: 现有的借用检测方法通常依赖标注数据或特定语言知识，限制了其可扩展性和通用性。论文旨在开发一种不依赖标注数据且适用于多语言的自监督方法。

Result: 实验证明PMI相似度优于NED和SCA等传统方法，组合方法性能与监督基线相当或更好。消融实验验证了字符编码、温度设置和数据增强的重要性。

Insight: 全局统计信息（PMI）和局部特征（语音向量）的结合能显著提升借用检测性能；自监督方法在无标注场景下具有潜力。

Abstract: This paper presents a fully self-supervised approach to borrowing detection in multilingual wordlists. The method combines two sources of information: PMI similarities based on a global correspondence model and a lightweight contrastive component trained on phonetic feature vectors. It further includes an automatic procedure for selecting decision thresholds without requiring labeled data. Experiments on benchmark datasets show that PMI alone already improves over existing string similarity measures such as NED and SCA, and that the combined similarity performs on par with or better than supervised baselines. An ablation study highlights the importance of character encoding, temperature settings and augmentation strategies. The approach scales to datasets of different sizes, works without manual supervision and is provided with a command-line tool that allows researchers to conduct their own studies.

[211] Beware of Reasoning Overconfidence: Pitfalls in the Reasoning Process for Multi-solution Tasks cs.CLPDF

Jiannan Guan, Qiguang Chen, Libo Qin, Dengyun Peng, Jinhao Liu

TL;DR: 论文探讨了大语言模型（LLM）在多解任务中的局限性，提出了‘推理过度自信’的概念，并通过实验和理论分析揭示了其成因与解决方法。

Details

Motivation: 大语言模型在单解任务中表现优异，但在需要全面多样化答案的多解任务中表现不佳，研究者希望探究这一现象的成因并提供改进方法。

Result: 实验显示Short-CoT在多解任务中存在显著过度自信，而Long-CoT通过迭代探索和自我反思有效缓解了这一现象。

Insight: 研究表明，LLM在多解任务中的表现不仅需要关注单解准确性，还需评估其探索全面性，未来研究应注重推理过程的多样性。

Abstract: Large Language Models (LLMs) excel in reasoning tasks requiring a single correct answer, but they perform poorly in multi-solution tasks that require generating comprehensive and diverse answers. We attribute this limitation to \textbf{reasoning overconfidence}: a tendency to express undue certainty in an incomplete solution set. To examine the effect, we introduce \textit{MuSoBench}, a benchmark of multi-solution problems. Experiments show that the conventional short chain-of-thought (Short-CoT) prompting paradigm exhibits pronounced overconfidence, whereas the emerging long chain-of-thought (Long-CoT) approach mitigates it through iterative exploration and self-reflection. We further characterise observable behaviours and influential factors. To probe the underlying cause, we propose the \textbf{cognitive-rigidity hypothesis}, which posits that overconfidence arises when the reasoning process prematurely converges on a narrow set of thought paths. An attention-entropy analysis offers preliminary support for this view. These findings provide tools for assessing the completeness of LLM reasoning and highlight the need to move evaluation beyond single-answer accuracy toward comprehensive exploration.

[212] Reasoning About the Unsaid: Misinformation Detection with Omission-Aware Graph Inference cs.CLPDF

Zhengjia Wang, Danding Wang, Qiang Sheng, Jiaying Wu, Juan Cao

TL;DR: 该论文提出了OmiGraph，首个专注于遗漏型虚假信息检测的框架，通过构建遗漏感知图和动态关系建模，显著提升了检测性能。

Details

Motivation: 现有虚假信息检测多关注显性伪造内容，而隐性遗漏信息同样能误导读者却少有研究。

Result: 在两个大规模数据集上平均提升5.4% F1和5.3% ACC。

Insight: 遗漏信息是虚假信息的重要形式，动态建模其意图和内容能显著提升检测效果。

Abstract: This paper investigates the detection of misinformation, which deceives readers by explicitly fabricating misleading content or implicitly omitting important information necessary for informed judgment. While the former has been extensively studied, omission-based deception remains largely overlooked, even though it can subtly guide readers toward false conclusions under the illusion of completeness. To pioneer in this direction, this paper presents OmiGraph, the first omission-aware framework for misinformation detection. Specifically, OmiGraph constructs an omission-aware graph for the target news by utilizing a contextual environment that captures complementary perspectives of the same event, thereby surfacing potentially omitted contents. Based on this graph, omission-oriented relation modeling is then proposed to identify the internal contextual dependencies, as well as the dynamic omission intents, formulating a comprehensive omission relation representation. Finally, to extract omission patterns for detection, OmiGraph introduces omission-aware message-passing and aggregation that establishes holistic deception perception by integrating the omission contents and relations. Experiments show that, by considering the omission perspective, our approach attains remarkable performance, achieving average improvements of +5.4% F1 and +5.3% ACC on two large-scale benchmarks.

[213] InnoGym: Benchmarking the Innovation Potential of AI Agents cs.CL | cs.AI | cs.CV | cs.LG | cs.MAPDF

Jintian Zhang, Kewei Xu, Jingsheng Zheng, Zhuoyun Yu, Yuqi Zhu

TL;DR: 该论文提出了InnoGym，一个用于评估AI代理创新潜力的基准和框架，强调方法多样性和原创性。

Details

Motivation: 现有基准主要关注答案的正确性，而忽视了解决方案背后方法的多样性和原创性，这对真正的创新至关重要。

Result: 实验表明，部分代理能产生新颖方法，但鲁棒性不足限制了性能增益，揭示了创造力和有效性之间的差距。

Insight: 创新不仅需要新颖性，还需要方法的鲁棒性，未来的基准需同时评估两者。

Abstract: LLMs and Agents have achieved impressive progress in code generation, mathematical reasoning, and scientific discovery. However, existing benchmarks primarily measure correctness, overlooking the diversity of methods behind solutions. True innovation depends not only on producing correct answers but also on the originality of the approach. We present InnoGym, the first benchmark and framework designed to systematically evaluate the innovation potential of AI agents. InnoGym introduces two complementary metrics: performance gain, which measures improvement over the best-known solutions, and novelty, which captures methodological differences from prior approaches. The benchmark includes 18 carefully curated tasks from real-world engineering and scientific domains, each standardized through resource filtering, evaluator validation, and solution collection. In addition, we provide iGym, a unified execution environment for reproducible and long-horizon evaluations. Extensive experiments show that while some agents produce novel approaches, their lack of robustness limits performance gains. These results highlight a key gap between creativity and effectiveness, underscoring the need for benchmarks that evaluate both.

[214] Beyond SFT: Reinforcement Learning for Safer Large Reasoning Models with Better Reasoning Ability cs.CLPDF

Jinghan Jia, Nathalie Baracaldo, Sijia Liu

TL;DR: 本文探讨了如何通过强化学习（RL）改善大型推理模型（LRMs）的安全性和推理能力，弥补监督微调（SFT）方法的不足。

Details

Motivation: LRMs通过显式推理链提升数学和逻辑问题解决能力，但也引入了新的安全风险（如中间推理路径中的不安全行为）。现有基于SFT的安全对齐方法效果不一致且可能损害推理能力，因此需要更鲁棒的方法。

Result: 实验显示RL在多模型家族和基准测试中实现了更强的安全性，推理能力未受显著影响。通过熵分析，RL能有效抑制不安全推理行为。

Insight: RL框架更适合LRMs的安全对齐，因其能动态优化策略，平衡安全性和推理能力。SFT的静态训练方式难以应对复杂推理路径中的安全问题。

Abstract: Large reasoning models (LRMs) extend large language models by generating explicit chain-of-thought (CoT) reasoning, significantly improving mathematical and logical problem solving. However, this explicit reasoning process also introduces new safety risks, as unsafe behaviors often emerge within intermediate reasoning trajectories, even when final answers appear harmless. Existing safety alignment approaches primarily rely on supervised fine-tuning (SFT) over safety-oriented long CoT datasets. While intuitive, we find that SFT produces inconsistent safety improvements, degrades reasoning ability, and generalizes poorly across model families. These limitations suggest that purely supervised approaches are insufficient for robust safety alignment in LRMs. To address this, we investigate reinforcement learning (RL) as a complementary optimization framework for LRM safety training. Unlike SFT, RL directly optimizes model policies with reward feedback, enabling more adaptive and stable alignment. Extensive experiments across multiple model families and benchmarks show that RL achieves stronger and more consistent safety gains while maintaining reasoning competence. Further analysis of reflection dynamics and token-level entropy reveals that RL suppresses unsafe exploratory reasoning while preserving reflective depth, leading to safer and more reliable reasoning processes.

[215] BHRAM-IL: A Benchmark for Hallucination Recognition and Assessment in Multiple Indian Languages cs.CL | cs.AI | cs.ETPDF

Hrishikesh Terdalkar, Kirtan Bhojani, Aryan Dongare, Omm Aditya Behera

TL;DR: BHRAM-IL 是一个专注于多印度语言的幻觉识别与评估的基准数据集，涵盖印度语、古吉拉特语、马拉地语、奥里亚语及英语。

Details

Motivation: 尽管幻觉检测在英语中已广泛研究，但在资源匮乏的印度语言中仍缺乏探索。BHRAM-IL 填补了这一空白。

Result: 整体评分为0.23，语言修正模糊评分为0.385，展示了数据集的有效性。

Insight: BHRAM-IL为多语言幻觉检测提供了标准化工具，并揭示了LLM在印度语言中的表现差距。

Abstract: Large language models (LLMs) are increasingly deployed in multilingual applications but often generate plausible yet incorrect or misleading outputs, known as hallucinations. While hallucination detection has been studied extensively in English, under-resourced Indian languages remain largely unexplored. We present BHRAM-IL, a benchmark for hallucination recognition and assessment in multiple Indian languages, covering Hindi, Gujarati, Marathi, Odia, along with English. The benchmark comprises 36,047 curated questions across nine categories spanning factual, numerical, reasoning, and linguistic tasks. We evaluate 14 state-of-the-art multilingual LLMs on a benchmark subset of 10,265 questions, analyzing cross-lingual and factual hallucinations across languages, models, scales, categories, and domains using category-specific metrics normalized to (0,1) range. Aggregation over all categories and models yields a primary score of 0.23 and a language-corrected fuzzy score of 0.385, demonstrating the usefulness of BHRAM-IL for hallucination-focused evaluation. The dataset, and the code for generation and evaluation are available on GitHub (https://github.com/sambhashana/BHRAM-IL/) and HuggingFace (https://huggingface.co/datasets/sambhashana/BHRAM-IL/) to support future research in multilingual hallucination detection and mitigation.

[216] Cross-Lingual Interleaving for Speech Language Models cs.CL | cs.AIPDF

Adel Moumen, Guangzhi Sun, Philip C. Woodland

TL;DR: 论文提出了一种跨语言交错方法，用于混合多语言的语音标记，无需文本监督，并发布了EN-FR训练数据集和评测基准，显著提升了单语语义准确性和跨语言能力。

Details

Motivation: 由于缺乏多语言的评测基准和训练数据，当前的语音语言模型（SLM）主要集中于英语，跨语言学习困难。本文旨在通过跨语言交错方法解决这一问题，推动多语言SLM的发展。

Result: 在360M和1B参数的SLM上，跨语言交错方法提升了单语语义准确性，增强了跨语言延续能力和隐藏状态对齐。

Insight: 跨语言交错是一种简单且可扩展的方法，能够有效构建理解并支持多语言对话的SLM。

Abstract: Spoken Language Models (SLMs) aim to learn linguistic competence directly from speech using discrete units, widening access to Natural Language Processing (NLP) technologies for languages with limited written resources. However, progress has been largely English-centric due to scarce spoken evaluation benchmarks and training data, making cross-lingual learning difficult. We present a cross-lingual interleaving method that mixes speech tokens across languages without textual supervision. We also release an EN-FR training dataset, TinyStories (~42k hours), together with EN-FR spoken StoryCloze and TopicCloze benchmarks for cross-lingual semantic evaluation, both synthetically generated using GPT-4. On 360M and 1B SLMs under matched training-token budgets, interleaving improves monolingual semantic accuracy, enables robust cross-lingual continuation, and strengthens cross-lingual hidden-state alignment. Taken together, these results indicate that cross-lingual interleaving is a simple, scalable route to building multilingual SLMs that understand and converse across languages. All resources will be made open-source to support reproducibility.

[217] Rectifying LLM Thought from Lens of Optimization cs.CL | cs.AIPDF

Junnan Liu, Hongwei Liu, Songyang Zhang, Kai Chen

TL;DR: 该论文提出了RePro方法，通过优化视角改进大型语言模型（LLM）的推理过程，解决了长链思维提示（CoT）中的次优行为。

Details

Motivation: 尽管长链思维提示（CoT）展现了强大的推理能力，但其容易出现过思考和推理链过长的问题，影响性能。论文希望通过优化视角重新设计推理过程。

Result: 实验表明，RePro在数学、科学和编程任务中显著提升了推理性能，减少了次优推理行为。

Insight: 通过优化视角分析推理过程，可以有效改进LLM的推理能力和效率。

Abstract: Recent advancements in large language models (LLMs) have been driven by their emergent reasoning capabilities, particularly through long chain-of-thought (CoT) prompting, which enables thorough exploration and deliberation. Despite these advances, long-CoT LLMs often exhibit suboptimal reasoning behaviors, such as overthinking and excessively protracted reasoning chains, which can impair performance. In this paper, we analyze reasoning processes through an optimization lens, framing CoT as a gradient descent procedure where each reasoning step constitutes an update toward problem resolution. Building on this perspective, we introduce RePro (Rectifying Process-level Reward), a novel approach to refine LLM reasoning during post-training. RePro defines a surrogate objective function to assess the optimization process underlying CoT, utilizing a dual scoring mechanism to quantify its intensity and stability. These scores are aggregated into a composite process-level reward, seamlessly integrated into reinforcement learning with verifiable rewards (RLVR) pipelines to optimize LLMs. Extensive experiments across multiple reinforcement learning algorithms and diverse LLMs, evaluated on benchmarks spanning mathematics, science, and coding, demonstrate that RePro consistently enhances reasoning performance and mitigates suboptimal reasoning behaviors.

[218] How Far Are We from Genuinely Useful Deep Research Agents? cs.CLPDF

Dingling Zhang, He Zhu, Jincheng Ren, Kangqi Song, Xinran Zhou

TL;DR: 该论文提出了一个名为FINDER的新基准测试和DEFT故障分类法，用于评估和分类深度研究代理在生成综合性报告时的失败模式。研究发现现有系统在证据整合、验证和推理规划方面表现不佳。

Details

Motivation: 现有深度研究代理（DRAs）主要基于问答基准测试，缺乏对生成综合性报告的研究。现有基准测试任务复杂且评价指标主观，无法反映用户需求。

Result: 实验显示，现有DRAs在任务理解方面表现尚可，但在证据整合、验证和推理规划方面存在显著不足。

Insight: 研究表明，提升DRAs的实用性需要重点关注证据处理和推理能力，而非单纯的任务理解。

Abstract: Deep Research Agents (DRAs) aim to automatically produce analyst-level reports through iterative information retrieval and synthesis. However, most existing DRAs were validated on question-answering benchmarks, while research on generating comprehensive reports remains overlooked. Worse, current benchmarks for report synthesis suffer from task complexity and subjective metrics – this fails to reflect user demands and limits the practical utility of generated reports. To address these gaps, we present Fine-grained DEepResearch bench (FINDER), an enhanced benchmark consisting of 100 human-curated research tasks with 419 structured checklist items that standardize report structure, analytical depth, and factual grounding. Based on approximately 1,000 reports produced by mainstream DRAs, we further propose Deep rEsearch Failure Taxonomy (DEFT), the first failure taxonomy for deep research agents. DEFT contains 14 fine-grained failure modes across reasoning, retrieval, and generation, and is built upon grounded theory with human-LLM co-annotating and inter-annotator reliability validation. Our experimental findings reveal that current DRAs struggle not with task comprehension but with evidence integration, verification, and reasoning-resilient planning.

[219] The Art of Scaling Test-Time Compute for Large Language Models cs.CLPDF

Aradhye Agarwal, Ayan Sengupta, Tanmoy Chakraborty

TL;DR: 论文探讨了在大型语言模型（LLM）推理过程中动态分配计算资源（即测试时扩展，TTS）对推理能力的提升作用，并通过大规模实验揭示了不同TTS策略的效果与模型类型及问题难度的关系。

Details

Motivation: 现有研究缺乏对不同TTS策略在相同条件下的系统性比较，且模型类型和问题难度对性能的影响尚不明确，因此有必要填补这些空白。

Result: 研究发现：(1) 没有一种TTS策略在所有情况下表现最优；(2) 推理模型在不同难度和跟踪长度下表现截然不同，可分为短期和长期两类；(3) 给定模型类型，TTS性能随计算预算单调提升。

Insight: 论文揭示了模型类型和问题难度对TTS策略选择的关键影响，为优化推理性能提供了实用的方法论。

Abstract: Test-time scaling (TTS) – the dynamic allocation of compute during inference – is a promising direction for improving reasoning in large language models (LLMs). However, a systematic comparison of well-known TTS strategies under identical conditions is missing, and the influence of model type and problem difficulty on performance remains unclear. To address these gaps, we conduct the first large-scale study of TTS, spanning over thirty billion tokens generated using eight open-source LLMs (7B to 235B parameters), across four reasoning datasets. We observe three consistent trends: (1) no single TTS strategy universally dominates; (2) reasoning models exhibit distinct trace-quality patterns across problem difficulty and trace length, forming short-horizon and long-horizon categories; and (3) for a given model type, the optimal TTS performance scales monotonically with compute budget. Based on these insights, we provide a practical recipe for selecting the best TTS strategy, considering problem difficulty, model type, and compute budget, providing a practical guide to effective inference-time scaling.

[220] Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling cs.CL | cs.LGPDF

Jack Cook, Junxian Guo, Guangxuan Xiao, Yujun Lin, Song Han

TL;DR: 论文提出了一种名为Four Over Six (4/6)的改进NVFP4量化算法，通过自适应块缩放减少量化误差，显著提高训练和推理性能。

Details

Motivation: 随着大语言模型规模的增大，低精度数值格式（如NVFP4）因速度和内存优势而流行。然而，NVFP4量化可能导致训练发散和推理性能下降，尤其是在处理最大值附近的值时误差较大。4/6算法旨在解决这一问题。

Result: 实验表明，4/6能防止训练发散，使训练损失更接近BF16精度，并易于集成到多种训练后量化方法中，显著提升下游任务准确率。

Insight: 浮点格式的量化误差集中在最大值附近，自适应块缩放能有效改善这一问题。4/6为NVFP4的广泛应用提供了新思路。

Abstract: As large language models have grown larger, low-precision numerical formats such as NVFP4 have become increasingly popular due to the speed and memory benefits they provide. However, to accelerate computation with NVFP4, all matrix multiplication operands–weights and activations in the forward pass, and weights, activations, and gradients in the backward pass–must be quantized to NVFP4, often leading to divergence during training and performance degradation during inference. NVFP4 by evaluating multiple potential scale factors for each block of values. To address this issue, in this work we introduce Four Over Six (4/6), a modification to the NVFP4 quantization algorithm that evaluates two potential scale factors for each block of values. Unlike integer formats, floating-point formats such as FP4 have the most quantization error on near-maximal values in each block, which we find to be primarily responsible for downstream performance degradation. We find that for some blocks, scaling to smaller FP4 values makes the distribution of representable values more uniform, improving representation of near-maximal values. Importantly, 4/6 can be implemented efficiently on NVIDIA Blackwell GPUs, making it viable to use while training LLMs with NVFP4. In pre-training experiments with transformer and hybrid model architectures, we find that 4/6 prevents divergence in several cases, bringing training loss significantly closer to BF16 compared to models trained with current state-of-the-art NVFP4 training recipes. We also find that 4/6 can be easily incorporated into many different post-training quantization methods and generally improves downstream accuracy. We hope this inspires future work in training and deploying models with NVFP4.

eess.IV [Back]

[221] MedCondDiff: Lightweight, Robust, Semantically Guided Diffusion for Medical Image Segmentation eess.IV | cs.AI | cs.CV | cs.LGPDF

Ruirui Huang, Jiacheng Li

TL;DR: MedCondDiff是一种基于扩散模型的医学图像分割框架，通过结合语义先验和轻量级设计，提高了鲁棒性和效率。

Details

Motivation: 医学图像分割需要高效且鲁棒的模型，传统扩散模型计算开销大，MedCondDiff旨在解决这一问题。

Result: 在多器官、多模态数据集上表现优异，性能接近或优于传统扩散模型。

Insight: 语义引导的扩散模型是医学图像分割的有效架构，尤其在效率和鲁棒性方面表现突出。

Abstract: We introduce MedCondDiff, a diffusion-based framework for multi-organ medical image segmentation that is efficient and anatomically grounded. The model conditions the denoising process on semantic priors extracted by a Pyramid Vision Transformer (PVT) backbone, yielding a semantically guided and lightweight diffusion architecture. This design improves robustness while reducing both inference time and VRAM usage compared to conventional diffusion models. Experiments on multi-organ, multi-modality datasets demonstrate that MedCondDiff delivers competitive performance across anatomical regions and imaging modalities, underscoring the potential of semantically guided diffusion models as an effective class of architectures for medical imaging tasks.

[222] Disentangling Progress in Medical Image Registration: Beyond Trend-Driven Architectures towards Domain-Specific Strategies eess.IV | cs.CVPDF

Bailiang Jian, Jiazhen Pan, Rohit Jena, Morteza Ghahremani, Hongwei Bran Li

TL;DR: 该论文探讨了医学图像配准中低层通用计算模块与高层领域特定设计的影响，发现后者对性能提升更为关键。

Details

Motivation: 近年来，医学图像配准领域常借鉴计算机视觉中的通用架构（如大核CNN、Transformer等），但这些模块的贡献与领域特定设计的相对重要性尚不明确。

Result: 领域特定设计显著提升了标准U-Net基线的性能（平均相对改进约3%），超越了通用模块的效果。

Insight: 未来医学图像配准研究应更关注领域特定设计，而非盲目追随通用架构趋势。

Abstract: Medical image registration drives quantitative analysis across organs, modalities, and patient populations. Recent deep learning methods often combine low-level “trend-driven” computational blocks from computer vision, such as large-kernel CNNs, Transformers, and state-space models, with high-level registration-specific designs like motion pyramids, correlation layers, and iterative refinement. Yet, their relative contributions remain unclear and entangled. This raises a central question: should future advances in registration focus on importing generic architectural trends or on refining domain-specific design principles? Through a modular framework spanning brain, lung, cardiac, and abdominal registration, we systematically disentangle the influence of these two paradigms. Our evaluation reveals that low-level “trend-driven” computational blocks offer only marginal or inconsistent gains, while high-level registration-specific designs consistently deliver more accurate, smoother, and more robust deformations. These domain priors significantly elevate the performance of a standard U-Net baseline, far more than variants incorporating “trend-driven” blocks, achieving an average relative improvement of $\sim3%$. All models and experiments are released within a transparent, modular benchmark that enables plug-and-play comparison for new architectures and registration tasks (https://github.com/BailiangJ/rethink-reg). This dynamic and extensible platform establishes a common ground for reproducible and fair evaluation, inviting the community to isolate genuine methodological contributions from domain priors. Our findings advocate a shift in research emphasis: from following architectural trends to embracing domain-specific design principles as the true drivers of progress in learning-based medical image registration.

cs.NI [Back]

[223] LM4Opt-RA: A Multi-Candidate LLM Framework with Structured Ranking for Automating Network Resource Allocation cs.NI | cs.AI | cs.CLPDF

Tasnim Ahmed, Siana Rizwan, Naveed Ejaz, Salimur Choudhury

TL;DR: 本文提出了一种基于多候选LLM框架（LM4Opt-RA）的方法，结合结构化排名机制，用于自动化网络资源分配问题。通过引入NL4RA数据集和LLM辅助数学评估（LAME）指标，提升了复杂数学建模任务的性能。

Details

Motivation: 现有的LLM方法和基准数据集难以处理动态环境下资源分配的复杂性，需要一种更高效的自动化解决方案。

Result: 在LAME评分中，Llama-3.1-70B得分0.8007，显著优于其他模型，但仍不及人类专家水平。

Insight: 1）数学任务的评估需专用指标（如LAME）；2）结构化排名和多候选策略能显著提升LLM性能。

Abstract: Building on advancements in Large Language Models (LLMs), we can tackle complex analytical and mathematical reasoning tasks requiring nuanced contextual understanding. A prime example of such complex tasks is modelling resource allocation optimization in networks, which extends beyond translating natural language inputs into mathematical equations or Linear Programming (LP), Integer Linear Programming (ILP), and Mixed-Integer Linear Programming (MILP) models. However, existing benchmarks and datasets cannot address the complexities of such problems with dynamic environments, interdependent variables, and heterogeneous constraints. To address this gap, we introduce NL4RA, a curated dataset comprising 50 resource allocation optimization problems formulated as LP, ILP, and MILP. We then evaluate the performance of well-known open-source LLMs with varying parameter counts. To enhance existing LLM based methods, we introduce LM4Opt RA, a multi candidate framework that applies diverse prompting strategies such as direct, few shot, and chain of thought, combined with a structured ranking mechanism to improve accuracy. We identified discrepancies between human judgments and automated scoring such as ROUGE, BLEU, or BERT scores. However, human evaluation is time-consuming and requires specialized expertise, making it impractical for a fully automated end-to-end framework. To quantify the difference between LLM-generated responses and ground truth, we introduce LLM-Assisted Mathematical Evaluation (LAME), an automated metric designed for mathematical formulations. Using LM4Opt-RA, Llama-3.1-70B achieved a LAME score of 0.8007, outperforming other models by a significant margin, followed closely by Llama-3.1-8B. While baseline LLMs demonstrate considerable promise, they still lag behind human expertise; our proposed method surpasses these baselines regarding LAME and other metrics.

cs.AR [Back]

[224] Ternary-Input Binary-Weight CNN Accelerator Design for Miniature Object Classification System with Query-Driven Spatial DVS cs.AR | cs.CV | eess.IVPDF

Yuyang Li, Swasthik Muloor, Jack Laudati, Nickolas Dematteis, Yidam Park

TL;DR: 提出了一种面向微型成像系统的CNN硬件加速器设计，优化目标和功耗，通过三元DVS输出和三元输入二值权重神经网络减少计算和内存需求。

Details

Motivation: 微型成像系统受限于内存和功耗约束，传统机器学习方法能耗过高，无法满足小型电池的需求。

Result: 在28 nm CMOS工艺下，数据尺寸减少81%，MAC操作减少27%，功耗1.6 mW，推理时间440 ms，FoM提升7.3倍。

Insight: 三元输入二值权重网络和DVS传感器重构技术为微型系统提供了高效的硬件加速解决方案。

Abstract: Miniature imaging systems are essential for space-constrained applications but are limited by memory and power constraints. While machine learning can reduce data size by extracting key features, its high energy demands often exceed the capacity of small batteries. This paper presents a CNN hardware accelerator optimized for object classification in miniature imaging systems. It processes data from a spatial Dynamic Vision Sensor (DVS), reconfigurable to a temporal DVS via pixel sharing, minimizing sensor area. By using ternary DVS outputs and a ternary-input, binary-weight neural network, the design reduces computation and memory needs. Fabricated in 28 nm CMOS, the accelerator cuts data size by 81% and MAC operations by 27%. It achieves 440 ms inference time at just 1.6 mW power consumption, improving the Figure-of-Merit (FoM) by 7.3x over prior CNN accelerators for miniature systems.

cs.RO [Back]

[225] A Comprehensive Survey on Surgical Digital Twin cs.RO | cs.AI | cs.CVPDF

Afsah Sharaf Khan, Falong Fan, Doohwan DH Kim, Abdurrahman Alshareef, Dong Chen

TL;DR: 本文对手术数字孪生（SDT）进行了全面综述，梳理了其定义、分类、技术挑战和研究进展，并提出了未来的研究方向。

Details

Motivation: 随着多模态手术数据和实时计算的快速发展，SDT作为一种虚拟镜像技术，能够在术前、术中和术后为医疗决策提供支持。然而，其在实际应用中仍面临诸多挑战，如数据融合、计算效率、鲁棒性和隐私合规等问题。

Result: 本文系统化了SDT的研究现状和未来方向，为设计和部署SDT提供了指导。

Insight: SDT的发展需要解决技术挑战（如实时性和鲁棒性）和临床需求（如隐私和合规性）的多重平衡，同时标准化和数据治理是实现临床转化的关键。

Abstract: With the accelerating availability of multimodal surgical data and real-time computation, Surgical Digital Twins (SDTs) have emerged as virtual counterparts that mirror, predict, and inform decisions across pre-, intra-, and postoperative care. Despite promising demonstrations, SDTs face persistent challenges: fusing heterogeneous imaging, kinematics, and physiology under strict latency budgets; balancing model fidelity with computational efficiency; ensuring robustness, interpretability, and calibrated uncertainty; and achieving interoperability, privacy, and regulatory compliance in clinical environments. This survey offers a critical, structured review of SDTs. We clarify terminology and scope, propose a taxonomy by purpose, model fidelity, and data sources, and synthesize state-of-the-art achievements in deformable registration and tracking, real-time simulation and co-simulation, AR/VR guidance, edge-cloud orchestration, and AI for scene understanding and prediction. We contrast non-robotic twins with robot-in-the-loop architectures for shared control and autonomy, and identify open problems in validation and benchmarking, safety assurance and human factors, lifecycle “digital thread” integration, and scalable data governance. We conclude with a research agenda toward trustworthy, standards-aligned SDTs that deliver measurable clinical benefit. By unifying vocabulary, organizing capabilities, and highlighting gaps, this work aims to guide SDT design and deployment and catalyze translation from laboratory prototypes to routine surgical care.

[226] Foundation Models for Trajectory Planning in Autonomous Driving: A Review of Progress and Open Challenges cs.RO | cs.CVPDF

Kemal Oksuz, Alexandru Buburuzan, Anthony Knittel, Yuhan Yao, Puneet K. Dokania

TL;DR: 本文综述了多模态基础模型在自动驾驶轨迹规划中的应用，总结了37种最新方法的设计选择、优势和局限性，并探讨了开源代码和数据集的可用性。

Details

Motivation: 自动驾驶技术从传统的手工设计转向基于基础模型的统一方法，能够直接从原始感官输入推断运动轨迹，自然语言作为额外模态的引入进一步扩展了应用场景。

Result: 总结了基础模型在自动驾驶轨迹规划中的进展和挑战，强调了开源代码和数据集的不足。

Insight: 基础模型为自动驾驶轨迹规划提供了统一且高效的解决方案，但开源资源的缺乏可能阻碍进一步研究。

Abstract: The emergence of multi-modal foundation models has markedly transformed the technology for autonomous driving, shifting away from conventional and mostly hand-crafted design choices towards unified, foundation-model-based approaches, capable of directly inferring motion trajectories from raw sensory inputs. This new class of methods can also incorporate natural language as an additional modality, with Vision-Language-Action (VLA) models serving as a representative example. In this review, we provide a comprehensive examination of such methods through a unifying taxonomy to critically evaluate their architectural design choices, methodological strengths, and their inherent capabilities and limitations. Our survey covers 37 recently proposed approaches that span the landscape of trajectory planning with foundation models. Furthermore, we assess these approaches with respect to the openness of their source code and datasets, offering valuable information to practitioners and researchers. We provide an accompanying webpage that catalogs the methods based on our taxonomy, available at: https://github.com/fiveai/FMs-for-driving-trajectories

[227] Learning from Watching: Scalable Extraction of Manipulation Trajectories from Human Videos cs.RO | cs.CVPDF

X. Hu, G. Ye

TL;DR: 该论文提出了一种结合大型视频理解基础模型与点跟踪技术的方法，从人类操作视频中提取密集轨迹，以更全面地利用互联网规模的示范视频，实现更高效、可扩展的机器人学习。

Details

Motivation: 现有的数据收集方法依赖真实机器人平台，成本高且劳动密集；而现有基于人类视频的方法主要关注手部检测或物体姿态估计，未能充分利用视频中丰富的交互信息。

Result: 实验表明，该方法能够准确跟踪整个操作过程中的关键点，为机器人学习提供了更具扩展性和数据效率的途径。

Insight: 通过利用互联网视频资源，可以低成本地获取大量高质量操作数据，从而推动机器人学习的进一步发展。

Abstract: Collecting high-quality data for training large-scale robotic models typically relies on real robot platforms, which is labor-intensive and costly, whether via teleoperation or scripted demonstrations. To scale data collection, many researchers have turned to leveraging human manipulation videos available online. However, current methods predominantly focus on hand detection or object pose estimation, failing to fully exploit the rich interaction cues embedded in these videos. In this work, we propose a novel approach that combines large foundation models for video understanding with point tracking techniques to extract dense trajectories of all task-relevant keypoints during manipulation. This enables more comprehensive utilization of Internet-scale human demonstration videos. Experimental results demonstrate that our method can accurately track keypoints throughout the entire manipulation process, paving the way for more scalable and data-efficient robot learning.

Nivedan Yakolli, Avinash Gautam, Abhijit Das, Yuankai Qi, Virendra Singh Shekhawat

TL;DR: 本文是一篇关于通过视觉与语言导航（VLN）提升人机协作的调查论文，总结了200多篇相关研究，提出了未来VLN系统的改进方向，包括双向通信、多机器人协作和动态决策。

Details

Motivation: 当前的VLN模型在多机器人协作和复杂环境中存在局限性，如双向通信不足、歧义处理和协作决策能力较弱。本文旨在通过综述现有研究，推动VLN与机器人技术的结合，提升人机交互能力。

Result: 研究发现，未来的VLN系统需要结合先进的自然语言理解和动态角色分配，以提升多机器人协作效率。

Insight: VLN技术的进一步发展依赖于跨模态理解和动态协作能力的提升，特别是在实时交互和复杂环境中的应用潜力巨大。

Abstract: Vision-and-Language Navigation (VLN) is a multi-modal, cooperative task requiring agents to interpret human instructions, navigate 3D environments, and communicate effectively under ambiguity. This paper presents a comprehensive review of recent VLN advancements in robotics and outlines promising directions to improve multi-robot coordination. Despite progress, current models struggle with bidirectional communication, ambiguity resolution, and collaborative decision-making in the multi-agent systems. We review approximately 200 relevant articles to provide an in-depth understanding of the current landscape. Through this survey, we aim to provide a thorough resource that inspires further research at the intersection of VLN and robotics. We advocate that the future VLN systems should support proactive clarification, real-time feedback, and contextual reasoning through advanced natural language understanding (NLU) techniques. Additionally, decentralized decision-making frameworks with dynamic role assignment are essential for scalable, efficient multi-robot collaboration. These innovations can significantly enhance human-robot interaction (HRI) and enable real-world deployment in domains such as healthcare, logistics, and disaster response.

Yanjia Huang, Xianshun Jiang, Xiangbo Gao, Mingyang Wu, Zhengzhong Tu

TL;DR: VISTAv2提出了一种生成式世界模型，用于室内视觉语言导航任务，通过在线预测未来视图并生成价值地图，结合动作和语言指令进行规划，提升了导航的鲁棒性和效率。

Details

Motivation: 现有的视觉语言导航方法在多模态任务中面临挑战，尤其是在连续真实空间中的在线规划和动作条件预测方面表现不足。VISTAv2旨在填补这一缺口，通过生成式模型预测未来视图并生成明确的规划价值。

Result: 在MP3D和RoboTHOR基准测试中，VISTAv2表现优于基线方法，证明了动作条件预测、指令引导融合和在线价值地图的关键作用。

Insight: VISTAv2为解决视觉语言导航任务提供了一种实用且可解释的方案，强调了动作条件和多模态融合的重要性。

Abstract: Vision-and-Language Navigation (VLN) requires agents to follow language instructions while acting in continuous real-world spaces. Prior image imagination based VLN work shows benefits for discrete panoramas but lacks online, action-conditioned predictions and does not produce explicit planning values; moreover, many methods replace the planner with long-horizon objectives that are brittle and slow. To bridge this gap, we propose VISTAv2, a generative world model that rolls out egocentric future views conditioned on past observations, candidate action sequences, and instructions, and projects them into an online value map for planning. Unlike prior approaches, VISTAv2 does not replace the planner. The online value map is fused at score level with the base objective, providing reachability and risk-aware guidance. Concretely, we employ an action-aware Conditional Diffusion Transformer video predictor to synthesize short-horizon futures, align them with the natural language instruction via a vision-language scorer, and fuse multiple rollouts in a differentiable imagination-to-value head to output an imagined egocentric value map. For efficiency, rollouts occur in VAE latent space with a distilled sampler and sparse decoding, enabling inference on a single consumer GPU. Evaluated on MP3D and RoboTHOR, VISTAv2 improves over strong baselines, and ablations show that action-conditioned imagination, instruction-guided value fusion, and the online value-map planner are all critical, suggesting that VISTAv2 offers a practical and interpretable route to robust VLN.

[230] Bootstrap Dynamic-Aware 3D Visual Representation for Scalable Robot Learning cs.RO | cs.CVPDF

Qiwei Liang, Boyang Cai, Minghao Lai, Sitong Zhuang, Tao Lin

TL;DR: AFRO是一个自监督框架，通过生成扩散过程和联合建模正逆向动力学，学习动态感知的3D表示，显著提升了机器人在仿真和现实任务中的操作成功率。

Details

Motivation: 当前3D视觉预训练方法在机器人操作任务中表现不佳，主要原因是缺乏状态-动作-状态动态建模和显式几何重建的冗余。

Result: AFRO在16个仿真和4个现实任务中显著提升操作成功率，优于现有预训练方法，且能随着数据量和任务复杂性扩展。

Insight: AFRO学习的特征具有语义丰富性和判别性，为机器人3D表示学习提供了有效的预训练解决方案。

Abstract: Despite strong results on recognition and segmentation, current 3D visual pre-training methods often underperform on robotic manipulation. We attribute this gap to two factors: the lack of state-action-state dynamics modeling and the unnecessary redundancy of explicit geometric reconstruction. We introduce AFRO, a self-supervised framework that learns dynamics-aware 3D representations without action or reconstruction supervision. AFRO casts state prediction as a generative diffusion process and jointly models forward and inverse dynamics in a shared latent space to capture causal transition structure. To prevent feature leakage in action learning, we employ feature differencing and inverse-consistency supervision, improving the quality and stability of visual features. When combined with Diffusion Policy, AFRO substantially increases manipulation success rates across 16 simulated and 4 real-world tasks, outperforming existing pre-training approaches. The framework also scales favorably with data volume and task complexity. Qualitative visualizations indicate that AFRO learns semantically rich, discriminative features, offering an effective pre-training solution for 3D representation learning in robotics. Project page: https://kolakivy.github.io/AFRO/

[231] Arcadia: Toward a Full-Lifecycle Framework for Embodied Lifelong Learning cs.RO | cs.CVPDF

Minghe Gao, Juncheng Li, Yuze Lin, Xuqi Liu, Jiaming Ji

TL;DR: Arcadia提出了一种全生命周期的框架（Arcadia），通过紧密耦合四个阶段（自主数据采集、生成式场景重建与增强、共享表征架构、仿真反馈与进化），实现了持续的终身学习与端到端泛化，显著提升了导航与操作任务的性能。

Details

Motivation: 现有的具身学习方法通常只优化单一阶段（如数据收集或训练），无法持续改进或泛化到新场景。Arcadia认为具身学习应是一个全生命周期的问题，需要通过闭环系统实现各阶段的紧密耦合。

Result: 实验证明，Arcadia在导航与操作任务上表现优异，并能稳健地迁移到真实机器人，表明其支持终身学习和端到端泛化。

Insight: 具身学习的持续改进依赖于全生命周期的闭环设计，单一阶段的优化不足以实现长期泛化和适应能力。

Abstract: We contend that embodied learning is fundamentally a lifecycle problem rather than a single-stage optimization. Systems that optimize only one link (data collection, simulation, learning, or deployment) rarely sustain improvement or generalize beyond narrow settings. We introduce Arcadia, a closed-loop framework that operationalizes embodied lifelong learning by tightly coupling four stages: (1) Self-evolving exploration and grounding for autonomous data acquisition in physical environments, (2) Generative scene reconstruction and augmentation for realistic and extensible scene creation, (3) a Shared embodied representation architecture that unifies navigation and manipulation within a single multimodal backbone, and (4) Sim-from-real evaluation and evolution that closes the feedback loop through simulation-based adaptation. This coupling is non-decomposable: removing any stage breaks the improvement loop and reverts to one-shot training. Arcadia delivers consistent gains on navigation and manipulation benchmarks and transfers robustly to physical robots, indicating that a tightly coupled lifecycle: continuous real-world data acquisition, generative simulation update, and shared-representation learning, supports lifelong improvement and end-to-end generalization. We release standardized interfaces enabling reproducible evaluation and cross-model comparison in reusable environments, positioning Arcadia as a scalable foundation for general-purpose embodied agents.

[232] RealAppliance: Let High-fidelity Appliance Assets Controllable and Workable as Aligned Real Manuals cs.RO | cs.AI | cs.CVPDF

Yuzheng Gao, Yuxing Long, Lei Kang, Yuchong Guo, Ziyan Yu

TL;DR: 论文介绍了RealAppliance数据集和RealAppliance-Bench基准测试，旨在解决现有家电资产渲染质量低、机制不完整以及与手册不对齐的问题。

Details

Motivation: 现有的家电资产存在渲染质量差、机制不完整以及与手册不对齐的问题，导致模拟与现实之间的差距阻碍了家电操作的发展。

Result: 模型在RealAppliance-Bench上的表现分析为家电操作研究提供了深入见解。

Insight: 高保真数据集和全面的基准测试有助于缩小模拟与现实之间的差距，推动家电操作规划技术的发展。

Abstract: Existing appliance assets suffer from poor rendering, incomplete mechanisms, and misalignment with manuals, leading to simulation-reality gaps that hinder appliance manipulation development. In this work, we introduce the RealAppliance dataset, comprising 100 high-fidelity appliances with complete physical, electronic mechanisms, and program logic aligned with their manuals. Based on these assets, we propose the RealAppliance-Bench benchmark, which evaluates multimodal large language models and embodied manipulation planning models across key tasks in appliance manipulation planning: manual page retrieval, appliance part grounding, open-loop manipulation planning, and closed-loop planning adjustment. Our analysis of model performances on RealAppliance-Bench provides insights for advancing appliance manipulation research

[233] MILE: A Mechanically Isomorphic Exoskeleton Data Collection System with Fingertip Visuotactile Sensing for Dexterous Manipulation cs.RO | cs.CV | cs.HCPDF

Jinda Du, Jieji Ren, Qiaojun Yu, Ningbin Zhang, Yu Deng

TL;DR: MILE是一个机械同构的外骨骼数据收集系统，通过指尖视觉触觉传感实现灵巧操作，解决了现有数据收集系统中的运动重定向不精确、效率低和缺失高分辨率触觉信号的问题。

Details

Motivation: 模仿学习在灵巧手操作中具有潜力，但缺乏大规模高保真数据限制了其效果。现有数据收集系统存在运动重定向不准确、效率低和缺失高分辨率触觉信号等问题。

Result: 外骨骼的多关节平均绝对角度误差低于1度；遥操作成功率提升64%；结合触觉信号后，成功率比仅用视觉的基线平均提升25%。

Insight: 机械同构设计避免了非线性重定向问题，而高分辨率触觉信号显著提升了操作的精度和成功率，验证了其在复杂操作任务中的实用性。

Abstract: Imitation learning provides a promising approach to dexterous hand manipulation, but its effectiveness is limited by the lack of large-scale, high-fidelity data. Existing data-collection pipelines suffer from inaccurate motion retargeting, low data-collection efficiency, and missing high-resolution fingertip tactile sensing. We address this gap with MILE, a mechanically isomorphic teleoperation and data-collection system co-designed from human hand to exoskeleton to robotic hand. The exoskeleton is anthropometrically derived from the human hand, and the robotic hand preserves one-to-one joint-position isomorphism, eliminating nonlinear retargeting and enabling precise, natural control. The exoskeleton achieves a multi-joint mean absolute angular error below one degree, while the robotic hand integrates compact fingertip visuotactile modules that provide high-resolution tactile observations. Built on this retargeting-free interface, we teleoperate complex, contact-rich in-hand manipulation and efficiently collect a multimodal dataset comprising high-resolution fingertip visuotactile signals, RGB-D images, and joint positions. The teleoperation pipeline achieves a mean success rate improvement of 64%. Incorporating fingertip tactile observations further increases the success rate by an average of 25% over the vision-only baseline, validating the fidelity and utility of the dataset. Further details are available at: https://sites.google.com/view/mile-system.

[234] Sign Language Recognition using Bidirectional Reservoir Computing cs.RO | cs.CVPDF

Nitin Kumar Singh, Arie Rachmad Syulistyo, Yuichiro Tanaka, Hakaru Tamukoh

TL;DR: 论文提出了一种基于双向储层计算(BRC)的手语识别(SLR)方法，结合MediaPipe提取的手部关节点坐标作为输入，显著降低了计算资源需求，适用于资源受限的边缘设备。

Details

Motivation: 现有基于深度学习的手语识别方法计算复杂度高，不适合资源受限设备，因此需要一种高效且低资源消耗的解决方案。

Result: 在WLASL数据集上达到57.71%的准确率，训练时间仅9秒，远低于Bi-GRU方法的55分钟38秒。

Insight: BRC架构在低资源需求下仍能有效捕捉时序依赖性，为边缘设备上的实时手语识别提供了可行方案。

Abstract: Sign language recognition (SLR) facilitates communication between deaf and hearing individuals. Deep learning is widely used to develop SLR-based systems; however, it is computationally intensive and requires substantial computational resources, making it unsuitable for resource-constrained devices. To address this, we propose an efficient sign language recognition system using MediaPipe and an echo state network (ESN)-based bidirectional reservoir computing (BRC) architecture. MediaPipe extracts hand joint coordinates, which serve as inputs to the ESN-based BRC architecture. The BRC processes these features in both forward and backward directions, efficiently capturing temporal dependencies. The resulting states of BRC are concatenated to form a robust representation for classification. We evaluated our method on the Word-Level American Sign Language (WLASL) video dataset, achieving a competitive accuracy of 57.71% and a significantly lower training time of only 9 seconds, in contrast to the 55 minutes and $38$ seconds required by the deep learning-based Bi-GRU approach. Consequently, the BRC-based SLR system is well-suited for edge devices.

Thomas Chabal, Shizhe Chen, Jean Ponce, Cordelia Schmid

TL;DR: FOM-Nav提出了一种模块化框架，通过结合Frontier-Object Maps和视觉语言模型，解决了Object Goal Navigation问题，提升了探索效率和导航性能。

Details

Motivation: 现有的隐式内存方法难以保留长期记忆和规划，而显式地图方法缺乏丰富的语义信息。FOM-Nav旨在弥补这些不足。

Result: FOM-Nav在MP3D和HM3D基准上取得了SOTA性能，尤其在导航效率指标SPL上表现突出，并在真实机器人上验证了可行性。

Insight: 多模态场景理解和分层规划的结合显著提升了导航效率和目标定位能力。

Abstract: This paper addresses the Object Goal Navigation problem, where a robot must efficiently find a target object in an unknown environment. Existing implicit memory-based methods struggle with long-term memory retention and planning, while explicit map-based approaches lack rich semantic information. To address these challenges, we propose FOM-Nav, a modular framework that enhances exploration efficiency through Frontier-Object Maps and vision-language models. Our Frontier-Object Maps are built online and jointly encode spatial frontiers and fine-grained object information. Using this representation, a vision-language model performs multimodal scene understanding and high-level goal prediction, which is executed by a low-level planner for efficient trajectory generation. To train FOM-Nav, we automatically construct large-scale navigation datasets from real-world scanned environments. Extensive experiments validate the effectiveness of our model design and constructed dataset. FOM-Nav achieves state-of-the-art performance on the MP3D and HM3D benchmarks, particularly in navigation efficiency metric SPL, and yields promising results on a real robot.

[236] Opening the Sim-to-Real Door for Humanoid Pixel-to-Action Policy Transfer cs.RO | cs.CVPDF

Haoru Xue, Tairan He, Zi Wang, Qingwei Ben, Wenli Xiao

TL;DR: 论文提出了一种师生引导学习框架，通过视觉感知实现仿人机器人的运动与操作策略迁移，利用阶段性重置探索策略和GRPO微调方法提升了仿真到现实RL的性能。

Details

Motivation: 近年来GPU加速的光照真实仿真为机器人学习提供了可扩展的数据生成路径，但如何在仿真环境中训练的策略能在现实世界中零样本（zero-shot）表现稳健仍是一个挑战。论文旨在解决这一问题，特别是在高难度的铰接物体交互任务中。

Result: 论文提出的策略在多样化门类型任务中实现了稳健的零样本性能，任务完成时间比人类遥控操作快31.7%。

Insight: 1. 大规模物理和视觉随机化是仿真训练策略泛化到现实的关键；
2. 阶段性重置探索和GRPO微调是提升长时程任务稳定性的有效手段；
3. 仅RGB感知的策略可以完成复杂的铰接物体操作任务，展示了仿真到现实的潜力。

Abstract: Recent progress in GPU-accelerated, photorealistic simulation has opened a scalable data-generation path for robot learning, where massive physics and visual randomization allow policies to generalize beyond curated environments. Building on these advances, we develop a teacher-student-bootstrap learning framework for vision-based humanoid loco-manipulation, using articulated-object interaction as a representative high-difficulty benchmark. Our approach introduces a staged-reset exploration strategy that stabilizes long-horizon privileged-policy training, and a GRPO-based fine-tuning procedure that mitigates partial observability and improves closed-loop consistency in sim-to-real RL. Trained entirely on simulation data, the resulting policy achieves robust zero-shot performance across diverse door types and outperforms human teleoperators by up to 31.7% in task completion time under the same whole-body control stack. This represents the first humanoid sim-to-real policy capable of diverse articulated loco-manipulation using pure RGB perception.

Fei Liu, Shichao Xie, Minghua Luo, Zedong Chu, Junjun Hu

TL;DR: NavForesee提出了一个统一的视觉-语言世界模型，结合高层次语言规划和预测性世界模型想象，显著提升了长视野任务的导航性能。

Details

Motivation: 现有的导航代理在复杂自然语言指令和未见环境的长视野任务中表现不佳，缺乏鲁棒的长时规划和预测能力。

Result: 在R2R-CE和RxR-CE基准测试中取得了极具竞争力的性能。

Insight: 显式语言规划与隐式时空预测的结合是提升具身智能导航能力的关键。

Abstract: Embodied navigation for long-horizon tasks, guided by complex natural language instructions, remains a formidable challenge in artificial intelligence. Existing agents often struggle with robust long-term planning about unseen environments, leading to high failure rates. To address these limitations, we introduce NavForesee, a novel Vision-Language Model (VLM) that unifies high-level language planning and predictive world model imagination within a single, unified framework. Our approach empowers a single VLM to concurrently perform planning and predictive foresight. Conditioned on the full instruction and historical observations, the model is trained to understand the navigation instructions by decomposing the task, tracking its progress, and formulating the subsequent sub-goal. Simultaneously, it functions as a generative world model, providing crucial foresight by predicting short-term environmental dynamics and long-term navigation milestones. The VLM’s structured plan guides its targeted prediction, while the imagined future provides rich context to inform the navigation actions, creating a powerful internal feedback loop of perception-planning/prediction-action. We demonstrate through extensive experiments on the R2R-CE and RxR-CE benchmark that NavForesee achieves highly competitive performance in complex scenarios. Our work highlights the immense potential of fusing explicit language planning with implicit spatiotemporal prediction, paving the way for more intelligent and capable embodied agents.

[238] Guardian: Detecting Robotic Planning and Execution Errors with Vision-Language Models cs.RO | cs.CVPDF

Paul Pacaud, Ricardo Garcia, Shizhe Chen, Cordelia Schmid

TL;DR: 本文提出了Guardian，一种利用视觉语言模型（VLM）检测机器人规划与执行错误的方法。通过自动合成机器人失败数据，作者构建了三个新的失败检测基准，并训练了Guardian模型，显著提升了任务成功率。

Details

Motivation: 现有的视觉语言模型在机器人失败检测中的准确性和泛化能力受限于失败数据的稀缺性，为了解决这一问题，作者提出了一种自动生成失败数据的方法。

Result: Guardian在现有和新基准上均表现优异，同时在实际机器人系统中显著提高了任务成功率。

Insight: 自动合成失败数据可以有效解决数据稀缺问题，而结合多视角推理的VLM在机器人失败检测中具有显著潜力。

Abstract: Robust robotic manipulation requires reliable failure detection and recovery. Although current Vision-Language Models (VLMs) show promise, their accuracy and generalization are limited by the scarcity of failure data. To address this data gap, we propose an automatic robot failure synthesis approach that procedurally perturbs successful trajectories to generate diverse planning and execution failures. This method produces not only binary classification labels but also fine-grained failure categories and step-by-step reasoning traces in both simulation and the real world. With it, we construct three new failure detection benchmarks: RLBench-Fail, BridgeDataV2-Fail, and UR5-Fail, substantially expanding the diversity and scale of existing failure datasets. We then train Guardian, a VLM with multi-view images for detailed failure reasoning and detection. Guardian achieves state-of-the-art performance on both existing and newly introduced benchmarks. It also effectively improves task success rates when integrated into a state-of-the-art manipulation system in simulation and real robots, demonstrating the impact of our generated failure data.

[239] RoaD: Rollouts as Demonstrations for Closed-Loop Supervised Fine-Tuning of Autonomous Driving Policies cs.RO | cs.AI | cs.CV | cs.LGPDF

Guillermo Garcia-Cobo, Maximilian Igl, Peter Karkus, Zhejun Zhang, Michael Watson

TL;DR: RoaD提出了一种利用策略自身的闭环rollout数据作为额外训练样本的方法，通过专家引导生成高质量轨迹，有效解决了自动驾驶策略在闭环部署中因覆盖偏移导致的误差累积问题。

Details

Motivation: 现有的自动驾驶策略通常通过开环行为克隆人类示范进行训练，但在闭环部署中容易因覆盖偏移导致误差累积。RoaD旨在通过闭环rollout数据增强训练样本，提升策略的鲁棒性。

Result: 在WOSAC和AlpaSim两个仿真平台上，RoaD表现优于或接近现有方法，驾驶分数提高41%，碰撞减少54%。

Insight: 闭环rollout数据可以有效缓解覆盖偏移问题，专家引导则确保了数据的质量和实用性，为自动驾驶策略的鲁棒性提供了新思路。

Abstract: Autonomous driving policies are typically trained via open-loop behavior cloning of human demonstrations. However, such policies suffer from covariate shift when deployed in closed loop, leading to compounding errors. We introduce Rollouts as Demonstrations (RoaD), a simple and efficient method to mitigate covariate shift by leveraging the policy’s own closed-loop rollouts as additional training data. During rollout generation, RoaD incorporates expert guidance to bias trajectories toward high-quality behavior, producing informative yet realistic demonstrations for fine-tuning. This approach enables robust closed-loop adaptation with orders of magnitude less data than reinforcement learning, and avoids restrictive assumptions of prior closed-loop supervised fine-tuning (CL-SFT) methods, allowing broader applications domains including end-to-end driving. We demonstrate the effectiveness of RoaD on WOSAC, a large-scale traffic simulation benchmark, where it performs similar or better than the prior CL-SFT method; and in AlpaSim, a high-fidelity neural reconstruction-based simulator for end-to-end driving, where it improves driving score by 41% and reduces collisions by 54%.

cs.SD [Back]

[240] MoLT: Mixture of Layer-Wise Tokens for Efficient Audio-Visual Learning cs.SD | cs.CV | cs.MMPDF

Kyeongha Rho, Hyeongkeun Lee, Jae Won Cho, Joon Son Chung

TL;DR: 论文提出了MoLT（Mixture of Layer-Wise Tokens），一种高效参数和内存的音视频学习框架，通过并行轻量级方案替代传统的逐层适应，仅从深层提取和融合分层标记，提升了性能和效率。

Details

Motivation: 传统的逐层音视频学习方法计算和内存开销大，MoLT通过优化适应策略，减少冗余和错误传播，提高了模型的效率和性能。

Result: 在多项音视频任务（如音视频问答、分割和事件定位）上，MoLT优于现有方法，证明了其高效性和性能优势。

Insight: 深层特征更稳定且信息丰富，MoLT通过专注于深层提取和动态融合，实现了性能与效率的平衡，为正交正则化在多模态学习中的应用提供了新的视角。

Abstract: In this paper, we propose Mixture of Layer-Wise Tokens (MoLT), a parameter- and memory-efficient adaptation framework for audio-visual learning. The key idea of MoLT is to replace conventional, computationally heavy sequential adaptation at every transformer layer with a parallel, lightweight scheme that extracts and fuses layer-wise tokens only from the late layers. We adopt two types of adapters to distill modality-specific information and cross-modal interaction into compact latent tokens in a layer-wise manner. A token fusion module then dynamically fuses these layer-wise tokens by taking into account their relative significance. To prevent the redundancy of latent tokens, we apply an orthogonality regularization between latent tokens during training. Through the systematic analysis of the position of adaptation in the pre-trained transformers, we extract latent tokens only from the late layers of the transformers. This strategic adaptation approach avoids error propagation from the volatile early-layer features, thereby maximizing the adaptation performance while maintaining parameter and memory efficiency. Through extensive experiments, we demonstrate that MoLT outperforms existing methods on diverse audio-visual benchmarks, including Audio-Visual Question Answering, Audio-Visual Segmentation, and Audio-Visual Event Localization.

Jiaying Hong, Ting Zhu, Thanet Markchom, Huizhi Liang

TL;DR: 论文提出了Art2Music框架，通过多模态情感对齐从艺术图像生成音乐，利用ArtCaps数据集和轻量级跨模态方法显著提升了感知自然性和频谱保真度。

Details

Motivation: 现有方法依赖昂贵的情感标注，缺乏灵活性，因此需要更灵活的情感对齐方法用于多模态音乐生成。

Result: 实验在Mel-Cepstral Distortion等指标上显著提升，小规模LLM评分验证了情感对齐的跨模态一致性。

Insight: Art2Music在少量训练数据下表现稳健，为情感对齐的创意音频生成提供了可扩展方案。

Abstract: With the rise of AI-generated content (AIGC), generating perceptually natural and feeling-aligned music from multimodal inputs has become a central challenge. Existing approaches often rely on explicit emotion labels that require costly annotation, underscoring the need for more flexible feeling-aligned methods. To support multimodal music generation, we construct ArtiCaps, a pseudo feeling-aligned image-music-text dataset created by semantically matching descriptions from ArtEmis and MusicCaps. We further propose Art2Music, a lightweight cross-modal framework that synthesizes music from artistic images and user comments. In the first stage, images and text are encoded with OpenCLIP and fused using a gated residual module; the fused representation is decoded by a bidirectional LSTM into Mel-spectrograms with a frequency-weighted L1 loss to enhance high-frequency fidelity. In the second stage, a fine-tuned HiFi-GAN vocoder reconstructs high-quality audio waveforms. Experiments on ArtiCaps show clear improvements in Mel-Cepstral Distortion, Frechet Audio Distance, Log-Spectral Distance, and cosine similarity. A small LLM-based rating study further verifies consistent cross-modal feeling alignment and offers interpretable explanations of matches and mismatches across modalities. These results demonstrate improved perceptual naturalness, spectral fidelity, and semantic consistency. Art2Music also maintains robust performance with only 50k training samples, providing a scalable solution for feeling-aligned creative audio generation in interactive art, personalized soundscapes, and digital art exhibitions.

hep-ex [Back]

[242] Panda: Self-distillation of Reusable Sensor-level Representations for High Energy Physics hep-ex | cs.CVPDF

Samuel Young, Kazuhiro Terao

TL;DR: Panda提出了一种直接从原始未标记LArTPC数据中学习可重用传感器级别表征的方法，通过分层稀疏3D编码器和多视图自蒸馏目标显著提高了标签效率和重建质量。

Details

Motivation: 当前高能物理实验中依赖复杂的手工设计算法或任务特定神经网络，这些方法需要大量标记数据和校准过程，效率低下。Panda旨在解决这一问题。

Result: 在模拟数据集上，Panda仅用1/1000标签即超越先前语义分割SOTA；小型预测头也能实现与SOTA相当的粒子识别性能。

Insight: Panda表明直接从原始数据学习表征可显著提升效率和性能，为高能物理实验提供了一种更通用的解决方案。

Abstract: Liquid argon time projection chambers (LArTPCs) provide dense, high-fidelity 3D measurements of particle interactions and underpin current and future neutrino and rare-event experiments. Physics reconstruction typically relies on complex detector-specific pipelines that use tens of hand-engineered pattern recognition algorithms or cascades of task-specific neural networks that require extensive, labeled simulation that requires a careful, time-consuming calibration process. We introduce \textbf{Panda}, a model that learns reusable sensor-level representations directly from raw unlabeled LArTPC data. Panda couples a hierarchical sparse 3D encoder with a multi-view, prototype-based self-distillation objective. On a simulated dataset, Panda substantially improves label efficiency and reconstruction quality, beating the previous state-of-the-art semantic segmentation model with 1,000$\times$ fewer labels. We also show that a single set-prediction head 1/20th the size of the backbone with no physical priors trained on frozen outputs from Panda can result in particle identification that is comparable with state-of-the-art (SOTA) reconstruction tools. Full fine-tuning further improves performance across all tasks.

cs.MM [Back]

[243] Audio-Visual World Models: Towards Multisensory Imagination in Sight and Sound cs.MM | cs.CV | cs.SDPDF

Jiahua Wang, Shannan Yan, Leqi Zheng, Jialong Wu, Yaoxin Mao

TL;DR: 该论文提出了首个音频-视觉世界模型（AVWM）的正式框架，整合了多模态环境模拟，引入了AVW-4k数据集和AV-CDiT模型，实现了高保真的多模态预测，并验证了其在连续音频-视觉导航任务中的实用性。

Details

Motivation: 现有世界模型主要关注视觉观测，但现实世界感知涉及多模态感官（如听觉）。音频提供了空间和时间线索（如声源定位），但其在世界模型中的应用尚未充分探索。本文旨在填补这一空白。

Result: AV-CDiT在视觉和听觉模态上实现了高保真的预测，并在连续音频-视觉导航任务中显著提升了智能体的性能。

Insight: 音频与视觉的联合建模能显著增强智能体的环境理解与任务执行能力，为多模态世界模型的发展提供了新方向。

Abstract: World models simulate environmental dynamics to enable agents to plan and reason about future states. While existing approaches have primarily focused on visual observations, real-world perception inherently involves multiple sensory modalities. Audio provides crucial spatial and temporal cues such as sound source localization and acoustic scene properties, yet its integration into world models remains largely unexplored. No prior work has formally defined what constitutes an audio-visual world model or how to jointly capture binaural spatial audio and visual dynamics under precise action control with task reward prediction. This work presents the first formal framework for Audio-Visual World Models (AVWM), formulating multimodal environment simulation as a partially observable Markov decision process with synchronized audio-visual observations, fine-grained actions, and task rewards. To address the lack of suitable training data, we construct AVW-4k, a dataset comprising 30 hours of binaural audio-visual trajectories with action annotations and reward signals across 76 indoor environments. We propose AV-CDiT, an Audio-Visual Conditional Diffusion Transformer with a novel modality expert architecture that balances visual and auditory learning, optimized through a three-stage training strategy for effective multimodal integration. Extensive experiments demonstrate that AV-CDiT achieves high-fidelity multimodal prediction across visual and auditory modalities with reward. Furthermore, we validate its practical utility in continuous audio-visual navigation tasks, where AVWM significantly enhances the agent’s performance.

cs.LG [Back]

[244] Rep3Net: An Approach Exploiting Multimodal Representation for Molecular Bioactivity Prediction cs.LG | cs.CL | q-bio.QMPDF

Sabrina Islam, Md. Atiqur Rahman, Md. Bakhtiar Hasan, Md. Hasanul Kabir

TL;DR: Rep3Net是一种结合多种模态表示的深度学习架构，用于分子生物活性预测，整合了分子描述符、图表示和SMILES嵌入，显著提升了预测性能。

Details

Motivation: 传统QSAR模型在分子生物活性预测中因无法捕捉化合物的结构和上下文信息而表现不佳，需要一种更全面的表示方法。

Result: 在PARP-1数据集上表现优异，优于传统模型（如GCN、GAT、XGBoost）。

Insight: 多模态表示能够更全面地捕捉分子特性，为药物发现中的计算筛选提供了可扩展的框架。

Abstract: In early stage drug discovery, bioactivity prediction of molecules against target proteins plays a crucial role. Trdaitional QSAR models that utilizes molecular descriptor based data often struggles to predict bioactivity of molecules effectively due to its limitation in capturing structural and contextual information embedded within each compound. To address this challenge, we propose Rep3Net, a unified deep learning architecture that not only incorporates descriptor data but also includes spatial and relational information through graph-based represenation of compounds and contextual information through ChemBERTa generated embeddings from SMILES strings. Our model employing multimodal concatenated features produce reliable bioactivity prediction on Poly [ADP-ribose] polymerase 1 (PARP-1) dataset. PARP-1 is a crucial agent in DNA damage repair and has become a significant theraputic target in malignancies that depend on it for survival and growth. A comprehensive analysis and comparison with conventional standalone models including GCN, GAT, XGBoost, etc. demonstrates that our architecture achieves the highest predictive performance. In computational screening of compounds in drug discovery, our architecture provides a scalable framework for bioactivity prediction.

[245] Statistical NLP for Optimization of Clinical Trial Success Prediction in Pharmaceutical R&D cs.LG | cs.CL | q-bio.QMPDF

Michael R. Doane

TL;DR: 这篇论文提出了一种基于统计NLP的概率分类器，用于预测神经科学领域临床试验的技术和监管成功率（pTRS），以优化制药研发中的资源分配。

Details

Motivation: 制药研发（尤其是神经科学领域）的失败率和成本极高，成功率低于10%。通过及时识别有潜力的项目，可以优化资源分配并降低财务风险。

Result: 在10万多例临床试验数据上，非LLM模型的ROC-AUC为0.64，而BioBERT模型的ROC-AUC提升至0.74，预测误差比行业基准低40%。

Insight: 通过NLP驱动的预测模型可以为制药研发提供更精准的决策支持，特别是在神经科学领域，帮助优化投资和战略规划。

Abstract: This work presents the development and evaluation of an NLP-enabled probabilistic classifier designed to estimate the probability of technical and regulatory success (pTRS) for clinical trials in the field of neuroscience. While pharmaceutical R&D is plagued by high attrition rates and enormous costs, particularly within neuroscience, where success rates are below 10%, timely identification of promising programs can streamline resource allocation and reduce financial risk. Leveraging data from the ClinicalTrials.gov database and success labels from the recently developed Clinical Trial Outcome dataset, the classifier extracts text-based clinical trial features using statistical NLP techniques. These features were integrated into several non-LLM frameworks (logistic regression, gradient boosting, and random forest) to generate calibrated probability scores. Model performance was assessed on a retrospective dataset of 101,145 completed clinical trials spanning 1976-2024, achieving an overall ROC-AUC of 0.64. An LLM-based predictive model was then built using BioBERT, a domain-specific language representation encoder. The BioBERT-based model achieved an overall ROC-AUC of 0.74 and a Brier Score of 0.185, indicating its predictions had, on average, 40% less squared error than would be observed using industry benchmarks. The BioBERT-based model also made trial outcome predictions that were superior to benchmark values 70% of the time overall. By integrating NLP-driven insights into drug development decision-making, this work aims to enhance strategic planning and optimize investment allocation in neuroscience programs.

[246] Towards Active Synthetic Data Generation for Finetuning Language Models cs.LG | cs.CLPDF

Samuel Kessler, Menglin Xia, Daniel Madrigal Diaz, Dongge Han, Helia Heshemi

TL;DR: 论文研究了动态生成合成数据以微调语言模型的方法，证明了基于学生模型状态的闭环生成优于静态生成，并发现简单的主动学习选择标准效果最佳。

Details

Motivation: 现有的静态合成数据生成方法可能不够高效，论文希望通过动态生成数据来提升微调效果。

Result: 在四个数学和逻辑推理数据集上，动态生成方法显著提升了学生模型的性能。

Insight: 动态生成合成数据可以更高效地利用有限的预算，简单的主动学习方法在实际应用中可能优于复杂的设计。

Abstract: A common and effective means for improving language model capabilities involves finetuning a student'' language model's parameters on generations from a more proficient teacher’’ model. Termed ``synthetic data’’, these generations are often produced before any student finetuning, but some work has considered generating new synthetic samples as training progresses. This paper studies and advocates for the latter case, where data are generated in an iterative, closed-loop fashion that is guided by the current state of the student model. For a fixed budget of generated samples, or a budget in terms of compute spent querying a teacher, we show that this curation of finetuning data affords improved student performance over static generation. Further, while there have been several LLM-specific methods proposed that operate in this regime, we find that simple, inexpensive selection criteria from the active learning literature tend to be most performant. We validate these claims across four mathematical and logical reasoning datasets using four different small language models.

[247] Mode-Conditioning Unlocks Superior Test-Time Scaling cs.LG | cs.AI | cs.CLPDF

Chen Henry Wu, Sachin Goyal, Aditi Raghunathan

TL;DR: 该论文提出了模式调节（ModC）框架，通过显式分配计算资源到不同推理模式，解决了并行采样中的多样性崩溃问题，显著提升了测试时的扩展效率。

Details

Motivation: 并行采样在测试时扩展中有巨大潜力，但由于多样性崩溃问题（模型集中在少数模式上，导致重复错误），其效果受限。论文旨在通过模式调节最大化数据的多样性利用。

Result: 1) Qwen2.5-7B在OpenThoughts上实现了4倍效率提升；2) 在NuminaMath等数据集上无监督ModC取得了10%的性能增益；3) ModC提升了强化学习的多样性和性能。

Insight: 标准训练未能充分利用数据多样性，而ModC通过显式模式调节解锁了多样性的潜力，为测试时扩展提供了简单有效的解决方案。

Abstract: Parallel sampling promises substantial gains in test-time scaling, but its effectiveness is sharply limited by diversity collapse, where models concentrate on a few modes and repeated samples produce the same mistakes. We propose the mode-conditioning (ModC) framework, which explicitly allocates test-time compute across reasoning modes using either specialist models or mode-specific prefixes. ModC consistently improves scaling across controlled graph-search tasks and large-scale reasoning benchmarks, spanning model families and sizes from 0.5B to 7B. On OpenThoughts, fine-tuning Qwen2.5-7B with ModC achieves a 4x efficiency gain over standard training while also improving the maximum attainable Pass@k. We further show that gradient clustering enables ModC without explicit mode labels, yielding up to 10% gains on datasets such as NuminaMath. Finally, we show that ModC improves reinforcement learning (RL) and can further boost diversity-inducing RL methods. These results demonstrate that standard training underutilizes the diversity in data, and that ModC provides a simple, effective remedy for unlocking the full benefits of diversity in test-time scaling.

[248] Pay Attention Later: From Vector Space Diffusion to Linearithmic Spectral Phase-Locking cs.LG | cs.AI | cs.CLPDF

Alper Yıldırım, İbrahim Yücedağ

TL;DR: 论文提出了一种新型模型PRISM，通过谐波卷积和频率编码解决了传统Transformer的语义对齐成本和塑性-稳定性问题，并在翻译任务中验证了其有效性。

Details

Motivation: 传统Transformer依赖扩散学习导致语义对齐成本高昂，且在新概念适应时表现出灾难性遗忘问题。论文旨在解决这一塑性-稳定性难题。

Result: 在WMT14翻译任务中，PRISM在塑性-稳定性压力测试中表现优异（96% 5-shot学习率，仅损失0.84 BLEU），而Transformer则完全失败（损失10.55 BLEU）。

Insight: 谐波表示能够有效解耦记忆与推理，为实时知识适应提供了一种结构化解法。

Abstract: Standard Transformers suffer from a “Semantic Alignment Tax”, a prohibitive optimization cost required to organize a chaotic initialization into a coherent geometric map via local gradient diffusion. We hypothesize that this reliance on diffusive learning creates “Catastrophic Rigidity”, rendering models unable to adapt to novel concepts without destroying their pre-trained reasoning capabilities. To isolate this phenomenon, we introduce Iterative Semantic Map Refinement (ISMR), a diagnostic protocol revealing that alignment is a fixed geometric barrier that scaling cannot solve; a 20-layer model overcomes this barrier no faster than a 1-layer model. We introduce the Phase-Resonant Intelligent Spectral Model (PRISM). PRISM encodes semantic identity as resonant frequencies in the complex domain (C^d) and replaces quadratic self-attention with linearithmic O(N log N) Gated Harmonic Convolutions. We validate PRISM on the WMT14 translation task. While the Standard Transformer maintains a slight edge in general competence on static benchmarks (23.88 vs 21.40 BLEU), it fails the “Plasticity-Stability” stress test completely. When injected with novel concepts, the Transformer suffers Catastrophic Forgetting, degrading by -10.55 BLEU points while achieving only 60% acquisition. In contrast, PRISM demonstrates Lossless Plasticity, achieving 96% 5-shot acquisition with negligible degradation (-0.84 BLEU). These results suggest that harmonic representations effectively decouple memory from reasoning, offering a structural solution to the plasticity-stability dilemma in real-time knowledge adaptation.

[249] Stabilizing Reinforcement Learning with LLMs: Formulation and Practices cs.LG | cs.AI | cs.CLPDF

Chujie Zheng, Kai Dang, Bowen Yu, Mingze Li, Huiqiang Jiang

TL;DR: 这篇论文提出了一种基于大型语言模型（LLM）的强化学习（RL）新框架，解释了为什么以及在什么条件下可通过替代的令牌级目标优化真实的序列级奖励。研究通过一阶近似证明，只有当训练-推断差异和策略陈旧性最小化时，替代目标才逐渐有效。实验表明，结合重要性采样修正和路由重放技术，可显著提升RL训练的稳定性。

Details

Motivation: 传统的强化学习方法在面对大型语言模型时存在训练不稳定和收敛速度慢的问题，尤其是在处理序列生成任务时。论文旨在提出一种理论和实践结合的框架，解释和解决这些问题。

Result: 实验表明：1. 在策略训练中，重要性采样修正能显著提升稳定性；2. 非策略训练中，结合裁剪和路由重放技术是必要的；3. 稳定训练后，长时间优化性能不受冷启动影响。

Insight: 1. 训练稳定性的关键在于减小训练-推断差异和策略陈旧性；2. 具体技术（如重要性采样修正和路由重放）为未来RL研究提供了实用方案。

Abstract: This paper proposes a novel formulation for reinforcement learning (RL) with large language models, explaining why and under what conditions the true sequence-level reward can be optimized via a surrogate token-level objective in policy gradient methods such as REINFORCE. Specifically, through a first-order approximation, we show that this surrogate becomes increasingly valid only when both the training-inference discrepancy and policy staleness are minimized. This insight provides a principled explanation for the crucial role of several widely adopted techniques in stabilizing RL training, including importance sampling correction, clipping, and particularly Routing Replay for Mixture-of-Experts (MoE) models. Through extensive experiments with a 30B MoE model totaling hundreds of thousands of GPU hours, we show that for on-policy training, the basic policy gradient algorithm with importance sampling correction achieves the highest training stability. When off-policy updates are introduced to accelerate convergence, combining clipping and Routing Replay becomes essential to mitigate the instability caused by policy staleness. Notably, once training is stabilized, prolonged optimization consistently yields comparable final performance regardless of cold-start initialization. We hope that the shared insights and the developed recipes for stable RL training will facilitate future research.

[250] ZIP-RC: Zero-overhead Inference-time Prediction of Reward and Cost for Adaptive and Interpretable Generation cs.LG | cs.AI | cs.CLPDF

Rohin Manvi, Joey Hong, Tim Seyde, Maxime Labonne, Mathias Lechner

TL;DR: ZIP-RC是一种自适应推理方法，通过在推理时预测奖励和成本的零开销预测，让语言模型能够动态调整生成策略，从而在质量和效率之间找到平衡。

Details

Motivation: 现有大型语言模型在推理时缺乏自我反省能力，无法预估自身成功所需的计算量。人类能够实时调整投入的努力，而现有方法（如Best-of-N）固定计算预算，缺乏灵活性，且额外的验证模型增加了开销。

Result: 在混合难度数学基准上，ZIP-RC在相同或更低平均成本下，比多数投票方法精度提高12%，并在质量、计算量和延迟之间实现了平滑的帕累托前沿。

Insight: ZIP-RC展示了语言模型可以通过高效的实时自省实现自适应推理，同时避免了额外计算开销，为动态生成策略提供了新思路。

Abstract: Large language models excel at reasoning but lack key aspects of introspection, including anticipating their own success and the computation required to achieve it. Humans use real-time introspection to decide how much effort to invest, when to make multiple attempts, when to stop, and when to signal success or failure. Without this, LLMs struggle to make intelligent meta-cognition decisions. Test-time scaling methods like Best-of-N drive up cost and latency by using a fixed budget of samples regardless of the marginal benefit of each one at any point in generation, and the absence of confidence signals can mislead people, prevent appropriate escalation to better tools, and undermine trustworthiness. Learned verifiers or reward models can provide confidence estimates, but do not enable adaptive inference and add substantial cost by requiring extra models or forward passes. We present ZIP-RC, an adaptive inference method that equips models with zero-overhead inference-time predictions of reward and cost. At every token, ZIP-RC reuses reserved or unused logits in the same forward pass as next-token prediction to output a joint distribution over final reward and remaining length – no extra models, architecture change, or inference overhead. This full joint distribution is used to compute a sampling utility which is the linear combination of the expected maximum reward, total compute, and latency of set of samples if generated to completion. During inference, we maximize this utility with meta-actions that determine which prefix of tokens to continue or initiate sampling from. On mixed-difficulty mathematical benchmarks, ZIP-RC improves accuracy by up to 12% over majority voting at equal or lower average cost, and traces smooth Pareto frontiers between quality, compute, and latency. By providing real-time reward-cost introspection, ZIP-RC enables adaptive, efficient reasoning.

[251] HalluGraph: Auditable Hallucination Detection for Legal RAG Systems via Knowledge Graph Alignment cs.LG | cs.AI | cs.CLPDF

Valentin Noël, Elimane Yassine Seidou, Charly Ken Capo-Chichi, Ghanem Amari

TL;DR: HalluGraph是一种基于知识图谱对齐的法律RAG系统幻觉检测框架，提供可审计的透明性和可追踪性，显著优于现有语义相似性方法。

Details

Motivation: 法律AI系统在生成文本时需要确保其内容与源文件一致，但现有幻觉检测方法依赖语义相似性，可能导致关键实体或关系被错误替换，带来严重后果。

Result: 在控制文档和生成性法律任务中，HalluGraph分别达到AUC=0.979和AUC≈0.89，显著优于语义相似性基线。

Insight: 结构化对齐方法在高风险法律应用中更具可靠性和透明度，能够生成从生成内容到源文件的完整审计路径。

Abstract: Legal AI systems powered by retrieval-augmented generation (RAG) face a critical accountability challenge: when an AI assistant cites case law, statutes, or contractual clauses, practitioners need verifiable guarantees that generated text faithfully represents source documents. Existing hallucination detectors rely on semantic similarity metrics that tolerate entity substitutions, a dangerous failure mode when confusing parties, dates, or legal provisions can have material consequences. We introduce HalluGraph, a graph-theoretic framework that quantifies hallucinations through structural alignment between knowledge graphs extracted from context, query, and response. Our approach produces bounded, interpretable metrics decomposed into \textit{Entity Grounding} (EG), measuring whether entities in the response appear in source documents, and \textit{Relation Preservation} (RP), verifying that asserted relationships are supported by context. On structured control documents, HalluGraph achieves near-perfect discrimination ($>$400 words, $>$20 entities), HalluGraph achieves $AUC = 0.979$, while maintaining robust performance ($AUC \approx 0.89$) on challenging generative legal task, consistently outperforming semantic similarity baselines. The framework provides the transparency and traceability required for high-stakes legal applications, enabling full audit trails from generated assertions back to source passages.

[252] Agentic Policy Optimization via Instruction-Policy Co-Evolution cs.LG | cs.AI | cs.CLPDF

Han Zhou, Xingchen Wan, Ivan Vulić, Anna Korhonen

TL;DR: INSPO提出了一种新颖的指令-策略协同进化框架，通过动态优化指令强化RL代理的性能，显著优于静态指令方法。

Details

Motivation: 现有RLVR依赖静态手动设计的指令，可能无法适应模型和环境的动态变化，限制了代理性能提升的空间。

Result: 在多轮检索和推理任务中，INSPO显著超越静态指令基线，生成更高效的策略路径。

Insight: 指令的动态优化是提升RL代理性能的关键，且不需显著增加计算开销。

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has advanced the reasoning capability of large language models (LLMs), enabling autonomous agents that can conduct effective multi-turn and tool-integrated reasoning. While instructions serve as the primary protocol for defining agents, RLVR typically relies on static and manually designed instructions. However, those instructions may be suboptimal for the base model, and the optimal instruction may change as the agent’s policy improves and explores the interaction with the environment. To bridge the gap, we introduce INSPO, a novel Instruction-Policy co-evolution framework that integrates instruction optimization as a dynamic component of the reinforcement learning (RL) loop. INSPO maintains a dynamic population of instruction candidates that are sampled with questions, where reward signals in RL loops are automatically attributed to each instruction, and low performers are periodically pruned. New instructions are generated and verified through an on-policy reflection mechanism, where an LLM-based optimizer analyzes past experience from a replay buffer and evolves more effective strategies given the current policy. We conduct extensive experiments on multi-turn retrieval and reasoning tasks, demonstrating that INSPO substantially outperforms strong baselines relying on static instructions. INSPO discovers innovative instructions that guide the agent toward more strategic reasoning paths, achieving substantial performance gains with only a marginal increase in computational overhead.

[253] SelfAI: Building a Self-Training AI System with LLM Agents cs.LG | cs.AI | cs.CVPDF

Xiao Wu, Ting-Zhu Huang, Liang-Jian Deng, Xiaobing Yu, Yu Zhong

TL;DR: SelfAI是一个多智能体平台，通过结合用户代理、认知代理和实验管理器，实现自主科学发现和超参数优化，同时减少冗余实验，优于传统方法。

Details

Motivation: 现有基于LLM的自主科学发现系统局限于狭窄领域，缺乏实时交互和停止搜索的机制，导致效率低下和可重复性问题。SelfAI旨在解决这些不足。

Result: 在回归、NLP、计算机视觉等多个任务中，SelfAI表现优于贝叶斯优化和基于LLM的方法，减少了冗余实验。

Insight: SelfAI展示了多智能体协同在科学发现中的潜力，同时强调了评估指标对优化过程的重要性。

Abstract: Recent work on autonomous scientific discovery has leveraged LLM-based agents to integrate problem specification, experiment planning, and execution into end-to-end systems. However, these frameworks are often confined to narrow application domains, offer limited real-time interaction with researchers, and lack principled mechanisms for determining when to halt exploration, resulting in inefficiencies, reproducibility challenges, and under-utilized human expertise. To address these gaps, we propose \textit{SelfAI}, a general multi-agent platform that combines a User Agent for translating high-level research objectives into standardized experimental configurations, a Cognitive Agent powered by LLMs with optimal stopping criteria to iteratively refine hyperparameter searches, and an Experiment Manager responsible for orchestrating parallel, fault-tolerant training workflows across heterogeneous hardware while maintaining a structured knowledge base for continuous feedback. We further introduce two novel evaluation metrics, Score and $\text{AUP}_D$, to quantify discovery efficiency and search diversity. Across regression, NLP, computer vision, scientific computing, medical imaging, and drug discovery benchmarks, SelfAI consistently achieves strong performance and reduces redundant trials compared to classical Bayesian optimization and LLM-based baselines, while enabling seamless interaction with human researchers.

[254] REM: Evaluating LLM Embodied Spatial Reasoning through Multi-Frame Trajectories cs.LG | cs.AI | cs.CVPDF

Jacob Thompson, Emiliano Garcia-Lopez, Yonatan Bisk

TL;DR: 论文提出了REM基准，用于评估多模态大语言模型（MLLMs）在长时程空间推理任务中的表现，揭示了其在复杂空间任务中的局限性，并与人类表现形成鲜明对比。

Details

Motivation: 人类通过导航构建视角独立的认知地图，而当前的MLLMs尽管经过大量视频训练，仍缺乏这种基础的空间推理能力，这限制了其在具身应用中的潜力。

Result: 当前最佳模型在简单任务中表现尚可，但随着任务复杂度增加，表现急剧下降，远不如人类。

Insight: MLLMs需要更强大的空间表示能力以应对动态环境，REM为此提供了具体的诊断指标和优化方向。

Abstract: Humans build viewpoint-independent cognitive maps through navigation, enabling intuitive reasoning about object permanence and spatial relations. We argue that multimodal large language models (MLLMs), despite extensive video training, lack this fundamental spatial reasoning capability, a critical limitation for embodied applications. To demonstrate these limitations and drive research, we introduce REM (Reasoning over Embodied Multi-Frame Trajectories), a benchmark using controllable 3D environments for long-horizon embodied spatial reasoning. REM systematically evaluates key aspects like object permanence/distinction, spatial relationships, and numerical tracking across dynamic embodied viewpoints. Our evaluation shows that the best-performing current models exhibit promising overall performance, but become increasingly unreliable at even moderate complexity levels easily handled by humans. These findings highlight challenges MLLMs face in developing robust spatial representations from sequential visual input. Consequently, REM provides targeted metrics and diagnostics to foster improved spatial understanding in future models.

[255] Open-Set Domain Adaptation Under Background Distribution Shift: Challenges and A Provably Efficient Solution cs.LG | cs.AI | cs.CVPDF

Shravan Chaudhari, Yoav Wald, Suchi Saria

TL;DR: 该论文提出了一种名为\ours{}的方法，用于解决开放集识别问题，即使在背景分布发生变化的挑战性情况下也能保证性能。论文通过理论和实验证明了该方法的有效性，并揭示了新类别大小等因素对性能的影响。

Details

Motivation: 现实中机器学习系统的部署面临数据分布变化的挑战，尤其是新类别的出现和已知类别分布的变化。现有开放集识别方法通常假设背景分布固定，无法应对背景分布变化的情况。

Result: 实验结果表明，\ours{}在图像和文本数据上显著优于现有开放集识别方法，尤其是在背景分布变化的情况下。

Insight: 新类别的大小对开放集识别的性能有显著影响，这一发现填补了现有研究的空白。

Abstract: As we deploy machine learning systems in the real world, a core challenge is to maintain a model that is performant even as the data shifts. Such shifts can take many forms: new classes may emerge that were absent during training, a problem known as open-set recognition, and the distribution of known categories may change. Guarantees on open-set recognition are mostly derived under the assumption that the distribution of known classes, which we call \emph{the background distribution}, is fixed. In this paper we develop \ours{}, a method that is guaranteed to solve open-set recognition even in the challenging case where the background distribution shifts. We prove that the method works under benign assumptions that the novel class is separable from the non-novel classes, and provide theoretical guarantees that it outperforms a representative baseline in a simplified overparameterized setting. We develop techniques to make \ours{} scalable and robust, and perform comprehensive empirical evaluations on image and text data. The results show that \ours{} significantly outperforms existing open-set recognition methods under background shift. Moreover, we provide new insights into how factors such as the size of the novel class influences performance, an aspect that has not been extensively explored in prior work.

[256] First On-Orbit Demonstration of a Geospatial Foundation Model cs.LG | cs.AI | cs.CVPDF

Andrew Du, Roberto Del Prete, Alejandro Mousist, Nick Manser, Fabrice Marre

TL;DR: 论文展示了如何在资源受限的空间硬件上部署紧凑型地理空间基础模型（GeoFM），并通过压缩和领域适应保持性能，成功在国际空间站上实现轨道推理。

Details

Motivation: 地理空间基础模型（GeoFMs）在地球观测任务中具有广泛泛化能力，但其大模型尺寸限制了在资源有限的空间硬件上的部署。

Result: 成功在国际空间站的IMAGIN-e载荷上实现轨道推理，证明压缩后的模型在高性能的同时满足资源需求。

Insight: 模型压缩和领域适应是实现在资源受限环境中部署大型AI模型的关键，为地球观测任务中的机载AI提供了可行路径。

Abstract: Geospatial foundation models (GeoFMs) promise broad generalisation capacity for Earth observation (EO) tasks, particularly under data-limited conditions. However, their large size poses a barrier to deployment on resource-constrained space hardware. To address this, we present compact variants of a Vision Transformer (ViT)-based GeoFM that preserve downstream task performance while enabling onboard execution. Evaluation across five downstream tasks and validation in two representative flight environments show that model compression and domain adaptation are critical to reducing size and resource demands while maintaining high performance under operational conditions. We further demonstrate reliable on-orbit inference with the IMAGIN-e payload aboard the International Space Station. These results establish a pathway from large GeoFMs to flight-ready, resource-efficient deployments, expanding the feasibility of onboard AI for EO missions.

[257] Stay Unique, Stay Efficient: Preserving Model Personality in Multi-Task Merging cs.LG | cs.CVPDF

Kuangpu Guo, Yuhe Ding, Jian Liang, Zilei Wang, Ran He

TL;DR: 该论文提出了DTS框架，通过保留任务特定信息来解决多任务模型合并中的性能下降问题，同时最小化存储开销。

Details

Motivation: 现有模型合并方法在多任务场景中性能下降显著，尤其是相似任务上表现不佳，需要保留任务特定信息以提升性能。

Result: 实验表明，DTS在性能上优于现有方法，且每任务仅需1%的额外存储开销，在未见任务上也表现出更好的泛化能力。

Insight: 保留任务特定信息的关键是高效分解和选择性融合，可通过语义相似性实现数据无关的任务泛化。

Abstract: Model merging has emerged as a promising paradigm for enabling multi-task capabilities without additional training. However, existing methods often experience substantial performance degradation compared with individually fine-tuned models, even on similar tasks, underscoring the need to preserve task-specific information. This paper proposes Decomposition, Thresholding, and Scaling (DTS), an approximation-based personalized merging framework that preserves task-specific information with minimal storage overhead. DTS first applies singular value decomposition to the task-specific information and retains only a small subset of singular values and vectors. It then introduces a novel thresholding strategy that partitions singular vector elements into groups and assigns a scaling factor to each group. To enable generalization to unseen tasks, we further extend DTS with a variant that fuses task-specific information in a data-free manner based on the semantic similarity of task characteristics. Extensive experiments demonstrate that DTS consistently outperforms state-of-the-art baselines while requiring only 1% additional storage per task. Furthermore, experiments on unseen tasks show that the DTS variant achieves significantly better generalization performance. Our code is available at https://github.com/krumpguo/DTS.

[258] Forget Less, Retain More: A Lightweight Regularizer for Rehearsal-Based Continual Learning cs.LG | cs.CVPDF

Lama Alssum, Hasan Abed Al Kader Hammoud, Motasem Alfarra, Juan C Leon Alcazar, Bernard Ghanem

TL;DR: 该论文提出了一种轻量化的正则化策略（信息最大化正则化器），用于缓解基于记忆的持续学习方法中的灾难性遗忘问题。该方法通过专注于预期的标签分布，实现了类别无关的正则化，从而显著提升性能且计算开销小。

Details

Motivation: 深度神经网络在持续学习任务中容易发生灾难性遗忘，即新任务的学习会覆盖旧任务的知识。现有方法虽然在缓解问题上有一定效果，但在计算开销和实用性上仍有不足。

Result: 实验表明，IM正则化器在多个数据集和任务数量下都能显著提升基线性能，且计算开销极小。此外，IM在视频数据上也表现出色，证明了其数据无关的通用性。

Insight: 1. 专注于标签分布而非具体数据是一种高效的缓解灾难性遗忘的策略；2. 轻量化的设计使得该方法在计算效率和实用性上具有优势；3. 该方法可以扩展到复杂场景（如视频），展现了较强的适应性。

Abstract: Deep neural networks suffer from catastrophic forgetting, where performance on previous tasks degrades after training on a new task. This issue arises due to the model’s tendency to overwrite previously acquired knowledge with new information. We present a novel approach to address this challenge, focusing on the intersection of memory-based methods and regularization approaches. We formulate a regularization strategy, termed Information Maximization (IM) regularizer, for memory-based continual learning methods, which is based exclusively on the expected label distribution, thus making it class-agnostic. As a consequence, IM regularizer can be directly integrated into various rehearsal-based continual learning methods, reducing forgetting and favoring faster convergence. Our empirical validation shows that, across datasets and regardless of the number of tasks, our proposed regularization strategy consistently improves baseline performance at the expense of a minimal computational overhead. The lightweight nature of IM ensures that it remains a practical and scalable solution, making it applicable to real-world continual learning scenarios where efficiency is paramount. Finally, we demonstrate the data-agnostic nature of our regularizer by applying it to video data, which presents additional challenges due to its temporal structure and higher memory requirements. Despite the significant domain gap, our experiments show that IM regularizer also improves the performance of video continual learning methods.

cs.CR [Back]

[259] Large Language Models Cannot Reliably Detect Vulnerabilities in JavaScript: The First Systematic Benchmark and Evaluation cs.CR | cs.CL | cs.SEPDF

Qingyuan Fei, Xin Liu, Song Li, Shujiang Wu, Jianwei Hou

TL;DR: 该论文首次系统评估了大型语言模型（LLMs）在JavaScript漏洞检测中的能力，提出了构建基准的三原则，并开发了FORGEJS和JUDGEJS框架。结果显示LLMs在此任务中表现不佳，存在鲁棒性问题。

Details

Motivation: 现有JavaScript漏洞检测的基准存在覆盖不全、低估或高估LLM能力的问题，需要系统性评估和更全面的基准。

Result: LLMs在JavaScript漏洞检测中表现有限，推理能力不足且鲁棒性差，表明LLM在此任务中尚不可靠。

Insight: LLMs当前无法可靠检测JavaScript漏洞，未来需优化基准设计和模型能力。

Abstract: Researchers have proposed numerous methods to detect vulnerabilities in JavaScript, especially those assisted by Large Language Models (LLMs). However, the actual capability of LLMs in JavaScript vulnerability detection remains questionable, necessitating systematic evaluation and comprehensive benchmarks. Unfortunately, existing benchmarks suffer from three critical limitations: (1) incomplete coverage, such as covering a limited subset of CWE types; (2) underestimation of LLM capabilities caused by unreasonable ground truth labeling; and (3) overestimation due to unrealistic cases such as using isolated vulnerable files rather than complete projects. In this paper, we introduce, for the first time, three principles for constructing a benchmark for JavaScript vulnerability detection that directly address these limitations: (1) comprehensiveness, (2) no underestimation, and (3) no overestimation. Guided by these principles, we propose FORGEJS, the first automatic benchmark generation framework for evaluating LLMs’ capability in JavaScript vulnerability detection. Then, we use FORGEJS to construct ARENAJS-the first systematic benchmark for LLM-based JavaScript vulnerability detection-and further propose JUDGEJS, an automatic evaluation framework. We conduct the first systematic evaluation of LLMs for JavaScript vulnerability detection, leveraging JUDGEJS to assess seven popular commercial LLMs on ARENAJS. The results show that LLMs not only exhibit limited reasoning capabilities, but also suffer from severe robustness defects, indicating that reliable JavaScript vulnerability detection with LLMs remains an open challenge.

[260] Securing Large Language Models (LLMs) from Prompt Injection Attacks cs.CR | cs.CL | cs.LGPDF

Omar Farooq Khan Suri, John McCrae

TL;DR: 该论文探讨了大型语言模型（LLMs）在提示注入攻击（Prompt Injection Attacks）中的安全问题，提出了一种基于任务特定微调（JATMO）的防御方法，并通过遗传攻击框架HOUYI评估其鲁棒性。研究发现JATMO虽能降低攻击成功率，但仍存在漏洞，尤其在多语言或代码相关场景下。

Details

Motivation: 随着LLMs在现实应用中的广泛部署，其灵活性带来的安全风险日益凸显，尤其是提示注入攻击。研究旨在评估一种新型防御方法JATMO的有效性，并揭示其局限性。

Result: 实验显示，JATMO降低了攻击成功率，但未能完全阻止注入攻击；多语言和代码相关攻击仍能绕过防御。此外，任务性能与攻击脆弱性之间存在权衡。

Insight: 1. 单靠任务特定微调不足以完全防御提示注入攻击；2. 防御策略需要多层设计并结合对抗性训练；3. 未来研究方向应关注如何在提高任务性能的同时保持安全性。

Abstract: Large Language Models (LLMs) are increasingly being deployed in real-world applications, but their flexibility exposes them to prompt injection attacks. These attacks leverage the model’s instruction-following ability to make it perform malicious tasks. Recent work has proposed JATMO, a task-specific fine-tuning approach that trains non-instruction-tuned base models to perform a single function, thereby reducing susceptibility to adversarial instructions. In this study, we evaluate the robustness of JATMO against HOUYI, a genetic attack framework that systematically mutates and optimizes adversarial prompts. We adapt HOUYI by introducing custom fitness scoring, modified mutation logic, and a new harness for local model testing, enabling a more accurate assessment of defense effectiveness. We fine-tuned LLaMA 2-7B, Qwen1.5-4B, and Qwen1.5-0.5B models under the JATMO methodology and compared them with a fine-tuned GPT-3.5-Turbo baseline. Results show that while JATMO reduces attack success rates relative to instruction-tuned models, it does not fully prevent injections; adversaries exploiting multilingual cues or code-related disruptors still bypass defenses. We also observe a trade-off between generation quality and injection vulnerability, suggesting that better task performance often correlates with increased susceptibility. Our results highlight both the promise and limitations of fine-tuning-based defenses and point toward the need for layered, adversarially informed mitigation strategies.

[261] EmoRAG: Evaluating RAG Robustness to Symbolic Perturbations cs.CR | cs.AI | cs.CLPDF

Xinyun Zhou, Xinfeng Li, Yinan Peng, Ming Xu, Xuanwang Zhang

TL;DR: 这篇论文揭示了RAG系统对情感符号扰动的脆弱性，展示了单一情感符号如何极大干扰检索效果，并提出针对性防御措施。

Details

Motivation: 研究动机在于揭示RAG系统对情感符号等细微扰动的敏感性及其潜在威胁，弥补当前研究中忽略的脆弱性。

Result: 1. 单一情感符号导致99%无关检索；2. 开头位置扰动F1达0.92；3. 大模型更易受影响。

Insight: RAG实际鲁棒性低于假设，情感符号可被恶意利用；现有防御不足，需针对性设计。

Abstract: Retrieval-Augmented Generation (RAG) systems are increasingly central to robust AI, enhancing large language model (LLM) faithfulness by incorporating external knowledge. However, our study unveils a critical, overlooked vulnerability: their profound susceptibility to subtle symbolic perturbations, particularly through near-imperceptible emoticon tokens such as “(@_@)” that can catastrophically mislead retrieval, termed EmoRAG. We demonstrate that injecting a single emoticon into a query makes it nearly 100% likely to retrieve semantically unrelated texts that contain a matching emoticon. Our extensive experiment across general question-answering and code domains, using a range of state-of-the-art retrievers and generators, reveals three key findings: (I) Single-Emoticon Disaster: Minimal emoticon injections cause maximal disruptions, with a single emoticon almost 100% dominating RAG output. (II) Positional Sensitivity: Placing an emoticon at the beginning of a query can cause severe perturbation, with F1-Scores exceeding 0.92 across all datasets. (III) Parameter-Scale Vulnerability: Counterintuitively, models with larger parameters exhibit greater vulnerability to the interference. We provide an in-depth analysis to uncover the underlying mechanisms of these phenomena. Furthermore, we raise a critical concern regarding the robustness assumption of current RAG systems, envisioning a threat scenario where an adversary exploits this vulnerability to manipulate the RAG system. We evaluate standard defenses and find them insufficient against EmoRAG. To address this, we propose targeted defenses, analyzing their strengths and limitations in mitigating emoticon-based perturbations. Finally, we outline future directions for building robust RAG systems.

cs.AI [Back]

[262] Probing the “Psyche’’ of Large Reasoning Models: Understanding Through a Human Lens cs.AI | cs.CLPDF

Yuxiang Chen, Zuohan Wu, Ziwei Wang, Xiangning Yu, Xujia Li

TL;DR: 这篇论文提出了一个全面的分类法，用于理解大型推理模型（LRMs）的‘心理’过程，并结合人类认知视角对其进行深入分析。作者开发了名为CAPO的自动标注框架，并生成了一份标注数据集。研究发现，当前模型的自我监控行为往往流于表面，因此提倡多步反思以提高推理能力。

Details

Motivation: 大型推理模型（LRMs）展现出类人的推理行为，但缺乏系统的理论框架来理解其内部‘心理’过程。通过结合人类心理学的分类方法，可以更好地理解和改进LRMs的表现。

Result: 实验表明CAPO在标注一致性上优于基线方法。分析发现，当前模型的‘双重检查’行为效果有限，多步反思可能是更有效的改进路径。

Insight: 将人类认知理论应用于LRMs分析有助于揭示其局限性，并指导改进方向（如强调多步反思）。CAPO框架为大规模分析提供了可扩展的工具。

Abstract: Large reasoning models (LRMs) have garnered significant attention from researchers owing to their exceptional capability in addressing complex tasks. Motivated by the observed human-like behaviors in their reasoning processes, this paper introduces a comprehensive taxonomy to characterize atomic reasoning steps and probe the psyche'' of LRM intelligence. Specifically, it comprises five groups and seventeen categories derived from human mental processes, thereby grounding the understanding of LRMs in an interdisciplinary perspective. The taxonomy is then applied for an in-depth understanding of current LRMs, resulting in a distinct labeled dataset that comprises 277,534 atomic reasoning steps. Using this resource, we analyze contemporary LRMs and distill several actionable takeaways for improving training and post-training of reasoning models. Notably, our analysis reveals that prevailing post-answer double-checks’’ (self-monitoring evaluations) are largely superficial and rarely yield substantive revisions. Thus, incentivizing comprehensive multi-step reflection, rather than simple self-monitoring, may offer a more effective path forward. To complement the taxonomy, an automatic annotation framework, named CAPO, is proposed to leverage large language models (LLMs) for generating the taxonomy-based annotations. Experimental results demonstrate that CAPO achieves higher consistency with human experts compared to baselines, facilitating a scalable and comprehensive analysis of LRMs from a human cognitive perspective. Together, the taxonomy, CAPO, and the derived insights provide a principled, scalable path toward understanding and advancing LRM reasoning.

[263] Evaluating Legal Reasoning Traces with Legal Issue Tree Rubrics cs.AI | cs.CLPDF

Jinu Lee, Kyoung-Woon On, Simeng Han, Arman Cohan, Julia Hockenmaier

TL;DR: 该论文提出了LEGIT数据集，用于评估法律领域LLM生成推理痕迹的质量，并探讨了检索增强生成（RAG）和基于Rubrics的强化学习（RL）对法律推理能力的互补作用。

Details

Motivation: 在法律领域，评估LLM生成的推理痕迹的质量至关重要，但因其复杂性而具有挑战性。论文旨在通过构建专家级法律推理数据集（LEGIT）来解决这一问题。

Result: 1. LLM的法律推理能力受问题覆盖率和正确性严重影响；2. RAG提升整体推理能力，而RL提高正确性但降低覆盖率。

Insight: RAG和RL在法律推理中各有优势，结合使用可能更有效；层次树结构为复杂领域推理评估提供了新思路。

Abstract: Evaluating the quality of LLM-generated reasoning traces in expert domains (e.g., law) is essential for ensuring credibility and explainability, yet remains challenging due to the inherent complexity of such reasoning tasks. We introduce LEGIT (LEGal Issue Trees), a novel large-scale (24K instances) expert-level legal reasoning dataset with an emphasis on reasoning trace evaluation. We convert court judgments into hierarchical trees of opposing parties’ arguments and the court’s conclusions, which serve as rubrics for evaluating the issue coverage and correctness of the reasoning traces. We verify the reliability of these rubrics via human expert annotations and comparison with coarse, less informative rubrics. Using the LEGIT dataset, we show that (1) LLMs’ legal reasoning ability is seriously affected by both legal issue coverage and correctness, and that (2) retrieval-augmented generation (RAG) and RL with rubrics bring complementary benefits for legal reasoning abilities, where RAG improves overall reasoning capability, whereas RL improves correctness albeit with reduced coverage.

[264] H-Neurons: On the Existence, Impact, and Origin of Hallucination-Associated Neurons cs.AI | cs.CL | cs.CYPDF

Cheng Gao, Huimin Chen, Chaojun Xiao, Zhiyi Chen, Zhiyuan Liu

TL;DR: 论文研究了大型语言模型（LLMs）中与幻觉行为相关的神经元（H-Neurons），探索了它们的识别、行为影响及起源，揭示了稀疏神经元与幻觉的因果关系及其在预训练中的形成。

Details

Motivation: 幻觉（plausible but factually incorrect outputs）是LLMs的主要问题之一，严重影响了其可靠性。现有研究多从宏观角度（如训练数据和目标）分析，但对神经元层面的机制缺乏深入探讨。

Result: 研究发现不到0.1%的神经元能可靠预测幻觉行为，且这些神经元在预训练阶段就已形成，展现出强泛化能力。

Insight: H-Neurons的存在表明幻觉行为与特定神经元直接相关，这一发现为通过微观干预提升LLMs的可靠性提供了新思路。

Abstract: Large language models (LLMs) frequently generate hallucinations – plausible but factually incorrect outputs – undermining their reliability. While prior work has examined hallucinations from macroscopic perspectives such as training data and objectives, the underlying neuron-level mechanisms remain largely unexplored. In this paper, we conduct a systematic investigation into hallucination-associated neurons (H-Neurons) in LLMs from three perspectives: identification, behavioral impact, and origins. Regarding their identification, we demonstrate that a remarkably sparse subset of neurons (less than $0.1%$ of total neurons) can reliably predict hallucination occurrences, with strong generalization across diverse scenarios. In terms of behavioral impact, controlled interventions reveal that these neurons are causally linked to over-compliance behaviors. Concerning their origins, we trace these neurons back to the pre-trained base models and find that these neurons remain predictive for hallucination detection, indicating they emerge during pre-training. Our findings bridge macroscopic behavioral patterns with microscopic neural mechanisms, offering insights for developing more reliable LLMs.

[265] From Atomic to Composite: Reinforcement Learning Enables Generalization in Complementary Reasoning cs.AI | cs.CLPDF

Sitao Cheng, Xunjian Yin, Ruiwen Zhou, Yuxuan Li, Xinyi Wang

TL;DR: 论文研究了RL（强化学习）在互补推理中的作用，发现RL能够作为推理的合成器而非简单的概率放大器，但需要以SFT（监督微调）为基础掌握原子技能。通过分解任务为参数推理和上下文推理，实验表明RL在复杂推理任务中具备泛化能力。

Details

Motivation: 探讨RL如何提升推理能力，尤其是是否能够通过合成新技能而非仅放大现有行为来提高泛化能力。通过互补推理任务，严格分解能力并评估泛化表现。

Result: SFT在分布内表现良好但在O.O.D.泛化中失败，而RL能够合成复杂策略。RL的成功依赖于模型通过SFT掌握的原子技能。

Insight: RL不仅是行为的放大器，还能在原子技能基础上合成新推理策略。分解训练结合RL为复杂推理任务的泛化提供了可行路径。

Abstract: The mechanism by which RL contributes to reasoning capabilities-whether it incentivizes the synthesis of new skills or merely amplifies existing behaviors-remains a subject of intense debate. In this work, we investigate this question through the lens of Complementary Reasoning, a complex task that requires integrating internal parametric knowledge with external contextual information. Using a controlled synthetic dataset of human biographies, we strictly decouple this ability into two atomic skills: Parametric Reasoning (relying on internal knowledge) and Contextual Reasoning (depending on external information). To rigorously assess capability boundaries, we evaluate generalization across three distinct levels of difficulty: I.I.D., Composition, and Zero-shot settings. We find that while SFT is sufficient for in-distribution performance, it struggles with O.O.D. generalization, particularly in Zero-shot settings where relational combinations are novel. Crucially, we identify the SFT Generalization Paradox: Models supervised solely on the composite task achieve near-perfect in-distribution accuracy but collapse on out-of-distribution generalization, indicating their reliance on rote memorization of path shortcuts. In contrast, we find that RL acts as a reasoning synthesizer rather than a probability amplifier. However, we uncover a strict atomic prerequisite: RL can only synthesize these complex strategies if the base model has first mastered the independent atomic skills (Parametric and Contextual) via SFT. These findings challenge the view of RL as a mere amplifier, suggesting that given sufficient atomic foundations, RL can actively synthesize complex reasoning strategies from learned primitives without explicit supervision on such complex strategies. This indicates that decoupled atomic training followed by RL offers a scalable path to generalization for complex reasoning tasks.

[266] Chain-of-Ground: Improving GUI Grounding via Iterative Reasoning and Reference Feedback cs.AI | cs.CL | cs.CVPDF

Aiden Yiliu Li, Bizhi Yu, Daoan Lei, Tianhe Ren, Shilong Liu

TL;DR: Chain-of-Ground (CoG) 提出了一种无需训练的渐进式视觉推理框架，通过迭代调整假设提升 GUI 定位能力，在 ScreenSpot Pro 和 TPanel UI 数据集上分别提升 4.8 和 6.9 个百分点。

Details

Motivation: 当前多模态大语言模型在复杂用户界面的 GUI 定位任务中存在小目标、视觉相似目标及布局歧义等问题，限制了其定位能力。

Result: 在 ScreenSpot Pro 和 TPanel UI 上分别达到 68.4% 和优于 Qwen3 VL 235B 6.9 个百分点的性能。

Insight: 迭代性结构化推理能显著提升定位能力，无需额外训练即可解锁多模态模型的潜力。

Abstract: GUI grounding aims to align natural language instructions with precise regions in complex user interfaces. Advanced multimodal large language models show strong ability in visual GUI grounding but still struggle with small or visually similar targets and ambiguity in real world layouts. These limitations arise from limited grounding capacity and from underuse of existing reasoning potential. We present Chain of Ground CoG a training free multi step grounding framework that uses multimodal large language models for iterative visual reasoning and refinement. Instead of direct prediction the model progressively reflects and adjusts its hypotheses leading to more accurate and interpretable localization. Our approach achieves 68.4 accuracy on the ScreenSpot Pro benchmark an improvement of 4.8 points. To measure real world generalization we introduce TPanel UI a dataset of 420 labeled industrial control panels with visual distortions such as blur and masking. On TPanel UI Chain of Ground improves over the strong baseline Qwen3 VL 235B by 6.9 points showing the effectiveness of multi step training free grounding across real world and digital interfaces. These results highlight a direction for unlocking grounding potential through structured iterative refinement instead of additional training.

[267] LLM CHESS: Benchmarking Reasoning and Instruction-Following in LLMs through Chess cs.AI | cs.CLPDF

Sai Kolasani, Maxim Saplin, Nicholas Crispino, Kyle Montgomery, Jared Quincy Davis

TL;DR: LLM CHESS是一个评估框架，通过象棋领域的长期交互测试大语言模型（LLMs）在推理和指令遵循上的泛化能力，并对50多个开源和闭源模型进行了排名。

Details

Motivation: 现有静态基准测试容易过拟合和记忆，且无法动态评估模型的推理和指令遵循能力，因此需要一个更灵活且难以饱和的评估方法。

Result: 实验显示即使顶级模型也难以稳定完成游戏或获胜，表明LLM CHESS能有效区分推理和非推理模型。

Insight: 动态任务设计减少过拟合和记忆问题，为评估模型能力提供了更具挑战性和可持续性的方法。

Abstract: We introduce LLM CHESS, an evaluation framework designed to probe the generalization of reasoning and instruction-following abilities in large language models (LLMs) through extended agentic interaction in the domain of chess. We rank over 50 open and closed source models by playing against a random opponent using a range of behavioral metrics, including win and loss rates, move quality, move legality, hallucinated actions, and game duration. For a subset of top reasoning models, we derive an Elo estimate by playing against a chess engine with variably configured skill, which allows for comparisons between models in an easily understandable way. Despite the simplicity of the instruction-following task and the weakness of the opponent, many state-of-the-art models struggle to complete games or achieve consistent wins. Similar to other benchmarks on complex reasoning tasks, our experiments reveal a clear separation between reasoning and non-reasoning models. However, unlike existing static benchmarks, the stochastic and dynamic nature of LLM CHESS uniquely reduces overfitting and memorization while preventing benchmark saturation, proving difficult even for top reasoning models. To support future work on evaluating reasoning and instruction-following in LLMs, we release our experimental framework, a public leaderboard, and a dataset of associated games.

[268] Med-CMR: A Fine-Grained Benchmark Integrating Visual Evidence and Clinical Logic for Medical Complex Multimodal Reasoning cs.AI | cs.CVPDF

Haozhen Gong, Xiaozhong Ji, Yuansen Liu, Wenbin Wu, Xiaoxiao Yan

TL;DR: Med-CMR是一个细粒度的医学复杂多模态推理基准，通过系统能力分解、挑战性任务设计和广泛高质量数据覆盖，评估多模态大语言模型（MLLMs）在医学领域的表现。

Details

Motivation: 现有的MLLMs在临床工作流程中的应用潜力尚不明确，缺乏针对复杂医学推理能力的细粒度评估基准。

Result: GPT-5表现最佳（MCQ准确率57.81，开放问题得分48.70），但专业医学MLLMs未必优于通用模型，长尾泛化是主要失败模式。

Insight: Med-CMR揭示了医学MLLMs在视觉推理整合和罕见病例稳健性上的不足，为未来临床系统提供严格基准。

Abstract: MLLMs MLLMs are beginning to appear in clinical workflows, but their ability to perform complex medical reasoning remains unclear. We present Med-CMR, a fine-grained Medical Complex Multimodal Reasoning benchmark. Med-CMR distinguishes from existing counterparts by three core features: 1) Systematic capability decomposition, splitting medical multimodal reasoning into fine-grained visual understanding and multi-step reasoning to enable targeted evaluation; 2) Challenging task design, with visual understanding across three key dimensions (small-object detection, fine-detail discrimination, spatial understanding) and reasoning covering four clinically relevant scenarios (temporal prediction, causal reasoning, long-tail generalization, multi-source integration); 3) Broad, high-quality data coverage, comprising 20,653 Visual Question Answering (VQA) pairs spanning 11 organ systems and 12 imaging modalities, validated via a rigorous two-stage (human expert + model-assisted) review to ensure clinical authenticity. We evaluate 18 state-of-the-art MLLMs with Med-CMR, revealing GPT-5 as the top-performing commercial model: 57.81 accuracy on multiple-choice questions (MCQs) and a 48.70 open-ended score, outperforming Gemini 2.5 Pro (49.87 MCQ accuracy, 45.98 open-ended score) and leading open-source model Qwen3-VL-235B-A22B (49.34 MCQ accuracy, 42.62 open-ended score). However, specialized medical MLLMs do not reliably outperform strong general models, and long-tail generalization emerges as the dominant failure mode. Med-CMR thus provides a stress test for visual-reasoning integration and rare-case robustness in medical MLLMs, and a rigorous yardstick for future clinical systems.

cs.SE [Back]

[269] Bias Testing and Mitigation in Black Box LLMs using Metamorphic Relations cs.SE | cs.CLPDF

Sina Salimian, Gias Uddin, Sumon Biswas, Henry Leung

TL;DR: 本文提出了一个统一的框架，通过六种新颖的形变关系（MRs）系统评估和减轻黑盒大语言模型（LLMs）中的社会偏见。该方法不仅能够自动化检测隐藏的偏见，还能利用MR生成的多样本进行微调，显著提高模型的公平性。

Details

Motivation: 随着LLMs的广泛应用，其输出中隐含的社会偏见问题日益突出。现有方法难以应对间接或复杂的偏见诱导提示，迫切需要一种系统化的评估与减轻机制。

Result: 在BiasAsker基准测试中，MRs揭示了更多隐藏偏见（提升14%）。微调后，模型的安全响应率从54.7%提高到88.9%。

Insight: 形变关系不仅是有效的偏见检测工具，还能通过生成多样化的偏见样本直接用于模型优化，为提升LLMs的公平性提供了实用机制。

Abstract: The widespread deployment of Large Language Models (LLMs) has intensified concerns about subtle social biases embedded in their outputs. Existing guardrails often fail when faced with indirect or contextually complex bias-inducing prompts. To address these limitations, we propose a unified framework for both systematic bias evaluation and targeted mitigation. Our approach introduces six novel Metamorphic Relations (MRs) that, based on metamorphic testing principles, transform direct bias-inducing inputs into semantically equivalent yet adversarially challenging variants. These transformations enable an automated method for exposing hidden model biases: when an LLM responds inconsistently or unfairly across MR-generated variants, the underlying bias becomes detectable. We further show that the same MRs can be used to generate diverse bias-inducing samples for fine-tuning, directly linking the testing process to mitigation. Using six state-of-the-art LLMs - spanning open-source and proprietary models - and a representative subset of 385 questions from the 8,978-item BiasAsker benchmark covering seven protected groups, our MRs reveal up to 14% more hidden biases compared to existing tools. Moreover, fine-tuning with both original and MR-mutated samples significantly enhances bias resiliency, increasing safe response rates from 54.7% to over 88.9% across models. These results highlight metamorphic relations as a practical mechanism for improving fairness in conversational AI.

Table of Contents

cs.CV [Back]

[1] MOTION: ML-Assisted On-Device Low-Latency Motion Recognition cs.CV | cs.AI | cs.HCPDF

[2] Closing the Gap: Data-Centric Fine-Tuning of Vision Language Models for the Standardized Exam Questions cs.CV | cs.AI | cs.CL | cs.CYPDF

[3] PEFT-DML: Parameter-Efficient Fine-Tuning Deep Metric Learning for Robust Multi-Modal 3D Object Detection in Autonomous Driving cs.CV | cs.AIPDF

[4] DL-CapsNet: A Deep and Light Capsule Network cs.CVPDF

[5] Satellite to Street : Disaster Impact Estimator cs.CV | cs.AIPDF

[6] ProvRain: Rain-Adaptive Denoising and Vehicle Detection via MobileNet-UNet and Faster R-CNN cs.CVPDF

[7] Conceptual Evaluation of Deep Visual Stereo Odometry for the MARWIN Radiation Monitoring Robot in Accelerator Tunnels cs.CV | cs.ROPDF

[8] Exploring Diagnostic Prompting Approach for Multimodal LLM-based Visual Complexity Assessment: A Case Study of Amazon Search Result Pages cs.CVPDF

[9] A Fast and Efficient Modern BERT based Text-Conditioned Diffusion Model for Medical Image Segmentation cs.CV | cs.LGPDF

[10] Multi-modal On-Device Learning for Monocular Depth Estimation on Ultra-low-power MCUs cs.CVPDF

[11] Exploring Automated Recognition of Instructional Activity and Discourse from Multimodal Classroom Data cs.CVPDF

[12] SemImage: Semantic Image Representation for Text, a Novel Framework for Embedding Disentangled Linguistic Features cs.CV | cs.LGPDF

[13] TeleViT1.0: Teleconnection-aware Vision Transformers for Subseasonal to Seasonal Wildfire Pattern Forecasts cs.CVPDF

[14] Comparative Analysis of Vision Transformer, Convolutional, and Hybrid Architectures for Mental Health Classification Using Actigraphy-Derived Images cs.CV | cs.LGPDF

[15] TinyViT: Field Deployable Transformer Pipeline for Solar Panel Surface Fault and Severity Screening cs.CV | eess.IVPDF

[16] Hybrid Synthetic Data Generation with Domain Randomization Enables Zero-Shot Vision-Based Part Inspection Under Extreme Class Imbalance cs.CV | cs.LGPDF

[17] AutocleanEEG ICVision: Automated ICA Artifact Classification Using Vision-Language AI cs.CV | cs.LG | eess.IV | q-bio.QMPDF

[18] DenseScan: Advancing 3D Scene Understanding with 2D Dense Annotation cs.CV | cs.AIPDF

[19] Relightable Holoported Characters: Capturing and Relighting Dynamic Human Performance from Sparse Views cs.CVPDF

[20] UniDiff: Parameter-Efficient Adaptation of Diffusion Models for Land Cover Classification with Multi-Modal Remotely Sensed Imagery and Sparse Annotations cs.CVPDF

[21] HeartFormer: Semantic-Aware Dual-Structure Transformers for 3D Four-Chamber Cardiac Point Cloud Reconstruction cs.CVPDF

[22] Words into World: A Task-Adaptive Agent for Language-Guided Spatial Retrieval in AR cs.CV | cs.AI | cs.HCPDF

[23] TGSFormer: Scalable Temporal Gaussian Splatting for Embodied Semantic Scene Completion cs.CVPDF

[24] Optimizing Distributional Geometry Alignment with Optimal Transport for Generative Dataset Distillation cs.CVPDF

[25] ART-ASyn: Anatomy-aware Realistic Texture-based Anomaly Synthesis Framework for Chest X-Rays cs.CVPDF

[26] Odometry Without Correspondence from Inertially Constrained Ruled Surfaces cs.CVPDF

[27] MVAD : A Comprehensive Multimodal Video-Audio Dataset for AIGC Detection cs.CVPDF

[28] Assimilation Matters: Model-level Backdoor Detection in Vision-Language Pretrained Models cs.CVPDF

[29] mmPred: Radar-based Human Motion Prediction in the Dark cs.CVPDF

[30] MM-DETR: An Efficient Multimodal Detection Transformer with Mamba-Driven Dual-Granularity Fusion and Frequency-Aware Modality Adapters cs.CVPDF

[31] Towards aligned body representations in vision models cs.CV | cs.AIPDF

[32] THCRL: Trusted Hierarchical Contrastive Representation Learning for Multi-View Clustering cs.CVPDF

[33] WiseEdit: Benchmarking Cognition- and Creativity-Informed Image Editing cs.CVPDF

[34] Better, Stronger, Faster: Tackling the Trilemma in MLLM-based Segmentation with Simultaneous Textual Mask Prediction cs.CVPDF

[35] Low-Bitrate Video Compression through Semantic-Conditioned Diffusion cs.CV | cs.AIPDF

[36] SplatFont3D: Structure-Aware Text-to-3D Artistic Font Generation with Part-Level Style Control cs.CV | cs.GRPDF

[37] What about gravity in video generation? Post-Training Newton’s Laws with Verifiable Rewards cs.CVPDF

[38] RecruitView: A Multimodal Dataset for Predicting Personality and Interview Performance for Human Resources Applications cs.CV | cs.AIPDF

[39] CausalAffect: Causal Discovery for Facial Affective Understanding cs.CV | cs.AIPDF

[40] RealGen: Photorealistic Text-to-Image Generation via Detector-Guided Rewards cs.CV | cs.AIPDF

[41] Structured Context Learning for Generic Event Boundary Detection cs.CVPDF

[42] Learning What Helps: Task-Aligned Context Selection for Vision Tasks cs.CVPDF

[43] CC-FMO: Camera-Conditioned Zero-Shot Single Image to 3D Scene Generation with Foundation Model Orchestration cs.CVPDF

[44] Terrain Sensing with Smartphone Structured Light: 2D Dynamic Time Warping for Grid Pattern Matching cs.CVPDF

[45] Image Generation as a Visual Planner for Robotic Manipulation cs.CV | cs.ROPDF

[46] Cross-Temporal 3D Gaussian Splatting for Sparse-View Guided Scene Update cs.CVPDF

[47] NeuroVolve: Evolving Visual Stimuli toward Programmable Neural Objectives cs.CVPDF

[48] Describe Anything Anywhere At Any Moment cs.CV | cs.AI | cs.ROPDF

[49] SatireDecoder: Visual Cascaded Decoupling for Enhancing Satirical Image Comprehension cs.CVPDF

[50] Scaling Down to Scale Up: Towards Operationally-Efficient and Deployable Clinical Models via Cross-Modal Low-Rank Adaptation for Medical Vision-Language Models cs.CVPDF

[51] Automatic Pith Detection in Tree Cross-Section Images Using Deep Learning cs.CV | cs.AIPDF

[52] XAI-Driven Skin Disease Classification: Leveraging GANs to Augment ResNet-50 Performance cs.CV | cs.AIPDF

[53] Doppler-Enhanced Deep Learning: Improving Thyroid Nodule Segmentation with YOLOv5 Instance Segmentation cs.CV | cs.AI | cs.CE | cs.LG | cs.PFPDF

[54] MambaScope: Coarse-to-Fine Scoping for Efficient Vision Mamba cs.CV | cs.AIPDF

[55] Dynamic-eDiTor: Training-Free Text-Driven 4D Scene Editing with Multimodal Diffusion Transformer cs.CV | cs.AIPDF

[56] Silhouette-based Gait Foundation Model cs.CVPDF

[57] Affordance-First Decomposition for Continual Learning in Video-Language Understanding cs.CVPDF

[58] Optimizing LVLMs with On-Policy Data for Effective Hallucination Mitigation cs.CV | cs.AIPDF

[59] Deep Learning-Based Computer Vision Models for Early Cancer Detection Using Multimodal Medical Imaging and Radiogenomic Integration Frameworks cs.CV | cs.AIPDF

[60] RS-ISRefiner: Towards Better Adapting Vision Foundation Models for Interactive Segmentation of Remote Sensing Images cs.CVPDF

[61] TrajDiff: End-to-end Autonomous Driving without Perception Annotation cs.CV | cs.ROPDF

[62] Multi-GRPO: Multi-Group Advantage Estimation for Text-to-Image Generation with Tree-Based Trajectories and Multiple Rewards cs.CVPDF

[63] Joint Multi-scale Gated Transformer and Prior-guided Convolutional Network for Learned Image Compression cs.CVPDF

[64] Seeing the Wind from a Falling Leaf cs.CVPDF

[65] The Outline of Deception: Physical Adversarial Attacks on Traffic Signs Using Edge Patches cs.CVPDF

[66] EAG3R: Event-Augmented 3D Geometry Estimation for Dynamic and Extreme-Lighting Scenes cs.CV | cs.AIPDF

[67] DEJIMA: A Novel Large-scale Japanese Dataset for Image Captioning and Visual Question Answering cs.CVPDF

[68] PolarGS: Polarimetric Cues for Ambiguity-Free Gaussian Splatting with Accurate Geometry Recovery cs.CVPDF

[69] CircleFlow: Flow-Guided Camera Blur Estimation using a Circle Grid Target cs.CVPDF

[70] Thinking with Drafts: Speculative Temporal Reasoning for Efficient Long Video Understanding cs.CVPDF

[71] IRPO: Boosting Image Restoration via Post-training GRPO cs.CVPDF

[72] PanFlow: Decoupled Motion Control for Panoramic Video Generation cs.CVPDF

[73] AFRAgent : An Adaptive Feature Renormalization Based High Resolution Aware GUI agent cs.CVPDF

[74] TAP-CT: 3D Task-Agnostic Pretraining of Computed Tomography Foundation Models cs.CV | cs.AIPDF

[75] Generative Adversarial Gumbel MCTS for Abstract Visual Composition Generation cs.CV | cs.AI | cs.CLPDF

[76] Quantum-Inspired Spectral Geometry for Neural Operator Equivalence and Structured Pruning cs.CVPDF

[77] Look, Recite, Then Answer: Enhancing VLM Performance via Self-Generated Knowledge Hints cs.CV | cs.AIPDF

[78] HanDyVQA: A Video QA Benchmark for Fine-Grained Hand-Object Interaction Dynamics cs.CVPDF